Dumping a sample of production data with MySQL

This is a question @joshuapaling asked on twitter:

I've done something similar. I struggled to explain in 140 characters. So, impromptu blog post!

Ignoring large tables

In the past I have wanted to dump production database without data from large tables. I did that by using mysqldump in two steps.

Dump an empty schema into a new dump.sql file.

mysqldump --no-data mydatabase > dump.sql

Dump data, ignoring large tables, appending it to dump.sql

mysqldump --no-create-info
          --ignore-table mydatabase.big_table
          --ignore-table mydatabase.another_big_table
          mydatabase >> dump.sql

What about just certain rows?

I haven't tried this but I think it should work using the same idea.

Same as before, dump an empty schema.

mysqldump --no-data mydatabase > dump.sql

Dump the data, ignoring tables you only want some data from.

mysqldump --no-create-info
          --ignore-table mydatabase.table_i_only_want_some_data_from
          mydatabase >> dump.sql

Dump the data from the table, using an SQL query to only dump certain rows.

mysqldump --no-create-info
          --where="updated_at > '2016-06-19'"
          mydatabase table_i_only_want_some_data_from >> dump.sql

Restoring

dump.sql can be restored with

mysql mydatabasename < dump.sql

What about invalid data?

Assuming that you have key constraints in your database, the restore will fail to insert any records where the parent doesn't exist.

You can make use of this. If you use --where to dump the data for blog_posts that were updated today. You can try to restore all comments, and only comments from blog posts updated today will be restored.

That also mean that the order matters. Don't try to restore the comments table before blog_posts.

Will this work?

I've only tried this with ignoring whole tables, but I think it should work for a subset of rows if you have key constraints in the database which prevent you from inserting invalid data.

What else?

I have a few extra tips up my sleeve for large database dumps. If you want to know more ask me about pv and gzip.