Docs: Features: Seed Data

Skeema Premium v1.10+ includes the ability to manage seed data in your schema repository. This feature allows you to track the initial data for populating a newly-created table. Seed data can be useful for ensuring that new tables immediately have some necessary data in production. It can also be used for tracking development data or “fixture” rows, for use when populating a new dev/test/CI database.

Seed data is represented by SQL INSERT statements in your *.sql files. By default, skeema diff and skeema push only process these statements when the INSERT is targeting an empty or newly-created table.

Skeema Premium also provides optional support for automatically generating your seed INSERT statements, based on conditionally dumping data from a specific environment when running skeema pull. These conditional data dumps are controlled by specially-named “seed views”, which allow you to track arbitrary SELECT statements for use in seed data population. This provides a powerful mechanism for subsetting or anonymizing production data using the full expressiveness of declarative SQL.

Repo layout and INSERT statements

INSERT statements may be placed in any *.sql files within the same directory as the definition of their target tables. You may place them in the same file as their corresponding CREATE TABLE, or in a separate file per INSERT, or place all INSERTs into a single file.

There are several approaches for creating and managing INSERT statements in your *.sql files:

  • Manually: write INSERT statements from scratch, and adjust them manually when needed.

  • Skeema seed views: skeema pull’s update-seed-inserts option provides an automated mechanism for using specially-named views to dump seed data from the appropriate environment; see below for usage.

  • External dump tools: for example, a custom wrapper around the output of mysqldump --no-create-info --where='...' --tables ... could be used in an automation pipeline, in order to obtain the desired INSERTs and write them to a consistent location in your schema repo.

If maintaining INSERT statements manually, you can split seed data for a single table into multiple separate INSERT statements if you wish. If a single table has multiple INSERT statements defined in the same .sql file, their execution ordering will match their order in the file.

Diff/push and INSERT statements

The inserts enum option specifies how skeema diff and skeema push should handle INSERT statements.

With the default of inserts=seed, these commands process INSERT statements only when the target table has no rows.

With inserts=none, no INSERT statements are processed. This is effectively how the Community Edition operates. The alias skip-inserts may also be used equivalently.

With inserts=all, all INSERT statements are processed, regardless of table emptiness. It is important to understand that Skeema does not diff your table row data or support imperative patching of table data / generic data migrations and this option simply re-runs your full INSERTs. When using inserts=all, you must take caution to avoid duplicate key errors in your INSERT statements, such as by manually ensuring your statements all use syntax INSERT IGNORE or INSERT ... ON DUPLICATE KEY UPDATE as necessary.

As with all other operations, skeema diff only ever displays INSERTs (as appropriate based on the value of the inserts option) but does not execute them. Meanwhile skeema push executes them, but always after any DDL for the corresponding table. When skeema push executes INSERTs, each statement is run in its own separate transaction. If INSERTs are present for multiple tables, their execution ordering is random. The session will use foreign_key_checks=0, avoiding foreign key dependency chain issues, at the cost of not ensuring referential integrity of the seed data.

With inserts=seed, if a single table has multiple INSERT statements defined in the directory, the table-emptiness check is run only once prior to execution of any INSERT statements.

The ddl-wrapper option does not affect INSERT statements; skeema push always executes INSERTs directly, not through a wrapper script. Query timeouts (readTimeout in connect-options) are not enforced for INSERTs.

Neither skeema diff nor skeema push pre-verifies INSERT statements for correctness prior to execution. INSERTs are not run inside of a workspace.

Presence of INSERTs does not affect the exit code behavior of skeema diff.

Skeema Seed Views

Seed views provide a mechanism for auto-creating INSERT statements, based on data dumps from a specific database environment. Each seed view encapsulates a SELECT query for obtaining seed data for a specific table. Seed data INSERTs are created or updated, based on the contents of corresponding seed views, whenever skeema pull is run with the update-seed-inserts option enabled.

Among your *.sql files, Skeema considers a CREATE VIEW statement to be a seed view definition if where the view name begins with “_seed_”, followed by a table name. As an example, consider the following view definition:

CREATE VIEW _seed_app_users AS SELECT * FROM app_users WHERE user_id < 100;

Skeema would consider this to be a seed view, defining which seed data to dump for the table `app_users`. When running skeema pull, if this view definition is present in a .sql file and update-seed-inserts is enabled, Skeema will automatically run the query from `_seed_app_users`, convert that row data into an INSERT statement, and write it to the end of app_users.sql – or, if there was already an INSERT statement there, dynamically replace the statement’s contents to reflect the current data.

The update-seed-inserts option is not enabled by default. Appropriate usage of this option depends on your preferred workflow for INSERT statements, and the source of the seed data. A few possibilities include:

  • Obtaining seed data using a subset of production data: enable update-seed-inserts in the [production] section of a .skeema file. Your seed views will typically involve WHERE clauses which define what subset of the row data to dump on pull, and your SELECT clauses might use string manipulation functions to mask sensitive data if desired. Running skeema pull production will update CREATE as well as INSERT statements; pulling from other environments will only update CREATE statements.

  • Maintaining a canonical set of seed data in one database environment, such as considering the “staging” DB to be the source-of-truth for dumping seed data: enable update-seed-inserts in that environment section in a .skeema file. In this case, typically each seed view will be of the simple form CREATE VIEW _seed_foo AS SELECT * FROM foo; which dump the entirety of the table. Running skeema pull staging (for example) will update CREATE as well as INSERT statements; meanwhile pulling from other environments will only update CREATE statements.

  • Updating seed data ad hoc: enable update-seed-inserts on the command-line only when needed, for example skeema pull production --update-seed-inserts. If needed, you can combine this approach with --ignore-table to filter out tables whose seed data should not be updated.

When update-seed-inserts is enabled, its exact behavior is as follows: skeema pull looks in the *.sql files for view definitions where the view name begins with “_seed_”. For each such view definition that is found, skeema pull will take the following actions:

  1. Create the view on the target database (replacing it if it already exists) using the filesystem definition.
    • Note that this is the only case where skeema pull treats the filesystem CREATE VIEW definition as its source-of-truth, rather than the live database’s definition. It is also the only case where skeema pull modifies the live database schema (to create or replace the seed view definition as needed).
    • This design decision is based on the notion of treating the filesystem view definition as a declarative mechanism for mapping a result-set to a table name. In a sense, the seed views are representing configuration metadata for the pull operation, rather than expressing a desired state for the schema.
    • Favoring the filesystem definition also avoids problems with the DBMS normalizing view definitions. For example, the filesystem definition can use SELECT * and this will reflect all current columns dynamically, whereas view definitions on the database side become “frozen” at creation time by the DBMS.
  2. Query all row data exposed by the view, e.g. SELECT * FROM _seed_foo
  3. Convert the data into an INSERT statement
    • The table’s name will be whatever followed “_seed_” in the view’s name; for example view `_seed_foo` would emit INSERT INTO `foo` ....
    • The column names in the INSERT statement will match whatever columns were found in the view’s definition.
  4. Write the INSERT statement to the filesystem
    • If there is no INSERT statement for this table yet in the directory, the new INSERT statement will be written to the same file as the table’s definition. In the previous example of view `_seed_foo` and table `foo`, by default this would write the INSERT to foo.sql. You may manually move the statement to a different file in the same directory if you wish, and subsequent pulls will respect the new location.
    • If there was already a single INSERT statement for this table in the directory, its text will be replaced with that of the new INSERT statement.
    • If there were multiple INSERT statements for this table in the directory, a warning will be logged and none of the statements will be modified.

Seed views may be defined anywhere within a directory’s *.sql files: you may place them all in a single file, or place them in the same file as each corresponding table definition, or place each of them in their own file. Seed view definitions are always manually maintained: commands such as skeema init, skeema pull, and skeema format will not ever export or update CREATE VIEW statements if the view name begins with “_seed_”.

Seed view edge cases

The seed data dumping functionality currently has several minor shortcomings:

  • As a boolean, the operation of update-seed-inserts is currently “all or nothing”, operating on all or none of your seed views. Using --ignore-table on the command-line can help filter out some tables, but is insufficient if you have many seed views and only wish to update a small subset of your tables' INSERT statements. As a work-around, after pulling you can use git restore (or similar) to discard unwanted changes from your working directory before committing.
  • In order to create or update the seed view definition in the live database, Skeema’s database user needs the CREATE VIEW privilege, along with the DROP privilege. There is not currently a mechanism to have skeema pull just run raw SELECT queries directly instead of using views.
  • The seed view feature only emits standard INSERT statements. If you need INSERT IGNORE or perhaps INSERT ... ON DUPLICATE KEY UPDATE, you will need to manually edit the generated statements after each pull.
  • For tables with generated columns, you must manually list columns in your seed view’s SELECT and cannot use SELECT *.

These situations may be improved in future Skeema releases, depending on customer interest. Please contact us with any feedback on the seed data and seed view functionality.