Docs: FAQ

Is Skeema another online schema change tool?

No. Skeema is a tool for managing schemas, and the workflow around how schema changes are requested, reviewed, and performed. It can be used as a “glue” layer between git and existing online schema change tools, or perhaps as part of a continuous integration / continuous deployment pipeline.

Skeema is designed to be a unified solution to the following common problems:

  • Keeping schemas in sync across development, staging, and production environments
  • Keeping schemas in sync across multiple shards
  • Exporting schemas to a repo and managing them like code, using pull requests and code review
  • Configuring use of an external online schema change tool, optionally only for certain table sizes, clusters, schema names, or environments

Skeema does not implement its own method for online schema changes, but it can be configured to shell out to other arbitrary online schema change tools.

Is it safe?

Schema changes can be scary. Skeema includes a number of safety mechanisms to help ensure correct operation.

Extensive automated testing suite

Skeema has an extensive suite of unit, integration, and end-to-end functional tests; its libraries/subpackages do as well. This suite includes automated testing against a Dockerized database instance of 14+ different vendor/version combinations of MySQL, MariaDB, and Percona Server.

Destructive operations are prevented by default

Destructive operations only occur when specifically requested via the allow-unsafe option. This prevents human error with running skeema push from an out-of-date repo working copy, as well as misinterpreting accidental attempts to rename tables or columns (both of which are not yet supported).

The following operations are considered unsafe:

  • Dropping a table
  • Altering a table to drop a normal column or stored (non-virtual) generated column
  • Altering a table to modify an existing column in a way that potentially causes data loss, length truncation, or reduction in precision
  • Altering a table to modify the character set of an existing column
  • Altering a table to change its storage engine
  • Dropping a stored procedure or function (even if just to re-create it with a modified definition)

Note that skeema diff also has the same safety logic as skeema push, even though skeema diff never actually modifies tables. This behavior exists so that skeema diff can serve as a safe dry-run that exactly matches the logic for skeema push. If unsafe operations are not explicitly allowed, skeema diff will display unsafe operations as commented-out DDL.

You may also configure Skeema to always permit unsafe operations on tables below a certain size (in bytes), or always permit unsafe operations on tables that have no rows. See the safe-below-size option.

Detection of unsupported table features

If a table uses a feature not supported by Skeema, such as spatial indexes, Skeema will refuse to generate ALTERs for the table. These cases are detected by comparing the output of SHOW CREATE TABLE to what Skeema thinks the generated CREATE TABLE should be, and flagging any discrepancies as tables that aren’t supported for diffing or altering. This is noted in the output, and does not block execution of other schema changes. When in doubt, always check skeema diff as a safe dry-run prior to using skeema push.

No reliance on SQL parsing

Skeema’s behavior does not rely on parsing SQL DDL. Parsing SQL is brittle across various MySQL and MariaDB versions, which have subtle differences in features, functionality, and syntax. Instead, Skeema introspects your database objects by querying information_schema and using SHOW commands, to obtain metadata directly from the database itself.

In order to accurately introspect the schemas represented in your filesystem’s *.sql files, Skeema actually runs the files' CREATE statements in a temporary location, called a workspace. By default, Skeema creates, uses, and then drops a database called _skeema_tmp on each database it interacts with. (This behavior is configurable: a different schema name can be used, or a local Docker container can be used instead of each live database, or a separate database server may be configured for workspaces).

When operating on a workspace, Skeema halts immediately if any workspace table is detected to be non-empty (contains any rows). This prevents disaster if someone accidentally misconfigures Skeema’s workspace-related options.

Only skeema push manipulates real schemas

Aside from the workspace operations described above, only one command modifies schemas and tables: skeema push. All other commands are read-only in terms of interactions with live tables.

Auto-generated DDL is verified for correctness

Skeema is a declarative tool: users declare what the table should look like (via CREATE TABLE files), and the tool generates the corresponding ALTER TABLE in skeema diff (outputted but not run) and skeema push (actually executed). When generating these statements, Skeema automatically verifies their correctness by testing them in a workspace. This confirms that running the generated DDL against an empty copy of the old (live) table definition correctly yields the expected new (from filesystem/repo) table definition. If verification fails, Skeema aborts with an error.

Pedigree

Skeema’s creator has been using MySQL extensively since 2003, and is a former member of Facebook’s elite team that maintains and automates the world’s largest MySQL environment. Prior to Facebook, he started and led the database team at Tumblr, and created the open-source Ruby database automation library and shard-split tool Jetpants. Rest assured that safety of data is baked into Skeema’s DNA. All outside open source contributions are held to an equally high standard of code quality review and automated test coverage.

Since Skeema’s initial release in 2016, a number of large well-known public tech companies have adopted Skeema as a critical part of their development and deployment toolchain. Between the Community and Premium editions, the Skeema CLI is now installed over 60,000 times per month, powering thousands of schema management operations worldwide every day.

Responsibilities for the user

Please see the requirements doc for important notes relating to running Skeema safely in your environment.

Do schema changes get pushed automatically when I change files?

No; the Skeema CLI is not a daemon / server. When using the CLI tool, schema changes only occur when you run skeema push.

To achieve automated deployment, you may optionally wrap skeema push in custom automation. The CLI is designed to be easy to integrate into a pipeline, in terms of exit codes and STDOUT vs STDERR. Additionally, Skeema Cloud Linter is designed to provide configurable guardrails around pull-request-driven automation pipelines.

I have several database schemas. How do I interact with just one at a time?

The simplest solution is to cd to the desired schema’s subdirectory before invoking skeema. Most Skeema commands operate recursively from the current working directory. This means you can run e.g. skeema diff from a parent directory to diff everything, or from a specific schema subdirectory to diff just that schema.

If you prefer to always invoke skeema from a parent / top-level directory, another solution is to leverage environment names in .skeema files. A bottom-level schema directory is only processed if its .skeema file defines the schema name option for the current environment. By default, skeema init places schema names in .skeema files “outside” of any environment section (applying to all environments), but you can freely edit this configuration to restrict subdirs to specific environment names.

For example, say you have multiple schemas, one of which is called “blog”. By default, skeema init makes blog/.skeema look something like this:

schema=blog
default-character-set=latin1
default-collation=latin1_swedish_ci

You could manually change it to this instead:

default-character-set=latin1
default-collation=latin1_swedish_ci

[production]
schema=blog

[blog]
schema=blog

[blog_dev]
schema=blog

Now even if you invoke skeema from a parent directory, this directory will be processed when running skeema diff blog, skeema diff blog_dev, skeema diff production, or skeema diff with no environment (defaults to “production”). However, the blog directory will not be processed for skeema diff otherapp since it does not define a schema name in an “otherapp” section, nor does it define a default schema name “outside” (above) any section.

How do I configure Skeema to use online schema change tools?

The alter-wrapper option for skeema diff and skeema push allows you to shell out to arbitrary external commands to perform ALTERs. You can set this option in ~/.skeema or any other .skeema config file to automatically apply it every time. For example, to always use pt-online-schema-change to perform ALTERs, you might have a config file line of:

alter-wrapper=/usr/local/bin/pt-online-schema-change --execute --alter {CLAUSES} D={SCHEMA},t={TABLE},h={HOST},P={PORT},u={USER},p={PASSWORDX}

The brace-wrapped variables will automatically be replaced with appropriate values from the corresponding .skeema files. A few special explanations about the command-line above:

  • The {PASSWORDX} variable is equivalent to {PASSWORD} in execution, but it displays as X’s whenever printed to STDOUT.
  • The {CLAUSES} variable returns the portion of the DDL statement after the prefix, e.g. everything after ALTER TABLE table_name . You can also obtain the full DDL statement via {DDL}.
  • Variable values containing spaces or control characters will be escaped and wrapped in single-quotes, and then the entire command string is passed to /bin/sh -c on Linux/MacOS or powershell.exe -Command on Windows.

This feature works most easily for pt-online-schema-change. Integration with gh-ost can be more challenging (but definitely possible), because its recommended execution mode requires passing it a replica, not the master; but meanwhile .skeema files should only refer to the master, since this is where CREATE TABLE and DROP TABLE statements need to be run. Similar problems exist with using fb-osc, which must be run on the master and all replicas individually.

How do I force Skeema to use the database’s built-in online DDL functionality?

By default, ALTER TABLE in MySQL 5.6+ and MariaDB 10.0+ will use the least-disruptive algorithm/locking method possible. However, some types of ALTERs will still block writes to the table. Skeema’s alter-algorithm and alter-lock provide a way to fast-fail any ALTER which would otherwise block writes.

Note that online-DDL ALTERs on larger tables can still be problematic for traditional binlog replication, due to the replication lag they create. In such situations, consider using alter-wrapper and alter-wrapper-min-size to implement conditional logic: use built-in online DDL for smaller tables, and an external online schema change (OSC) tool for larger tables.

How do I configure Skeema to use service discovery?

There are several possibilities here, all based on how the host and host-wrapper options are configured:

  • DNS: host set to a domain name, and don’t set host-wrapper at all. This works if you can provide a consistently up-to-date domain name for the master of each cluster. It isn’t friendly towards sharded environments though, nor is it a good solution if nonstandard port numbers are in use. (Skeema does not yet support SRV record lookups.)

  • External command shellout: configuring host-wrapper to shell out to a service discovery client. In this configuration, rather than host being set to a literal address, it should be a lookup key to pass to service discovery. The host-wrapper command-line can then use the {HOST} placeholder variable to obtain each directory’s lookup key, such as host-wrapper=/path/to/service_discovery_lookup.sh /databases/{ENVIRONMENT}/{HOST}. The executed script should be capable of doing lookups such as “return the master of cluster foo” or “return all shard masters for sharded cluster xyz”.

  • Configuration management: You could use a system like Chef or Puppet to rewrite directories' .skeema config files periodically, ensuring that an up-to-date master IP is listed for host in each file.

Simpler integration with etcd, Consul, and ZooKeeper may be added in the future.