By Evan Elias | August 12, 2021
Converting your database to a different character set can be a difficult and perilous process. In this post, we’ll cover several ways that Skeema can help simplify this task.
utf8mb4 character set is capable of encoding all unicode characters, including emoji and other supplemental 4-byte characters. Although
utf8mb4 is now the default in MySQL 8, older installations of MySQL and MariaDB frequently still use more restrictive charsets, such as
utf8 (also called
Manually converting the charset of all objects in a database can be both tedious and error-prone. Most tutorials on this topic rely on custom scripting in order to query
information_schema and shell out to external OSC tools. With Skeema, no custom scripting is required, as the tool already provides the necessary functionality:
As a declarative infrastructure-as-code tool, Skeema operates entirely on a repo of
CREATEstatements. This means you can find and modify the relevant charset clauses using your favorite IDE’s multi-file search-and-replace functionality, and/or command-line tools like
Skeema automatically tests all DDL in a temporary location called a workspace, and notifies you of any errors. This means you’ll immediately learn if the new charset causes any columns to be too large for their data type or indexes – prior to any DDL being applied to your actual tables.
Skeema’s configurable linter rules allow you to specify which charsets to permit, and whether to treat violations as a warning or fatal error. This aids in identifying every table with the old charset, verifying nothing was missed after the conversion, and preventing future accidental use of the old charset.
Skeema’s config files support having any number of named environments such as “production”, “staging”, “qa”, “development”, etc. This way, you can easily test the charset conversion process on each non-production environment before moving on to prod.
Preparing the schema repo changes
This post assumes you’re already using Skeema and have a schema repo.
If your schema repo has grown organically over time, it may be useful to run
skeema format once, to restore a consistent set of style rules. This command will reformat everything in the database’s native style, matching the output of
SHOW CREATE queries. In terms of tables and their
CREATE TABLE statements in *.sql files, this ensures each table’s
DEFAULT CHARSET clause will always be present, simplifying search-and-replace operations.
Now you can begin bulk-editing
CREATE statements to replace the old charset with the new one.
In terms of version control, you can choose to put everything in a single giant commit, or you can split it up into multiple commits (and, later, multiple
skeema push operations) however you prefer.
Automated search-and-replace typically works well for the first pass. For example, to use BSD
sed (as found on MacOS) to replace
utf8mb4, along with also replacing their corresponding default collations:
sed -e 's/utf8\([^m_]\)/utf8mb4\1/g' -e 's/utf8_general_ci/utf8mb4_0900_ai_ci/g' -i '' *.sql
Or with GNU
sed (as found on Linux), note that the -i parameter (enabling modification of files in-place) works slightly differently:
sed -e 's/utf8\([^m_]\)/utf8mb4\1/g' -e 's/utf8_general_ci/utf8mb4_0900_ai_ci/g' -i *.sql
The first regular expression is designed to match “utf8” but not “utf8m” (as in “utf8mb4”) or “utf8_” (as in utf8 collation names). The \1 backreference then substitutes back in whichever character followed “utf8”, such as a space, comma, or semicolon.
Afterwards, some fine-tuning and verification may be needed, differing slightly by object type.
The overall goal is to adjust the
DEFAULT CHARSET clause in your tables' *.sql files. If any individual columns also use a different charset than the table default, you will need to adjust those too as appropriate, but this is relatively uncommon. In most cases, you just need to change the table’s default; any columns that don’t specify an explicit charset will be handled by Skeema automatically. In any event, automated search-and-replace tends to work well.
After modifying the
CREATE TABLE statements in your *.sql files, you can verify you caught everything using
skeema lint. Be sure to adjust your lint-charset and allow-charset options as needed, either on the command-line (as a one-time override) or in .skeema config files (when you’re ready to persist the linter config change).
Later, during the execution step below, Skeema will emit the appropriate
ALTER TABLE clauses to change tables' default charset, as well as column-level changes. Note that Skeema does not ever use the
CONVERT TO CHARACTER SET shortcut clause, since this can silently change column data types, among other problems. Skeema will instead emit alter clauses for each individual string-typed column.
Procs and Funcs
Automated search-and-replace can catch any cases where an explicit charset is used in your routines' definitions. But what about strings that don’t explicitly specify a charset? It turns out that the DB permanently remembers the schema-level default charset and collation as it existed at the routine’s creation time, and uses that for parameters – even if the schema-level default later changes to something else! From the MySQL manual:
For stored routines (procedures and functions), the database character set and collation in effect at routine creation time are used as the character set and collation of character data parameters for which the declaration includes no CHARACTER SET or a COLLATE attribute. To override this, provide CHARACTER SET and COLLATE explicitly.
Skeema has an option to account for changes in creation-time metadata in its diff logic of routines, but this option isn’t enabled by default. See the discussion on compare-metadata below in the execution section.
The situation for triggers is essentially identical to that of procs and funcs: automated search-and-replace handles most cases, but you may also need to enable Skeema’s compare-metadata option during execution of the change if any trigger bodies rely on strings which lack an explicit charset.
Please note that triggers are only supported by premium Skeema products. The open source Community edition of the Skeema CLI simply ignores triggers entirely.
Views rarely need any special handling for character set conversions. Once you’ve altered the underlying tables, the view definition will automatically reflect the new character set. The only special-case is when strings are explicitly cast to a specific charset and/or collation in a view’s SELECT query, but generally search-and-replace tools should find these.
Please note that views are only supported by premium Skeema products. The open source Community edition of the Skeema CLI simply ignores views entirely.
Every database schema has a notion of a default charset and default collation. These are applied in situations in which an explicit charset or collation are not specified. In Skeema, these are configured using default-character-set and default-collation in the .skeema file of each subdir that has *.sql files. You should generally configure them alongside the schema option which specifies the schema name.
After modifying these values, during the execution step Skeema will emit the appropriate
ALTER DATABASE clauses.
Each DB instance also has a notion of a server-level default, in the form of the
collation_server dynamic global variables. These must be modified outside of Skeema. Be sure to change the value in the running DB instance, as well as persisting the change so that it survives server restarts.
As with any schema change in Skeema, the core process is simple:
skeema diff to review a dry-run of the change, and then
skeema push to actually execute it. Before actually running this in production, there are a few important considerations:
Due to their delicate nature, table charset changes are always considered “unsafe” by Skeema. You will need to use the
--allow-unsafeoption on the command-line to proceed, for both
skeema push. (Even though
skeema diffdoesn’t actually run DDL, it still requires this option so that its parameter handling exactly matches that of
If you have any procs, funcs, or triggers whose bodies use strings without an explicit charset, you may want to enable Skeema’s compare-metadata option so that these objects are re-created alongside the schema default charset change. Please note that these DROP-and-re-CREATE operations are not atomic, so there will be a split-second period in which the object does not exist. This may cause a brief moment of failing queries or skipped triggers, so take extreme care when doing this in production.
It’s a good practice to have multiple team members review the *.sql changes (e.g. git commits), as well as the
skeema diffoutput. If your schema repo is stored on GitHub.com, Skeema Cloud Linter makes code review easy by including the diff and lint output inline in each pull request.
Test the change in non-prod environments. For example, if you have a “staging” environment configured, use
skeema diff stagingfollowed by
skeema push staging.
Be sure to create a backup of your database right before running any major schema change in prod.
This post has focused only on the mechanical side of the charset change operation. Your overall process should also include quite a bit of planning, application testing, and backup / emergency restore considerations. These topics are largely outside the scope of this post, and will just be discussed here briefly.
You will need to evaluate the particulars of your application’s charset and collation usage, in both the old and new charsets. Some conversions are easier than others; for example,
utf8mb4 is somewhat straightforward since
utf8 is a subset of
utf8mb4. Meanwhile, moving from
utf8mb4 will require extra work if your “latin1” actually consists of utf8 shoved into a latin1 byte stream – an extremely common scenario for older applications. In this case, you will need an intermediate migration to binary column types prior to moving to utf8mb4, doubling the work involved.
Your choice of collation requires serious consideration. In most cases, you should use a collation that is both case-insensitive (“ci”) and accent-insensitive (“ai”). However, keep in mind this has implications for comparisons and sorting, especially if your previous collation had different equality rules: accent or case insensitivity affects evaluation of unique constraints for any
PRIMARY KEY or
UNIQUE KEY on string columns. DB version upgrades may also complicate things; MySQL 8 adds some great new collations for utf8mb4, but these are unavailable in prior versions. Even the default collation for utf8mb4 has changed in MySQL 8.
If any of your string columns have
FOREIGN KEYs with
ON UPDATE CASCADE, beware a major landmine that can cause data corruption. The workaround involves temporarily disabling foreign key checks globally, which in turn may require performing the operation on a paused replica which is later promoted to be the new primary.
On the application side, don’t forget that connections (sessions) also have a charset and collation. Be sure that your application / client / driver is configured properly to avoid garbled data, and test interactions between the application and DB thoroughly, both during and after the charset switch.
Finally, and perhaps most importantly, it is absolutely essential to have a well-tested database backup-and-restore solution prior to attempting a major charset conversion.
All these complications can be a bit overwhelming, especially if your tables are large. For safety’s sake, you may want to consider setting up a retainer agreement with an expert consultant, and/or subscribing to a support contract with your database vendor.