The default character set selection of MySQL mainly affects the compatibility of data storage and the internationalization support of applications. Many MySQL versions will use latin1 or utf8 as the default encoding after installation, but both encodings have compatibility and functional defects: latin1 cannot store Chinese and multi-byte character sets, and utf8 (3 bytes) cannot completely cover Emoji expressions and some rare Chinese characters, which can easily lead to garbled characters or truncation during insertion or query. To avoid these situations, the utf8mb4 character set must be uniformly used at the database global, library level, table level, and field level, and combined with appropriate collation rules (such as utf8mb4_unicode_ci or utf8mb4_general_ci), which has become the industry's best practice.
To achieve the migration of MySQL's default encoding from latin1 or incomplete utf8 at installation to utf8mb4, it is necessary to modify the configuration file, restart the service, migrate existing data, and configure the client connection in four key steps.
First, you need to declare the global character set and collation rules in the MySQL configuration file. In common Linux systems, the configuration file path is usually /etc/my.cnf or /etc/mysql/my.cnf. Add character-set-server=utf8mb4 and collation-server=utf8mb4_unicode_ci in the [mysqld] section to ensure that the default session character set and collation rules are consistent after the server is started. At the same time, add default-character-set=utf8mb4 in the [client] and [mysql] sections to ensure that utf8mb4 is used by default for client connections. The configuration section example is as follows:
[mysqld]
character-set-server = utf8mb4
collation-server = utf8mb4_unicode_ci
init_connect = 'SET NAMES utf8mb4'
[client]
default-character-set = utf8mb4
[mysql]
default-character-set = utf8mb4
Adding init_connect allows each new session to automatically execute SET NAMES utf8mb4 to prevent some applications from ignoring the character set setting and generating garbled characters. After editing, restart the MySQL service through systemctl restart mysqld or service mysql restart to make the configuration take effect.
After restarting, verify whether the global variables have been changed to utf8mb4 by executing SHOW VARIABLES LIKE 'character_set_%'; and SHOW VARIABLES LIKE 'collation%';. If character_set_server, character_set_database, and character_set_client are all utf8mb4 in the output, and the corresponding collation rule is utf8mb4_unicode_ci, it means that the global default encoding has been successfully modified.
Next, for the existing database and table structure, you need to modify its encoding and collation separately. For databases that have not been created, no additional operations are required, and the global default will be automatically used when a new database is created. For existing databases, you can execute ALTER DATABASE dbname CHARACTER SET = utf8mb4 COLLATE = utf8mb4_unicode_ci; in sequence to change the default charset at the database level to utf8mb4. Then, perform encoding migration for each table under the database, for example, ALTER TABLE tablename CONVERT TO CHARACTER SET utf8mb4 COLLATE utf8mb4_unicode_ci;. This command not only changes the default character set of the table, but also transcodes all text columns (including CHAR, VARCHAR, TEXT, etc.) column by column, and retains the original data. Please note that it may take a long time to execute this command on a table with a large amount of data, and the table will be locked and write operations will be blocked. It is recommended to perform it during the off-peak period of business or use online DDL tools (such as gh-ost or pt-online-schema-change) to reduce downtime.
In large-scale clusters or sharded databases and tables, migration scripts can be automatically executed for all databases and tables. By querying SCHEMATA and TABLES in INFORMATION_SCHEMA, combine scripting languages (Bash, Python, etc.) to generate and execute ALTER statements in sequence. At the same time, statistics and monitoring of migration progress, table lock status and disk space should be made in advance to prevent unexpected failures or resource exhaustion. After the migration is completed, use SHOW FULL TABLES WHERE Table_Type = 'BASE TABLE'; again and check SHOW CREATE TABLE tablename; to confirm that the table-level character set has been updated.
In addition to modifying the database and table structure, you also need to consider specific data types at the field level. No modification is required for BLOB or binary fields, but for ENUM, SET, TEXT, and CHAR types, you need to pay attention to the maximum byte length limit. The utf8mb4 character set occupies a maximum of 4 bytes per character. If the original latin1 or utf8-based column is close to the upper limit of the length, converting to utf8mb4 may cause the row size limit to be exceeded (MySQL's maximum row length is about 65535 bytes). At this time, you can adopt a strategy to reduce the field length, use the TEXT type instead, or store in separate tables to avoid failure.
Finally, the character set configuration of the client and application layers. After confirming that the server has uniformly used utf8mb4, you also need to specify useUnicode=true&characterEncoding=utf8mb4 in the application database connection configuration (taking Java JDBC as an example), or call set names 'utf8mb4' when creating a connection in PHP's PDO or mysqli. It is best to add a unified character set setting code to the application initialization module to avoid writing garbled characters due to configuration omissions. In a multilingual environment or framework, you can set the default character set in the configuration file or environment variable to ensure that the entire technology stack from the client to the database uses utf8mb4.
Through the above four steps of configuration file modification, restart verification, data migration and client configuration, you can switch the MySQL default encoding to utf8mb4. The entire process should be fully verified in the test environment, and a data backup and recovery strategy should be formulated to deal with potential migration risks. Automated scripts and monitoring alarm mechanisms can simplify the migration process and improve efficiency. After completion, the database will have stronger character compatibility, support global multilingual applications and full-table Emoji storage, and lay a solid foundation for subsequent product iterations and international expansion.