Until version 4.1, MySQL tables were encoded with the latin1 character set. By default, the character set is now utf8. This is a good thing in terms of non-latin character support, but if you’re upgrading from an older database you may run into a lot of character encoding problems.

I’ve recently migrated several older databases from 4.0 to MySQL 5 and converted a few tables from latin1 to utf8, and I’ve put together a few tricks that might help you through the same. So far, I’ve basically run into three main issues: converting a table from the latin1 charset to utf8, upgrading a whole database set that was properly exported with mysqldump, and (the worst) migrating an entire set of database’s MYI and MYD files that weren’t properly exported.

Converting a latin1 table to utf8
If you’ve already migrated successfully to the new MySQL version, but you have an older latin1 encoded table and you simply want to change it’s character set to utf8, it’s a fairly simple exercise. First, export the table with mysqldump:

mysqldump -u username -p database --default-character-set=latin1 table > tableoutput.sql

First make a back up of that file and just edit it to adjust two things. Find the line that contains “SET NAMES latin1” and change that to “SET NAMES utf8”. Then, look at the table definition and change “DEFAULT CHARSET=latin1” to “DEFAULT CHARSET=utf8”. Finally, reimport the table:

cat tableoutput.sql | mysql --default-character-set=utf8 -u username -p database

When it’s imported back in, it will properly be created as a utf8 table and you shouldn’t have any character encoding problems. You can do this for a whole database with the same technique and using a search and replace to switch latin1 to utf8.

Upgrading a whole database set that was properly exported with mysqldump
This is a pretty simple scenario. You’ve exported your entire database from the previous MySQL version with something like this:

mysqldump --default-character-set=latin1 -u username -p database > dboutput.sql

You can then import it either as a latin1 table, or convert it to utf8 as we did above. Just make sure to adjust the “SET NAMES” and “CHARACTER SET” values appropriately, and set the “–default-character-set” parameter to utf8 or latin1, as appropriate.

Note that the default character set is now utf8 unless you change it in the my.cnf file. I’ve been on systems where it’s been set to latin1 for compatability reasons, however, so you can’t just assume this. Make sure that all of your database code (in your php, perl, whatever) issues the “SET NAMES utf8” SQL statement before issuing any further SQL commands. This will ensure that what you set in INSERTs and receive in SELECTs isn’t mis-encoded. If you are using a legacy latin1 table, make sure you do the same, but with the latin1 setting.

Migrating an entire set of database’s MYI and MYD files that weren’t properly exported
I was a bonehead the other day and didn’t properly export one of my server’s databases before upgrading it to MySQL 5. After upgrading, all of my databases were corrupted, including the mysql database, meaning I couldn’t even log in. The mysql_upgrade command didn’t fix things properly, and it was all because MySQL assumed my tables from that older version were encoded in utf8.

It was, like, a bummer.

I don’t know if there is an easier way, but this is how I managed to transition all those MYI and MYD files to the new database, without reverting to the older version and exporting properly.

First, you need to change the mysql config so that when it loads your old databases it assumes they are latin1 encoded. To do this, edit the my.cnf file. Find all the lines that say “default-character-set=utf8” and change them all to say “default-character-set=latin1”.

Start up mysqld and have it ignore permissions:

mysqld_safe --skip-grant-tables &

Now, run “mysql_upgrade”. When it finishes, do a SELECT on the mysql.user table. You should see that usernames and encoded passwords haven’t been truncated in half and replaced with gobbledeguk.

Kill and restart the mysql server. You should be able to log in, meaning the mysql users table is, in fact, uncorrupted. You’ll find that all your other tables are fine as well. One thing that you might notice is that the mysql_upgrade script will have updated the mysql databases to use utf8. All of your other databases are still in latin1, however, so you’ll need to convert them to utf8 (if you desire) using the instructions above. Also, if you don’t require legacy support for older applications, you should probably go back to the my.cnf file and change the default character set back to utf8 for everything.

Your thoughts
From this point forward, everyone should be using utf8 across the board, but it seems like making the jump is more painful than it should be. If you know any good MySQL character encoding tips and tricks, help out your fellow hacker. Share them in comments!