MySQL collations demystified
Posted by Kelvin on 07 Jun 2010 at 06:52 pm | Tagged as: programming
Good article here: http://code.openark.org/blog/mysql/mysqls-character-sets-and-collations-demystified
MySQL’s character sets and collations are often considered as a mystery, and many users either completely disregard them and keep with the defaults, or set everything to UTF8.
This post will attempt to shed some light on the mystery, and provide with some best practices for use with text columns with regard to character sets.
Character Sets
A thorough discussion of how the character sets have evolved through history is beyond the scope of this post. While the Unicode standard is gaining recognition, the “older” character sets are still around. Understanding the difference between Unicode and local character sets is crucial.
Consider, for example, MySQL’s latin1 character set. In this character set there are 256 different characters, represented by one byte. The first 128 characters map to ASCII, the standard “ABCabc012 dot comma” set, of which most of this post is composed. The latter 128 characters in latin1 are composed of West European specific characters, such as À, ë, õ, Ñ.
A Name VARCHAR(60) CHARSET latin1 column can describe names with West European characters. But it cannot describe Russian or Hebrew names. To represent a name in Hebrew, you’d need the hebrew charset (ISO 8859-8), in which the first 128 characters are, as always, mapped to ASCII, and the latter 128 characters describe the Hebrew alphabet and punctuation marks, such as ?,?,?,?. The Cyrillic, Arabic and Turkish charsets follow in a similar manner.
Assume now I’m building a world wide web application, such as a popular social network. I would like to store the first names of my users, in every possible language. None of the above character sets support all languages. I therefore turn to Unicode. In particular, MySQL supports utf8, a Unicode encoding scheme, which is commonly used due to its economic storage requirements.
In Unicode there is a dedicated number for each letter in the known languages, in ancient languages, and some imaginary or otherwise non existing languages, such as Klingon (yes, I know there are people who actually speak Klingon), may yet find their way into the standard…