What Is Unicode?If you're familiar with the technical details of how text is stored in a computer and you're a native English speaker, you've probably heard of ASCII, the American Standard Code for Information Interchange. ASCII maps bytes to numbers, letters, various symbols and control characters, which do things like beep the computer speaker or signal the beginning of a new line. It's been around forever and it works great - if your primary language is U.S. English.
However, many of the computer users around the world speak other languages, many of which aren't even close to English. If you're a developer and your software doesn't take this into account, you can have some real headaches.
Joel Spolsky, a software engineer and writer, recounts what can happen if programmers aren't careful about locales:
"A couple of years ago, a beta tester for FogBUGZ was wondering whether it could handle incoming email in Japanese. Japanese? They have email in Japanese? I had no idea. When I looked closely at the commercial ActiveX control we were using to parse MIME email messages, we discovered it was doing exactly the wrong thing with character sets, so we actually had to write heroic code to undo the wrong conversion it had done and redo it correctly. When I looked into another commercial library, it, too, had a completely broken character code implementation. I corresponded with the developer of that package and he sort of thought they 'couldn't do anything about it.' Like many programmers, he just wished it would all blow over somehow.
If you know about Unicode, then you can build applications that will be able to work with other languages with ease. Unicode is an international consortium that strives to release a truly universal character set that makes it possible to develop truly universal programs. Computer scientist Donald E. Knuth called it 'the best tool I know to help bring understanding between people of different cultures.'"
So in Unicode, characters are represented as "code points" instead of bytes, which are abstracted from the way they're actually stored in the computer. The Unicode standard represents letters, numbers, currency symbols and writing direction (for representing languages that read right-to-left like Japanese, Hebrew and Arabic).
The History of UnicodeAs mentioned previously, ASCII has been around for a long time. While it translates characters to bytes, it only uses seven bits in each byte instead of the usual eight. At the same time, the spread of computers around the world made representing different languages necessary. Extending ASCII by using the extra bit seemed the most obvious solution. The problem was that every company and country extended it in a different way, making it almost impossible to exchange data with people who used different languages.
In the late 1980s, some employees at Apple and Xerox decided to band together to solve the problem and develop a way to represent every language on earth. They were soon joined by people from Sun Microsystems and IBM to form the Unicode Consortium. The first version of the standard was released in 1991, and it's been continually improved. Almost all programs that deal with text and virtually all modern operating systems support Unicode.
UTF-8, UTF-16, UTF-32Since Unicode is ubiquitous these days, it's pretty easy to add support for your application. Just consult the documentation for your favorite programming language.
Unicode comes in three flavors:
Why Unicode?There's a reason they call it the World Wide Web. It's becoming almost universally available. And if you're a developer, it pays to reach a worldwide audience. That means representing text in the modern way. Spolsky puts it more bluntly:
"All that stuff about 'plain text = ascii = characters are eight bits' is not only wrong, it's hopelessly wrong, and if you're still programming that way, you're not much better than a medical doctor who doesn't believe in germs."