Unicode 101

If you haven’t noticed, there’s a whole world outside of the United States. Unfortunately, representing text in different languages can be challenging for programmers. Unicode is a universal standard for representing text that makes it easy to support almost any language. Here we’ll take a look at the basics of Unicode.

What Is Unicode?

If you’re familiar with the technical details of how text is stored in a computer and you’re a native English speaker, you’ve probably heard of ASCII, the American Standard Code for Information Interchange. ASCII maps bytes to numbers, letters, various symbols and control characters, which do things like beep the computer speaker or signal the beginning of a new line. It’s been around forever and it works great – if your primary language is U.S. English.

However, many of the computer users around the world speak other languages, many of which aren’t even close to English. If you’re a developer and your software doesn’t take this into account, you can have some real headaches.

Joel Spolsky, a software engineer and writer, recounts what can happen if programmers aren’t careful about locales:

“A couple of years ago, a beta tester for FogBUGZ was wondering whether it could handle incoming email in Japanese. Japanese? They have email in Japanese? I had no idea. When I looked closely at the commercial ActiveX control we were using to parse MIME email messages, we discovered it was doing exactly the wrong thing with character sets, so we actually had to write heroic code to undo the wrong conversion it had done and redo it correctly. When I looked into another commercial library, it, too, had a completely broken character code implementation. I corresponded with the developer of that package and he sort of thought they ‘couldn’t do anything about it.’ Like many programmers, he just wished it would all blow over somehow.

“If you know about Unicode, then you can build applications that will be able to work with other languages with ease. Unicode is an international consortium that strives to release a truly universal character set that makes it possible to develop truly universal programs. Computer scientist Donald E. Knuth called it ‘the best tool I know to help bring understanding between people of different cultures.'”

So in Unicode, characters are represented as “code points” instead of bytes, which are abstracted from the way they’re actually stored in the computer. The Unicode standard represents letters, numbers, currency symbols and writing direction (for representing languages that read right-to-left like Japanese, Hebrew and Arabic).

The History of Unicode

As mentioned previously, ASCII has been around for a long time. While it translates characters to bytes, it only uses seven bits in each byte instead of the usual eight. At the same time, the spread of computers around the world made representing different languages necessary. Extending ASCII by using the extra bit seemed the most obvious solution. The problem was that every company and country extended it in a different way, making it almost impossible to exchange data with people who used different languages.

UTF-8, UTF-16, UTF-32

Since Unicode is ubiquitous these days, it’s pretty easy to add support for your application. Just consult the documentation for your favorite programming language.

Unicode comes in three flavors:

UTF-8
UTF-16
UTF-32

The numbers refer the number of bits within which the characters are stored. For example, UTF-8 stores characters in eight-bit bytes. It’s also backward compatible with ASCII, and is used very widely on the web because it can deal with foreign text and still remain compact. UTF-16 uses 16 bits to store characters, and offers a good balance between compact storage and the ability to access characters. UTF-32 is ideal when you don’t have to worry about storage space.

Why Unicode?

There’s a reason they call it the world wide web. It’s becoming almost universally available. And if you’re a developer, it pays to reach a worldwide audience. That means representing text in the modern way. Spolsky puts it more bluntly:

“All that stuff about ‘plain text = ASCII = characters are eight bits’ is not only wrong, it’s hopelessly wrong, and if you’re still programming that way, you’re not much better than a medical doctor who doesn’t believe in germs.”

Unicode vs. ASCII

Since ASCII is the bare minimum that computers support, it won’t be going away. If you’re absolutely, positively certain that your application will only be used in the English-speaking world, you might be able to get away with it. But even though lots of foreigners learn English anyway, you’re still better off using Unicode, since it’s much more flexible than ASCII. Also, you might never know when your users might want to type the Euro symbol or characters with accents.

A Global Standard

If you want your applications to be truly global, they have to be able to handle languages other than U.S. English. Fortunately, Unicode provides a relatively painless way to do it. Why not start globalizing your software right now?