Web, Japanese and UTF8

Getting Japanese text to work on a webpage was something that came up recently at work, so I felt inspired to touch on the subject briefly. This piece is a bit more techinical than other posts, but on the other hand, I am not an expert either, so hopefully this will find some middle ground.

If you can read this text below:

日本語が読める?

You’re off to a good start. If not, that’s ok. This post will contain a few tips about being able to read Japanese. But first a bit of history:

History of Computing and Character Sets

If you go back to computing all the way to the 1970′s the process of improving internalization runs like so:

  • The ASCII standard supplants earlier and more difficult EBCDIC character sets. This was widely adopted and is still more or less in use today. Trouble is, it was English-only, but the de-facto standard of computing back then. The ASCII standard was 7-bits in total, so you could only cram in 128 characters.
  • To compensate for the English-centric behavior of ASCII, other similar code pages were developed for other languages. They all had the same restriction of 7-bits, but now if you used the French code-page, you could get nice accented letters for example. Still useless for things like Japanese and Chinese which have lots, and lots more characters.
  • In the early 1990′s ISO-2022 standard is established for presenting languages like Japanese and Korean, which then leads to well-known character sets like EUC-JP (Extended Unix Characters – JP). As you can guess from the name, it was widely used in UNIX and other systems.
  • In the late 1990′s Microsoft in conjunction with a Japanese company create a new character set known as Shift-JIS. This being a Microsoft invention, gains widespread use, though EUC-JP is thought by some to produce cleaner, easier to manipulate computer code. Both EUC-JP and Shift-JIS compete for presenting Japanese text on the Internet, and both are still frequently in use.
  • Though begun in 1986, the famous Unicode Standard came out in 1991/1992, as a way of having a single character set and standard to store all characters from all languages. At first, it only had the English/ASCII characters mapped out, but as it grew, it absorbed languages and characters into it’s massive mapping scheme. Already by 1991, Japanese Hiragana and Katakana letters were mapped out, but not Kanji Chinese characters.
  • By 1992, the first 20,000 Chinese characters are mapped out, with another 40,000 added in 2001. These of course are used in Japanese as “kanji”.
  • To make Unicode easier to program, the UTF-8 encoding scheme is presented to the public in 1993, which takes off after that.

The Web and Japanese language today

UTF-8 is the standard today. The problem is that over time, lots of cruft, old websites and outdated software still persists in computing and the Internet, so although most modern programs, web-browsers and such all use UTF-8 by default, there’s plenty of leftovers that don’t. So, if you see a Japanese website and even if the website represents a major company, you can see different character sets used.

On your web browser, you can view the character-set of a website using the “view source” option. For Firefox and Opera users, this is a simple matter of clicking on “view” then either “source” or “page source”. In both cases, you can use the shortcut of typing CTRL+U or Command+U (Mac users). You should look for a line that says near the top:

<meta http-equiv=”Content-Type” content=”text/html; charset=utf-8″ />

Here, the character-set the page I chose defaults to is “utf-8″. ;)

But for example:

These are both major websites in Japan, and yet use different character sets. So, how can you see Japanese when reading websites? Read on.

Setting your computer to read Japanese

So, to be to read Japanese, you need to have two things:

  • A modern web browser that is standards-compliant, and fully internationalized. I recommend Firefox, but also Opera, Safari and Camino. All four have worked for me. But this isn’t enough.
  • Your computer’s operating system also needs to be enabled to display Japanese fonts and characters. Mac OS X works right out of the box, so that’s easy. Linux and BSD both have worked for me right out of the box too, though it’s a lot harder to get Japanese input working compared to Mac OS X. Windows is the hardest one since they are not pre-installed, so you have to enable them. A simple Google search reveals many websites showing how to do this.

If you have met the two pre-requisites above, you should be to read the text I had at the top of the post. If not, leave a post and I can try to help out. Though, it’s been a while since I used a computer that didn’t display Japanese characters right away. ;)

What about inputting characters?

Enabling Japanese Input

This is important for Japanese students who use things like Anki, or like to search for things in Japanese.1 Again, Mac OS X is by far the easiest once Kotoeri is enabled. Frankly, that’s what I most often use. Windows actually does this pretty well with it’s IME input editor. Both have their quirks, as Mac’s Kotoeri doesn’t always find the Kanji right away, and IME lacks the cool keyboard shortcuts.

For Linux and BSD, it’s somewhat harder. Recently, I finally got it to work right on FreeBSD, so try following those instructions, or do a Google search for more information.

Conclusion

As technology converges, things will get easier, but the Internet is a prime example of Inertia, so things do not change overnight. Still, when my daughter is old enough to use the Internet well, I hope she will not have to worry about such things. :)

1 Like certain blog writers. *cough*


Be the first to like this post.

2 Comments on “Web, Japanese and UTF8”

  1. ロバート says:

    I’ve found ISO2022 encoding the safest for email to Japan.
    Using UTF8 I constantly hit the problem of mojibake (nonsense characters).
    Unfortunately there are a couple of characters I tend to use that aren’t in the 2022 set, but at least it seems that any UTF-8 programs are backward compatible and can read ISO2022 encodings.

  2. Doug says:

    Hi Robert,

    I wasn’t aware that ISO1022 was still in use, but that’s good to know. I have only limited experience with emailing Japanese folks, mostly at work where the software works pretty well as-is. I’ve not tried outside of that much. :-/


Leave a Reply

Gravatar
WordPress.com Logo
Twitter picture

You are commenting using your
Twitter account. (Log Out)

Facebook photo

You are commenting using your
Facebook account. (Log Out)

Connecting to %s