Working With Chinese Text

Note: This post is not about Unicode.

Loosely speaking, most languages have some kind of an alphabet. The written word, roughly or precisely, spells out how the spoken word sounds. Some languages use a ~consonant-only script (e.g. Arabic); some languages use a script that spells out whole syllables at a time (e.g. Cherokee). They may not be “alphabets”, strictly speaking, but they all have a small set of symbols that make up the entire written language.

Designing a keyboard, then, is easy: make a key for each possible symbol. Maybe add a few dead keys (combination keys) for åcćeñtş. Voila.

That’s simply impossible for Chinese text. A high school graduate in China knows about 4,000 to 5,000 Chinese characters, and there are no easy ways to further break down a character. Clearly we can’t have a keyboard with 4,000 keys on it?

Not with that attitude. 活字印刷术 (movable-character-printing-method, or movable type) was invented in China 1,000 years ago, and was still in commercial use in as late as the 1990s. It’s a learned skill, and some people made a living picking out the blocks quickly and accurately. As computers proliferated, though, so too did a revolution in working with Chinese text.

拼音 Pinyin

Hanyu Pinyin, often just pinyin (literally “spell-sound”), uses the Latin alphabet to spell out pronunciations of Chinese characters. It was developed in the late 1950s by Chinese linguists, and by the 1970s, it was already commonly taught in school. Today, it serves as the official romanization of Chinese characters in both mainland China and Taiwan.

If you thought, hang on, I thought “yin” meant the opposite of “yang”, you’ve perfectly highlighted why pinyin remains an auxiliary tool. A Chinese learner will encounter this familiar phrase many times on their journey to fluency: “same pronunciation, different character”. The opposite of yang is 阴 (trad. 陰), and the yin in pinyin, meaning “sound”, is 音. Even a casual observer can tell the difference. Disregarding the tones, there are only about 400 distinct syllables in Mandarin Chinese. Pinyin can not and will not replace the actual characters.

Nevertheless, pinyin serves as a teaching tool, and a way to transcribe Chinese characters into the Latin alphabet when it’s necessary (e.g. in passports). Today, it also allows people to type Chinese text into their computers and smartphones.

输入法 IME

Typing in English on a computer is exactly like typing on a typewriter. One key, one symbol. With minor modifications, this can be made to work for the majority of languages.

You might not be surprised to learn that there are no mechanical typewriters for Chinese, if you don’t count the giant movable type machines used only by the publishers. Ever since the first electronic Chinese typewriter, an input method, better known by its abbreviation IME, is used to convert keystrokes into characters.

Germans have their QWERTZ keyboard, French people have AZERTY, but Chinese people just use a standard English QWERTY keyboard, plus an IME. You may already have an idea how this IME would work, and you’re probably right:

The default pinyin IME on macOS

Pinyin IMEs do exactly what they sound like: type in the pinyin, out come the characters. It used to be that each character has to be typed in individually, and the user has to choose from a list of homonymous characters over and over again. Lately, pinyin IMEs have gotten so smart that a user can often type a whole sentence without having to choose a single character.

“Genius is one percent inspiration and ninety-nine percent perspiration.” Shame that it doesn’t rhyme in Chinese.

Before pinyin IMEs got smart, people invented all sorts of ways to avoid the time-consuming process of choosing which character they meant. This almost always required “breaking up” the characters in some way, occasionally very counter-intuitively. It would often take weeks, even months, to become proficient in one of these IMEs, but once you’re proficient, you could type much, much faster than those who typed in pinyin. One could still make a living as a typist.

From left to right: Wubi, Stroke, Cangjie

Another advantage of these break-up-a-character IMEs over pinyin IMEs was that you can still type out a character even if you don’t know how it’s pronounced. These days, though, there’s a way around it, thanks to advances in OCR technologies:

Hey, you try writing on a trackpad before criticizing my penmanship.

On mobile phones, a popular IME from the T-9 era (called 10-key or 九宫格 on iOS) remains popular among people who grew up with it:

Good for some one-handed typing.

笔画顺序 Stroke Order

“Stroke order” usually means 笔顺, the “correct” order of the strokes a character should be written in. I’m sure I got it wrong multiple times in the GIF above.

By a pure coincidence, it’s also the translation of 笔画顺序, a traditional method of ordering Chinese characters, such as in a dictionary, or in the Parade of Nations at the opening ceremony of the Beijing Olympics. When comparing two characters, the one with fewer strokes comes first (so 一 comes before 二). If they have the same number of strokes, we compare them stroke by stroke. Remember, there’s a “correct” way to write each character.

Each stroke comes in one of five types: (1) left-to-right, (2) top-to-bottom, (3) NE-to-SW, (4) NW-to-SE, and (5) zigzags. If the first strokes of the two characters have different types, the one with a smaller type number comes first (so 二 comes before 人). If their first strokes have the same type, we look at the second strokes (so 二 comes before 十), etc. If all strokes have the same types, the order is undefined (i.e. up to the editor).

Since pinyin has become common knowledge, stroke order has taken a back seat to 音序, the phonetic order or “pinyin order”. Characters are usually ordered first by pinyin (using the same plain old dictionary order as in English), and secondarily by stroke for characters with the same pronunciation.

I can’t, off the top of my head, think of two characters that have the same pronunciation and the same stroke types.

查字典 Using A Dictionary

Apart from looking up definitions, a dictionary has two basic functions: checking how a word is spelled (or how a character is written), and checking how it’s pronounced.

The first problem can be a little tricky in English. If the word begins with a silent letter (e.g. “mnemonics” or “psychology”), you’ll have a hard time finding it without help. It’s less of a problem in Chinese, especially when you know what the character roughly looks like.

If the character you’re looking for isn’t in box 2, turn to the next page.

The opposite problem — knowing how a word is spelled (or how a character is written), and checking its pronunciation — is almost trivial in English. The word’s location in the dictionary is determined by its spelling. In Chinese though, it can be tricky, since the characters are primarily ordered by their pronunciations, which is the very thing we didn’t know and wanted to find out (and they’re not always guessable from how the character looks).

To address this use case, dictionaries have a complete index of all the characters in stroke order, usually first thing in the book. To make it even faster to find the character, they are often further indexed by radicals (partial shapes that occur frequently in characters).

It’s not as complicated as it looks, after some practice.

Here’s what you do:

Step 0. Identify a radical in the character. This starts to come naturally once you’ve learned a couple hundred characters. Dictionaries make this easy by indexing the same character under multiple plausible radicals (even though there’s a “correct” one you should look under). There’s usually also a list of “difficult characters” whose radicals are hard to identify even by native speakers.

Step 1. Count the number of strokes in the radical.

Step 2. Find the radical in the list. Radicals are listed in stroke order.

Step 3. Find the page number you should turn to.

Step 4. Go to that page.

Step 5. Find the radical’s section on that page.

Step 6. Count the number of strokes remaining in the character, not including the radical itself.

Step 7. Find the character and the page number for its entry. Characters with multiple entries (because they have multiple pronunciations) might have multiple page numbers.

All grade school students learn this system and are tested on it, though not very many of them retain this knowledge once they get older. It’s much easier on a computer anyway:

Kids these days…