Pinyin translator tested with "HSK Standard Course"

If you follow the news thread, you know that during the last few months I was studying Chinese in China. During my stay in China I used some of my free time to thoroughly test the Chinese pinyin translator. For that, I used the series of textbooks "HSK Standard Course" published by Beijing Language and Culture University Press (by the way I learned Chinese in this very university).

If you read my article "A Complete Guide to Language Learning. Part 1. Learning Pronunciation", you know that I prefer video materials when learning a new language. HSK standard course is an audio course, so it is not a perfect solution for beginners, but still it is a very decent material. The advantages of this course are:

  1. The authors used only the most popular Chinese words and the most frequent grammar constructions that are required for the HSK exam.
  2. The audio recordings are very good. The combination audio from textbook + audio from workbook gives you a lot of examples of how these popular Chinese words are used in different sentences. The intonations are very natural. The speed is a little bit fast starting from HSK level 3, but that's how Chinese actually speak in the everyday life.
  3. All dialogues in the books for HSK level 1 and 2 are translated in English.
  4. The authors took a very good decision about how to show pinyin. In the books for HSK level 1 and 2, the pinyin is above each line of the dialogue. In the books for HSK level 3 and 4, the pinyin for each dialogue is at the bottom of the page, so it doesn't distract you so much. Starting from the book for HSK level 5 (only the first book is published so far) they show pinyin only for the new words.

So basically, to test the pinyin translator I took all the dialogues from the books for HSK levels 1 to 4 (32,800 characters) and compared the result of the conversion with the pinyin transcription from the book.

For my translator I use the CC-CEDICT dictionary, but there are three problems with this dictionary:

  1. Sometimes it contains multiple transcriptions for the same word, some of them are very rare. For example, the question particle 吗 has two entries: one is "ma5" and the other one is "ma3". And in CC-CEDICT dictionary the rare "ma3" goes first.
  2. The second problem is what I call "the problem of long words". For example, there's a Chinese word "几分" which means "somewhat; rather". But when you encounter the phrase "几分钟" in the Chinese text, it should be converted as "jǐ fēnzhōng" ("several minutes").
  3. And the last problem is "the problem of excessive entries". For example, CC-CEDICT dictionary has an entry for "等一下儿", which is not a word, but a phrase. For such entries the algorithm of the tone correction for 一 and 不 doesn't work sometimes. And since this phrase is not listed in HSK vocabulary lists, the HSK level is not highlighted either.

So I had to find and manually correct all these little errors. Probably one day I will add an online form that would allow users to submit such errors by themselves. I also have plans to show multiple transcriptions for the most frequent Chinese words, such as 得 which can be pronounced as "de", "děi", or "dé". Right now, only the most frequent pronunciation is shown.

