Pinyin translator tested with "HSK Standard Course"
If you follow the news thread, you know that during the last few months I was studying Chinese in China. During my stay in China I used some of my free time to thoroughly test the Chinese pinyin translator. For that, I used the series of textbooks "HSK Standard Course" published by Beijing Language and Culture University Press (by the way I learned Chinese in this very university).
If you read my article "A Complete Guide to Language Learning. Part 1. Learning Pronunciation", you know that I prefer video materials when learning a new language. HSK standard course is an audio course, so it is not a perfect solution for beginners, but still it is a very decent material. The advantages of this course are:
- The authors used only the most popular Chinese words and the most frequent grammar constructions that are required for the HSK exam.
- The audio recordings are very good. The combination audio from textbook + audio from workbook gives you a lot of examples of how these popular Chinese words are used in different sentences. The intonations are very natural. The speed is a little bit fast starting from HSK level 3, but that's how Chinese actually speak in the everyday life.
- All dialogues in the books for HSK level 1 and 2 are translated in English.
- The authors took a very good decision about how to show pinyin. In the books for HSK level 1 and 2, the pinyin is above each line of the dialogue. In the books for HSK level 3 and 4, the pinyin for each dialogue is at the bottom of the page, so it doesn't distract you so much. Starting from the book for HSK level 5 (only the first book is published so far) they show pinyin only for the new words.
So basically, to test the pinyin translator I took all the dialogues from the books for HSK levels 1 to 4 (32,800 characters) and compared the result of the conversion with the pinyin transcription from the book.
For my translator I use the CC-CEDICT dictionary, but there are three problems with this dictionary:
- Sometimes it contains multiple transcriptions for the same word, some of them are very rare. For example, the question particle 吗 has two entries: one is "ma5" and the other one is "ma3". And in CC-CEDICT dictionary the rare "ma3" goes first.
- The second problem is what I call "the problem of long words". For example, there's a Chinese word "几分" which means "somewhat; rather". But when you encounter the phrase "几分钟" in the Chinese text, it should be converted as "jǐ fēnzhōng" ("several minutes").
- And the last problem is "the problem of excessive entries". For example, CC-CEDICT dictionary has an entry for "等一下儿", which is not a word, but a phrase. For such entries the algorithm of the tone correction for 一 and 不 doesn't work sometimes. And since this phrase is not listed in HSK vocabulary lists, the HSK level is not highlighted either.
So I had to find and manually correct all these little errors. Probably one day I will add an online form that would allow users to submit such errors by themselves. I also have plans to show multiple transcriptions for the most frequent Chinese words, such as 得 which can be pronounced as "de", "děi", or "dé". Right now, only the most frequent pronunciation is shown.