[Python] Automatically Convert Traditional Chinese PO file to Simplified Chinese


In this post, we will write a Python script to automatically convert Traditional Chinese (zh_TW) PO file to Simplified Chinese (zh_CN) by OpenCC (Open Chinese Convert) and pyOpenCC (OpenCC Python binding). Please read my previous post [1] to install OpenCC and pyOpenCC first.

Source Code

The zh_TW PO file for test:

messages.po | repository | view raw
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
# Chinese translations for PACKAGE package.
# Copyright (C) 2013 THE PACKAGE'S COPYRIGHT HOLDER
# This file is distributed under the same license as the PACKAGE package.
# Automatically generated, 2013.
#
msgid ""
msgstr ""
"Project-Id-Version: PACKAGE VERSION\n"
"Report-Msgid-Bugs-To: \n"
"POT-Creation-Date: 2013-06-04 10:20+0800\n"
"PO-Revision-Date: 2013-03-10 05:19+0800\n"
"Last-Translator: Automatically generated\n"
"Language-Team: none\n"
"Language: zh_TW\n"
"MIME-Version: 1.0\n"
"Content-Type: text/plain; charset=UTF-8\n"
"Content-Transfer-Encoding: 8bit\n"

msgid "Definition and Meaning"
msgstr "定義與意義"

msgid "Words Start with"
msgstr "單字,開頭為"

msgid "Home"
msgstr "首頁"

msgid "Canon"
msgstr "經典"

msgid "About"
msgstr "關於"

msgid "Setting"
msgstr "設定"

msgid "Translation"
msgstr "翻譯"

The Python script:

tw2cn.py | repository | view raw
 1
 2
 3
 4
 5
 6
 7
 8
 9
10
11
12
13
14
15
16
17
18
19
20
21
#!/usr/bin/env python
# -*- coding:utf-8 -*-

import re
import pyopencc
tw2cn = pyopencc.OpenCC('zht2zhs.ini').convert


if __name__ == '__main__':
  with open("locale/zh_TW/LC_MESSAGES/messages.po", 'r') as ftw:
    with open("locale/zh_CN/LC_MESSAGES/messages.po", "w") as fcn:
      for line in ftw.readlines():
        if 'zh_TW' in line:
          fcn.write(line.replace('zh_TW', 'zh_CN'))
        elif line.startswith('msgstr'):
          try:
            fcn.write(re.sub('msgstr "(.+)"', lambda m: 'msgstr "%s"' % tw2cn(m.group(1)), line))
          except UnicodeEncodeError:
            fcn.write(re.sub('msgstr "(.+)"', lambda m: 'msgstr "%s"' % tw2cn(m.group(1)), line).encode('utf-8'))
        else:
          fcn.write(line)

Tested on: Ubuntu Linux 15.10, Python 2.7.10, opencc 0.4.3-2build1, pyopencc-0.4.2.2.


References:

[1][Python] Conversion of Traditional and Simplified Chinese
[2]Python Regular Expressions | Google for Education | Google Developers
[3]Regex replace (in Python) - a simpler way? - Stack Overflow
[4]python - Import a module from a relative path - Stack Overflow