[Python] Use Regular Expression to Find Strings Marked For Internationalization (i18n)


Python Regular Expression

An i18n (web) application usually mark strings to be translated as _("string"). You can use xgettext in GNU gettext utilities to extract translatable strings from given input files. This post, however, will use regular expression in Python to do the work.

The basic pattern to search _("string") is:

def searchI18n(string):
  # only first match and longest match
  # i.e., the string {{_("ddd")}}12345{{_("sss")}} will return
  # {{_("ddd")}}12345{{_("sss")}}, not return {{_("ddd")}}
  return re.search(r'{{\s*_\(\s*(.+)\s*\)\s*}}', string)

A more advanced pattern is:

def getAllMatchesInFile(filepath):
  with open(filepath, 'r') as f:
    # [^)] to prevent {{_("ddd")}}12345{{_("sss")}}
    return re.findall(r'{{\s*_\(\s*([^)]+)\s*\)\s*}}', f.read())

The above function will return all matched strings in a file.

Alternative (Use xgettext)

You can also use the following command line in Linux console to extract strings: (assume your strings are in HTML files in . directory)

xgettext --no-wrap --from-code=UTF-8 --keyword=_ --output=messages.pot `find . -name *.html`

xgettext will save the strings in the file named messages.pot.


References:

[1]Python Regular Expressions | Google for Education | Google Developers
[2]Regex replace (in Python) - a simpler way? - Stack Overflow
[3]python - Import a module from a relative path - Stack Overflow
[4]Internationalize a Python application - maemo.org wiki
[5]Python localization made easy « Supernifty – nifty stuff
[6]GNU gettext utilities