247 lines
11 KiB
ReStructuredText
247 lines
11 KiB
ReStructuredText
Finding non-translated strings in Python code
|
||
#############################################
|
||
|
||
:date: 2016-12-22T09:35:11Z
|
||
:category: blog
|
||
:tags: development,python
|
||
:url: 2016/12/22/finding-non-translated-strings-in-python-code/
|
||
:save_as: 2016/12/22/finding-non-translated-strings-in-python-code/index.html
|
||
:status: published
|
||
:author: Gergely Polonkai
|
||
|
||
When creating multilingual software, be it on the web, mobile, or desktop, you will eventually
|
||
fail to mark strings as translatable. I know, I know, we developers are superhuman and never do
|
||
that, but somehow I stopped trusting myself recently, so I came up with an idea.
|
||
|
||
Right now I assist in the creation of a multilingual site/web application, where a small part of
|
||
the strings come from the Python code instead of HTML templates. Call it bad practice if you
|
||
like, but I could not find a better way yet.
|
||
|
||
As a start, I tried to parse the source files with simple regular expressions, so I could find
|
||
anything between quotation marks or apostrophes. This attempt quickly failed with strings that
|
||
had such characters inside, escaped or not; my regexps became so complex I lost all hope. Then
|
||
the magic word “lexer” came to mind.
|
||
|
||
While searching for ready made Python lexers, I bumped into the awesome ``ast`` module. AST
|
||
stands for Abstract Syntax Tree, and this module does that: parses a Python file and returns a
|
||
tree of nodes. For walking through these nodes there is a ``NodeVisitor`` class (among other
|
||
means), which is meant to be subclassed. You add a bunch of ``visitN`` methods (where ``N`` is an
|
||
``ast`` class name like ``Str`` or ``Call``), instantiate it, and call its ``visit()`` method with
|
||
the root node. For example, the ``visitStr()`` method will be invoked for every string it finds.
|
||
|
||
How does it work?
|
||
=================
|
||
|
||
Before getting into the details, let’s me present you the code I made:
|
||
|
||
.. code-block:: python
|
||
|
||
import ast
|
||
import gettext
|
||
from gettext import gettext as _
|
||
import sys
|
||
|
||
|
||
def get_func_name(node):
|
||
cls = node.__class__.__name__
|
||
|
||
if cls == 'Call':
|
||
return get_func_name(node.func)
|
||
elif cls == 'Attribute':
|
||
return '{}.{}'.format(
|
||
get_func_name(node.value),
|
||
node.attr)
|
||
elif cls == 'Name':
|
||
return get_func_name(node.id)
|
||
elif cls == 'str':
|
||
return node
|
||
elif cls == 'Str':
|
||
return "<String literal>"
|
||
elif cls == 'Subscript':
|
||
return '{}[{}]'.format(get_func_name(node.value),
|
||
get_func_name(node.slice))
|
||
elif cls == 'Index':
|
||
return get_func_name(node.value)
|
||
else:
|
||
print('ERROR: Unknown class: {}'.format(cls))
|
||
|
||
|
||
class ShowStrings(ast.NodeVisitor):
|
||
TRANSLATION_FUNCTIONS = [
|
||
'_', # gettext.gettext is often imported under this name
|
||
'gettext',
|
||
'gettext.gettext',
|
||
# FIXME: this list is pretty much incomplete
|
||
]
|
||
UNTRANSLATED = 'untranslated 9'
|
||
|
||
def __init__(self, filename=None):
|
||
super(ShowStrings, self).__init__()
|
||
|
||
self.in_call = []
|
||
self.filename = filename or '<parsed string>'
|
||
|
||
def visit_with_trace(self, node, func):
|
||
self.in_call.append((func, node.lineno, node.col_offset))
|
||
self.visit(node)
|
||
self.in_call.pop()
|
||
|
||
def visit_Str(self, node):
|
||
# TODO: make it possible to ignore untranslated strings
|
||
# TODO: make this ignore docstrings
|
||
|
||
# if we are not in a translator function, issue a warning
|
||
if not self.in_call or \
|
||
self.in_call[-1][0] not in self.TRANSLATION_FUNCTIONS:
|
||
try:
|
||
funcname = self.in_call[-1][0]
|
||
except IndexError:
|
||
funcname = None
|
||
|
||
funcall_msg = "outside a function call" if funcname is None \
|
||
else "inside a call to {funcname}".format(
|
||
funcname=funcname)
|
||
|
||
print("WARNING: Untranslated string found at "
|
||
"{filename}:{line}:{col} {funcall_msg}".format(
|
||
filename=self.filename,
|
||
line=node.lineno,
|
||
col=node.col_offset,
|
||
funcall_msg=funcall_msg))
|
||
|
||
def visit_Call(self, node):
|
||
# if we are in a translator function, issue a warninc
|
||
if self.in_call and self.in_call[-1][0] in self.TRANSLATION_FUNCTIONS:
|
||
print("WARNING: function call within a translation function at "
|
||
"{filename}:{line}:{col}".format(filename=self.filename,
|
||
line=node.lineno,
|
||
col=node.col_offset))
|
||
funcname = get_func_name(node)
|
||
|
||
for arg in node.args:
|
||
self.visit_with_trace(arg, funcname)
|
||
|
||
for kwarg in node.keywords:
|
||
self.visit_with_trace(kwarg.value, funcname)
|
||
|
||
def generic_visit(self, node):
|
||
# if we are inside a translator function, issue a warning
|
||
if self.in_call and self.in_call[-1][0] in self.TRANSLATION_FUNCTIONS:
|
||
# Some ast nodes, like Add don’t have position information
|
||
if hasattr(node, 'lineno'):
|
||
print("WARNING: something not a string ({klass}) found in a "
|
||
"translation function at {filename}:{line}:{col}".format(
|
||
filename=self.filename,
|
||
klass=node.__class__.__name__,
|
||
line=node.lineno,
|
||
col=node.col_offset))
|
||
else:
|
||
print("WARNING: something not a string ({klass}) found in a "
|
||
"translation function. Position unknown; function call "
|
||
"is at {filename}:{line}:{col}".format(
|
||
filename=self.filename,
|
||
klass=node.__class__.__name__,
|
||
line=self.in_call[-1][1],
|
||
col=self.in_call[-1][2]))
|
||
|
||
super(ShowStrings, self).generic_visit(node)
|
||
|
||
|
||
def tst(*args, **kwargs):
|
||
pass
|
||
|
||
|
||
def actual_tests():
|
||
_('translated 1')
|
||
tst(_('translated 2'))
|
||
tst(gettext.gettext('translated 3'))
|
||
tst(_('translated 4') + 'native 1')
|
||
tst('native 2'
|
||
'native 3')
|
||
tst(_('native 4' + 'native 5'))
|
||
tst('native 6', b='native 7')
|
||
tst(_(tst('hello!')))
|
||
|
||
|
||
if __name__ == '__main__':
|
||
try:
|
||
filename = sys.argv[1]
|
||
except IndexError:
|
||
filename = __file__
|
||
print("INFO: No filename specified, checking myself.")
|
||
|
||
with open(filename, 'r') as f:
|
||
code = f.read()
|
||
|
||
root = ast.parse(code)
|
||
|
||
show_strings = ShowStrings(filename=filename)
|
||
show_strings.visit(root)
|
||
|
||
The class initialization does two things: creates an empty ``in_call`` list (this will hold our
|
||
primitive backtrace), and saves the filename, if provided.
|
||
|
||
``visitCall``, again, has two tasks. First, it checks if we are inside a translation function.
|
||
If so, it reports the fact that we are translating something that is not a raw string. Although
|
||
it is not necessarily a bad thing, I consider it bad practice as it may result in undefined
|
||
behaviour.
|
||
|
||
Its second task is to walk through the positional and keyword arguments of the function call. For
|
||
each argument it calls the ``visit_with_trace()`` method.
|
||
|
||
This method updates the ``in_call`` property with the current function name and the position of
|
||
the call. This latter is needed because ``ast`` doesn’t store position information for every node
|
||
(operators are a notable example). Then it simply visits the argument node, which is needed
|
||
because ``NodeVisitor.visit()`` is not recursive. When the visit is done (which, with really
|
||
deeply nested calls like ``visit(this(call(iff(you(dare)))))`` will be recursive), the current
|
||
function name is removed from ``in_call``, so subsequent calls on the same level see the same
|
||
“backtrace”.
|
||
|
||
The ``generic_visit()`` method is called for every node that doesn’t have a named visitor (like
|
||
``visitCall`` or ``visitStr``. For the same reason we generate a warning in ``visitCall``, we do
|
||
the same here. If there is anything but a raw string inside a translation function call,
|
||
developers should know about it.
|
||
|
||
The last and I think the most important method is ``visitStr``. All it does is checking the last
|
||
element of the ``in_call`` list, and generates a warning if a raw string is found somewhere that
|
||
is not inside a translation function call.
|
||
|
||
For accurate reports, there is a ``get_func_name()`` function that takes an ``ast`` node as an
|
||
argument. As function call can be anything from actual functions to object methods, this goes all
|
||
down the node’s properties, and recursively reconstructs the name of the actual function.
|
||
|
||
Finally, there are some test functions in this code. ``tst`` and
|
||
``actual_tests`` are there so if I run a self-check on this script, it will
|
||
find these strings and report all the untranslated strings and all the
|
||
potential problems like the string concatenation.
|
||
|
||
Drawbacks
|
||
=========
|
||
|
||
There are several drawbacks here. First, translation function names are built in, to the
|
||
``TRANSLATION_FUNCTIONS`` property of the ``ShowString`` class. You must change this if you use
|
||
other translation functions like ``dngettext``, or if you use a translation library other than
|
||
``gettext``.
|
||
|
||
Second, it cannot ignore untranslated strings right now. It would be great if a pragma like
|
||
``flake8``’s ``# noqa`` or ``coverage.py``’s ``# pragma: no cover`` could be added. However,
|
||
``ast`` doesn’t parse comment blocks, so this proves to be challenging.
|
||
|
||
Third, it reports docstrings as untranslated. Clearly, this is wrong, as docstrings generally
|
||
don’t have to be translated. Ignoring them, again, is a nice challenge I couldn’t yet overcome.
|
||
|
||
The ``get_func_name()`` helper is everything but done. As long as I cannot remove that final
|
||
``else`` clause, there may be error reports. If that happens, the reported class should be
|
||
treated in a new ``elif`` branch.
|
||
|
||
Finally (and the most easily fixed), the warnings are simply printed on the console. It is nice,
|
||
but it should be optional; the problems identified should be stored so the caller can obtain it as
|
||
an array.
|
||
|
||
Bottom line
|
||
===========
|
||
|
||
Finding strings in Python sources is not as hard as I imagined. It was fun to learn using the
|
||
``ast`` module, and it does a great job. Once I can overcome the drawbacks above, this script
|
||
will be a fantastic piece of code that can assist me in my future tasks.
|