247 lines
11 KiB
ReStructuredText
247 lines
11 KiB
ReStructuredText
|
Finding non-translated strings in Python code
|
|||
|
#############################################
|
|||
|
|
|||
|
:date: 2016-12-22T09:35:11Z
|
|||
|
:category: blog
|
|||
|
:tags: development,python
|
|||
|
:url: 2016/12/22/finding-non-translated-strings-in-python-code/
|
|||
|
:save_as: 2016/12/22/finding-non-translated-strings-in-python-code/index.html
|
|||
|
:status: published
|
|||
|
:author: Gergely Polonkai
|
|||
|
|
|||
|
When creating multilingual software, be it on the web, mobile, or desktop, you will eventually
|
|||
|
fail to mark strings as translatable. I know, I know, we developers are superhuman and never do
|
|||
|
that, but somehow I stopped trusting myself recently, so I came up with an idea.
|
|||
|
|
|||
|
Right now I assist in the creation of a multilingual site/web application, where a small part of
|
|||
|
the strings come from the Python code instead of HTML templates. Call it bad practice if you
|
|||
|
like, but I could not find a better way yet.
|
|||
|
|
|||
|
As a start, I tried to parse the source files with simple regular expressions, so I could find
|
|||
|
anything between quotation marks or apostrophes. This attempt quickly failed with strings that
|
|||
|
had such characters inside, escaped or not; my regexps became so complex I lost all hope. Then
|
|||
|
the magic word “lexer” came to mind.
|
|||
|
|
|||
|
While searching for ready made Python lexers, I bumped into the awesome ``ast`` module. AST
|
|||
|
stands for Abstract Syntax Tree, and this module does that: parses a Python file and returns a
|
|||
|
tree of nodes. For walking through these nodes there is a ``NodeVisitor`` class (among other
|
|||
|
means), which is meant to be subclassed. You add a bunch of ``visitN`` methods (where ``N`` is an
|
|||
|
``ast`` class name like ``Str`` or ``Call``), instantiate it, and call its ``visit()`` method with
|
|||
|
the root node. For example, the ``visitStr()`` method will be invoked for every string it finds.
|
|||
|
|
|||
|
How does it work?
|
|||
|
=================
|
|||
|
|
|||
|
Before getting into the details, let’s me present you the code I made:
|
|||
|
|
|||
|
.. code-block:: python
|
|||
|
|
|||
|
import ast
|
|||
|
import gettext
|
|||
|
from gettext import gettext as _
|
|||
|
import sys
|
|||
|
|
|||
|
|
|||
|
def get_func_name(node):
|
|||
|
cls = node.__class__.__name__
|
|||
|
|
|||
|
if cls == 'Call':
|
|||
|
return get_func_name(node.func)
|
|||
|
elif cls == 'Attribute':
|
|||
|
return '{}.{}'.format(
|
|||
|
get_func_name(node.value),
|
|||
|
node.attr)
|
|||
|
elif cls == 'Name':
|
|||
|
return get_func_name(node.id)
|
|||
|
elif cls == 'str':
|
|||
|
return node
|
|||
|
elif cls == 'Str':
|
|||
|
return "<String literal>"
|
|||
|
elif cls == 'Subscript':
|
|||
|
return '{}[{}]'.format(get_func_name(node.value),
|
|||
|
get_func_name(node.slice))
|
|||
|
elif cls == 'Index':
|
|||
|
return get_func_name(node.value)
|
|||
|
else:
|
|||
|
print('ERROR: Unknown class: {}'.format(cls))
|
|||
|
|
|||
|
|
|||
|
class ShowStrings(ast.NodeVisitor):
|
|||
|
TRANSLATION_FUNCTIONS = [
|
|||
|
'_', # gettext.gettext is often imported under this name
|
|||
|
'gettext',
|
|||
|
'gettext.gettext',
|
|||
|
# FIXME: this list is pretty much incomplete
|
|||
|
]
|
|||
|
UNTRANSLATED = 'untranslated 9'
|
|||
|
|
|||
|
def __init__(self, filename=None):
|
|||
|
super(ShowStrings, self).__init__()
|
|||
|
|
|||
|
self.in_call = []
|
|||
|
self.filename = filename or '<parsed string>'
|
|||
|
|
|||
|
def visit_with_trace(self, node, func):
|
|||
|
self.in_call.append((func, node.lineno, node.col_offset))
|
|||
|
self.visit(node)
|
|||
|
self.in_call.pop()
|
|||
|
|
|||
|
def visit_Str(self, node):
|
|||
|
# TODO: make it possible to ignore untranslated strings
|
|||
|
# TODO: make this ignore docstrings
|
|||
|
|
|||
|
# if we are not in a translator function, issue a warning
|
|||
|
if not self.in_call or \
|
|||
|
self.in_call[-1][0] not in self.TRANSLATION_FUNCTIONS:
|
|||
|
try:
|
|||
|
funcname = self.in_call[-1][0]
|
|||
|
except IndexError:
|
|||
|
funcname = None
|
|||
|
|
|||
|
funcall_msg = "outside a function call" if funcname is None \
|
|||
|
else "inside a call to {funcname}".format(
|
|||
|
funcname=funcname)
|
|||
|
|
|||
|
print("WARNING: Untranslated string found at "
|
|||
|
"{filename}:{line}:{col} {funcall_msg}".format(
|
|||
|
filename=self.filename,
|
|||
|
line=node.lineno,
|
|||
|
col=node.col_offset,
|
|||
|
funcall_msg=funcall_msg))
|
|||
|
|
|||
|
def visit_Call(self, node):
|
|||
|
# if we are in a translator function, issue a warninc
|
|||
|
if self.in_call and self.in_call[-1][0] in self.TRANSLATION_FUNCTIONS:
|
|||
|
print("WARNING: function call within a translation function at "
|
|||
|
"{filename}:{line}:{col}".format(filename=self.filename,
|
|||
|
line=node.lineno,
|
|||
|
col=node.col_offset))
|
|||
|
funcname = get_func_name(node)
|
|||
|
|
|||
|
for arg in node.args:
|
|||
|
self.visit_with_trace(arg, funcname)
|
|||
|
|
|||
|
for kwarg in node.keywords:
|
|||
|
self.visit_with_trace(kwarg.value, funcname)
|
|||
|
|
|||
|
def generic_visit(self, node):
|
|||
|
# if we are inside a translator function, issue a warning
|
|||
|
if self.in_call and self.in_call[-1][0] in self.TRANSLATION_FUNCTIONS:
|
|||
|
# Some ast nodes, like Add don’t have position information
|
|||
|
if hasattr(node, 'lineno'):
|
|||
|
print("WARNING: something not a string ({klass}) found in a "
|
|||
|
"translation function at {filename}:{line}:{col}".format(
|
|||
|
filename=self.filename,
|
|||
|
klass=node.__class__.__name__,
|
|||
|
line=node.lineno,
|
|||
|
col=node.col_offset))
|
|||
|
else:
|
|||
|
print("WARNING: something not a string ({klass}) found in a "
|
|||
|
"translation function. Position unknown; function call "
|
|||
|
"is at {filename}:{line}:{col}".format(
|
|||
|
filename=self.filename,
|
|||
|
klass=node.__class__.__name__,
|
|||
|
line=self.in_call[-1][1],
|
|||
|
col=self.in_call[-1][2]))
|
|||
|
|
|||
|
super(ShowStrings, self).generic_visit(node)
|
|||
|
|
|||
|
|
|||
|
def tst(*args, **kwargs):
|
|||
|
pass
|
|||
|
|
|||
|
|
|||
|
def actual_tests():
|
|||
|
_('translated 1')
|
|||
|
tst(_('translated 2'))
|
|||
|
tst(gettext.gettext('translated 3'))
|
|||
|
tst(_('translated 4') + 'native 1')
|
|||
|
tst('native 2'
|
|||
|
'native 3')
|
|||
|
tst(_('native 4' + 'native 5'))
|
|||
|
tst('native 6', b='native 7')
|
|||
|
tst(_(tst('hello!')))
|
|||
|
|
|||
|
|
|||
|
if __name__ == '__main__':
|
|||
|
try:
|
|||
|
filename = sys.argv[1]
|
|||
|
except IndexError:
|
|||
|
filename = __file__
|
|||
|
print("INFO: No filename specified, checking myself.")
|
|||
|
|
|||
|
with open(filename, 'r') as f:
|
|||
|
code = f.read()
|
|||
|
|
|||
|
root = ast.parse(code)
|
|||
|
|
|||
|
show_strings = ShowStrings(filename=filename)
|
|||
|
show_strings.visit(root)
|
|||
|
|
|||
|
The class initialization does two things: creates an empty ``in_call`` list (this will hold our
|
|||
|
primitive backtrace), and saves the filename, if provided.
|
|||
|
|
|||
|
``visitCall``, again, has two tasks. First, it checks if we are inside a translation function.
|
|||
|
If so, it reports the fact that we are translating something that is not a raw string. Although
|
|||
|
it is not necessarily a bad thing, I consider it bad practice as it may result in undefined
|
|||
|
behaviour.
|
|||
|
|
|||
|
Its second task is to walk through the positional and keyword arguments of the function call. For
|
|||
|
each argument it calls the ``visit_with_trace()`` method.
|
|||
|
|
|||
|
This method updates the ``in_call`` property with the current function name and the position of
|
|||
|
the call. This latter is needed because ``ast`` doesn’t store position information for every node
|
|||
|
(operators are a notable example). Then it simply visits the argument node, which is needed
|
|||
|
because ``NodeVisitor.visit()`` is not recursive. When the visit is done (which, with really
|
|||
|
deeply nested calls like ``visit(this(call(iff(you(dare)))))`` will be recursive), the current
|
|||
|
function name is removed from ``in_call``, so subsequent calls on the same level see the same
|
|||
|
“backtrace”.
|
|||
|
|
|||
|
The ``generic_visit()`` method is called for every node that doesn’t have a named visitor (like
|
|||
|
``visitCall`` or ``visitStr``. For the same reason we generate a warning in ``visitCall``, we do
|
|||
|
the same here. If there is anything but a raw string inside a translation function call,
|
|||
|
developers should know about it.
|
|||
|
|
|||
|
The last and I think the most important method is ``visitStr``. All it does is checking the last
|
|||
|
element of the ``in_call`` list, and generates a warning if a raw string is found somewhere that
|
|||
|
is not inside a translation function call.
|
|||
|
|
|||
|
For accurate reports, there is a ``get_func_name()`` function that takes an ``ast`` node as an
|
|||
|
argument. As function call can be anything from actual functions to object methods, this goes all
|
|||
|
down the node’s properties, and recursively reconstructs the name of the actual function.
|
|||
|
|
|||
|
Finally, there are some test functions in this code. ``tst`` and
|
|||
|
``actual_tests`` are there so if I run a self-check on this script, it will
|
|||
|
find these strings and report all the untranslated strings and all the
|
|||
|
potential problems like the string concatenation.
|
|||
|
|
|||
|
Drawbacks
|
|||
|
=========
|
|||
|
|
|||
|
There are several drawbacks here. First, translation function names are built in, to the
|
|||
|
``TRANSLATION_FUNCTIONS`` property of the ``ShowString`` class. You must change this if you use
|
|||
|
other translation functions like ``dngettext``, or if you use a translation library other than
|
|||
|
``gettext``.
|
|||
|
|
|||
|
Second, it cannot ignore untranslated strings right now. It would be great if a pragma like
|
|||
|
``flake8``’s ``# noqa`` or ``coverage.py``’s ``# pragma: no cover`` could be added. However,
|
|||
|
``ast`` doesn’t parse comment blocks, so this proves to be challenging.
|
|||
|
|
|||
|
Third, it reports docstrings as untranslated. Clearly, this is wrong, as docstrings generally
|
|||
|
don’t have to be translated. Ignoring them, again, is a nice challenge I couldn’t yet overcome.
|
|||
|
|
|||
|
The ``get_func_name()`` helper is everything but done. As long as I cannot remove that final
|
|||
|
``else`` clause, there may be error reports. If that happens, the reported class should be
|
|||
|
treated in a new ``elif`` branch.
|
|||
|
|
|||
|
Finally (and the most easily fixed), the warnings are simply printed on the console. It is nice,
|
|||
|
but it should be optional; the problems identified should be stored so the caller can obtain it as
|
|||
|
an array.
|
|||
|
|
|||
|
Bottom line
|
|||
|
===========
|
|||
|
|
|||
|
Finding strings in Python sources is not as hard as I imagined. It was fun to learn using the
|
|||
|
``ast`` module, and it does a great job. Once I can overcome the drawbacks above, this script
|
|||
|
will be a fantastic piece of code that can assist me in my future tasks.
|