gergelypolonkai-web-jekyll/content/blog/2016-12-22-finding-non-tran...

Finding non-translated strings in Python code
#############################################

:date: 2016-12-22T09:35:11Z
:category: blog
:tags: development,python
:url: 2016/12/22/finding-non-translated-strings-in-python-code/
:save_as: 2016/12/22/finding-non-translated-strings-in-python-code/index.html
:status: published
:author: Gergely Polonkai

When creating multilingual software, be it on the web, mobile, or desktop, you will eventually
fail to mark strings as translatable.  I know, I know, we developers are superhuman and never do
that, but somehow I stopped trusting myself recently, so I came up with an idea.

Right now I assist in the creation of a multilingual site/web application, where a small part of
the strings come from the Python code instead of HTML templates.  Call it bad practice if you
like, but I could not find a better way yet.

As a start, I tried to parse the source files with simple regular expressions, so I could find
anything between quotation marks or apostrophes.  This attempt quickly failed with strings that
had such characters inside, escaped or not; my regexps became so complex I lost all hope.  Then
the magic word “lexer” came to mind.

While searching for ready made Python lexers, I bumped into the awesome ``ast`` module.  AST
stands for Abstract Syntax Tree, and this module does that: parses a Python file and returns a
tree of nodes.  For walking through these nodes there is a ``NodeVisitor`` class (among other
means), which is meant to be subclassed.  You add a bunch of ``visitN`` methods (where ``N`` is an
``ast`` class name like ``Str`` or ``Call``), instantiate it, and call its ``visit()`` method with
the root node.  For example, the ``visitStr()`` method will be invoked for every string it finds.

How does it work?
=================

Before getting into the details, let’s me present you the code I made:

.. code-block:: python

   import ast
   import gettext
   from gettext import gettext as _
   import sys


   def get_func_name(node):
       cls = node.__class__.__name__

       if cls == 'Call':
           return get_func_name(node.func)
       elif cls == 'Attribute':
           return '{}.{}'.format(
               get_func_name(node.value),
               node.attr)
       elif cls == 'Name':
           return get_func_name(node.id)
       elif cls == 'str':
           return node
       elif cls == 'Str':
           return "<String literal>"
       elif cls == 'Subscript':
           return '{}[{}]'.format(get_func_name(node.value),
                                  get_func_name(node.slice))
       elif cls == 'Index':
           return get_func_name(node.value)
       else:
           print('ERROR: Unknown class: {}'.format(cls))


   class ShowStrings(ast.NodeVisitor):
       TRANSLATION_FUNCTIONS = [
           '_',  # gettext.gettext is often imported under this name
           'gettext',
           'gettext.gettext',
           # FIXME: this list is pretty much incomplete
       ]
       UNTRANSLATED = 'untranslated 9'

       def __init__(self, filename=None):
           super(ShowStrings, self).__init__()

           self.in_call = []
           self.filename = filename or '<parsed string>'

       def visit_with_trace(self, node, func):
           self.in_call.append((func, node.lineno, node.col_offset))
           self.visit(node)
           self.in_call.pop()

       def visit_Str(self, node):
           # TODO: make it possible to ignore untranslated strings
           # TODO: make this ignore docstrings

           # if we are not in a translator function, issue a warning
           if not self.in_call or \
              self.in_call[-1][0] not in self.TRANSLATION_FUNCTIONS:
               try:
                   funcname = self.in_call[-1][0]
               except IndexError:
                   funcname = None

               funcall_msg = "outside a function call" if funcname is None \
                             else "inside a call to {funcname}".format(
                                     funcname=funcname)

               print("WARNING: Untranslated string found at "
                     "{filename}:{line}:{col} {funcall_msg}".format(
                         filename=self.filename,
                         line=node.lineno,
                         col=node.col_offset,
                         funcall_msg=funcall_msg))

       def visit_Call(self, node):
           # if we are in a translator function, issue a warninc
           if self.in_call and self.in_call[-1][0] in self.TRANSLATION_FUNCTIONS:
               print("WARNING: function call within a translation function at "
                     "{filename}:{line}:{col}".format(filename=self.filename,
                                                      line=node.lineno,
                                                      col=node.col_offset))
           funcname = get_func_name(node)

           for arg in node.args:
               self.visit_with_trace(arg, funcname)

           for kwarg in node.keywords:
               self.visit_with_trace(kwarg.value, funcname)

       def generic_visit(self, node):
           # if we are inside a translator function, issue a warning
           if self.in_call and self.in_call[-1][0] in self.TRANSLATION_FUNCTIONS:
               # Some ast nodes, like Add don’t have position information
               if hasattr(node, 'lineno'):
                   print("WARNING: something not a string ({klass}) found in a "
                         "translation function at {filename}:{line}:{col}".format(
                             filename=self.filename,
                             klass=node.__class__.__name__,
                             line=node.lineno,
                             col=node.col_offset))
               else:
                   print("WARNING: something not a string ({klass}) found in a "
                         "translation function.  Position unknown; function call "
                         "is at {filename}:{line}:{col}".format(
                             filename=self.filename,
                             klass=node.__class__.__name__,
                             line=self.in_call[-1][1],
                             col=self.in_call[-1][2]))

           super(ShowStrings, self).generic_visit(node)


   def tst(*args, **kwargs):
       pass


   def actual_tests():
       _('translated 1')
       tst(_('translated 2'))
       tst(gettext.gettext('translated 3'))
       tst(_('translated 4') + 'native 1')
       tst('native 2'
           'native 3')
       tst(_('native 4' + 'native 5'))
       tst('native 6', b='native 7')
       tst(_(tst('hello!')))


   if __name__ == '__main__':
       try:
           filename = sys.argv[1]
       except IndexError:
           filename = __file__
           print("INFO:    No filename specified, checking myself.")

       with open(filename, 'r') as f:
           code = f.read()

       root = ast.parse(code)

       show_strings = ShowStrings(filename=filename)
       show_strings.visit(root)

The class initialization does two things: creates an empty ``in_call`` list (this will hold our
primitive backtrace), and saves the filename, if provided.

``visitCall``, again, has two tasks.  First, it checks if we are inside a translation function.
If so, it reports the fact that we are translating something that is not a raw string.  Although
it is not necessarily a bad thing, I consider it bad practice as it may result in undefined
behaviour.

Its second task is to walk through the positional and keyword arguments of the function call.  For
each argument it calls the ``visit_with_trace()`` method.

This method updates the ``in_call`` property with the current function name and the position of
the call.  This latter is needed because ``ast`` doesn’t store position information for every node
(operators are a notable example).  Then it simply visits the argument node, which is needed
because ``NodeVisitor.visit()`` is not recursive.  When the visit is done (which, with really
deeply nested calls like ``visit(this(call(iff(you(dare)))))`` will be recursive), the current
function name is removed from ``in_call``, so subsequent calls on the same level see the same
“backtrace”.

The ``generic_visit()`` method is called for every node that doesn’t have a named visitor (like
``visitCall`` or ``visitStr``.  For the same reason we generate a warning in ``visitCall``, we do
the same here.  If there is anything but a raw string inside a translation function call,
developers should know about it.

The last and I think the most important method is ``visitStr``.  All it does is checking the last
element of the ``in_call`` list, and generates a warning if a raw string is found somewhere that
is not inside a translation function call.

For accurate reports, there is a ``get_func_name()`` function that takes an ``ast`` node as an
argument.  As function call can be anything from actual functions to object methods, this goes all
down the node’s properties, and recursively reconstructs the name of the actual function.

Finally, there are some test functions in this code.  ``tst`` and
``actual_tests`` are there so if I run a self-check on this script, it will
find these strings and report all the untranslated strings and all the
potential problems like the string concatenation.

Drawbacks
=========

There are several drawbacks here.  First, translation function names are built in, to the
``TRANSLATION_FUNCTIONS`` property of the ``ShowString`` class.  You must change this if you use
other translation functions like ``dngettext``, or if you use a translation library other than
``gettext``.

Second, it cannot ignore untranslated strings right now.  It would be great if a pragma like
``flake8``’s ``# noqa`` or ``coverage.py``’s ``# pragma: no cover`` could be added.  However,
``ast`` doesn’t parse comment blocks, so this proves to be challenging.

Third, it reports docstrings as untranslated.  Clearly, this is wrong, as docstrings generally
don’t have to be translated.  Ignoring them, again, is a nice challenge I couldn’t yet overcome.

The ``get_func_name()`` helper is everything but done.  As long as I cannot remove that final
``else`` clause, there may be error reports.  If that happens, the reported class should be
treated in a new ``elif`` branch.

Finally (and the most easily fixed), the warnings are simply printed on the console.  It is nice,
but it should be optional; the problems identified should be stored so the caller can obtain it as
an array.

Bottom line
===========

Finding strings in Python sources is not as hard as I imagined.  It was fun to learn using the
``ast`` module, and it does a great job.  Once I can overcome the drawbacks above, this script
will be a fantastic piece of code that can assist me in my future tasks.