Add post about finding non-translated strings in Python
This commit is contained in:
parent
a0e65ff2ff
commit
3734bb67d0
@ -0,0 +1,115 @@
|
|||||||
|
---
|
||||||
|
layout: post
|
||||||
|
title: "Finding non-translated strings in Python code"
|
||||||
|
date: 2016-12-22 09:35:11
|
||||||
|
tags: [development, python]
|
||||||
|
published: true
|
||||||
|
author:
|
||||||
|
name: Gergely Polonkai
|
||||||
|
email: gergely@polonkai.eu
|
||||||
|
---
|
||||||
|
|
||||||
|
When creating multilingual software, be it on the web, mobile, or desktop,
|
||||||
|
you will eventually fail to mark strings as translatable. I know, I know,
|
||||||
|
we developers are superhuman and never do that, but somehow I stopped
|
||||||
|
trusting myself recently, so I came up with an idea.
|
||||||
|
|
||||||
|
Right now I assist in the creation of a multilingual site/web application,
|
||||||
|
where a small part of the strings come from the Python code instead of HTML
|
||||||
|
templates. Call it bad practice if you like, but I could not find a better
|
||||||
|
way yet.
|
||||||
|
|
||||||
|
As a start, I tried to parse the source files with simple regular
|
||||||
|
expressions, so I could find anything between quotation marks or
|
||||||
|
apostrophes. This attempt quickly failed with strings that had such
|
||||||
|
characters inside, escaped or not; my regexps became so complex I lost all
|
||||||
|
hope. Then the magic word “lexer” came to mind.
|
||||||
|
|
||||||
|
While searching for ready made Python lexers, I bumped into the awesome
|
||||||
|
`ast` module. AST stands for Abstract Syntax Tree, and this module does
|
||||||
|
that: parses a Python file and returns a tree of nodes. For walking through
|
||||||
|
these nodes there is a `NodeVisitor` class (among other means), which is
|
||||||
|
meant to be subclassed. You add a bunch of `visitN` methods (where `N` is
|
||||||
|
an `ast` class name like `Str` or `Call`), instantiate it, and call its
|
||||||
|
`visit()` method with the root node. For example, the `visitStr()` method
|
||||||
|
will be invoked for every string it finds.
|
||||||
|
|
||||||
|
#### How does it work?
|
||||||
|
|
||||||
|
Before getting into the details, let’s me present you the code I made:
|
||||||
|
|
||||||
|
{% gist gergelypolonkai/1a16a47e5a1971ca33e58bdfd88c5059 string-checker.py %}
|
||||||
|
|
||||||
|
The class initialization does two things: creates an empty `in_call` list
|
||||||
|
(this will hold our primitive backtrace), and saves the filename, if
|
||||||
|
provided.
|
||||||
|
|
||||||
|
`visitCall`, again, has two tasks. First, it checks if we are inside a
|
||||||
|
translation function. If so, it reports the fact that we are translating
|
||||||
|
something that is not a raw string. Although it is not necessarily a bad
|
||||||
|
thing, I consider it bad practice as it may result in undefined behaviour.
|
||||||
|
|
||||||
|
Its second task is to walk through the positional and keyword arguments of
|
||||||
|
the function call. For each argument it calls the `visit_with_trace()`
|
||||||
|
method.
|
||||||
|
|
||||||
|
This method updates the `in_call` property with the current function name
|
||||||
|
and the position of the call. This latter is needed because `ast` doesn’t
|
||||||
|
store position information for every node (operators are a notable example).
|
||||||
|
Then it simply visits the argument node, which is needed because
|
||||||
|
`NodeVisitor.visit()` is not recursive. When the visit is done (which, with
|
||||||
|
really deeply nested calls like `visit(this(call(iff(you(dare)))))` will be
|
||||||
|
recursive), the current function name is removed from `in_call`, so
|
||||||
|
subsequent calls on the same level see the same “backtrace”.
|
||||||
|
|
||||||
|
The `generic_visit()` method is called for every node that doesn’t have a
|
||||||
|
named visitor (like `visitCall` or `visitStr`. For the same reason we
|
||||||
|
generate a warning in `visitCall`, we do the same here. If there is
|
||||||
|
anything but a raw string inside a translation function call, developers
|
||||||
|
should know about it.
|
||||||
|
|
||||||
|
The last and I think the most important method is `visitStr`. All it does
|
||||||
|
is checking the last element of the `in_call` list, and generates a warning
|
||||||
|
if a raw string is found somewhere that is not inside a translation function
|
||||||
|
call.
|
||||||
|
|
||||||
|
For accurate reports, there is a `get_func_name()` function that takes an
|
||||||
|
`ast` node as an argument. As function call can be anything from actual
|
||||||
|
functions to object methods, this goes all down the node’s properties, and
|
||||||
|
recursively reconstructs the name of the actual function.
|
||||||
|
|
||||||
|
Finally, there are some test functions in this code. `tst` and
|
||||||
|
`actual_tests` are there so if I run a self-check on this script, it will
|
||||||
|
find these strings and report all the untranslated strings and all the
|
||||||
|
potential problems like the string concatenation.
|
||||||
|
|
||||||
|
#### Drawbacks
|
||||||
|
|
||||||
|
There are several drawbacks here. First, translation function names are
|
||||||
|
built in, to the `TRANSLATION_FUNCTIONS` property of the `ShowString` class.
|
||||||
|
You must change this if you use other translation functions like
|
||||||
|
`dngettext`, or if you use a translation library other than `gettext`.
|
||||||
|
|
||||||
|
Second, it cannot ignore untranslated strings right now. It would be great
|
||||||
|
if a pragma like `flake8`’s `# noqa` or `coverage.py`’s `# pragma: no cover`
|
||||||
|
could be added. However, `ast` doesn’t parse comment blocks, so this proves
|
||||||
|
to be challenging.
|
||||||
|
|
||||||
|
Third, it reports docstrings as untranslated. Clearly, this is wrong, as
|
||||||
|
docstrings generally don’t have to be translated. Ignoring them, again, is
|
||||||
|
a nice challenge I couldn’t yet overcome.
|
||||||
|
|
||||||
|
The `get_func_name()` helper is everything but done. As long as I cannot
|
||||||
|
remove that final `else` clause, there may be error reports. If that
|
||||||
|
happens, the reported class should be treated in a new `elif` branch.
|
||||||
|
|
||||||
|
Finally (and the most easily fixed), the warnings are simply printed on the
|
||||||
|
console. It is nice, but it should be optional; the problems identified
|
||||||
|
should be stored so the caller can obtain it as an array.
|
||||||
|
|
||||||
|
#### Bottom line
|
||||||
|
|
||||||
|
Finding strings in Python sources is not as hard as I imagined. It was fun
|
||||||
|
to learn using the `ast` module, and it does a great job. Once I can
|
||||||
|
overcome the drawbacks above, this script will be a fantastic piece of code
|
||||||
|
that can assist me in my future tasks.
|
Loading…
Reference in New Issue
Block a user