Protecting a Python codebase - Part 3

This is the third part of the Protecting a Python codebase series. This time we will be playing with Python interpreter in order to protect the original code of a Python based project.

The Python Interpreter

Going to the basics, the Python Interpreter is a machine that:

Consumes Python code (.py files, from the command line, from a REPL shell)
Compiles the code to bytecode
Executes the bytecode (here and here)

Of course that’s a very rough description. But that’s enough to get started on modifying it to make your own version that will run your own code, and your code won’t run on others interpreters (even the standard one).

What’s bytecode and where is it defined?

Python bytecode is just an integer representation of an operation, the values are defined in the opcode.h header file, there we see the definition of constants with a human-friendly name and an integer value, we saw those names in the first article.

The interpreter will read a bytecode at a time, then invoke the piece of code properly crafted to handle it (take for instance LOAD_CONST). These bits of code will do update the stack accordingly to the expected parameters for the operation in turn.

Redefining Python

Since these values are just constants, and thanks to Python well practiced DRY principle, nothing stops us for changing the values in the table with our own version. Then it’s safe to assume that the generated bytecode won’t match the standard definition at all.

There are a couple small rules in the definition of the table, these are defined just as comments in opcode.h.

#define SLICE		30
/* Also uses 31-33 */

#define STORE_SLICE	40
/* Also uses 41-43 */

#define DELETE_SLICE	50
/* Also uses 51-53 */

/* ... */

#define HAVE_ARGUMENT	90 /* Opcodes from here have an argument: */

/* ... */

/* The next 3 opcodes must be contiguous and satisfy
   (CALL_FUNCTION_VAR - CALL_FUNCTION) & 3 == 1  */
#define CALL_FUNCTION_VAR          140	/* #args + (#kwargs<<8) */
#define CALL_FUNCTION_KW           141	/* #args + (#kwargs<<8) */
#define CALL_FUNCTION_VAR_KW       142	/* #args + (#kwargs<<8) */

But, if we respect the rules, everything will work just fine.

In the companion project there’s a Python script that will generate a new opcode.h, the only dependency is a copy of the Python source code. After running the script, a new table will be written in opcode.h with the values re-arranged, the interpreter will be ready to be compiled and used.

And now what?

With this custom interpreter we can go back to the first proposed solution in these articles series, distributing bytecode.

Now we can run the compileall module through the project, grab the .pyc files and install in the desired location to run them securely. Don’t forget to also install the custom interpreter since the generated bytecode will only run on it.

How secure is this?

This time the dis module will be useless since it won’t be able to interpret the opcodes in the bytecode. If we try to disassemble the simple hello world example we’ve been using:

>>> import dis
>>>
>>> def hello(name='World'):
...    print('Hello {0}'.format(name))
...
>>> hello('Python)
Hello Python
>>> dis.disassemble(hello.func_code)
  2           0 LOAD_CLOSURE             1
Traceback (most recent call last):
  File "<stdin>", line 1, in <module>
  File "python-obfuscation/custom_interpreter/opcodes/Python-2.7.9-customized/Lib/dis.py", line 107, in disassemble
    print '(' + free[oparg] + ')',
IndexError: tuple index out of range

We get an error, on other functions it might work, but the code won’t make any sense.

Also the uncompyle2 application won’t provide any help since it uses the same principles.

This is probably good enough for several scenarios, but an statistical analysis attack can be used to figure out the opcodes table. The idea is that a well known piece of code can be compiled with the standard Python interpreter and the custom one, then the generated code can be compared to start mapping the custom opcodes to the standard ones.

More involved changes can be introduced to avoid such attacks.

Per module opcode table.
Encrypted marshal output.
Encryption based import hook.
Disable compile() functionality, remove compileall module.

Final thoughts

Introducing changes to the Python interpreter is a really interesting mechanism of protection, in a way we are writing our own Python version to run our own application, we can even extend the language syntax. We are forced to distribute bytecode modules, but that’s not a big deal.

Of all the discussed options, Cython seems to be the way to go, just be sure to have a full coverage test suite to confirm that everything works as it should be. Encryption + import hooks is another option also worth checking.

In the end, it all depends on the project we try to apply any of these methods.