This is the third part of the Protecting a Python codebase series. This time we will be playing with Python interpreter in order to protect the original code of a Python based project.
The Python Interpreter
Going to the basics, the Python Interpreter is a machine that:
- Consumes Python code (
.pyfiles, from the command line, from a REPL shell)
- Compiles the code to bytecode
- Executes the bytecode (here and here)
Of course that’s a very rough description. But that’s enough to get started on modifying it to make your own version that will run your own code, and your code won’t run on others interpreters (even the standard one).
What’s bytecode and where is it defined?
Python bytecode is just an integer representation of an operation, the values are defined in the opcode.h header file, there we see the definition of constants with a human-friendly name and an integer value, we saw those names in the first article.
The interpreter will read a bytecode at a time, then invoke the piece of code properly crafted to handle it (take for instance LOAD_CONST). These bits of code will do update the stack accordingly to the expected parameters for the operation in turn.
Since these values are just constants, and thanks to Python well practiced DRY principle, nothing stops us for changing the values in the table with our own version. Then it’s safe to assume that the generated bytecode won’t match the standard definition at all.
There are a couple small rules in the definition of the table, these are defined just as comments in opcode.h.
But, if we respect the rules, everything will work just fine.
In the companion project there’s a Python script
that will generate a new
opcode.h, the only dependency is a copy of
the Python source code. After running the script, a new table will be
opcode.h with the values re-arranged, the interpreter
will be ready to be compiled and used.
And now what?
With this custom interpreter we can go back to the first proposed solution in these articles series, distributing bytecode.
Now we can run the
compileall module through the project, grab the
.pyc files and install in the desired location to run them
securely. Don’t forget to also install the custom interpreter
since the generated bytecode will only run on it.
How secure is this?
This time the
dis module will be useless since it won’t be able to
interpret the opcodes in the bytecode. If we try to disassemble the
hello world example we’ve been using:
We get an error, on other functions it might work, but the code won’t make any sense.
Also the uncompyle2 application won’t provide any help since it uses the same principles.
This is probably good enough for several scenarios, but an statistical analysis attack can be used to figure out the opcodes table. The idea is that a well known piece of code can be compiled with the standard Python interpreter and the custom one, then the generated code can be compared to start mapping the custom opcodes to the standard ones.
More involved changes can be introduced to avoid such attacks.
- Per module opcode table.
- Encrypted marshal output.
- Encryption based import hook.
Introducing changes to the Python interpreter is a really interesting mechanism of protection, in a way we are writing our own Python version to run our own application, we can even extend the language syntax. We are forced to distribute bytecode modules, but that’s not a big deal.
Of all the discussed options, Cython seems to be the way to go, just be sure to have a full coverage test suite to confirm that everything works as it should be. Encryption + import hooks is another option also worth checking.
In the end, it all depends on the project we try to apply any of these methods.