Deep into the Python interpreter to understand the bytecode

Deep into the Python interpreter to understand the bytecode _python in Python

Last Update:2017-01-19 Source: Internet

Author: User

Tags class definition ord in python

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

I have recently been involved in Python bytecode related work and would like to share some of this experience with you. To be more precise, I am participating in the work of the 2.6 to 2.7 version of the CPython interpreter byte code.

Python is a dynamic language that, when run under the command line tool, essentially performs the following steps:

This code is compiled (for example, loaded as a module, or executed directly) when it is first executed to a piece of code. Depending on the operating system, this step generates a PYC or pyo binary file with the suffix name.
The interpreter reads the binary file and executes the instruction sequentially (opcodes).

The Python interpreter is based on stacks. To understand the flow of data, we need to know the stack effect of each instruction (e.g., opcode and parameters).

Explore Python binaries

The simplest way to get a binary byte code is to decode the CODETYPE structure:

Import marshal
fd = open (' Path/to/my.pyc ', ' RB ')
Magic = Fd.read (4) # magic number, related to python version date
= Fd.read (4) # Compile Date
code_object = marshal.load (FD)
fd.close ()

The code_object contains a CodeType object that represents the entire module of the file being loaded. To see all of the nested coded objects (encoded objects, the original code object) for this module's class definition, method, and so on, we need to recursively check the CodeType constant pool. Just like the following code:

Import Types
 
def inspect_code_object (co_obj, indent= '):
print indent, "%s (lineno:%d)"% (Co_obj.co_name, co_ Obj.co_firstlineno)
for C in co_obj.co_consts:
if Isinstance (c, types. CodeType):
inspect_code_object (c, indent + ')
 
Inspect_code_object (code_object) # Starting with the first object

In this case, we print out an encoded object tree, each encoded object is a child of its parent. For the following code:

Class A:
def __init__ (self):
pass
def __repr__ (self): return
' A () '
a = A ()
print a

The tree result we get is:

<module> (lineno:2)
 A (lineno:2)
 __init__ (lineno:3)
 __repr__ (Lineno:5)

To test, we can compile a string containing the Python source code through the compile instruction so that we can get an encoded object:

Co_obj = Compile (Python_source_code, ' <string> ', ' exec ')

For more information on encoded objects, we can look up the Co_* fields section of the Python document.

Initial byte code

Once we get the encoded object, we can begin to disassemble it (in the Co_code field). Parse out the meaning of the byte code:

? Explain what the opcode means
? Extract any parameters

The disassemble function of the DIS module shows how it is done. For our previous example, it outputs the result:

2 0 load_const 0 (' A ')
 3 Load_const 3 (())
 6 Load_const 1 (<code object A at 0x42424242, file "<string>", Line 2>)
 9 make_function 0
 call_function 0
 build_class
 store_name 0 (A)
 
8 Load_name
   0 (a)
 call_function 0
 store_name 1 (a)
 
9 load_name  1 (a)
 Print_item
 PRINT _newline
 load_const 2 (None)
 Return_value

We've got:

Line number (when it changes)
Sequence number of the instruction
OpCode for current instruction
The operation parameter (OPARG), which the opcode uses to compute the actual parameters. For example, for the Load_name opcode, the action argument points to the index of the tuple co_names.
Actual parameters after calculation (in parentheses)

For instructions with ordinal 6, the opcode load_const the operation parameters, pointing to objects that need to be loaded from the tuple co_consts. Here, it points to the type definition of a. In the same way, we can continue and decompile all the code objects to get all the bytecode of the module.

The first part of the bytecode (ordinal 0 to 16) is related to the type definition of a, and the other part is that we instantiate a and print its code.

Interesting byte code constructs

All opcode is fairly straightforward, but for the following reasons it may seem strange in some cases:

Compiler optimizations
Interpreter optimizations (thus causing additional opcode to be added)

Assign values to order variables

First, let's look at what happens when we assign values sequentially to multiple elements:

(1) A, B = 1, ' 2 '
(2) A, B = 1, e
(3) A, B, C = 1, 2, e
(4) A, B, c, d = 1, 2, 3, E

The 4-in statement produces a considerably different byte code.

The first case is the simplest because the right value of the assignment operation (RHS) contains only constants. In this case, CPython creates a (1, ' a ') T Uple, uses the unpack_sequence opcode, presses two elements onto the stack, and performs store_fast operations on the variables A and B, respectively:

0 Load_const 5 ((1, ' 2 '))
3 unpack_sequence 2
6 Store_fast 0 (a)
9 Store_fast 1 (b)

In the second case, a variable is introduced at the right value, so in general, a value instruction is invoked (where the load_global instruction is simply invoked). However, the compiler does not need to create a new tuple on the stack for these values, nor does it need to call Unpack_sequence (ordinal 18); The call to Rot_two is sufficient to exchange two elements on the top of the stack (although switching instructions 19 and 22 can also achieve the goal).

Load_const 1 (1)
Load_global 0 (E)
rot_two
store_fast 0 (a)
Store_fast 1 (b)

The third kind of situation becomes very strange. Putting an expression on the stack is the same as in the previous case, but after swapping 3 elements on the top of the stack, it swaps 2 elements on the top of the stack:

Load_const 1 (1)
Load_const 3 (2)
Load_global 0 (E)
rot_three
rot_two
0 (a)
Store_fast 1 (b)
Store_fast 2 (c)

The final scenario is the generic approach, where the rot_* operation does not seem to work, the compiler creates a tuple, and then calls Unpack_sequence to put the element on the stack:

Load_const 1 (1)
Load_const 3 (2)
Wuyi Load_const 4 (3)
Load_global 0 (E)
build_tuple 4
   
    60 unpack_sequence 4
store_fast 0 (a)
Store_fast 1 (b)
Store_fast 2 (c)
Store_fast 3 (d)

Function call Construction

The final set of interesting examples is about the function call construction and the 4 opcode that created the call. I guess the number of these opcode is to optimize the interpreter code, because it is not like Java, there is one invokedynamic,invokeinterface,invokespecial,invokestatic or invokevirtual.

In Java, both Invokeinterface,invokespecial and invokevirtual are drawn from statically typed languages (Invokespecial are used only to invoke constructors and parent classes AFAIK). Invokestatic is self-describing (no need to put the receiver on the stack), and there is no similar concept in Python (at the interpreter level, not the decorator). In short, Python calls can be converted into invokedynamic.

In Python, the different call_* opcode does not exist because of the type system, static method, or the need for a special access builder. They all point to how a function call in Python is determined. From the perspective of grammar:

The call structure allows code to write these:

Func (Arg1, arg2, Keyword=some_value, *unpack_list, **unpack_dict)

Keyword parameters allow arguments to be passed by the name of the formal argument, not just through the position. The * symbol takes all elements out of an iterative container, passing them as arguments (one element at a tuple, not the other), while the * * symbol processes A dictionary that contains keywords and values.

This example uses almost all of the features of the call construct:
? Pass variable argument list (_var): Call_function_var, call_function_var_kw
? Passing Dictionary based Keywords (_kw): call_function_kw, call_function_var_kw

The byte code is like this:

0 load_name 0 (func)
3 load_name 1 (arg1)
6 load_name 2 (arg2)
9 load_const 0 (' keyword ')
load_name 3 ( Some_value
load_name 4 (unpack_list)
load_name 5 (unpack_dict)
call_function_var_kw 258

Typically, the call_function call resolves Oparg to the number of parameters. However, more information is encoded. The first byte (0xFF mask) stores the number of parameters, and the second byte ((Value >> 8) & 0xFF) stores the number of keyword arguments passed. To calculate the number of elements that need to pop up from the top of the stack, we need to do this:

na = arg & 0xff # num args
nk = (arg >> 8) & 0xff # num keywords
n_to_pop = Na + 2 * nk + Call_ext RA_ARG_OFFSET[OP]

The call_extra_arg_offset contains an offset that is determined by the calling opcode (2 for call_function_var_kw). Here, we need to pop up 6 elements before accessing the function name.

For other call_* calls, it is entirely dependent on whether the code passes parameters using lists or dictionaries. Just need a simple combination.

To construct a minimal CFG

In order to understand how the code works, we can construct a control flowchart (Control-flow graph,cfg), which is a very interesting process. We look through it to see what conditions, which unconditionally determine the sequence of opcode (basic unit) will be executed.

Even though bytecode is a real small language, building a stable cfg requires a lot of detail work, far beyond the scope of this blog. So if you need a true CFG implementation, you can look at the equip here.

Here, we only focus on code that has no loops and exceptions, so the control process relies only on the IF statement.

Only a few opcode can perform address jumps (for cases without loops and exceptions);

Jump_forward: Jumps to a relative position in the byte code. parameter is the number of bytes skipped.
Jump_if_false_or_pop,jump_if_true_or_pop,jump_absolute,pop_jump_if_false, and Pop_jump_if_true: parameters are absolute addresses in bytecode.

Making a CFG for a function means creating a basic unit that does not contain a sequence of operational codes for conditional judgments-unless an exception occurs-and connects them to a condition and a branch to form a graph. In our example, we have only true, false, and unconditional branches.

Let's consider the following code example (never in practice):

def factorial (n):
if n <= 1: Return
1
elif n = 2: Return
2 return
n * Factorial (n-1)

As mentioned earlier, we get the code object for the factorial method:

Module_co = Compile (Python_source, ', ' exec ')
Meth_co = module_co.co_consts[0]

The disassembly result is like this (<<< post is my comment):

3  0 load_fast  0 (N)
  3 load_const  1 (1)
  6 compare_op  1 (<=)
  9 Pop_jump_if_false  <<< Control Flow
 
4  load_const  1 (1)
  return_value    <<< Control Flow
 
5 >> load_fast  0 (n)
  load_const  2 (2)
  compare_op  2 (=)
  Pop_jump_ If_false <<< Control Flow
 
6  load_const  2 (2)
  return_value    << < control flow
 
7 >> load_fast  0 (n)
  load_global  0 (factorial)
  Load_fast  0 (N)
  load_const  1 (1)
  binary_subtract
  call_function  1
  binary_ MULTIPLY
  return_value <<< Control Flow

In this byte code, we have 5 instructions to change the CFG structure (add constraints, or allow fast exit):

Pop_jump_if_false: Jump to absolute address 16 and 32;
Return_value: POPs an element from the top of the stack and returns.

Extracting basic units is simple because we only care about the instructions that change the control process. In our case, we have not encountered a forced jump instruction, such as Jump_forward or Jump_absolute.

Sample code to extract this type of structure:

Import opcode Return_value = jump_forward, Jump_absolute =, 113 false_branch_jumps = (114) # Jump_if_false_or _pop, Pop_jump_if_false def find_blocks (meth_co): blocks = {} code = Meth_co.co_code Finger_start_block = 0 I, lengt H = 0, Len (code) while I < Length:op = Ord (Code[i]) i + + 1 if op = = return_value: # We Force finishing The block a Fter the return, # Dead code might still exist after though ... blocks[finger_start_block] = {' Length ': i-finger_s Tart_block-1, ' exit ': True} finger_start_block = i elif op >= opcode. Have_argument:oparg = Ord (Code[i]) + (ORD (code[i+1) << 8) i + + 2 if op in opcode.hasjabs: # absolute jump to OP Arg Blocks[finger_start_block] = {' Length ': i-finger_start_block} if op = = Jump_absolute: # only Uncond Absolute  Jump blocks[finger_start_block][' conditions '] = {' Uncond ': Oparg} else:false_index, True_index = (Oparg, i) if Op in False_branch_jumps else (i, Oparg) Blocks[finger_start_block][' conditions '] = {' True ': True_index, ' false ': false_index} finger_start_block = i elif op in Opcode.hasjrel:

 # essentially do the same ... pass return blocks

We got the following basic unit:

Block 0: {' length ': A, ' conditions ': {' false ':, ' true ': {
' length ': 3, ' exit ': true} block
16: { ' Length ': ' conditions ': {' false ': +, ' true ': '} ':
{' length ': 3, ' exit ': true} block
: {' length ': ' Exit ': True}

and the current structure of the unit:

Basic blocks
 Start_block_index: =
 Length: = the size of instructions
 Condition: = True | false | Uncond-> Targ Et_index
 exit*: = True

We get the control flowchart (except the entry and the implicit exit unit), which we can then convert to a visual graphic:

def to_dot (blocks):
cache = {}
 
def get_node_id (idx, buf):
If idx not in cache:
cache[idx] = ' node_%d '% ID x
buf.append ('%s [label= "Block Index%d"]; '% (Cache[idx], idx)) return
Cache[idx]
 
buffer = [' digraph CFG {']
buffer.append (' entry [label= "CFG entry"]; ')
buffer.append (' Exit [label= "CFG implicit return"]; ') for
 
block_idx in blocks:
node_id = get_node_id (block_idx, buffer)
If Block_idx = 0:
Buffer.append (' entry->%s; '% node_id)
if ' conditions ' in BLOCKS[BLOCK_IDX]: for
cond_kind in Blocks[block _idx][' conditions ']:
target_id = get_node_id (blocks[block_idx][' conditions '][cond_kind], buffer)
Buffer.append ('%s->%s [label= '%s ']; '% (node_id, target_id, cond_kind))
if ' exit ' in Blocks[block_idx]:
Buffer.append ('%s-> exit; '% node_id)
 
buffer.append ('} ') return
' n '. Join (buffer)

Visualization of Process Control diagrams:

Why do you have this article?

The need to access the Python bytecode is rare, but I've met several times. I hope this article will help people who are starting to study Python reverse engineering.

Now, however, I'm studying Python code, especially its bytecode. Since there is no such tool in Python (and detecting the source code usually leaves a very inefficient adorner detection code), this is why equip appears.

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More