PyC File Parsing

Last Update:2018-12-04 Source: Internet
Author: User
Tags first string
Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞
1. pycodeobject and PyC files generally think that python is an explanatory language, but this statement is incorrect. In fact, when python is executed. the source code in the py file is compiled into the byte code (bytecode) of Python, and then the compiled byte code is executed by the python virtual machine. The basic idea of this mechanism is consistent with that of Java and. net. However, what is different between Python Virtual Machine and Java or. Net Virtual Machine is that python virtual machine is a more advanced virtual machine. The level here is not generally advanced, not to say that python's virtual machine is better than Java or. net is more powerful than Java or. NET. net, Python's virtual machine is far from the real machine. Alternatively, Python virtual machine is a virtual machine with a higher abstraction level. Let's consider the following Python code: [demo. PY] Class A: Pass def fun (): pass value = 1str = "Python" A = a () Fun () Python is executing codeobject. in py, the first action is to compile it. What is the result of the compilation? Of course, there are bytecode, otherwise Python won't be able to play. However, in addition to bytecode, it also contains other results, which are also required when python is running. Let's take a look at our demo. py and parse it with our eyes. From this file, we can see that it contains some strings, some constant values, and some operations. Of course, Python processes the operation result with its own code. So what is the processing result of the python compilation process on strings and constant values? In fact, all the static information contained in the Python source code will be collected by python. The compiled result contains the string, constant value, bytecode and other useful static information in the source code. The information will eventually be stored in an object during the python runtime. After the python runtime ends, the information will even be stored in a file. This object and file are the focus of our exploration in this chapter: pycodeobject object and PyC file. It can be said that pycodeobject is a set of static information about the program after Python source code compilation: [compile. h]/* bytecode object */typedef struct {pyobject_head int co_argcount;/* # arguments, role T * ARGs */INT co_nlocals;/* # local variables */INT co_stacksize; /* # entries needed for evaluation stack */INT co_flags;/* CO _..., see below */pyobject * co_code;/* instruction Opcodes */pyobject * co_consts;/* List (constants used) */pyobject * co_nam Es;/* List of strings (names used) */pyobject * co_varnames;/* tuple of strings (local variable names) */pyobject * co_freevars; /* tuple of strings (free variable names) */pyobject * co_cellvars;/* tuple of strings (cell variable names) * // * The rest doesn't count for hash/CMP */pyobject * co_filename;/* string (where it was loaded from) */pyobject * co_name; /* string (name, for reference) */In T co_firstlineno;/* first source line number */pyobject * co_lnotab;/* string (encoding ADDR <-> lineno mapping) */} pycodeobject; when compiling the Python source code, a pycodeobject is created for a code block, which corresponds to the code. So how can we determine how much code is a code block? In fact, when we enter a new scope, we start a new code block. That is to say, for the following Python source code: [codeobject. PY] Class A: Pass def fun (): pass a = a () Fun () after Python compilation is complete, three pycodeobject objects will be created, one corresponding to codeobject. for py, one corresponds to the Code (scope) of Class A, and the last one corresponds to the code of Def fun. Each pycodeobject object contains the byte code obtained after each code block is compiled. But unfortunately, python will destroy pycodeobject after executing these byte code, so it will execute this again next time. in The py file, Python needs to re-compile the source code, create three pycodeobjects, and then execute byte code. Not good, right? Python should provide a mechanism to save the intermediate compilation result, namely byte code, or save pycodeobject more accurately. In fact, Python does provide such a mechanism-PyC files. The PyC file in python is the key to saving pycodeobject. our analysis of the python interpreter begins with the PyC file and the PyC file format. Before analyzing the PyC file format, let's take a look at how to generate a PyC file. After the source code in A. py file is executed, Python does not automatically generate the. PyC file corresponding to the. py file. We need to trigger python to create a PyC file. Below we provide a method for python to create a PyC file. In fact, it is very simple to use the python import mechanism. In the process of running python, if you encounter an import ABC statement, python will search for ABC in the set path. pyC or ABC. DLL files. If they are not found, they are only found in ABC. PY, then python will first put ABC. py is compiled into the intermediate results of the corresponding pycodeobject, and then ABC is created. pyC file, and write intermediate results to the file. Next, python will perform an Import action on the ABC. PyC file. In fact, it will re-copy the pycodeobject In the ABC. PyC file in the memory. After learning about this process, we can easily use the generator. py shown below to create the PyC file corresponding to the above Code (codeobjectt. py. Generator. pycodeobject. pyimport testprint "done" Class A: Pass def fun (): pass a = a () Fun () Figure 1 shows the PyC file generated by Python: As you can see, pyC is a binary file, so it is important for python to explain how this heap of seemingly meaningless byte streams. This is the format of the PyC file. To understand the format of the PyC file, we must first understand what each domain in pycodeobject represents, which cannot be bypassed in any way. The number of fieldcontentco_argcountcode block parameters. For example, the number of local variables in the co_nlocalscode block of a function, co_stacksize, the stack space required for co_flagsn/aca_codecode block compilation. The struct object exists in the form of pystringobject. Save the co_namespytupleobject object constant in the block and save all the symbols co_varnamesn/ACO _freevarsn/ACO _cellvarsn/ACO _filenamecode block corresponding to the block. the name of the full path co_namecode block of The py file, usually the function name or class name co_firstlinenocode block in the corresponding. the start line co_lnotabbyte code and. the correspondence between source code and line number in the py file. The co_lnotab field must be described in the form of pystringobject. Before python2.3, there was a byte code called set_lineno. This byte code records the location information of the source code in the. py file, which is useful for debugging and exception display. However, after python2.3, Python does not generate this byte code during compilation. As a result, Python records this information to co_lnotab during compilation. The corresponding information of byte code and source code in co_lnotab exists in the form of an unsigned bytes array, and the form of an array can be seen as (the byte code is incremental in the co_code position, and the number of lines of code is incremental) form of a list. For example, for the following example: the offset of byte code in co_code. the number of lines in the source code in the py file is 0162507. Here is a small trick. Python does not directly record the information. On the contrary, it records the increment values between the information. Therefore, the corresponding co_lnotab should be: 0, 1, 6, 1, 4, 5. 2. Generate the PyC file as we mentioned earlier, if the corresponding PyC file or DLL file is not found during Python import, the PyC file will be automatically created based on The py file. To understand what the PyC format is, we only need to check what action Python has taken when writing the compiled pycodeobject to the PyC file. The following function is our entry point: [import. c] Static void write_compiled_module (pycodeobject * CO, char * cpathname, long mtime) {file * FP; FP = open_exclusive (cpathname); convert (pyc_magic, FP, py_marshal_version ); /* first write a 0 for mtime */pyexternal_writelongtofile (0l, FP, py_external_version); pyexternal_writeobjecttofile (pyobject *) Co, FP, py_external_version ); /* Now write the true mtime */fseek (FP, 4l, 0); pyexternal_writelongtofile (mtime, FP, py_internal_version); fflush (FP); fclose (FP);} cpathname here is of course the absolute path of the PyC file. First, we can see that the value of pyc_magic is written to the beginning of the file. In fact, pyc_magic corresponds to a magic value. Magic is a measure to ensure Python compatibility. For example, to prevent the runtime environment of python2.4 from loading the PyC file generated by python1.5, set the magic of python2.4 and python1.5 to different values. Python first checks the magic value when loading the PyC file to reject loading incompatible PyC files. Why is the PyC file incompatible? One of the main reasons is the change of byte code. As Python has been continuously improved, some byte codes have exited the stage of history, for example, the set_lineno mentioned above; or because some new syntax features will add new byte code, these will cause Python incompatibility issues. The write actions of the PyC file are finally concentrated in the following functions (assuming that the Code only processes the data written to the file, that is, p-> FP is valid. Therefore, the Code has been deleted, and another w_short is not listed. For the missing part, see Python source code): [Marshal. c] typedef struct {file * FP; int error; int depth; pyobject * strings;/* dict on Marshal, list on unmarshal */} wfile; # define w_byte (C, p) putc (C), (p)-> FP) Static void w_long (long X, wfile * P) {w_byte (char) (X & 0xff ), p); w_byte (char) (x> 8) & 0xff), P); w_byte (char) (x> 16) & 0xff ), p); w_byte (char) (x> 24) & 0xff), P);} static void w_string (char * s, int N, wfile * P) {fwrite (S, 1, N, P-> FP);} w_long is called directly when pyexternal_writelongtofile is called. However, when pyexternal_writeobjecttofile is called, w_object is also called through an indirect function. Note that the first parameter of pyexternal_writeobjecttofile is the python compiled pycodeobject object. The w_object code is very long and is not listed here. In fact, the logic of w_object is very simple, that is, it corresponds to different objects, such as string, Int, list, etc., there will be different write actions, however, the ultimate goal is to write the entire pycodeobject to the PyC file through the most basic w_long or w_string. For pycodeobject, it is clear that all the fields in pycodeobject will be traversed, and these fields will be written in sequence: [Marshal. C] Static void w_object (pyobject * V, wfile * P ){...... Else if (pycode_check (V) {pycodeobject * CO = (pycodeobject *) V; w_byte (type_code, P); w_long (co-> co_argcount, P ); w_long (co-> co_nlocals, P); w_long (co-> co_stacksize, P); w_long (co-> co_flags, P); w_object (co-> co_code, P ); w_object (co-> co_consts, P); w_object (co-> co_names, P); w_object (co-> co_varnames, P); w_object (co-> co_freevars, P ); w_object (co-> co_cellvars, P); w_object (co-> co_filename, P); w_obje CT (co-> co_name, P); w_long (co-> co_firstlineno, P); w_object (co-> co_lnotab, P );}......} For a pylistobject object, imagine what actions will happen? That's right. traverse it !!!: [W_object () in Marshal. C]…… Else if (pylist_check (V) {w_byte (type_list, P); n = pylist_get_size (V); w_long (long) n, p); for (I = 0; I <n; I ++) {w_object (pylist_get_item (V, I), P );}}...... If it is py1_bject, well, it's too simple. There's almost nothing to say: [w_object () in Marshal. C]… Else if (pyint_check (V) {w_byte (type_int, P); w_long (X, p );}...... Have you noticed a flag like type_list, type_code, and type_int? The PyC file uses these flags to indicate the beginning of a new object. When the PyC file is loaded, the loader can know when to load the object. These flags are also defined in import. C: [import. C] # define type_null '0' # define type_none 'n '...... # Define type_int 'I' # define type_string's '# define type_interned 'T' # define type_stringref 'R' # define type_tuple' ('# define type_list' ['# define type_code' c' here, as you can see, the export of intermediate results in python is not complex. In fact, in the write action, no matter the complex objects such as pycodeobject or pylistobject, they all come down to two simple forms: writing a value and writing a string. In fact, we have seen the write process for values. There is a complicated mechanism for writing strings. Before learning about the writing mechanism of strings, we need to first understand the key struct wfile (with strikethrough) during the writing process: [Marshal. c] typedef struct {file * FP; int error; int depth; pyobject * strings;/* dict on Marshal, list on unmarshal */} wfile; here we only consider the FP effect, that is, writing to a file. Wfile can be seen as a simple packaging of file *, but in wfile, there is a strange strings field. This field is the key to writing or reading strings in the PyC file. When writing to PyC, string will be a pydictobject object. when reading from PyC, string is a pylistobject object. [Marshal. c] void pymarshal_writeobjecttofile (pyobject * X, file * FP, int version) {wfilewf; WF. fp = FP; WF. error = 0; WF. depth = 0; WF. strings = (version> 0 )? Pydict_new (): NULL; w_object (x, & WF) ;}you can see that strings has been created before writing. In the string processing section of w_object, we can see the use of strings: [w_object () in Marshal. C]… Else if (pystring_check (V) {If (p-> strings & pystring_check_interned (V) {pyobject * o = pydict_getitem (p-> strings, V ); if (o) {long w = pyint_aslong (o); w_byte (type_stringref, P); w_long (W, P); goto exit ;} else {o = pyint_fromlong (pydict_size (p-> strings); pydict_setitem (p-> strings, V, O); py_decref (o); w_byte (type_interned, p) ;}} else {w_byte (type_string, p) ;}n = pystring_get_size (V); w_long (lon G) n, p); w_string (pystring_as_string (V), N, P );}...... What is really interesting is that this string is a string that requires intern operations. We can see that the strings field of wfile is actually a pydictobject object mapped from string to int. What is the int value? The Int value indicates that the corresponding string is the first string to be added to wfile. Strings. This int value does not seem necessary. What is the significance of recording the sequence number that a string is added to wfile. strings? Well, let's consider the following situation: Suppose we need to write three strings to the PyC file: "Jython", "Ruby", "Jython ", all three strings must be operated by intern. There is no problem with the first two strings. Just write them with your eyes closed. After writing the first two strings, The wfile. Strings and PyC files are shown in Case 2: It is troublesome to write the third string. What should we do with this "Jython? Is it the same as the previous two strings? If this is the case, after writing, we can see in case 3 of wfile. Strings and PyC: no matter what happens to wfile. strings, but when we look at the PyC file, we will know that the problem is coming. In the PyC file, duplicate content appears. The "Jython" information is repeated twice. What troubles does this cause? Imagine that in the Python code, we have created a button, after which the button is used multiple times. In this way, the "button" appears multiple times in the code. Imagine how bloated our PyC file will be, and it is filled with valuable redundant information. If you are Guido, can you endure this design? Of course not !! So Guido gave us the type_stringref. When parsing the PyC file, this flag indicates that the next value represents an index value. You can find the required String Based on the index value in wfile. Strings. With type_stringref, our PyC file will become slim, as shown in 4: Let's take a look at the process of loading the PyC file, and we will be able to better understand this mechanism. As mentioned above, when reading the PyC file, wfile. strings is a pylistobject object, so after reading the first two strings, wfile. case 5 of strings: when loading the followed (R, 0), because the parsing is a type_stringref flag, the system directly accesses wfile with the 0-bit index next to the flag. strings, you can get the string "Jython" immediately ". 3. One pycodeobject and multiple pycodeobjects? Here, we have only the last interesting topic about the pycodeobject and PyC files. Do you still remember the previous test. py? We can say that the simple Python code that can't do anything will produce three pycodeobjects. In write_compiled_module, we can see that the python runtime environment only calls the pyexternal_writeobjecttofile operation on a pycodeobject object. In a flash, we saw two lost pycodeobject objects. Python obviously won't make such a low-level mistake. Imagine how to solve this problem if you are Guido? First, we will assume that two pycodeobject objects must be included in another pycodeobject. Yes, indeed. Do you still remember how the python we first pointed out determined a code block? Right, it is the scope. Take a closer look at test. py. You will find that the scope presents a nested structure, which is exactly the structure between pycodeobject objects. So now it is clear that the pycodeobject object corresponding to fun and a must be included in the pycodeobject object corresponding to the global scope, the co_consts domain in the pycodeobject structure is the hiding place of the two pycodeobject objects, as shown in Figure 6: When writing a pycodeobject object to the PyC object, if another pycodeobject object contained in the object is met, the operation to write the pycodeobject object is performed recursively. In this case, all the final pycodeobject objects will be written to the PyC file. In addition, the pycodeobject object in the PyC file is also associated with a nested relationship. 4. python bytecode Python source code is compiled into Python byte code before execution. The Python execution engine performs a series of operations based on these byte code to complete the execution of Python programs. In python2.4.1, a total of 103 bytes are defined: [opcode. H] # define stop_code 0 # define pop_top 1 # define rot_two 2 ...... # Define call_function_kw 141 # define call_function_var_kw 142 # define extended_arg 143 The operation meaning of all these bytecode is described on a specific page in the python documentation. Of course, you can also view the following url: http://docs.python.org/lib/bytecodes.html. You must have noticed that the byte code has reached 143. Yes, the byte code in python2.4.1 does not grow in sequence. For example, after rot_four encoded as 5, it is the NOP encoded as 9. This may be a legacy of history. You know, in our line, history is not a good thing, as a result, many people have to be very frustrated with the fact that, among the 143 bytes of Python, some require parameters, while others do not. The encoding of byte code of all required parameters is greater than or equal to 90. Python provides a special macro to determine whether a byte code requires a parameter: [opcode. h] # define has_arg (OP)> = have_argument). Now we know everything about pycodeobject and PyC files, now we can do something very interesting about Python. Well, in my opinion, the most interesting thing is to write a python file parser by yourself. That's right. We can do that with everything we know now. Figure 7 shows the result of parsing test. py earlier in this chapter: further, we can also parse byte code. As we already know, when Python generates the PyC file, it will also write the byte code in the pycodeobject to the PyC file, the PyC file also records the correspondence between each byte code and the Python source code. Well, it's co_lnotab. If we know the offset address of byte code in co_code, the location of the Python source code corresponding to this byte code can be obtained through the following algorithm (Python pseudo code ): lineno = ADDR = 0for addr_incr, line_incr in c_lnotab: ADDR + = addr_incr if ADDR> A: Return lineno + = line_incr the result of decompiling a Python source code into byte code, this result will also start with the analysis of the python execution engine in the next chapter: I = 1 # load_const 0 # store_name 0 s = "Python" # load_const 1 # store_name 1 D ={}# build_map 0 # store_name 2 L = [] # build_list 0 # Stor E_name 3 # load_const 2 # return_value none let's move forward and think about it. From where we arrived, we can actually create a python execution engine. Wow, this is so exciting. The sky is far away, and the sun is rising ...... In fact, the python Standard Library provides the decompilation tool dis for python. With this tool, we can easily get the results we get here. Of course, we need to be more detailed, figure 8 shows codeobject using dis. PY decompilation Result: In the result shown in figure 8, the leftmost column shows codeobject. the number of lines in the source code in Py. The second column on the left shows the offset position of the current bytecode command in co_code. In the future analysis, most of us will use the DIS tool's decompilation results. In some special cases, we will use our own decompilation results. Good articles. You must share them !! Address: http://blog.donews.com/lemur/archive/2006/02/21/736881.aspx
This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.
A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service
Get Started for Free
Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More