Low-level Understanding C language

Source: Internet
Author: User

To understand the C language in depth, you have to know a few points of knowledge:

1. It is well known that code written in any high-level language (not a scripting language) goes through a similar phase: preprocessing-> compiled into assembly code (compilation)-> Assembly (assembly)-> connection (linking). where preprocessing produces. I files, compilation produces. s files, assembly produces. o files, the last connection produces the executable file,. o files are different on different machines, and Java can "compile, run everywhere" Because Java does not produce different. o files on different machines like C, but instead uses a JVM virtual machine to mask the differences on different machines, so that only a different machine will have Java Plug-ins, and a compiled file can run everywhere. (It's conceivable why Android hardware is often better than the iphone because Android is using Java technology, so one more conversion process is less efficient than using object-c-programmed iOS, But it is said that the JVM has recently adopted some technology to improve efficiency, but I have not studied it before I say.

2. When your code is compiled into an executable file by the compiler (not necessarily exe, this is an error, take the PE file as an example, these formats are actually in the PE file head offset 0016h at the characteristics field indicated, If this is an EXE field 0X0F01), different operating systems under the executable file is different, Linux under Elf,windows for PE. Because I am familiar with the format of PE file, I take the PE file as an example, you disassemble any one of the Windows executable file you will find that each file is divided into a lot of blocks, roughly divided. Text,.idata,.rdata,.data,.rsrc block, this is why. This is actually to facilitate the program mapping to the process memory space, because in order to facilitate the management and implementation of various mechanisms, the process memory space is segmented, in the Linux next process memory space is roughly the same:

The linear address space of the user state of the process is from 0x00000000 to 0XBFFFFFFF, which is the linear address space of the general application running (each byte of data in memory is given an address), Note here is the linear address space , The address space on the left side of your disassembly is a logical address,

As the left-hand side of the diagram above is the logical address, (these addresses are all 16), the logical address to be segmented mechanism to point to the linear address, and the linear address to go through the paging to point to the physical address (the physical address is the memory bar), (some operating systems do not have segmentation mechanism, logical address equals linear address). This is a chapter of the details of the content, I do not say that interested can look at the Linux kernel books, you have to know that your program to run must be CPU for your program to allocate memory (in fact, there are many things), When you run, look at the situation. Your program will appear on the operating system in a process or thread state (the description of the process is not simple PID can be identified, but task_struct this is called the process descriptor of the same thing to refer to the kernel aspect of the book). The following illustration is a map of Windows executables (lazy, just take notes up):

I would like to pass this description of you vaguely clear a program in your computer's existence and operation of what is the case, I will analyze the statement


Static scope:

Yes, that's part of the compiler principle, but I've added some of my insights from the bottom up to explain what a static scope is. In layman's terms, you can tell the scope of a declaration by source code,

All use of the declared variable in this scope points to that declaration. The C language (Class C language) scope rule is based on the program structure (block), which is related to the use of your "{}" symbol, the following figure:


The last cout<<a<<b prints a value of 1 because the statement within a block (also a scope) uses the declaration in that block first, such as the value of a cout<<a<<b printed in a B3 field is 3, If there is no such declaration within the block, the

Finds its parent block, such as the value of cout<<a<<b printed in the B3 field is 2. In fact, int A is a declaration, int a=1 is also a declaration, and the definition is a statement similar to a=1 after you declare, in fact, the definition can be regarded as a fixed value, and a This thing

Just a name, name and variable (memory location is also different memory address) as shown in the diagram:


In different domain words can be the same, but because its environment (scope) is different in fact it points to the memory location is different, and you must declare before the definition, otherwise it is not clear to which memory location of the assignment operation, such as: B2 domain and B3 domain have an int a =* declaration, In fact, they point to different memory locations, so you can save different values. The variable (in fact, the memory location) to which the definition is directed depends on the declaration of the scope, and even the same statement (the name) assigns different values to the variable as the environment changes. And because the C language executes the statement sequentially in the scope, there is an example on the internet, where int max (int,int), declares a function variable (with a corresponding memory address), its scope is the entire program, but this variable has no value in this scope, int The main () function is another scope, there is no declaration of the MAX function in this scope, so it invokes the declaration of its parent scope and has a function declaration within its parent scope (because int max (int,int) was executed before the main function was executed;), so the function was successfully invoked. In this case you will be int max (int,int), which is not possible under the main () function, because the C language is executed sequentially, in fact the last 7 lines of this example can be either defined or declared, just like int a=1.

The difference between java,c++ and C is that it has a lot of public,private,protected and so on, and not just the "{}" programmers themselves limiting scope or functions (different functions are different scopes), as in C. As with Java, there are many different types of encapsulated scopes, such as public declaration methods that can be invoked by objects of all defined classes ... And then the object of this kind of thing, I think C language is not the root cause of object-oriented language is that it does not customize the scope of the encapsulation, does not produce a unique method (in C language is a function) of the "object", in layman's sense is not like the keyword such as public. (Just a personal opinion, Daniel saw don't laugh)



Next is the point: it said so much in fact, not to the assembly level is also useless before I described the process memory knowledge, nor from the bottom to give different scope of implementation mechanism, next is the key.

     This is my simplified Linux process memory storage mode (similar to the first picture, in fact, the first picture is also a brief one.) There are many other things in the text segment and. Data section (segement), after all, you have a larger program to have a dynamic link library and some are related to the function library libc of ANSI C in Linux, These things are not related to kernel calls or to the library functions or even to the GCC compiler, the linear addresses are larger from left to right, where the. Text fragment has a read-only binary,  .data has global initialization variables such as: Static Int a=0 . The BSS segment has a global uninitialized variable such as: Static Int A. You're going to wonder about that. The variables stored within the function, such as the int b=2,b in B2, exist. In fact, they all exist stack this, stack is the meaning of the stack, as long as there is no global declaration of variables exist inside the stack. The variable declared in static is fixed (the address is fixed), which means that once you change the value of the name in the program, it will change permanently within the running cycle of your program, regardless of the scope of the statement you are changing. Next, I'm going to disassemble some of the programs for you to reveal how variables of ordinary functions exist on the stack: (You may not understand the next thing, but you can't, my topic is to understand the C language, but the content above I think it is very meaningful)

First of all, in Linux, objdump-s test1.o command to disassemble the. o file that was previously edited by TEST.O that is not yet linked to an executable file (can be produced with the gcc-c-o test1.c command,-S is the assembly code and the language displayed simultaneously) because. o file is not an executable There is no link, so when you call another function within a function there is no called statement, the. o file link produces a lot of segments that are not source code, this is the system's own call or library links or even some segments are used to pass the user state process register data to the kernel, The existence of these mechanisms I think not only for the function of the system, but also a large part of security, the earliest stack allocation method is very easy to be buffer attack. Before the C language Program disassembly code is roughly the point here to view, the stack protection of a variety of ways, some at the return address of the shim, some with ASLR technology is called address space randomization technology, after I analyzed the code will simply demonstrate this technology, These technologies as Linux continues to evolve and update, making the operating system more secure, this is the charm of open source, the equivalent of the world's experts are participating in the update of the operating system, which is the charm and vitality of Linux. I've been in touch with Shellcode's writing, and although it's still a dish, I'm pretty much aware of a few common buffer attacks and some outdated vulnerabilities. Well, anyway, since I don't understand the protection mechanism of the current version of the kernel, so some statements function I am not very clear, can only blind guess some, if Daniel see don't laugh, before said. o File, I think. o file is not very suitable for demo, so I am demonstrating disassembly executable test1, with Gcc-o The test test1.c command is compiled and then disassembled with objdump-d-M i386 test1, where the-m command option is to specify the format of the assembly language, with Objdump-i to see the formatting options, a total of two formats from a statement-form perspective, Intel and at &t, the default is At&t, the two formats are not very different, and then each format is divided into 32-bit and 64-bit two, but the register has changed, but 64-bit registers are compatible with 32-bit, in order to facilitate my use of the At&t 32-bit (i386) instruction set, You can even add two-M options such as objdump-d-M i386-m Intel Test1, which uses Intel's 32-bit instruction set. The 64-bit register and 32-bit compatibility are shown below:


Source code test1.c: [OBJC] view plain copy #include <stdio.h> int sum (int temp1,int temp2);     int main () {int i;   I=sum (2,3);   return 0;   int sum (int temp1,int temp2) {int c=temp1;   int b=temp2;   int a= b+c;   return A; }

In this I have to correct a lot of people will make a mistake, is to write void main () this form of the main function, the main () function return value must be int,linux process exit divided into normal exit and abnormal exit two kinds, One of the normal exits is to perform a return operation in the main () function (the other is to call the kernel function exit () and _exit (), where exit () will write back the memory buffer data to the file) The returned value of the main function is received by __libc_start_main, and passed to exit,return+ not 0 value to indicate an abnormal exit (in addition to the process interruption will call the About () function to indicate an abnormal exit), in fact, the C language return mechanism is very much like the Java try (), catch () the exception thrown, Most of us, however, use return as a returned value, but it can also be viewed as an "anomaly" in another way, if other functions are called in the main () function (or other function), When the return within the called function is executed, the control (see the CPU Command Register EIP (RIP)) is handed to the calling function. If return is performed in the main () function, the control is given to the operating system, and void Main () in the earlier compiler version is an error , the new version of the compiler will automatically join return 0 in void Main (). This error may seem like nothing, but it can be exploited by highly skilled hackers.

Okay, back to the point: Disassembly code: I'm primarily concerned with the function main and sum of the source code, because the other segments are fundamentally unrelated to the source code (in fact, some things are too complex to understand)

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.