A Free Trial That Lets You Build Big!
Start building with 50+ products and up to 12 months usage for Elastic Compute Service
Because to understand multi-threading, naturally without some hardware knowledge of science, I do not have a system to learn hardware knowledge, only from the book and the network, if there are errors please point out.
CPU, full name central processing Unit (CPU). This is an ultra-large-scale integrated circuit that contains billions of transistors, a computer's core and control core (Controlunit). Its function is mainly to explain the computer instruction and processing the data in the computer software.
Its main composition is: the operator, controller, register
Operators: You can perform fixed-point or floating-point arithmetic operations, shift operations, and logical operations, as well as perform address operations and conversions.
Controller: Mainly responsible for decoding the instruction, and issued to complete each instruction to perform each operation of the control signal. There are two kinds of structure: one is the micro-control method with micro-storage as the core, and the other is the logic-based hard-wiring structure.
Registers: Register parts, including registers, special registers, and control registers. General-purpose registers can also be divided into fixed-point and floating-point numbers, which are used to hold the number of registers and intermediate (or final) operations that are temporarily stored during instruction execution. The Universal register is one of the important parts of the CPU.
The first stage, extract, retrieves the instruction (either a numeric value or a series of values) from memory or buffer memory. The location of the memory is specified by program Counter. (The program counter holds values that are used to identify the program location.) In other words, the program counter records the trace of the CPU in the program. )
Phase II: Decoding (Controller)
Phase III: Execution, the arithmetic logic unit (alu,arithmetic logic unit) will connect to a set of inputs and a set of outputs. The input provides the value to add, and the output will contain the result of the sum. The ALU contains the circuit system, which is easy for the output to perform simple ordinary and logical operations (such as addition and bit-element operations). If the addition operation produces a result that is too large for the CPU processing, the operation overflow (arithmetic Overflow) flag may be set in the flag register.
Stage four: Write-back, buffer cache, or a larger, cheaper, vulgar memory (memory, hard disk, etc.)
Register: A very important storage unit in an integrated circuit, usually consisting of a trigger. In IC design, registers can be divided into the registers used inside the circuit and the registers which act as internal and external interfaces. Internal registers cannot be accessed by external circuitry or software, only for the internal circuit to realize the storage function or to meet the circuit timing requirements. While the interface registers can be accessed by both internal and external circuits or software, the registers in the CPU are one of them, as the interface of hardware and software, which is familiar to a wide range of general programming users.
1) data register-used to store integer numbers (refer to the following floating-point registers). In some simple/old CPUs, special data storage
4) The accumulator is used as a mathematical calculation.
5) address register-holds the memory address used to access the memory. In some simple/old CPUs, the special address register is the index register (one or more may appear).
6) General Purpose Register (GPRs)-can save data or address both, that is, they are combined with the function of the data/address register.
7) floating-point register (FPRS)-Used to store floating-point numbers.
8) constant register-used to hold read-only values (e.g. 0, 1, pi, etc.).
9) Vector Register-used to store data obtained by vector processors running SIMD (single instruction, multiple data) instructions.
10) Special purpose registers-stores data inside the CPU, such as program counters (or instruction pointers), stack registers, and status registers (or microprocessor state word groups).
11) Instruction register (instruction register)-stores commands that are now being run.
12) Index register-used to change the address of an operand when the program is running.
Registers are divided into internal registers and external registers, so-called internal registers, but also some small storage units, can also store data. However, compared with memory, Register has its own unique features:
The ① registers are located inside the CPU with a small number of only 14
The data that the ② register can store is not necessarily 8bit, some registers can store 16bit data, and some registers in the 386/486 processor can store 32bit data
③ Each internal register has a name and no memory-like address number.
1. Performs arithmetic and logical operations on data in registers
2. The address stored in the register can be used to point to a location in memory that is addressed
3. A peripheral device that can be used to read and write data to a computer.
Simple: Instruction parsing-data/operation (register)-write-Back (Cache/memory/disk)
The storage level of the computer (memory hierarchy), the fastest register, memory second, the slowest is the hard disk. Also are transistor storage devices, why register faster than memory? Mike Ash wrote a good explanation and answered the question in a very popular way, helping to deepen the understanding of the hardware.
Reason one: distance is different
Distance is not the main factor, but it is best to understand, so put it on the front. Memory is far away from the CPU, so it takes longer to read.
In the case of 3GHz CPUs, the current can oscillate 3 billion times per second, which takes approximately 0.33 nanoseconds each time. Light in 1 nanosecond time, can advance 30 centimeters. That is, in a clock cycle of the CPU, the light can advance 10 centimeters. Therefore, if the memory is more than 5 centimeters away from the CPU, it is not possible to complete the reading of the data in a clock cycle, which does not take into account the limitations of the hardware and the current actually does not reach the speed of light. In contrast, registers are inside the CPU, which of course reads faster. The distance to the desktop computer has a large impact on the phone is much smaller. The phone's CPU clock is slow (IPhone 5s is 1.3GHz), and the phone's memory is next to the CPU.
Reason two: Hardware design differs (1 byte means one byte, 1b=8bit)
The latest iphone 5s,cpu is A7, with more than 6,000 registers (31 64-bit registers plus 32 128-bit registers). The memory of the iphone 6s is 1GB, about 8 billion bits (bit). This means that high-performance, high-cost, high-power designs can be used on registers, which are more than 6,000, and cannot be used in memory. Because the cost and energy consumption of each bit is increased by just a little bit, it will be magnified 8 billion times times. In fact, the design of the memory is relatively simple, each bit is a capacitor and a transistor, and the register design is completely different, more than several electronic components. And after the power, register the transistor has been a power, and the memory of the transistor only used to have electricity, useless to the power, this is beneficial to power saving. These design factors determine that the register is faster than the memory read speed.
Reason three: different ways of working
Registers work in a very simple way, with only two steps: (1) Find the relevant bits, (2) read the bits.
The way memory works is much more complicated:
(1) A pointer to the data is found. (The pointer may be stored in the register, so this step includes the full work of the Register.) ）
(2) Send the pointer to the Memory management Unit (MMU), which translates the virtual memory address into the actual physical address by the MMU.
(3) Send the physical address to the memory controller and the memory controller to find out which memory slot (bank) the address is on.
(4) Determine which memory block (chunk) the data is on, and read the data from that block.
(5) The data is sent back to the memory controller, then sent back to the CPU, and then started to use.
The workflow of memory is a lot more than the register. Each step creates a delay, which accumulates to make the memory much slower than the register. To mitigate the huge speed differences between registers and memory, hardware designers have made a lot of effort, including setting cache caches inside the CPU, optimizing the way the CPU works, trying to read all the data from memory to the command at once, and so on.
It says "cache", and most programmers know what the concept of caching is in software architecture. In this case, the cache refers to the hardware "buffer memory", which exists in main memory and CPU between the first level of memory (common in the computer CPU performance indicators: first-level cache, level two cache, high-configuration server will have a level three cache), composed of static memory chip (SRAM), the capacity is small but much higher than main memory , close to the speed of the CPU. In the hierarchical structure of the computer storage system, it is a high-speed small-capacity memory between the central processing unit and the main memory. It forms a level of memory together with the primary memory. The scheduling and transmission of information between the buffer memory and the main memory is carried out automatically by the hardware. The most important technical indicator of the buffer memory is its hit ratio (a=n*80%)-level two cache (b=a*80%)-level three cache (c=b*80%)). The so-called hit is to read the specified data on the cache.
Since it is a cache, the size must be limited, that is, not all the data the CPU needs can be hit in the cache because it has its own update policy. As follows
1. According to the rules of program locality: The program is running, always use the most recently used instructions and data. This provides a theoretical basis for the replacement strategy. Comprehensive hit ratio, difficulty and speed of the implementation of various factors, the replacement strategy can have a random method, FIFO method, the least recently used method.
(1). Random method (Rand method)
Random method is to randomly determine the replacement of the storage block. Sets a random number generator to determine the replacement block based on the resulting random number. This method is simple and easy to implement, but the hit rate is relatively low.
(2). FIFO method
The FIFO method is to select the first incoming block to replace. When the first incoming and hit the block, it is likely to be replaced by priority, and therefore does not conform to the local law. This method is better than the random method, but it does not meet the requirements. The FIFO approach is easy to implement,
(3). Least recently Used method (LRU method)
The LRU method is based on the usage of each block and always chooses the least recently used block to be replaced. This method is better to reflect the local law of program. There are several ways to implement LRU policies.
2 in the multi-body parallel storage system, because the I/O device to main memory request level is higher than the CPU, this occurs the CPU waits for the I/O device to visit the phenomenon, causes the CPU to wait for some time, even may wait several main memory cycle, thus reduces the CPU efficiency. In order to avoid the CPU and I/O device scrambling for the memory, you can add a cache between the CPU and main storage, so that main memory can send the CPU information to the cache ahead of time, once the primary presence and I/O device Exchange, the CPU can read the required information directly from the cache, do not have to wait for the efficiency.
3 currently proposed algorithms can be divided into the following three categories (the first type is the key to grasp):
(1) The traditional substitution algorithm and its direct evolution, the representative algorithm is: ①LRU (Least recently used) algorithm: the least recently used content to replace the cache; ②lfu (Lease frequently Used) algorithm: Replace the least visited content with Cache;③ if all content in the cache is cached on the same day, replace the largest document with the cache, or replace it by the LRU algorithm. ④fifo (first in first out): follows the FIFO principle and replaces the one that first enters the cache if the current cache is filled.
(2) The replacement algorithm based on the key features of the cache content, its representative algorithm is: ①size substitution algorithm: The largest content replaces the cache②lru-min substitution algorithm: The algorithm tries to make the number of documents replaced by the least. The size of the document to be cached is S, the document cached in cache is at least s, replaced by the LRU algorithm, or, if there is no object of at least S, the LRU algorithm is substituted from the document of at least S/2 size; ③lru- Threshold replacement algorithm: Consistent with the LRU algorithm, only documents larger than a certain threshold can not be cached; ④lowest lacency first replacement algorithm: Replace the document with the least latency of access to the cache.
(3) The cost-based substitution algorithm, which evaluates the objects in the cache using a cost function, and then determines the replacement object based on the size of the generation value. Its representative algorithm is: ①hybrid algorithm: The algorithm assigns a utility function to each object in the cache, replaces the least effective object with the Cache;②lowest Relative value algorithm: Replaces the object with the lowest utility value Cache;③least Normalized cost replacement (LCNR) algorithm: The algorithm uses a reasoning function about the frequency of document access, transmission time and size to determine the replacement document; ④bolot and others proposed a time-based price, size, and the weighted inference function of the last access time to determine the document substitution; ⑤size-adjust LRU (SLRU) algorithm: Sorts cached objects by cost-to-size ratio and selects objects with the lowest ratio to replace them.
Start building with 50+ products and up to 12 months usage for Elastic Compute Service