Numbers, cardinality and representations

Source: Internet
Author: User
Tags rounding plus

Numbers, cardinality and representations integer

Integers are these familiar numbers ...,-1, 0, +1, +2, .... Integer values are also referred to as ' complete ' and are divided into positive numbers (1 to infinity), negative numbers (-1 to negative infinity), 0 (0), non-negative (0 or positive), and rare non-positive (0 or negative). The difference between positive and non-negative numbers is usually very important, for example, the C language typically uses non-negative numbers as an array subscript, explicitly including 0.

cardinality

We usually use ' radix 10 ' or ' decimal ' arithmetic when we write integers (and other numbers). This is a positional symbol, where the value of each ' position ' is 10 times times greater than the next. The last number is the number of one, the second is the number of 10, and so on: so the number sequence 593 means ' five hundred, nine X and three one ' or 593. On the arithmetic we have a cardinal ' B ' (typically a positive number, here is 10) and a ' n ' sequence of numbers an-1, An-2, ..., a1, A0. This means that the value an-1*bn-1 + an-2*bn-2 + ... + a1*b1 + a0*b0. (Note B1 = b and B0 = 1; We can simplify that, but this symmetry looks good.) )

Note that in this mathematical notation, the number is preceded by an excess of 0, but does not affect the value: 0042 has Chi, 0 hundred, four ten and two one, and this 42 is the same. Usually we get rid of leading zeros because they don't bring anything useful.

Modern computers use binary, or cardinality 2 internally. The number sequence 100110 represents 1*25 + 0*24 + 0*23 + 1*22 + 1*21 + 0*20 or 32 + 4 + 2 or 38. You may need to be familiar with the binary, at least some of the smaller two powers (1,2,4,8,16,32,64,128,256, etc.).

As a C programmer, you also need to familiarize yourself with two additional cardinality: 8 (also known as ' octal ') and 16 (' hexadecimal ') greater than 10 cardinality with a symbolic representation of the problem: a single number will exceed 9. The C language uses the letter a (both uppercase and lowercase) to indicate that 10,B represents 11 until F.

In the C language, the number 42, you can be as usual in decimal, or octal, hexadecimal. In octal, 42 requires 5 8 and two 1, so we write 52 instead of 42. In hex, we need two 16 and 10 1, so we write 2 a. Of course if you write ' 52 ', people may think that you mean 52 instead of 42; like a sequence of numbers, we need some way of clearly representing the cardinality. In mathematics we use the foot tag: 4210 = 528 = 2a16; Some assembly languages use suffixes instead of feet; C uses prefixes. To indicate that a number is hexadecimal, we precede it with a number zero and an X (uppercase and lowercase). To indicate that it is octal, we place a number zero on its front. So in C, 42, 052, and 0x2a all represent the abstract number ' 42 '.

Note that the number itself does not change, because we use different cardinality to represent it. This fact is very important: the number itself is an abstraction, and the different representations are concrete displays of abstraction.

A useful feature of a binary (base 2) number is that each digit is always 0 or 1. 0 times any number is zero, and any other number is that number, so only 1 ' arithmetic ', only 1 of the position appears to work. Any power of 2 includes or is not included. When carrying, we found that the low number of simple bounce, high number in the adjacent low from 1 to 0 o'clock to jump into: 0000, 0001, 0010, 0011, 0100, 0101, 0110, 0111, 1000, 1001, 1010, .... For a given binary number, the maximum possible value is that they are all 1 o'clock (the minimum value is naturally 0). And the maximum value is simply 1 less than the next power of 2: for example, 112 = 310, which is 1002 less than 410 = 1. If you remember some of the power values of 2, you can immediately say how many bits are required to represent a given value. For example, because 211 is 2048 and 212 is 4096, a whole number greater than or equal to 2048 but less than 4096, requires 11 bits to represent it. (Of course the extra bits are also allowed: the extra bits in the front need to be 0.) )

The C language lacks the ability to output binary numbers directly, but octal and hexadecimal look enough.

integer notation

Today's computers store numeric values in binary. It is easy to represent non-negative integers in this way until an upper limit is reached. If a computer is 8-bit bytes and uses 4 bytes to store an integer, it has 32 significant digits. If we go to the above position notation, we can immediately see that it can represent the value to 231 + 230 + ... + 21 + 20, or 232-1, or (in decimal) 4,294,967,295.

This is good for nonnegative integers, but for negative numbers? As time grows, many scenarios are used to represent negative integers in a binary computer. The simplest of the concepts may be ' symbol-size ' (the original code), also called ' signed Size '. The remaining two are called ' one's complement ' (complement) and ' Two's complement ' (Anti-code). These three methods are allowed in the C language. In these three methods, we use one representation symbol and the remaining digits to represent the values. The choice of which one is not important, for the convenience of symbols, we tend to choose the ' first ' bit (which later when we consider the ' byte order ' of the machine, which leads to some interesting confrontations, because which one is ' first ', depends on where you view bits and bytes-at the low or high end).

In the symbol size representation, you simply get the numeric value from all the ' value ' bits that represent the number, and if the sign bit is valid, it is called a negative number. So the number 1 (for symbol), 011 (for numeric values) is 310, and is negative, which means ' negative three '. To indicate a positive number, we close the sign bit and get +3. One drawback of this method is that there are two 0: regular or ' 0 ' representing positive numbers (both the sign bit and the value bit are 0), and a ' negative ' 0 (the value bit is 0, the sign bit is set to 1). A more serious flaw is that this representation is more complex when calculating, slowing down the computer. Before we add two numbers, we must pay attention to whether they have the same symbol. If not, we must subtract negative numbers, not just add them. (This, of course, gives us a chance to find the overflow, but as we've seen repeatedly from calculations, it's more important to get an answer as soon as possible, even if it's wrong.) )

One's complement method avoids the second defect. Here, in order to represent a negative number, we place the sign bit again, but this time we also reverse all the value bits (Turn 1 to 0, and vice versa). Now for negative three, we set the sign bit, but let the value bits be empty, give 1 (representing the sign bit), 100 (indicating the value). When we add two numbers, we need a ' round rounding carry ' device, which skips the sign bit and adds the value bit.

Note the observations in our add-3 plus +2,-3 plus -1,-3 plus +4, this is how it works (the following points represent the rounding that needs to be added):

     1100 ( -3)   + 0010 (+2)     ----     1110 ( -1)     .     1100 ( -3)   + 1110 ( -1)     ----     1011 ( -4)     .     1100 ( -3)   + 0100 (+4)     ----     0001 (+1)

The sum of the first two cases is negative, and the third is positive. There is no carry in the first addition, so summing is easy. In the second addition, two of the maximum numeric bits are 1 to produce a carry, which is added to the two sign bits (two are placed at the beginning), so the final sign bit is still placed. When the two sign bits are added, a carry is also generated, and the carry is sent to the end, which is added to the lowest value bit (in this case both are 0), so the final and also the low is 1. The final example is similar to the following: two maximum numeric digits are generated after a carry is 0. But this time the in (one) sign bit is added, and another carry is obtained, but the sign bit is set to 0. The carry that is generated by the sign bit is sent to the end and the lowest bit is added back.

It has been seen that there are two representations of 0: normal, usual 0, and the sign bit (actually each bit) being placed on the ' minus 0 '. This value is slightly less visible, and on a computer with a large number of complements, the attempt to use '-0 ' will be captured at runtime (sometimes optional), and thus capture the attempt to use a variable that is not set ' correct ', by simply populating all memory with the ' full 1-bit ' preset. (A similar trick is found entirely in another form on today's computers.) )

Here, let's look at a few summation examples using the complement arithmetic of one. Note that if the low value bits of the two parties are all 1, their sum is 0 (the downward value bit is carried into a carry), but a round rounding carry will cause the low level to become 1:

     . ...     11001 ( -6)   + 11011 ( -4)     -----     10101 (-10)

(We have to use a total of five bits-or four values here) like if there is only one low of 1, their sum is 1 (no carry), but this time a loop rounding carry results in a carry, a ripple effect occurs. The worst case occurs when you add ' minus 0 ' to a value:

     .....     00010 (+2)   + 11111 ( -0)     -----     00010 (+2)

In particular, any number other than 0 and +0 gets the original value, however-0 plus +0 gets-0.

With five bits, the allowed range is 15 to + 15:

     .....     10111 ( -8)   + 10000 ( -15)     -----     01000 (+8)

Here, we really get the wrong answer, not because of one's complement, because the result doesn't really fit in: it should be-23.

The last method of representing a negative integer is the two complement, which is, in fact, the most common so far. (Incidentally, see Knuth's book for a different reason than the abbreviation for one's complement and two's complement.) This method is also much like a complement, except that the sign bit is placed, we skip the rest of the values and add 1 to them to get the value represented. So when-3 in the four-bit complement is 1100, it becomes 1101 in the complement of the two. The extra step--' plus 1 '--avoids circular rounding plus one. Now we can add any two values, simply discard the sign bit's carry and get it:

     .     11010 ( -6)   + 11100 ( -4)     -----     10110 (-10)

This means that we can add signed integers using the same machine instructions as the addition of unsigned integers. This is not a case of using one's complement: Consider the example of the first-6 Plus-4 to get-10. Circular rounding carry produces the correct result (-10), but if two five-bit sequences are unsigned integers, their values are 25 and 27, respectively, and 52, or binary represented as 110100. Of course, this value requires six bits instead of five bits, but we get a low of 10101 instead of 10100. In our second example, when we add 6 and 4, their unsigned equality representations are not 25 and 27, but 26 and 28, their sum is 54, or binary is represented as 110110. Now we get the right position (and a carry, which indicates that the result is too large to fit in the unsigned five-bit sequence). This method of calculation is used to implement unsigned integers in the C language.

In most cases, these are fairly minor issues, because as programmers using a high-level language, we don't have to worry about how the inside of the machine implements numbers such as gravel. In some places, these gaps are beginning to manifest-where they are, understanding these expressions and their shortcomings becomes crucial. For example, the symbol-size and the complement of one have a ' minus 0 ' that we need to deal with, and the complement of two has a different problem: it can represent one more negative number than the positive.

In the symbol-size, the ' All 1 ' bit pattern is-1, but a sign bit and full 0 is '-0 '. In one's complement, all 1 is '-0 '. In the two's complement, all 1 is also -1--so a sign bit and all 0 is what? The answer is that it is a negative value with no corresponding positive value. If we get these bit patterns 1000 ... 00, plus 1, we add 1 to 0111 ... 11 The result is another 1000 ... 00. In 16 bits, the sign bit is 1 and then the full 0 represents-32768, but the maximum positive number is +32767. Negative. To 32768 minus get-32768 (unless, almost never, the machine catches this error for you; Besides, it's usually more important to get the wrong answer as soon as possible).

Therefore, three representations are flawed: the symbol-size is easy to understand, but requires a lot of extra hardware, and has a ' minus 0 ', one of the complements requires less extra hardware (round rounding carry), and similar to the symbol size, requiring separate instructions to represent signed and unsigned numbers, and still have a ' Minus 0 '; the complement of two has a negative number with no corresponding positive value. But the complement of the two does not require additional hardware, can be very fast, and can use a set of simplified instructions to get the wrong answer as quickly as possible, so most modern computers do so.

By the way, notice that I'm talking about using a bit inverse to calculate a negative value (and possibly plus 1). We have a very elegant mathematical explanation for the sum rule of the number sequence above. However, unlike the an-1 by 2n-1, the complement of two and one's complement, this most important ' sign ' bit, respectively, has a value-(2n-1) or-(2n-1-1). For example, for a 4-bit binary number, the low 3 bits represent 4, 2, and 1 (so the three-bit maximum is 7) but the first represents-8 and-7 instead of 8. The number sequence is therefore 1001, with two complement representations of -8+1 or -7,1000 representing-8. These sequences are in the complement of one, representing -7+1 or-6, and-7, and the full 1 in the complement of two is -8+7 or-1, which is now -7+7 or 0 (because of the sign bit, it is-1 instead of +0).

fixed-point number

The C language does not have a built-in fixed-point number type, but it can be easily constructed from an integral type. For a binary set of points, we simply specify some bits in this number as the ' fractional part '. Continue to use the above mathematical notation, the number before the decimal point represents 2n-1,2n-2, and so on until 20. The subsequent numbers represent 2-1,2-2, and so on. Note that 2-1 is 0.5,2-2 is 0.25,2-3 is 0.125, and so on--so we can represent 4.5,4.25 and 4.75, but depending on the number of bits we can use, we may not be able to accurately represent certain values. If the latter two bits ' crosses the decimal point ', 10000 means that 4.00,10001 indicates that 4.25,10010 represents 4.50 and 10011 means 4.75, but no bit pattern can represent 4.10. We can only effectively ' pre-multiply ' (or proportionally scale) 2k, where k is the number of digits after the decimal point: 4.5 times 4 is 18, and the binary representation is 10010. 4.1 Times 4 is 16.4, ' point 4 ' section cannot be represented.

The unsigned integer types in the C language can be used well for fixed-point numbers, which are, of course, unsigned. If you need negative fixed-point numbers, you can simply use signed arithmetic. This works well in most cases, but notice what happens when you multiply or divide two fixed-point numbers: they already include a scale factor of 2k, so the product of the two includes a factor of 22k, and division removes the scale factor. Depending on the use of the results, you may need to adjust the proportions. This can be achieved using the shift operation common sense, but the shift operation defines only unsigned integers. If the underlying hardware uses a complement or symbolic size notation, you will usually get a wrong answer. Even for two of the complement, shifting a negative number can sometimes produce the wrong answer. (or you can comfort yourself with the idea of getting the wrong answer as quickly as possible.) )

floating point number

Floating-point numbers are generated in a bewildering variety of implementations, at least in older hardware. C allows almost all of these hardware. Focusing on all these details is far beyond the scope of this page, so I would intentionally skip the sad gloom like IBM's ' hexadecimal floating point ', focusing on the IEEE floating-point numbers that are present on most microprocessors today.

Consider a typical decimal ' scientific notation ' digital 7.2345*1012. For convenience, it is often not written in the superscript 7.2345E12. It consists of two parts, a valid number (also known as the ' Mantissa ', although strictly speaking here the mantissa is 0.2345, the effective number is the entire 7.2345 this part) and the index--here is 12. This notation has some useful features, only some of which are applicable to computer representations. The first feature is that in this regular format, there is only one digit before the decimal point. We can then use this to avoid storing the ' decimal ' character in the computer. One that is not applicable, or at least not directly applied, is the concept of ' effective numbers ': The number 7.2345 has five digits (we do not calculate decimal points), so this number is considered accurate to 5 bits. If this number is accurate to six bits, we will write 7.23450E12: This extra 0 does not affect the final value. The third feature, which is now also not used, is that multiplication and division are easier-we multiply and divide valid numbers, and then simply add or subtract exponents. Curiously, addition and subtraction have become more difficult. Because 1.4171901E7 plus 3.3E1 requires that we call ' anti-normalization ' (because the decimal point is not next to the first number), ' Rescale ' one of the numbers: 3. 3E1 first becomes 0.0000033E7, then we add two numbers to get 1.4171934E7. (The inverse normalization of another operand is also possible, or even two, in addition and normalization of the result.) But under some extreme indices, things can get tricky. Subtract two numbers close to each other and become subtle. For example, 1.99999E5 minus 1.99998E5 gets 0.00001E5. This had to be re-normalized to 1.00000E0. )

Scientific notation is a valid floating-point decimal format. It's easy to get caught up in this trap: computer arithmetic is often exported as a decimal floating-point number format, making it easy to believe that those numbers are always in this format. But in fact, most computers today use a binary floating-point number format.

Consider the number 4.5 in binary floating point format. Similar to the above mentioned fixed-point number 4.5 is 22 + 2-1. In binary notation, then ' 100.1 '. Now all we have to do is ' normalize ' this number for (using the ' newest invention ' character B symbol instead of e) ' 1.001b+2 '. In other words, we now have one, 0, two, 0, four, one eight (1.125), multiplied by two squared or 4. For fixed-point numbers we simply scale with 4-in this case. The difference between fixed-point and floating-point numbers is that the scale factor changes. To represent 9.0, we'll zoom in by 8 (23): Now we make one, 0, two, 0 four, one, and one 8 and one multiplied by nine to get 1.001b+3.

In scientific notation decimal floating-point numbers, scaling by a power of 10 is trivial: we only change the exponent. Similarly, in binary floating-point numbers, scaling by a power of two is so simple. Since 9 is twice times 4.5, we only add one to the index. In order to get 18 we re-increment the exponent. Each valid bit remains unchanged 1.001; only the exponent changes.

Oddly enough, binary floating-point numbers are absolutely frightening when they represent a number similar to One-tenth. This is very easy when using decimal: 1.0e-1. In binary, however, One-tenth is 1 one-sixteenth (0.0675) plus 1 one-thirty second (0.03125) plus 0 one-sixty fourth (0.015625; this makes us to the next) plus 0 1/128 (0.0078125; continue to next) plus 1 1/ 256 plus 1 x 1/512 plus 0 1/1024, and continue, get 1.100110011001100...b-4 as its binary representation. The sequence ' 1100 ' repeats forever, quite similar to the One-third decimal form of 3.33333...e-1, with an infinite ' 3 '. (It's actually easy to know that any decimal decimal that doesn't end with the number 5 has this problem.) Fortunately, integer values, at least for ' small enough ' integers, can always be represented by binary floating-point numbers exactly.

Note that if we always write binary floating-point numbers in the ' normalized ' format, not only is the binary decimal point always the second character, but the first character is always a ' 1 '. It means that we can omit not only the storage space used for the binary decimal point, but also the storage for the first number. So inside the computer, the digital 1.1011b+4 (which represents 16+8+2+1 or 27) can be represented as a binary number sequence ' 1011 ' and an exponent (+4). Therefore, in the IEEE binary floating-point number, the storage is divided into two parts: the mantissa (the term is now used appropriately, since the first digit is no longer stored) as part of the exponent. Of course we have a nasty symbolic problem to deal with: both the mantissa and the exponent are signed, because the numbers may have the form -1.1011b-4 and +1.1011b+4. The techniques used in the IEEE format here are somewhat rare: for older computers and integers, the mantissa is stored in the symbol-size format, but the exponent is in an ' extra ' format. Here, negative numbers add a bias to make them non-negative.

In cases where more detail is not involved, this means that the floating-point number has three parts (not four): a bit is used to denote the mantissa, some fixed bits represent the mantissa itself, and the remaining bits represent the exponent, in an extra format. ' Extra ' depends on the remaining digits. For example, IEEE single precision is a 32-bit format where 1 bits represent the mantissa symbol, 23 bits represent the mantissa, and 8 bits represent the exponent. Since eight unsigned potentials represent 0 to 255, the exponent is logically stored in ' out of 128 ' format. Because I didn't intend to explain the reason, but the real quota is 127, so a 1 index value represents 21-127, that is 2-126. As we will see, a 0 index is used for another purpose.

Because we use the canonical format, 23 bits for the mantissa actually store 24 valid values, plus the implied leading ' 1 ' bit and the binary decimal point. Our numbers are always positive-negative (according to the mantissa symbol) 1. Some numbers, B, positive-negative, some numbers (corrected exponent). or easier to understand, if the true tail value is M, the true exponent (uncorrected) is E, the value is 1.m*2e-127 (if the symbol position is negative).

If you notice, or are familiar with the situation, now you will know a problem. If the mantissa is always a--point--some number, how do we say 0? The best case scenario is that we use a minimum regular exponent (an uncorrected value of 1, which means 2-126) is 1.000 ... 000b-126, it's about 1.17549435e-38. Therefore, we steal some values from the tail end by claiming that some indices are ' special cases '. In this case, in particular, a full 0-bit exponent that should represent b-127 is used in two special cases. for (or wrong) symmetry, a full 1-bit index that should represent b+128 is used for another two.

The first two special cases are ' 0 ' and denormalized (or very normalized) numbers: The mantissa is a full 0-bit value and a 0 exponent represents 0.0 (or-0.0 if the sign bit is set), however a number with the same non-0-digit number is considered ' denormalized '. This number is expressed as 0.m*2e-126 instead of 1.m*2e-127 (or you like to move the binary decimal point to the right one bit, minus the exponent to-127). Of course, the e here is zero, so this happens to be 0.m*2-126 (or the decimal point is after the first M m.m*2-127). The smallest non-0 single-precision number is therefore 0.00000000000000000000001b-126 (there are 22 zeros after the binary decimal point, followed by 1), which is 2-149, or about 1.4e-45.

The last two exceptions are that a full 1-bit exponent indicates infinity (if all trailing digits are 0) and ' not a number ' or ' NaN ' (if some of the trailing digits are placed). As a useful technique, if all bits are placed, the exponent is defined as full 1, the mantissa is clearly at least some non-0 bits (actually all are not 0), so this is a nan. Set the memory to full 1 to capture some use of the uninitialized variable.

Note that an infinite number is a signed number--positive infinity is greater than all other numbers, negative infinity is less than the remainder--but although Nan has a sign bit, they are not considered to be a signed number. Not every implementation can always get this right. More complicated, NaNs is divided into ' quiet (Quiet) NaNs ' and ' signal (signaling) NaNs ', Qnans is used to calculate only the result into another Qnan, but Snans is used to catch errors at runtime. Repeatedly, not every implementation has done this right.

The result of an operation that exceeds the maximum value that can be represented is converted to positive infinity: for example, 1e38 times 1e1 overflows, so a positive infinity number is obtained. Negative numbers exceeding the minimum negative number also overflow into negative infinity (so -1e38*1e1 is a negative infinity).

For an exponent of size b+127, the single-precision number is approximately 1e38. (The exact number is 340282346638528859811704183484516925440 or about 3.4e+38, which is the value of 1.m*2e-127, when M is full 1 bits, E equals 254). Double-precision numbers typically use 64-bit, divided into 1-bit sign bits, 52-bit mantissa, and 11 exponential bits (with a storage limit of 1023, because 210 is 1024). The larger index is about 1e308. Extended precision, if any, uses 80 bits on some systems including IA-32, 128 bits on other systems. The mantissa and exponent are expanded, the exponent usually has 15 bits, the remainder is the mantissa (not the same); The largest number is approximately 1e4932. Intel IA-32 architecture floating point units are rare, and they internally use 80 bits to represent all numbers, which are converted to 32-bit or 64-bit when loaded or stored. However, there are some control bits to compensate for the extra effective digital accuracy, but not for the additional exponential range. This means that some of the calculations that overflow will actually come up with an answer that is not an infinite number. For example, multiplying two double digits 1e200 and 1e200 overflows, and the result is not 1e400. Use 1e300 to get an infinite number except for the result (positive infinity). On many Intel chips, the only way to make this happen is to store the operation again in memory, which slows down the operation. The C code is this:

    Double A = 1.0e200, R;    R = A * A;    R = r/1.0e300;    printf ("%g\n", R);

It should print ' infinite number ', but actually print ' 1.0e100 ' on many Intel machines on the C compiler. (We see again that getting the wrong answer is very important-at least the IEEE cares about the error--as fast as possible. Of course someone will argue that it is clearly the right answer. However, the computational rules of C and IEEE indicate that the result should be infinite. )

Numbers, cardinality and representations

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.