Fractions present another number representation problem. How can one represent fractional quantities using bits?
Representing fractions can be solved in the same way that positive
powers of 2 represented integers, use negative powers of two to added
up to approximate fractional quantities.
The small negative powers of two are:
Thus you can pick some part of a 32-bit computer word and decide where the binary point should be. Let's say we pick the middle of the word somewhere. This gives 1 bit of sign, 15 bits of integer, and 16 bits of fraction. Thus one can have numbers from about -32,767 to +32,767 with about 4½ decimal digits of fractional precision.
If you use this technique you must convert numbers from the bit representation to decimal representation correctly. Pascal and C do not provide this flexibility in conversion, but some other programming languages do. PL/I for example allows you to declare numeric variables with the number of bits and where the binary point should be located. PL/I then builds code to do the shifting and adjusting for you, and its input/output routines also work properly. However one can always write conversion subroutines in other languages that do work properly so that you can both input fractional quantities and output them properly.
Note that the rules for binary arithmetic on these fractional numbers are the same as it is for integers, so it requires no change in the hardware to deal with binary fractions. That is this fractional representation is isomorphic to the binary twos-complement integer representation that the machine uses. It only requires that you think the numbers are fractions and that you understand and treat properly the binary-point (the binary equivalent of the decimal point). For example you can only meaningfully add two numbers if the binary point is aligned. Of course, the computer will add them no matter where you think the binary point is. This is just like decimal arithmetic where you must add two numbers with the decimal points aligned if you want the right answer.
One can shift the binary point of a number by multiplying or dividing by the proper power of two, just as one shifts the decimal point by multiplying or dividing by a power of ten. Most machines also provide arithmetic shift operations that shift the bit representation of the number right or left more quickly than a multiply instruction would. An arithmetic right shift of 3 is a binary division by 23 or 8; a left shift is a binary multiplication. Arithmetic right shifts usually copy the sign bit so that negative numbers stay negative. Arithmetic left shifts introduce zeros from the right side of the number. If the sign bit changes in an arithmetic left shift then the number has overflowed.
The binary point can be represented by counting the number of binary
fractional digits in the number.
Multiplication of a number with binary point of 5 and one with a binary
point of 3 will give a number with a binary point of 8.
That is binary numbers can be represented in general as having
p binary digits and q fractional digits.
The result of binary arithmetic for A with and B with
gives the worst case p:
One reason that this scaled integer arithmetic is not popular is that unless the program scales numbers using multiplication and division by powers of the radix to make sure that the two operands of addition and subtraction have the same q, answers will be wrong. The programming language PL/I does this scaling automatically but for other languages you have to do it yourself.
Note you can do this in Pascal or C because it only involves the way you think about the the numbers you are using. Pascal doesn't have to know what you are doing, as long as you write the code properly.
One is not restricted to scaling in the binary representation,
one can also scale in the decimal representation.
The rules for decimal scaling are the same as in the table above.
For example if you read in an integer but you know that it contains
a number with two `implied' decimal digits, i.e. the number 500
represents 5 with a q=2.
You treat it that way in the machine, multiply it by 2.1 (21 with a q=1),
the result is (10500, with q=3).
Now you can correctly output the result with the decimal point properly
placed because you know the q=3, 10.500.
Again you must be careful to only add and subtract numbers with the
same decimal scale, and carefully scale the numbers around multiplication
and division to preserve both the high-order digits while maintaining the
required precision after the decimal point.
One can represent any number as the ratio of two integers. To do this with bits, rational fractional quantities are represented as two integers, one representing the numerator and other the denominator of a fraction that equals the number.
Multiplication and division of these numbers is easy. Addition and subtraction requires finding the common denominator (the denominator of the result) and then adding or subtracting the adjusted numerator.
The major problem with this representation of numbers is that it requires twice as much space, and comparison is costly because it requires two divisions. Non-rational numbers must be approximated by a rational. Rationals whose denominator gets too large for the integer size available must also be approximated. Error analysis is non-trivial for algorithms represented this way.
Few computers use it as a hardware method of representing numbers,
but the algorithms for doing rational arithmetic are not difficult
to code in any language using standard binary integers, and structures.
For computer languages that do not do fixed scaling of integers, like Pascal and C, the easiest way of handling fractions is with floating point numbers, sometimes called real numbers. Using this technique a number is represented in bits by three parts: sign, exponent, and fraction. This is similar to scientific notation used to represent large or small numbers (e.g. ). The sign is negative, the exponent is 8 and the fraction is 0.35. Inside a computer these are kept in binary and put into separate fields in the floating point number. Most modern computers allow more than one `size' of floating point numbers.
Take the floating point used on the IBM PC and similar clones.
They, as do many other machines,
use a hardware chip-set that uses rules defined by IEEE Standard
Floating Point conventions.
Internal to the machine all arithmetic is done in an 80-bit floating
point format.
0 1000010 01101 0000 0000 0000 0000 0000 |
13 |
1 10000010 11010...0 |
The floating point result is normalized after each arithmetic operation. Normalizing means shifting the significand left to eliminate high-order zeros and adjusting the exponent to match. This means that in every normalized floating point number the top bit of the significand should be one. For example, let's subtract 10 from 13
13.0 is 0 10000010 11010...0 |
10.0 is 0 10000010 10100...0 |
result 3.0 is 0 10000010 00110...0 |
normalize 0 10000001 01100...0 |
normalize 0 10000000 11000...0 |
The exponent of this number is 128 - 127 = 1; so the number is or 3.
If one always knows the top bit should be a one is there any reason
to store it?
Of course not.
So in the short and long IEEE floating point form there is an
implied extra one bit before the most significant digit of the significand.
Thus 3.0 in memory is represented as :
+3.0 | 0 10000000 10000...0 | 0x40400000 |
+2.0 | 0 10000000 00000...0 | 0x40000000 |
+1.0 | 0 01111111 00000...0 | 0x3f800000 |
How does one represent zero, if there is always an implied bit in the significand?
+zero | 0 00000000 00000...0 | 0x00000000 |
-zero | 1 00000000 00000...0 | 0x80000000 |
There are two special values for the IEEE exponents, zero and all ones. A zero exponent is used for zero and very small unnormalized numbers. Because it uses sign-magnitude for the significand there is both a positive and negative zero in IEEE floating point.
The all-one exponent (the number 255 in short floating point format) represents special values like infinity, and values that are not-a-number, or NaN. There are two infinitys minus and positive, positive infinity is greater than any other floating point numbers, negative infinity is less than all other floating point numbers. Infinity is represented by an all zero significand with all one exponent. Arithmetic with infinity works as you would expect, add, subtract, or multiplying by infinity produces ±; division by infinity generally produces 0.
0 11111111 00000...0 | 0x7f800000 |
1 11111111 00000...0 | 0xff800000 |
Things that are `Not a Number' or NaNs can be used by a programmer to represent uninitialized values; if this floating value is ever used in a computation an error is generated.
0 11111111 10000...0 | 0x7f900000 |
Floating point in some computers comes in a 96 bit format. Occasionally one finds machines with 128 or 256 bit floating point numbers. Why this massive precision? The reason is found in the scaling, especially when multiplication is combined with subtraction of numbers that can be close to the same value. A series of apparently harmless computations can turn the floating point result into nonsense. Users are usually unaware of the arithmetic problems they are about to encounter, and the machine, for it's part, has no way to determine that the precision of the computation has ebbed away. One way to put off incorrect answers is to use a lot of precision. But it only puts it off, it doesn't solve the problems. The only good solution is for the programmer to understand the program he has written and the floating point computational issues it involves.
The classic example of this problem is the equation for the sample standard deviation. Given a set of values xi, for n numbers, the most common way single pass way to compute the sample standard deviation, sigma, is to evaluate:
The problem with computing this quantity in floating point is that the average sum of squares and the square of the mean can be large numbers that are close. For example if the mean you computed is 50,000, which is not a particularly large number; then the square of the mean is 2,500,000,000. The average sum of the squares will be a similar number, say 2,500,000,125. The difference is only 125, but it is in the low-order three digits.
IEEE single precision floating points have 23 bits of significand. The number of bits 23, divided by our magic number 3.32, gives 6.9 digits, or just about 7 digits of precision. So the answer after the floating point subtraction will be close to zero. All the significance in the formula comes from the low order digits, and the subtraction throws away the high order digits. This produces garbage numbers to feed to the square root routine. If you are lucky this will result in a domain error trying take the square root of a negative number. If you are not lucky you will just get a garbage answer. The result of square-root has half the precision of its argument, so even if you had two digits of precision to start with by the time you finish there may only be a single digit of precision left.
What is a computationally safe way to compute the standard deviation?
One way is to compute the mean first.
Then subtract the mean from each item and square the difference and sum
across the items.
The square root of the average sum of the squares is the sample
standard deviation.
This requires two passes across the data, one to compute the mean,
the second one to find the sum of the squares of the differences.
Computationally it can be done in one pass, if you
memorize and use the following algorithm for standard deviation:
Computationally Accurate Sample Mean and SD Algorithm |
Although this algorithm is somewhat slower to compute it gives computationally accurate answers. If you don't believe it, then do a little proof to show it is mathematically equivalent to the other formula, and then examine the precision loss in the two methods.
Floating point should always be used cautiously, it doesn't just work for a while and then break like fixed point computations; with floating point, computational errors can just sneak up on you reducing your calculations to mush. With floating point calculations it is not always true for example that ; when x becomes large enough then adding one just yields x. That's because for large x. So if you think about floating point arithmetic as approximate arithmetic you will be closer to the mark.
Because of this approximate nature of floating point computation comparisons for equality of two computed quantities seldom yield equal even if mathematically the quantities should be equal. Care should be taken to write algorithms that you can be sure will work. One good rule of thumb is that if computations have been done on a floating point number check for a range of values or check that something has exceeded some boundary. Don't rely on equal! The only time an equals test can be guaranteed to work for floating point values is when you assign a value to a floating point number and check to see if that same value is still there.
These floating point precision problems also have a nasty habit of cropping up in production software that has passed all the debugging tests. One reason for this is that programmers tend to be lazy; so they write small test cases in which answers are easy to compute. These small cases seldom generate products or sums that challenge the precision limitations of the hardware. When the real data comes along, the program doesn't work.
You may feel that all these problems are too awful to have to think about;
therefore they must have been fixed by modern technology.
From this you conclude that your Pascal system can't possibly
exhibit these horrible problems.
Don't be fooled! All these problems are here in Think Pascal,
Think C, and other languages.
The problems run deeper than the machines that exhibit the problems.
They have their origin in the unmathematical finite nature of our world.
Even more subtle `errors' than the ones demonstrated are possible.
Programmers, beware!