In STEP 2, floating point representation was introduced as a convenient way of dealing with large or small numbers. Since most scientific computations involve such numbers, many students will be familiar with floating point arithmetic and will appreciate the way in which it facilitates calculations involving multiplication or division.
In order to investigate the implications of finite number representation, one must examine the way in which arithmetic is carried out with floating point numbers. The following specifications apply to most computers which round, and are easily adapted to those which chop. For the sake of simplicity in the examples, we will use a three-digit decimal mantissa normalized to lie in the range
![]()
(most computers use binary representation and the mantissa is commonly normalized to lie in the rangc [½,1]). Note that up to six digits are used for intermediate results, but the final result of each operation is a normalized three-digit decimal floating point number.
Mantissae are added or subtracted (after shifting the mantissa and increasing the exponent of the smaller number, if necessary, to make the exponents agree); the final normalized result is obtained by rounding (after shifting the mantissa and adjusting the exponent, if necessary). Thus:
3.12 x 101 + 4.26 x
l01 = 7.38 x 101
2.77 x 102 + 7.55 x 102 = 10.32 x 102 ® 1.03 x 103
6.18 x l01 + 1.84 x l0-1 = 6.18 x 101
+ 0.0184 x 101 = 6.1984 x 101 ® 6.20 x 101
,
3.65 x 10-1 - 2.78 x 10-1 = 0.87 x 10-1
® 8.70
x 10-2.
The exponents are added and the mantissae are multiplied; the final result is obtained by rounding (after shifting the mantissa right and increasing the exponent by 1, if necessary). Thus:
(4.27 x 101) x (3.68
x 101) = 15.7136 x 102 ® 1.57x103
(2.73x102)x(-3.64x10-2)=-9.9372x100
® -9.94x100.
The exponents are subtracted and the mantissae are divided; the final result is obtained by rounding (after shifting the mantissa left and reducing the exponent by 1, if necessary). Thus:
(5.43xl01) /
(4.55x102) = 1.19340...xl0-1 ® 1.19x10-1
(-2.75x102) / (9.87x10-2) =
-0.278622. . .x104 ® -2.79x103.
The order of evaluation is determined in a standard way and the result of each operation is a normalized floating point number. Thus:
(6.18x101+1.84xl0-1)/((4.27x101)x(3.68x101))®(6.20x101)/(1.57x103)=3.94904...x10-2® 3.95x10-2
Note that all the above examples (except the subtraction and the first addition) involve generated errors which are relatively large due to the small number of digits in the mantissae. Thus the generated error in
2.77x102+7.55x102=10.32x102 ® 1.03x103
is 0.002 x 103. Since the propagated error in this example may be as large as 0.01 x 102 (assuming the operands are correct to 3S ), one can use the result given in Error propagation to deduce that the accumulated error cannot exceed 0.002x103 + 0.01x102 = 0.003x103..
The peculiarities of floating point arithmetic lead to some unexpected and unfortunate consequences, including the following:
5.18x102 + 4.37x10-1 = 5.18x102 + 0.00437x102 = 5.18437x102 ® 5.18x102,
whence, the additive identity is not unique.
1/a is 3.33x10-1
and
a´ (1/a) is 9.99´ 10-1,
whence the multiplicative inverse may not exist.
a = 6.31x101, b = 4.24x100, c = 2.47x10-1,
then
(a+b)+c = (6.31x101 + 0.424x101) + 2.47x10-1 ® 6.73x101 + .0247x101 ® 6.75x101,
whereas
(a+b)+c = 6.31x101 + (4.24x100 + 2.47x100) ® 6.31x101 + 4.49x100 ® 6.31x101+4.49x100 ® 6.31x101+ 0.449x101 ® 6.76x101,
whence the associative law for addition does not always apply.
Examples involving adding many numbers of varying size indicate that adding in order of increasing magnitude is preferable to adding in the reverse order.
1 - cos(0.05) = 1-0.99875 ® 1.00x100 - 0.999x100 ® 1.00x10-3.
Although the value of 1 is exact and cos(0.05) is correct to 3S, when expressed as a three-digit floating point number, their computed difference is correct to only 1S ! (The two zeros after the decimal point in 1.00x10-3 pad the number.)
The approximation 0.999 ~ cos(0.05) has a relative error of about 2.5x10-4. By comparison, the relative error of 1.00x10-3 - cos(0.05) is about 0.2, i.e., it is much larger. Thus, subtraction of two nearly equal numbers should be avoided whenever possible.
In the case of f(x)= 1 - cos x, one can avoid loss of significant digits by writing
![]()
This last formula is more suitable for calculations when x is close to 0. It can be verified that the more accurate approximation of 1.25 x 10-3 is obtained for 1- cos(0.05) when three-digit floating point arithmetic is used.
Why is it sometimes necessary
to shift the mantissa and adjust the exponent of
a floating
point number?
Exercises
Evaluate the following expressions, using three-digit decimal normalized floating point arithmetic with rounding:
Since
tanx-sinx=tanx(1-cosx)=tanx(2
f(x) may be written as f (x)=2tanxsin2(x/2). Repeat the calculation using this alternative expression. Which of the two values is more accurate?
Answers