Rounding Errors And Arbitrary Precision Arithmetic

The computer computes. Plus, minus, times and sometimes even divide. The results are always correct. Really? Computers usually use floating point numbers, which are very flexible, but still can cause quite substantial rounding errors – at least when you are doing complex calculations with many steps. These small inaccuracies can accumulate and become a creeping poison. Arbitrary precision arithmetic offers a way out.


“If you drink much from a bottle marked ‘poison’ it is certain to disagree with you sooner or later.” – Lewis Carroll

Numerical Inaccuracies in Practice

OK, let’s start up the Windows-Calculator and switch to “scientific mode”:

√4 – 2 = -8.1648465955514287168521180122928e-39

0 would have been correct, but close enough and square roots are a bit problematic anyway. Next try:

10-37 + 1 = 1

Not quite, although the rounded result certainly looks a lot nicer than the correct one.

In programming languages we encounter the same kinds of effects when using floating point numbers:

… which will return an intermediate result of 3.0000002 and a final result of 4194304.

When using the datatype double (64-bit double-precision instead of 32-bit single-precision) the result does not inspire confidence either: The intermediate result in this case is 2.9999999999999996 and this time the end result is a very large negative number: -2251799813685248. This is not an error in floating point arithmetic (like the FDIV-Bug), but their normal behaviour as specified in the standard IEEE 754.

The errors usually start very small, but can accumulate during complex calculations to quite substantial errors. A few examples:

  • Solving linear equation systems is very unstable numerically. One such example involving Hilbert-Matrices can be found on the JLinAlg-Homepage (more about this open source library later).
  • Algorithms to approximate mathematical constants like the number π do not converge to the correct result anymore, if one uses floating point numbers. This problem can often be remedied to some extend by a suitable mathematical transformation of the problem.
  • Failure at Dhahran: On February 25, 1991 an Iraqi Scud-rocket hit a barracks of the US-Army in Dhahran (Saudi Arabia), although there was a Patriot missile defense system in place. 28 soldiers died because a calculation within the radar system based on floating point numbers had been getting more and more imprecise the longer it was in operation. After 100 hours the radar system was unable to track the incoming rocket properly and dismissed it as a measurement error. Details can be found in the official report.

The radar system used for Patriot-rockets

Floating Point Numbers

Floating point numbers are designed to perform numerical calculations as exactly as possible, despite having only a limited number of bits available. Contrary to fixed point numbers, like for example 0.0000042, floating point numbers do not waste bits on leading or trailing zeros. Instead numbers are normalized in a way resembling the scientific notation, which turns 0.0000042 into 4.2 · 10-6.

Using scientific notation one can write down small numbers and very large numbers in a compact form. Fixed point numbers will often need a lot more space for the same information. Two examples:

  • Planck time is the smallest time interval in which the laws of physics as we understand them today are valid. This constant is 5.391·10-44 seconds. As a fixed point number this would be: 0.000,000,000,000,000,000,000,000,000,000,000,000,000,000,053,91 seconds
  • The mass of the earth is approximately 5.974 · 1024 kg or as a fixed point number: 5,974,000,000,000,000,000,000,000 kg

Moreover, floating point numbers are displayed like 4.2e-6 for 4.2 · 10-6 and using floating point arithmetic are rounded and normalized (i.e. the exponent is recalculated) after each arithmetic operation.

Despite their great flexibility, floating point numbers are often imprecise because many numbers cannot be described with a finite number of digits. The decimal number 0.0625, which is 2-4 , is no problem for the binary system, but the decimal number 0.1 has infinitely many decimal places when one uses the binary system (just like 1/7 = 0,142857… when using the decimal system). The best approximation for 0.1 using single precision floating point numbers is – transformed back to a decimal number – exactly:


Source: Wikipedia

Arbitrary Precision Arithmetic

One way to get reliable results is arbitrary precision arithmetic. JLinAlg is an open source-project which offers (among other alternatives) various datatypes with arbitrary precision.


Logo of the JLinAlg-project, a java library for linear algebra.

Now let us look at the above example, but this time using rational numbers, as provided by JLinAlg’s Rational class:

This time we simply get 3 as the intermediate result and the last line results in an error message because we tried to divide by zero, which really was the case here.

Another argument for arbitrary precision is that results are often more concise and therefore better to handle. E.g.: What is 116 / 406? The result is as follows:

  • Using single precision floating point numbers (Float): 0.2857143
  • Using double precision floating point numbers (Double): 0.2857142857142857
  • With arbitrary precision (Rational): 2 / 7

None of the results are wrong, but only the last one is really precise and easy to handle. The source code to do this calculation using JLinAlg looks like this:

The complete source code for the examples above is available here.


It is important to be aware of rounding errors when using floating point numbers and avoid them, if possible. There are cases in which it is worth sacrificing some computational performance and use arbitrary precision arithmetic.

Related Posts

Leave a Reply

Your email address will not be published. Required fields are marked *