Using and Porting GNU Fortran

Node: Floating-point Errors, Previous: Strange Behavior at Run Time, Up: But-bugs

Floating-point Errors

Some programs appear to produce inconsistent floating-point results compiled by g77 versus by other compilers.

Often the reason for this behavior is the fact that floating-point values are represented on almost all Fortran systems by approximations, and these approximations are inexact even for apparently simple values like 0.1, 0.2, 0.3, 0.4, 0.6, 0.7, 0.8, 0.9, 1.1, and so on. Most Fortran systems, including all current ports of g77, use binary arithmetic to represent these approximations.

Therefore, the exact value of any floating-point approximation as manipulated by g77-compiled code is representable by adding some combination of the values 1.0, 0.5, 0.25, 0.125, and so on (just keep dividing by two) through the precision of the fraction (typically around 23 bits for REAL(KIND=1), 52 for REAL(KIND=2)), then multiplying the sum by a integral power of two (in Fortran, by 2**N) that typically is between -127 and +128 for REAL(KIND=1) and -1023 and +1024 for REAL(KIND=2), then multiplying by -1 if the number is negative.

So, a value like 0.2 is exactly represented in decimal--since it is a fraction, 2/10, with a denominator that is compatible with the base of the number system (base 10). However, 2/10 cannot be represented by any finite number of sums of any of 1.0, 0.5, 0.25, and so on, so 0.2 cannot be exactly represented in binary notation.

(On the other hand, decimal notation can represent any binary number in a finite number of digits. Decimal notation cannot do so with ternary, or base-3, notation, which would represent floating-point numbers as sums of any of 1/1, 1/3, 1/9, and so on. After all, no finite number of decimal digits can exactly represent 1/3. Fortunately, few systems use ternary notation.)

Moreover, differences in the way run-time I/O libraries convert between these approximations and the decimal representation often used by programmers and the programs they write can result in apparent differences between results that do not actually exist, or exist to such a small degree that they usually are not worth worrying about.

For example, consider the following program:

     PRINT *, 0.2
     END

When compiled by g77, the above program might output 0.20000003, while another compiler might produce a executable that outputs 0.2.

This particular difference is due to the fact that, currently, conversion of floating-point values by the libg2c library, used by g77, handles only double-precision values.

Since 0.2 in the program is a single-precision value, it is converted to double precision (still in binary notation) before being converted back to decimal. The conversion to binary appends binary zero digits to the original value--which, again, is an inexact approximation of 0.2--resulting in an approximation that is much less exact than is connoted by the use of double precision.

(The appending of binary zero digits has essentially the same effect as taking a particular decimal approximation of 1/3, such as 0.3333333, and appending decimal zeros to it, producing 0.33333330000000000. Treating the resulting decimal approximation as if it really had 18 or so digits of valid precision would make it seem a very poor approximation of 1/3.)

As a result of converting the single-precision approximation to double precision by appending binary zeros, the conversion of the resulting double-precision value to decimal produces what looks like an incorrect result, when in fact the result is inexact, and is probably no less inaccurate or imprecise an approximation of 0.2 than is produced by other compilers that happen to output the converted value as "exactly" 0.2. (Some compilers behave in a way that can make them appear to retain more accuracy across a conversion of a single-precision constant to double precision. See Context-Sensitive Constants, to see why this practice is illusory and even dangerous.)

Note that a more exact approximation of the constant is computed when the program is changed to specify a double-precision constant:

     PRINT *, 0.2D0
     END

Future versions of g77 and/or libg2c might convert single-precision values directly to decimal, instead of converting them to double precision first. This would tend to result in output that is more consistent with that produced by some other Fortran implementations.

A useful source of information on floating-point computation is David Goldberg, `What Every Computer Scientist Should Know About Floating-Point Arithmetic', Computing Surveys, 23, March 1991, pp. 5-48. An online version is available at http://docs.sun.com/, and there is a supplemented version, in PostScript form, at http://www.validgh.com/goldberg/paper.ps.

Information related to the IEEE 754 floating-point standard by a leading light can be found at http://http.cs.berkeley.edu/%7Ewkahan/ieee754status/; see also slides from the short course referenced from http://http.cs.berkeley.edu/%7Efateman/. http://www.linuxsupportline.com/%7Ebillm/ has a brief guide to IEEE 754, a somewhat x86-GNU/Linux-specific FAQ, and library code for GNU/Linux x86 systems.

The supplement to the PostScript-formatted Goldberg document, referenced above, is available in HTML format. See `Differences Among IEEE 754 Implementations' by Doug Priest, available online at http://www.validgh.com/goldberg/addendum.html. This document explores some of the issues surrounding computing of extended (80-bit) results on processors such as the x86, especially when those results are arbitrarily truncated to 32-bit or 64-bit values by the compiler as "spills".

(Note: g77 specifically, and gcc generally, does arbitrarily truncate 80-bit results during spills as of this writing. It is not yet clear whether a future version of the GNU compiler suite will offer 80-bit spills as an option, or perhaps even as the default behavior.)

The GNU C library provides routines for controlling the FPU, and other documentation about this.

See Floating-point precision, regarding IEEE 754 conformance.