Floating point is a mathematically elegant form of data storage whose beauty is often forgotten by many people.

If you utilize a programming language that supports IEEE 754 single- or double-precision floating point, you may one day be exposed to such phenomena as:

$ php -a
Interactive shell

php > echo(0.1 + 0.7);
0.8
php > echo(floor((0.1 + 0.7) * 10));
7

If you execute similar code in Python, you will gain insight as to why this is happening:

>>> 0.1 + 0.7
0.7999999999999999

If you peruse the PHP documentation on floating point numbers, you see how the IEEE 754 floating point representation of 0.1 + 0.7 is in fact 0.79999999999999991118

Multiplying by 10 then gives 7.9999999999999991118

And when you floor() that, PHP zeroes the fractional binary portion of the internal floating point representation to yield 7 instead of the intuitively expected 8.

These same flaws would affect any language using IEEE 754 floating point. It affects double-precision as much as it affects single-precision. Simply throwing more bits of precision does not solve the problem; it only makes that 7.99999… representation have more 9’s before it ends in …1118… etc.

This is because IEEE 754 relies on fractional binary (fractional base-2), which cannot perfectly represent decimal numbers like 0.1 and 0.7. As a result, there is a fractional binary rounding error when operating on such numbers.

This is also not a flaw; it is an inherent part of the tradeoff made by the IEEE 754 standard.

To understand this tradeoff, you have to appreciate the problem that the IEEE 754 standard is trying to solve. It is trying to map an uncountably infinite set (real numbers) which has infinite cardinality, to a finite bit-sequence with an eye towards minimizing space (memory) usage.

As a result, a choice has to be made about “what values” are encoded. What values are “most useful” to encode?

A uniform approach where the interval between encoded values is consistent is useless because this is simply a re-implementation of signed and unsigned two’s complement integers.

If space and memory were not constraints, then realistically, it simply makes sense to extend two’s complement integers to arbitrary precision and string multiple integers together to formulate numbers with the desired precision.

Especially “back in the day” (pre-1980s), space and memory were HUGE considerations. Remember, people were dealing with kilobytes and megabytes if they were so lucky.

The scientists and engineers working on the spec for standardization focused on these ranges:

  • Extremely small numbers
  • Extremely large numbers
  • Plus the full range of 32-bit signed integers

Oh, and they intended to capture these ranges within 32-bits.

You are (hopefully) thinking: “wait a second… they wanted to capture the full range of 32-bit signed integers PLUS more ranges, in just 32-bits?”

Hopefully now you’re beginning to see exactly why floating point is so brilliant and beautiful.

If you read the IEEE 754-2008 white paper, you begin to see the mathematical methodology by which this was accomplished.

The floating point density of smaller numbers is much higher e.g. numbers are clustered much closer together, while the density of large numbers is much lower e.g. numbers are much more spread out.

This produces a “generally” useful and efficient number encoding system that is extremely to many science, engineering, and computation fields, including graphics and video games.

It also produces a system that is undesirable for financial applications. In other words, if you’re using floats to track money, you’re probably doing it wrong.

Love this stuff?
Computer Science Fundamentals – Variables and Data


Leave a Reply

Your email address will not be published. Required fields are marked *