So you summed some floats and have some unexpected results? Don’t be afraid, it’s normal.
When summing floats you have errors, not only because some numbers can’t be represented exactly as a float, but you get errors from differences in magnitude too.
Say you have a large float and add a small one to it, the small one will loose a lot of it’s precision in the process. You can see what can happen if you use fixed-width decimal numbers:
12345.6 + 0.12345 ------------ 12345.72345
If you only have one addition, the error could be acceptable, but if you have to sum a lot of numbers the small ones could start to be a significant part of the sum, and that part can get lost in the process.
If you had:
12345.6 + 0.05123 ------------ 12345.65123 + 0.05123 ------------ 12345.65123
You just lost .1, even if you could have avoided that by reordering the operations:
0.05123 + 0.05123 --------- 0.10246 + 12345.6 ------------- 12345.70246
So what can you do?
Actually, you have multiple choices, depending on what you want to achieve:
- You can sort the numbers (small to big) and sum them after sorting. This will group together numbers of similar magnitude and it will yield smaller errors. But this is not the best method, there are more accurate methods that will not need a sort.
- You can use Kahan summation. This is more accurate than sorting and it should be faster too.
- You can use use doubles. This is the best method, but what if you’re summing a vector of doubles in the first place?
- If you’re feeling adventurous you can run the sum on the FPU, but not many places let you insert assembly in the codebase.
If you want to see some code, you can download this file: Float sums example (1267). That code works on GCC 4.5 with the
Here is the output of a run, summing 10M random floats:
Starting: /home/florin/projects/floats/build/floats 2916.660400390625 <- Reverse sorted 3057.517578125 <- Normal sum 3192.88330078125 <- Sorted 3194.340576171875 <- Kahan summation 3194.340576171875 <- FPU 3194.340576171875 <- Double truncated 3194.34058601572951374691911041736602783203125 <- Double precision