Fixed-point vs. Floating-point

Both are ways to represent non-integer/fractional numbers with bounded ranges and precision.


Fixed-point

With a fixed-point number, the point position is static.

There is no fundamental type for the fixed-point numbers in C++ yet.

Typical fixed-point layout:
– Q15.16 (32-bit): 1 bit sign, 15 bits integer part, 16 bits fractional part
– Q31.32 (64-bit): 1 bit sign, 31 bits integer part, 32 bits fractional part

Example: 10,186 in Q15.16

Calculation:
stored_val = real_val × 2^16
           = 10,186 × 65536
           = 667549,696
           ≈ 667550 (Round to nearest)

Representation:
      Sign (1) | Integer (15)        | Fraction (16)
Bin | 0        | 0000 0000 0000 1010 | 0010 1111 1001 1110
Hex | 0        | A                   | 2F9E
Dec | 0        | 10                  | 12190

Reversing:
real_val = (−1)^sign × stored_val / 2^16
         = 667550 / 65536
         ≈ 10,18600464

They are calculated by the ALU.


Floating-point

With a floating-point number, the point position is dynamic.

In C++ float/double and std::floatN_t are floating-point numbers.
– float/double are usually 32-bit/64-bit and based on IEEE 754, but not on all systems.
– std::floatN_t is optional and is only provided if the system supports IEEE 754.

The IEEE 754 binary floating-point layout:
– binary32 (32-bit): 1 bit sign, 8 bit exponent, 23 bit mantissa (std::float32_t, usually also float)
– binary64 (64-bit): 1 bit sign, 11 bit exponent, 52 bit mantissa (std::float64_t, usually also double)

Example: 10,186 in IEEE 754 binary32

Calculation:
real_exp = log2(10.186) = 3 (Round to floor)
stored_exp = real_exp + bias = 3 + 127 = 130
real_man = real_val / 2^real_exp ≈ 1010.001011111001110110110₂ / 2^3 = 1.010001011111001110110110₂
stored_man = (real_man−1) × 2^23 ≈ (1.010001011111001110110110₂-1) × 2^23 ≈ 01000101111100111011011₂

Representation:
      Sign (1) | Exponent (8) | Mantissa (23)
Bin | 0        | 1000 0010    | 0100 0101 1111 0011 1011 011
Hex | 0        | 82           | 22F9DB
Dec | 0        | 130          | 2292187

Reversing:
real_value = (−1)^sign × (1 + stored_man/2^23) × 2^(stored_exp−127)
           = (1 + 2292187 / 2^23) × 2^3
           ≈ 10.18599987

They are calculated by the FPU.


Comparison

Fixed-point Floating-point
Range Constant Bigger
Precision Constant Depended 1
Performance 2 Faster Slower

1 If the magnitude is small the precision is high and vice versa.
2 On modern CPU’s the performance can be nearly the same.