*Have you been using float or double variables to perform mathematical operations on embedded systems without a Floating-Point Unit (FPU)? You are doing it wrong! That’s incredibly inefficient. Use fixed-point representation instead.*

An FPU is an hardware block specially designed to carry on arithmetic operations on floating point numbers. Even though the C/C++ code may work without an FPU, it’s always much faster to use hardware designed for a specific purpose, like this one, instead of relying on a software implementation, something that the compiler will do for you, knowing the hardware restrictions you have but not in an efficient manner. Essentially, it will generate a lot of assembly code, greatly increasing the size of your program and the amount of time required to complete the operation. Thus, if you don’t have an FPU available and you still want to perform those arithmetic operations efficiently you’ll have to *convert* those numbers to *fixed-point representation*. Integers! But how? By scaling them. Let’s see how that scaling value may be determined.

The scaling value as well as the resulting scaled number, which is an integer, really much depends on the bitness of the CPU’s architecture being used. You want to use values that fit in the available registers which have the same width as the CPU buses. So, whether you are working with an 8, 16 or 32-bit architecture, the range of integer values we can store on those registers, being *b* the number of bits and representing numbers in two’s complement, is given by:

### Fixed-Point Representation

If one bit is used to represent the sign (and in this text we’ll always consider signed numbers) the remaining ones may be used to represent the integer and fractional parts of the floating-point number.We may textually represent this format as follows (denoted as Q-format):

Where *m* corresponds to the bits available to represent the integer part of and *n* corresponds to the bits available to represent the fractional part. If *m* is zero you may use just . So, when you use a register to save both integer and fractional parts (and the sign bit!), the value range is then given by:

(note that the expression above is a particular case of this one, for n=0 and m=b-1).

It’s up to you deciding how many bits are reserved for *m* and *n* (still, you should base your decision on a good criteria: the more bits, the greater the precision you can achieve; more on this bellow). So, you are essentially *fixing* an *imaginary point* in your register that separates the integer and fractional parts. That’s why it’s called fixed-point.

Now, consider the following floating-point number:

Since the integer part is zero, you have bits to represent the fractional part. You do that by multiplying the floating-point number by . And that’s our scaler! Simple as that. For 8, 16 and 32-bit architectures, these are the resulting scalers and corresponding scaled values (i.e. the floating-point number represented in fixed-point):

Q-format | Scaler | ||
---|---|---|---|

Q7 | 15.802 | 16 | |

Q15 | 4045.4 | 4045 | |

Q31 | 265119741.247488 | 265119741 |

Yes, after scaling the floating-point number you may still get a floating-point number as well, so you have to round it to get an integer.

### Precision (Or The Lack of It)

Using finite word lengths (i.e. 8, 16 or 32-bit registers/variables) limits the precision with which we can represent a number. The more the bits we have, the greater the precision we can achieve. It’s a trade-off. Usually, the less bits you have available (in the architecture as well as on physical memory), the cheaper is the hardware.

Thus, it is important to respond to the question: “how many bits do I need to represent a floating-point number in fixed-point format and retaining the same accuracy?”

The answer is given by this expression:

Where d is the number of digits of a fractional number. For example, to retain the same accuracy of 0.123456 we need at least bits.

### Arithmetic Operations

Now if you want to perform arithmetic operations with fixed-point numbers, there are some *rules* to follow.

#### Multiplication

Multiplying two numbers with the same base and different exponents results in . Since for each operand we have possible signed numbers, by multiplying them we’ll get possible signed numbers as a result. That means we need twice the space to store the result of a multiplication and that you get 2 sign bits in it. For example, if both operands are 16-bit values, we’ll need a 32-bit register to store the result, where the two most significant bits are sign bits and the remaining 30 bits contain the result.

The same principle applies to fixed-point multiplication and the output Q-format is given by:

In the example above, if we represent the operands in Q15 format, then we’ll get a result in Q30 format. To remove the extra sign bit you only need to shift-left the result. You may also want to reuse it in another multiplication. In that case, it’s useful to convert the result to the same Q-format as the operands. That means we have to *truncate* our value. Discarding bits containing information will result in loss of precision. In order to reduce that loss the result is *rounded* before truncation. Adding to our result will do the rounding. Truncation is then accomplished by right-shifting the result by *n* bits or by (*n*+1) bits in case the extra sign bit was already discarded (by left-shifting).

The following code snippet illustrates these operations by multiplying 0.123456 by itself. This code can also be used to test fixed-point multiplication for different Q-formats.

By running it with Q7, Q15 and Q31 formats we can build the following table:

Format | Result (Fixed) | Result (Float) | Error |
---|---|---|---|

Float | - | 0.015241 | – |

Q7 | 2 | 0.015625 | 0.000384 |

Q15 | 500 | 0.015259 | 0.000017 |

Q31 | 32730621 | 0.015241 | 0 |

As you can see, the result obtained with Q31 is the only one where the error is zero. That’s because it’s the only format in which we can retain the same accuracy as the floating-point numbers used (the multiplication operands).

#### Division

Division is trickier. When dividing two numbers with the same base and different exponents, the exponents are subtracted: , which means:

If both operands were represented in the same format you would get the result in Q0. That’s a problem! But has very simple solution: if we convert the dividend to format, then we get the result in . This is done by left-shifting the dividend by *n*. We also want to round our values. In division this is accomplished by adding half the divisor to the dividend before proceeding with the division itself. The following code snippet illustrate these operations.

There’s still another “trick”. Consider the following two operands and divisions:

The result of x/y can be represented with (i.e., Q7, Q15, Q31). However, for y/x at least 3 bits must be reserved for the decimal part. Hence, possible formats to use in this operation, for both operands and, consequently, the result, would be Q3.4, Q3.12 and Q3.28.

Unlike multiplication, the division of two fractional numbers without integer part, may or may not result in a number with integer part. This is of extreme importance when choosing the most appropriate Q-format.

Again, the following tables can be built:

Format | x/y Result (Fixed) | x/y Result (Float) | Error |
---|---|---|---|

Float | - | 0.18868 | – |

Q7 | 23 | 0.179688 | -0.008992 |

Q15 | 6182 | 0.18866 | 0.000012 |

Q31 | 405182993 | 0.188679 | 0.000002 |

Format | y/x Result (Fixed) | y/x Result (Float) | Error |
---|---|---|---|

Float | - | 5.3 | – |

Q3.4 | 160 | 10.0 | 4.3 |

Q3.12 | 21737 | 5.306885 | 0.006885 |

Q3.28 | 1422717077 | 5.300034 | 0.000034 |

We can conclude that in Q3.4 format there aren’t enough bits to get a result with reasonable accuracy. We can also see that Q31 is the only format that, by rounding the result to the same decimal place as the “original result” (float format) would yield the same exact value.

Several small microprocessors may not have an hardware divider available, unlike multipliers which are very common. So the division may still be considered as an heavy operation (to be avoided). However, it’s always more efficient to do it with integers than with floats.

#### Addition and Subtraction

Addition and subtraction are equivalent operations since we are dealing with signed numbers. Two possible issues with these operations are: *overflow* and *underflow*, if the resulting value is larger/smaller than the one that can be stored in hardware.

The rule is: the sum of N individual b-bit binary numbers can require as many as bits to represent the results. For example, the sum of two 8-bit words requires an accumulator whose word length is to ensure no overflow errors occur.

Also, in fixed-point representation the operands to be added or subtracted must be in the same Q-format.

### Additional Notes

Why did we choose as the scaler value? Couldn’t it be another number? Usually, we, humans, tend to use a value to scale or aid on the representation of fractional numbers, such as scientific notation. That’s a value we easily understand, it just moves the decimal place and the digits keep the same. However, in hardware values are stored in binary and as such using powers of two is more efficient. That’s what allows us to convert numbers in different Q-formats by simply shifting them and thus improving the efficiency of arithmetic operations.

Now you should feel comfortable enough to embrace the world of fixed-point representation. However, if you don’t know where to use it (or where it’s commonly used), let me tell you one thing: fixed-point representation is thoroughly used in Digital Signal Processing (DSP). You may use it to create digital audio effects, create inertial measurement units or extract other meaningful data from digital (and noisy) signals by filtering and processing them.

On the next post I will write about some types of digital filters and how they can be implemented in fixed-point.

The full source code to test fixed-point multiplication and division is available here.

Did you find this interesting? Let me know what you think.

Great post, really understandable!

I believe you miss-typed an “m” when you meant an “n” in the “fractional part (m bits)” from your graphic http://theramblingness.com/wp-content/uploads/2015/07/fixedpoint.png .

Hope this helps.

Best regards!

Thanks Josep!

Indeed, it was a typo, it´s fixed now.