is there any explanation i've missed in this article ? #111432

saifeddineelhanoune · 2024-03-09T19:35:36Z

saifeddineelhanoune
Mar 9, 2024

Select Topic Area

Question

Body

Fixed-point and Floating point

floating-point representations are two methods used in computing to
store real numbers, each with its own advantages and disadvantages.
Understanding how these representations work, especially the conversion
from floating-point to fixed-point, requires a deep dive into their
structure and the principles behind them.

Fixed-Point Representation

Fixed-point

numbers are represented as integers in memory, with a notional radix
point (the decimal point in decimal numbers) placed at a fixed position
relative to the least significant bit. The position of this radix point
is determined by a scaling factor, which is a power of 2. This scaling
factor can move the radix point to the right (increasing the value), to
the left (decreasing the value), or keep it in the same position (no
change in value) [1](https://andybargh.com/fixed-and-floating-point-binary/).

Advantages:

Fixed-point representation offers
performance benefits because the CPU can perform integer arithmetic
operations efficiently. It's suitable for applications where the range
of values is known and does not exceed the fixed-point representation's
capabilities .

**[[1](https://andybargh.com/fixed-and-floating-point-binary/)](https://andybargh.com/fixed-and-floating-point-binary/)**

Disadvantages: The range of values that can be
represented is limited, which makes fixed-point less suitable for
applications requiring a wide range of values or high precision .

[1](https://andybargh.com/fixed-and-floating-point-binary/)

Floating-Point Representation (IEEE 754)

Floating-point
numbers are represented using a sign bit, an exponent, and a mantissa
(also known as the significand). The sign bit indicates whether the
number is positive or negative. The exponent represents the power to
which the mantissa is raised, and the mantissa contains the significant
digits of the number. The exponent is stored with a bias to avoid
negative numbers, which simplifies the representation [1](https://andybargh.com/fixed-and-floating-point-binary/)[[7](https://www.geeksforgeeks.org/introduction-of-floating-point-representation/)](https://www.geeksforgeeks.org/introduction-of-floating-point-representation/).

Exponent Representation: The exponent is stored
with a bias to represent both positive and negative exponents. For
example, in a 32-bit floating-point representation, the bias is 127,
allowing the exponent to represent values from -126 to 127 .

7
Mantissa Representation: The mantissa contains the
significant digits of the number. In a 32-bit floating-point
representation, the mantissa is represented by the remaining 23 bits
after the exponent .

[7](https://www.geeksforgeeks.org/introduction-of-floating-point-representation/)

Conversion from Floating-Point to Fixed-Point

To
convert a floating-point number to a fixed-point representation, you
need to decide on the number of fractional bits (the number of bits
dedicated to the fractional part of the number) and the scaling factor.
The algorithm involves determining the number of bits required for the
whole part and the fractional part of the number, calculating the scale,
and then quantizing the floating-point number with this scale. The
quantized number is then clipped to the minimum and maximum values that
can be represented with the given bitwidth [5](http://www.kaizou.org/2023/05/quantization-fixed-point.html).

Here's a simplified version of the algorithm:

Determine the number of bits available for the mantissa.
Calculate the number of bits required to represent the whole part of the number.
Deduce the number of bits required for the fractional part of the number.
Calculate the scale, which is the smallest value that can be represented (as 1).
Quantize the floating-point number with the scale, clipping to the minimum and maximum values for the mantissa.

This
process ensures that the floating-point number is accurately represented
as a fixed-point number, taking into account the limitations of
fixed-point representation in terms of range and precision [5](http://www.kaizou.org/2023/05/quantization-fixed-point.html).

Example

Consider
converting a floating-point number to an 8-bit fixed-point
representation with 4 fractional bits. The algorithm would calculate the
scale based on the number of fractional bits, quantize the
floating-point number with this scale, and then clip the result to fit
within the 8-bit range. This process ensures that the floating-point
number is represented as closely as possible within the constraints of
fixed-point representation (http://www.kaizou.org/2023/05/quantization-fixed-point.html).

In
summary, understanding the underpinnings of fixed-point and
floating-point representations, especially the conversion from
floating-point to fixed-point, is crucial for applications requiring
precise control over numerical representation and performance.

This comment was marked as off-topic.

Sign in to view

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

GitHub Community

is there any explanation i've missed in this article ? #111432

{{title}}

Replies: 2 comments

This comment was marked as off-topic.

This comment was marked as off-topic.

Select a reply

GitHub Community

is there any explanation i've missed in this article ? #111432

saifeddineelhanoune Mar 9, 2024

Select Topic Area

Body

Fixed-point and Floating point

Fixed-Point Representation

Fixed-point

Advantages:

Replies: 2 comments

This comment was marked as off-topic.

This comment was marked as off-topic.

saifeddineelhanoune
Mar 9, 2024