is there any explanation i've missed in this article ? #111432
Replies: 2 comments
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
This comment was marked as off-topic.
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
-
Select Topic Area
Question
Body
Fixed-point and Floating point
floating-point representations are two methods used in computing to
store real numbers, each with its own advantages and disadvantages.
Understanding how these representations work, especially the conversion
from floating-point to fixed-point, requires a deep dive into their
structure and the principles behind them.
Fixed-Point Representation
Fixed-point
numbers are represented as integers in memory, with a notional radix
point (the decimal point in decimal numbers) placed at a fixed position
relative to the least significant bit. The position of this radix point
is determined by a scaling factor, which is a power of 2. This scaling
factor can move the radix point to the right (increasing the value), to
the left (decreasing the value), or keep it in the same position (no
change in value) [1](https://andybargh.com/fixed-and-floating-point-binary/).
Advantages:
Fixed-point representation offers
performance benefits because the CPU can perform integer arithmetic
operations efficiently. It's suitable for applications where the range
of values is known and does not exceed the fixed-point representation's
capabilities .
Disadvantages: The range of values that can be
represented is limited, which makes fixed-point less suitable for
applications requiring a wide range of values or high precision .
[1](https://andybargh.com/fixed-and-floating-point-binary/)
Floating-Point Representation (IEEE 754)
Floating-point
numbers are represented using a sign bit, an exponent, and a mantissa
(also known as the significand). The sign bit indicates whether the
number is positive or negative. The exponent represents the power to
which the mantissa is raised, and the mantissa contains the significant
digits of the number. The exponent is stored with a bias to avoid
negative numbers, which simplifies the representation [1](https://andybargh.com/fixed-and-floating-point-binary/)[[7](https://www.geeksforgeeks.org/introduction-of-floating-point-representation/)](https://www.geeksforgeeks.org/introduction-of-floating-point-representation/).
Exponent Representation: The exponent is stored
with a bias to represent both positive and negative exponents. For
example, in a 32-bit floating-point representation, the bias is 127,
allowing the exponent to represent values from -126 to 127 .
7
Mantissa Representation: The mantissa contains the
significant digits of the number. In a 32-bit floating-point
representation, the mantissa is represented by the remaining 23 bits
after the exponent .
[7](https://www.geeksforgeeks.org/introduction-of-floating-point-representation/)
Conversion from Floating-Point to Fixed-Point
To
convert a floating-point number to a fixed-point representation, you
need to decide on the number of fractional bits (the number of bits
dedicated to the fractional part of the number) and the scaling factor.
The algorithm involves determining the number of bits required for the
whole part and the fractional part of the number, calculating the scale,
and then quantizing the floating-point number with this scale. The
quantized number is then clipped to the minimum and maximum values that
can be represented with the given bitwidth [5](http://www.kaizou.org/2023/05/quantization-fixed-point.html).
Here's a simplified version of the algorithm:
This
process ensures that the floating-point number is accurately represented
as a fixed-point number, taking into account the limitations of
fixed-point representation in terms of range and precision [5](http://www.kaizou.org/2023/05/quantization-fixed-point.html).
Example
Consider
converting a floating-point number to an 8-bit fixed-point
representation with 4 fractional bits. The algorithm would calculate the
scale based on the number of fractional bits, quantize the
floating-point number with this scale, and then clip the result to fit
within the 8-bit range. This process ensures that the floating-point
number is represented as closely as possible within the constraints of
fixed-point representation (http://www.kaizou.org/2023/05/quantization-fixed-point.html).
In
summary, understanding the underpinnings of fixed-point and
floating-point representations, especially the conversion from
floating-point to fixed-point, is crucial for applications requiring
precise control over numerical representation and performance.
Beta Was this translation helpful? Give feedback.
All reactions