❮ Verilog2 Level Modeling A Beginners Guide To Web Development ❯

Differences Between float and double Types

Category Programming Technology

float (Single Precision Floating Point) occupies 4 bytes in memory and is described by 32 bits in binary.

double (Double Precision Floating Point) occupies 8 bytes in memory and is described by 64 bits in binary.

Floating point numbers are represented in memory using an exponential format, which is broken down into: sign, mantissa, exponent sign, and exponent.

The sign occupies 1 bit and indicates the number's positive or negative status.

The exponent sign occupies 1 bit and indicates the exponent's positive or negative status.

The mantissa represents the significant digits of the floating point number, typically in the form 0.xxxxxxx, but does not store the leading zero and decimal point.

The exponent stores the significant digits of the exponent.

The number of bits allocated for the exponent and mantissa depends on the computer system.

It could be that the sign plus mantissa occupy 24 bits, and the exponent sign plus exponent occupy 8 bits -- float.

Or, the sign plus mantissa occupy 48 bits, and the exponent sign plus exponent occupy 16 bits -- double.

Knowing the bit allocation of these four parts, you can estimate the size range in binary and then convert it to decimal to understand the numerical range.

For programmers, the difference between double and float is that double has higher precision with 16 significant digits, while float has 7 significant digits. However, double consumes twice as much memory as float, and its computation speed is much slower. In C language, the names of mathematical functions for double and float are different, so do not mix them up. Use single precision when possible to save memory and speed up computation.

Type	Bits	Significant Digits	Range
float	32	6-7	-3.410(-38)～3.410(38)
double	64	15-16	-1.710(-308)～1.710(308)
long double	128	18-19	-1.210(-4932)～1.210(4932)

In summary, float is single precision, occupying 4 bytes in memory, with 7 significant digits (not 8 due to the sign), and by default, on my computer and VC++6.0 platform, it displays 6 significant digits; double is double precision, occupying 8 bytes, with 16 significant digits, but by default, it also displays 6 significant digits on my computer and VC++6.0 platform.

Example: In C and C++, the following assignment statement:

float a=0.1;

The compiler issues a warning: warning C4305: 'initializing' : truncation from 'const double ' to 'float '

Reason: In C/C++ (not sure if it's specific to VC++), the right-hand side of the assignment, 0.1, is considered a double by the compiler (since decimal numbers default to double), hence the warning. Usually, changing it to 0.1f resolves the issue.

I typically use double rather than float.

In C and C# languages, floating-point types are stored using single precision float and double precision double, where float data occupies 32 bits and double data occupies 64 bits. When declaring a variable float f= 2.25f, how is memory allocated? If allocation were random, the world would be chaotic. Both float and double follow IEEE standards in storage: float follows IEEE R32.24, and double follows R64.53.

Both single and double precision storage consists of three parts:

Sign bit: 0 for positive, 1 for negative.
Exponent: Stores the exponent data of scientific notation and uses offset storage.
Mantissa: The mantissa part.

Source: https://my.oschina.net/zd370982/blog/724265

#

** TRDSF

* 429**[email protected]

Precision Differences Between float and double Types in C++

Double has higher precision with 15-16 significant digits, while float has lower precision with 6-7 significant digits. However, double consumes twice as much memory as float and is much slower in computation. It is recommended to use float when precision is sufficient to save memory and speed up computation.

#include <iostream>
#include <iomanip>
using namespace std;

int main()
{
    float a=12.257902012398877;
    double b=12.257902012398877;
    const float PI=3.1415926;         // Constant definition
    cout<&lt;setprecision(15)&lt;&lt;a&lt;&lt;endl;  // Only 6-7 significant digits, the rest are inaccurate
    cout<&lt;setprecision(15)&lt;&lt;b&lt;&lt;endl;  // 15-16 significant digits, so it is accurate
    cout<&lt;setprecision(15)&lt;&lt;PI&lt;&lt;endl; 
    return 0;
}

** TRDSF

* 429**[email protected]

** Click to Share Notes

Cancel

❮ Verilog2 Level Modeling A Beginners Guide To Web Development ❯