# 4.8 — Floating point numbers

Integers are great for counting whole numbers, but sometimes we need to store very large numbers, or numbers with a fractional component. A floating point type variable is a variable that can hold a real number, such as 4320.0, -3.33, or 0.01226. The floating part of the name floating point refers to the fact that the decimal point can “float”; that is, it can support a variable number of digits before and after the decimal point.

There are three different floating point data types: float, double, and long double. As with integers, C++ does not define the actual size of these types (but it does guarantee minimum sizes). On modern architectures, floating point representation almost always follows IEEE 754 binary format. In this format, a float is 4 bytes, a double is 8, and a long double can be equivalent to a double (8 bytes), 80-bits (often padded to 12 bytes), or 16 bytes.

Floating point data types are always signed (can hold positive and negative values).

Category Type Minimum Size Typical Size
floating point float 4 bytes 4 bytes
double 8 bytes 8 bytes
long double 8 bytes 8, 12, or 16 bytes

Here are some definitions of floating point numbers:

When using floating point literals, always include at least one decimal place (even if the decimal is 0). This helps the compiler understand that the number is a floating point number and not an integer.

Note that by default, floating point literals default to type double. An f suffix is used to denote a literal of type float.

Best practice

Always make sure the type of your literals match the type of the variables they’re being assigned to or used to initialize. Otherwise an unnecessary conversion will result, possibly with a loss of precision.

Warning

Make sure you don’t use integer literals where floating point literals should be used. This includes when initializing or assigning values to floating point objects, doing floating point arithmetic, and calling functions that expect floating point values.

Printing floating point numbers

Now consider this simple program:

The results of this seemingly simple program may surprise you:

```5
6.7
9.87654e+06
```

In the first case, the std::cout printed 5, even though we typed in 5.0. By default, std::cout will not print the fractional part of a number if the fractional part is 0.

In the second case, the number prints as we expect.

In the third case, it printed the number in scientific notation (if you need a refresher on scientific notation, see lesson 4.7 -- Introduction to scientific notation).

Floating point range

Assuming IEEE 754 representation:

Size Range Precision
4 bytes ±1.18 x 10-38 to ±3.4 x 1038 6-9 significant digits, typically 7
8 bytes ±2.23 x 10-308 to ±1.80 x 10308 15-18 significant digits, typically 16
80-bits (typically uses 12 or 16 bytes) ±3.36 x 10-4932 to ±1.18 x 104932 18-21 significant digits
16 bytes ±3.36 x 10-4932 to ±1.18 x 104932 33-36 significant digits

The 80-bit floating point type is a bit of a historical anomaly. On modern processors, it is typically implemented using 12 or 16 bytes (which is a more natural size for processors to handle).

It may seem a little odd that the 80-bit floating point type has the same range as the 16-byte floating point type. This is because they have the same number of bits dedicated to the exponent -- however, the 16-byte number can store more significant digits.

Floating point precision

Consider the fraction 1/3. The decimal representation of this number is 0.33333333333333… with 3’s going out to infinity. If you were writing this number on a piece of paper, your arm would get tired at some point, and you’d eventually stop writing. And the number you were left with would be close to 0.3333333333…. (with 3’s going out to infinity) but not exactly.

On a computer, an infinite length number would require infinite memory to store, and typically we only have 4 or 8 bytes. This limited memory means floating point numbers can only store a certain number of significant digits -- and that any additional significant digits are lost. The number that is actually stored will be close to the desired number, but not exact.

The precision of a floating point number defines how many significant digits it can represent without information loss.

When outputting floating point numbers, std::cout has a default precision of 6 -- that is, it assumes all floating point variables are only significant to 6 digits (the minimum precision of a float), and hence it will truncate anything after that.

The following program shows std::cout truncating to 6 digits:

This program outputs:

```9.87654
987.654
987654
9.87654e+006
9.87654e-005
```

Note that each of these only have 6 significant digits.

Also note that std::cout will switch to outputting numbers in scientific notation in some cases. Depending on the compiler, the exponent will typically be padded to a minimum number of digits. Fear not, 9.87654e+006 is the same as 9.87654e6, just with some padding 0’s. The minimum number of exponent digits displayed is compiler-specific (Visual Studio uses 3, some others use 2 as per the C99 standard).

The number of digits of precision a floating point variable has depends on both the size (floats have less precision than doubles) and the particular value being stored (some values have more precision than others). Float values have between 6 and 9 digits of precision, with most float values having at least 7 significant digits. Double values have between 15 and 18 digits of precision, with most double values having at least 16 significant digits. Long double has a minimum precision of 15, 18, or 33 significant digits depending on how many bytes it occupies.

We can override the default precision that std::cout shows by using the std::setprecision() function that is defined in the iomanip header.

Outputs:

```3.333333253860474
3.333333333333334
```

Because we set the precision to 16 digits, each of the above numbers is printed with 16 digits. But, as you can see, the numbers certainly aren’t precise to 16 digits! And because floats are less precise than doubles, the float exhibits has more error.

Precision issues don’t just impact fractional numbers, they impact any number with too many significant digits. Let’s consider a big number:

Output:

```123456792
```

123456792 is greater than 123456789. The value 123456789.0 has 10 significant digits, but float values typically have 7 digits of precision (and the result of 123456792 is precise only to 7 significant digits). We lost some precision! When precision is lost because a number can’t be stored precisely, this is called a rounding error.

Consequently, one has to be careful when using floating point numbers that require more precision than the variables can hold.

Best practice

Favor double over float unless space is at a premium, as the lack of precision in a float will often lead to inaccuracies.

Rounding errors make floating point comparisons tricky

Floating point numbers are tricky to work with due to non-obvious differences between binary (how data is stored) and decimal (how we think) numbers. Consider the fraction 1/10. In decimal, this is easily represented as 0.1, and we are used to thinking of 0.1 as an easily representable number with 1 significant digit. However, in binary, 0.1 is represented by the infinite sequence: 0.00011001100110011… Because of this, when we assign 0.1 to a floating point number, we’ll run into precision problems.

You can see the effects of this in the following program:

This outputs:

```0.1
0.10000000000000001
```

On the top line, std::cout prints 0.1, as we expect.

On the bottom line, where we have std::cout show us 17 digits of precision, we see that d is actually not quite 0.1! This is because the double had to truncate the approximation due to its limited memory. The result is a number that is precise to 16 significant digits (which type double guarantees), but the number is not exactly 0.1. Rounding errors may make a number either slightly smaller or slightly larger, depending on where the truncation happens.

Rounding errors can have unexpected consequences:

```1
0.99999999999999989
```

Although we might expect that d1 and d2 should be equal, we see that they are not. If we were to compare d1 and d2 in a program, the program would probably not perform as expected. Because floating point numbers tend to be inexact, comparing floating point numbers is generally problematic -- we discuss the subject more (and solutions) in lesson 5.6 -- Relational operators and floating point comparisons.

One last note on rounding errors: mathematical operations (such as addition and multiplication) tend to make rounding errors grow. So even though 0.1 has a rounding error in the 17th significant digit, when we add 0.1 ten times, the rounding error has crept into the 16th significant digit. Continued operations would cause this error to become increasingly significant.

Key insight

Rounding errors occur when a number can’t be stored precisely. This can happen even with simple numbers, like 0.1. Therefore, rounding errors can, and do, happen all the time. Rounding errors aren’t the exception -- they’re the rule. Never assume your floating point numbers are exact.

A corollary of this rule is: never use floating point numbers for financial or currency data.

NaN and Inf

There are two special categories of floating point numbers. The first is Inf, which represents infinity. Inf can be positive or negative. The second is NaN, which stands for “Not a Number”. There are several different kinds of NaN (which we won’t discuss here).

Here’s a program showing all three:

And the results using Visual Studio 2008 on Windows:

```1.#INF
-1.#INF
1.#IND
```

INF stands for infinity, and IND stands for indeterminate. Note that the results of printing Inf and NaN are platform specific, so your results may vary.

Conclusion

To summarize, the two things you should remember about floating point numbers:

1) Floating point numbers are useful for storing very large or very small numbers, including those with fractional components.

2) Floating point numbers often have small rounding errors, even when the number has fewer significant digits than the precision. Many times these go unnoticed because they are so small, and because the numbers are truncated for output. However, comparisons of floating point numbers may not give the expected results. Performing mathematical operations on these values will cause the rounding errors to grow larger. 4.9 -- Boolean values Index 4.7 -- Introduction to scientific notation

### 344 comments to 4.8 — Floating point numbers

• James Ray

Want to know more about floating point numbers work? See here: https://en.wikipedia.org/wiki/Floating_point.

• Garnik

Hi,

Why does the following code compile at all?

I get a compile error when dividing 0.0/0.0 directly, while the outcome of the above is -nan(ind).

Is this because of the variable initialization, the usage of which C++ does not to limit and cause unexpected compile errors?

Thanks,
Garnik

• Alex

If you try to divide by a 0.0 literal, the compiler can easily determine that you're trying to divide by 0.0, and disallow it.
If you try to divide by a variable, the compiler can't easily determine that the variable may hold the value 0.0. Therefore, it allows it, and you get a runtime issue.

• Garnik

Hi Alex,

Thanks for your reply, and like everyone doing your tutorials, I am very much grateful for the work you have done, as well as for being so responsive and supportive.

Best Regards,
Garnik

• Mufti

can someone explain this code for me

a,b,c are variables of type int while d, e f and g are of type float.
•If a= 4, b = 6 and c -3, Find

•d=3a + -ac
•f = d/a
•e = f+2c
•g = f/e

thank you...

• Nurlan

Hello
Alex, for asking the question again, again.
You have stated above- Long double has a minimum precision of 15, 18, or 33 significant digits depending on how many bytes it occupies.And the table above is stated as-
------------------------------------------------------------------------
80-bits (12 bytes)     ±3.36 x 10-4932 to ±1.18 x 104932     18-21 significant digits
----------------------------------------------------------------------
16 bytes     ±3.36 x 10-4932 to ±1.18 x 104932     33-36 significant digits
------------------------------------------------------------------------
#include<iostream>

int main()
{
std::cout << std::setprecision(22);

long double d1(1.44+1.69);
std::cout << d1 << std::endl;//output 3.1299999999999998933419

long double d2(0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1); // should equal 1.0
std::cout << d2 << std::endl;//output:0.9999999999999998889777
std::cout<<sizeof(d2);// output 16

}
In my case the long double has 15 precision as you see in program. the size of long data type is 16 bytes. I didn't understand what you said that minimum precision of 15, 18, or 33 significant digits depend on how many bytes it occupies. Mine is 16 bytes as it cout in program.What does prevent getting minimum 33 significant bytes? is there anything that i missed,Thanks in advance.

• Alex

Your request to display 22 bits of precision isn't being honored. You need to std::cout << std::fixed before setting the precision.

Also don't forget to #include iomanip.

• Mufti

thank you very much Mr. Alex, I really appreciate this tutorial.

please teach me how to solve this question. i have tried but not sure of the answers am getting.

I will be very grateful if anyone help me out.

A,b,c are variables of type int while d, e, f, and g are of type float.
•If a= 4, b = 6 and c -3, Find
•d = 3a + -ac
•f = d/a
•e = f+2c

• Matt

Your table at the beginning of this lesson has different minimum sizes for floating point numbers than your table in lesson 2.3. In this lesson you have float 4, double 4, and long double 4.
In lesson 2.3 you have float 4, double 8, long double 8.

• Alex

Thanks for pointing that out. I've updated the article accordingly.

• reznov

In C langauge I was taught to force truncate any unnecessary significant digits by hand before comparing a float for i.e. a while loop condition. Now I understand that if I didn't, I could've created an infinite loop.

I don't quite remember how to do that in C, but that shouldn't matter for C++ now, problem is I don't know it for C++ either yet. Will this be covered in a future lesson? If so I'll wait patiently and read along, else I'd like to have a quick heads up, because I tend to try alot of crazy things and wouldn't want to melt my CPU doing them.

Anyway, thanks for the tutorial so far. It's been really helpful already!

• Alex

I'm not sure what you mean by "truncate any unnecessary significant digits". If they're significant, then they're not unnecessary, are they?

I do talk in the next chapter about how to properly compare floating point numbers, so if that's the topic of interest here, keep reading.

• reznov

Oh my bad there, I tend to call any digit that isn't a zero before the first significant digit significant even though they're unsignificant or even unwanted.

In example, if my float stores 0.1000000000000000001 instead of 0.1 flat, I've been taught to tell the compiler he has to cut off any digits that go beyond the 1st decimal (in C laungauge). Is this process explained / still used in C++ ?

Also I started wondering about the origin of the error, does it have to do with how we convert decimals to binary? In example: 0.5f = 2^-1 or 0.125f = 2^-3 etc. Not gonna break my head over it, but just wondered if that has something to do with it.

• Alex

Just out of curiosity, how are you telling the compiler to 0.10000000000001 to 0.1? What if the number get represented as 0.999999999999 instead?

The origin of precision issues do generally have to do with how we convert decimals to binary, as explained in the section "Rounding errors" above.

• Mauricio Mirabetti

Alex, me again.
As a correction note, when explaining how to convert numbers to scientific notation, on the first example, on step two (trimming leading zeros) I believe you trimmed the trailing zero by mistake. That could lead to a misunderstanding of the "protocol".

Slide decimal left 4 spaces: 4.2030e4
No leading zeros to trim: 4.203e4
Trim trailing zeros: 4.203e4 (4 significant digits)"

It could be:
Slide decimal left 4 spaces: 4.2030e4
No leading zeros to trim: 4.2030e4 //no change on trailing zeroes yet
Trim trailing zeros: 4.203e4 (4 significant digits)"

Best regards.

Mauricio

• Alex

Thanks! Fixed.

• joe

I have a question.

Lets say I am setting up a clock for the hours and minutes and seconds and even miliseconds.
how would I write my code so as to show a maximum number for my miliseconds as well as a minimum?

lets say I want it to show all numbers up to the 100ths so it would say .50 or .55, how would I write?

• Alex

If you want to display a fixed number of decimal places for floating point numbers, you need to tell std::cout to print numbers in fixed-point format. To do that:

The other alternative is to store your hours, minutes, seconds, and microseconds as integers and do your own formatting.

• joe

cool, thank you.

Another question, is there a way to capture a specific part of the data?
lets say with 123.4567.  I only want to keep the .4567 and discard the rest.

The way I am thinking is to create an integer, making it equal to the float so it captures the whole numbers, then subtracting it from the float.  Would that work?

• Alex

The easiest way is as you suggest, cast the floating point number to an integer and subtract that from the floating point number. e.g.

• Ronald Hofmann

Yes, you are right. I realized that 5 seconds after my comment, sorry for that.
Well, it was the first time I put an address on the white list.

Still trying

Greetings from Switzerland, Ronald Hofmann
---

• Ronald Hofmann

Hi, the word significand is correctly written with a t at the end -> significant

I put this page on the witelist of my AdBlocker Plus but stil get a message that ABP is active.
Any idea?

Greetings from Switzerland, Ronald Hofmann
---

• Alex

No, significand is actually the correct term. It's a mathematical term dealing with numbers in scientific notation (along with the mantissa and exponent).

Unsure about Adblock Plus, I've never used it so I can't advise on how to fix issues with it. :(

• Nyap

can someone please explain to me what a "significand" or "exponent" is

• Alex

The significand is the part of a number in scientific notation that contains the significant digits.

The exponent is the same exponent used in standard math. It determines how many times the base is multiplied. E.g. 2^4 means 2 is multiplied 4 times to get 16.

So in the number 1.2 x 10^4, 1.2 is the significand and 4 is the exponent. This evaluates to 1.2 * 10000 = 12000.

• Nyap

I don't understand why you need to use the f suffix
isn't a double basically the same as a float, but with more breathing space (memory)? and if so, why don't you need to use suffixes for the different types of integers?

• Alex

Generally speaking, literals don't know what type they are -- numbers are just numbers devoid of context. The suffixes are used to specify what type the literal is intended to be. If you don't specify a suffix, the compiler will make an assumption about what your intent is.

With integers, the compiler will automatically find the smallest integer that fits your integer literal and assume your integer literal is of that type. This works because integers do not suffer any precision issues, so the compiler only needs to worry about the range of the type. If you then assign the integer to a larger type, no problem.

However, with floating point numbers, there are precision issues, and that complicates things. A floating point literal gets represented differently as a float or a double, with a possible loss of precision, so the compiler can't just put your floating point literal inside the smallest floating point type that it fits (e.g. a float) and call it good. But it has to do something if you don't specify a prefix, so it assumes you meant double in that case.

Consider what happens if you do something like this:

4.3 is a decimal number with no suffix, so the compiler assumes you meant a double. 4.3 gets converted to a double, with some loss of precision. That double is then assigned to a float, which incurs another loss of precision. Double precision loss!

If 4.3 had an f suffix, then you only get one loss of precision, when 4.3f is converted to a float. The assignment doesn't incur an additional loss of precision because the types are already identical by that point.

Make sense?

• Peng

34500.0 is equal to 3.45000e4 (6 significant digits)
But why 123456789.0 has 9 significant digits not 10 significant digits in above?

• Alex

It should have been 10. I've updated the example.

• Toasty Turnip

Hi,
I'm a bit confused by the bit regarding the f appended to some numbers. I did some googling but am still confused.

"Note that by default, floating point literals default to type double. An f suffix is used to denote a literal of type float."

However, when you initialize or define a variable, you would have to specify its type as "float" or "double" wouldn't you? Or do the two for some reason initialize the same type, and the f is the only way to initialize a non-double float?

Thanks for the great tutorial and for still keeping the comment section updated after all these years! :)

• Alex

Even though you must assign your variable a type, C++ won't infer what kind of literal you're intending based on the type of variable you're initializing. Therefore, you must use a suffix to specify what type the literal is.

• Toasty Turnip

So I can call it a float and use it as a double? And if I assign it a "double" type, but add the f suffix, it will be treated as a single float?

• Alex

If you do something like this:

The compiler will implicitly convert the float literal to a double before initialization.

• Toasty Turnip

I decided to do a small test but... I still don't seem to understand the trailing "f"...

I'm not sure how the format turns out, so here's a pastebin:
http://pastebin.com/TU7R91qS

The output is:

fValue1:                4 bytes
fValue2:                4 bytes
dValue1:                8 bytes
dValue2:                8 bytes
ldValue1                16 bytes
ldValue2                16 bytes

It looks like the size of the float remains 4 bytes regardless of using the trailing f or not. Does it have to do with the compiler? Is using the f the standard, and hence "safe" way to do it, or am I just not getting something?

Sorry for all the questions, and thanks for your patience. :)

• Alex

The f suffix exists to tell C++ that the literal number (in this case 0.1) should be treated as a floating point number rather than a double. It doesn't have any impact on the size of the variable you assign that value to, but you may see differences in precision:

• Nurlan

Hello,

Alex.
Could you please clarify below this question.
double d1 =3123456712345678912345678912345678912345.6789;
std::cout<<d1<<'\n';// it prints output- works okay.
double d2 =3123456712345678912345678912345678912345.6789f
std::cout<<d1<<'\n';// but this is not. it shows infinity.
As you have stated before (above comment) :
double d = 0.333333f; // converts a float literal to a double

My question- if the suffix converts a float literal to a double then in my case why my d2 variable doesn't print which is converted from a literal to a double? As i understood, it is converted to double not a float.i.e  the size of double could have cout the d2 value, but this acts as if it has float size (4 byte). This same situation between long double and double.

• Alex

I think you have it backwards -- the f suffix causes the compiler to treat the literal as a float, not a double. So d1 is being initialized with a double literal, whereas d2 is being initialized with a float literal. Your very large number appears to be in range for a double, but not for a float, so the compiler says, 3123456712345678912345678912345678912345.6789f is out of range for a float, we'll treat this as infinity. That gets assigned to d2.

• trebor

I'm a newby but I disagree with a) and b) solutions. We physicists would assume that the zeroes on the right are not significant digits.

• Darren

I think you might be eligible for a refund on your degree.

• Osman Zakir

It compiled when I took away the "f" suffix.  I don't know why, though, as it compiled fine with that suffix included on any other code I wrote.

• Alex

Because then it treated the literal as an integer, and converted the integer literal to a float when the initialization took place.

• Osman Zakir

I’m having some trouble with this in Visual Studio (Community 2015 edition) concerning user-defined literal operators.  I have no idea what they are, though.  I tried to do this:

• Alex

123456789 is an integer literal, so I guess the compiler is refusing to do a conversion in this case.

• Gemparks

Hi, there.
Its me again.

Based on the following codes

C++ source file:

Why does the ans return "inf" instead of a proper one, that is 1.43?

• Alex

It works fine for me. I ran your example and got 1.43. It looks like you might be running an older compiler (since you're using conio.h). Maybe try upgrading your compiler to a more modern one?

• Ben

I understand that there are rounding errors due to conversions between binary and decimal.  Why don't these errors propagate consistently?

For example, in your example of 0.1, the compiler prints 0.10000000000000001 (which is slightly greater than 0.10000...) when precision is set to 17 sig figs.  So why do you get a number slightly less than 1 when you add 0.1 ten times? Logically, it would seem like the compiler might give a number like 1.00000000000000010 instead of 0.99999999999999989.

• Alex

The errors do propagate as you'd expect. However, when a number can't be stored exactly in floating point representation, it will generally be rounded to the closest floating-point representation with the given precision. This number could be larger or smaller than the actual number. It also means two numbers that "round down" may sum to one that "rounds up", or vice versa.

Consider the following program:

This prints:

```0.1 is 0.10000000000000001
0.4 is 0.40000000000000002
0.4 + 0.1 + 0.1 is 0.59999999999999998
```

Three numbers that are larger than expected add up to one that is smaller than expected.

• Austin

Last two lessons.... very tough for me. Just like a lot of information to take in. And it's very hard to remember without connecting to something meaningful.

• R4Z3R

Hi Alex
look at the following snipped:

result: 0.12345
but the following snipped:

result: 0.123457

How can I fix it and why it happens?

• Alex

std::cout has a default precision of 6 digits, but your expected value of 0.123456789 is 9 digits of precision. You need to tell std::cout to print more digits of precision:

• Max

I just have to say: Amazing tutorial! THANK YOU!!!!!!!!!!!

• babu.tritran

oh ^^ thanks Alex.you are a great support.

• babu.tritran

Why?
45.5+45.5=90
or
123.456/654.321=0.188073
It should be 123.456/654.321= 0.1886780342 (right)

Thank Alex.

main.cpp

• Alex

You should heed the warnings of your compiler. Your compiler should be warning you that enterNumber() is returning a float, but the return type is an integer. That's why you're getting incorrect answers: all the fractional numbers you're entering are getting truncated to integers.

As a side note, your summation function is misnamed, as it doesn't do any summation. You're doing the summation in the function argument. As written, the function should be named "printResult" or something.

• babu.tritran

You can edit my post again. to give me a better idea.

• Alex

Change:

to

• Kyle

Hi, I'm trying to write a program that evaluates a fraction from the user, does a math operation on it and spits out another fraction (in fractional format, not decimals). Is it possible to do this in C++? I ran a simple program with inputting a fraction into a float but the program doesn't work, I suspect it's not possible with floats but how else would I do it?

• Alex

Yes, it is definitely possible, but in order to do that you'll need to manage the numerator and denominator yourself (e.g. store them as integers) and write the operations to add, subtract, multiple, and divide them yourself. Floats won't be of use for this.

• BROKEN WINDOW

Hi Alex,
if we want to store a 60 digits number in a variable or print it in output without scientific notation, what can we do ?
for example in python we can calculate 9999 to the exponent 9999 and it will print the result.
is there any way to do it ?

Thanks.

• Alex

Since C++ doesn't have support for arbitrarily large integers, probably the best solution here would be to install a library that implements large integers (e.g. https://mattmccutchen.net/bigint/).

• takise

Hello, I can't get one thing - is there any difference with rounding errors depending on which math operations we use.

double d = 0.1;
cout << d << endl;         // shows 0.10000000000000001
double d2 = (d + d + d + d + d + d + d + d + d + d);
cout << d2 << endl;         // shows 0.99999999999999989
double d3 = 10 * d;
cout << d3 << endl;              // shows 1  ?? why ??
return 0;

Why after multiplying the result is without rounding error ?

Thanks upfront for answer and great tutorial!

• Alex

Yes, the amount of rounding error can depend both on which mathematical operation you and how many times you use them. The more times you use them, the more errors tend to grow. The plus case has more error because we used plus 10 times.

• AK

A good example which made me to understand the precision error is "Patriot Missile Failure" during GULF war. The internal clock was multiplied by 1/10 to get real time.
This calculation was doing a fixed point 24 bit multiplication. And over 100 hours this multiplication was yielded a drift of 0.34 sec which was enough for the missile to go undetected by the Radar.

Details at :

• Pranav

float f is printed out as 8e-010. I am using visual studio. Is there a mistake in initializing the variable? ignore the random long double.

• Alex

No. 8e-10 and 8e-010 are the same number.

• Pranav

But rather than giving out 0.0000000008 it gives out 8e-10.

• Alex

Yup. std::cout will print some numbers in scientific notation, particularly if they are large or small.

• Avneet

After mixing my stupid experiments with your great lessons on floating point numbers, I ended up with this conclusion.  Please correct me if I m wrong somewhere:

Floating numbers are different in a machine from what we expect. For example: 1.11 is not stored as 1.11 in a machine. Floating point numbers can't take more than 4,8,12 or 16 bytes in memory. That is why numbers going out of Infinity has to be rounded up or down to an approximate value. Setting precision shows how numbers are stored in the machine but can't result the exact number because it has to truncate the number according to the given limit or due to limited memory. After truncating, it returns an approximate value (approximate according to machine).  However, we can see exact 1.11 to be printed out to the console because cout displays expected results when precision is set under the precision range of a type given to the object (e.g. double x; // x is okay with precision 15,16 or 17 when cout prints it's value. If precision is set to 18 for x, cout will display unexpected digits after 15,16 or 17 significant digits) type and fails only when floating numbers are compared to each other using relational operator, or when evaluated as a result of an expression, using arithmetic operators. 50 is the largest parameter setprecision can hold (setprecision (50)). I m a bit confused here. Why 50. Does a number with 51 digits goes out of 16 byte (maximum a floating point number can reserve). Or setprecision () has a rule that this object can't display more than 50 significant digits.

• Alex

You're pretty much correct. Floating point numbers are not stored in decimal format internally (1.11 is not stored as "1.11"), they are stored as some magic combination of bits that gets reconstituted into 1.11). Just like decimal numbers, some floating point numbers can be represented precisely, and some can not. For example 1/3rd can't be represented precisely in decimal format (0.333333... you have to truncate somewhere), but 1/10 can (0.1).

It's unclear to me whether setprecision() having a max value 50 is a limitation of your compiler or something else.

• Avneet

cout prints maximum 50 significant digits for all setprecision values greater than 50 in my machine. I m using code::blocks.