# 2.5 — Floating point numbers

Integers are great for counting whole numbers, but sometimes we need to store very large numbers, or numbers with a fractional component. A floating point type variable is a variable that can hold a real number, such as 4320.0, -3.33, or 0.01226. The floating part of the name floating point refers to the fact that the decimal point can “float”; that is, it can support a variable number of digits before and after the decimal point.

There are three different floating point data types: float, double, and long double. As with integers, C++ does not define the size of these types. On modern architectures, floating point representation almost always follows IEEE 754 binary format. In this format, a float is 4 bytes, a double is 8, and a long double can be equivalent to a double (8 bytes), 80-bits (often padded to 12 bytes), or 16 bytes.

Floating point data types are always signed (can hold positive and negative values).

Category Type Minimum Size Typical Size
floating point float 4 bytes 4 bytes
double 4 bytes 8 bytes
long double 4 bytes 8, 12, or 16 bytes

Here are some definitions of floating point numbers:

When we assign literal numbers to floating point numbers, it is convention to use at least one decimal place. This helps distinguish floating point values from integer values.

Note that by default, floating point literals default to type double. An f suffix is used to denote a literal of type float.

Scientific notation

How floating point variables store information is beyond the scope of this tutorial, but it is very similar to how numbers are written in scientific notation. Scientific notation is a useful shorthand for writing lengthy numbers in a concise manner. And although scientific notation may seem foreign at first, understanding scientific notation will help you understand how floating point numbers work, and more importantly, what their limitations are.

Numbers in scientific notation take the following form: significand x 10exponent. For example, in the scientific notation `1.2 x 104`, `1.2` is the significand and `4` is the exponent. This number evaluates to 12,000.

By convention, numbers in scientific notation are written with one digit before the decimal, and the rest of the digits afterward.

Consider the mass of the Earth. In decimal notation, we’d write this as `5973600000000000000000000 kg`. That’s a really large number (too big to fit even in an 8 byte integer). It’s also hard to read (is that 19 or 20 zeros?). In scientific notation, this would be written as `5.9736 x 1024 kg`, which is much easier to read. Scientific notation has the added benefit of making it easier to compare the magnitude of two really large or really small numbers simply by comparing the exponent.

Because it can be hard to type or display exponents in C++, we use the letter ‘e’ or ‘E’ to represent the “times 10 to the power of” part of the equation. For example, `1.2 x 104` would be written as `1.2e4`, and `5.9736 x 1024` would be written as `5.9736e24`.

For numbers smaller than 1, the exponent can be negative. The number `5e-2` is equivalent to `5 * 10-2`, which is `5 / 102`, or `0.05`. The mass of an electron is `9.1093822e-31 kg`.

In fact, we can use scientific notation to assign values to floating point variables.

How to convert numbers to scientific notation

Use the following procedure:

• Your exponent starts at zero.
• Slide the decimal so there is only one non-zero digit to the left of the decimal.
• Each place you slide the decimal to the left increases the exponent by 1.
• Each place you slide the decimal to the right decreases the exponent by 1.
• Trim off any leading zeros (on the left end)
• Trim off any trailing zeros (on the right end) if the original number had no decimal point. We’re assuming they’re not significant unless otherwise specified.

Here’s some examples:

```Start with: 42030
Slide decimal left 4 spaces: 4.2030e4
No leading zeros to trim: 4.2030e4
Trim trailing zeros: 4.203e4 (4 significant digits)
```
```Start with: 0.0078900
Slide decimal right 3 spaces: 0007.8900e-3
Don't trim trailing zeros: 7.8900e-3 (5 significant digits)
```
```Start with: 600.410
Slide decimal left 2 spaces: 6.00410e2
No leading zeros to trim: 6.00410e2
Don't trim trailing zeros: 6.00410e2 (6 significant digits)
```

Here’s the most important thing to understand: The digits in the significand (the part before the E) are called the significant digits. The number of significant digits defines a number’s precision. The more digits in the significand, the more precise a number is.

Precision and range

Consider the fraction 1/3. The decimal representation of this number is 0.33333333333333… with 3’s going out to infinity. An infinite length number would require infinite memory to store, and we typically only have 4 or 8 bytes. Floating point numbers can only store a certain number of significant digits, and the rest are lost. The precision of a floating point number defines how many significant digits it can represent without information loss.

When outputting floating point numbers, cout has a default precision of 6 -- that is, it assumes all floating point variables are only significant to 6 digits, and hence it will truncate anything after that.

The following program shows cout truncating to 6 digits:

This program outputs:

```9.87654
987.654
987654
9.87654e+006
9.87654e-005
```

Note that each of these is only 6 significant digits.

Also note that cout will switch to outputting numbers in scientific notation in some cases. Depending on the compiler, the exponent will typically be padded to a minimum number of digits. Fear not, 9.87654e+006 is the same as 9.87654e6, just with some padding 0’s. The minimum number of exponent digits displayed is compiler-specific (Visual Studio uses 3, some others use 2 as per the C99 standard).

However, we can override the default precision that cout shows by using the std::setprecision() function that is defined in a header file called iomanip.

Outputs:

```3.333333253860474
3.333333333333334
```

Because we set the precision to 16 digits, each of the above numbers is printed with 16 digits. But, as you can see, the numbers certainly aren’t precise to 16 digits!

The number of digits of precision a floating point variable has depends on both the size (floats have less precision than doubles) and the particular value being stored (some values have more precision than others). Float values have between 6 and 9 digits of precision, with most float values having at least 7 significant digits (which is why everything after that many digits in our answer above is junk). Double values have between 15 and 18 digits of precision, with most double values having at least 16 significant digits. Long double has a minimum precision of 15, 18, or 33 significant digits depending on how many bytes it occupies.

Precision issues don’t just impact fractional numbers, they impact any number with too many significant digits. Let’s consider a big number:

Output:

```123456792
```

123456792 is greater than 123456789. The value 123456789.0 has 10 significant digits, but float values typically have 7 digits of precision. We lost some precision!

Consequently, one has to be careful when using floating point numbers that require more precision than the variables can hold.

Assuming IEEE 754 representation:

Size Range Precision
4 bytes ±1.18 x 10-38 to ±3.4 x 1038 6-9 significant digits, typically 7
8 bytes ±2.23 x 10-308 to ±1.80 x 10308 15-18 significant digits, typically 16
80-bits (12 bytes) ±3.36 x 10-4932 to ±1.18 x 104932 18-21 significant digits
16 bytes ±3.36 x 10-4932 to ±1.18 x 104932 33-36 significant digits

It may seem a little odd that the 12-byte floating point number has the same range as the 16-byte floating point number. This is because they have the same number of bits dedicated to the exponent -- however, the 16-byte number offers a much higher precision.

Rule: Favor double over float unless space is at a premium, as the lack of precision in a float will often lead to inaccuracies.

Rounding errors

One of the reasons floating point numbers can be tricky is due to non-obvious differences between binary (how data is stored) and decimal (how we think) numbers. Consider the fraction 1/10. In decimal, this is easily represented as 0.1, and we are used to thinking of 0.1 as an easily representable number. However, in binary, 0.1 is represented by the infinite sequence: 0.00011001100110011… Because of this, when we assign 0.1 to a floating point number, we’ll run into precision problems.

You can see the effects of this in the following program:

This outputs:

```0.1
0.10000000000000001
```

On the top line, cout prints 0.1, as we expect.

On the bottom line, where we have cout show us 17 digits of precision, we see that d is actually not quite 0.1! This is because the double had to truncate the approximation due to its limited memory, which resulted in a number that is not exactly 0.1. This is called a rounding error.

Rounding errors can have unexpected consequences:

```1
0.99999999999999989
```

Although we might expect that d1 and d2 should be equal, we see that they are not. If we were to compare d1 and d2 in a program, the program would probably not perform as expected. We discuss this more in section 3.5 -- Relational operators (comparisons).

One last note on rounding errors: mathematical operations (such as addition and multiplication) tend to make rounding errors grow. So even though 0.1 has a rounding error in the 17th significant digit, when we add 0.1 ten times, the rounding error has crept into the 16th significant digit.

NaN and Inf

There are two special categories of floating point numbers. The first is Inf, which represents infinity. Inf can be positive or negative. The second is NaN, which stands for “Not a Number”. There are several different kinds of NaN (which we won’t discuss here).

Here’s a program showing all three:

And the results using Visual Studio 2008 on Windows:

```1.#INF
-1.#INF
1.#IND
```

INF stands for infinity, and IND stands for indeterminate. Note that the results of printing Inf and NaN are platform specific, so your results may vary.

Conclusion

To summarize, the two things you should remember about floating point numbers:

1) Floating point numbers are great for storing very large or very small numbers, including those with fractional components, so long as they have a limited number of significant digits (precision).

2) Floating point numbers often have small rounding errors, even when the number has fewer significant digits than the precision. Many times these go unnoticed because they are so small, and because the numbers are truncated for output. Consequently, comparisons of floating point numbers may not give the expected results. Performing mathematical operations on these values will cause the rounding errors to grow larger.

Quiz

1) Convert the following numbers to C++ style scientific notation (using an e to represent the exponent) and determine how many significant digits each has:
a) 34.50
b) 0.004000
c) 123.005
d) 146000
e) 146000.001
f) 0.0000000008
g) 34500.0

### 161 comments to 2.5 — Floating point numbers

• C++ Student

It is far less confusing after reading this. Thank you. In class we covered this in about ten minutes and moved on to the next thing.

I’ve been to a few sites already trying to get a grasp on these (floats/doubles) and this summation really did the trick.

Thanks!

• Jefferson

What do the f’s after some of the float and double values mean?

• By default, if you type a floating point value into C++ it’s typed as a double. Consequently, if you do something like this:

You’re assigning a double to a float, which loses precision, and the compiler will probably complain.

Putting an “f” after the value means that you intend that value to be a float, not a double. Then when you do this:

You’re assigning a float value to a float variable, which makes more sense.

• jsieckman

I missed something. How is it that 4.53 is a double and not a float?

• rameye

The 4.53 is a literal constant of type `double` by default. When you add the f suffix to it like 4.53f it then becomes a literal constant of type `float`

• cpplx

it might be good if you add this explanation in the lesson instead of just a comment in the code. I was wondering myself about the usage of the f suffix due overlooking the comment.

• Alex

Good idea. Done.

• Jim

Alex,
It seems kind of dumb to inatialize a variable as a float, and then specify that you want a float to be output, by adding the f suffix. {example 9.87654321f }
Except for the fact that all floats are defaulted as double in C++.  I’ve ran your example with all the numbers with the f suffix and without and get the same output.  My PC has i7 processor.  Why use float at all, since both use four bytes?

• Alex

I’m not quite sure I understand what you’re getting at. Doubles are generally 8 bytes, and floats are normally 4.

• Masss

Is there something wrong with my code?

The compiler brings up a problem with setprecision()..

Thanks

• You need to include iomanip.h to use setprecision() that way.

• Coming form a Java background, I wonder if anyone can advise me a C++ library with a similar function as Java’s BigDecimal.
Preferably one that works on Linux with gcc(so not the decimal type from Visual C++)

• So with all the rounding errors and precision problems, how do programmers deal with operations that need to display something that would end up with a precision or rounding error? Or am I just over-thinking things?

• Most of the time it’s simply not necessary to display a number to the number of significant digits where precision/rounding errors creep in. Generally with floating point numbers, programs will truncate the display to 2-5 decimals.

• this is very help full site

• iseneirik

how can i make the value a user inputs into a float?

then when i run the program from main() and i put in 2 values like eg. x = 10 y = 3 then the answer is 3 instead of 3.333333

• You are already storing the user input values as a float. The problem is that your function is returning an integer, so it’s truncating the result of x/y. Change your function to return a float and you will be good.

• Sam

Hi Alex, really great guide. Enjoying the challenges its throwing up so far. With regards to functions returning integer’s, how do you change a function to actually return a float? My little "add" program is working but it cuts of the decimal? so if I have x = 4.5 + y = 5.5 it will give me the answer of 9! im probably missing something ridiculously obvious but if you could help that would be great!

• Alex

Just change the return type of the function from int to float (or even better, double).

However, if x and y are ints, and you try to assign fractional values to them (e.g. 4.5 or 5.5) the fractions will get lost. So you may need to make those floats/doubles as well.

• Chris

I set the precision level to 4, and added cout for the 2 values, fValue1 + fValue2.

I got fValue1 IS actually rounded off to 1.345 and fValue2 IS actually 1.123, expecting now the get the result of 2.468, but still reports ‘fTotal is not 2.468’

Why is that?

Chris

• Rounding error. The numbers printed on your screen by cout are rounded in this case, so you’re not seeing the full representation. However, when you do the comparison, it does so with the actual numbers, not the rounded ones, which can lead to rounding issues.

• Vivek Hiremath

How do I convert a Float say

x = 1234.567890123456789

to
y = 1234.5678901234 (small float ..10 decimal places only)

Something similar to setPrecision, to use NOT for display/Printing, but to use as a value for calculations / pass it on to a Database etc ?

• I’m not sure what the best way to do this is. For small numbers, you can multiply by 10^x, cast to an integer to drop the remaining decimals, then divide by 10^x. However, if your number is too large you’ll overflow the int when you do the casting so I won’t say this is foolproof.

• Noha

Didn’t understand how 0.1 is represented in binary by 0.00011001100110011…

• jeremy

In decimal, .1 is tenths, .01 is hundredths, .001 is thousandths and so on. Likewise, in binary, .1 is halves, .01 is quarters, .001 is eights, and so on.
0.000110011… would be equal to 1/16 + 1/32 + 1/256 + 1/512 + …

• Kostas81

Ok, so in binary, we have to approximate the 1 of decimal 0.1 with an infinite sum. What I still don’t understand is why we don’t use all the “weights”, that is, all the powers of 2, but only 1/16, 1/32, 1/256, 1/512 and so on, that is, the 4th position (2^4 = 16), the 5th, the 8th, the 9th, and so on. In other words, why don’t we have 0.011111111…….. which is equal to 1/2 + 1/4 + 1/8 + 1/16 + …? It also approaches 1! (I am referring of course to the decimal part of 0.1, that is, the 1.)

• johnbobz

That’s exactly why you can’t use every power of 1/2. The infinite sum would add up to 1, which is ten times the number we require. In order for the sum to add up to 0.1, you would need to add Sum[(1/2)^4n + (1/2)^(4n+1)], taking n from 1 to infinity. You can try it yourself if you want.

• Kostas81

Yes, you are right, this sum indeed converges to 0.1, whereas the sum I used converges to 1.0. Thank you for the clear and concise explanation!

• Kostas81

There is also a good explanation in Wikipedia (yes, sometimes - not often though - Wikipedia has good articles):

“Fractions in binary

Fractions in binary only terminate if the denominator has 2 as the only prime factor. As a result, 1/10 does not have a finite binary representation, and this causes 10 × 0.1 not to be precisely equal to 1 in floating point arithmetic. As an example, to interpret the binary expression for 1/3 = .010101…, this means: 1/3 = 0 × 2^(-1) + 1 × 2^(-2) + 0 × 2^(-3) + 1 × 2^(-4) + … = 0.3125 + … An exact value cannot be found with a sum of a finite number of inverse powers of two, and zeros and ones alternate forever.”

Follows a table of the conversion (fractional approximations) for fractions from decimal to binary. For the ones who are interested:

http://en.wikipedia.org/wiki/Binary_numeral_system

• Stacey Brewer

When I run the following code, the values seem really wrong when output. What is going wrong here?

• sanjiv

your C ++ compiler has a tendency to roundoff 8th precision onwards.
For any value lesser then 8;
It will display 1 lesser than called for.

Hi there! Congratulation, very good explanation!!! Just what I was looking for.

Thank you.

• jimboshack

When I run this code, I get z = 0.333333 and q = 0

float x = 1;
float y = 3;
float z = (x/y);
float q = (1/3);

Can someone explain why? I realize that if I write

float q = (1.0/3.0);

that this problem doesn’t occur, but I’m just wondering why I can’t use (1/3) since q is defined as a float. This page says it’s just a convention to have the decimal point.

• lharriger

Think it through as follows:

float x = 1 reads “put INT 1 into FLOAT x.” This changes its type from int to float. The same is true for float y = 3.
Thus float z = x/y divides two floats and returns a float.

However for float q = (1/3), this is a two part statement.
The first part (1/3) reads “divide INT 1 by INT 3”. Since this is division of two integers, this means it must return an integer (the floor), which in this case is 0.
The second part is then q = 0, which reads “put INT 0 into float q.”

An important thing to keep in mind is that division on a float is different than division on an integer. The literal 1 is read as an integer, however, the literal 1.0 is read as a float/double. This is why q = (1.0/3.0) is different than q = (1/3).

Hope this helped.

• duyquyen1710

thank you!

• Alex

Yup, 1 / 3 performs integer division (which gives an answer of 0, as the fractional component is dropped) , whereas 1.0 / 3.0 performs floating point division (which gives an answer of 0.333333…)

• Sergey

This is a very good article on the floating-point computation issue: “Microsoft Visual C++ Floating-Point Optimization”, by Eric Fleegal, MSDN, 2004
http://msdn.microsoft.com/en-us/library/aa289157(v=vs.71).aspx

• mohibouki

Hi people,
I am new to C++ so please don’t flame me 🙂

I wrote a simple prog. to test this course but something isn’t really working well and I can’t figure out why…

``` #include<iostream> #include<string> #include <iomanip> // for setprecision() using namespace std;```

``` main() { cout<<setprecision(7); //7 decimals float v = 1; float j = 3; float cc; cc = v/j; //TEST with FLOAT NUMMERS float ff = 0.3333333; // 7 decimals as set in "setprecusion(7);" if(cc<ff) { cout<<"cc is smaller then ff"<<endl; } else if(cc>ff) { cout<<"cc is bigger then ff"<<endl; } else{ cout<<"cc equals to ff"<<endl; } cout<<cc<<" = cc"<<endl; cout<<ff<<" = ff"<<endl; ```

```return 0; } ```

The output gives me that cc is bigger than ff…
I don’t understand why as I set precision to 7 and my var ff has also 7 decimals.
They should both be equal.

Any suggestions where I made an error?
Thanks!!

• sajibfinix

No, You didn’t make any wrong.
I think setprecision() function is only for setting the precision at time of showing your variable when you use cout.
I mean setprecision() cant change your variable.Like in your code cc = v/j so cc is stored as 0.33333333333333333333333333………. & setprecision() cant change this.You stored ff as 0.3333333.
0.3333333333333333…….. is greater than 0.3333333 isnt it?So your code show cc is bigger than ff.

Read the “Comparison of floating point numbers” part of this tutorial.It dosent say that you can use setprecision() for Comparison of floating point numbers.
Thank You

• sajibfinix

How to control the numbers after decimal point in C++?

Like in C language if i take a floating variable f = 123.4567 and i want to show only 2 numbers after decimal point than i will use printf(“%0.2f”).Then it will show 123.45.

In C++ i have to use setprecision().But It determines total numbers not just numbers after decimal point.So, it makes problem.Like if i dont know that what numbers my floating variable will contain after calculation it can contain 123.123 or 1234.123 so if i set precision to 5 for first case it will show 123.12 and for second case it will show 1234.1!But i always want to show 2 numbers after decimal points for every case.How can i do that in c++?

• duyquyen1710

i have same question like you

• rameye

Use the `std::fixed` stream manipulator and the member function `std::precision`.

For example if you want to display with 2 decimal places:

``` double pi=3.14159; std::cout.precision(2); std::cout << "Today's price for a slice of pi is \$" << std::fixed << pi << std::endl; ```

and it should print:

`Today's price for a slice of pi is \$3.14`

• rameye

Alex, the first example demonstrating setprecision() is missing the closing curly bracket for main()

• Alex

Fixed. Thanks!

• noobcaxe

I'm perplexed! I'm working through Bjarne S's book and have done one of the drills but I don't understand what's going on with varying results.

Here's some output;

9.99
10
smaller is a 9.99  the larger is b 10
result 0.01
they are almost equal

99.99
100
smaller is a 99.99  the larger is b 100
result 0.01

199.99
200
smaller is a 199.99  the larger is b 200
result 0.01
they are almost equal

Why doesn't entering 99.99 and 100 give the message "they are almost equal"?

• Alex

This is an interesting example of rounding errors.

10 - 9.99 = 0.01, but due to rounding error, C++ is representing this as 0.0099999999999997868
100 - 99.99 = 0.01, but due to rounding error, C++ is representing this as 0.0100000000000005116

One of this is larger than 0.01, and one is smaller.

• Noobcaxe

Argggh! I almost went insane yesterday evening try to see where I'd gone wrong.
How does one avoid rounding errors? I can think of any number of applications where rounding errors even at the level of precision in my example might be disastrous.

• Alex

There are a couple of ways to “avoid” rounding errors:
1) Avoid use of floating point numbers altogether (sometimes this is possible, sometimes it isn’t).
2) Don’t do raw comparisons like you’re doing. In section 3.5 -- Relational operators (comparisons), we discuss how to tell if floating point numbers are equal. These can be extended to handle less then/greater than cases. This would help avoid the case you see above.
3) Ensure that when you use floating point numbers, you only treat them as accurate to a certain level of precision.

• Billy

I don’t know why all outputs are same.
Please someone help me.(VS2015,64 bit Windows)
Thank you!!

• Alex

Your float doesn’t have enough precision to store the entire number, so it’s being truncated. If you change f to a double (and remove the f suffix on the literals) then you’ll see that the numbers print differently.

• Billy

Sorry, I still feel confused.
So even i use setprecision, i still cannot change precision.
Is my understanding correct?

• Alex

setprecision can’t show more significant digits than the underlying number has. A float only has 6 to 9 digits of precision, so you’re generally only going to get between 6 and 9 digits of precision even if you ask for more.

The number you picked (9876543.21f) has 9 significant digits, but because you tried to put it in a float, it got truncated to 7 significant digits (9876543) internally. So regardless of whether you ask for 7, 8, or 9 significant digits of precision, you’re only going to get 7 because that’s all the underlying number that got stored can offer.

• Billy

Got it.
Thank you Alex!

• PixelHero

Answer "a" should be 3.45e1, because we need to trim off any trailing zeros if it is a positive exponent. Am I right?

• Alex

No, my instructions were incorrect. I’ve updated the instructions.

• PixelHero

Thank you, I’ll read this lesson again.

• Kevin

whats the difference between:

cout<<precision(17)

&

cout.setprecision(17)

• Alex

With Visual Studio, cout << setprecision() calls cout.precision() internally. So they essentially do the same thing. Calling cout.precision() is probably slightly more efficient, but unless you're calling it hundreds of times, it won't really matter.

SIR PLZ TEL ME HOW TO PRINT ON OUTPUT SCREEN 14 ND 0.5 IF THE GIVEN VALUE IS 14.5 BASICALLY I WANT TO PRINT IT SEPARATELY ON OT PUT SCREEN BY FLOAT ND INT DATA TYPE

• Alex

Casting your float to an int via static_cast will drop the remainder, allowing you to print the integer. You can then subtract the integer from the floating point value to get just the remainder.

• Hello, I’m not sure if you’ll read this but I’m confused about some of the outputs:
9.87654e+006
9.87654e-005

Why is 006 and 005 after the e?
How does that make sense?
I don’t get what 9.87654e+006 means… I thought you could only have numbers like 6 or -3 and stuff after the e not numbers like 006.
I’m confused now… Shouldn’t it just be 9.87654321e6 ? Why is it 006 ?

PS: I noticed your site is built on WordPress and I think it would be really helpful if you got a plugin that would email commenters when someone replies, would save me a lot of time, lol.

Anyway, thanks a lot, I’m finding this site really helpful 🙂

• Alex

9.87654e+006 is the same as 9.87654e6. Visual Studio always prints exponents with at least 3 digits. Other compilers print exponents with at least 2 (as required by C99). In other words, it’s implementation specific.

• Oh right, Ok I understand thanks a lot!

• Metereo_Rex

So… Just to clarify, the varying results seen in the above examples are caused by varying error(s) in rounding?

Eg, each time you round to the hundreds/hundredths place, it will be different from rounding to the tens/tenths place. And with that, the higher you the place you round to, the more accurate the answer will be (As with standard math). Is this correct?

Also, I’m getting slightly different results than what’s presented above. For instance, what’s 3.333333333333334 to you is 3.333333333333333 on my machine, despite rounding to the same place. Just thinking about that makes me uncomfortable!

Edit: it’s also worth noting that despite the fact that the comment box says we can use certain HTML tags, it automatically encodes the brackets.

• Alex

I’m not sure which examples you’re referring to. But generally speaking, the fact that floating point numbers have limited precision leads to rounding errors, especially in cases where a number is represented as an infinite sequence in binary (like 0.1).

re: The HTML tags issue, I used to allow HTML on the site, but I’ve installed a plugin to treat all text literally so that people who post code snippets won’t have lines such as “#include ” treated as invalid HTML. I’ll see if I can remove/disable the misleading text.

• Stack

Greetings.
The example says:
Slide decimal left 5 spaces: 4.2030e4

Shouldn’t that be slide the decimal left 4 spaces?

~Stack~

• Alex

Yes, that was a typo. It’s fixed now. Thanks for pointing it out. 🙂

• Tito

In your 0.1 example you forgot to add "#include <iostream>" at the top

• Alex

Fixed.

• Twisted Code

Your quiz questions are not lettered correctly. There are two F’s. The answers are fine, in that respect, though

• Alex

Fixed, thanks for the note.

• Jimmy

I understood everything perfectly except one thing: why are we leaving trailing zeroes untrimmed when the original number has a decimal point?

• Alex

Good question. If we write a large number with no decimal point, e.g. 21,000,000,000, it’s not clear whether we mean “exactly 21 billion”, or “somewhere around 21 billion”. More often than not, we mean “Somewhere around 21 billion” (because in most cases, the difference between 21,000,000,000 and 21,000,000,001 is insignificant).

So when we convert 21,000,000,000, we assume that the trailing zeros are not significant.

However, when the number has a decimal point, e.g. 21,000,000,000.01, or even 21,000,000,000.0, it’s clear that this number is exact, otherwise we wouldn’t have provided the decimal point. So in the case where a decimal is provided, we assume all of the trailing zeros are significant.

Make sense?

Your explanation makes sense, but does the compiler care?  What does it do differently? Or is this just to let the code document what we know about the number?

• Alex

I’m guessing that the compiler would consider 21,000,000,000.0 to be equivalent to 21,000,000,000. So although we would probably say one has a high precision and one doesn’t, the compiler likely treats them both as low-precision numbers.

• I have to point this out, for the sake of correct math:

First off, infinity is not a number, it’s a concept.

Second, x/0 is not and never will be infinity or -infinity.

Numberphile did a good video that explains the division by 0 problem: https://www.youtube.com/watch?v=BRRolKTlF6Q

• Darren

Ah mathemagicians and their concepts. Almost as bad as physicist.

• Big_Jiraffe

Outputs:

0.10000000000000001
0.99999999999999989

In my machine. So, when I use double without setprecission(17), It shows the correct results: 0.1 and 1
So my question is, without using precission(17), do they store the same value? Or double without precission stores 0.10000000000000001 but somehow does a rounding to 0.1?
Greaat tutorials by the way! You are helping a lott!!

• Alex

Cout rounds the numbers for display based on the precision. Internally, they’re stored as 0.10000000000000001 and 0.99999999999999989.

• Todd

Typos.

"In decimal, this is easy (easily) represented as 0.1"

• Alex

Fixed. Thanks for the heads up.

• Jeff

This is driving me crazy! Do floats have 6 or 7 digits of precision?

"When outputting floating point numbers, cout has a default precision of 6 -- that is, it assumes all variables are only significant to 6 digits, and hence it will truncate anything after that."

"The value 123456789.0 has 9 significant digits, but float only supports about 7 digits of precision"

"4 bytes    ±1.18 x 10-38 to ±3.4 x 1038    7 significant digits"

• Alex

The answer to this question is complicated due to the way floating point numbers are stored. Most floats have (at least) 7 digits of precision, however there is a subset of the floating point numbers that only have 6 digits of precision. I’ll update the article to indicate 6 instead of 7.

• Odgarig

f = 9876549.21f in this case would scientific notation be 9.87655e006 because of after 6th specific digit there is 9? Mathematically 0.49 can be equal to 5.So could it be same in c++ ?

• Alex

The precise scientific notation would be 9.87654921e006.

However, because std::cout rounds to 6 significant digits, 9.87654921e006 would be rounded to 9.87655e006.

This is something that is easy to verify yourself with a simple program:

• Successfully compiled,

Gives a compiler error. Why..???
I am using codeblocks.

• Alex

What compiler error are you getting?

• Compiler says:
"error: ‘setprecision’ was not declared in this scope."
but when I delete "cout" from "std::cout << setprecision(x) " (e.g. std::setprecision(x)), it works as expected. Pretty confusing for beginners like me.

• Alex

I updated the lesson to make it more clear that std::setprecision() lives in the std:: namespace.

• waqas

Dear Alex, Is there a way to compare integer value with float type. Like If I want to write a code which takes input number from user and finds out if its a prime/even/odd number? Consider if for prime, division by any number other than 1 and the entered integer, would result in a fraction Now if i had taken input from user in Int and it just got converted to a fraction (which should be stored in a Float variable) how would I knoe if any thing like that has happened? OfCourse one can use as manu If else statements to compare the division or even the inserted number but I want to write smalest code for frist 100 or 200 prime numbers.

• Alex

Hi waqas,

> Now if i had taken input from user in Int and it just got converted to a fraction

I’m having a hard time understanding what you’re asking.

Instead of doing a floating point division on an integer to see if the result is a fraction or not, you’d be better off using the modulus operator (%) to see if doing an integer division yields a modulus. You can use a for loop to loop through all of the possible divisors to see if any of them have a modulus of 0. Modulus is covered in chapter 3, and loops are covered in chapter 5.

• Sarah Gunn

I wanted to ask about the division by zero.  While I am sure you understand the mathematics behind why it is undefined, here in your code it indicates that it will return infinity.  Is it the double having a decimal that causes this to happen rather than just giving an error instead?  I ask because I want to understand what the code is doing under the hood so that I never mistakenly fall into this trap.  Thanks and great set of lessons!

• Alex

The short answer is that it works that way because that’s how IEEE 754 (the standard to which floating point number implementations adhere) defines it that way. See http://grouper.ieee.org/groups/754/faq.html#exceptions.

• Sarah Gunn

Thanks!  I will bear this in mind when I wish to reprogram my coffee pot!  😉

It does make sense though from that perspective, hardware can be stupid and this has to be adaptable to that.

• After mixing my stupid experiments with your great lessons on floating point numbers, I ended up with this conclusion.  Please correct me if I m wrong somewhere:

Floating numbers are different in a machine from what we expect. For example: 1.11 is not stored as 1.11 in a machine. Floating point numbers can’t take more than 4,8,12 or 16 bytes in memory. That is why numbers going out of Infinity has to be rounded up or down to an approximate value. Setting precision shows how numbers are stored in the machine but can’t result the exact number because it has to truncate the number according to the given limit or due to limited memory. After truncating, it returns an approximate value (approximate according to machine).  However, we can see exact 1.11 to be printed out to the console because cout displays expected results when precision is set under the precision range of a type given to the object (e.g. double x; // x is okay with precision 15,16 or 17 when cout prints it’s value. If precision is set to 18 for x, cout will display unexpected digits after 15,16 or 17 significant digits) type and fails only when floating numbers are compared to each other using relational operator, or when evaluated as a result of an expression, using arithmetic operators. 50 is the largest parameter setprecision can hold (setprecision (50)). I m a bit confused here. Why 50. Does a number with 51 digits goes out of 16 byte (maximum a floating point number can reserve). Or setprecision () has a rule that this object can’t display more than 50 significant digits.

• Alex

You’re pretty much correct. Floating point numbers are not stored in decimal format internally (1.11 is not stored as “1.11”), they are stored as some magic combination of bits that gets reconstituted into 1.11). Just like decimal numbers, some floating point numbers can be represented precisely, and some can not. For example 1/3rd can’t be represented precisely in decimal format (0.333333… you have to truncate somewhere), but 1/10 can (0.1).

It’s unclear to me whether setprecision() having a max value 50 is a limitation of your compiler or something else.

• cout prints maximum 50 significant digits for all setprecision values greater than 50 in my machine. I m using code::blocks.

• Pranav

float f is printed out as 8e-010. I am using visual studio. Is there a mistake in initializing the variable? ignore the random long double.

• Alex

No. 8e-10 and 8e-010 are the same number.

• Pranav

But rather than giving out 0.0000000008 it gives out 8e-10.

• Alex

Yup. std::cout will print some numbers in scientific notation, particularly if they are large or small.

• AK

A good example which made me to understand the precision error is "Patriot Missile Failure" during GULF war. The internal clock was multiplied by 1/10 to get real time.
This calculation was doing a fixed point 24 bit multiplication. And over 100 hours this multiplication was yielded a drift of 0.34 sec which was enough for the missile to go undetected by the Radar.

Details at :

• takise

Hello, I can’t get one thing - is there any difference with rounding errors depending on which math operations we use.

double d = 0.1;
cout << d << endl;         // shows 0.10000000000000001
double d2 = (d + d + d + d + d + d + d + d + d + d);
cout << d2 << endl;         // shows 0.99999999999999989
double d3 = 10 * d;
cout << d3 << endl;              // shows 1  ?? why ??
return 0;

Why after multiplying the result is without rounding error ?

Thanks upfront for answer and great tutorial!

• Alex

Yes, the amount of rounding error can depend both on which mathematical operation you and how many times you use them. The more times you use them, the more errors tend to grow. The plus case has more error because we used plus 10 times.

• BROKEN WINDOW

Hi Alex,
if we want to store a 60 digits number in a variable or print it in output without scientific notation, what can we do ?
for example in python we can calculate 9999 to the exponent 9999 and it will print the result.
is there any way to do it ?

Thanks.

• Alex

Since C++ doesn’t have support for arbitrarily large integers, probably the best solution here would be to install a library that implements large integers (e.g. https://mattmccutchen.net/bigint/).

• Marina

Hello!
I get an error while trying to run setprecision function.
Mine looks just like in the lesson above.
#include <iostream>
#include <iomanip> // for std::setprecision()

int main()
{
float f(123456789.0f); // f has 9 significant digits
std::cout << std::setprecision(9); // because we want to show all 9 significant digits in f
std::cout << f << std::endl;
return 0;
}

It compiles, but when I try to run it I get "sh:1: Syntax error:"(" unexpected" in the answer box.

What is wrong and how can I fix it?
Regards,
Marina

• Alex

How are you compiling your program? sh:1: Syntax error looks like a script error, not a compiled program error…

• Joseph

Minor typo: "This is because the double had to truncate the approximation due to it’s limited memory," should be ‘its’.

• Alex

it’s vs its is the bane of my existence. Fixed. 🙂

• Kyle

Hi, I’m trying to write a program that evaluates a fraction from the user, does a math operation on it and spits out another fraction (in fractional format, not decimals). Is it possible to do this in C++? I ran a simple program with inputting a fraction into a float but the program doesn’t work, I suspect it’s not possible with floats but how else would I do it?

• Alex

Yes, it is definitely possible, but in order to do that you’ll need to manage the numerator and denominator yourself (e.g. store them as integers) and write the operations to add, subtract, multiple, and divide them yourself. Floats won’t be of use for this.

• babu.tritran

Why?
45.5+45.5=90
or
123.456/654.321=0.188073
It should be 123.456/654.321= 0.1886780342 (right)

Thank Alex.

main.cpp

• Alex

You should heed the warnings of your compiler. Your compiler should be warning you that enterNumber() is returning a float, but the return type is an integer. That’s why you’re getting incorrect answers: all the fractional numbers you’re entering are getting truncated to integers.

As a side note, your summation function is misnamed, as it doesn’t do any summation. You’re doing the summation in the function argument. As written, the function should be named “printResult” or something.

• babu.tritran

You can edit my post again. to give me a better idea.

• Alex

Change:

to

• babu.tritran

oh ^^ thanks Alex.you are a great support.

• Luis

Hi Alex, I’m having trouble reconciling your statement "123456789.0f has 9 significant digits".  Rules for determining the number of sig figs would lead most to say there are 10 significant figures.  Why does the last zero not count as a sig fig?  Just some convention?

Thanks.

• Alex

You are correct -- this was a mistake on my part. I’ve changed 1234567989.0 to 123456789, which has 9 significant digits, as intended.

• I just have to say: Amazing tutorial! THANK YOU!!!!!!!!!!!

• Hi Alex
look at the following snipped:

result: 0.12345
but the following snipped:

result: 0.123457

How can I fix it and why it happens?

• Alex

std::cout has a default precision of 6 digits, but your expected value of 0.123456789 is 9 digits of precision. You need to tell std::cout to print more digits of precision:

• Austin

Last two lessons…. very tough for me. Just like a lot of information to take in. And it’s very hard to remember without connecting to something meaningful.

• Ben

I understand that there are rounding errors due to conversions between binary and decimal.  Why don’t these errors propagate consistently?

For example, in your example of 0.1, the compiler prints 0.10000000000000001 (which is slightly greater than 0.10000…) when precision is set to 17 sig figs.  So why do you get a number slightly less than 1 when you add 0.1 ten times? Logically, it would seem like the compiler might give a number like 1.00000000000000010 instead of 0.99999999999999989.

• Alex

The errors do propagate as you’d expect. However, when a number can’t be stored exactly in floating point representation, it will generally be rounded to the closest floating-point representation with the given precision. This number could be larger or smaller than the actual number. It also means two numbers that “round down” may sum to one that “rounds up”, or vice versa.

Consider the following program:

This prints:

```0.1 is 0.10000000000000001
0.4 is 0.40000000000000002
0.4 + 0.1 + 0.1 is 0.59999999999999998
```

Three numbers that are larger than expected add up to one that is smaller than expected.

• Gemparks

Hi, there.
Its me again.

Based on the following codes

C++ source file:

Why does the ans return "inf" instead of a proper one, that is 1.43?

• Alex

It works fine for me. I ran your example and got 1.43. It looks like you might be running an older compiler (since you’re using conio.h). Maybe try upgrading your compiler to a more modern one?

• Osman Zakir

I’m having some trouble with this in Visual Studio (Community 2015 edition) concerning user-defined literal operators.  I have no idea what they are, though.  I tried to do this:

• Alex

123456789 is an integer literal, so I guess the compiler is refusing to do a conversion in this case.

• Osman Zakir

It compiled when I took away the "f" suffix.  I don’t know why, though, as it compiled fine with that suffix included on any other code I wrote.

• Alex

Because then it treated the literal as an integer, and converted the integer literal to a float when the initialization took place.

• Nicolas

Dear Alex,

Again, thank you so much for this tutorials, they are incredibly helpful, well structured and clear!

I am sorry for bothering with this question, but I am having a vary weird issue. I first create a class that has a double attribute. That same object has a function that returns the double attribute. Additionally, I have another class that has an std::vector attribute of objects from the initial class, and a double attribute that is initialized using the member function of the previous class that returns its respective double attribute. Finally, the last class also has a member function that returns the corresponding double attribute.

The problem is that when I use the member function of the second class mentioned in the previous paragraph, the double attribute looses precision.

It is much more clear if you check this piece of code:

Running this code prints the following:

From LC_Raster object 2129292.633
From LC_Product object 2129292.63
Press a key to continue …

Now, I tried to reproduce the error in a simpler piece of code that had the exact same idea, with the difference that I was not using the GDAL library, and I did not get the error. Here is the simpler code:

The output of this code is:

From LC_Raster object 2129292.633
From LC_Product object 2129292.633
Press a key to continue …

As you can see, in this case the member function of the second class (LC_Product) does not reduce the precision of the double variable it returns. I have been trying to figure this out for a long time with no success. I thought the problem was related to the GDAL library, but if you check the codes in detail the problem is related to the transition from the first class (LC_Raster) to the second class (LC_Product), but I have no clue why the simpler code works. I would really appreciate so so much if you have any insight in this issue.

Thank you again for all your help!

Best,
Nicolas

• Nicolas

I forgot to add that I am working in Visual Studio 2010.

• Alex

I am not sure. I’d use a debugger to see if I could determine where the precision was being lost. Step through the program and watch how the value 2129292.633 propagates.

• Vishal

I am running the same code as given above and i am getting this error
"Error    C3688    invalid literal suffix ‘f’; literal operator or literal operator template ‘operator ""f’ not found".  HELP

[code]
#include"stdafx.h"
#include<iostream>
#include <iostream>
#include <iomanip> // for std::setprecision()

int main()
{
float f(123456789f); // f has 9 significant digits
std::cout << std::setprecision(9); // because we want to show all 9 significant digits in f
std::cout << f << std::endl;
return 0;
}
/[code]

• Alex

Try 123456789.0f. The .0 tells the compiler this is a floating point number. I’ve updated the example.

• trebor

I’m a newby but I disagree with a) and b) solutions. We physicists would assume that the zeroes on the right are not significant digits.

• Darren

I think you might be eligible for a refund on your degree.

• David

Can you check the numbers in the chart that is below:

Assuming IEEE 754 representation:

Specifically the exponents in the range column in the last 2 rows for 80 bits (12 bytes) and the 16 bytes.

They don’t seem right.

• Alex

Agreed, the 12-bit minimum range was incorrect. It’s fixed now.

• Toasty Turnip

Hi,
I’m a bit confused by the bit regarding the f appended to some numbers. I did some googling but am still confused.

"Note that by default, floating point literals default to type double. An f suffix is used to denote a literal of type float."

However, when you initialize or define a variable, you would have to specify its type as "float" or "double" wouldn’t you? Or do the two for some reason initialize the same type, and the f is the only way to initialize a non-double float?

Thanks for the great tutorial and for still keeping the comment section updated after all these years! 🙂

• Alex

Even though you must assign your variable a type, C++ won’t infer what kind of literal you’re intending based on the type of variable you’re initializing. Therefore, you must use a suffix to specify what type the literal is.

• Toasty Turnip

So I can call it a float and use it as a double? And if I assign it a "double" type, but add the f suffix, it will be treated as a single float?

• Alex

If you do something like this:

The compiler will implicitly convert the float literal to a double before initialization.

• Toasty Turnip

I decided to do a small test but… I still don’t seem to understand the trailing "f"…

I’m not sure how the format turns out, so here’s a pastebin:
http://pastebin.com/TU7R91qS

The output is:

fValue1:                4 bytes
fValue2:                4 bytes
dValue1:                8 bytes
dValue2:                8 bytes
ldValue1                16 bytes
ldValue2                16 bytes

It looks like the size of the float remains 4 bytes regardless of using the trailing f or not. Does it have to do with the compiler? Is using the f the standard, and hence "safe" way to do it, or am I just not getting something?

Sorry for all the questions, and thanks for your patience. 🙂

• Alex

The f suffix exists to tell C++ that the literal number (in this case 0.1) should be treated as a floating point number rather than a double. It doesn’t have any impact on the size of the variable you assign that value to, but you may see differences in precision:

• Peng

34500.0 is equal to 3.45000e4 (6 significant digits)
But why 123456789.0 has 9 significant digits not 10 significant digits in above?

• Alex

It should have been 10. I’ve updated the example.

• Nyap

can someone please explain to me what a "significand" or "exponent" is

• Alex

The significand is the part of a number in scientific notation that contains the significant digits.

The exponent is the same exponent used in standard math. It determines how many times the base is multiplied. E.g. 2^4 means 2 is multiplied 4 times to get 16.

So in the number 1.2 x 10^4, 1.2 is the significand and 4 is the exponent. This evaluates to 1.2 * 10000 = 12000.

• Nyap

I don’t understand why you need to use the f suffix
isn’t a double basically the same as a float, but with more breathing space (memory)? and if so, why don’t you need to use suffixes for the different types of integers?

• Alex

Generally speaking, literals don’t know what type they are -- numbers are just numbers devoid of context. The suffixes are used to specify what type the literal is intended to be. If you don’t specify a suffix, the compiler will make an assumption about what your intent is.

With integers, the compiler will automatically find the smallest integer that fits your integer literal and assume your integer literal is of that type. This works because integers do not suffer any precision issues, so the compiler only needs to worry about the range of the type. If you then assign the integer to a larger type, no problem.

However, with floating point numbers, there are precision issues, and that complicates things. A floating point literal gets represented differently as a float or a double, with a possible loss of precision, so the compiler can’t just put your floating point literal inside the smallest floating point type that it fits (e.g. a float) and call it good. But it has to do something if you don’t specify a prefix, so it assumes you meant double in that case.

Consider what happens if you do something like this:

4.3 is a decimal number with no suffix, so the compiler assumes you meant a double. 4.3 gets converted to a double, with some loss of precision. That double is then assigned to a float, which incurs another loss of precision. Double precision loss!

If 4.3 had an f suffix, then you only get one loss of precision, when 4.3f is converted to a float. The assignment doesn’t incur an additional loss of precision because the types are already identical by that point.

Make sense?

• Hi, the word significand is correctly written with a t at the end -> significant

I put this page on the witelist of my AdBlocker Plus but stil get a message that ABP is active.
Any idea?

Greetings from Switzerland, Ronald Hofmann
--

• Alex

No, significand is actually the correct term. It’s a mathematical term dealing with numbers in scientific notation (along with the mantissa and exponent).

Unsure about Adblock Plus, I’ve never used it so I can’t advise on how to fix issues with it. 🙁

• Yes, you are right. I realized that 5 seconds after my comment, sorry for that.
Well, it was the first time I put an address on the white list.

Still trying

Greetings from Switzerland, Ronald Hofmann
--

• joe

I have a question.

Lets say I am setting up a clock for the hours and minutes and seconds and even miliseconds.
how would I write my code so as to show a maximum number for my miliseconds as well as a minimum?

lets say I want it to show all numbers up to the 100ths so it would say .50 or .55, how would I write?

• Alex

If you want to display a fixed number of decimal places for floating point numbers, you need to tell std::cout to print numbers in fixed-point format. To do that:

The other alternative is to store your hours, minutes, seconds, and microseconds as integers and do your own formatting.

• joe

cool, thank you.

Another question, is there a way to capture a specific part of the data?
lets say with 123.4567.  I only want to keep the .4567 and discard the rest.

The way I am thinking is to create an integer, making it equal to the float so it captures the whole numbers, then subtracting it from the float.  Would that work?

• Alex

The easiest way is as you suggest, cast the floating point number to an integer and subtract that from the floating point number. e.g.

• Mauricio Mirabetti

Alex, me again.
As a correction note, when explaining how to convert numbers to scientific notation, on the first example, on step two (trimming leading zeros) I believe you trimmed the trailing zero by mistake. That could lead to a misunderstanding of the "protocol".

Slide decimal left 4 spaces: 4.2030e4
No leading zeros to trim: 4.203e4
Trim trailing zeros: 4.203e4 (4 significant digits)"

It could be: