Integers are great for counting whole numbers, but sometimes we need to store *very* large numbers, or numbers with a fractional component. A **floating point** type variable is a variable that can hold a real number, such as 4320.0, -3.33, or 0.01226. The *floating* part of the name *floating point* refers to the fact that the decimal point can “float”; that is, it can support a variable number of digits before and after the decimal point.

There are three different floating point data types: **float**, **double**, and **long double**. As with integers, C++ does not define the size of these types. On modern architectures, floating point representation almost always follows IEEE 754 binary format. In this format, a float is 4 bytes, a double is 8, and a long double can be equivalent to a double (8 bytes), 80-bits (often padded to 12 bytes), or 16 bytes.

Floating point data types are always signed (can hold positive and negative values).

Category | Type | Minimum Size | Typical Size |
---|---|---|---|

floating point | float | 4 bytes | 4 bytes |

double | 4 bytes | 8 bytes | |

long double | 4 bytes | 8, 12, or 16 bytes |

Here are some definitions of floating point numbers:

1 2 3 |
float fValue; double dValue; long double dValue2; |

When we assign literal numbers to floating point numbers, it is convention to use at least one decimal place. This helps distinguish floating point values from integer values.

1 2 3 |
int n(5); // 5 means integer double d(5.0); // 5.0 means floating point (double by default) float f(5.0f) // 5.0 means floating point, f suffix means float |

**Scientific notation**

How floating point variables store information is beyond the scope of this tutorial, but it is very similar to how numbers are written in scientific notation. **Scientific notation** is a useful shorthand for writing lengthy numbers in a concise manner. And although scientific notation may seem foreign at first, understanding scientific notation will help you understand how floating point numbers work, and more importantly, what their limitations are.

Numbers in scientific notation take the following form: *significand* x 10^{exponent}. For example, in the scientific notation `12 x 10`

, ^{3}`12`

is the significand and `3`

is the exponent. This number evaluates to 12,000.

Consider the mass of the Earth. In decimal notation, we’d write this as `5973600000000000000000000 kg`

. That’s a really large number (too big to fit even in an 8 byte integer). It’s also hard to read (is that 19 or 20 zeros?). In scientific notation, this would be written as `5.9736 x 10`

, which is much easier to read. In addition, scientific notation makes it easier to compare the magnitude of two really large or really small numbers simply by comparing the exponent.^{24} kg

Because it can be hard to type or display exponents in C++, we use the letter ‘e’ or ‘E’ to represent the “times 10 to the power of” part of the equation. For example, `12 x 10`

would be written as ^{3}`12e3`

, and `5.9736 x 10`

would be written as ^{24}`5.9736e24`

.

For numbers smaller than 1, the exponent can be negative. The number `5e-2`

is equivalent to `5 * 10`

, which is ^{-2}`5 / 10`

, or ^{2}`0.05`

. The mass of an electron is `9.1093822e-31 kg`

.

In fact, we can use scientific notation to assign values to floating point variables.

1 2 3 4 5 |
double d1(500.0); double d2(5e2); // another way to assign 500 double d3(0.05); double d4(5e-2); // another way to assign 0.05 |

Here’s the most important thing to understand: The digits in the significand (the part before the E) are called the **significant digits**. The number of significant digits defines a number’s **precision**. The more digits in the significand, the more precise a number is.

**Precision and range**

Consider the fraction 1/3. The decimal representation of this number is 0.33333333333333… with 3’s going out to infinity. An infinite length number would require infinite memory to store, and we typically only have 4 or 8 bytes. Floating point numbers can only store a certain number of significant digits, and the rest are lost. The **precision** of a floating point number defines how many *significant digits* it can represent without information loss.

When outputting floating point numbers, cout has a default precision of 6 — that is, it assumes all variables are only significant to 6 digits, and hence it will truncate anything after that.

The following program shows cout truncating to 6 digits:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 |
#include <iostream> int main() { using namespace std; float f; f = 9.87654321f; cout << f << endl; f = 987.654321f; cout << f << endl; f = 987654.321f; cout << f << endl; f = 9876543.21f; cout << f << endl; f = 0.0000987654321f; cout << f << endl; return 0; } |

This program outputs:

9.87654 987.654 987654 9.87654e+006 9.87654e-005

Note that each of these is only 6 significant digits. Also note that cout will switch to outputting numbers in scientific notation in some cases.

However, we can override the default precision that cout shows by using the setprecision() function that is defined in a header file called iomanip.

1 2 3 4 5 6 7 8 9 10 11 12 13 |
#include <iostream> #include <iomanip> // for setprecision() int main() { using namespace std; cout << setprecision(16); // show 16 digits float f = 3.33333333333333333333333333333333333333f; cout << f << endl; double d = 3.3333333333333333333333333333333333333; cout << d << endl; return 0; } |

Outputs:

3.333333253860474 3.333333333333334

Because we set the precision to 16 digits, each of the above numbers has 16 digits. But, as you can see, the numbers certainly aren’t precise to 16 digits!

Variables of type float typically have a precision of about 7 significant digits (which is why everything after that many digits in our answer above is junk). Variables of type double typically have a precision of about 16 significant digits. Long double can have a precision of about 16, 18, or 33 significant digits depending on how many bytes it occupies.

Precision issues don’t just impact fractional numbers, they impact any number with too many significant digits. Let’s consider a big number:

1 2 3 4 5 6 7 8 9 10 |
#include <iostream> int main() { using namespace std; float f(123456789.0f); // f has 9 significant digits cout << setprecision(9); // because we want to show all 9 significant digits in f cout << f << endl; return 0; } |

Output:

123456792

123456792 is greater than 123456789. The value 123456789.0 has 9 significant digits, but float only supports about 7 digits of precision. We lost some precision!

Consequently, one has to be careful when using floating point numbers that require more precision than the variables can hold.

Assuming IEEE 754 representation:

Size | Range | Precision |
---|---|---|

4 bytes | ±1.18 x 10^{-38} to ±3.4 x 10^{38} |
7 significant digits |

8 bytes | ±2.23 x 10^{-308} to ±1.80 x 10^{308} |
15 significant digits |

80-bits (12 bytes) | ±3.65 x 10^{-4951} to ±1.18 x 10^{4932} |
18 significant digits |

16 bytes | ±3.36 x 10^{-4932} to ±1.18 x 10^{4932} |
34 significant digits |

*Rule: Favor double over float unless space is at a premium, as the lack of precision in a float will often lead to challenges.*

**Rounding errors**

One of the reasons floating point numbers can be tricky is due to non-obvious differences between binary (how data is stored) and decimal (how we think) numbers. Consider the fraction 1/10. In decimal, this is easy represented as 0.1, and we are used to thinking of 0.1 as an easily representable number. However, in binary, 0.1 is represented by the infinite sequence: 0.00011001100110011… Because of this, when we assign 0.1 to a floating point number, we’ll run into precision problems.

You can see the effects of this in the following program:

1 2 3 4 5 6 7 8 9 10 |
#include <iomanip> int main() { using namespace std; double d(0.1); cout << d << endl; // use default cout precision of 6 cout << setprecision(17); cout << d << endl; return 0; } |

This outputs:

0.1 0.10000000000000001

On the top line, cout prints 0.1, as we expect.

On the bottom line, where we have cout show us 17 digits of precision, we see that d is actually *not quite* 0.1! This is because the double had to truncate the approximation due to it’s limited memory, which resulted in a number that is not exactly 0.1. This is called a **rounding error**.

Rounding errors can have unexpected consequences:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 |
#include <iostream> #include <iomanip> int main() { using namespace std; cout << setprecision(17); double d1(1.0); cout << d1 << endl; double d2(0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1); // should equal 1.0 cout << d2 << endl; } |

1 0.99999999999999989

Although we might expect that d1 and d2 should be equal, we see that they are not. If we were to compare d1 and d2 in a program, the program would probably not perform as expected. We discuss this more in section 3.5 — Relational operators (comparisons).

One last note on rounding errors: mathematical operations (such as addition and multiplication) tend to make rounding errors grow. So even though 0.1 has a rounding error in the 17th significant digit, when we add 0.1 ten times, the rounding error has crept into the 16th significant digit.

**Conclusion**

To summarize, the two things you should remember about floating point numbers:

1) Floating point numbers are great for storing very large or very small numbers, including those with fractional components, so long as they have a limited number of significant digits (precision).

2) Floating point numbers often have small rounding errors, even when the number has fewer significant digits than the precision. Many times these go unnoticed because they are so small, and because the numbers are truncated for output. Regardless, comparisons on floating point numbers may not give the expected results, and performing math on these values will cause the rounding errors to grow larger.

2.6 -- Boolean Values |

Index |

2.4a -- Fixed-width integers |

It is far less confusing after reading this. Thank you. In class we covered this in about ten minutes and moved on to the next thing.

I’ve been to a few sites already trying to get a grasp on these (floats/doubles) and this summation really did the trick.

Thanks!

What do the f’s after some of the float and double values mean?

By default, if you type a floating point value into C++ it’s typed as a double. Consequently, if you do something like this:

You’re assigning a double to a float, which loses precision, and the compiler will probably complain.

Putting an “f” after the value means that you intend that value to be a float, not a double. Then when you do this:

You’re assigning a float value to a float variable, which makes more sense.

I missed something. How is it that 4.53 is a double and not a float?

The 4.53 is a literal constant of type

`double`

by default. When you add the f suffix to it like 4.53f it then becomes a literal constant of type`float`

Is there something wrong with my code?

The compiler brings up a problem with setprecision()..

Thanks

You need to include iomanip.h to use setprecision() that way.

See the lesson on ostream for more info about output manipulators and stuff.

Coming form a Java background, I wonder if anyone can advise me a C++ library with a similar function as Java’s BigDecimal.

Preferably one that works on Linux with gcc(so not the decimal type from Visual C++)

So with all the rounding errors and precision problems, how do programmers deal with operations that need to display something that would end up with a precision or rounding error? Or am I just over-thinking things?

Most of the time it’s simply not necessary to display a number to the number of significant digits where precision/rounding errors creep in. Generally with floating point numbers, programs will truncate the display to 2-5 decimals.

this is very help full site

how can i make the value a user inputs into a float?

then when i run the program from main() and i put in 2 values like eg. x = 10 y = 3 then the answer is 3 instead of 3.333333

You are already storing the user input values as a float. The problem is that your function is returning an integer, so it’s truncating the result of x/y. Change your function to return a float and you will be good.

I set the precision level to 4, and added cout for the 2 values, fValue1 + fValue2.

I got fValue1 IS actually rounded off to 1.345 and fValue2 IS actually 1.123, expecting now the get the result of 2.468, but still reports ‘fTotal is not 2.468′

Why is that?

Chris

Rounding error. The numbers printed on your screen by cout are rounded in this case, so you’re not seeing the full representation. However, when you do the comparison, it does so with the actual numbers, not the rounded ones, which can lead to rounding issues.

How do I convert a Float say

x = 1234.567890123456789

to

y = 1234.5678901234 (small float ..10 decimal places only)

Something similar to setPrecision, to use NOT for display/Printing, but to use as a value for calculations / pass it on to a Database etc ?

I’m not sure what the best way to do this is. For small numbers, you can multiply by 10^x, cast to an integer to drop the remaining decimals, then divide by 10^x. However, if your number is too large you’ll overflow the int when you do the casting so I won’t say this is foolproof.

Didn’t understand how 0.1 is represented in binary by 0.00011001100110011…

In decimal, .1 is tenths, .01 is hundredths, .001 is thousandths and so on. Likewise, in binary, .1 is halves, .01 is quarters, .001 is eights, and so on.

0.000110011… would be equal to 1/16 + 1/32 + 1/256 + 1/512 + …

Ok, so in binary, we have to approximate the 1 of decimal 0.1 with an infinite sum. What I still don’t understand is why we don’t use all the “weights”, that is, all the powers of 2, but only 1/16, 1/32, 1/256, 1/512 and so on, that is, the 4th position (2^4 = 16), the 5th, the 8th, the 9th, and so on. In other words, why don’t we have 0.011111111…….. which is equal to 1/2 + 1/4 + 1/8 + 1/16 + …? It also approaches 1! (I am referring of course to the decimal part of 0.1, that is, the 1.)

That’s exactly why you can’t use every power of 1/2. The infinite sum would add up to 1, which is ten times the number we require. In order for the sum to add up to 0.1, you would need to add Sum[(1/2)^4n + (1/2)^(4n+1)], taking n from 1 to infinity. You can try it yourself if you want.

Yes, you are right, this sum indeed converges to 0.1, whereas the sum I used converges to 1.0. Thank you for the clear and concise explanation!

There is also a good explanation in Wikipedia (yes, sometimes – not often though – Wikipedia has good articles):

“Fractions in binary

Fractions in binary only terminate if the denominator has 2 as the only prime factor. As a result, 1/10 does not have a finite binary representation, and this causes 10 × 0.1 not to be precisely equal to 1 in floating point arithmetic. As an example, to interpret the binary expression for 1/3 = .010101…, this means: 1/3 = 0 × 2^(-1) + 1 × 2^(-2) + 0 × 2^(-3) + 1 × 2^(-4) + … = 0.3125 + … An exact value cannot be found with a sum of a finite number of inverse powers of two, and zeros and ones alternate forever.”

Follows a table of the conversion (fractional approximations) for fractions from decimal to binary. For the ones who are interested:

http://en.wikipedia.org/wiki/Binary_numeral_system

When I run the following code, the values seem really wrong when output. What is going wrong here?

your C ++ compiler has a tendency to roundoff 8th precision onwards.

For any value lesser then 8;

It will display 1 lesser than called for.

Hi there! Congratulation, very good explanation!!! Just what I was looking for.

Thank you.

When I run this code, I get z = 0.333333 and q = 0

float x = 1;

float y = 3;

float z = (x/y);

float q = (1/3);

Can someone explain why? I realize that if I write

float q = (1.0/3.0);

that this problem doesn’t occur, but I’m just wondering why I can’t use (1/3) since q is defined as a float. This page says it’s just a convention to have the decimal point.

Think it through as follows:

float x = 1 reads “put INT 1 into FLOAT x.” This changes its type from int to float. The same is true for float y = 3.

Thus float z = x/y divides two floats and returns a float.

However for float q = (1/3), this is a two part statement.

The first part (1/3) reads “divide INT 1 by INT 3″. Since this is division of two integers, this means it must return an integer (the floor), which in this case is 0.

The second part is then q = 0, which reads “put INT 0 into float q.”

An important thing to keep in mind is that division on a float is different than division on an integer. The literal 1 is read as an integer, however, the literal 1.0 is read as a float/double. This is why q = (1.0/3.0) is different than q = (1/3).

Hope this helped.

thank you!

Yup, 1 / 3 performs integer division (which gives an answer of 0, as the fractional component is dropped) , whereas 1.0 / 3.0 performs floating point division (which gives an answer of 0.333333…)

This is a very good article on the floating-point computation issue: “Microsoft Visual C++ Floating-Point Optimization”, by Eric Fleegal, MSDN, 2004

http://msdn.microsoft.com/en-us/library/aa289157(v=vs.71).aspx

Hi people,

I am new to C++ so please don’t flame me

I wrote a simple prog. to test this course but something isn’t really working well and I can’t figure out why…

#include<iostream>

#include<string>

#include <iomanip> // for setprecision()

using namespace std;

main()

{

cout<<setprecision(7); //7 decimals

float v = 1;

float j = 3;

float cc;

cc = v/j;

//TEST with FLOAT NUMMERS

float ff = 0.3333333; // 7 decimals as set in "setprecusion(7);"

if(cc<ff) {

cout<<"cc is smaller then ff"<<endl;

}

else if(cc>ff) {

cout<<"cc is bigger then ff"<<endl;

}

else{

cout<<"cc equals to ff"<<endl;

}

cout<<cc<<" = cc"<<endl;

cout<<ff<<" = ff"<<endl;

`return 0;`

}

The output gives me that cc is bigger than ff…

I don’t understand why as I set precision to 7 and my var ff has also 7 decimals.

They should both be equal.

Any suggestions where I made an error?

Thanks!!

I Have Tasted Your Code.

No, You didn’t make any wrong.

I think setprecision() function is only for setting the precision at time of showing your variable when you use cout.

I mean setprecision() cant change your variable.Like in your code cc = v/j so cc is stored as 0.33333333333333333333333333………. & setprecision() cant change this.You stored ff as 0.3333333.

0.3333333333333333…….. is greater than 0.3333333 isnt it?So your code show cc is bigger than ff.

Read the “Comparison of floating point numbers” part of this tutorial.It dosent say that you can use setprecision() for Comparison of floating point numbers.

Thank You

How to control the numbers after decimal point in C++?

Like in C language if i take a floating variable f = 123.4567 and i want to show only 2 numbers after decimal point than i will use printf(“%0.2f”).Then it will show 123.45.

In C++ i have to use setprecision().But It determines total numbers not just numbers after decimal point.So, it makes problem.Like if i dont know that what numbers my floating variable will contain after calculation it can contain 123.123 or 1234.123 so if i set precision to 5 for first case it will show 123.12 and for second case it will show 1234.1!But i always want to show 2 numbers after decimal points for every case.How can i do that in c++?

i have same question like you

Use the

`std::fixed`

stream manipulator and the member function`std::precision`

.For example if you want to display with 2 decimal places:

double pi=3.14159;

std::cout.precision(2);

std::cout << "Today's price for a slice of pi is $" << std::fixed << pi << std::endl;

and it should print:

`Today's price for a slice of pi is $3.14`

Alex, the first example demonstrating setprecision() is missing the closing curly bracket for main()

Fixed. Thanks!

I'm perplexed! I'm working through Bjarne S's book and have done one of the drills but I don't understand what's going on with varying results.

Here's some output;

9.99

10

smaller is a 9.99 the larger is b 10

result 0.01

they are almost equal

99.99

100

smaller is a 99.99 the larger is b 100

result 0.01

199.99

200

smaller is a 199.99 the larger is b 200

result 0.01

they are almost equal

Why doesn't entering 99.99 and 100 give the message "they are almost equal"?

This is an interesting example of rounding errors.

10 – 9.99 = 0.01, but due to rounding error, C++ is representing this as 0.0099999999999997868

100 – 99.99 = 0.01, but due to rounding error, C++ is representing this as 0.0100000000000005116

One of this is larger than 0.01, and one is smaller.

Argggh! I almost went insane yesterday evening try to see where I'd gone wrong.

How does one avoid rounding errors? I can think of any number of applications where rounding errors even at the level of precision in my example might be disastrous.

There are a couple of ways to “avoid” rounding errors:

1) Avoid use of floating point numbers altogether (sometimes this is possible, sometimes it isn’t).

2) Don’t do raw comparisons like you’re doing. In section 3.5 — Relational operators (comparisons), we discuss how to tell if floating point numbers are equal. These can be extended to handle less then/greater than cases. This would help avoid the case you see above.

3) Ensure that when you use floating point numbers, you only treat them as accurate to a certain level of precision.