Integers are great for counting whole numbers, but sometimes we need to store *very* large numbers, or numbers with a fractional component. A **floating point** type variable is a variable that can hold a real number, such as 4320.0, -3.33, or 0.01226. The *floating* part of the name *floating point* refers to the fact that the decimal point can “float”; that is, it can support a variable number of digits before and after the decimal point.

There are three different floating point data types: **float**, **double**, and **long double**. As with integers, C++ does not define the size of these types. On modern architectures, floating point representation almost always follows IEEE 754 binary format. In this format, a float is 4 bytes, a double is 8, and a long double can be equivalent to a double (8 bytes), 80-bits (often padded to 12 bytes), or 16 bytes.

Floating point data types are always signed (can hold positive and negative values).

Category | Type | Minimum Size | Typical Size |
---|---|---|---|

floating point | float | 4 bytes | 4 bytes |

double | 8 bytes | 8 bytes | |

long double | 8 bytes | 8, 12, or 16 bytes |

Here are some definitions of floating point numbers:

1 2 3 |
float fValue; double dValue; long double dValue2; |

When using floating point literals, it is convention to always include at least one decimal place. This helps distinguish floating point values from integer values.

1 2 3 |
int x(5); // 5 means integer double y(5.0); // 5.0 is a floating point literal (no suffix means double type by default) float z(5.0f); // 5.0 is a floating point literal, f suffix means float type |

Note that by default, floating point literals default to type double. An f suffix is used to denote a literal of type float.

**Scientific notation**

How floating point variables store information is beyond the scope of this tutorial, but it is very similar to how numbers are written in scientific notation. **Scientific notation** is a useful shorthand for writing lengthy numbers in a concise manner. And although scientific notation may seem foreign at first, understanding scientific notation will help you understand how floating point numbers work, and more importantly, what their limitations are.

Numbers in scientific notation take the following form: *significand* x 10^{exponent}. For example, in the scientific notation `1.2 x 10`

, ^{4}`1.2`

is the significand and `4`

is the exponent. This number evaluates to 12,000.

By convention, numbers in scientific notation are written with one digit before the decimal, and the rest of the digits afterward.

Consider the mass of the Earth. In decimal notation, we’d write this as `5973600000000000000000000 kg`

. That’s a really large number (too big to fit even in an 8 byte integer). It’s also hard to read (is that 19 or 20 zeros?). In scientific notation, this would be written as `5.9736 x 10`

, which is much easier to read. Scientific notation has the added benefit of making it easier to compare the magnitude of two really large or really small numbers simply by comparing the exponent.^{24} kg

Because it can be hard to type or display exponents in C++, we use the letter ‘e’ or ‘E’ to represent the “times 10 to the power of” part of the equation. For example, `1.2 x 10`

would be written as ^{4}`1.2e4`

, and `5.9736 x 10`

would be written as ^{24}`5.9736e24`

.

For numbers smaller than 1, the exponent can be negative. The number `5e-2`

is equivalent to `5 * 10`

, which is ^{-2}`5 / 10`

, or ^{2}`0.05`

. The mass of an electron is `9.1093822e-31 kg`

.

In fact, we can use scientific notation to assign values to floating point variables.

1 2 3 4 5 |
double d1(5000.0); double d2(5e3); // another way to assign 5000 double d3(0.05); double d4(5e-2); // another way to assign 0.05 |

**How to convert numbers to scientific notation**

Use the following procedure:

- Your exponent starts at zero.
- Slide the decimal so there is only one non-zero digit to the left of the decimal.
- Each place you slide the decimal to the left increases the exponent by 1.
- Each place you slide the decimal to the right decreases the exponent by 1.
- Trim off any leading zeros (on the left end)
- Trim off any trailing zeros (on the right end) only if the original number had no decimal point. We’re assuming they’re not significant unless otherwise specified.

Here’s some examples:

Start with: 42030 Slide decimal left 4 spaces: 4.2030e4 No leading zeros to trim: 4.2030e4 Trim trailing zeros: 4.203e4 (4 significant digits)

Start with: 0.0078900 Slide decimal right 3 spaces: 0007.8900e-3 Trim leading zeros: 7.8900e-3 Don't trim trailing zeros: 7.8900e-3 (5 significant digits)

Start with: 600.410 Slide decimal left 2 spaces: 6.00410e2 No leading zeros to trim: 6.00410e2 Don't trim trailing zeros: 6.00410e2 (6 significant digits)

Here’s the most important thing to understand: The digits in the significand (the part before the E) are called the **significant digits**. The number of significant digits defines a number’s **precision**. The more digits in the significand, the more precise a number is.

**Precision and trailing zeros after the decimal**

Consider the case where we ask two lab assistants each to weigh the same apple. One returns and says the apple weighs 87 grams. The other returns and says the apple weighs 87.000 grams. Assuming the weighings were correct, in the former case, we know the apple actually weighs somewhere between 86.50 and 87.49 grams. Maybe the scale was only precise to the nearest gram. Or maybe our assistant rounded a bit. In the latter case, we are confident about the actual weight of the apple to a much higher degree (it weighs between 86.9950 and 87.0049 grams, which has much less variability).

So in scientific notation, we prefer to keep trailing zeros after a decimal, because those digits impart useful information about the precision of the number.

However, in C++, 87 and 87.000 are treated exactly the same, and the compiler will store the same value for each. There’s no technical reason why we should prefer one over the other (though there might be scientific reasons, if you’re using the source code as documentation).

**Precision and range**

Consider the fraction 1/3. The decimal representation of this number is 0.33333333333333… with 3’s going out to infinity. An infinite length number would require infinite memory to store, and we typically only have 4 or 8 bytes. Floating point numbers can only store a certain number of significant digits, and the rest are lost. The **precision** of a floating point number defines how many *significant digits* it can represent without information loss.

When outputting floating point numbers, std::cout has a default precision of 6 -- that is, it assumes all floating point variables are only significant to 6 digits, and hence it will truncate anything after that.

The following program shows std::cout truncating to 6 digits:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
#include <iostream> int main() { float f; f = 9.87654321f; // f suffix means this number should be treated as a float std::cout << f << std::endl; f = 987.654321f; std::cout << f << std::endl; f = 987654.321f; std::cout << f << std::endl; f = 9876543.21f; std::cout << f << std::endl; f = 0.0000987654321f; std::cout << f << std::endl; return 0; } |

This program outputs:

9.87654 987.654 987654 9.87654e+006 9.87654e-005

Note that each of these is only 6 significant digits.

Also note that cout will switch to outputting numbers in scientific notation in some cases. Depending on the compiler, the exponent will typically be padded to a minimum number of digits. Fear not, 9.87654e+006 is the same as 9.87654e6, just with some padding 0’s. The minimum number of exponent digits displayed is compiler-specific (Visual Studio uses 3, some others use 2 as per the C99 standard).

However, we can override the default precision that cout shows by using the std::setprecision() function that is defined in a header file called iomanip.

1 2 3 4 5 6 7 8 9 10 11 |
#include <iostream> #include <iomanip> // for std::setprecision() int main() { std::cout << std::setprecision(16); // show 16 digits float f = 3.33333333333333333333333333333333333333f; std::cout << f << std::endl; double d = 3.3333333333333333333333333333333333333; std::cout << d << std::endl; return 0; } |

Outputs:

3.333333253860474 3.333333333333334

Because we set the precision to 16 digits, each of the above numbers is printed with 16 digits. But, as you can see, the numbers certainly aren’t precise to 16 digits!

The number of digits of precision a floating point variable has depends on both the size (floats have less precision than doubles) and the particular value being stored (some values have more precision than others). Float values have between 6 and 9 digits of precision, with most float values having at least 7 significant digits (which is why everything after that many digits in our answer above is junk). Double values have between 15 and 18 digits of precision, with most double values having at least 16 significant digits. Long double has a minimum precision of 15, 18, or 33 significant digits depending on how many bytes it occupies.

Precision issues don’t just impact fractional numbers, they impact any number with too many significant digits. Let’s consider a big number:

1 2 3 4 5 6 7 8 9 10 |
#include <iostream> #include <iomanip> // for std::setprecision() int main() { float f(123456789.0f); // f has 10 significant digits std::cout << std::setprecision(9); // to show 9 digits in f std::cout << f << std::endl; return 0; } |

Output:

123456792

123456792 is greater than 123456789. The value 123456789.0 has 10 significant digits, but float values typically have 7 digits of precision. We lost some precision!

Consequently, one has to be careful when using floating point numbers that require more precision than the variables can hold.

Assuming IEEE 754 representation:

Size | Range | Precision |
---|---|---|

4 bytes | ±1.18 x 10^{-38} to ±3.4 x 10^{38} |
6-9 significant digits, typically 7 |

8 bytes | ±2.23 x 10^{-308} to ±1.80 x 10^{308} |
15-18 significant digits, typically 16 |

80-bits (12 bytes) | ±3.36 x 10^{-4932} to ±1.18 x 10^{4932} |
18-21 significant digits |

16 bytes | ±3.36 x 10^{-4932} to ±1.18 x 10^{4932} |
33-36 significant digits |

It may seem a little odd that the 12-byte floating point number has the same range as the 16-byte floating point number. This is because they have the same number of bits dedicated to the exponent -- however, the 16-byte number offers a much higher precision.

*Rule: Favor double over float unless space is at a premium, as the lack of precision in a float will often lead to inaccuracies.*

**Rounding errors**

One of the reasons floating point numbers can be tricky is due to non-obvious differences between binary (how data is stored) and decimal (how we think) numbers. Consider the fraction 1/10. In decimal, this is easily represented as 0.1, and we are used to thinking of 0.1 as an easily representable number. However, in binary, 0.1 is represented by the infinite sequence: 0.00011001100110011… Because of this, when we assign 0.1 to a floating point number, we’ll run into precision problems.

You can see the effects of this in the following program:

1 2 3 4 5 6 7 8 9 10 11 |
#include <iostream> #include <iomanip> // for std::setprecision() int main() { double d(0.1); std::cout << d << std::endl; // use default cout precision of 6 std::cout << std::setprecision(17); std::cout << d << std::endl; return 0; } |

This outputs:

0.1 0.10000000000000001

On the top line, cout prints 0.1, as we expect.

On the bottom line, where we have cout show us 17 digits of precision, we see that d is actually *not quite* 0.1! This is because the double had to truncate the approximation due to its limited memory, which resulted in a number that is not exactly 0.1. This is called a **rounding error**.

Rounding errors can have unexpected consequences:

1 2 3 4 5 6 7 8 9 10 11 12 13 |
#include <iostream> #include <iomanip> // for std::setprecision() int main() { std::cout << std::setprecision(17); double d1(1.0); std::cout << d1 << std::endl; double d2(0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1 + 0.1); // should equal 1.0 std::cout << d2 << std::endl; } |

1 0.99999999999999989

Although we might expect that d1 and d2 should be equal, we see that they are not. If we were to compare d1 and d2 in a program, the program would probably not perform as expected. We discuss this more in section 3.5 -- Relational operators (comparisons).

One last note on rounding errors: mathematical operations (such as addition and multiplication) tend to make rounding errors grow. So even though 0.1 has a rounding error in the 17th significant digit, when we add 0.1 ten times, the rounding error has crept into the 16th significant digit.

**NaN and Inf**

There are two special categories of floating point numbers. The first is **Inf**, which represents infinity. Inf can be positive or negative. The second is **NaN**, which stands for “Not a Number”. There are several different kinds of NaN (which we won’t discuss here).

Here’s a program showing all three:

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 |
#include <iostream> int main() { double zero = 0.0; double posinf = 5.0 / zero; // positive infinity std::cout << posinf << std::endl; double neginf = -5.0 / zero; // negative infinity std::cout << neginf << std::endl; double nan = zero / zero; // not a number (mathematically invalid) std::cout << nan << std::endl; return 0; } |

And the results using Visual Studio 2008 on Windows:

1.#INF -1.#INF 1.#IND

INF stands for infinity, and IND stands for indeterminate. Note that the results of printing Inf and NaN are platform specific, so your results may vary.

**Conclusion**

To summarize, the two things you should remember about floating point numbers:

1) Floating point numbers are great for storing very large or very small numbers, including those with fractional components, so long as they have a limited number of significant digits (precision).

2) Floating point numbers often have small rounding errors, even when the number has fewer significant digits than the precision. Many times these go unnoticed because they are so small, and because the numbers are truncated for output. Consequently, comparisons of floating point numbers may not give the expected results. Performing mathematical operations on these values will cause the rounding errors to grow larger.

**Quiz**

1) Convert the following numbers to C++ style scientific notation (using an e to represent the exponent) and determine how many significant digits each has (keep trailing zeros after the decimal):

a) 34.50

b) 0.004000

c) 123.005

d) 146000

e) 146000.001

f) 0.0000000008

g) 34500.0

**Quiz Answers**

2.6 -- Boolean values and an introduction to if statements |

Index |

2.4a -- Fixed-width integers and the unsigned controversy |

Hi Alex!,

Can you give me your definition of the function "long"

maybe show me how to use it in code

Hi Camron!

"long" isn’t a function but a keyword, just like unsigned for example.

"long" tell’s the compiler to make to variable at least 32 bits in size.

Exactly how much memory is used is up to the compiler.

Writing "long double" asks the compiler to use a little more memory than for a normal "double" which causes the variable to be more accurate and/or be able to store bigger numbers.

More of this is covered in chapters 2.4 and B.2

I don’t understand the question. “long” isn’t a function, it’s a type specifier that can be used to define a long integer or long double.

if i divided two integer the result must be float but i could not get that result why?

let you see the program>

#include <iostream>

#include <iomanip>

using namespace std;

int main()

{

int a=8,b=7;

float area;

area= (a/b);

cout << area << endl;

return 0;

}

Nope, if you divide two integers, you’ll get an integer. We talk about this more in the next chapter. If you want a float result, you have to cast one of the integers to a float _before_ doing the division:

Is it not inconsistent to keep the zero value significant digits for numbers with a decimal point but ignore those zero value digits for numbers without a decimal point?

For example:

b) 0.004000 = 4.000e-3 (4 significant digits) - in c++ style scientific notation, we are stating that this value is accurate to 4 significant digits.

d) 146000 = 1.46e5 (3 significant digits) - here we are ignoring the fact that this value may actually be accurate to 6 significant digits. We are throwing away 3 digits of precision just because those digits happen to be 0’s.

For notation consistency, should d) not be 1.46000e5?

Or alternatively, to be consistent, b) should be 4e-3 and d) should be 1.45e5?

Nope. With zeros immediately to the right of the decimal, we can be sure they’re significant (because those numbers can’t be rounded). With zeros immediately to the left of the the decimal, we can’t be sure whether they’re significant or rounded. Thus, standard convention is that we assume numbers to the left of the decimal aren’t treated as significant unless otherwise stated.

Thanks for the reply, I think I get it now. I’m used to engineering where we we would be consistent, like I said above. I realise that digits after the decimal point will only have a limited precision in c++, so that’s why you have to include the significant digits. Hope I’ve got that right.

Can you explain your sentence here:

"Note that by default, floating point literals default to type double. An f suffix is used to denote a literal of type float."

When you initialise a variable, you have to define it’s type, e.g. int, float or double etc. So in what circumstance would my computer need to default to a variable type, because every time you create a variable syntax dictates that you have to define it’s type? Hope that makes sense. So if I am initialising a variable:

In what circumstance would the compiler default to a double?

So let’s differentiate two things. Your variable has a type. Your literals also have a type. In your sample code, your variable x has type float. But your literal (because it does not have an f suffix) has a default type of double. Therefore, your compiler will have to implicitly convert double 1.1 to a float value, so it can be used to initialize variable x. By using 1.1f, the literal has type float, which matches with the type of your variable, so no conversion (and chance for error) is necessary.

I think I get it now. Also, I didn’t consider expressions, where I could use the literal 1.1 in an equation, and I wouldn’t have given it a type myself, so it will default to be type double. Thanks

Hi Alex

First, your C++ tutorials are superb - by far the best I found on the web. Clear explanations with lots of great examples!!

At the top of this page you write:

I am still not sure why the f suffix needs to be there (I read your answers to comments about this in the thread)

Reason I ask is that using

my compiler tells me that it takes f as a float whether I include the f suffix or not.

Perhaps some compilers need the "f" and some don’t….?

5.0f means 5.0 as a float. 5.0 (no f suffix) means 5.0 as a double. If you initialize a float variable with 5.0 (no suffix), then you’re assigning a double literal to a float variable, which means a conversion happens.

Your typeid(f).name() is showing you the type of the variable f, not the f suffix. Perhaps it’s a bit confusing to have the variable have the same name as the suffix. I’ll update that.

Hi Alex

Many thanks. That makes sense.

I guess the fact that the f suffix avoids an implicit type conversion from double to float had escaped me!

Rex

Hi Alex,

"The size of the variable puts a limit on the amount of information it can store -- variables that utilize more bytes can hold a wider range of values."

Data type : int

Size: 4 bytes

Range: -2,147,483,648 to 2,147,483,647

Date type: float

Size: 4 bytes

Range: ±1.18 x E-38 to ±3.4 x E38

I am wondering why floating point variables can hold much larger range of values than integer variables can since both have the same size of 4 bytes?

Thank you, Have a great day

Good question. The answer is that integers and floating point numbers use their bits differently, with each method having tradeoffs. Let’s explore this with an exercise.

If I gave you 4 digits to make numbers out of, how many different numbers could you make? And what is the largest number you could make? (assume positive numbers only for simplicity). You might initially think: with 4 digits, you can make all the numbers between 0 and 9999, so 10000 numbers, with the highest being 9999. That’s what integer representation does.

But consider what would happen instead if you used the first two digits as a base and the last two as an exponent. You could still make around 10000 different numbers, but now your largest number would be 99^99, which is much larger than 9999. However, the tradeoff is that you could no longer represent a number like 115 exactly in this scheme, because you no longer have enough precision to precisely represent this number. That’s what floating point representation does.

Does that make sense?

Looks like we can’t use scientific notation to assign values to integer variables?. Let’s say if I’d like to assign 5973600000000000000000000 to an integer variable, I’d have to type 5973600000000000000000000, not 5.9736 x E24?

For example:

ll x = 5973600000000000000000000;

ll x = 5.9736 x E24; // is it ok?

Thank you.

You can do this, but it’s not a good idea because these numbers are considered floating point literals, and thus subject to precision issues. With 32-bit integers, this isn’t likely to be an issue, but with 64-bit integers, the precision of the floating point literal may be less than the number desired. Your compiler will probably also complain about the conversion (or reject it altogether if you’re doing a uniform initialization, which disallows narrowing conversions)

Dear Teacher,

Please let me say you that I have a special interest in floating point numbers. Could you please suggest me a website about?

Regards.

What more do you want to learn about floating point numbers? In the comments of this article, there should be a link to an article on how floating point numbers are represented in memory, but it’s pretty complicated.

Dear Teacher, please accept my thanks for you replied my comment. Already I have chosen Wikipedia’s article "Single-precision floating-point format" as starting point for learn this subject. Regards.

Dear Teacher, could you please explain me why your second number in output of your program

is 3.333333333333334? I have used different platforms and output is 3.333333333333333. Regards.

Floating point operations are always imprecise to some degree. Some architectures may have better hardware support for precise calculations than others. They should never be different by much, but as you’ve just discovered, with floating point you can never assume a precise value.

Dear Teacher, please let me ask: by "In the comments of this article, there should be a link to an article on how floating point numbers are represented in memory, but it’s pretty complicated" do you mean https://en.wikipedia.org/wiki/Floating_point suggested by Mr. James Ray, posted by February 16, 2017 at 1:20 am? Regards.

No, not that one. I thought it was linked from the comments but apparently not, and I can’t find it in my notes any more. If you look up “floating point representation ieee 754 denormalized” on Google I’m sure you’ll get a ton of articles of interest.

My dear Teacher,

Please let me answer your question: (is that 19 or 20 zeros?). That is 20 zeros!

With regards and friendship.

Hi Alex.

Whenever you talked about uniform initialization, you said that narrowing conversions were forbidden. Why is it that float value{ 1.0 } runs fine but float value{ 1.0 / 3 } does not? In both cases, aren’t you converting a double to a float?

The standard says, “A narrowing conversion is an implicit conversion… from long double to double or float, or from double to float, except where the source is a constant expression and the actual value after conversion is within the range of values that can be represented (even if it cannot be represented exactly)”

So it looks like technically the 1.0 isn’t considered a narrowing conversion in this context since it’s a constant and within the range of a float. 1.0 / 3 doesn’t work because it’s not constant.

My dear c++ Teacher,

Can you please give me your definition of literal number?

With regards and friendship.

A fixed value that has been inserted into the code, such as 5, 6.7, or ‘a’.

In the expression x + 5, 5 is an integer literal.

My dear Teacher,

Please let me thank you for replied my request, and say that phrase "When we assign literal numbers to floating point numbers" in the introduction of this lesson, is ambiguous. Do you mean "When literal numbers are floating point numbers"?

With regards and friendship.

I rewrote the sentence as, “When using floating point literals, it is convention to always include at least one decimal place”. That should be a bit clearer. Thanks for pointing out the ambiguity.

Dear Teacher, please accept my thanks and congratulations for rewriting this sentence. Now it is clear. Regards.

Please write like a normal human.

Dear Teacher, please let me the question: Do you consider "dear Teacher" an abnormal human’s expression? Regards.

No, but it’s a bit overly formal for the internets. 🙂

Dear Teacher, please let me thank you for replied my comment.

1. In your comment there is not "reply" capability. Then I replied with Mr. William’s "Reply" capability.

2. Also I please you forgive me for I use formal expressions. Although Greek, I live in France and like french savoir vivre as much I dislike cowboy’s. Regards.

I had a hard time understanding this lesson. I got the jist of the first part but you lost me towards the end. Honestly, how important is it to understand ALL of this? Is it really going to be used much in everyday coding?

From a lesson standpoint, I don’t think there are lessons that build on top of this one, so assuming you understand basic usage, you should be fine to continue the tutorials.

As for whether it’s used in everyday coding, that depends entirely on what type of programs you’re creating. In some cases, they’re not used at all. In other cases, they are used all the time.

Yeah I suppose that makes sense. Alrighty then

Hey Alex,

In this code why did you use “n”. I tried the code after omitting it and it worked just fine. Thanks in advance.

‘\n’ does a newline, same as std::endl. I realized I hadn’t explained this yet, so I replaced the \n with std::endl for now.

Hi. Great set of tutorials. Thank you for sharing. I had a couple of questions.

1) I compiled and executed the following piece of code :

Giving the following output :

Shouldn’t it have printed 1.23456 in the first line and 1.234567 in the second ?

2) Why do we need a suffix ‘f’ while initializing a variable of type float?

1) Floating point numbers round the last digit if more precision is available that can’t be displayed. So 1.234568 rounds to 1.23457, not 1.23456.

2) The f suffix tells the compiler the literal is of type float. If you don’t include the f suffix, your literal will be of type double instead.

Hey Alex, thank you very much for this wonderful website, but I’d like to ask you about this specific lesson and how a rounding error occurs.

I saw the code

d2 came up as 0.99999999999999989 as expected, and when I used

it also came as 0.99999999999999989, but when I tried

it came out as 1 perfectly fine, can you explain why this is? thank you very much for your tutorials and i hope this weird doesn’t go unanswered, also if it is possible is there a desirable or better way of writing this code?

I’m not sure on this one. The topic of floating point representation of numbers (which is what produces precision issues) is an esoteric topic.

0.7 + 0.1 + 0.1 + 0.1; also produces 0.99999999999999989, but 0.8 + 0.1 + 0.1 produces 1.

That’s a shame D: But thank you for answering regardless, have a good day 😀

Question: doesnt std:: signify that something is in the standard library? yet we need <iomanip> for std::setprecision()?

Yes, std:: means it’s in the standard library. However, the standard library is scattered across many different header files. std::setprecision is declared inside the std namespace in the iomanip header, so you should include that header if you want to use it.

Hi Alex, I have four problems about floating point numbers,

Here’re my Questions:

1)In my computer architecture The ranges of long double and double are different(when I

sizeofthe long double and double types, I have12 bytesfor long double and8 bytesfor double) But the precision of double and long double is the same, Why?2)when I use float for summing the numbers 0.1 ten times I get 1 as expected result. Why this?

3)Can I have some explanations about how the compiler rounds the numbers, please if it’s possible.

4) When I divide 0(integer) by 0

the program falls

but I divide 0.0(decimal) by 0

I have the expected result:

nanWhy an integer divided by 0 make the program falls?

1) That’s the way it’s defined. Presumably long double has a larger range.

2) Your compiler may be optimizing this and avoiding the precision error.

3) The details of how numbers are converted into floating point format is way outside the scope of these tutorials.

4) Dividing an integer by an integer produces an integer result. Dividing a floating point number by an integer produces a floating point result. NaN is a floating point result.

Thanks you very much for your replies Alex, I understood

Have a nice day!

So I’ve been trying to playing around with the precision of the various float types and how they can produce error, so I made this program:

and using VS Community 2015, it complied and output:

So three questions:

1) Why did I have to use 1.0f and 3.0f on all of my numbers? If I remove the ‘f’ on any of them (float, double float, or long double) then the code does not compile.

2) Why does it appear that my float type has the highest precision of the three results? I would think if 1/3 were a terminating decimal in binary and float had enough bits to hold it, I would simply get the same precision on all three results.

3) Why does it appear that both double and long double types have the same precision? Shouldn’t long double allow for more bits to be used and would result in a more accurate decimal for 1/3?

1) With no f suffix, your floating point literals are treated as doubles. I’m not sure why this wouldn’t compile -- perhaps your compiler is warning you about converting a double value back into a float, which can cause a loss of precision.

2) Most likely it has something to do with the fact that your literals are floats. Perhaps the compiler is doing some kind of optimization here helping to avoid a precision issue in that one case.

3) Your compiler may not support long double, and thus treat double and long double the same. Try doing a sizeof() each and see if they both come out as 8 bytes.

After carefully reading your definitions on double, float and long double I still don’t know what they are for or how I can use them. I need some more in-depth confirmation on the subjects, but perhaps I found something that would kick me off to a start.

I wrote this code:

It will put up a prompt where I can type stuff, and it gathers 2 or more entries of numbers until I type "-1" where it will sum the numbers I’ve typed and it will display it. I noticed when I used "int a,z = 0;" that I was only able to type numbers without decimals. If I typed, "9" it would be fine, but if I typed, "9.2" the program would not work anymore. With double or float I am able to type a lot of numbers beyond the decimal, "9.2, 9.222, 9.1239123123123," anything goes. So my conclusion is, int cannot store enough memory for numbers with decimal points and other things like float and double can. Let me know if I’m correct or not, I’d love to know.

By the way, what should I refer commands like, "int, double, float, long double" to? I called them "things" in this question because I don’t know what to call them. Info on that would be appreciated too. Thanks for the tutorials.

Integers (int, long, etc…) are designed for storing integer values (those with no decimal points). If you don’t need to store decimal points, you should use an integer.

Floating point numbers are used for storing numbers that need a decimal. The only difference between float, double, and long double is in how many decimals they can store before running out of room.

int, double, float, double, etc… are called types (short for data types).

1. WHY 87 AND 87.000 ARE TREATED SAME? ONE IS CONSIDERED AS FLOAT AND OTHER AS INTEGER.

2. WHY WE SPECIFY F? EG. float f=4.76f;

3. WHY float f=1.0;

cout<<f;

PRINTS "1" ?

4. WHY THE RESULTS ARE IRRELEVANT

EG. float f=3.3333333333333;

cout<<f;

OUTPUT

3.33333326347 ????

THE FLOAT SHOULD SHOW ONLY 6-7 NUMBERS LIKE 3.333334

1) 87 is an integer literal, and 87.000 is a double literal.

2) The f suffix tells the compiler that a floating point literal should be interpreted as a float instead of a double.

3) std::cout doesn’t print trailing zeros.

4) When I tried this on Visual Studio 2015, I got 3.33333, which is correct -- the default precision should be 6. It sounds like either your compiler is defaulting to some other precision, or you have a statement in some code above that’s changing the default precision for your program.

What is the meaning of this: "\n" when we write std::cout << posinf << "\n";

I cover ‘\n’ in the lesson on chars, which is later in this chapter.

I was wondering why some of my answers were wrong…

a) 34.50 => 3.45e1

b) 0.004000 => 4e-3

c) 123.005 => 1.23005e2

d) 146000 => 1.46e5

e) 146000.001 => 1.46000001e5

f) 0.0000000008 => 8e-10

g) 34500.0 => 3.45e4

With a little Google-Fu, I found out that, in science, when a measurement is made, 42 means something while 42.0000 means the measurement has been done with a much higher precision than in the first case. But this is the scientific reason.

What is the C++ reason why removing trailing 0s is wrong? Can someone give me an example where removing trailing 0s from the scientific notation leads to errors?

I tried to hunt the error with:

I added a new subsection called, “Precision and trailing zeros after the decimal” that talks more about this. In short, C++ doesn’t care about trailing zeros after the decimal. Whether you include them or not is up to you.

f = 9.87654321f;

Why is f her??

The f suffix means the number is a float. I discuss this in a few lessons.

Thank you,

For your answer

Hi Alex,

When you set the precision of some floating point higher than its significant digits, where do all the numbers after the last significant digit come from (like in the example of 3.3333333333333)?

Not all decimal numbers can be represented precisely in floating point format. This can manifest in a few different ways:

1) Much like 1/3 ends up as 0.333333333 (repeating) in decimal, some numbers that have concise representations in decimal format have repeating representations in floating point format.

2) In some other cases, the exact decimal value can’t be represented in floating point format, so the closest value that can be represented is picked. This can lead to cases where you wanted some number but get a number that’s slightly smaller or larger than the one you were expecting.

"Each place you slide the decimal to the left increases the exponent by 1.

Each place you slide the decimal to the right decreases the exponent by 1."

Isn’t it the other way around?

I r cunfosd

Nope.

Start with: 42030

This is the same as 42030.0

Slide decimal left 4 spaces: 4.2030e4

We slid the decimal 4 spaces to the left, so the exponent increased from 0 to 4.

that made me cringe.

5/1 = 5

5/0.5 = 10

5/-1 = -5

5/-0.5 = -10

as they approach 0 they are simultaneously negatively and positively ‘infinite’ and hence we say anything divided by 0 is undefined.

I am genuinely surprised and concerned that 5/0 does not return NaN or at least some error.

please watch this video;

https://www.youtube.com/watch?v=BRRolKTlF6Q

On Visual Studio 2015, Windows 10, Intel Core i3-2130 @3.40 GHz, 8 GB RAM, x64 processor, when I run:

I get the output:

inf

-inf

-nan(ind)

Learn more about NaN here: https://en.wikipedia.org/wiki/NaN.