September 2020
Distributions form the central element of probabilistic thinking. We confront the difficulty of understanding the concepts of probability distribution functions, cumulative distribution functions, and their assorted conversions. Learning how to use R
functions that use these distributions add more difficulty because the functions can be inscrutible. I sympathize with students because the functions confuse me.
I hope that this document helps alleviate some of that confusion.
Conceptual Foundation
Distributions
Distributions provide the foundation for statistical thinking. Distributions help us determine how likely it is that we will observe particular values in our data. Throughout this document, I use the Gaussian, or normal, distribution, but many of the same concepts apply to other distributions.
Turning values of data into locations in a distribution provides the key to unlock probabilistic reasoning. In the Gaussian distribution, the key comes in the form of z-scores. Z-scores represent the data in units of standard deviations away from the mean. To get a z-score for any value, \(Y^*\), we subtract the mean from the value, \(Y^*-\bar{Y}\), which gives us the error.
The error represents the value of a measure in the units of measurement. For example, if we measure height, the the error would be measured as inches. To use z-scores, we need to convert the error into units of standard deviations. To do that, we divide our error by the standard deviation (\(\sigma\)):
\[\frac{Y^*-\bar{Y}}{\sigma} = \frac{e_i}{\sigma} = z^*\]
The term \(e_i/\sigma\) provides the key to convert units of measurement to z-scores and z-scores to units of measurement.
Domain and Range
We will take a little detour to define two terms: domain and range. The domain of any function refers to the set of possible values that you can put into a function. The range of a function refers to the set of possible values that you can get out of a function. Keep those definitions in mind because they will become helpful in a minute.
Probabilities
We want to calculate the probabilities of observing particular events. Remember that the law of total probability says that the sum of all possible probabilities must sum to one. The area underneath the curve of all probability density functions (PDF), like that in Fig 1, equal one. Therefore, the probability density function satisfies the requirement of the law of total probability.
Converting Between Z-Scores and Probabilities
With that out of the way, we can now turn to the R
functions that provide values for the normal distribution. Before we do, let’s start by looking at the PDF of the Gaussian (a.k.a. normal) distribution, as shown in 1. The x-axis records values in z-scores and the y-axis records the resulting probability of observing z-scores. The graph shows that a value attains the highest probability of being observed if that value falls near the mean, and lower probabilities as the absolute value of the error increases.
Figure 1: A Gaussian (Normal) Probability Density Function
pnorm()
: Converting Z-Scores to Probabilities
The function plotted in Fig. 1 represents the chances that you obtain a particular z-score value in your data. Generally, we care more about the cumulative distribution function (CDF), that represents the sum of all of the probabilities up to that value. We use the pnorm()
function to get the value of the latter function (the CDF).
We use pnorm()
to calculate the cumulative probability associated with a particular z-score. For an example, let’s say that we want to figure out the probability associated with a z-score of -0.5.1 Fig 2 graphically represents the problem that we want to solve.
Figure 2: Gaussian distribution with values less than z=-0.5 shaded
We know our z-score and we want to figure out the area of the shaded portion of the curve in Fig. 2, which represents the probability of a value sits between negative infinity and -0.5 standard deviations away from the mean. We therefore type, pnorm(-0.5)
to get that value:
pnorm(-0.5)
## [1] 0.3085375
The value means that the probability of observing an error less than or equal to -0.5 standard deviations away from the mean equals 0.31. Fig. ?? shows this graphically.
Figure 3: Gaussian distribution with response to pnorm(-0.5)
Since the Gaussian probability distribution can take any value from negative infinity to positive infinity, the domain of pnorm()
equals any real number. The range, however, will be limited to probabilities, and therefore return a value from 0 to 1 (inclusive).
qnorm()
: Converting Probabilities to Z-Scores
Now we come to qnorm()
, what I consider the more difficult of the functions. As the description in help for these functions2 from the help explains, the “q
” refers to the “quantile function”:
Description
Density, distribution function, quantile function and random generation for the normal distribution with mean equal to
mean
and standard deviation equal tosd
.
Quantiles refer to evenly-spaced bins across the domain of a function. The median is a quantile: it measures the quantile where the same number of values in the domain fall on each side. That means that we can consider quantiles as percentages of the domain. The median, for example, equals 50%. Since the sum of the probability distribution function across all values of the domain equals one (by construction), then we can consider those percentages of the distribution as probabilities. The median, therefore, represents the value where the probability equals 50%.
The upshot of the previous paragraph comes to this: qnorm()
converts probabilities to z-scores. Therefore, if we want to figure out the z-score associated with a particular probability, we will use qnorm()
. Fig. 4 below represents an example where we want to find the z-score associated with a 95% probability. We know that the area shaded in black equals 95% of the area under the curve. We want to find the value of the dashed line.
Figure 4: Gaussian distribution with 95% of area shaded
To find the z-score associated with a 95% probability (the dashed line), meaning the z-score where 95% of all values are less than that value, we would type:
qnorm(0.95)
## [1] 1.644854
Fig. ?? shows how the answer above corresponds to Fig. 4.
Figure 5: Gaussian distribution with response to qnorm(0.95)
You should note that qnorm()
expects a probability. Since probabilities must take on values between 0 and 1 (inclusive), the domain of qnorm()
must be a number between 0 and 1 (inclusive). If you try to enter a number greater than 1 or less than 0, R will tell you that the answer is not a number:
qnorm(2)
## Warning in qnorm(2): NaNs produced
## [1] NaN
The range of values that you get back from qnorm()
is any real number.
Summary
I hope that this helps decide whether to use pnorm()
or qnorm()
when you want to convert one to the other. Table 1 shows a summary of both functions, and when you would want to use each function.
If you have a… | …and you want a… | …you use: | Code | Domain | Range |
---|---|---|---|---|---|
Z-score | Probability | qnorm() |
qnorm( probability) |
\((-\infty,\infty)\) | \([0,1]\) |
Probability | Z-score | pnorm() |
pnorm( z-score) |
\([0,1]\) | \((-\infty,\infty)\) |