Standard Deviation in Statistics

These are my notes on standard deviation in Statistics. 

This is my favorite Statistics book on Amazon, if you are interested in learning Statistics I highly recommend it

Expressing a distance from the mean in standard deviations standardizes the performances. To standardize a value, we subtract the mean and then divide this difference by the standard deviation.

\[z = \frac{y - y{bar}}{s}\]

 

Standardizing Values

The values are called standardized values, and are commonly denoted with the letter z. Usually, we call them z-scores. Z-scores measure the distance of a value from the mean in standard deviations. A z-score of 2 says that a data value is 2 standard deviations above the mean. Data values below the mean have a negative z-score, so a z-score of -1.6 means that the data value was 1.6 standard deviations below the mean.

 

There are two steps to finding a z-score. First, the data are shifted by subtracting the mean. Then, they are rescaled by dividing by the standard deviation. Adding or subtracting a constant to every data value adds or subtracts the same constant to measures of position, but leaves measures of spread unchanged.

 

When we multiply or divide all the data values by any constant, all measures of position such as median, mean, and percentiles, and measures of spread such as range, IQR, and the standard deviation are multiplied by that same constant. 

 

Shifting and Scaling Values

Standardizing data into z-scores is just shifting them by the mean and rescaling them by the standard deviation. Now we can see how standardizing affects the distribution. When we subtract the mean of the data from every data value, we shift the mean to zero. As we have seen, such a shift does not change the standard deviation. 

 

When we divide each of these shifted values by s, the standard deviation should be divided by s as well. Since the standard deviation was s to start with, the new standard deviation becomes 1. Standardizing into z-scores does not change the shape of the distribution of a variable. Standardizing into z-scores changes the center by making the mean 0. Standardizing into z-scores changes the spread by making the standard deviation 1.

 

Normal Models

A z-score gives an indication of how unusual a value is because it tells how far it is from the mean. If the data value sits right at the mean, it’s not very far at all and its z-score is 0. A z-score of 1 tells us that the data value is 1 standard deviation above the mean, while a z-score of -1 tells us that the value is 1 standard deviation below the mean.

 

For many unimodal and symmetric distributions, about 68% of the values  fall within one standard deviation of the mean. 95% of the values are found within two standard deviations of the mean. 99.7% or almost all of the values will be within three standard deviations of the mean. 

 

In 1809 Gauss figured out the formula for the model that accounts for this observation, it is called the Normal or Gaussian model. It illustrates one of the most important uses of the standard deviation. The standard deviation is the statistician’s ruler. This model for unimodal symmetric data gives us even more information because it tells us how likely it is to have z-scores between -1 and1, between -2 and 2, and between -3 and 3.

 

These magic 68, 95, and 99.7 values come from the Normal model. As a model, it can give us corresponding values for any z-score.

 

N always denotes a Normal model. The mu symbol is the Greek letter for m and always represents the mean in a model. The sigma character is the lowercase Greek letter for s and always represents the standard deviation in a model. The man and standard deviation are not numerical summaries of data. They are characteristics of the model called parameters. Parameters are the values we choose that completely specify a model. We do not want to confuse the parameters with summaries of the data so we use special symbols. In statistics, we almost always use Greek letters for parameters. Summaries of data, like the sample mean, median, or standard deviation, are called statistics and are usually written with Latin letters.

 

 If we model data with a Normal model and standardize them using the corresponding mu or sigma, we still call the standardized value a z-score.

\[z = \frac{y - \mu}{\singma}\]

Usually, it is easier to standardize data using the mean and standard deviation first. Then we only need the model with mean 0 and standard deviation 1. This Normal model is called the Standard Normal model.

 

Notice how well the 68-95-99.7 rule world when the distribution is unimodal and symmetric. Careful though, you should not use the Normal model for just any dataset. Standardizing will not change the shape of the distribution. If the distribution is not unimodal and symmetric to begin with, standardizing will not make it Normal.

 

All models make assumptions. Whenever we model we will be careful to point out the assumptions that we are making. We will also check the associated conditions in the data to make sure that those assumptions are reasonable. So, do not model data without checking whether the data is normal or not. To be Normal, the shape of the data’s distribution is unimodal and symmetric and there are no obvious outliers. 

 

To sketch a Normal curve that looks normal is important. The Normal curve is bell-shaped and symmetric around its mean. Start at the middle and sketch to the right and left from there. Even though the Normal model extends forever on either side, you need to draw it only for 3 standard deviations. After that, there is little left that is worth sketching. The place where the bell shape changes from curving downward to curving back up, or inflection point, is exactly one standard deviation away from the mean. 

 

Normal Percentiles

When a value does not fall exactly one, two or three standard deviations from the mean, we need to find the percentiles. Mathematically, the percentage of values falling between two z-scores is the area under the normal model between those values. So, Normal percentiles are the percentage of values in a standard Normal distribution found at that z-score or below. 

 

Finding areas from z-scores is the simplest way to work with the Normal model. But sometimes we start with areas and are asked to work backward to find the corresponding z-score or even the original data value. 

 

Normal Probability Plots

We have assumed that the underlying data distribution was roughly unimodal and symmetric so that using a Normal model is reasonable. Drawing a histogram of the data and looking at the shape is one good way to see whether a Normal model might work. 

 

However, there is a more specialized graphical display that can help you to decide whether a Normal model is appropriate, the Normal probability plot. If the distribution of the data is roughly Normal, the plot will be roughly a diagonal straight line. Deviations from a straight line indicate that the distribution is not Normal. This plot is usually able to show deviations from Normality more clearly than the corresponding histogram, but it is usually easier to understand how a distribution fails to be Normal by looking at its histogram. 

 

A Normal probability plot takes each data value and plots it against the z-score you would expect that point to have if the distribution were perfectly Normal. When the values match up well, the line is straight. If one or two points are surprising from the Normal’s point of view, they do not line up. When the entire distribution is skewed or different from the Normal in some other way, the values do not match up very well at all and the plot bends. 

 

It turns out to be tricky to find the values we expect. They are called Normal scores, but you cannot easily look them up in tables. That is why probability plots are best made with technology and not by hand. The best advice on using probability plots is to see whether they are straight. If so, then your data look like data from a Normal model. If not, make a histogram to understand how they differ from the model. 

 

Changing the spread and center of a variable is equivalent to changing the units. Indeed, the only part of the data’s context changed by standardizing is the units. All other aspects of the context do not depend on the choice or modification of measurement units. This fact points out an important distinction between the numbers the data provide for calculation and the meaning of the variables and the relationships among them. Standardizing can make the numbers easier to work with, but it does not alter the meaning.

 

Another way to look at this is to note that standardizing may change the center and spread values, but it does not affect the shape of a distribution. A histogram or boxplot of standardized values looks just the same as the histogram or boxplot of the original values except for the numbers on the axes. When we summarized shape, center, and spread for histograms, we compared them to unimodal, symmetric shapes. You could not ask for a nice example than the Normal model. If the shape is like a Normal, we will use the mean and standard deviation to standardize the values.