Comparing Distributions in Statistics
These are my notes on comparing distributions in Statistics.
Displays For Comparing Groups
It is almost always more interesting to compare groups than to summarize data for a single group. There are several ways to summarize a variable. The median and quartiles are suitable even for data that may be skewed or have outliers and are usually used together. Along with these three values, we can report the max and min values. These five values together make up the 5-number summary of the data. They include the median, quartiles, max, and min. It is a useful, concise summary because it gives a good idea of the center, spread, and range.
A boxplot highlights several features of the distribution. The central box shows the middle half of the data, between the quartiles. The height of the box is equal to the IQR. If the median is roughly centered between the quartiles, then the middle half of the data is roughly symmetric. If the median is not centered, the distribution is skewed. The whiskers show skewness as well if they are not roughly the same length.
Histograms or stem and leaf displays are good for single distributions but not good for 20. It would be hard to see patterns. By placing boxplots side by side, you can easily see which groups have higher medians, which have greater IQR’s, where the central 50% of the data is located in each group, and which have the overall greater range.
Outliers
Outliers arise for many reasons. They may be the most important values in the dataset or they may be an error. It could be an exceptional case or illuminating a pattern by being the exception to the rule. Many outliers are not wrong, they are just different. Most repay the effort to understand them. You can sometimes learn more from extraordinary cases than from summaries of the entire dataset.
There are two things you should never do with outliers. You should not leave an outlier in place and proceed as if nothing happened. Analyses of data with outliers are very likely to be wrong. The other is to omit an outlier from the analysis without comment. A histogram is often a better way to see more detail about how the outlier fits in or doesn’t fit at all.
Timeplots
A display of values against time is called a timeplot. Timeplots often show a great deal of point to point variation. We usually want to see past this variation to understand any underlying smooth trends. Also we want to think about how the values vary around that tend, the timeplot version of center and spread.
Re-Expressing Data
When data are skewed, it can be hard to summarize them simply with a center and spread, and hard to decide whether the most extreme values are outliers or just part of the stretched out tail. We re-express the data by applying a simple function to each value. Re-express means to transform the data by applying a simple function to make the skewed distribution more symmetric. It could be either a square root or logarithm function. Variables that are skewed to the right often benefit from a re-expression by square roots, logs, or reciprocals. Those skewed to the left may benefit from squaring the data. Re-expressing can help alleviate the problem of comparing groups that have very different spreads.
Choose the right tool for comparing distributions. Compare the distributions of two or three groups with histograms. Compare several groups with boxplots, which make it easy to compare centers and spreads and spot outliers, but hide much of the detail of distribution shape.
Treat outliers with attention and care. Outliers are nominated by the boxplot rule, but you must decide what to do with them. Track down the background for outliers, it may be informative.
Re-express data to make them easier to work with. Re-expression can make skewed distributions more nearly symmetric. Re-expression can make the spreads of different groups more nearly comparable.
Outlier
Values that are large or small compared to most of the other values in a variable. Whether they are outliers is a judgement call that depends on the context. A boxplot displays values more than 1.5 IQR’s beyond the nearest quartile as potential outliers, but that is not a definition of outlier that can be used anywhere.
5-Number Summary
A summary of a variable’s distribution that consists, of the extremes, the quartiles, and the median.
Boxplot
A display of a box between the quartiles and whiskers extending to the highest and lowest values not nominated as outliers.
Far Outlier
In a boxplot a value more than 3 IQR’s beyond the nearest quartile. Such values deserve special attention.
Timeplot
A timeplot displays data that change over time. Often, successive values are connected with lines to show trends more clearly. Sometimes, a smooth curve is added to the plot to help show long0term patterns and trends.
Re-Express
This is another name for transform. The structure of data may be improved by working with a simple function of the data. The logarithm, square root, and reciprocal are the most common re-expression functions.