Displaying and Describing Data in Statistics

These are my notes on displaying and describing data in Statistics.

A symmetric distribution has roughly the same shape reflected around the center. A skewed distribution extends farther on one side than on the other. A unimodal distribution has a single major hump. A bimodal distribution has two humps. Multimodal distributions have more than two humps. Outliers are values that lie far from the rest of the data.

 

The mean is the sum of the values divided by the count. The median is the middle value. Half the values are above and half the values are below the median. The mean and median may differ because of outliers. If there are no outliers then the mean and median should be almost the same. 

 

The standard deviation is roughly the square root of the average squared difference between each data value and the mean. It is the summary of choice for the spread of unimodal, symmetric variables. The IQR is the difference between the third and first quartiles. It is the preferred summary of spread for skewed distributions or data with outliers. 

 

Area of Principle

In a statistical display, each data value should be represented by the same amount of area.

 

Frequency Table

A frequency table lists the categories in a categorical variable and gives the count of observations for each category.

 

Distribution

The distribution of a categorical value gives the possible values of the variable and the relative frequency of each variable.

 

Bar Chart

Bar charts show a bar whose area represents the count of observations for each category of a categorical variable.

 

Pie Chart

Pie charts show how a whole is divided into categories. The area of each wedge of the circle corresponds to the proportion in each category.

 

Histogram

A histogram uses adjacent bars to show the distribution of a quantitative variable. Each bar represents the frequency of values falling in each bin.

 

Gap

A region of the distribution where there are no values.

 

Stem and Leaf Display

A display that shows quantitative data values in a way that sketches the distribution of the data.

 

Dotplot

A dotplot graphs a dot for each case along a single axis.

 

Density Plot

A density plot shows the shape of a variable’s distribution by smoothing out its histogram to make a gentle curve.

 

Shape

To describe the shape of a distribution, look for single versus multiple modes, symmetry versus skewness, and outliers versus gaps.

 

Mode

A hump or local high point in the distribution of a variable. The apparent location of modes can change as the scale of a histogram is changed.

 

Uniform

A distribution that does not appear to have any mode and in which all the bars of its histogram are approximately the same height.

 

Symmetric

A distribution is symmetric if the two halves on either side of the center look approximately like mirror images of each other.

 

Tails

The parts of a distribution that trail off on either side. Distributions can be characterized as having long tails or short tails.

 

Skewed

A distribution is skewed if it’s not symmetric and one tail stretches out farther than the other. Distributions are said to be skewed left when the longer tail stretches to the left, and skewed right when it goes to the right.

 

Outlier

Outliers are extreme values that don’t appear to belong with the rest of the data. They may be unusual values that deserve further investigation or they may just be mistakes.

 

Center

The place in the distribution of a variable that you would point to if you wanted to attempt the impossible by summarizing the entire distribution with a single number. Measures of the center include the mean and median.

 

Median

The median is the middle value, with half the data above and half below it. If n is even, it is the average of the two middle values. It is usually paired with the IQR.

 

Mean

The mean is found by adding up all the data values and dividing by the count.

 

Spread

A numerical summary of high tightly the values are clustered around the center. Measures of spread include the IQR and standard deviation.

 

Range

The difference between the lowest and highest value in a dataset.

 

Quartile

The lower quartile Q1 is the value with a quarter of the data below it. The upper quartile Q3 has three quarters of the data below it. The median and quartiles divide the data into the four parts with approximately equal numbers of data values.

 

Percentile

The ith percentile is the number that falls above the i% of the data.

 

IQR - Interquartile Range

The IQR is the difference between the first and third quartiles, so Q3-Q1. It is usually reported along with the median.

 

Least Squares Property

The property of a statistic that the sum of the squared deviations of data values from data summaries due to that statistic is as small as it could be for any statistic is called the least squares property. 

 

Residuals

A residual is the difference between an observed data value and some summary or model for that value.

 

Variance

The variance is the sum of squared deviations from the mean, divided by the count minus 1.

 

Standard Deviation

The standard deviation is the square root of the variance.

 

Bar Chart In Excel

First make a pivot table which is Excel’s name for a frequency table. From the data menu, choose Pivot table and Pivot Chart Report. When you reach the layout window, drag your variable to the row area and drag your variable again to the data area. This tells Excel to count the occurrences of each category. Once you have an Excel pivot table, you can construct bar charts and pie charts. 

 

Compute Average in Excel

Click inside the Pivot table. Click the Pivot table chart wizard button. Excel creates a bar chart. To compute the mean, click on an empty cell. Go to the Formulas tab in the ribbon. Click the drop down arrow next to Auto-Sum and choose Average. Enter the data range in the formula displayed in the empty bow you selected earlier. Press enter and this will compute the mean for the values in that range.

 

Compute Standard Deviation in Excel

To computer standard deviation, click on an empty cell. Go to the Formulas tab in the ribbon and click the drop down arrow next to Auto-sum and select More Functions. In the dialog box that opens, select STDEV from the list of functions and click Ok. A new dialog box opens. Enter a range of fields into the text fields and click Ok. Excel computes the standard deviation for the values in that range and places it in the specified cell of the spreadsheet.