Univariate Data in Statistics

These are my notes on univariate data in Statistics.

The population is the entire group of individuals or things that we are interested in. The sample is the part of the population that is actively studied. 

 

When examining a graphical summary of a distribution of univariate data, use the center, spread, and shape. In addition, you should also note any clustering of data, any gaps in the data, and any outliers. If possible, try to provide explanations for such features. Be sure to write your descriptions within the context of the problem.

 

Continuous Variables

Numerical, tabular, and graphical methods complement one another. Numerical methods are precise and can be used in a variety of ways for statistical inference. Graphical methods allow us to view a large amount of data and a large amount of relationships at once. Tabular methods allow us to find precise values but are not as good for grasping relationships among variables. The three types of numerical measures are measures of central tendency, measures of variation, and measures of position.

 

A Greek capital letter \(\Sigma\) is used to indicate the sum of a set of measurements.

 

Measures of Central Tendency

Measures of central tendency determine the central point of a data set or the point around which all the measurements are scattered. The two main measures of central tendency are the mean and the median.

 

The arithmetic mean, or average, is the most commonly used measure of the center of a set of data. The mean can be described as a data set’s center of gravity, the point at which the whole group of data balances. Unlike the median, the mean is affected by extreme or outlier measurements. One very large or very small measurement can pull the mean up or down. We say that the mean is not resistant to changes caused by outliers.

 

The population mean is denoted by the Greek letter \(\Mu\). Simply add up all of the values in the entire population and divide by the number of values. The sample mean is generally denoted by an English letter with a bar on top. It is computed the same way as the population mean.

 

Median

The median is another commonly used measure of central tendency. The median is the point that divides the measurements in half. Half the values are at or below the median and half are at or above the median. The median is not affected by outliers. Therefore, for skewed data sets, it is better to use the median rather than the mean to measure the center of data. The median is resistant to changes caused by outliers. 

 

Note that if the data set contains an odd number of measurements, then the median is the middle value. If there is an even number of measurements it is the mean of the middle two measurements. 

 

If the data set is symmetric use the mean but if it is skewed then use the median value. 

 

Measures of Variation

Measures of variation summarize the spread of a data set. They describe how measurements differ from each other and from their mean. The three most commonly used measures of variation are range, interquartile range, and standard deviation.

 

The range is the difference between the largest and the smallest measurements in a data set. It is the simplest of the measures of spread. It is very easy to compute and understand, but it is not a reliable measure because it depends only on the two extreme measurements and does not take into account the values of the remaining measurements.

 

The interquartile range is the range of the middle 50% of the data, the difference between the third quartile and the first quartile. Interquartile range is not affected by outliers. If you choose to measure the center using the median, you should use the interquartile range to measure the spread.

 

Standard deviation is often a more useful measure of variation than range is. Unlike range, standard deviation takes every measurement into account. However, like range, the standard deviation is affected by outliers. When there are outliers, the interquartile range may be a more useful measurement, similar to how the median is more useful than the mean when outliers are present. The square of the standard deviation is known as the variance.

 

A lowercase Greek letter \(\sigma\) is used to denote a population standard deviation. So, \(\sigma^2\) denotes a population variance. We square the difference between each point and the mean, add those squares, divide by the number of points, and take the square root. The letter \(s\) is used to denote a sample standard deviation. So \(s^2\) denotes a sample variance.

 

Note that standard deviation is measured in the same units as are data values, whereas variance is measured in squared units of the data values. 

 

Standard deviation can be used as a unit for measuring the distance between any measurement and the mean of the data set. For example, a measurement can be described as being so many standard deviations above or below the mean.

 

A standard deviation of 0 indicates that all of the measurements are identical. It is the positive square root of variance. Because variance is a squared quantity, it is always a positive number. A larger standard deviation indicates a larger spread among the measurements. The larger the standard deviation, the wider the graph. 

 

Measures of Position

These measures are used to describe the position of a value with respect to the rest of the values of the data set. Quartiles, percentiles, and standardized scores(z-scores) are the most commonly used measures of position. To compute quartiles and percentiles, but not to compute z-scores, the data must be sorted by value. 

 

Percentiles divide a set of values into 100 equal parts. A 95th percentile means that 95% of the values are at or below this point. So, arrange all values in order, then the percentile of a particular point in the ith position when counted from the lowest measurement. 

 

Quartiles divide a set of values into four equal parts by using the 25th, 50th, and 75th percentiles. Q1 is the 25th percentile, 25% of values are below and 75% of values are above. Q2 is the 50th percentile, 50% of values are below and 50% of values are above. Q3 is the 75th percentile. 75% of values are below and 25% of values are above. 

 

Standardized scores or z-scores are independent of the units in which the data values are measured. Therefore, they are useful when comparing observations measured on different scales. They are computed as:

\[z-score = \frac{measurement-mean}{StandardDeviation}\]

A z-score gives the distance between the measurements and the mean in terms of the number of standard deviations. A negative z-score indicates that the measurement is smaller than the mean. A positive z-score indicates that the measurement is larger than the mean.