Exploring Data in Statistics

These are my notes on exploring data in statistics.

Statistics books on Amazon

We use data to make decisions. We make estimations and develop guidelines using data. Therefore, data and its analysis is important. 


Businesses collect data on their users. Scientists collect data on their experiments. Doctors collect data on their patients. Police collect data on criminals. Lots of people, organizations, and businesses collect data now. In fact, almost all of them do.


Most data that is collected is not immediately useful. It has to be organized and put in the proper format. It also needs to be summarized so decision makers can easily make choices on what is best for their organization. Doing this analysis is called descriptive methods. They are useful for presentation, data reduction, and summarization of data. 



There are two types of variables, categorical and numerical. A variable is categorical if it places the individuals being studied into one of several groups or categories. A variable is numerical if its outcomes are quantitative and can be analyzed using arithmetic. Numerical variables can be either discrete or continuous. Different methods of analysis must be used for categorical and numerical variables. 

If we take only one measurement on each object, we get univariate data. With two measurements on each object, we get bivariate data. 


Types of Descriptive Methods

There are different descriptive methods depending on the type of data that is collected. These are tabular, graphical, and numerical methods.

Different descriptive methods will answer different questions about data. 



Collected data needs to be rearranged before analysis. One tabular method is the frequency distribution table. The letter ’n’ is used to denote the number of observations in a data set. The frequency of a value is the number of times that observations occur. Frequency is usually denoted by using the letter f. The relative frequency of a value is the ratio of the frequency to the total number of observations. It is usually denoted by \(rf\)and equals \(\frac{f}{n}\).  The cumulative frequency gives the number of observations less than or equal to a specified value and is denoted by \(cf\). A frequency distribution table is a table giving all possible values of a variable and their frequencies.  



Presenting your data in tables is not very useful, but it is done. You should know how to interpret a table if you have to analyze one. Charts are just a better tool.

Bar charts are used a lot. It can have either horizontal or vertical bars. They are used to display categorical data very commonly. 

Pie charts can also display amounts and frequencies of data. They are a popular graphical method but not usually the best choice. They are difficult to make and read. 


Segmented Bar Charts

 It is important to see categorical data that stems from different groups in order to make comparisons. A segmented bar chart takes the distribution from each group and arranges them along either the horizontal or vertical axis. Then it shows the relative frequency of each group represented in one bar for each group. These data charts can be used to show frequency with bars of various sizes or relative frequency where all bars are the same size regardless of group size. Segmented bar charts that measure relative frequency between groups can be somewhat misleading when sample size is concerned.


Mosaic Plots

These are kind of similar to segmented bar charts. They are just a different way to compare categorical data. In a mosaic plot, use the width of the bars to represent the size of the sample. Each header indicates a different group. The groups can be arranged along the x or y axis. The lengths of these bars along the axis represent the relative frequencies of these groups compared to each other.

Along the other axis, the bars of each group are the same length. Each section within the group bars represents the percentage that category occurred in the data set for that group. These same categories should appear within each of the group bars. We can make comparisons about the size of each group based on the length of each group bar. We can also evaluate the proportions of the categorical variables within each group by comparing the relative sizes of each section.


Graphical Methods For Numerical Data

To summarize and describe numerical data, dotplots and stemplots are used for small sets of data. For larger sets, histograms, cumulative frequency charts, and boxplots are often used.

We can describe the overall pattern of the distribution of a numerical variable set using the following three methods: center, spread, and the shape.

The center of a distribution describes the central data point. There are a few ways to measure the central tendency which include the mean, median, and the mode. Each measure has different pros and cons depending on the type and shape of the data.

The spread of a distribution can tell us where most of the data is. You can have a symmetric distribution and a skewed distribution.

For a symmetric distribution, if the left half of the distribution is approximately a mirror image of the right half, then the distribution is called as symmetric. This means that the data is spread out in the same way on both sides and that there is the same amount of data on each side of the center.

In a skewed distribution, if there are extreme values in only one direction that cause one side to have a longer tail, we call that distribution skewed. It is right skewed if the longer tail is on the right and left skewed if the longer tail is on the left. 


Patterns of Data

When looking at data, we should look for patterns and deviations. To describe patterns, you can have clusters of data and outlier data. In clustered data, observations are grouped together tightly. If data is not clustered it can be described as having gaps. It is important to make these distinctions. 

If you have outliers in your data, you have an observation that is a lot different from the rest of the data. Outliers fall away from the middle of the data set. 


Graphical Methods for Continuous Variables

There are several ways to show graphical data for continuous variables. These include dotplots, stemplots, histograms, and cumulative frequency charts.



Dotplots are easy to make. They are nice for smaller data sets. However, if there is too much data the dotplot becomes too cluttered to read. To make a dotplot, draw a horizontal line to indicate the data range, scale the line to accommodate the entire range of data, if more than one observation has the same value then add dots above the other, and mark a dot for each observation in the appropriate place above the scaled line. 

Each dot on the plot indicates the location of the value of a data point. For any data point, we can look directly down at the scale to determine the value of the point. When looking at a dotplot, we can see how the data points are spread, what kind of shape the points make collectively, and where the approximate center of the distribution is. 



Stemplots are also used a lot. An advantage of the stemplot is that it shows every value. However, since that is the case, it is only useful for small data sets. 

To make a stemplot, separate each observation into two parts. The left part of the observation is called the stem and the right part is called the leaf. Draw a vertical line on the left side of the page to separate the stems from the leaves. Write all possible stems in increasing order on the left of the line. For each observation, write in the leaf to the right of the corresponding stem on the right side of the vertical line in increasing order. 

The numbers on the left side of the vertical line are stems. The value of a data point is the stem plus the leaf. Each stem has a different number of leaves, indicating the frequency of the class. Each leaf indicates a single observation. 

We use stemplots to see how the data is shaped and how it is spread. We also use it to see where the center of the data is. 



A histogram is the most popular form of displaying data. It resembles a stemplot on its side. They are useful for showing patterns in large data sets. A histogram can be drawn using frequencies, relative frequencies, or percentages.

To make a histogram, create groups from continuous data, draw the x axis and the y axis to scale and to accommodate all of the data groups and frequencies, Draw bars of heights equal to the corresponding frequencies and add a label for each group, and draw the bars next to each other without any gaps. There are no gaps between histogram bars because the data values are continuous and the values in one bar flow right into the next one. 

Each bar represents a single group or class. There is only one bar for each class. The classes are placed on the x axis in numerically increasing order, just as on a number line. The height of a bar in a frequency histogram corresponds to the frequency of that class. Percentage or relative frequency histograms can be read similarly. In a relative frequency histogram, the height of the bar reflects the relative frequency corresponding to the class. In a percentage frequency histogram, the height of the bar reflects the percent frequency that corresponds to the class. 


Cumulative Frequency Charts

The cumulative frequency for any group is the frequency for that group plus the frequencies of all groups of smaller observations. 

To draw cumulative frequency charts, draw the x and y axis, scale the x axis to accommodate the range of all groups, mark the upper boundary of each group, scale the y axis from 0 to n for a cumulative frequency chart, place a dot at the height equal to the cumulative frequency for that group above the upper boundary for each group, then connect all the dots with straight lines.

From any point on the graph, we can draw a vertical line to read the x value from the x axis and a horizontal line to read the y value from the y axis. For right skewed distributions, the curve increases quickly in the beginning but then steadies in the later part. For left skewed distributions, the curve increases slowly in the beginning, but then steeply later on. The cumulative frequency chart for a symmetric distribution is often described as s-shaped because it begins with a slow increase on the left, rises rapidly in the middle, and then tapers off to a slow increase again at the right. 



Visualizations are the very first step you should take when analyzing data. The types of summary statistics, inferential tests, and analysis that can be calculated are dependent upon the shape of the distribution. The key point to remember is that there are different calculations for symmetric and skewed data. Knowing the shape of the distributions will help you get started.