Working with Tables and Graphs in Statistics
These are my notes and thoughts on working with tables and graphs in statistics.
When working with large data sets, a frequency distribution is often helpful in organizing and summarizing data. A frequency distribution helps us to understand the nature of the distribution of a data set.
Frequency Distribution
A frequency distribution or table shows how data are partitioned among several categories by listing the categories along with the number of data values in each of them.
Lower class limits are the smallest numbers that can belong to each of the different classes. Upper class limits are the largest numbers that can belong to each of the different classes. Class boundaries are the numbers used to separate the classes, but without the gaps created by class limits. Class midpoints are the values in the middle of the classes. Class width is the difference between two consecutive lower class limits in a frequency distribution.
Finding the correct class width can be tricky. For class width, don’t make the most common mistake of using the difference between a lower class limit and an upper class limit. For class boundaries, remember that they split the difference between the end of one class and the beginning of the next class.
We construct frequency distributions to:
- Summarize large data sets
- See the distribution and identify outliers
- Have a basis for constructing graphs
Technology can generate frequency distributions but these are the common steps:
- Select the number of classes, usually between 5 and 20
- Calculate class width: \(\frac{\text{max data value - min data value}}{\text{number of classes}} \)
- Round this result to get a convenient number
- Choose the value for the first lower class limit by using either the min value or a convenient value below the minimum.
- Using the first lower class limit and the class width, list the other lower class limits.
- List the lower class limits in a vertical column and then determine and enter the upper class limits.
- Take each individual data value and put a tally mark in the appropriate class. Add the tally marks to find the total frequency for each class.
Relative Frequency Distribution
A variation of the basic frequency distribution is a relative frequency distribution. Each class frequency is replaced by a relative frequency as a percentage.
\[ \text{relative frequency} = \frac{\text{frequency for class}}{\text{sum of frequencies}} * 100 \]
This will give you the frequency percentage.
The sum of the percentages in a relative frequency distribution will be very close to 100 percent.
Another variation of a frequency distribution is a cumulative frequency distribution in which the frequency for each class is the sum of the frequencies for that class and all previous classes.
At the beginning we noted that a frequency distribution can help us understand the distribution of a data set, which is the nature or shape of the spread of the data over the range of values. In statistics, we are often interested in determining whether the data have a normal distribution. Data that have an approximately normal distribution are characterized by a frequency distribution with the following features:
- The frequencies start low, then increase to one or two high frequencies, and then decrease to a low frequency.
- The distribution is approximately symmetric. Frequencies preceding the maximum frequency should be roughly a mirror image of those that follow the maximum frequency.
The presence of gaps can suggest that the data are from two or more different populations.
Comparing two or more relative frequency distributions in one table makes comparisons of data much easier.
While a frequency distribution is a useful tool for summarizing data and investigating the distribution of data, an even better tool is a histogram, which is a graph that is easier to interpret than a table of numbers.
A histogram visually displays the shape of the distribution of the data. It shows the location of the center of the data. Histograms show the spread of data and can also identify outliers.
A histogram is basically a graph of a frequency distribution. Class frequencies should be used for the vertical scale and that scale should be labeled. There is no universal agreement on the procedure for selecting which values are used for the bar locations along the horizontal scale, but it is common to use class boundaries, class midpoints, class limits, or something else. It is often easier for us to use class midpoints for the horizontal scale. Histograms can usually be generated using technology.
A relative frequency histogram has the same shape and horizontal scale as a histogram, but the vertical scale uses relative frequencies instead of actual frequencies.
The ultimate objective of using histograms is to be able to understand characteristics of data. Exploring the data means to:
- Find the center of the data
- Find the variation
- Find the shape of the distribution
- Find any outliers
- Find the change of data over time
When a graph is said to be skewed to the right, it means the histogram shape has a tail on the right.
When a graph is said to be skewed to the left, it means the histogram shape has a tail on the left.
Bell-shaped distribution is called a normal distribution and has its highest values in the middle.
Uniform distribution is a histogram with roughly the same values all the way across.
Many statistical methods require that sample data come from a population having a distribution that is approximately a normal distribution.
In a uniform distribution, the different possible values occur with approximately the same frequency, so the heights of the bars in the histogram are approximately uniform.
A distribution of data is skewed if it is not symmetric and extends more to one side than to the other. Data skewed to the right, called positively skewed, have a longer right tail.
Data skewed to the left, called negatively skewed, have a longer left tail.
Some really important methods have a requirement that sample data must be from a population having a normal distribution. Histograms can be helpful in determining whether the normality requirement is satisfied, but they are not very helpful with very small data sets.
The population distribution is normal if the pattern of the points in the normal quantile plot is reasonably close to a straight line, and the points do not show some systematic pattern that is not a straight-line pattern.
The population distribution is not normal if the normal quantile plot has either or both of these two conditions:
- The points do not lie reasonably close to a straight-line pattern
- The points show some systematic pattern that is not a straight-line pattern
Graphs that Enlighten
A dot plot graph is a good type of graph. It consists of a graph of quantitative data in which each data value is plotted as a point above a horizontal scale of values. Dots representing equal values are stacked.
A dot plot:
- Displays the shape of the distribution of data
- It is usually possible to recreate the original list of data values.
A stem plot is another type of graph and it represents quantitative data by separating each value into two parts: the stem and the leaf. Better stem plots are often obtained by first rounding the original data values. Also, stem plots can be expanded to include more rows and can be condensed to include fewer rows.
Stem plots:
- Shows the shape of the distribution of data
- Retains the original data values
- The sample data are sorted
A time-series graph is a graph of time-series data, which are quantitative data that have been collected at different points in time, such as monthly or yearly.
Time-series graphs:
- Reveals information about trends over time
Bar graphs use bars of equal width to show frequencies of categories of categorical data. The bars may or not be separated by small gaps.
Bar graphs:
- Shows the relative distribution of categorical data so that it is easier to compare the different categories.
A pareto chart is a bar graph for categorical data, with the added stipulation that the bars are arranged in descending order according to frequencies, so the bars decrease in height from left to right.
Pareto charts:
- Shows the relative distribution of categorical data so that it is easier to compare the different categories.
- Draws attention to the more important categories.
A pie chart is a very common graph that depicts categorical data as slices of a circle, in which the size of each slice is proportional to the frequency count for the category. Although pie charts are very common, they are not as effective as Pareto charts.
Pie charts:
- Shows the distribution of categorical data in a commonly used format.
Try to never use pie charts because they waste ink on components that are not data, and they lack an appropriate scale.
A frequency polygon uses line segments connected to points located directly above class midpoint values. A frequency polygon is very similar to a histogram, but a frequency polygon uses line segments instead of bars.
A variation of the basic frequency polygon is the relative frequency polygon, which uses relative frequencies for the vertical scale. An advantage of relative frequency polygons is that two or more of them can be combined on a single graph for easy comparison.
Graphs that Deceive
Deceptive graphs are commonly used to mislead people. Graphs should be constructed in a way that is fair and objective.
A common deceptive graph involves using a vertical scale at some value greater than zero to exaggerate differences between groups. This is called a nonzero vertical graph. Always examine a graph carefully to see whether a vertical axis begins at some point other than zero so that differences are exaggerated.
Pictographs are another type of chart that are used to mislead. Data that are one-dimensional in nature are often depicted with two-dimensional objects or three-dimensional objects. By using pictographs, artists can create false impressions that grossly distort differences by using these same principles of basic geometry:
- When you double each side of a square, it’s area doesn’t merely double, it increase by a factor of four
- When you double each side of a cube, its volume doesn’t merely double, it increases by a factor of eight
When examining data depicted with a pictograph, determine whether the graph is misleading because objects of area or volume are used to depict amounts that are actually one-dimensional.
For small data sets of 20 values or fewer, use a table instead of a graph. A graph of data should make us focus on the true nature of the data, not on other elements, such as eye-catching but distracting design features. Do not distort data. Construct a graph to reveal the true nature of the data. Almost all of the ink in a graph should be used for the data, not for the design elements.
A correlation exists between two variables when the values of one variable are somehow associated with the values of the other variable.
A linear correlation exists between two variables when there is a correlation and the plotted points of paired data result in a pattern that can be approximated by a straight line. A scatterplot is a plot of paired quantitative data with a horizontal x-axis and the vertical axis is used for the second variable y.
The presence of correlation between two variables is not evidence that one of the variables causes the other. We might find a correlation between beer consumption and weight, but we cannot conclude from the statistical evidence that drinking beer has a direct effect on weight.
A scatterplot can be very helpful in determining whether there is a correlation between the two variables.
The linear correlation coefficient is denoted by r, and it measures the strength of the linear association between two variables.
When we do not conclude that there appears to be a linear correlation between two variables, we can find the equation of the straight line that best fits the sample data, and that equation can be used to predict the value of one variable when given a specific value of the other variable. Instead of using the straight-line equation of \(y = mx + b \) that we have all learned in prior math courses, we use the format that follows.
Given a collection of paired sample data, the regression line, or line of best fit, is the straight line that best fits the scatter plot of the data.
Round the number to the nearest ten
66,843.908
- 66,840
Round the number to the nearest hundredth
-0.451
- -0.45
Simplify
27/90
- 3/10
Write the percentage as a decimal number
55%
- 0.55
Write the percent as a simplified fraction
88%
- 22/25
Write the fraction as a percent
2/13
- 15.38%
Practice questions for a textbook are marked with difficulty levels of easy, intermediate, and difficult. If 46 of the 147 practice questions are rated as intermediate, approximately what percentage of the questions are intermediate level?
- 31%
Find the percentage of total calories from fat
Calories=120, Calories from fat=20
- 16.7%
What is 27% of 23
- 6.21
What is the y coordinate of (2,1)
- 1
How many individuals are included in the summary?
- 52
Is it possible to identify the exact values of all the original service times?
- No. The data values in each class could take on any value between the class limits, inclusive
A frequency table of grades has 5 classes(A,B,C,D,F) with frequencies of 2,13,16,7, and 1 respectively. Using percentages, what are the relative frequencies of the five classes?
- 5.13%
- 33.33%
- 41.03%
- 17.95%
- 2.56%
Heights of adult males are known to have a normal distribution. A researcher claims to have randomly selected adult males and measured their heights with the resulting relative frequency distribution as shown here. Identify two major flaws with these results.
- The sum of the relative frequencies is 124%, but it should be 100%, with a small possible round-off error
- All of the relative frequencies appear to be roughly the same. If they are from a normal distribution, they should start low, reach a maximum, and then decrease
Identify the lower class limits, upper class limits, class width, class midpoints, and class boundaries for the given frequency distribution. Also identify the number of individuals in the summary.
Age Frequency
15-24 29
25-34 33
35-44 14
45-54 4
55-64 6
65-74 1
75-84 1
Identify the lower class limits
- 15,25,35,45,55,65,75
Identify the upper class limits
- 24,34,44,54,64,74,84
Identify the class width
- 10
Identify the class midpoints
- 19.5, 29.5, 39.5, 49.5, 59.5, 69.6, 79.5
Identify the class boundaries
- 14.5, 24.5, 34.5, 44.5, 54.5, 64.5, 74.5, 84.5
Identify the number of individuals in the summary
- 88
Identify the lower class limits, upper class limits, class midpoints, and class boundaries for the given frequency distribution. Also identify the number of individuals included in the summary.
Platelet Count Frequency
100-199 24
200-299 91
300-399 30
400-499 0
500-599 4
Identify the lower class limits
- 100, 200, 300, 400, 500
Identify the upper class limits
- 199, 299, 399, 499, 599
Identify the class width
- 100
Identify the class midpoints
- 149.5, 249.5, 349.5, 449.5, 549.5
Identify the class boundaries
- 99.5, 199.5, 299.5, 399.5, 499.5, 599.5
Identify the number of individuals in the summary
- 149
Does the frequency distribution appear to have a normal distribution using a strict interpretation of the relevant criteria?
Temp Frequency
45-49 3
50-54 0
55-59 6
60-64 13
65-69 7
70-74 6
75-79 1
Does the frequency distribution appear to have a normal distribution?
- No, the distribution does not appear to be normal
Does the frequency distribution appear to have a normal distribution?
Temp Frequency
40-44 1
45-49 2
50-54 5
55-59 14
60-64 5
65-69 4
70-74 1
- Yes, because the frequencies start low, proceed to one or two higher frequencies, then decrease to a low frequency, and the distribution is approximately symmetric
The data represents the BMI values for 20 females. Construct a frequency distribution beginning with a lower class limit of 15 and use a class width of 6.0
17.7 33.5 26.3 25.9 22.9
27.1 21.9 18.3 27.7 22.9
19.2 22.3 23.7 37.7 32.4
27.8 44.9 30.6 28.7 22.9
BMI Frequency
15.0-20.9 3
21.0-26.9 8
27.0-32.9 6
33.0-38.9 2
39.0-44.9 1
The following data show the ages of recent award-winning male actors at the time when they won their award. Make a frequency table for the data, using bins of 20-29, 30-39, and so on
Age Number
20-29 2
30-39 10
40-49 11
50-59 7
60-69 2
70-79 2
Construct one table that includes relative frequencies based on the frequency distributions shown below. Compare the amounts of tar in unfiltered and filtered cigarettes.
Tar Nonfiltered Filtered
6-10 0% 8%
11-15 0% 12%
16-20 4% 20%
21-25 4% 60%
26-30 52% 0%
31-35 28% 0%
36-40 12% 0%
Do cigarette filters appear to be effective?
- Yes, because the relative frequency of the higher tar classes is greater for non filtered cigarettes
Construct the cumulative frequency distribution for the given data
20-29 25
30-39 35
40-49 11
50-59 2
60-69 4
70-79 1
80-89 1
Less than 30 = 25
Less than 40 = 60
Less than 50 = 71
Less than 60 = 73
Less than 70 = 77
Less than 80 = 78
Less than 90 = 79
Construct the cumulative frequency distribution for the given data
Daily Low Frequency
35-39 2
40-44 4
45-49 5
50-54 12
55-59 7
60-64 8
65-69 1
Less than 40 2
Less than 45 6
Less than 50 11
Less than 55 23
Less than 60 30
Less than 65 38
Less than 70 39
Among fatal plane crashes that occurred during the past 55 years, 466 were due to pilot error, 70 were due to other human error, 517 were due to weather, 343 were due to mechanical problems, and 485 were due to sabotage.
Construct the relative frequency distribution. What is the most serious threat to aviation safety, and can anything be done about it?
Total crashes=1890
Crashes per year=1890/55 = 34.36 fatal crashes per year
Pilot error = 24.7%
Other human error = 4.2%
Weather = 27.4%
Mechanical problems = 18.1%
Sabotage = 25.7%
What is the most serious threat to aviation safety and can anything be done about it?
- Weather is the most serious threat to aviation safety. Weather monitoring systems could be improved.
Use the given categorical data to construct the relative frequency distribution
Natural births randomly selected from four hospitals in a highly populated region occurred on the days of the week with the frequencies 53, 64, 71, 57, 54, 46, and 55. Does it appear that such births occur on the days of the week with equal frequency?
Total births = 400
Day Frequency
Mon 13.25%
Tue 16%
Wed 17.75%
Thur 14.25%
Fri 13.5%
Sat 11.5%
Sun 13.75%
Let the frequencies be substantially different if any frequency is at least twice any other frequency. Does it appear that these births occur on the days of the week with equal frequency?
- Yes, it appears that births occur on the days of the week with frequencies that are about the same.
Which characteristic of data is a measure of the amount that the data values vary?
- Variation
_____ are sample values that lie very far away from the majority of the other sample values.
- Outliers
_____ helps us understand the nature of the distribution of a data set.
- Frequency distribution
In a _____ distribution, the frequency of a class is replaced with a proportion or percent.
- Relative frequency distribution
Heights of adult males are normally distributed. If a large sample of heights of adult males is randomly selected and the heights are illustrated in a histogram, what is the shape of that histogram?
- Bell-shaped
If we collect a large sample of blood platelet counts and if our sample includes a single outlier, how will that outlier appear in a histogram?
- The outlier will appear as a bar far from all of the other bars with a height that corresponds to a frequency of 1.
Listed below are body temperatures of healthy adults. Why is it that a graph of these data would not be very effective in helping us understand the data?
- The data set is too small for a graph to reveal important characteristics of the data
If we have a large voluntary response sample consisting of weights of subjects who chose to respond to a survey posted on the internet, can a graph help to overcome the deficiency of having a voluntary response sample?
- No, a graph cannot help to overcome the deficiency. If the sample is a bad sample, there are no graphs or other techniques that can be used to salvage the data
How does the stem-and-leaf plot show the distribution of data?
- The lengths of the rows are similar to the heights of bars in a histogram, longer rows of data correspond to higher frequencies.
The accompanying data represent women’s median earnings as a percentage of men’s median earnings for recent years beginning with 1989. Is there a trend?
- There is a general upward trend though there have been some down years. An upward trend would be helpful to women so that their earnings become equal to those of men.
In a study of retractions in biomedical journals, 431 were due to error, 205 were due to plagiarism, 812 were due to fraud, 306 were due to duplication of publications, and 285 had other causes. Does misconduct appear to be a major factor?
- Yes, misconduct appears to be a major factor because the majority of retractions were due to misconduct
The graph to the right uses cylinders to represent barrels of oil consumed by two countries. Does the graph distort the data or does it depict the data fairly?
- Yes it distorts the data because the graph incorrectly uses objects of volume to represent the data
In this section we use r to denote the value of the linear correlation coefficient. Why do we refer to this correlation coefficient as being linear?
- The term linear refers to a straight line, and r measures how well a scatter plot fits a straight-line pattern
If we find that there is a linear correlation between the concentration of carbon dioxide in our atmosphere and the global temperature, does that indicate that changes in the concentration of carbon dioxide causes changes in the global temperature?
- No, the presence of a linear correlation between two variables does not imply that one of the variables is the cause of the other variable.
What is a scatter plot and how does it help us?
- A scatter plot is a graph of paired (x,y) quantitative data. It provides a visual image of the data plotted as points, which helps show any patterns in the data.
For a data set of brain volumes and IQ scores of seven males, the linear correlation coefficient is r=.805. Use the table available to find the critical values of r. Based on a comparison of the linear correlation coefficient r and the critical values, what do you conclude about a linear correlation?
- The critical values are -.754, .754
Since the correlation coefficient of r is:
- In the right tail above the positive critical value, there is sufficient evidence to support the claim of a linear correlation
For a data set of brain volumes and IQ scores of four males, the linear correlation coefficient is found and the P-value is .641. Write a statement that interprets the P-value and includes a conclusion about linear correlation.
- The P-value indicates that the probability of a linear correlation coefficient that is at least twice as extreme is 64.1%, which is high, so there is not sufficient evidence to conclude that there is a linear correlation between brain volumes and IQ scores in males.
For a data set of weights and highway fuel consumption amounts of ten types of automobile, the linear correlation coefficient is found and the P-value is .021. Write a statement that interprets the P-value and includes a conclusion about linear correlation.
- The P-value indicates that the probability of a linear correlation coefficient that is at least as extreme is 2.1% which is low, so there is sufficient evidence to conclude that there is a linear correlation between weight and highway fuel consumption in automobiles.
A magazine, which does not accept free products or advertisements from anyone, prints a review of new cars. Are there sources of bias in this situation?
- There do not appear to be any sources of bias
Determine whether the given value is a statistic or a parameter.
A sample of 568 doctors showed that 16% go to school
- The value is a statistic because it is a numerical measurement describing some characteristic of a sample
State whether the data described below are discrete or continuous, and explain why.
The number of eyes that different people have.
- The data are discrete because the data can only take on specific values
Determine whether the given value is a statistic or a parameter
Thirty percent of all dog owners poop scoop after this dog
- Parameter
Determine which of the four levels of measurement is most appropriate
Student’s grades(A, B, C, D) on a test
- Ordinal
Determine which of the four levels of measurement is most appropriate
Level of satisfaction of survey respondents
- Ordinal
The following frequency distribution displays the scores on a math test. Find the class boundaries of scores interval 50-59
- 49.5, 59.5
Construct a stem and leaf plot of the test scores 68, 72, 85, 75, 89, 89, 87, 90, 98, 100
How does the stem and leaf plot show the distribution of these data
6 8
7 25
8 5799
9 08
10 0
- The lengths of the rows are similar to the heights of bars in a histogram, longer rows of data correspond to higher frequencies
The linear _____ coefficient denoted by r measures the _____ of the linear association between two variables.
- Correlation, strength
Identify the type of sampling used(random, systematic, convenience, stratified, or cluster) in the situation described below.
A researcher selects every 324th social security number and surveys the corresponding person.
- Systematic
Identify which type of sampling is used, (random, systematic, convenience, stratified, or cluster).
To determine customer opinion of their musical variety, Sony randomly selects 120 concerts during a certain week and surveys all concert goers.
- Cluster
A polling company reported that 49% of 1018 surveyed adults said that secondhand smoke is annoying.
What is the exact value that is 49% of 1018
- 498.82
Could the result be the actual number of adults who said that secondhand smoke is quite annoying?
- No, the result from part a could not be the actual number of adults who said that because a count of people must result in a whole number
What could be the actual number of adults who said that secondhand smoke is annoying?
- 499
Among the 1019 respondents, 190 said that secondhand smoke is not annoying at all. What percentage of respondents said that second hand smoke said that?
- (190/1028) *100 = 18.66%
A polling company reported that 49% of 2302 surveyed adults said they play baseball.
What is the exact value of 49% of 2303
- (.49*2302) = 1127.98
Could the previous result be the actual number of adults who play baseball?
- No, the result must result in a whole number
What could be the actual number of faults who play baseball?
- 1128
Among the 2302 respondents, 114 said they play hockey. What percentage play hockey?
- (114/2302) * 100 = 4.95%
Determine whether the study is an experiment or an observational study, and then identify a major problem with the study.
A medical researcher tested for a difference in systolic blood pressure levels between male and female students who are 12 years of age. She randomly selected four males and four females for her study.
- This is an observational study because the researcher does not attempt to modify the individuals
What is a major problem with the study?
- The sample is too small
Identify the type of observational study(cross-sectional, retrospective, or prospective)
A research company uses a device to record the viewing habits of about 10000 households, and the data collected today will be used to determine the proportion of households tuned to a particular sports program
- Cross-sectional study
Identify the type of observational study
A researcher plans to obtain data by interviewing siblings of victims who perished in a bombing. He will interview them, and people unrelated to the victims, over the next ten years to see how closeness to a traumatic event might affect recovery time.
- Prospective because it is in the future