# Introduction to Statistics

These are my notes and thoughts on an introduction to statistics.

### Table of Contents

**Statistics and Problem Solving**

A population is the total set of subjects or things we are interested in studying. Populations are defined by what a researcher is studying and can come in all shapes and sizes.

A frame is a list containing all members of the population.

Population parameters are facts about the population. Since parameters are descriptions of the population, a population can have many parameters. Parameters can be averages, percentages, minimums, or maximums. For a specific population at a specific point in time, population parameters do not change.

A sample is a subset of the population which is used to gain insight about the population. Samples are used to represent a larger group, the population.

A statistic is a fact or characteristic about the sample. For any given sample a statistic is a fixed number. Statistics are used as estimates of population parameters.

A process is a method for obtaining a desired result. The idea of a process is closely tied to quality control. In order to improve a process, there must be an understanding of how the process is currently performing. This required definition and measurement of the process.

The science of statistics is divided into two categories, descriptive and inferential. Descriptive methods describe and summarize data. Descriptive statistics is the collection, organization, and presentation of data.

The objective of inferential statistics is to make reasonable guesses about the population characteristics using sample data.

**Collecting and Analyzing Data**

Part of becoming a problem solver and user of statistics is developing an ability to appraise the quality of measurements. When you encounter data, consider whether the concept under study is adequately reflected by the proposed measurements, is the data measured accurately, and is there a sufficient quantity of the data to draw a reasonable conclusion.

Measurement and data are an integral part of science. Methods have been developed to solve research problems. Gather information about the phenomenon being studied. On the basis of the data, formulate a preliminary generalization or hypothesis. Collect further data to test the hypothesis. If the data and other subsequent experiments support the hypothesis, it becomes a law.

There are two ways to obtain data, observation and controlled experiments. In a statistical analysis, it is usually not possible to recover from poorly measured concepts or badly collected measurements.

A response variable measures the outcome of interest in a study. An explanatory variable causes or explains changes in a response variable. Isolating the effects of one variable on another means anticipating potentially confounding variables and designing a controlled experiment to produce data in which the values of the confounding variable are regulated.

Observational data comes about from measuring things. They can be extremely valuable.

Much of the statistical information presented to us is in the form of surveys. So, it is important to understand them and how they are done. In some cases, the purpose of a survey is purely descriptive. However, in many cases the researcher is interested in discovering a relationship.

Data in which the observations are restricted to a set of values that possess gaps is called discrete. Data that can take on any value within some interval is called continuous. The quality of data is referred to as its level of measurement. When analyzing data, you must be exceedingly conscious of the data’s level of measurement because many statistical analyses can only be applied to data that possess a certain level of measurement.

Data that represents whether a variable possesses some characteristic is called nominal. Ordinal data represents categories that have some associated order. Note that ordinal data is also nominal, but it also possesses the additional property of ordinality.

If the data can be ordered and the arithmetic difference is meaningful, the data is interval. An example of interval data is temperature. Interval data is numerical data that possesses both the property of ordinality and the interval property. Ratio data is similar to interval data, except that it has a meaningful zero point and the ratio of two data points is meaningful.

Qualitative data is data measured on a nominal or ordinal scale. Quantitative data is measured on an interval or ratio scale.

Time series data originates as measurements usually taken from some process over equally spaced intervals of time. Time series data originate from processes. Processes can be divided into two categories: stationary and nonstationary. All time series that are interesting vary, and the nature of the variability determines how the process is characterized. In a stationary process the time series varies around some central value and has approximately the same variation over the series. In a nonstationary process, the time series possess a trend, the tendency for the series to either increase or decrease over time.

Cross-sectional data are measurements created at approximately the same period of time.

**Organization of Data in Statistics**

A frequency distribution is a summary technique that organizes data into classes and provides in tabular form a list of the classes along with the number of observations in each class.

The process begins by refining information. An analyst will do this. He takes raw data and organizes that data. This is done by counting the number of observations in each classification.

A frequency distribution is a good way to handle large amounts of data. With it, we can see the overall structure of the data.

There are two steps in creating a frequency distribution:

- Choose the classifications
- Counting the number in each class

Graphs are important because they put information in visual form. While individual data can be lost, this is more than made up for by a nice graph. Use some type of graphing software to do this easily. Lots of different programs are available to create nice looking graphs these days.

**Bar Charts**

The bar chart is a simple graph in which the length of each bar corresponds to the number of observations in a category.

They are a good presentation tool and helpful in showing the differences in magnitude.

Creating a bar chart can get complicated. You should think about size, color, and labeling.

**Pie Charts**

Pie charts can represent the same information as a bar chart. The slices in a pie chart are proportional to the total in each category. You can easily compare the total of each category to the total overall.

When your data is qualitative, choosing categories is pretty easy. However, when your data is qualitative, choosing those categories is more complicated. The reason is that your choices often reflect how others will interpret the data. So, you have to be careful when doing this.

Choosing the number of categories is your choice and should depend on the amount of data available. You want enough categories to make the comparisons meaningful but not so many that it is hard to understand. Each situation will be different in this regard.

**Relative Frequency Distribution**

This represents the total observations in a category. It enables a person to view the number in each category in relation to the total number of observations. Another thing it does is change the frequency in each category to a proportion so we can compare data sets easier. I looks like this:

\[ \text{relative frequency} = \frac{\text{number in category}}{\text{total number}} \]

**Cumulative Frequency Distribution**

This gives a person the ability to quickly look at any category and see the number of observations and how they are related. The cumulative frequency is the sum of the frequency of a particular category and all preceding categories.

**Cumulative Relative Frequency**

The cumulative relative frequency is the proportion of observations in a particular category and all preceding categories.

**Histograms**

A histogram is used frequently and reveals the distribution of data. It is a bar graph of the frequency in which the height of each bar corresponds to the frequency of the category. Each category is represented by a vertical bar whose height is proportional to the frequency of the interval. The horizontal boundaries of each vertical bar correspond to the category endpoints. Once the frequency distribution has been calculated, all the information necessary for plotting a histogram is available.

**Stem and Leaf Display**

The stem and leaf display is a mix of methods. The display is similar to a histogram but the data remains usable to the user. It is useful for ordering and detecting patterns in the data. In other words, the raw data is not lost in the graph. It is similar to a histogram but the data remains visible.

**Ordered Array**

An ordered array is a listing of all the data in either increasing or decreasing magnitude. Data listed in increasing order is said to be listed in rank order. If listed in decreasing order, it is listed in reverse rank order. Listing data in an order is very useful and usually done. It allows you to scan the data quickly for the largest and smallest values.

**Dot Plots**

A dot plot is a graph where each data value is plotted as a point. If there are multiple entries, they are plotted above each other.

**Time Series Data**

A time series plot graphs data using time as the horizontal axis.

**Statistical and Critical Thinking**

Surveys provide data that enable us to improve products or services. Surveys guide political candidates, shape business practices, influence social media, and affect many aspects of our lives.

A voluntary response sample is a sample in which respondents themselves decide whether to participate. Those with a strong interest in the topic are more likely to participate. Sample data must be collected in an appropriate way, such as through a process of random selection. If sample data are not collected in an appropriate way, the data may be so completely useless that no amount of statistical torturing can salvage them.

When using methods of statistics with sample data to form conclusions about a population, it is absolutely essential to collect sample data in a way that is appropriate.

Data are collections of observations, such as measurements, genders, or survey responses. A single data value is called a datum. The term data is plural.

Statistics is the science of planning studies and experiments, obtaining data, and organizing, summarizing, presenting, analyzing, and interpreting those data and then drawing conclusions based on them.

A population is the complete collection of all measurements or data that are being considered. Typically, a population is the complete collection of data that we would like to make inferences about.

A census is the collection of data from every member of the population.

A sample is a subcollection of members selected from a population.

Because populations are often very large, a common objective of the use of statistics is to obtain data from a sample and then use those data to form a conclusion about the population.

A voluntary response sample is one in which the respondents themselves decide whether to be included.

The word statistics is derived from the Latin word status, meaning state. Early uses of statistics involved compilations of data and graphs describing various aspects of a state or country.

The following types of polls are common examples of voluntary response samples. By their very nature, all are seriously flawed because we should not make conclusions about a population on the basis of samples with a strong possibility of bias.

- Internet polls: people online can decide whether to respond.
- Mail-in polls: in which people can decide whether to reply.
- Telephone polls in which newspaper, radio, or television announcements ask that you call a special number to respond.

**Analyze**

After completing our preparation by considering the context, source, and sampling method, we begin to analyze the data.

**Graph and Explore**

An analysis should begin with appropriate graphs and explorations of data.

**Apply Statistical Methods**

A good statistical analysis does not require strong computational skills. A good statistical analysis does require using common sense and paying careful attention to sound statistical methods.

**Conclude**

The final step in our statistical process involves conclusions, and we should develop an ability to distinguish between statistical significance and practical significance.

Statistical significance is achieved in a study when we get a result that is very unlikely to occur by chance. A common criterion is that we have statistical significance if the likelihood of an event occurring by chance is 5 percent or less. Getting 98 girls in 100 random births is statistically significant because such an extreme outcome is not likely to result from random chance. Getting 52 girls in 100 births is not statistically significant because that event could easily occur with random chance.

Practical significance is when it is possible that some treatment or finding is effective, but common sense might suggest that the treatment or finding does not make enough of a difference to justify its use or to be practical.

**Misleading Conclusions**

When forming a conclusion based on a statistical analysis, we should make statements that are clear even to those who have no understanding of statistics and its terminology. We should carefully avoid making statements not justified by statistical analysis.

**Sample Data Reported**

When collecting data from people, it is better to take measurements yourself instead of asking subjects to report results. Ask people what they weigh and you are likely to get their desired weights, not their actual weight.

**Loaded Questions**

If survey questions are not worded carefully, the results of a study can be misleading. Survey questions can be loaded or intentionally worded to elicit a desired response.

**Order of Questions**

Sometimes survey questions are unintentionally loaded by such factors as the order of the items being considered.

**Nonresponse**

A nonresponse occurs when someone either refuses to respond to a survey question or is unavailable. When people are asked survey questions, some firmly refuse to answer.

**Percentages**

To find a percentage of an amount, replace the % symbol with division by 100, and then interpret “of” to be multiplication.

6% of 1200 responses = \(\frac{6}{100} * 1200 = 72 \)

**Decimal to Percentage**

To convert from a decimal to a percentage, multiply by 100%.

\[ 0.25 \rightarrow 0.25 * 100% = 25% \]

**Fraction to Percentage**

To convert from a fraction to a percentage, divide the denominator into the numerator to get an equivalent decimal number. Then multiply by 100 percent.

\[ \frac{}3}{4} = 0.75 \rightarrow 0.75 * 100% = 75% \]

**Percentage to Decimal**

To convert from a percentage to a decimal number, replace the % symbol with division by 100.

\[ 85% = \frac{85}{100} = 0.85 \]

A parameter is a numerical measurement describing some characteristic of a population.

A statistic is a numerical measurement describing some characteristic of a sample.

If we have more than one statistic, we have statistics. Another meaning of statistics is the science of planning studies and experiments; obtaining data, organizing, summarizing, presenting, analyzing, and interpreting those data.

Some data are numbers representing counts or measurements, whereas others are attributes that are not counts or measurements. Quantitative data consist of numbers representing counts or measurements.

Categorical data consist of names or labels. Categorical data are sometimes coded with numbers, with those numbers replacing names. Although such numbers might appear to be quantitative, they are actually categorical data.

**Include Units of Measurement**

With quantitative data, it is important to use the appropriate units of measurement, such as dollars, hours, feet, or meters. We should carefully observe information given about the units of measurement, such as all amounts are in thousands of dollars or all units are in kilograms.

**Discrete or Continuous**

Quantitative data can be further described by distinguishing between discrete and continuous types. Discrete data result when the data values are quantitative and the number of values is finite. Continuous or numerical data result from infinitely many possible quantitative values, where the collection of values is not countable.

The concept of countable data plays a key role in the preceding definitions, but it is not a particularly easy concept to understand. Continuous data can be measured, but not counted. If you select a particular value from continuous data, there is no next data value.

**Levels of Measurement**

Another common way of classifying data is to use four levels of measurement; nominal, ordinal, interval, and ratio. When we are applying statistics to real problems, the level of measurement of the data helps us to decide which procedure to use. Don’t do computations and don’t use statistical methods that are not appropriate for the data.

**Ratio**

There is a natural zero starting point and ratios make sense. These are heights, lengths, distances, and volumes.

**Interval**

Differences are meaningful, but there is no natural zero starting point and ratios are meaningless. Body temperatures in degrees is an example.

**Ordinal**

Data can be arranged in order, but differences either can’t be found or are meaningless. Examples are ranks of colleges.

**Nominal**

Categories only. Data cannot be arranged in order. An example is eye colors.

The nominal level of measurement is characterized by data that consist of names, labels, or categories only. The data cannot be arranged in some order.

Because nominal data lack any ordering or numerical significance, they should not be used for calculations. Numbers such as 1,2,3, or 4 are sometimes assigned to the different categories, but these numbers have no real computational significance and any average calculated from them is meaningless and possibly misleading.

Data are at the ordinal level of measurement if they can be arranged in some order, but differences between data values cannot be determined or are meaningless.

Ordinal data provide information about relative comparisons, but not the magnitudes of the differences. Usually, ordinal data should not be used for calculations such as an average, but this guideline is sometimes ignored.

Data are at the interval level of measurement if they can be arranged in order, and differences between data values can be found and are meaningful. Data at this level do not have a natural zero starting point at which none of the quantity is present.

Data are at the ratio level of measurement if they can be arranged in order, differences can be found and are meaningful, and there is a natural zero starting point. For data at this level, differences and ratios are both meaningful.

The distinction between the interval and ratio levels of measurement can be a bit tricky. For the ratio test, focus on the term ratio and know that the term twice describes the ratio of one value to be double the other value. To distinguish between the interval and ratio levels of measurement, use a ratio test by asking this question: Does use of the term twice make sense? Twice makes sense for data at this level of measurement, but it does not make sense for data at the interval level of measurement.

For the true zero test, and for ratios to make sense, there must be a value of true zero, where the value of zero indicates that none of the quantity is present, and zero is not simply an arbitrary value on a scale. The temperature of 0 F is arbitrary and does not indicate that there is no heat, so temperatures on the Fahrenheit scale are at the interval level of measurement not the ratio level.

Big data refers to data sets so large and so complex that their analysis is beyond the capabilities of traditional software tools. Analysis of big data may require software simultaneously running in parallel on many different computers.

Data science involves applications of statistics, computer science, and software engineering, along with some other relevant fields such as sociology or finance.

**Example of Data Set Magnitudes**

- Terabytes
- Petabytes
- Exabytes
- Zettabytes
- Yottabytes

**Statistics in Data Science**

The modern data scientist has a solid background in statistics and computer systems as well as expertise in fields that extend beyond statistics. The modern data scientist might be skilled with Hadoop software, which uses parallel processing on many computers for the analysis of big data. The modern data scientist might also have a strong background in some other field such as psychology, biology, medicine, chemistry, or economics.

**Missing Data**

When collecting sample data, it is quite common to find that some values are missing. Ignoring missing data can sometimes create misleading results. If you make the mistake of skipping over a few different samples when you are manually typing them into a statistics software program, the missing values are not likely to have a serious effect on the results. However, if a survey includes many missing salary entries because those with very low incomes are reluctant to reveal their salaries, those missing values will have the serious effect of making salaries appear higher than they really are.

A data value is missing completely at random if the likelihood of its being missing is independent of its value or any of the other values in the data set. That is, any data value is just as likely to be missing as any other data value.

A data value is missing not at random if the missing value is related to the reason that it is missing.

Missing data at random can happen and an example is when using a keyboard to manually enter ages of survey respondents and makes the mistake of failing to enter the age of 37 years. The data value is missing completely at random.

**Biased Results**

Based on the two definitions and examples from the previous page, it makes sense to conclude that if we ignore data missing completely at random, the remaining values are not likely to be biased and good results should be obtained. However, if we ignore data that are missing, not at random, it is very possible that the remaining values are biased and results will be misleading.

**Correcting for Missing Data**

There are different methods for dealing with missing data. One very common method for dealing with missing data is to delete all subjects having any missing values. If the data are missing completely at random, the remaining values are not likely to be biased and good results can be obtained, but with a smaller sample size. If the data are missing not at random, deleting subjects having any missing values can easily result in a bias among the remaining values, so results can be misleading.

We can also input missing data values when we substitute values for them. There are different methods of determining the replacement values, such as using the mean of the other values, or using a randomly selected value from other similar cases, or using a method based on regression analysis.

When analyzing sample data with missing values, try to determine why they are missing, then decide whether it makes sense to treat the remaining values as being representative of the population. If it appears that there are missing values that are missing not at random, know that the remaining data may well be biased and any conclusions based on those remaining values may well be misleading.

In an experiment, we apply some treatment and then proceed to observe its effects on the individuals. The individuals in experiments are called experimental units and they are often called subjects when they are people. In an observational study, we observe and measure specific characteristics, but we don’t attempt to modify the individuals being studied.

Experiments are often better than observational studies because well planned experiments typically reduce the chance of having the results affected by some variable that is not part of the study. A lurking variable is one that affects the variables included in the study, but it is not included in the study.

**Design of Experiments**

Good design of experiments includes replication, blinding, and randomization.

Replication is the repetition of an experiment on more than one individual. Good use of replication requires sample sizes that are large enough so that we can see effects of treatments.

Blinding is used when the subject doesn’t know whether he or she is receiving a treatment or a placebo. Blinding is a way to get around the placebo effect, which occurs when an untreated subject reports an improvement in symptoms.

Randomization is used when individuals are assigned to different groups through a process of random selection. The logic behind randomization is to use chance as a way to create two groups that are similar.

A simple random sample of n subjects is selected in such a way that every possible sample of the same size n has the same chance of being chosen.

Unlike careless or haphazard sampling, random sampling usually requires very careful planning and execution.

**Simple Random Sample**

A sample of n subjects is selected so that every sample of the same size n has the same chance of being selected

**Systematic Sample**

Select every kth subject

**Convenience Sample**

Use data that are very easy to get

**Stratified Sample**

Subdivide populations into strata or groups with the same characteristics, then randomly sample within those strata.

**Cluster Sample**

Partition the population in clusters or groups, then randoml;y select all members of the selected clusters.

**Multistage Sampling**

Professional pollsters and government researchers often collect data by using some combination of the preceding sampling methods. In a multistage sample design, pollsters select a sample in different stages, and each stage might use different methods of sampling.

In a cross sectional study, data are observed, measured, and collected at one point in time, not over a period of time.

In a retrospective study, data are collected from a past timer period by going back in time.

In a prospective study, data are collected in the future from groups that share common factors.

**Experiments**

In an experiment, confounding occurs when we can see some effect, but we can’t identify the specific factor that caused it.

A randomized block design uses the same basic idea as stratified sampling, but randomized block designs are used when designing experiments, whereas stratified sampling is used for surveys.

**Matched Pairs Design**

Compare two treatment groups by using subjects matched in pairs that are somehow related ort have similar characteristics.

**Rigorously Controlled Design**

Carefully assign subjects to different treatment groups, so that those given each treatment are similar in the ways that are important to the experiment. This can be extremely difficult to implement, and often we can never be sure that we have accounted for all of the relevant factors.

**Sampling Errors**

In statistics, you could use a good sampling method and do everything correctly, and yet it is possible to get wrong results. No matter how well you plan and execute the sample collection process, there is likely to be some error in the results.

A sampling error occurs when the sample has been selected with a random method, but there is a discrepancy between a sample result and the true population result, such an error results from chance sample fluctuations.

A non sampling error is the result of human error, including such factors as wrong data entries, computing errors, questions with biased wording, false data provided by respondents, forming biased conclusions, or applying statistical methods that are not appropriate for the circumstances.

A non random sampling error is the result of using a sampling method that is not random, such as using a convenience sample or a voluntary response sample.

**The Gold Standard**

Randomization with placebo/treatment groups is sometimes called the gold standard because it is so effective.

List the elements in the set

{x|x is a natural number between 5 and 13}

{6,7,8,9,10,11,12}

List the elements in the set

{x|x is an integer between -1 and 1}

{0}

Perform the exponentiation by hand

-5^2

= -25

Express 2*2*2*2*3*3*3*3*3 using exponents

= 2^4 * 3^5

Rewrite the expression using exponents

4*3*3*3

= 3^3 * 4^1

Perform the indicated operations

5^3 * 3^2

= 1125

Write the number in scientific notation

0.0001

= 1 * 10^-4

Write the number in scientific notation

3000

= 3 * 10^3

Write the number in scientific notation

4,420,000

= 4.42 * 10 ^6

A newspaper posted this question on its website: How often do you seek medical information online? Of 1072 internet users who chose to respond, 38 percent of them responded with frequently. What term is used to describe this type of survey in which the people surveyed consist of those who decided to respond?

- The respondents are a self selected sample
- The respondents are a voluntary response sample

What is wrong with this type of sampling method?

- Responses may not reflect the opinions of the general population
- Many people may choose not to respond to the survey

Determine whether the source given below has the potential to create a bias in a statistical study.

A certain medical organization tends to oppose the use of meat and dairy products in our diets, and that organization has received hundreds of thousands of dollars from an animal rights organization.

- There does appear to be a potential to create a bias. There is an incentive to produce results that are in line with the organization's creed and that of its founders.

An article noted that chocolate is rich in flavonoids. The article reports that regular consumption of foods rich in flavonoids may reduce the risk of coronary heart disease. The study received funding from a candy company and a chocolate manufacturers association. Identify and explain at least one source of bias in the study described.

- The researchers may have been more inclined to provide favorable results because funding was provided by a party with a definite interest. The bias could have been avoided if the researchers were not paid by the candy company and the chocolate manufacturers.

Determine whether the sampling method described below appears to be sound or is flawed.

In a survey of 572 subjects, each was asked how often he or she drank milk. The survey subjects were internet users who responded to a question that was posted on a news website.

- It is flawed because it is a voluntary response sample

Determine whether the sampling method described below appears to be sound or is flawed.

In a survey of 735 human resource professionals, each was asked about the importance of the experience of a job applicant. The survey subjects were randomly selected by pollsters from a reputable market research firm.

- It appears to be sound because the data are not biased in any way

Determine whether the results appear to have statistical significance, and also determine whether the results appear to have practical significance.

In a study of a birth sex selection method used to increase the likelihood of a baby being born female, 1929 users of the method gave birth to 946 males and 983 females. There is about a 21% chance of getting that many babies born female if the method had no effect.

- Does not have statistical significance
- Not many
- 51%
- Does not have practical significance

In the data table below, the x-values are the weight of cars and the y-values are the corresponding highway fuel consumption amounts.

4034 3364 4179 3674 3599

26 32 28 29 30

Given the context of the car measurement data, what issue can be addressed by conducting a statistical analysis of the values?

- Is there a relationship or an association between the weight of a car and its fuel consumption amount?

A magazine ran a survey about a website for downloading music. Readers could register their responses on the magazine’s website. Identify what is wrong.

- The sample is a voluntary response sample, so there is a good chance that the results do not reflect the population.

A polling company reported that 27 percent of 1013 surveyed adults said that their cill phones are very harmful.

What is the exact value of 27% of 1013

- = 273.51

Could the result from part A be the actual number of adults who said that cell phones are very harmful?

- No, the result from part A could not be the actual number of adults who said cell phones are very harmful because a count of people must result in a whole number.

What could be the actual number of adults who said that cellular phones are harmful?

- = 274 (just round)

Among the 1013 respondents, 406 said that cell phones are not at all harmful. What percentage said that cell phones are not harmful?

- = 40.08% (406/1013)*100

A polling company reported that 59% of 2302 surveyed adults said that they play basketball.

What is the exact value that is 59%m of 2302?

- 1358.18 (.59*2302)

Could the result from part B be the actual number of adults who said they play basketball?

- No, the result from part A could not be the actual number of adults who said they play basketball because a count of people must result in a whole number.

What could be the actual number of adults who said they play basketball?

- = 1358

Among the 2302 respondents, 301 said that they only play hockey. What percentage of respondents said that they only play hockey?

- = 13.08% (301/2302)

Determine whether the data described below are qualitative or quantitative and explain why.

The types of food served by restaurants.

- The data are qualitative because they don’t measure or count anything.

State whether the data described below are discrete or continuous.

The populations of cities

- The data are discrete because the data can only take on specific values

Determine whether the given value is a statistic or a parameter

A homeowner measured the voltage supplied to his home on 5 days of a given week, and the average value is 147.6

- The given value is a statistic for the week because the data collected represent a sample

A particular country has 50 total states. If the areas of all 50 states are added and the sum is divided by 50, the result is 194,953 kilometers. Determine whether this result is a statistic or a parameter

- The result is a parameter because it describes some characteristic of a population

A parameter is a numerical measurement describing some characteristic of a population.

A statistic is a numerical measurement describing some characteristic of a sample.

State whether the data described below are discrete or continuous.

The numbers of people looking at a website at different times.

- The data are discrete because the data can only take on specific values

Determine whether the value given below is from a discrete or continuous data set.

When a car is randomly selected and weighed, it is found to weigh 1531.3 kg.

- A continuous data set because there are infinitely many possible values and those values cannot be counted.

Determine whether the value is from a discrete or continuous data set

Number of beats in a song is 5

- Discrete because it is countable

Nominal is categories only and data cannot be arranged in an ordering scheme.

Ordinal is categories but are ordered and differences cannot be found or are meaningless.

Interval are when differences are meaningful but there is no natural zero starting point.

Ratio is when there is a natural zero starting point and ratios are meaningful.

Determine which of the four levels of measurement is most appropriate for the data below

Body temperature in degrees Fahrenheit

- The interval level of measurement is most appropriate because the data can be ordered, differences can be found and are meaningful, and there is no natural starting zero point.

Determine which of the four levels of measurement is most appropriate

Favorite types of music

- Nominal

Determine which of the four levels of measurement is most appropriate

Ages of children: 5,6,7,8 and 9

- Ratio

Determine which of the four levels of measurement is most appropriate for the data below.

Volume of planets in cubic meters

- The ratio level of measurement is most appropriate because the data can be ordered, differences can be found and are meaningful, and there is a natural starting zero point.

Identify the level of measurement of the data, and explain what is wrong with the given calculation.

In a survey, the hair colors of respondents are identified as 10 for brown hair, 20 for blonde hair, 30 for black hair, and 40 for anything else. The average is calculated for 702 respondents and the result is 22.3

The data are at the _________ level of measurement

- Nominal

What is wrong with the given calculation?

- Such data are not counts or measures of anything, so it makes no sense to compute their average

Identify the level of measurement of the data, and explain what is wrong with the given calculation.

In a set of data, course grades are represented as 10 for A, 20 for B, and 30 for C. The average of the 769 course grades is 25.4.

The data are at the _____ level of measurement.

- Ordinal

What is wrong with the given calculation?

- Such data should not be used for calculation such as an average

Which of the following is associated with a parameter?

- Data that were obtained from an entire population. (A parameter is a numerical measurement describing some characteristic of a population. So, a parameter is associated with data that were obtained from an entire population.)

Which level of measurement consists of categories only where data cannot be arranged in an ordering scheme?

- Nominal. (The nominal level of measurement is characterized by data that consist of names, labels, or categories only. The data cannot be arranged in an ordering scheme such as low to high.)

Determine whether the given description corresponds to an observational study or an experiment?

In a study of 405 men with a particular disease, the subjects were photographed daily.

- The given description corresponds to an observational study

In a double blind experiment designed to test the effectiveness of a new medication as a treatment for lower back pain, 1643 patients were randomly assigned to one of three groups.

What does it mean to say that the experiment was double blind?

- The subjects in the study did now know whether they were taking a placebo or the new medication, and those who administered the pills also did now know.

In a study designed to test the effectiveness of a medication as a treatment for lower back pain, 1643 patients were randomly assigned to one of three groups. In what specific way was replication applied in the study?

Replication is the repetition of an experiment on more than one individual.

- The group sample sizes are all large so the researchers could see the effects of the treatment.

Determine whether the description corresponds to an observational study or an experiment.

Fifty patients with lung cancer are divided into two groups. One group receives an experimental drug to fight cancer, the other a placebo. After two years the spread of the cancer is measured.

Does the description correspond to an observational study or an experiment?

- Experiment

Identify the type of sampling used in the situation below.

In a poll conducted by a certain researcher, 980 adults were called after their telephone numbers were randomly generated by a computer, and 61% were able to correctly identify the vice president.

- Random sampling

Identify which type of sampling is used: random, systematic, convenience, stratified, or cluster.

A magazine asks its readers to call in their opinion regarding the quality of the articles.

- Convenience sampling

Determine whether the study is an experiment or an observational study, and identify a major problem with the study.

In a survey, 1465 internet users chose to respond to this question posted on a newspaper electronic edition. Is news online as satisfying as print and TV news? 52% of the respondents said yes.

- This is an observational study because the researchers do not attempt to modify the individuals.

What is a major problem with the study?

- This is a convenience sample with voluntary response, which has a high chance of leading to bias.

Determine whether the study is an experiment or an observational study, and then identify a major problem with the study.

A study involved 22071 male physicians. Based on random selections, 11037 of them were treated with aspirin and the other 11034 were given placebos. The study was stopped early because it became clear that aspirin reduced the risk of myocardial infarctions by a substantial amount.

This is an_______

- Experiment

Because the researchers_______

- Apply a treatment to the individuals

What is a major problem with the study?

- The results apply only to male physicians

Determine whether the study is an experiment or an observational study, and then identify a major problem with the study.

A medical researcher tested for a difference in systolic blood pressure levels between male and female students who are 12 years of age. She randomly selected four males and four females for her study.

- This is an observational study because the researcher does not attempt to modify the individuals

What is a major problem with the study?

- The sample is too small

_____ is used when subjects are assigned to different groups through a process of random selection.

Randomization is used when subjects are assigned to different groups through a process of random selection. The logic behind randomization is to use chance as a way to creator two groups that are similar. Although it might seem that we should not leave anything to chance in experiments, randomization has been found to be an extremely effective method for assigning subjects to groups.

- Randomization

A study is conducted to measure children’s growth rates without any treatment applied to the children. What best classifies this study?

- Observational study (An observational study involves observing and measuring specific characteristics without attempting to modify the subjects being studied)

Which of the following corresponds to the case when every sample of size n has the same chance of being chosen?

- Simple random sample (A simple random sample of n subjects is selected in such a way that every possible sample of the same size n has the same chance of being chosen)

**What Statistics Is All About**

One of the first considerations is designing appropriate studies. The purpose is to collect data. This process can be done with either surveys or experiments. One of the most popular ways to collect data is the observational study in a way that does not affect them. Surveys have to be worded carefully to get good information.

An experiment is another popular way to gather data. It involves treatments on participants so that clear comparisons can be made. After treatments are made, responses are recorded.

Collecting quality data is a major consideration. It really does no good to get bad data. So, studies and experiments must be planned well. Once you have good data, you can make a good report on what you found. To minimize bias in a survey, you have to be random when selecting participants.

**Descriptive Statistics**

These are numerical values that describe a data set. This is usually done through different types of categories. If the data are categorical they are usually summarized using the number of individuals in each group. This is called the frequency. If you use the percentage of individuals, it is called the relative frequency.

Numerical data represent measurements or counts. You can do more with numerical data. For example, you can get the measure of center and the measure of spread in the data.

Some descriptive statistics are more appropriate than others in certain situations. The average is not always the best measure of the center of a data set.

**Charts and Graphs**

Data is summarized in a visual way using charts and graphs. These are displays that are organized to give you a big picture of the data.

Some of the basic graphs used for categorical data include pie charts and bar graphs. These break down variables in the data.

For numerical data, a different type of graph is needed. Histograms and box plots are usually used to represent numerical data. These types of graphs make it easier to visualize the data.

**Distributions**

A variable is a characteristic that is being counted or measured. A distribution is a listing of the possible values of a variable and how often they occur.

Different types of distributions exist for different types of variables.

If a variable is counting the number of successes in a certain number of trials, it has a binomial distribution.

If the variable takes on values that occur according to a bell-shaped curve, then that variable has a normal distribution.

If the variable is based on sample averages and you have limited data, the t-distribution may be in order.

When it comes to distributions, you need to know how to decide which distribution a particular variable has, how to find probabilities for it, and how to figure out what the long-term average and standard deviation of the outcomes would be.

**Performing Analyses**

After data has been collected and described, it is time to do the statistical analysis. There are many types of analyses. You have to choose the appropriate type for your data.

You often see statistics that try to estimate numbers pertaining to an entire population. However, it is just an estimate and most studies only ask a small number of people their questions. What happens is that data is collected on a small sample of people. Sometimes the results they get are very inaccurate.

Sample results vary from sample to sample, and this amount of variability needs to be reported but usually it is not. The statistic used to measure and report the level of precision in someone’s sample result is called the margin of error. The range of the margin of error is called the confidence interval.

**Hypothesis Tests**

One major staple of research studies is called hypothesis testing. A hypothesis test is a technique for using data to validate or invalidate a claim about a population.

The elements about a population that are most often tested are:

- The population mean
- The population proportion
- The difference in two population means or proportions

Hypothesis tests are used in a host of areas that affect your everyday life, such as medical studies, advertisements, and polling data. Often you only hear the conclusions of hypothesis tests but you don’t see the methods used to come to these conclusions.

**Drawing Conclusions**

To perform statistical analyses, researchers use software that depends on formulas. You have to use them correctly, though. Some of the most common mistakes made in conclusions are overstating the results. Until you do a controlled experiment, you can’t make a cause-and-effect conclusion based on relationships you find.

Statistics is about much more than numbers. You need to understand how to make appropriate conclusions from studying data and be smart enough to not believe everything you read.

**Working with Tables and Graphs**

When working with large data sets, a frequency distribution is often helpful in organizing and summarizing data. A frequency distribution helps us to understand the nature of the distribution of a data set.

**Frequency Distribution**

A frequency distribution or table shows how data are partitioned among several categories by listing the categories along with the number of data values in each of them.

Lower class limits are the smallest numbers that can belong to each of the different classes. Upper class limits are the largest numbers that can belong to each of the different classes. Class boundaries are the numbers used to separate the classes, but without the gaps created by class limits. Class midpoints are the values in the middle of the classes. Class width is the difference between two consecutive lower class limits in a frequency distribution.

Finding the correct class width can be tricky. For class width, don’t make the most common mistake of using the difference between a lower class limit and an upper class limit. For class boundaries, remember that they split the difference between the end of one class and the beginning of the next class.

We construct frequency distributions to:

- Summarize large data sets
- See the distribution and identify outliers
- Have a basis for constructing graphs

Technology can generate frequency distributions but these are the common steps:

- Select the number of classes, usually between 5 and 20
- Calculate class width: \(\frac{\text{max data value - min data value}}{\text{number of classes}} \)
- Round this result to get a convenient number
- Choose the value for the first lower class limit by using either the min value or a convenient value below the minimum.
- Using the first lower class limit and the class width, list the other lower class limits.
- List the lower class limits in a vertical column and then determine and enter the upper class limits.
- Take each individual data value and put a tally mark in the appropriate class. Add the tally marks to find the total frequency for each class.

**Relative Frequency Distribution**

A variation of the basic frequency distribution is a relative frequency distribution. Each class frequency is replaced by a relative frequency as a percentage.

\[ \text{relative frequency} = \frac{\text{frequency for class}}{\text{sum of frequencies}} * 100 \]

This will give you the frequency percentage.

The sum of the percentages in a relative frequency distribution will be very close to 100 percent.

Another variation of a frequency distribution is a cumulative frequency distribution in which the frequency for each class is the sum of the frequencies for that class and all previous classes.

At the beginning we noted that a frequency distribution can help us understand the distribution of a data set, which is the nature or shape of the spread of the data over the range of values. In statistics, we are often interested in determining whether the data have a normal distribution. Data that have an approximately normal distribution are characterized by a frequency distribution with the following features:

- The frequencies start low, then increase to one or two high frequencies, and then decrease to a low frequency.
- The distribution is approximately symmetric. Frequencies preceding the maximum frequency should be roughly a mirror image of those that follow the maximum frequency.

The presence of gaps can suggest that the data are from two or more different populations.

Comparing two or more relative frequency distributions in one table makes comparisons of data much easier.

While a frequency distribution is a useful tool for summarizing data and investigating the distribution of data, an even better tool is a histogram, which is a graph that is easier to interpret than a table of numbers.

A histogram visually displays the shape of the distribution of the data. It shows the location of the center of the data. Histograms show the spread of data and can also identify outliers.

A histogram is basically a graph of a frequency distribution. Class frequencies should be used for the vertical scale and that scale should be labeled. There is no universal agreement on the procedure for selecting which values are used for the bar locations along the horizontal scale, but it is common to use class boundaries, class midpoints, class limits, or something else. It is often easier for us to use class midpoints for the horizontal scale. Histograms can usually be generated using technology.

A relative frequency histogram has the same shape and horizontal scale as a histogram, but the vertical scale uses relative frequencies instead of actual frequencies.

The ultimate objective of using histograms is to be able to understand characteristics of data. Exploring the data means to:

- Find the center of the data
- Find the variation
- Find the shape of the distribution
- Find any outliers
- Find the change of data over time

When a graph is said to be skewed to the right, it means the histogram shape has a tail on the right.

When a graph is said to be skewed to the left, it means the histogram shape has a tail on the left.

Bell-shaped distribution is called a normal distribution and has its highest values in the middle.

Uniform distribution is a histogram with roughly the same values all the way across.

Many statistical methods require that sample data come from a population having a distribution that is approximately a normal distribution.

In a uniform distribution, the different possible values occur with approximately the same frequency, so the heights of the bars in the histogram are approximately uniform.

A distribution of data is skewed if it is not symmetric and extends more to one side than to the other. Data skewed to the right, called positively skewed, have a longer right tail.

Data skewed to the left, called negatively skewed, have a longer left tail.

Some really important methods have a requirement that sample data must be from a population having a normal distribution. Histograms can be helpful in determining whether the normality requirement is satisfied, but they are not very helpful with very small data sets.

The population distribution is normal if the pattern of the points in the normal quantile plot is reasonably close to a straight line, and the points do not show some systematic pattern that is not a straight-line pattern.

The population distribution is not normal if the normal quantile plot has either or both of these two conditions:

- The points do not lie reasonably close to a straight-line pattern
- The points show some systematic pattern that is not a straight-line pattern

**Graphs that Enlighten**

A dot plot graph is a good type of graph. It consists of a graph of quantitative data in which each data value is plotted as a point above a horizontal scale of values. Dots representing equal values are stacked.

A dot plot:

- Displays the shape of the distribution of data
- It is usually possible to recreate the original list of data values.

A stem plot is another type of graph and it represents quantitative data by separating each value into two parts: the stem and the leaf. Better stem plots are often obtained by first rounding the original data values. Also, stem plots can be expanded to include more rows and can be condensed to include fewer rows.

Stem plots:

- Shows the shape of the distribution of data
- Retains the original data values
- The sample data are sorted

A time-series graph is a graph of time-series data, which are quantitative data that have been collected at different points in time, such as monthly or yearly.

Time-series graphs:

- Reveals information about trends over time

Bar graphs use bars of equal width to show frequencies of categories of categorical data. The bars may or not be separated by small gaps.

Bar graphs:

- Shows the relative distribution of categorical data so that it is easier to compare the different categories.

A pareto chart is a bar graph for categorical data, with the added stipulation that the bars are arranged in descending order according to frequencies, so the bars decrease in height from left to right.

Pareto charts:

- Shows the relative distribution of categorical data so that it is easier to compare the different categories.
- Draws attention to the more important categories.

A pie chart is a very common graph that depicts categorical data as slices of a circle, in which the size of each slice is proportional to the frequency count for the category. Although pie charts are very common, they are not as effective as Pareto charts.

Pie charts:

- Shows the distribution of categorical data in a commonly used format.

Try to never use pie charts because they waste ink on components that are not data, and they lack an appropriate scale.

A frequency polygon uses line segments connected to points located directly above class midpoint values. A frequency polygon is very similar to a histogram, but a frequency polygon uses line segments instead of bars.

A variation of the basic frequency polygon is the relative frequency polygon, which uses relative frequencies for the vertical scale. An advantage of relative frequency polygons is that two or more of them can be combined on a single graph for easy comparison.

**Graphs that Deceive**

Deceptive graphs are commonly used to mislead people. Graphs should be constructed in a way that is fair and objective.

A common deceptive graph involves using a vertical scale at some value greater than zero to exaggerate differences between groups. This is called a nonzero vertical graph. Always examine a graph carefully to see whether a vertical axis begins at some point other than zero so that differences are exaggerated.

Pictographs are another type of chart that are used to mislead. Data that are one-dimensional in nature are often depicted with two-dimensional objects or three-dimensional objects. By using pictographs, artists can create false impressions that grossly distort differences by using these same principles of basic geometry:

- When you double each side of a square, it’s area doesn’t merely double, it increase by a factor of four
- When you double each side of a cube, its volume doesn’t merely double, it increases by a factor of eight

When examining data depicted with a pictograph, determine whether the graph is misleading because objects of area or volume are used to depict amounts that are actually one-dimensional.

For small data sets of 20 values or fewer, use a table instead of a graph. A graph of data should make us focus on the true nature of the data, not on other elements, such as eye-catching but distracting design features. Do not distort data. Construct a graph to reveal the true nature of the data. Almost all of the ink in a graph should be used for the data, not for the design elements.

A correlation exists between two variables when the values of one variable are somehow associated with the values of the other variable.

A linear correlation exists between two variables when there is a correlation and the plotted points of paired data result in a pattern that can be approximated by a straight line. A scatterplot is a plot of paired quantitative data with a horizontal x-axis and the vertical axis is used for the second variable y.

The presence of correlation between two variables is not evidence that one of the variables causes the other. We might find a correlation between beer consumption and weight, but we cannot conclude from the statistical evidence that drinking beer has a direct effect on weight.

A scatterplot can be very helpful in determining whether there is a correlation between the two variables.

The linear correlation coefficient is denoted by r, and it measures the strength of the linear association between two variables.

When we do not conclude that there appears to be a linear correlation between two variables, we can find the equation of the straight line that best fits the sample data, and that equation can be used to predict the value of one variable when given a specific value of the other variable. Instead of using the straight-line equation of \(y = mx + b \) that we have all learned in prior math courses, we use the format that follows.

Given a collection of paired sample data, the regression line, or line of best fit, is the straight line that best fits the scatter plot of the data.

Round the number to the nearest ten

66,843.908

- 66,840

Round the number to the nearest hundredth

-0.451

- -0.45

Simplify

27/90

- 3/10

Write the percentage as a decimal number

55%

- 0.55

Write the percent as a simplified fraction

88%

- 22/25

Write the fraction as a percent

2/13

- 15.38%

Practice questions for a textbook are marked with difficulty levels of easy, intermediate, and difficult. If 46 of the 147 practice questions are rated as intermediate, approximately what percentage of the questions are intermediate level?

- 31%

Find the percentage of total calories from fat

Calories=120, Calories from fat=20

- 16.7%

What is 27% of 23

- 6.21

What is the y coordinate of (2,1)

- 1

How many individuals are included in the summary?

- 52

Is it possible to identify the exact values of all the original service times?

- No. The data values in each class could take on any value between the class limits, inclusive

A frequency table of grades has 5 classes(A,B,C,D,F) with frequencies of 2,13,16,7, and 1 respectively. Using percentages, what are the relative frequencies of the five classes?

- 5.13%
- 33.33%
- 41.03%
- 17.95%
- 2.56%

Heights of adult males are known to have a normal distribution. A researcher claims to have randomly selected adult males and measured their heights with the resulting relative frequency distribution as shown here. Identify two major flaws with these results.

- The sum of the relative frequencies is 124%, but it should be 100%, with a small possible round-off error
- All of the relative frequencies appear to be roughly the same. If they are from a normal distribution, they should start low, reach a maximum, and then decrease

Identify the lower class limits, upper class limits, class width, class midpoints, and class boundaries for the given frequency distribution. Also identify the number of individuals in the summary.

**Age** **Frequency**

15-24 29

25-34 33

35-44 14

45-54 4

55-64 6

65-74 1

75-84 1

Identify the lower class limits

- 15,25,35,45,55,65,75

Identify the upper class limits

- 24,34,44,54,64,74,84

Identify the class width

- 10

Identify the class midpoints

- 19.5, 29.5, 39.5, 49.5, 59.5, 69.6, 79.5

Identify the class boundaries

- 14.5, 24.5, 34.5, 44.5, 54.5, 64.5, 74.5, 84.5

Identify the number of individuals in the summary

- 88

Identify the lower class limits, upper class limits, class midpoints, and class boundaries for the given frequency distribution. Also identify the number of individuals included in the summary.

**Platelet Count**** ****Frequency**

100-199 24

200-299 91

300-399 30

400-499 0

500-599 4

Identify the lower class limits

- 100, 200, 300, 400, 500

Identify the upper class limits

- 199, 299, 399, 499, 599

Identify the class width

- 100

Identify the class midpoints

- 149.5, 249.5, 349.5, 449.5, 549.5

Identify the class boundaries

- 99.5, 199.5, 299.5, 399.5, 499.5, 599.5

Identify the number of individuals in the summary

- 149

Does the frequency distribution appear to have a normal distribution using a strict interpretation of the relevant criteria?

**Temp**** ****Frequency**

45-49 3

50-54 0

55-59 6

60-64 13

65-69 7

70-74 6

75-79 1

Does the frequency distribution appear to have a normal distribution?

- No, the distribution does not appear to be normal

Does the frequency distribution appear to have a normal distribution?

**Temp**** ****Frequency**

40-44 1

45-49 2

50-54 5

55-59 14

60-64 5

65-69 4

70-74 1

- Yes, because the frequencies start low, proceed to one or two higher frequencies, then decrease to a low frequency, and the distribution is approximately symmetric

The data represents the BMI values for 20 females. Construct a frequency distribution beginning with a lower class limit of 15 and use a class width of 6.0

17.7 33.5 26.3 25.9 22.9

27.1 21.9 18.3 27.7 22.9

19.2 22.3 23.7 37.7 32.4

27.8 44.9 30.6 28.7 22.9

**BMI**** ****Frequency**

15.0-20.9 3

21.0-26.9 8

27.0-32.9 6

33.0-38.9 2

39.0-44.9 1

The following data show the ages of recent award-winning male actors at the time when they won their award. Make a frequency table for the data, using bins of 20-29, 30-39, and so on

**Age**** ****Number**

20-29 2

30-39 10

40-49 11

50-59 7

60-69 2

70-79 2

Construct one table that includes relative frequencies based on the frequency distributions shown below. Compare the amounts of tar in unfiltered and filtered cigarettes.

**Tar**** ****Nonfiltered**** ****Filtered**

6-10 0% 8%

11-15 0% 12%

16-20 4% 20%

21-25 4% 60%

26-30 52% 0%

31-35 28% 0%

36-40 12% 0%

Do cigarette filters appear to be effective?

- Yes, because the relative frequency of the higher tar classes is greater for non filtered cigarettes

Construct the cumulative frequency distribution for the given data

20-29 25

30-39 35

40-49 11

50-59 2

60-69 4

70-79 1

80-89 1

Less than 30 = 25

Less than 40 = 60

Less than 50 = 71

Less than 60 = 73

Less than 70 = 77

Less than 80 = 78

Less than 90 = 79

Construct the cumulative frequency distribution for the given data

**Daily Low**** ****Frequency**

35-39 2

40-44 4

45-49 5

50-54 12

55-59 7

60-64 8

65-69 1

Less than 40 2

Less than 45 6

Less than 50 11

Less than 55 23

Less than 60 30

Less than 65 38

Less than 70 39

Among fatal plane crashes that occurred during the past 55 years, 466 were due to pilot error, 70 were due to other human error, 517 were due to weather, 343 were due to mechanical problems, and 485 were due to sabotage.

Construct the relative frequency distribution. What is the most serious threat to aviation safety, and can anything be done about it?

Total crashes=1890

Crashes per year=1890/55 = 34.36 fatal crashes per year

Pilot error = 24.7%

Other human error = 4.2%

Weather = 27.4%

Mechanical problems = 18.1%

Sabotage = 25.7%

What is the most serious threat to aviation safety and can anything be done about it?

- Weather is the most serious threat to aviation safety. Weather monitoring systems could be improved.

Use the given categorical data to construct the relative frequency distribution

Natural births randomly selected from four hospitals in a highly populated region occurred on the days of the week with the frequencies 53, 64, 71, 57, 54, 46, and 55. Does it appear that such births occur on the days of the week with equal frequency?

Total births = 400

**Day**** ****Frequency**

Mon 13.25%

Tue 16%

Wed 17.75%

Thur 14.25%

Fri 13.5%

Sat 11.5%

Sun 13.75%

Let the frequencies be substantially different if any frequency is at least twice any other frequency. Does it appear that these births occur on the days of the week with equal frequency?

- Yes, it appears that births occur on the days of the week with frequencies that are about the same.

Which characteristic of data is a measure of the amount that the data values vary?

- Variation

_____ are sample values that lie very far away from the majority of the other sample values.

- Outliers

_____ helps us understand the nature of the distribution of a data set.

- Frequency distribution

In a _____ distribution, the frequency of a class is replaced with a proportion or percent.

- Relative frequency distribution

Heights of adult males are normally distributed. If a large sample of heights of adult males is randomly selected and the heights are illustrated in a histogram, what is the shape of that histogram?

- Bell-shaped

If we collect a large sample of blood platelet counts and if our sample includes a single outlier, how will that outlier appear in a histogram?

- The outlier will appear as a bar far from all of the other bars with a height that corresponds to a frequency of 1.

Listed below are body temperatures of healthy adults. Why is it that a graph of these data would not be very effective in helping us understand the data?

- The data set is too small for a graph to reveal important characteristics of the data

If we have a large voluntary response sample consisting of weights of subjects who chose to respond to a survey posted on the internet, can a graph help to overcome the deficiency of having a voluntary response sample?

- No, a graph cannot help to overcome the deficiency. If the sample is a bad sample, there are no graphs or other techniques that can be used to salvage the data

How does the stem-and-leaf plot show the distribution of data?

- The lengths of the rows are similar to the heights of bars in a histogram, longer rows of data correspond to higher frequencies.

The accompanying data represent women’s median earnings as a percentage of men’s median earnings for recent years beginning with 1989. Is there a trend?

- There is a general upward trend though there have been some down years. An upward trend would be helpful to women so that their earnings become equal to those of men.

In a study of retractions in biomedical journals, 431 were due to error, 205 were due to plagiarism, 812 were due to fraud, 306 were due to duplication of publications, and 285 had other causes. Does misconduct appear to be a major factor?

- Yes, misconduct appears to be a major factor because the majority of retractions were due to misconduct

The graph to the right uses cylinders to represent barrels of oil consumed by two countries. Does the graph distort the data or does it depict the data fairly?

- Yes it distorts the data because the graph incorrectly uses objects of volume to represent the data

In this section we use r to denote the value of the linear correlation coefficient. Why do we refer to this correlation coefficient as being linear?

- The term linear refers to a straight line, and r measures how well a scatter plot fits a straight-line pattern

If we find that there is a linear correlation between the concentration of carbon dioxide in our atmosphere and the global temperature, does that indicate that changes in the concentration of carbon dioxide causes changes in the global temperature?

- No, the presence of a linear correlation between two variables does not imply that one of the variables is the cause of the other variable.

What is a scatter plot and how does it help us?

- A scatter plot is a graph of paired (x,y) quantitative data. It provides a visual image of the data plotted as points, which helps show any patterns in the data.

For a data set of brain volumes and IQ scores of seven males, the linear correlation coefficient is r=.805. Use the table available to find the critical values of r. Based on a comparison of the linear correlation coefficient r and the critical values, what do you conclude about a linear correlation?

- The critical values are -.754, .754

Since the correlation coefficient of r is:

- In the right tail above the positive critical value, there is sufficient evidence to support the claim of a linear correlation

For a data set of brain volumes and IQ scores of four males, the linear correlation coefficient is found and the P-value is .641. Write a statement that interprets the P-value and includes a conclusion about linear correlation.

- The P-value indicates that the probability of a linear correlation coefficient that is at least twice as extreme is 64.1%, which is high, so there is not sufficient evidence to conclude that there is a linear correlation between brain volumes and IQ scores in males.

For a data set of weights and highway fuel consumption amounts of ten types of automobile, the linear correlation coefficient is found and the P-value is .021. Write a statement that interprets the P-value and includes a conclusion about linear correlation.

- The P-value indicates that the probability of a linear correlation coefficient that is at least as extreme is 2.1% which is low, so there is sufficient evidence to conclude that there is a linear correlation between weight and highway fuel consumption in automobiles.

A magazine, which does not accept free products or advertisements from anyone, prints a review of new cars. Are there sources of bias in this situation?

- There do not appear to be any sources of bias

Determine whether the given value is a statistic or a parameter.

A sample of 568 doctors showed that 16% go to school

- The value is a statistic because it is a numerical measurement describing some characteristic of a sample

State whether the data described below are discrete or continuous, and explain why.

The number of eyes that different people have.

- The data are discrete because the data can only take on specific values

Determine whether the given value is a statistic or a parameter

Thirty percent of all dog owners poop scoop after this dog

- Parameter

Determine which of the four levels of measurement is most appropriate

Student’s grades(A, B, C, D) on a test

- Ordinal

Determine which of the four levels of measurement is most appropriate

Level of satisfaction of survey respondents

- Ordinal

The following frequency distribution displays the scores on a math test. Find the class boundaries of scores interval 50-59

- 49.5, 59.5

Construct a stem and leaf plot of the test scores 68, 72, 85, 75, 89, 89, 87, 90, 98, 100

How does the stem and leaf plot show the distribution of these data

6 8

7 25

8 5799

9 08

10 0

- The lengths of the rows are similar to the heights of bars in a histogram, longer rows of data correspond to higher frequencies

The linear _____ coefficient denoted by r measures the _____ of the linear association between two variables.

- Correlation, strength

Identify the type of sampling used(random, systematic, convenience, stratified, or cluster) in the situation described below.

A researcher selects every 324th social security number and surveys the corresponding person.

- Systematic

Identify which type of sampling is used, (random, systematic, convenience, stratified, or cluster).

To determine customer opinion of their musical variety, Sony randomly selects 120 concerts during a certain week and surveys all concert goers.

- Cluster

A polling company reported that 49% of 1018 surveyed adults said that secondhand smoke is annoying.

What is the exact value that is 49% of 1018

- 498.82

Could the result be the actual number of adults who said that secondhand smoke is quite annoying?

- No, the result from part a could not be the actual number of adults who said that because a count of people must result in a whole number

What could be the actual number of adults who said that secondhand smoke is annoying?

- 499

Among the 1019 respondents, 190 said that secondhand smoke is not annoying at all. What percentage of respondents said that second hand smoke said that?

- (190/1028) *100 = 18.66%

A polling company reported that 49% of 2302 surveyed adults said they play baseball.

What is the exact value of 49% of 2303

- (.49*2302) = 1127.98

Could the previous result be the actual number of adults who play baseball?

- No, the result must result in a whole number

What could be the actual number of faults who play baseball?

- 1128

Among the 2302 respondents, 114 said they play hockey. What percentage play hockey?

- (114/2302) * 100 = 4.95%

Determine whether the study is an experiment or an observational study, and then identify a major problem with the study.

A medical researcher tested for a difference in systolic blood pressure levels between male and female students who are 12 years of age. She randomly selected four males and four females for her study.

- This is an observational study because the researcher does not attempt to modify the individuals

What is a major problem with the study?

- The sample is too small

Identify the type of observational study(cross-sectional, retrospective, or prospective)

A research company uses a device to record the viewing habits of about 10000 households, and the data collected today will be used to determine the proportion of households tuned to a particular sports program

- Cross-sectional study

Identify the type of observational study

A researcher plans to obtain data by interviewing siblings of victims who perished in a bombing. He will interview them, and people unrelated to the victims, over the next ten years to see how closeness to a traumatic event might affect recovery time.

- Prospective because it is in the future