Statistics Essentials

These are my notes and thoughts on an introduction to statistics.

Statistics books on Amazon

 

Table of Contents

Exploring Data

We use data to make decisions. We make estimations and develop guidelines using data. Therefore, data and its analysis is important. 

Businesses collect data on their users. Scientists collect data on their experiments. Doctors collect data on their patients. Police collect data on criminals. Lots of people, organizations, and businesses collect data now. In fact, almost all of them do.

Most data that is collected is not immediately useful. It has to be organized and put in the proper format. It also needs to be summarized so decision makers can easily make choices on what is best for their organization. Doing this analysis is called descriptive methods. They are useful for presentation, data reduction, and summarization of data. 

Variables

There are two types of variables, categorical and numerical. A variable is categorical if it places the individuals being studied into one of several groups or categories. A variable is numerical if its outcomes are quantitative and can be analyzed using arithmetic. Numerical variables can be either discrete or continuous. Different methods of analysis must be used for categorical and numerical variables. 

If we take only one measurement on each object, we get univariate data. With two measurements on each object, we get bivariate data. 

Types of Descriptive Methods

There are different descriptive methods depending on the type of data that is collected. These are tabular, graphical, and numerical methods.

Different descriptive methods will answer different questions about data. 

Tabular

Collected data needs to be rearranged before analysis. One tabular method is the frequency distribution table. The letter ’n’ is used to denote the number of observations in a data set. The frequency of a value is the number of times that observations occur. Frequency is usually denoted by using the letter f. The relative frequency of a value is the ratio of the frequency to the total number of observations. It is usually denoted by \(rf\)and equals \(\frac{f}{n}\).  The cumulative frequency gives the number of observations less than or equal to a specified value and is denoted by \(cf\). A frequency distribution table is a table giving all possible values of a variable and their frequencies.  

Graphical

Presenting your data in tables is not very useful, but it is done. You should know how to interpret a table if you have to analyze one. Charts are just a better tool.

Bar charts are used a lot. It can have either horizontal or vertical bars. They are used to display categorical data very commonly. 

Pie charts can also display amounts and frequencies of data. They are a popular graphical method but not usually the best choice. They are difficult to make and read. 

Segmented Bar Charts

 It is important to see categorical data that stems from different groups in order to make comparisons. A segmented bar chart takes the distribution from each group and arranges them along either the horizontal or vertical axis. Then it shows the relative frequency of each group represented in one bar for each group. These data charts can be used to show frequency with bars of various sizes or relative frequency where all bars are the same size regardless of group size. Segmented bar charts that measure relative frequency between groups can be somewhat misleading when sample size is concerned.

Mosaic Plots

These are kind of similar to segmented bar charts. They are just a different way to compare categorical data. In a mosaic plot, use the width of the bars to represent the size of the sample. Each header indicates a different group. The groups can be arranged along the x or y axis. The lengths of these bars along the axis represent the relative frequencies of these groups compared to each other.

Along the other axis, the bars of each group are the same length. Each section within the group bars represents the percentage that category occurred in the data set for that group. These same categories should appear within each of the group bars. We can make comparisons about the size of each group based on the length of each group bar. We can also evaluate the proportions of the categorical variables within each group by comparing the relative sizes of each section.

Graphical Methods For Numerical Data

To summarize and describe numerical data, dotplots and stemplots are used for small sets of data. For larger sets, histograms, cumulative frequency charts, and boxplots are often used.

We can describe the overall pattern of the distribution of a numerical variable set using the following three methods: center, spread, and the shape.

The center of a distribution describes the central data point. There are a few ways to measure the central tendency which include the mean, median, and the mode. Each measure has different pros and cons depending on the type and shape of the data.

The spread of a distribution can tell us where most of the data is. You can have a symmetric distribution and a skewed distribution.

For a symmetric distribution, if the left half of the distribution is approximately a mirror image of the right half, then the distribution is called as symmetric. This means that the data is spread out in the same way on both sides and that there is the same amount of data on each side of the center.

In a skewed distribution, if there are extreme values in only one direction that cause one side to have a longer tail, we call that distribution skewed. It is right skewed if the longer tail is on the right and left skewed if the longer tail is on the left. 

Patterns of Data

When looking at data, we should look for patterns and deviations. To describe patterns, you can have clusters of data and outlier data. In clustered data, observations are grouped together tightly. If data is not clustered it can be described as having gaps. It is important to make these distinctions. 

If you have outliers in your data, you have an observation that is a lot different from the rest of the data. Outliers fall away from the middle of the data set. 

Graphical Methods for Continuous Variables

There are several ways to show graphical data for continuous variables. These include dotplots, stemplots, histograms, and cumulative frequency charts.

Dotplots

Dotplots are easy to make. They are nice for smaller data sets. However, if there is too much data the dotplot becomes too cluttered to read. To make a dotplot, draw a horizontal line to indicate the data range, scale the line to accommodate the entire range of data, if more than one observation has the same value then add dots above the other, and mark a dot for each observation in the appropriate place above the scaled line. 

Each dot on the plot indicates the location of the value of a data point. For any data point, we can look directly down at the scale to determine the value of the point. When looking at a dotplot, we can see how the data points are spread, what kind of shape the points make collectively, and where the approximate center of the distribution is.  

Stemplots

Stemplots are also used a lot. An advantage of the stemplot is that it shows every value. However, since that is the case, it is only useful for small data sets. 

To make a stemplot, separate each observation into two parts. The left part of the observation is called the stem and the right part is called the leaf. Draw a vertical line on the left side of the page to separate the stems from the leaves. Write all possible stems in increasing order on the left of the line. For each observation, write in the leaf to the right of the corresponding stem on the right side of the vertical line in increasing order. 

The numbers on the left side of the vertical line are stems. The value of a data point is the stem plus the leaf. Each stem has a different number of leaves, indicating the frequency of the class. Each leaf indicates a single observation. 

We use stemplots to see how the data is shaped and how it is spread. We also use it to see where the center of the data is. 

Histograms

A histogram is the most popular form of displaying data. It resembles a stemplot on its side. They are useful for showing patterns in large data sets. A histogram can be drawn using frequencies, relative frequencies, or percentages.

To make a histogram, create groups from continuous data, draw the x axis and the y axis to scale and to accommodate all of the data groups and frequencies, Draw bars of heights equal to the corresponding frequencies and add a label for each group, and draw the bars next to each other without any gaps. There are no gaps between histogram bars because the data values are continuous and the values in one bar flow right into the next one. 

Each bar represents a single group or class. There is only one bar for each class. The classes are placed on the x axis in numerically increasing order, just as on a number line. The height of a bar in a frequency histogram corresponds to the frequency of that class. Percentage or relative frequency histograms can be read similarly. In a relative frequency histogram, the height of the bar reflects the relative frequency corresponding to the class. In a percentage frequency histogram, the height of the bar reflects the percent frequency that corresponds to the class. 

Cumulative Frequency Charts

The cumulative frequency for any group is the frequency for that group plus the frequencies of all groups of smaller observations. 

To draw cumulative frequency charts, draw the x and y axis, scale the x axis to accommodate the range of all groups, mark the upper boundary of each group, scale the y axis from 0 to n for a cumulative frequency chart, place a dot at the height equal to the cumulative frequency for that group above the upper boundary for each group, then connect all the dots with straight lines.

From any point on the graph, we can draw a vertical line to read the x value from the x axis and a horizontal line to read the y value from the y axis. For right skewed distributions, the curve increases quickly in the beginning but then steadies in the later part. For left skewed distributions, the curve increases slowly in the beginning, but then steeply later on. The cumulative frequency chart for a symmetric distribution is often described as s-shaped because it begins with a slow increase on the left, rises rapidly in the middle, and then tapers off to a slow increase again at the right. 

Summary

Visualizations are the very first step you should take when analyzing data. The types of summary statistics, inferential tests, and analysis that can be calculated are dependent upon the shape of the distribution. The key point to remember is that there are different calculations for symmetric and skewed data. Knowing the shape of the distributions will help you get started. 

Statistics and Problem Solving

A population is the total set of subjects or things we are interested in studying. Populations are defined by what a researcher is studying and can come in all shapes and sizes.

A frame is a list containing all members of the population.

Population parameters are facts about the population. Since parameters are descriptions of the population, a population can have many parameters. Parameters can be averages, percentages, minimums, or maximums. For a specific population at a specific point in time, population parameters do not change.

A sample is a subset of the population which is used to gain insight about the population. Samples are used to represent a larger group, the population. 

A statistic is a fact or characteristic about the sample. For any given sample a statistic is a fixed number. Statistics are used as estimates of population parameters. 

A process is a method for obtaining a desired result. The idea of a process is closely tied to quality control. In order to improve a process, there must be an understanding of how the process is currently performing. This required definition and measurement of the process. 

The science of statistics is divided into two categories, descriptive and inferential. Descriptive methods describe and summarize data. Descriptive statistics is the collection, organization, and presentation of data.

The objective of inferential statistics is to make reasonable guesses about the population characteristics using sample data. 


Collecting and Analyzing Data

Part of becoming a problem solver and user of statistics is developing an ability to appraise the quality of measurements. When you encounter data, consider whether the concept under study is adequately reflected by the proposed measurements, is the data measured accurately, and is there a sufficient quantity of the data to draw a reasonable conclusion.

Measurement and data are an integral part of science. Methods have been developed to solve research problems. Gather information about the phenomenon being studied. On the basis of the data, formulate a preliminary generalization or hypothesis. Collect further data to test the hypothesis. If the data and other subsequent experiments support the hypothesis, it becomes a law.

There are two ways to obtain data, observation and controlled experiments. In a statistical analysis, it is usually not possible to recover from poorly measured concepts or badly collected measurements. 

A response variable measures the outcome of interest in a study. An explanatory variable causes or explains changes in a response variable. Isolating the effects of one variable on another means anticipating potentially confounding variables and designing a controlled experiment to produce data in which the values of the confounding variable are regulated.

Observational data comes about from measuring things. They can be extremely valuable. 

Much of the statistical information presented to us is in the form of surveys. So, it is important to understand them and how they are done. In some cases, the purpose of a survey is purely descriptive. However, in many cases the researcher is interested in discovering a relationship.

Data in which the observations are restricted to a set of values that possess gaps is called discrete. Data that can take on any value within some interval is called continuous. The quality of data is referred to as its level of measurement. When analyzing data, you must be exceedingly conscious of the data’s level of measurement because many statistical analyses can only be applied to data that possess a certain level of measurement. 

Data that represents whether a variable possesses some characteristic is called nominal. Ordinal data represents categories that have some associated order. Note that ordinal data is also nominal, but it also possesses the additional property of ordinality. 

If the data can be ordered and the arithmetic difference is meaningful, the data is interval. An example of interval data is temperature. Interval data is numerical data that possesses both the property of ordinality and the interval property. Ratio data is similar to interval data, except that it has a meaningful zero point and the ratio of two data points is meaningful. 

Qualitative data is data measured on a nominal or ordinal scale. Quantitative data is measured on an interval or ratio scale. 

Time series data originates as measurements usually taken from some process over equally spaced intervals of time. Time series data originate from processes. Processes can be divided into two categories: stationary and nonstationary. All time series that are interesting vary, and the nature of the variability determines how the process is characterized. In a stationary process the time series varies around some central value and has approximately the same variation over the series. In a nonstationary process, the time series possess a trend, the tendency for the series to either increase or decrease over time. 

Cross-sectional data are measurements created at approximately the same period of time. 

Organization of Data in Statistics

A frequency distribution is a summary technique that organizes data into classes and provides in tabular form a list of the classes along with the number of observations in each class.

The process begins by refining information. An analyst will do this. He takes raw data and organizes that data. This is done by counting the number of observations in each classification. 

A frequency distribution is a good way to handle large amounts of data. With it, we can see the overall structure of the data.

There are two steps in creating a frequency distribution:

  1. Choose the classifications
  2. Counting the number in each class

Graphs are important because they put information in visual form. While individual data can be lost, this is more than made up for by a nice graph. Use some type of graphing software to do this easily. Lots of different programs are available to create nice looking graphs these days. 

Bar Charts

The bar chart is a simple graph in which the length of each bar corresponds to the number of observations in a category.

They are a good presentation tool and helpful in showing the differences in magnitude. 

Creating a bar chart can get complicated. You should think about size, color, and labeling. 

Pie Charts

Pie charts can represent the same information as a bar chart. The slices in a pie chart are proportional to the total in each category. You can easily compare the total of each category to the total overall. 

When your data is qualitative, choosing categories is pretty easy. However, when your data is qualitative, choosing those categories is more complicated. The reason is that your choices often reflect how others will interpret the data. So, you have to be careful when doing this. 

Choosing the number of categories is your choice and should depend on the amount of data available. You want enough categories to make the comparisons meaningful but not so many that it is hard to understand. Each situation will be different in this regard. 

 

Relative Frequency Distribution

This represents the total observations in a category. It enables a person to view the number in each category in relation to the total number of observations. Another thing it does is change the frequency in each category to a proportion so we can compare data sets easier. I looks like this:

\[ \text{relative frequency} = \frac{\text{number in category}}{\text{total number}} \]

Cumulative Frequency Distribution

This gives a person the ability to quickly look at any category and see the number of observations and how they are related. The cumulative frequency is the sum of the frequency of a particular category and all preceding categories.

Cumulative Relative Frequency

The cumulative relative frequency is the proportion of observations in a particular category and all preceding categories. 

Histograms

A histogram is used frequently and reveals the distribution of data. It is a bar graph of the frequency in which the height of each bar corresponds to the frequency of the category. Each category is represented by a vertical bar whose height is proportional to the frequency of the interval. The horizontal boundaries of each vertical bar correspond to the category endpoints. Once the frequency distribution has been calculated, all the information necessary for plotting a histogram is available. 

Stem and Leaf Display

The stem and leaf display is a mix of methods. The display is similar to a histogram but the data remains usable to the user. It is useful for ordering and detecting patterns in the data. In other words, the raw data is not lost in the graph. It is similar to a histogram but the data remains visible. 

Ordered Array

An ordered array is a listing of all the data in either increasing or decreasing magnitude. Data listed in increasing order is said to be listed in rank order. If listed in decreasing order, it is listed in reverse rank order. Listing data in an order is very useful and usually done. It allows you to scan the data quickly for the largest and smallest values. 

Dot Plots

A dot plot is a graph where each data value is plotted as a point. If there are multiple entries, they are plotted above each other. 

Time Series Data

A time series plot graphs data using time as the horizontal axis. 

Statistical and Critical Thinking

Surveys provide data that enable us to improve products or services. Surveys guide political candidates, shape business practices, influence social media, and affect many aspects of our lives. 

A voluntary response sample is a sample in which respondents themselves decide whether to participate. Those with a strong interest in the topic are more likely to participate. Sample data must be collected in an appropriate way, such as through a process of random selection. If sample data are not collected in an appropriate way, the data may be so completely useless that no amount of statistical torturing can salvage them.

When using methods of statistics with sample data to form conclusions about a population, it is absolutely essential to collect sample data in a way that is appropriate. 

Data are collections of observations, such as measurements, genders, or survey responses. A single data value is called a datum. The term data is plural.

Statistics is the science of planning studies and experiments, obtaining data, and organizing, summarizing, presenting, analyzing, and interpreting those data and then drawing conclusions based on them. 

A population is the complete collection of all measurements or data that are being considered. Typically, a population is the complete collection of data that we would like to make inferences about. 

A census is the collection of data from every member of the population.

A sample is a sub-collection of members selected from a population.

Because populations are often very large, a common objective of the use of statistics is to obtain data from a sample and then use those data to form a conclusion about the population.

A voluntary response sample is one in which the respondents themselves decide whether to be included.

The word statistics is derived from the Latin word status, meaning state. Early uses of statistics involved compilations of data and graphs describing various aspects of a state or country. 

The following types of polls are common examples of voluntary response samples. By their very nature, all are seriously flawed because we should not make conclusions about a population on the basis of samples with a strong possibility of bias.

  1. Internet polls: people online can decide whether to respond.
  2. Mail-in polls: in which people can decide whether to reply.
  3. Telephone polls in which newspaper, radio, or television announcements ask that you call a special number to respond.

Analyze

After completing our preparation by considering the context, source, and sampling method, we begin to analyze the data.

Graph and Explore

An analysis should begin with appropriate graphs and explorations of data.

Apply Statistical Methods

A good statistical analysis does not require strong computational skills. A good statistical analysis does require using common sense and paying careful attention to sound statistical methods.

Conclude

The final step in our statistical process involves conclusions, and we should develop an ability to distinguish between statistical significance and practical significance. 

 

Statistical significance is achieved in a study when we get a result that is very unlikely to occur by chance. A common criterion is that we have statistical significance if the likelihood of an event occurring by chance is 5 percent or less. Getting 98 girls in 100 random births is statistically significant because such an extreme outcome is not likely to result from random chance. Getting 52 girls in 100 births is not statistically significant because that event could easily occur with random chance.

 

Practical significance is when it is possible that some treatment or finding is effective, but common sense might suggest that the treatment or finding does not make enough of a difference to justify its use or to be practical.

 

Misleading Conclusions

When forming a conclusion based on a statistical analysis, we should make statements that are clear even to those who have no understanding of statistics and its terminology. We should carefully avoid making statements not justified by statistical analysis. 

 

Sample Data Reported

When collecting data from people, it is better to take measurements yourself instead of asking subjects to report results. Ask people what they weigh and you are likely to get their desired weights, not their actual weight. 

 

Loaded Questions

If survey questions are not worded carefully, the results of a study can be misleading. Survey questions can be loaded or intentionally worded to elicit a desired response. 

 

Order of Questions

Sometimes survey questions are unintentionally loaded by such factors as the order of the items being considered. 

 

Nonresponse

A nonresponse occurs when someone either refuses to respond to a survey question or is unavailable. When people are asked survey questions, some firmly refuse to answer. 

 

Percentages

To find a percentage of an amount, replace the % symbol with division by 100, and then interpret “of” to be multiplication. 

6% of 1200 responses = \(\frac{6}{100} * 1200 = 72 \)

 

Decimal to Percentage

To convert from a decimal to a percentage, multiply by 100%.

\[ 0.25 \rightarrow 0.25 * 100% = 25% \]

 

Fraction to Percentage

To convert from a fraction to a percentage, divide the denominator into the numerator to get an equivalent decimal number. Then multiply by 100 percent.

\[ \frac{}3}{4} = 0.75 \rightarrow 0.75 * 100% = 75% \]

 

Percentage to Decimal

To convert from a percentage to a decimal number, replace the % symbol with division by 100. 

\[ 85% = \frac{85}{100} = 0.85 \]

A parameter is a numerical measurement describing some characteristic of a population.

A statistic is a numerical measurement describing some characteristic of a sample.

If we have more than one statistic, we have statistics. Another meaning of statistics is the science of planning studies and experiments; obtaining data, organizing, summarizing, presenting, analyzing, and interpreting those data.

Some data are numbers representing counts or measurements, whereas others are attributes that are not counts or measurements. Quantitative data consist of numbers representing counts or measurements. 

Categorical data consist of names or labels. Categorical data are sometimes coded with numbers, with those numbers replacing names. Although such numbers might appear to be quantitative, they are actually categorical data. 

Include Units of Measurement

With quantitative data, it is important to use the appropriate units of measurement, such as dollars, hours, feet, or meters. We should carefully observe information given about the units of measurement, such as all amounts are in thousands of dollars or all units are in kilograms. 

Discrete or Continuous

Quantitative data can be further described by distinguishing between discrete and continuous types. Discrete data result when the data values are quantitative and the number of values is finite. Continuous or numerical data result from infinitely many possible quantitative values, where the collection of values is not countable. 

The concept of countable data plays a key role in the preceding definitions, but it is not a particularly easy concept to understand. Continuous data can be measured, but not counted. If you select a particular value from continuous data, there is no next data value.  

Levels of Measurement

Another common way of classifying data is to use four levels of measurement; nominal, ordinal, interval, and ratio. When we are applying statistics to real problems, the level of measurement of the data helps us to decide which procedure to use. Don’t do computations and don’t use statistical methods that are not appropriate for the data. 

Ratio

There is a natural zero starting point and ratios make sense. These are heights, lengths, distances, and volumes.

Interval

Differences are meaningful, but there is no natural zero starting point and ratios are meaningless. Body temperatures in degrees is an example.

Ordinal

Data can be arranged in order, but differences either can’t be found or are meaningless. Examples are ranks of colleges.

Nominal

Categories only. Data cannot be arranged in order. An example is eye colors.

The nominal level of measurement is characterized by data that consist of names, labels, or categories only. The data cannot be arranged in some order. 

Because nominal data lack any ordering or numerical significance, they should not be used for calculations. Numbers such as 1,2,3, or 4 are sometimes assigned to the different categories, but these numbers have no real computational significance and any average calculated from them is meaningless and possibly misleading.

Data are at the ordinal level of measurement if they can be arranged in some order, but differences between data values cannot be determined or are meaningless.

Ordinal data provide information about relative comparisons, but not the magnitudes of the differences. Usually, ordinal data should not be used for calculations such as an average, but this guideline is sometimes ignored.

Data are at the interval level of measurement if they can be arranged in order, and differences between data values  can be found and are meaningful. Data at this level do not have a natural zero starting point at which none of the quantity is present. 

Data are at the ratio level of measurement if they can be arranged in order, differences can be found and are meaningful, and there is a natural zero starting point. For data at this level, differences and ratios are both meaningful.

The distinction between the interval and ratio levels of measurement can be a bit tricky. For the ratio test, focus on the term ratio and know that the term twice describes the ratio of one value to be double the other value. To distinguish between the interval and ratio levels of measurement, use a ratio test by asking this question: Does use of the term twice make sense? Twice makes sense for data at this level of measurement, but it does not make sense for data at the interval level of measurement.

For the true zero test, and for ratios to make sense, there must be a value of true zero, where the value of zero indicates that none of the quantity is present, and zero is not simply an arbitrary value on a scale. The temperature of 0 F is arbitrary and does not indicate that there is no heat, so temperatures on the Fahrenheit scale are at the interval level of measurement not the ratio level.

Big data refers to data sets so large and so complex that their analysis is beyond the capabilities of traditional software tools. Analysis of big data may require software simultaneously running in parallel on many different computers.

Data science involves applications of statistics, computer science, and software engineering, along with some other relevant fields such as sociology or finance.

Example of Data Set Magnitudes

  • Terabytes 
  • Petabytes
  • Exabytes
  • Zettabytes
  • Yottabytes

 

Statistics in Data Science

The modern data scientist has a solid background in statistics and computer systems as well as expertise in fields that extend beyond statistics. The modern data scientist might be skilled with Hadoop software, which uses parallel processing on many computers for the analysis of big data. The modern data scientist might also have a strong background in some other field such as psychology, biology, medicine, chemistry, or economics.

Missing Data

When collecting sample data, it is quite common to find that some values are missing. Ignoring missing data can sometimes create misleading results. If you make the mistake of skipping over a few different samples when you are manually typing them into a statistics software program, the missing values are not likely to have a serious effect on the results. However, if a survey includes many missing salary entries because those with very low incomes are reluctant to reveal their salaries, those missing values will have the serious effect of making salaries appear higher than they really are.

A data value is missing completely at random if the likelihood of its being missing is independent of its value or any of the other values in the data set. That is, any data value is just as likely to be missing as any other data value. 

A data value is missing not at random if the missing value is related to the reason that it is missing.

Missing data at random can happen and an example is when using a keyboard to manually enter ages of survey respondents and makes the mistake of failing to enter the age of 37 years. The data value is missing completely at random.

Biased Results

Based on the two definitions and examples from the previous page, it makes sense to conclude that if we ignore data missing completely at random, the remaining values are not likely to be biased and good results should be obtained. However, if we ignore data that are missing, not at random, it is very possible that the remaining values are biased and results will be misleading.

Correcting for Missing Data

There are different methods for dealing with missing data. One very common method for dealing with missing data is to delete all subjects having any missing values. If the data are missing completely at random, the remaining values are not likely to be biased and good results can be obtained, but with a smaller sample size. If the data are missing not at random, deleting subjects having any missing values can easily result in a bias among the remaining values, so results can be misleading. 

We can also input missing data values when we substitute values for them. There are different methods of determining the replacement values, such as using the mean of the other values, or using a randomly selected value from other similar cases, or using a method based on regression analysis. 

When analyzing sample data with missing values, try to determine why they are missing, then decide whether it makes sense to treat the remaining values as being representative of the population. If it appears that there are missing values that are missing not at random, know that the remaining data may well be biased and any conclusions based on those remaining values may well be misleading.

In an experiment, we apply some treatment and then proceed to observe its effects on the individuals. The individuals in experiments are called experimental units and they are often called subjects when they are people. In an observational study, we observe and measure specific characteristics, but we don’t attempt to modify the individuals being studied. 

Experiments are often better than observational studies because well planned experiments typically reduce the chance of having the results affected by some variable that is not part of the study. A lurking variable is one that affects the variables included in the study, but it is not included in the study.

Design of Experiments

Good design of experiments includes replication, blinding, and randomization.

Replication is the repetition of an experiment on more than one individual. Good use of replication requires sample sizes that are large enough so that we can see effects of treatments. 

Blinding is used when the subject doesn’t know whether he or she is receiving a treatment or a placebo. Blinding is a way to get around the placebo effect, which occurs when an untreated subject reports an improvement in symptoms. 

Randomization is used when individuals are assigned to different groups through a process of random selection. The logic behind randomization is to use chance as a way to create two groups that are similar. 

A simple random sample of n subjects is selected in such a way that every possible sample of the same size n has the same chance of being chosen.

Unlike careless or haphazard sampling, random sampling usually requires very careful planning and execution. 

Simple Random Sample

A sample of n subjects is selected so that every sample of the same size n has the same chance of being selected

Systematic Sample

Select every kth subject

Convenience Sample

Use data that are very easy to get

Stratified Sample

Subdivide populations into strata or groups with the same characteristics, then randomly sample within those strata.

Cluster Sample

Partition the population in clusters or groups, then randomly select all members of the selected clusters.

Multistage Sampling

Professional pollsters and government researchers often collect data by using some combination of the preceding sampling methods. In a multistage sample design, pollsters select a sample in different stages, and each stage might use different methods of sampling.

In a cross sectional study, data are observed, measured, and collected at one point in time, not over a period of time.

In a retrospective study, data are collected from a past timer period by going back in time.

In a prospective study, data are collected in the future from groups that share common factors.

Experiments

In an experiment, confounding occurs when we can see some effect, but we can’t identify the specific factor that caused it.

A randomized block design uses the same basic idea as stratified sampling, but randomized block designs are used when designing experiments, whereas stratified sampling is used for surveys.

Matched Pairs Design

Compare two treatment groups by using subjects matched in pairs that are somehow related ort have similar characteristics. 

Rigorously Controlled Design

Carefully assign subjects to different treatment groups, so that those given each treatment are similar in the ways that are important to the experiment. This can be extremely difficult to implement, and often we can never be sure that we have accounted for all of the relevant factors. 

Sampling Errors

In statistics, you could use a good sampling method and do everything correctly, and yet it is possible to get wrong results. No matter how well you plan and execute the sample collection process, there is likely to be some error in the results.

A sampling error occurs when the sample has been selected with a random method, but there is a discrepancy between a sample result and the true population result, such an error results from chance sample fluctuations.

A non sampling error is the result of human error, including such factors as wrong data entries, computing errors, questions with biased wording, false data provided by respondents, forming biased conclusions, or applying statistical methods that are not appropriate for the circumstances. 

A non random sampling error is the result of using a sampling method that is not random, such as using a convenience sample or a voluntary response sample.

The Gold Standard

Randomization with placebo/treatment groups is sometimes called the gold standard because it is so effective.

What Statistics Is All About

One of the first considerations is designing appropriate studies. The purpose is to collect data. This process can be done with either surveys or experiments. One of the most popular ways to collect data is the observational study in a way that does not affect them. Surveys have to be worded carefully to get good information.

An experiment is another popular way to gather data. It involves treatments on participants so that clear comparisons can be made. After treatments are made, responses are recorded.  

Collecting quality data is a major consideration. It really does no good to get bad data. So, studies and experiments must be planned well. Once you have good data, you can make a good report on what you found. To minimize bias in a survey, you have to be random when selecting participants. 

Descriptive Statistics

These are numerical values that describe a data set. This is usually done through different types of categories. If the data are categorical they are usually summarized using the number of individuals in each group. This is called the frequency. If you use the percentage of individuals, it is called the relative frequency.

Numerical data represent measurements or counts. You can do more with numerical data. For example, you can get the measure of center and the measure of spread in the data. 

Some descriptive statistics are more appropriate than others in certain situations. The average is not always the best measure of the center of a data set. 

Charts and Graphs

Data is summarized in a visual way using charts and graphs. These are displays that are organized to give you a big picture of the data. 

Some of the basic graphs used for categorical data include pie charts and bar graphs. These break down variables in the data. 

For numerical data, a different type of graph is needed. Histograms and box plots are usually used to represent numerical data. These types of graphs make it easier to visualize the data.

Distributions

A variable is a characteristic that is being counted or measured. A distribution is a listing of the possible values of a variable and how often they occur. 

Different types of distributions exist for different types of variables.

If a variable is counting the number of successes in a certain number of trials, it has a binomial distribution. 

If the variable takes on values that occur according to a bell-shaped curve, then that variable has a normal distribution.

If the variable is based on sample averages and you have limited data, the t-distribution may be in order.

When it comes to distributions, you need to know how to decide which distribution a particular variable has, how to find probabilities for it, and how to figure out what the long-term average and standard deviation of the outcomes would be.

Performing Analyses

After data has been collected and described, it is time to do the statistical analysis. There are many types of analyses. You have to choose the appropriate type for your data. 

You often see statistics that try to estimate numbers pertaining to an entire population. However, it is just an estimate and most studies only ask a small number of people their questions. What happens is that data is collected on a small sample of people. Sometimes the results they get are very inaccurate. 

Sample results vary from sample to sample, and this amount of variability needs to be reported but usually it is not. The statistic used to measure and report the level of precision in someone’s sample result is called the margin of error. The range of the margin of error is called the confidence interval. 

Hypothesis Tests

One major staple of research studies is called hypothesis testing. A hypothesis test is a technique for using data to validate or invalidate a claim about a population. 

The elements about a population that are most often tested are:

  • The population mean
  • The population proportion
  • The difference in two population means or proportions

Hypothesis tests are used in a host of areas that affect your everyday life, such as medical studies, advertisements, and polling data. Often you only hear the conclusions of hypothesis tests but you don’t see the methods used to come to these conclusions. 

Drawing Conclusions

To perform statistical analyses, researchers use software that depends on formulas. You have to use them correctly, though. Some of the most common mistakes made in conclusions are overstating the results. Until you do a controlled experiment, you can’t make a cause-and-effect conclusion based on relationships you find. 

Statistics is about much more than numbers. You need to understand how to make appropriate conclusions from studying data and be smart enough to not believe everything you read. 

Working with Tables and Graphs

When working with large data sets, a frequency distribution is often helpful in organizing and summarizing data. A frequency distribution helps us to understand the nature of the distribution of a data set.

Frequency Distribution 

A frequency distribution or table shows how data are partitioned among several categories by listing the categories along with the number of data values in each of them.

Lower class limits are the smallest numbers that can belong to each of the different classes. Upper class limits are the largest numbers that can belong to each of the different classes. Class boundaries are the numbers used to separate the classes, but without the gaps created by class limits. Class midpoints are the values in the middle of the classes. Class width is the difference between two consecutive lower class limits in a frequency distribution.  

Finding the correct class width can be tricky. For class width, don’t make the most common mistake of using the difference between a lower class limit and an upper class limit. For class boundaries, remember that they split the difference between the end of one class and the beginning of the next class.

We construct frequency distributions to:

  1. Summarize large data sets
  2. See the distribution and identify outliers
  3. Have a basis for constructing graphs

Technology can generate frequency distributions but these are the common steps:

  • Select the number of classes, usually between 5 and 20
  • Calculate class width: \(\frac{\text{max data value - min data value}}{\text{number of classes}} \)
  • Round this result to get a convenient number
  • Choose the value for the first lower class limit by using either the min value or a convenient value below the minimum.
  • Using the first lower class limit and the class width, list the other lower class limits.
  • List the lower class limits in a vertical column and then determine and enter the upper class limits.
  • Take each individual data value and put a tally mark in the appropriate class. Add the tally marks to find the total frequency for each class.

Relative Frequency Distribution

A variation of the basic frequency distribution is a relative frequency distribution. Each class frequency is replaced by a relative frequency as a percentage. 

\[ \text{relative frequency} = \frac{\text{frequency for class}}{\text{sum of frequencies}} * 100 \]

This will give you the frequency percentage.

The sum of the percentages in a relative frequency distribution will be very close to 100 percent.

Another variation of a frequency distribution is a cumulative frequency distribution in which the frequency for each class is the sum of the frequencies for that class and all previous classes. 

At the beginning we noted that a frequency distribution can help us understand the distribution of a data set, which is the nature or shape of the spread of the data over the range of values. In statistics, we are often interested in determining whether the data have a normal distribution. Data that have an approximately normal distribution are characterized by a frequency distribution with the following features:

  1. The frequencies start low, then increase to one or two high frequencies, and then decrease to a low frequency.
  2. The distribution is approximately symmetric. Frequencies preceding the maximum frequency should be roughly a mirror image of those that follow the maximum frequency.

The presence of gaps can suggest that the data are from two or more different populations.

Comparing two or more relative frequency distributions in one table makes comparisons of data much easier.

While a frequency distribution is a useful tool for summarizing data and investigating the distribution of data, an even better tool is a histogram, which is a graph that is easier to interpret than a table of numbers.

A histogram visually displays the shape of the distribution of the data. It shows the location of the center of the data. Histograms show the spread of data and can also identify outliers.

A histogram is basically a graph of a frequency distribution. Class frequencies should be used for the vertical scale and that scale should be labeled. There is no universal agreement on the procedure for selecting which values are used for the bar locations along the horizontal scale, but it is common to use class boundaries, class midpoints, class limits, or something else. It is often easier for us to use class midpoints for the horizontal scale. Histograms can usually be generated using technology.

A relative frequency histogram has the same shape and horizontal scale as a histogram, but the vertical scale uses relative frequencies instead of actual frequencies. 

The ultimate objective of using histograms is to be able to understand characteristics of data. Exploring the data means to:

  1. Find the center of the data
  2. Find the variation
  3. Find the shape of the distribution
  4. Find any outliers
  5. Find the change of data over time

When a graph is said to be skewed to the right, it means the histogram shape has a tail on the right.

When a graph is said to be skewed to the left, it means the histogram shape has a tail on the left.

Bell-shaped distribution is called a normal distribution and has its highest values in the middle.

Uniform distribution is a histogram with roughly the same values all the way across.

Many statistical methods require that sample data come from a population having a distribution that is approximately a normal distribution.

In a uniform distribution, the different possible values occur with approximately the same frequency, so the heights of the bars in the histogram are approximately uniform. 

A distribution of data is skewed if it is not symmetric and extends more to one side than to the other. Data skewed to the right, called positively skewed, have a longer right tail.

Data skewed to the left, called negatively skewed, have a longer left tail.

Some really important methods have a requirement that sample data must be from a population having a normal distribution. Histograms can be helpful in determining whether the normality requirement is satisfied, but they are not very helpful with very small data sets.

The population distribution is normal if the pattern of the points in the normal quantile plot is reasonably close to a straight line, and the points do not show some systematic pattern that is not a straight-line pattern.

The population distribution is not normal if the normal quantile plot has either or both of these two conditions:

  1. The points do not lie reasonably close to a straight-line pattern
  2. The points show some systematic pattern that is not a straight-line pattern

Graphs that Enlighten

A dot plot graph is a good type of graph. It consists of a graph of quantitative data in which each data value is plotted as a point above a horizontal scale of values. Dots representing equal values are stacked. 

A dot plot:

  1. Displays the shape of the distribution of data
  2. It is usually possible to recreate the original list of data values.

A stem plot is another type of graph and it represents quantitative data by separating each value into two parts: the stem and the leaf. Better stem plots are often obtained by first rounding the original data values. Also, stem plots can be expanded to include more rows and can be condensed to include fewer rows.

Stem plots:

  1. Shows the shape of the distribution of data
  2. Retains the original data values
  3. The sample data are sorted

A time-series graph is a graph of time-series data, which are quantitative data that have been collected at different points in time, such as monthly or yearly.

Time-series graphs:

  1. Reveals information about trends over time

Bar graphs use bars of equal width to show frequencies of categories of categorical data. The bars may or not be separated by small gaps.

Bar graphs:

  1. Shows the relative distribution of categorical data so that it is easier to compare the different categories.

A pareto chart is a bar graph for categorical data, with the added stipulation that the bars are arranged in descending order according to frequencies, so the bars decrease in height from left to right. 

Pareto charts:

  1. Shows the relative distribution of categorical data so that it is easier to compare the different categories.
  2. Draws attention to the more important categories.

A pie chart is a very common graph that depicts categorical data as slices of a circle, in which the size of each slice is proportional to the frequency count for the category. Although pie charts are very common, they are not as effective as Pareto charts.

Pie charts:

  1. Shows the distribution of categorical data in a commonly used format.

Try to never use pie charts because they waste ink on components that are not data, and they lack an appropriate scale.

A frequency polygon uses line segments connected to points located directly above class midpoint values. A frequency polygon is very similar to a histogram, but a frequency polygon uses line segments instead of bars. 

A variation of the basic frequency polygon is the relative frequency polygon, which uses relative frequencies for the vertical scale. An advantage of relative frequency polygons is that two or more of them can be combined on a single graph for easy comparison. 

Graphs that Deceive

Deceptive graphs are commonly used to mislead people. Graphs should be constructed in a way that is fair and objective. 

A common deceptive graph involves using a vertical scale at some value greater than zero to exaggerate differences between groups. This is called a nonzero vertical graph. Always examine a graph carefully to see whether a vertical axis begins at some point other than zero so that differences are exaggerated.

Pictographs are another type of chart that are used to mislead. Data that are one-dimensional in nature are often depicted with two-dimensional objects or three-dimensional objects. By using pictographs, artists can create false impressions that grossly distort differences by using these same principles of basic geometry:

  1. When you double each side of a square, it’s area doesn’t merely double, it increase by a factor of four
  2. When you double each side of a cube, its volume doesn’t merely double, it increases by a factor of eight

When examining data depicted with a pictograph, determine whether the graph is misleading because objects of area or volume are used to depict amounts that are actually one-dimensional. 

For small data sets of 20 values or fewer, use a table instead of a graph. A graph of data should make us focus on the true nature of the data, not on other elements, such as eye-catching but distracting design features. Do not distort data. Construct a graph to reveal the true nature of the data. Almost all of the ink in a graph should be used for the data, not for the design elements.

A correlation exists between two variables when the values of one variable are somehow associated with the values of the other variable.

A linear correlation exists between two variables when there is a correlation and the plotted points of paired data result in a  pattern that can be approximated by a straight line. A scatterplot is a plot of paired quantitative data with a horizontal x-axis and the vertical axis is used for the second variable y. 

The presence of correlation between two variables is not evidence that one of the variables causes the other. We might find a correlation between beer consumption and weight, but we cannot conclude from the statistical evidence that drinking beer has a direct effect on weight. 

A scatterplot can be very helpful in determining whether there is a correlation between the two variables.

The linear correlation coefficient is denoted by r, and it measures the strength of the linear association between two variables. 

When we do not conclude that there appears to be a linear correlation between two variables, we can find the equation of the straight line that best fits the sample data, and that equation can be used to predict the value of one variable when given a specific value of the other variable. Instead of using the straight-line equation of \(y = mx + b \) that we have all learned in prior math courses, we use the format that follows.

Given a collection of paired sample data, the regression line, or line of best fit, is the straight line that best fits the scatter plot of the data. 

Identify the lower class limits, upper class limits, class width, class midpoints, and class boundaries for the given frequency distribution. Also identify the number of individuals in the summary.

Measures of Center

Measures of center are widely used to provide representative values that summarize data sets.

A measure of center is a value at the center or middle of a data set. 

The mean is generally the most important of all numerical measurements used to describe data. It is what most people call an average.

The mean of a set of data is the measure of center found by adding all of the data values and dividing the total by the number of data values.

Sample means drawn from the same population tend to vary less than other measures of center. The mean of a data set uses every data value. A disadvantage of the mean is that just one extreme value can change the value of the mean substantially. This extreme value is called an outlier. By this definition, we say the mean is not resistant.

A statistic is resistant if the presence of extreme values does not cause it to change very much. 

The definition of the mean can be expressed by the formula:

\[\frac{\sigma x}{n} \]

Sigma refers to the sum of values. X is the sum of all values. N is the number of values.

If the data are from a sample of the population, the mean is denoted by x-bar.

If the data are from the entire population, the mean is denoted by mu.

Sample statistics are usually represented by English letters and population parameters are usually represented by Greek letters.

\(\sigma\) denotes the sum of a set of data values.

\(x\) is the variable usually used to represent the individual data values.

\(n\) represents the number of data values in a sample.

\(N\) represents the number of data values in a population.

Never use the term average when referring to a measure of center. The word average is often used for the mean but it should not be.

The median can be thought of as a middle value. More precisely, the median of a data set is the measure of center that is the middle value when the original data values are arranged in order of increasing or decreasing magnitude. 

The median does not change by large amounts when we include just a few extreme values, so the median is a resistant measure of center. The median does not directly use every data value.

The median of a sample is sometimes denoted by x-tilde or m or Med. to find the median, first sort the values. 

If the number of data values is odd, the median is the number located in the exact middle of the sorted list.

If the number of data values is even, the median is found by computing the mean of the two middle numbers in the sorted list. 

Mode isn’t used much with quantitative data, but it is the only measure of center that can be used with qualitative data. The mode of a data set is the value that occurs with the greatest frequency. 

The mode can be found with qualitative data. A data set can have no mode or one mode or multiple modes. When two data values occur with the same greatest frequency, each one is a mode and the data set is set to be bimodal. When more than two data values occur with the same greatest frequency, each is a mode and the data set is said to be multimodal. When no data value is repeated, we say there is no mode.  

Midrange is another measure of center. The midrange of a data set is the measure of center that is the value midway between the max and min values in the original data set. It is found by adding the max data value to the min data value and then dividing the sum by 2.

Because the midrange uses only the max and min values, it is very sensitive to those extremes so the midrange is not resistant. In practice, the midrange is rarely used, but it has 3 redeeming features:

  1. It is very easy to compute
  2. It helps reinforce the very important point that there are several different ways to define the center of a data set.
  3. The value of the midrange is sometimes used incorrectly for the median, so confusion can be reduced by clearly defining the midrange along with the median. 

When calculating measures of center, we often need to round the result.

For the mean, median, and midrange, carry one more decimal than is present in the original set of values.

For the mode, leave the value as is without rounding.

When applying any rounding rules, round only the final result, not anything before that.

We can always calculate measures of center from a sample of numbers, but we should always think about whether it makes sense to do that. 

For example, it makes no sense to do numerical calculations with data at the nominal level of measurement. We should also think about the sampling method used to collect data. If the sampling method is not sound, the statistics we obtain may be very misleading. 

Measures of Variation

 

To understand variation, we begin by introducing the range. The range of a set of data values is the difference between the max data value and the min data value. The range uses only the maximum and the minimum data values, so it is very sensitive to extreme values. It is not resistant. Because the range uses only the max and min values, it does not take every value into account and therefore does not truly reflect the variation among all of the data values.

\[ \text{Range = max value - min value} \]

Range Rule of Thumb

The range rule of thumb is a quick way to ballpark the standard deviation.

25%  *  range of data

Standard Deviation of a Sample

The standard deviation is the measure of variation most commonly used in statistics. It is a measure of how much data values deviate away from the mean. The standard deviation found from sample data is a statistic denoted by \{\text{s}\}. 

The symbol for sample standard variation is \(s\).

The symbol for population standard deviation is \(\sigma\)

The symbol for sample variance is \(s^2\)

The symbol for population variance is \(\sigma^{2}\)

The standard deviation is a measure of how much data values deviate from the mean. The value of the standard deviation is never negative. It is zero only when all of the data values are exactly the same. Larger values indicate greater amounts of variation. The standard deviation can increase dramatically with one or more outliers. The units of the standard deviation are the same as the units of the original data values.

Here are the steps to finding standard deviation:

  1. Find the mean of your data values
  2. Subtract the mean from each individual sample value
  3. Square each of the deviations obtained from the previous step
  4. Add all of the squares obtained from previous step
  5. Divide the total from previous step by n-1, which is 1 less than the total number of data values present
  6. Find the square root of the result of the previous step.

Standard Deviation of a Population

A different formula is used to find the standard deviation of a population. We use the value of N instead of n-1. When using a calculator, make sure which kind of deviation it is giving you. The variance of a set of values is a measure of variation equal to the square of the standard deviation. 

The units of the variance are the squares of the units of the original data values. The value of the variance can increase dramatically with the inclusion of outliers. So, the variance is not resistant. The value of the variance is never negative. It is zero only when all of the data values are the same number. 

In measuring variation in a set of sample data, it makes sense to begin with the individual amounts by which values deviate from the mean. It makes sense to combine those deviations into one number that can serve as a measure of variation. We do not want to add the variations because that will give us a zero answer. Instead, we want to use the absolute values of the deviations. When we find the mean of that sum, we get the mean absolute deviation, which is the mean distance of the data from the mean.

Computation of the mean absolute deviation uses absolute values, so it uses an operation that is not algebraic. The use of absolute values would be simple but it would create algebraic difficulties in inferential statistics. The standard deviation has the advantage of using only algebraic operations. Because it is based on the square root of a sum of squares, the standard deviation closely parallels distance formulas found in algebra. There are many instances where a statistical procedure is based on a similar sum of squares. Consequently, instead of using absolute values, we square all deviations so that they are nonnegative and those squares are used to calculate the standard deviation. 

After finding all of the individual values we combine them by finding their sum. We then divide by n-1 because there are only n-1 values that can be assigned without constraint. With a given mean, we can use any numbers for the first n-1 values, but the last value will then be automatically determined. With division by n-1, sample variances tend to center around the value of the population variance. With division by n, sample variances tend to underestimate the value of the population variance.

A concept helpful in interpreting the value of the standard deviation is the empirical rule. This rule states that for data sets having a distribution that is approximately bell-shaped, the following properties apply:

  1. 68 percent of all values fall within 1 standard deviation of the mean
  2. 95 percent of all values fall within 2 standard deviations of the mean
  3. 99.7 percent of all values fall within 3 standard deviations of the mean

Another concept helpful in understanding a value of a standard deviation is Chebyshev’s theorem. The empirical rule applies only to data sets with bell-shaped distributions, but Chebyshev’s theorem applies to any data set. Unfortunately, results are only approximate. Because the results are lower limits, this theorem has limited usefulness. 

If the population mean is \(\mu\) and the population standard deviation is \(\sigma\), then the range rule of thumb for identifying significant values is as follows:

Significantly low values are \(\mu - 2\sigma\) or lower

Significantly high values are \(\mu + 2\sigma\) or higher.

Insignificant values are between the previous two values.

Measures Of Relative Standing

Measures of relative standing are numbers showing the location of data values relative to the other values within the same data set.

A z score is found by converting a value to a standardized scale. This definition shows that a z score is the number of standard deviations that a data value is away from the mean.

The z score is calculated by using:

\[z = \frac{x - \Xbar}{s}\]

Or

\[z = \frac{x - \mu}{\sigma}\]

A z score is the number of standard deviations that a given value is above or below the mean.

Z scores are expressed as numbers with no units of measurement.

A data value is significantly low if its z score is less than or equal to -2 or the value is significantly high if its z score is greater than or equal to +2.

If an individual data value is less than the mean, its corresponding z score is a negative number.

A value is significantly low or significantly high if it is at least two standard deviations away from the mean. It follows that significantly low values have z scores less than or equal to -2 and significantly high values have z scores greater than or equal to +2. If a value is in between these values then it is not significant.

A z score is a measure of position, in the sense that it describes the location of a value relative to the mean. Percentiles and quartiles are other measures of position useful for comparing values within the same data set or between different data sets.

Percentiles

Percentiles are one type of quantiles or fractiles which partition data into groups with roughly the same number of values in each group.

The 50th percentile has about 50% of the data values below and above it.

The process of finding the percentile that corresponds to a particular data value is given by the following formula:

\[\text{percentile} = \frac{\text{number of values less than x}}{\text{total number of values}}*100\]

Notation

  • N = total number of values in the data set
  • K = percentile being used, for example k=25
  • L = locator that gives the position of a value.
  • \(P_k\) = kth percentile

Algorithm

Sort the data from lowest to highest.

Compute \(L=\frac{k}{100}*n\) where n= number of values and k= percentile in question.

Is L a whole number?

If yes, the value of the kth percentile is midway between the Lth value and the next value in the sorted set of data. Find P_k by adding the Lth value and the next value and dividing the total by 2.

If no, change L by rounding it up to the next larger whole number.

The value of P_kl is the Lth value, counting from the lowest.

Quartiles

Just as there are 99 percentiles that divide the data into 100 groups, there are three quartiles that divide the data into four groups.

Quartiles are measures of location, Q1,Q2, and Q3, which divide a set of data into four groups with about 25% of the values in each group. 

Interquartile range = \(Q_3 - Q_1\)

Semi-interquartile range = \(\frac{Q_3 - Q_1}{2}\)

Midquartile = \(\frac{Q_3 + Q_1}{2}\)

10-90 percentile range = \(P_90 = P_10\)

Boxplots

The values of the minimum, maximum, and three quartiles are used for the summary and construction of boxplot graphs.

For a set of data the summary consists of these 5 values:

  • Minimum
  • First quartile, Q1
  • Second quartile, Q2
  • Third quartile, Q3
  • Maximum

A boxplot is a graph of a data set that consists of a line extending from the minimum value to the maximum value, and a box with lines drawn at the first quartile, the median, and the third quartile.

A boxplot can often be used to identify skewness. This means it is not symmetric.

Basics of Probability

An event is any collection of results or outcomes of a procedure.

A simple event is an outcome or an event that cannot be further broken down into simpler components.

The sample space for a procedure consists of all possible events. That is, the sample space consists of all outcomes that cannot be broken down any further.

Simple Events

With one birth, the result of 1 girl is a simple event and the result of 1 boy is another simple event. They are individual simple events because they cannot be broken down any further.

With three births, the result of 2 girls followed by a boy is a simple event.

When rolling a single die, the outcome of 5 is a simple event, but the outcome of an even number is not a simple event.

Simple Events and Sample Spaces

With three births, the event of 2 girls and 1 boy is not a simple event because it can occur with different simple events.

With three births, the sample space consists of the eight different simple events.

Probability plays a central role in the important statistical method of hypothesis testing. Statisticians make decisions using data by rejecting explanations based on very low probabilities.

In probability, we deal with procedures that produce outcomes. An event is any collection of results or outcomes of a procedure. A simple event is an outcome or an event that cannot be further broken down into simpler components. The sample space for a procedure consists of all possible simple events. That is, the sample space consists of all outcomes that cannot be broken down any further.

Notation for Probabilities

P denotes a probability

A,B, and C denote specific events

P(A) denotes the probability of event A occurring

Three Approaches to Finding the Probability

Conduct a procedure and count the number of times that event A occurs. P(A) is then approximated as follows:

  1. Relative Frequency Approximation- \(P(A) = \frac{\text{number of time A occurred}}{\text{number of times procedure repeated}}\)
  2. Classical Approach to probability - If a procedure has n different sample events that are equally likely, and if event A can occur in s different ways, then: \(P(A)=\frac{\text{number of ways A occurs}}{\text{number of different simple events}}=\frac{s}{n}\)
  3. Subjective Probabilities-P(A), the probability of event A, is estimated by using knowledge of the relevant circumstances.

Simulations

Sometimes none of the preceding three approaches can be used. A simulation of a procedure is a process that behaves in the same ways as the procedure itself so that similar results are produced. Probabilities can sometimes be found by using a simulation.

Rounding Probabilities

When expressing the value of a probability, either give the exact fraction or decimal or round off final decimal results to three significant digits. When a probability is not a simple fraction such as \(\frac{2}{3}\), express it as a decimal so that the number can be better understood.

Law of Large Numbers

As a procedure is repeated again and again, the relative frequency probability of an event tends to approach the actual probability. It tells us that relative frequency approximations tend to get better with more observations. This law reflects a simple notion supported by common sense: a probability estimate based on only a few trials can be off by a substantial amount, but with a very large number of trials, the estimate tends to be much more accurate.

Don’t make the common mistake of finding a probability value by mindlessly dividing a smaller value by a larger number. Instead, think carefully about the numbers involved and what they represent. Carefully identify the total number of items being considered. 

Complementary Events

Sometimes we need to find the probability that an event does not occur. The complement of event A, denoted by \(\Abar\), consists of all outcomes in which event A does not occur. 

Identifying Significant Results

If, under a given assumption, the probability of a particular observed event is very small and the observed event occurs significantly less than or significantly greater than what we typically expect with that assumption, we conclude that the assumption is probably not correct.

We can use probabilities to identify values that are significantly low or significantly high.

  1. High number of successes: x successes among n trials is a significantly high number of successes if the probability of x or more successes is unlikely with a probability of 0.05 or less. 
  2. Low number of successes: x successes among n trials is a significantly low number of successes if the probability of x or fewer successes is unlikely with a probability of 0.05 or less.

Odds

Expressions of likelihood are often given as odds, such as 50:1. Here are advantages of probabilities and odds:

  1. Odds make it easier to deal with money transfers associated with gambling.
  2. Probabilities make calculations easier, so they tend to be used by statisticians, mathematicians, scientists, and researchers in all fields.

In the three definitions that follow, the actual odds against and the actual odds in favor reflect the actual likelihood of an event, but the payoff odds describe the payoff amounts that are determined by gambling houses. 

The actual odds against event A occurring are the ratio \(P(Abar) / P(A) \), usually expressed in the form of a:b, where a and b are integers.

The actual odds in favor of event A occurring are the ratio \(P(A) / P(Abar) \) which is the reciprocal of the actual odds against that event. If the odds against an event are a:b, then the odds in favor are b:a.

The payoff odds against event A occurring are the ratio of net profit(if you win) to the amount bet.

Payoff odds against event A = net profit:amount bet

If you bet $5 on the number 13 in roulette, your probability of winning is \(\frac{1}{38}\) but the payoff odds are given by the casino as 35:1

With P(13) = \({1}{38}\) and P(not 13) = \(\frac{37}{38}\), we get the actual odds against 13

= \(\frac{37/38}{1/38} or 37:1


Addition and Multiplication of Probabilities

Addition Rule

The addition rule is a tool for finding P(A or B), which is the probability that either event A occurs or event B occurs as the single outcome of a procedure. The word “or” in the addition rule is associated with the addition of probabilities.

Multiplication Rule

This section also presents the basic multiplication rule used for finding P(A and B), which is the probability that event A occurs and event B occurs. The word “and” in the multiplication rule is associated with the multiplication of probabilities.

Compound Event

A compound event is any event combining two or more simple events.

Addition Rule

Here is the notation for the addition rule. P(A or B) = P(in a single trial, event A occurs or event B occurs or they both occur).

Intuitive Addition Rule

To find P(A or B), add the number of ways event A can occur and the number of ways event B can occur, but add in such a way that every outcome is counted only once. P(A or B) is equal to that sum, divided by the total number of outcomes in the sample space.

Formal Addition Rule

P(A or B) = P(A) + P(B) - P(A and B)

Where P(A and B) denotes the probability that A and B both occur at the same time as an outcome in a trial of a procedure.

Disjoint Events and the Addition Rule

Events A and B are disjoint or mutually exclusive if they cannot occur at the same time. That is, disjoint events do not overlap.

Event A - Randomly selecting someone for a clinical trial who is a male.

Event B - Randomly selecting someone for a clinical trial who is a female.

Disjoint Events

Event A - Randomly selecting someone taking a statistics course.

Event B - Randomly selecting someone who is a female.

Complementary Events and the Addition Rule

We use \(\bar{A}\) to indicate that event A does not occur. Common sense dictates this principle. We are certain with probability of 1 that either an event A occurs or does not occur, so it follows that |(P(A or \bar{A}) = 1. Because events \(A \text{and} \bar{A}\) must be disjoint, we can use the addition rule to express this principle as follows:

\[P(A or \bar{A}) = P(A) + P(\bar{A}) = 1 \]

Rule of Complementary Events

\[ P(A) + P(\bar{A}) = 1 \]

\[ P(\bar{A}) = 1 - P(A) \]

\[ P(A) = 1 - P(\bar{A}) \]

Multiplication Rule

P(A and B) = P(event A occurs in a first trial and event B occurs in a second trial)

P(B | A) represents the probability of event B occurring after it is assumed that event A has already occurred.

Multiplication Rule

P(A and B) = P(event A occurs in a first trial and event B occurs in a second trial)

P(B | A) represents the probability of event B occurring after it is assumed that event A has already occurred.

Intuitive Multiplication Rule

To find the probability that event A occurs in one trial and event B occurs in another trial, multiply the probability of event A by the probability of event B, but be sure that the probability of event B is found by assuming that event A has already occurred.

Formal Multiplication Rule

P(A and B) = P(B | A)

Independence and the Multiplication Rule

Two events A and B are independent if the occurrence of one does not affect the probability of the occurrence of the other. Several events are independent if the occurrence of any does not affect the probabilities of the occurrence of the others. I A and B are not independent, they are said to be dependent.

Sampling

In the world of statistics, sampling methods are critically important.

Sampling with replacement: Selections are independent events.

Sampling without replacement: Selections are dependent events.

Treating Dependent Events as Independent

When sampling without replacement and the sample size is no more than 5% of the size of the population, treat the selections as being independent, even though they are actually dependent.

Redundancy

The principle of redundancy is used to increase the reliability of many systems. Our eyes have passive redundancy in the sense that if one of them fails, we continue to see. An important finding of modern biology is that genes in an organism can often work in place of each other. Engineers often design redundant components so that the whole system will not fail because of the failure of a single component

 

When randomly selecting an adult, A denotes the event of selecting someone with blue eyes. What do \(P(A)\) and \(P(\bar{A})\) represent?

\(.P(A)\) represents the probability of selecting an adult with blue eyes.

\(P(\bar{A}) represents the probability of selecting an adult who does not have blue eyes.

 

There are 15,958,866 adults in a region. If a polling organization randomly selects 1235 adults without replacement, are the selections independent or dependent? If the selections are dependent, can they be treated as independent for the purposes of calculations?

The selections are dependent because the selection is done without replacement.

Yes, because the sample size is less than 5% of the population.

When randomly selecting an adult, let B represent the event of randomly selecting someone with type B blood. Write a sentence describing what the rule of complements below is telling us.

\(P(B or \bar{B}) = 1\)

 It is certain that the selected adult has type B blood or does not have type B blood.

A research center poll showed that 76% of people believe that it is morally wrong to not report all income on tax returns. What is the probability that someone does not have this belief?

.24

Find the indicated complement.

A certain group of women has a 0.2% rate of red/green color blindness. If a woman is randomly selected, what is the probability that she does not have this color blindness?

.9998

Use the data in the following table, which lists drive-thru order accuracy at popular fast food chains. Assume that orders are randomly selected from those included in the table.

A B C D

316 266 250 125

32 56 37 20

If one order is selected, find the probability of getting food that is not from restaurant A.

Add up all of B,C, and D then divide by all of A,B,C, D.

754/1098=.68

Use the data in the following table which lists drive-thru order accuracy at popular fast food chains. Assume that orders are randomly selected from those included in the table.

If one order is selected, find the probability of getting an order that is not accurate.

Add up incorrect orders and then total orders

A B C D

320 260 236 149

39 59 32 12

142/1107= .128

Use the data in the following table, which lists drive-thru order accuracy at popular fast food chains. 

A B C D

321 280 244 129

39 51 30 14

If one order is selected, find the probability of getting an order from restaurant A or an order that is accurate. Are the events of selecting an order from restaurant A and selecting an accurate order disjoint events?

The formal addition rule is \( P(A or B) = P(A) + P(B) - P(A and B) \)

Accurate orders =974

Inaccurate orders from restaurant A=39

Add together to get 1013

1013/1108=.914

Use the data in the following table, which lists drive-thru order accuracy at popular fast food chains.

A B C D

367 255 206 176

45 53 22 28

If two orders are selected, find the probability that they are both from restaurant D

Assume that the selections are made without replacement, are the events independent?

\[ P(A and B) = P(A) * P(B | A) \]

Calculate total orders from all restaurants

Calculate orders from restaurant D

Divide orders from restaurant D by the total number of orders. This gives \(P(A)\)

  1. Assume that the selections are made with replacement

The events are independent and probability of event B stays the same regardless of event A

 So, \( P(A and B) = \frac{204}{1152} * \frac{204}{1152} = .0314 \)

  1. Assume that the selections are made without replacement.

The probability of event A will be the same \(\frac{204}{1152}\)

When replacements are not used, the events are not independent and the probability of event B changes depending on the outcome of event A.

Since event A was selecting an order from D, the selected order does not get replaced, the number of orders from D and the total number of orders to choose from each side each decrease by 1 when choosing event B.

So:

\[ P(A) = \frac{204}{1152} \text{and} P(B | A) = \frac{204-1}{1152-1} \]

Multiply the probability of event A by event B

\[ P(A and B) = .0312 \]

Use the data in the following table, which lists drive-thru order accuracy at popular fast food chains.

A B C D

323 267 241 128

30 55 34 12

If two orders are selected, find the probability that they are both accurate.

  1. Assume that the selections are made with replacement. Are the events independent?

Calculate total number of orders: 1090

Accurate orders: 959

\[\frac{959}{1090} * \frac{959}{1090} = .7741 \]

  1. Assume that the selections are made without replacement. Are the events independent?

Because the selections are made without replacement, the events are dependent events. 

The probability of each order being accurate is affected by the other orders.

The probability \(P(A)\} remains the same as in part A.

The probability \(P(B|A)\) must be adjusted to reflect that the first order was accurate and is not available for the second order.

Recall that originally there were 1004 accurate orders out of 1152.

After the first accurate order is selected, there are 1151 orders remaining of which 1003 are accurate.

\[ P(A and B) = \frac{959}{1090} * \frac{958}{1089} = .7740 \]

The events are not independent because the sampling is done without replacement

Use the data in the following table.

A B C D

321 260 243 121

35 52 32 14

If three orders are selected, find the probability that they are all from B.

\[(312 / 1078) * 3 = .0242 \]

Use the following results from a test for marijuana use, which is provided by a certain drug testing company. Among 145 subjects with positive test results, there are 29 false positive results. Among 157 negative results, there are 3 false negative results.

  1. How many subjects were in the study?

No Yes

pos=    29 145

neg=    157     3

How many subjects were included in the study? 

Add the subjects who tested positive to those who tested negative= 302

How many subjects did not use marijuana?=183

What is the probability that a randomly selected subject did not use marijuana?183/302=.606

Among 132 subjects with positive test results, there are 32 false positive results

Among 168 negative results, there are 8 false negative results.

If one of the test subjects is randomly selected, find the probability that the subject tested negative or did not use marijuana.

32 100

160 8

Total subjects=300

Next, find the probability that a randomly selected subject tested negative 

168/300

Now, find the number of subjects that did not use marijuana

Two groups did not use marijuana. True negatives and the false positives

160+32=192

Next, find the probability that a randomly selected test subject did not use marijuana.

Did not use=192/300

Next, find the probability that a randomly selected test subject tested negative and did not use it

160/300

Finally, use the formal addition rule to find the probability that a randomly selected subject tested negative or did not use it, rounding to 3 decimal places

168/300+192/300-160/300 = .667

The principle of redundancy is used when system reliability is improved through redundant components. Assume that a student’s alarm has a 16.0% daily failure rate.

  1. What is the probability that the student’s alarm clock will not work on the morning of an important exam?

To convert a percentage to a decimal number, remove the % symbol and divide by 100.

For the stated failure rate of 16% remove the percent symbol and divide by 100.

16/100 = .160

So, the probability that the student’s alarm clock will not work on the morning of an important exam is .160.

  1. If the student has two such alarm clocks, what is the probability that they both fail on the morning of an important exam?

Use the formal rule of multiplication that states if P(A) is the probability of event A occurring and P(B|A) is the probability of B occurring given that A has occurred, the probability of both A and B occurring is given by:

\[P(A and B)=P(\bar{A})*P(\bar{A}|\bar{B}\]

The functioning of the second alarm clock is not affected by the failure of the first, so by definition they are independent events.

Multiply A and B together.

.160*.160=.0256

  1. What is the probability of not being awakened if the student uses three independent alarm clocks?

A * B * C = .160*.160*.160= .00410

  1. Do the second and third alarm clocks result in greatly improved reliability?

Compare the probability of one alarm clock not working to the probabilities of 2 or 3 alarm clocks not working. In general, when an event will occur with probability 1, it is called certain. An event occurring with probability less than or equal to .05 is called unlikely. An event occurring with probability 0 is called impossible.

Surge protectors p and q are used to protect a television. If there is a surge in the voltage, the surge protector reduces it to a safe level. Assume that each surge protector has a .88 probability of working correctly when a voltage surge occurs.

  1. If the two surge protectors are arranged in a series, what is the probability that a voltage surge will not damage the television?

With two independent surge protectors in series, the television will be protected unless both surge protectors fail. In other words, only one surge protector needs to work. Find the probability that only one surge protector works by calculating 1-P(p and q). This probability can be found by applying the multiplication rule for independent events.

\[P(A and B)=P(A)*P(B)\]

The probability that a surge protector works correctly is .88. The probability that a surge protector fails is calculated below.

1-.88=.12

The probability that one surge protector fails is .12. The probability that both surge protectors fail is the product of the probabilities that either one fails.

.12*.12=.0144

There is a .0144 probability that both surge protectors fail. The probability that the television is protected in a series configuration is the complement of the probability that both fail.

1-.0144=.9856

  1. If the two surge protectors are arranged in parallel, what is the probability that a voltage surge will not damage the television?

With two independent surge protectors in parallel, the television will be protected as long as both surge protectors work. The probability that the two independent surge protectors both work is found by applying the multiplication rule for independent events.

\[P(A and B)=P(A)*P(B)\]

The probability that a surge protector works correctly is .88. The probability that both surge protectors work is the product of the probabilities that both work correctly.

.88*.88=.7744

  1. Which arrangement should be used for better protection?

Series

Complements and Conditional Probability

Complements

When finding the probability of some event occurring at least once, we should understand that at least one has the same meaning as one or more. The complement of getting at least one particular event is that you get no occurrences of that event.

Finding the probability of getting at least one of some event:

  1. Let A = getting at least one of some event.
  2. Then \(/bar{A}\) = getting none of the event being considered.
  3. Find \(P(/bar{A})\) = probability that event A does not occur.
  4. Subtract the result from 1.

Conditional Probability

A conditional probability of an event is a probability obtained with the additional information that some other event has already occurred.

\[P(B | A)\] denotes the conditional probability of event B occurring, given that event A has already occurred.

Intuitive Approach For Finding P(B|A)

The conditional probability of B occurring given that A has occurred can be found by assuming that event A has occurred and then calculating the probability that event B will occur.

Formal Approach For Finding P(B|A)

The probability P(B|A) can be found by dividing the probability of events A and B both occurring by the probability of event A.

\[P(B|A)=\frac{P(A \text{and} B)}{P(A)}\]

The preceding formula is a formal expression of conditional probability, but blind use of formulas is not recommended. The intuitive approach is recommended.

Bayes’ Theorem

The importance and usefulness of bayes’ Theorem is that it can be used with sequential events, whereby new additional information is obtained for a subsequent event, and that new information is used to revise the probability of the initial event. In this context, the terms prior probability and posterior probability are commonly used.

A prior probability is an initial probability value originally obtained before any additional information is obtained.

A posterior probability is a probability value that has been revised by using additional information that is later obtained

Multiplication Counting Rule

The multiplication counting rule is used to find the total number of possibilities from some sequence of events. For a sequence of events in which the first event can occur n ways, the second event can occur n2 ways and so on, the total number of outcomes is n1*n2*n3….

Factorial Rule

The factorial rule is used to find the total number of ways that n different items can be rearranged. The factorial rule uses the following notation. The factorial symbol(!) denotes the product of decreasing positive whole numbers. The factorial rule is stated as the number of different arrangements of n different items when all n of them are selected is n! The factorial rule is based on the principle that the first item may be selected n different ways, the second item may be selected n-1 ways, and so on. This rule is really the multiplication counting rule modified for the elimination of one item on each selection.

Permutations and Combinations

When using different counting methods, it is essential to know whether different arrangements of the same items are counted only once or are counted separately. The terms permutations and combinations are standard in this context. 

Permutations of items are arrangements in which different sequences of the same items are counted separately.

Combinations of items are arrangements in which different sequences of the same items are counted as being the same.

Permutations Rule

The permutation rule is used when there are n different items available for selection, we must select r of them without replacement, and the sequence of the items matters. The result is the total number of arrangements that are possible. Remember, rearrangements of the same items are counted as different permutations.

\[nP_r=\frac{n!}{(n-r)!}\]

When n items are all selected without replacement, but some items are identical, the number of possible permutations is found by using the following rule:

\[\frac{n!}{n_1!n_2!...n_k!}\]

Combinations Rule

The combinations rule is used when there are n different items available for selection, only r of them are selected without replacement, and order does not matter. The result is the total number of combinations that are possible. Remember, rearrangements of the same items are considered to be the same combination.

\[n_C_r=\frac{n!}{(n-r)!r!}\]

Find the probability that when a couple has three children, at least one of them is a girl. Assume that boys and girls are equally likely.

For each event there are two possibilities. There are 3 events.

½*½*½ = ⅛

1-⅛=⅞

In a certain country, the true probability of a baby being a girl is .509. Among the next six randomly selected births in the country, what is the probability that at least one of them is a girl?

The probability of at least one can be computed using the rule of complements. Let A represent the event that at least one of the next six births is a girl. Use the rule of complements below to find the probability of event A, P(A), where \(\bar{A}\) is the complement of A.

\[P(A)=1-P(\bar{A})\]

The complement of A, \(\bar{A}), is the event that the next six births are all boys.

Since each birth has no effect on any of the other births, the births are all independent events. The probability that the next six births are all boys can be found using the multiplication rule for independent events. The probability of the event can be written as shown below:

It is given that the probability of a birth being a boy is .509.

Use the multiplication rule for independent events to find the probability that the next six births are all boys. The multiplication rule for independent events states that the probability of two independent events occurring is the product of their individual probabilities. This can be extended to 6 independent events.

.509*.509*.509*.509*.509*.509 = .017

Then use the rule of complements to find the probability that the couple has at least one girl.

1-.017=.983

Therefore, the probability that the next six randomly selected births will contain at least one girl is .983

Subjects for the next presidential election poll are contacted using telephone numbers in which the last four digits are randomly selected​ (with replacement). Find the probability that for one such phone​ number, the last four digits include at least one 0.

10^4-9^4=3439

10^4=10000

3439/1000=.344

Based on a poll, 72% of internet users are more careful about personal information when using a public wi-fi hotspot. What is the probability that among three randomly selected internet users, at least one is more careful about personal information when using a public wi-fi hotspot? How is the result affected by the additional information that the survey subjects volunteered to respond to?

The probability of at least one can be computed using the rule of complements. The rule of complements states that the following expression is true for events A and \(\bar{A}\), where \(\bar{A}\) indicates that event A did not take place.

\[P(A)=1-P(\bar{A})\]

Identify the event that is the complement of A

\[\bar{A} = \text{none of the internet users are more careful}\]

To find the probability of the complement, first find the probability that an internet user is not more careful with personal information while using a public wi-fi hotspot.

1-P(is more careful)

1-.072 = .28

Find the probability of the complement using the multiplication rule for independent events, rounding to three decimal places. The multiplication rule for independent events states that the probability of two independent events occurring is the product of their individual probabilities. This can be extended to three independent events.

.28 * .28 * .28 = .022

Now use the rule of complements

1-.022 = .978

It is very possible that this result is not representative of people that use wi-fi

In an experiment, college students were given either four quarters or a $1 bill and they could either keep the money or spend it on gum.

Purchased Gum Kept the Money

Given four quarters     37 13

Given $1 bill                11 39

  1. Find the probability of randomly selecting a student who spent the money, given that the student was given four quarters.

The conditional probability of B occurring given that A has occurred, P(B|A), can be found intuitively by assuming that event A has occurred and then calculating the probability that event B will occur.

More formally, the probability P(B|A) can be found by dividing the probability of events A and B both occurring by the probability of event A.

In this case, given four quarters corresponds to event A and spent the money corresponds to event B.

First determine the number of students given four quarters that spent the money

37 students

Now calculate the probability

37/50=.74

  1. Find the probability of randomly selecting a student who kept the money given that the student was given four quarters.

Recall that 50 students were given four quarters

Identify the number of students given four quarters that kept the money

13 students

Now calculate the probability

13/50=.26

Now that since the students either kept the money or spent the money, these probabilities are complements.

.26=1-.74

  1. What do the preceding results suggest?

Compare the probabilities found in first parts

Spent the money=.74

Kept the money=.26

Since .74..26 P(spent the money | four quarters) has the greater probability

The accompanying table shows the results from a test for a certain disease. Find the probability of selecting a subject with a negative test result, given that the subject has the disease. What would be an unfavorable consequence for this error?

357 26

18 1150

A conditional probability of an event is a probability obtained with the additional information that some other event has already occurred. P(B|A) denotes the conditional probability of event B occurring, given that event A has already occurred.

A is the event which is known to have occurred. The given event is “the individual has the disease”.

B is the event for which the probability is sought. The event is “the individual tests negative for the disease”.

The conditional probability of B given A can be found by assuming that event A has occurred and, working under that assumption, calculating the probability that event B will occur.

First, determine the number of individuals who have the disease. Add all the values in the indicated column.

357+18=375

From the table, there are 18 individuals who have the disease and test negative. Divide to find the probability

18/375=.048

Therefore, the probability that a randomly selected individual who has the disease tests negative is.048

To determine an unfavorable consequence of this error, consider a subject that has the disease but with a negative test result.

Note that negative test results would lead the subject to believe that they have the disease.

The table below displays results from experiments with polygraph instruments. Find the positive predictive value for the test. That is, find the probability that the subject lied, given that the test yields a positive result.

Did not lie lied

Pos     9                                  46

Neg     30                                13

Use the intuitive approach to conditional probability. The conditional probability of B occurring given that A has occurred can be found by assuming that event A has occurred and then calculating the probability that event B will occur. Find the probability of selecting a subject who lied, given that the selected subject had a positive test result. If it is assumed that the subject had a positive test result, then only the 9+46=55 subjects in the top row of the table are to be used. Among those 55 subjects, 46 subjects who had a negative test result actually lied.

Divide the number of subjects who had a positive test result and actually lied by the total number of subjects who had a positive test result to find the probability, rounding to three decimal places.

46/55=.836

Assume that there is a 12% rate of disk drive failure in a year.

  1. If all your computer data is stored on a hard disk with a copy stored on a second hard disk drive, what is the probability that during a year, you can void catastrophe with at least one working drive.
  2. If copies of all your computer data are stored on three independent disk drives, what is the probability that during a year, you can avoid catastrophe with at least one working drive.
  1. Use the rule of complements shown below to find the probability that you can avoid catastrophe. Let A=at least one hard drive works correctly

\[P(A)=1-P(bar{A})\]

Identify the event that is the complement of A

\[\bar{A}=both hard drives fail\]

Since the two hard drives operate separately, their failures are independent events. Use the multiplication rule for independent events to find the probability of the complement of event A. The multiplication rule for independent events states that the probability of two independent events occurring is the product of their individual probabilities. The probability of any one of the hard drives failing to work correctly is 0.12

.\[ P(\bar{A})= .12*.12=.0144 \]

Now find P(A) by evaluating \(1-P(\bar{A})\)

1-.0144 = .9856

  1. Again, let A = at least one hard drive works correctly.

\[P(\bar{A}) = .12 * .12 * .12 = .001728 \]

Now find P(A) by evaluating \(1-P(\bar{A})\)

1-.001728 = .998272

Probability Distributions

Basic Concepts

A random variable is a variable that has a single numeric value, determined by chance, for each outcome of a procedure.

A probability distribution is a description that gives the probability for each value of the random variable. It is often expressed in the format of a table, formula, or graph.

A discrete random variable has a collection of values that is finite or countable. If there are infinitely many values, the number of values is countable if it is possible to count them individually, such as the number of tosses of a coin before getting to heads.

A continuous random variable has infinitely many values, and the collection of values is not countable. That is, it is impossible to count the individual items because at least some of them are on a continuous scale, such as body temperatures.

Probability Distribution Requirements

Every probability distribution must satisfy each of the following three requirements.

  1. There is a numerical random variable, and its number values are associated with corresponding probabilities.
  2. \(\Sigma P(x)=1\) where x assumes all possible values.
  3. \(0 \leq P(x) \leq 1 for every individual value of the random variable x. That is, each probability value must be between 0 and 1 inclusive.

The second requirement comes from the simple fact that the random variable x represents all possible events in the entire sample space, so we are certain that one of the events will occur. The third requirement comes from the basic principle that any probability value must be 0 or 1 or a value between 0 and 1.

 

The above x variable is a random variable because its numerical values depend on chance. The variable x is a numerical random variable, and its values are associated with probabilities. \(\sumP(x)=.25+.50+.25=1\). Each value of P(x) is between 0 and 1. The random variable x is a discrete random variable, because it has three possible values and three is a finite number.

Notation for 0+

In tables or the binomial probabilities, we recommend using 0+ to represent a probability value that is positive but very small, such as .0000000123. When rounding a probability value for inclusion in such a table, rounding to 0 would be misleading because it would incorrectly suggest the vent is impossible.

Probability Histogram

There are various ways to graph a probability distribution, but for now we will consider only the probability histogram. 

Parameters of a Probability Distribution

Remember that with a probability distribution, we have a description of a population instead of a sample, so the values of the mean, standard deviation, and variance are parameters, not statistics. The man, variance, and standard deviation of a discrete probability distribution can be found with the following formula.

This is the mean for a probability distribution:

\[ \mu = \sum [x * P(x)] \]

Variance for a probability distribution that should be easier to understand:

\[\sigma^2 = \Sigma[(x - \mu)^2 * P(x)]

Variance for probability distribution that is good for manual calculations:

\[\sigma^2 = \Sigma[x^2*P(x)] - \mu^2 \]

Standard deviation for probability distribution:

\[\sigma = \sqrt{\Sigma[x^2*P(x)] - \mu^2}\]

Expected Value

The mean of a discrete random variable is the theoretical mean outcome for infinitely many trials. We can think of that mean as the expected value in the sense that it is the average value that we would expect to get if the trials could continue indefinitely.

The expected value of a discrete random variable is denoted by E, and it is the mean value of the outcomes, so \(E=\mu\) abd E can also be found by evaluating \(\Sigma[x*P(x)]\).

An expected value need not be a whole number, even if the different possible values of x might all be whole numbers. The expected number of girls in five births is 2.5, even though five particular children can never result in 2.5 girls. If we were to survey many couples with 5 children, we expect that the mean number of girls will be 2.5.

Making Sense of Significant Figures

We present the following two different approaches for determining whether a value of a random variable is significantly low or high.

Range Rule of Thumb

The range rule of thumb may be helpful in interpreting the value of a standard deviation. According to the range rule of thumb, the vast majority of values should lie within 2 standard deviations of the mean, so we can consider a value to be significant if it is at least 2 standard deviations away from the mean. We can identify significant values as follows:

  1. Significantly low values are \((\mu-2\sigma\) or lower
  2. Significantly high values are \(\mu+2\sigma\) or higher
  3. Values not significant are between the previous two conditions

Know that the use of the number 2 in the range rule of thumb is somewhat arbitrary and this is a guideline, not an absolutely rigid rule.

Identifying Significant Results With Probabilities

X successes among n trials is a significantly high number of successes if the probability of x or more successes is .05 or less. That is, x is a significantly high of successes if \(P(x \text{or more}) \leg .05\)

X successes among n trials is a significantly low number of successes if the probability of x or fewer successes is .05 or less. That is, x is a significantly low number of successes if \(P(x \text{or fewer}) \leq .05\).

The Rare Event Rule For Inferential Statistics

If, under a given assumption, the probability of a particular outcome is very small and the outcome occurs significantly less than or significantly greater than what we expect with that assumption, we conclude that the assumption is probably not correct.

For example, if testing the assumption that boys and girls are equally likely, the outcome of 20 girls in 100 births is significantly low and would be a basis for rejecting that assumption.

Expected Value and Rationale for Formulas

Earlier we noted that the expected value of a random variable is equal to the mean. We can therefore find the expected value by computing \(\Sigma[x*P(x)]\), just as we do for finding the value of \(\mu\). We also noted that the concept of expected value is used in decision theory. 

Rationale for Earlier Formulas

Instead of blindly accepting and using formulas, it is much better to have some understanding of why they work. When computing the mean from a frequency distribution, f represents class frequency and N represents population size. In the expression that follows, we rewrite the formula for the mean of a frequency so that it applies to a population. In the fraction f/n, the value of f is the frequency with which the value x occurs and N is the population size, so f/N is the probability for the value of x. When we replace f/N with P(x), we make the transition from relative frequency based on a limited number of observations to probability based on infinitely many trials. 

Example 1

 The table below lists probabilities for the corresponding numbers of girls in three births. What is the random variable, what are its possible values, and are its values numerical?

Girls(x) P(x)

0 0.125

1 0.375

2 0.375

3 0.125

The random variable is x, which is the number of girls in three births. The possible values of x are 0,1,2, and 3. The values of the random value x are numerical.

Example 2

Is the random variable given in the accompanying table discrete or continuous?

Girls(x) P(x)

0 0.063

1 0.250

2 0.375

3 0.250

4 0.063

The random variable given in the accompanying table is discrete because there are a finite number of values.

Example 3

For 100 births, P(exactly 56 girls)=0.0390 and P(56 or more girls)=0.136. Is 56 girls in 100 births a significantly high number of girls? Which probability is relevant to answering that question? Consider a number of girls to be significantly high if the appropriate probability is 0.05 or less.

The relevant probability is P(56 or more girls), so 56 girls in 100 births is not a significantly high number of girls because the relevant probability is greater than 0.05.

Example 4

Five males with an x-linked genetic disorder have one child each. The random variable x is the number of children among the five who inherit the x-linked genetic disorder. Determine whether a probability distribution is given. If a probability distribution is given, find its mean and standard deviation. If a probability distribution is not given, identify the requirements that are not satisfied.

X P(x)

0 0.024

1 0.167

2 0.309

3 0.309

4 0.167

5 0.024

The random variable x is numerical because x takes on the integer values from 0 to 5.

The number values are associated with probabilities because each value of x has a corresponding value of P(x) in the next column of the table.

The mean for a probability distribution is given by the formula below.

\[\mu = \Sigma[x*P(x)]\]

Find each product of x and P(x)

0+.167+.618+.927+.668+.12=2.5

\[\mu=2.5\]

The standard deviation for a probability distribution is given by the formula below.

\[\sigma=\sqrt{\Sigma[x^2*P(x)]-\mu^2}\]

Create another table for the new values

X^2 X^2*P(x)

0 0

1 .167

4 1.236

9 2.781

16 2.672

25 .6

Sum = 7.456

Substitute into formula

\[\sqrt{7.456-2.5^2}= 1.1\]

Example 5

When conducting research on color blindness in males, a researcher forms random groups with five males in each group. The random variable x is the number of males in the group who have a form of color blindness. Determine whether a probability distribution is given. If a probability distribution is given, find its mean and standard deviation. If not, state why.

X P(x)

0 .657

1 .284

2 .053

3 .005

4 .001

5 .000

Find the mean of the random variable x

0+.284+.106+.015+.004+0=.409

Find the standard deviation of the random variable x

0+(1^2*.284)+(2^2*.053)+(3^2*.005)+(4^2*.001)+(5^2*0)=.557

\[\sqrt{.557-.409^2}\]=.6243

Example 6

Look at the next table. Determine whether a probability distribution is given. If it is, find the mean and standard deviation. If not, state why.

X P(x)

0 .001

1 .009

2 .034

3 .056

Does the table show a probability distribution?

No, the sum of all the probabilities is not equal to 1

Example 7

Look at the following table.

X P(x)

0 .094

1 .347

2 .395

3 .164

Does the table show a probability distribution?

Yes, the table shows a probability distribution

Find the mean of the random variable x

(0)+(.347)+(2*.395)+(3*.164)=1.629

Find the standard deviation of x

0+.347+(4*.395)+(9*.164)=3.403

\[\sqrt{3.403-1.629^2}=.8656\]

Example 8

Look at the following table

X P(x)

0 .365

1 .431

2 .178

3 .026

Does the table show a probability distribution?

Yes, the table shows a probability distribution

Find the mean of the random variable x

0+.431+(2*.178)+(3*.026)=.865

Find the standard deviation of x

0+.431+(4*.178)+(9*.026)=1.377

\[\sqrt{1.377-.865^2}=.7929\]

Example 9

Look at the table below

X P(x)

0 .002

1 .035

2 .111

3 .221

4 .272

5 .211

6 .116

7 .027

8 .005

Find the mean

0+.035+(2*.111)+(3*.221)+(4*.272)+(5*.211)+(6*.116)+(7*.027)+(.005)=3.953

Find the standard deviation

0+.035+(2^2*.111)+(3^2*.221)+(4^2*.272)+(5^2*.211)+(6^2*.116)+(7^2*.027)+(8^2*.005)=17.914

\[\sqrt{17.914-3.953^2}=1.5\]

Example 10

The following table describes results from groups of 10 births from 10 different sets of parents. The random variable x represents the number of girls among 10 children. Use the range rule of thumb to determine whether 1 girl in 10 births is a significantly low number of girls.

X P(x)

0 .005

1 .010

2 .046

3 .113

4 .194

5 .241

6 .211

7 .111

8 .039

9 .020

10 .010

The range rule of thumb for identifying significant values is shown below.

Significantly low values are \(\mu-2\sigma\) or lower

Significantly high values are \(\mu+2\sigma\) or higher

Values between these are not significant

To find the range of values that are not significant, first find the mean and standard deviation

Let us start with the mean

0+.010+.092+.339+.776+1.205+1.266+.777+.312+.180+.100=5.057

Now find the standard deviation

0+.010+(4*.046)+(9*.113)+(16*.194)+(25*.241)+(36*.211)+(49*.111)+(64*.039)+(81*.020)+(100*.010)=28.491

\[sqrt{28.491-5.057^2}=1.708\]

Now find the max range of values that are not significant

Max value = \(\mu+2\sigma\)

5.1+2*1.7=8.5

Now find the minimum range of values that are not significant

Min value = \(\mu-2\sigma\)

5.1-2*1.7=1.7