Basics of Statistics
These are my notes and thoughts on an introduction to statistics.
Table of Contents
- Exploring Data
- What Is Statistics
- Statistics and Problem Solving
- Statistical and Critical Thinking
- What Statistics Is All About
- Relationships Between Categorical Variables
- Working with Tables and Graphs
- Displaying and Describing Data
- Comparing Distributions in Statistics
- Measures of Center
- Measures of Variation
- Measures Of Relative Standing
- Standard Deviation
- Basics of Probability
- Addition and Multiplication of Probabilities
- Complements and Conditional Probability
- Beginning Probability
- Probability Distributions
- Univariate Data
- Graphing Univariate Data
- Numerical Methods For Continuous Data
- Joint Frequencies of Two-Way Tables
- Methods Of Data Collection
- Probability In Statistics
- Mutually Exclusive Events
- Probability Distributions of Random Variables
Exploring Data
We use data to make decisions. We make estimations and develop guidelines using data. Therefore, data and its analysis is important.
Businesses collect data on their users. Scientists collect data on their experiments. Doctors collect data on their patients. Police collect data on criminals. Lots of people, organizations, and businesses collect data now. In fact, almost all of them do.
Most data that is collected is not immediately useful. It has to be organized and put in the proper format. It also needs to be summarized so decision makers can easily make choices on what is best for their organization. Doing this analysis is called descriptive methods. They are useful for presentation, data reduction, and summarization of data.
Variables
There are two types of variables, categorical and numerical. A variable is categorical if it places the individuals being studied into one of several groups or categories. A variable is numerical if its outcomes are quantitative and can be analyzed using arithmetic. Numerical variables can be either discrete or continuous. Different methods of analysis must be used for categorical and numerical variables.
If we take only one measurement on each object, we get univariate data. With two measurements on each object, we get bivariate data.
Types of Descriptive Methods
There are different descriptive methods depending on the type of data that is collected. These are tabular, graphical, and numerical methods.
Different descriptive methods will answer different questions about data.
Tabular
Collected data needs to be rearranged before analysis. One tabular method is the frequency distribution table. The letter ’n’ is used to denote the number of observations in a data set. The frequency of a value is the number of times that observations occur. Frequency is usually denoted by using the letter f. The relative frequency of a value is the ratio of the frequency to the total number of observations. It is usually denoted by \(rf\)and equals \(\frac{f}{n}\). The cumulative frequency gives the number of observations less than or equal to a specified value and is denoted by \(cf\). A frequency distribution table is a table giving all possible values of a variable and their frequencies.
Graphical
Presenting your data in tables is not very useful, but it is done. You should know how to interpret a table if you have to analyze one. Charts are just a better tool.
Bar charts are used a lot. It can have either horizontal or vertical bars. They are used to display categorical data very commonly.
Pie charts can also display amounts and frequencies of data. They are a popular graphical method but not usually the best choice. They are difficult to make and read.
Segmented Bar Charts
It is important to see categorical data that stems from different groups in order to make comparisons. A segmented bar chart takes the distribution from each group and arranges them along either the horizontal or vertical axis. Then it shows the relative frequency of each group represented in one bar for each group. These data charts can be used to show frequency with bars of various sizes or relative frequency where all bars are the same size regardless of group size. Segmented bar charts that measure relative frequency between groups can be somewhat misleading when sample size is concerned.
Mosaic Plots
These are kind of similar to segmented bar charts. They are just a different way to compare categorical data. In a mosaic plot, use the width of the bars to represent the size of the sample. Each header indicates a different group. The groups can be arranged along the x or y axis. The lengths of these bars along the axis represent the relative frequencies of these groups compared to each other.
Along the other axis, the bars of each group are the same length. Each section within the group bars represents the percentage that category occurred in the data set for that group. These same categories should appear within each of the group bars. We can make comparisons about the size of each group based on the length of each group bar. We can also evaluate the proportions of the categorical variables within each group by comparing the relative sizes of each section.
Graphical Methods For Numerical Data
To summarize and describe numerical data, dotplots and stemplots are used for small sets of data. For larger sets, histograms, cumulative frequency charts, and boxplots are often used.
We can describe the overall pattern of the distribution of a numerical variable set using the following three methods: center, spread, and the shape.
The center of a distribution describes the central data point. There are a few ways to measure the central tendency which include the mean, median, and the mode. Each measure has different pros and cons depending on the type and shape of the data.
The spread of a distribution can tell us where most of the data is. You can have a symmetric distribution and a skewed distribution.
For a symmetric distribution, if the left half of the distribution is approximately a mirror image of the right half, then the distribution is called as symmetric. This means that the data is spread out in the same way on both sides and that there is the same amount of data on each side of the center.
In a skewed distribution, if there are extreme values in only one direction that cause one side to have a longer tail, we call that distribution skewed. It is right skewed if the longer tail is on the right and left skewed if the longer tail is on the left.
Patterns of Data
When looking at data, we should look for patterns and deviations. To describe patterns, you can have clusters of data and outlier data. In clustered data, observations are grouped together tightly. If data is not clustered it can be described as having gaps. It is important to make these distinctions.
If you have outliers in your data, you have an observation that is a lot different from the rest of the data. Outliers fall away from the middle of the data set.
Graphical Methods for Continuous Variables
There are several ways to show graphical data for continuous variables. These include dotplots, stemplots, histograms, and cumulative frequency charts.
Dotplots
Dotplots are easy to make. They are nice for smaller data sets. However, if there is too much data the dotplot becomes too cluttered to read. To make a dotplot, draw a horizontal line to indicate the data range, scale the line to accommodate the entire range of data, if more than one observation has the same value then add dots above the other, and mark a dot for each observation in the appropriate place above the scaled line.
Each dot on the plot indicates the location of the value of a data point. For any data point, we can look directly down at the scale to determine the value of the point. When looking at a dotplot, we can see how the data points are spread, what kind of shape the points make collectively, and where the approximate center of the distribution is.
Stemplots
Stemplots are also used a lot. An advantage of the stemplot is that it shows every value. However, since that is the case, it is only useful for small data sets.
To make a stemplot, separate each observation into two parts. The left part of the observation is called the stem and the right part is called the leaf. Draw a vertical line on the left side of the page to separate the stems from the leaves. Write all possible stems in increasing order on the left of the line. For each observation, write in the leaf to the right of the corresponding stem on the right side of the vertical line in increasing order.
The numbers on the left side of the vertical line are stems. The value of a data point is the stem plus the leaf. Each stem has a different number of leaves, indicating the frequency of the class. Each leaf indicates a single observation.
We use stemplots to see how the data is shaped and how it is spread. We also use it to see where the center of the data is.
Histograms
A histogram is the most popular form of displaying data. It resembles a stemplot on its side. They are useful for showing patterns in large data sets. A histogram can be drawn using frequencies, relative frequencies, or percentages.
To make a histogram, create groups from continuous data, draw the x axis and the y axis to scale and to accommodate all of the data groups and frequencies, Draw bars of heights equal to the corresponding frequencies and add a label for each group, and draw the bars next to each other without any gaps. There are no gaps between histogram bars because the data values are continuous and the values in one bar flow right into the next one.
Each bar represents a single group or class. There is only one bar for each class. The classes are placed on the x axis in numerically increasing order, just as on a number line. The height of a bar in a frequency histogram corresponds to the frequency of that class. Percentage or relative frequency histograms can be read similarly. In a relative frequency histogram, the height of the bar reflects the relative frequency corresponding to the class. In a percentage frequency histogram, the height of the bar reflects the percent frequency that corresponds to the class.
Cumulative Frequency Charts
The cumulative frequency for any group is the frequency for that group plus the frequencies of all groups of smaller observations.
To draw cumulative frequency charts, draw the x and y axis, scale the x axis to accommodate the range of all groups, mark the upper boundary of each group, scale the y axis from 0 to n for a cumulative frequency chart, place a dot at the height equal to the cumulative frequency for that group above the upper boundary for each group, then connect all the dots with straight lines.
From any point on the graph, we can draw a vertical line to read the x value from the x axis and a horizontal line to read the y value from the y axis. For right skewed distributions, the curve increases quickly in the beginning but then steadies in the later part. For left skewed distributions, the curve increases slowly in the beginning, but then steeply later on. The cumulative frequency chart for a symmetric distribution is often described as s-shaped because it begins with a slow increase on the left, rises rapidly in the middle, and then tapers off to a slow increase again at the right.
Summary
Visualizations are the very first step you should take when analyzing data. The types of summary statistics, inferential tests, and analysis that can be calculated are dependent upon the shape of the distribution. The key point to remember is that there are different calculations for symmetric and skewed data. Knowing the shape of the distributions will help you get started.
What Is Statistics
Data is any collection of numbers, characters, images, or any other items that provide information about something. What is Statistics? It is a way of reasoning, along with a collection of tools and methods, designed to help us understand the world. What are Statistics? Statistics are particular calculations made from data.
The characteristics recorded about each individual are called variables. They are usually found as the columns of a data table with a name in the header that identifies what has been recorded.
Some variables are called nominal because they name categories. That means you can’t do math on the data or that it would make no sense if you did. Descriptive responses to questions are often categories.
When a variable contains measured numerical values with measurement units, we call it a quantitative variable. Quantitative variables typically record an amount or degree of something. For quantitative variables, its measurement units provide a meaning for the numbers. Some quantitative variables do not have obvious units, like the stock market. Sometimes a variable with numerical values can be treated as either categorical or quantitative depending on what we want to know from it.
For a categorical variable, each individual is assigned one of two possible values. However, some variables will have many values, this is an identifier variable. Identifier variables do not tell us anything useful about their categories because we know there is exactly one in each. Identifiers are part of what is called metadata, or the data about data.
Variables that report order without natural units are often called ordinal variables. You still have to look at what you want from your study to understand what you want to learn from the variable to decide whether to treat it as categorical or quantitative.
Models are summaries and simplifications of data that help our understanding in many ways. It is a simplification of reality that gives us information that we can learn from and use. Without making models for how data vary, we would be limited to reporting only what the data we have says.
Don’t label a variable as categorical or quantitative without thinking about the data and what they represent. The same variable can sometimes take on different roles. Do not assume that a variable is quantitative just because its values are numbers. Categories are often given numerical labels. Do not let that fool you into thinking they have quantitative meaning. Look at the context.
Always be skeptical. One reason to analyze data is to discover the truth. Even when you are told a context for the data, it may turn out that the truth is a bit different. The context colors out interpretation of the data, so those who want to influence what you think may slant the context.
Data are recorded values, whether numbers or labels, together with their context. A data table is an arrangement of data in which each row represents a case and each column represents a variable. The context ideally tells who was measured, what was measured, and how the data were collected, and why the study was performed.
An individual about whom or which we have data is a case. A respondent is someone who answers or responds to a survey. A subject is a human experimental unit, also called a participant. A participant is a human experimental unit, also called a subject. A variable holds information about the same characteristic for many cases. A categorical variable names categories with words or numerals.
A nominal variable can be applied to a variable whose values are used only to name categories. A quantitative variable is a variable in which the numbers are values of measured quantities. A unit is a quantity or amount adopted as a standard of measurement, such as dollars or hours.
Metadata is the data about data. It can provide information to uniquely identify cases, making it possible to combine data from different sources, protect privacy, or label cases uniquely. An ordinal variable can be applied to a variable whose categorical values possess some kind of order. A model is a description or representation, in mathematical or statistical terms, of the behavior of a phenomenon based on data.
Example 1
Because of the difficulty of weighing a dolphin in the ocean, researchers caught and measured 12 dolphins, recording their weight, fin length, body length, and sex. They hoped to find a way to estimate weight from the other more easily determined quantities.
- Who was measured?
12 dolphins
- When were the measurements taken?
This information is not given
- Where were the measurements taken?
This information is not given
- Why were the measurements taken?
To find an easier way to estimate the weight of a dolphin
- How did the researchers obtain the measurements?
Researchers collected data on the 12 dolphins they were able to catch
- Specify whether the variables are categorical or quantitative.
The variable weight is quantitative and units were not provided
The variable fin length is quantitative and units were not provided
The variable body length is quantitative and units were not provided
The variable sex is categorical
Example 2
Researchers investigating the impact of prenatal care on newborn health collected data from 708 births during 1991-1993. They kept track of the mother’s age, the number of weeks the pregnancy lasted, the type of birth, the level of prenatal care the mother had, the weight and sex of the babies, and whether the babies exhibited health problems.
- Identify the who for the description of data.
The 708 births
- Identify the what for the description of data.
Baby’s health problems, sex of the babies, level of prenatal care, type of birth, weight of the babies, duration of pregnancy, and mother’s age
- Identify the when for the description of data.
Between the years 1991-1993
- Identify the where for the description of data
This information is not given
- Identify the why for the description of data
To determine the effect of prenatal care on the babies health
- Identify the how for the description of data
This information is not given
Statistics and Problem Solving
A population is the total set of subjects or things we are interested in studying. Populations are defined by what a researcher is studying and can come in all shapes and sizes.
A frame is a list containing all members of the population.
Population parameters are facts about the population. Since parameters are descriptions of the population, a population can have many parameters. Parameters can be averages, percentages, minimums, or maximums. For a specific population at a specific point in time, population parameters do not change.
A sample is a subset of the population which is used to gain insight about the population. Samples are used to represent a larger group, the population.
A statistic is a fact or characteristic about the sample. For any given sample a statistic is a fixed number. Statistics are used as estimates of population parameters.
A process is a method for obtaining a desired result. The idea of a process is closely tied to quality control. In order to improve a process, there must be an understanding of how the process is currently performing. This required definition and measurement of the process.
The science of statistics is divided into two categories, descriptive and inferential. Descriptive methods describe and summarize data. Descriptive statistics is the collection, organization, and presentation of data.
The objective of inferential statistics is to make reasonable guesses about the population characteristics using sample data.
Collecting and Analyzing Data
Part of becoming a problem solver and user of statistics is developing an ability to appraise the quality of measurements. When you encounter data, consider whether the concept under study is adequately reflected by the proposed measurements, is the data measured accurately, and is there a sufficient quantity of the data to draw a reasonable conclusion.
Measurement and data are an integral part of science. Methods have been developed to solve research problems. Gather information about the phenomenon being studied. On the basis of the data, formulate a preliminary generalization or hypothesis. Collect further data to test the hypothesis. If the data and other subsequent experiments support the hypothesis, it becomes a law.
There are two ways to obtain data, observation and controlled experiments. In a statistical analysis, it is usually not possible to recover from poorly measured concepts or badly collected measurements.
A response variable measures the outcome of interest in a study. An explanatory variable causes or explains changes in a response variable. Isolating the effects of one variable on another means anticipating potentially confounding variables and designing a controlled experiment to produce data in which the values of the confounding variable are regulated.
Observational data comes about from measuring things. They can be extremely valuable.
Much of the statistical information presented to us is in the form of surveys. So, it is important to understand them and how they are done. In some cases, the purpose of a survey is purely descriptive. However, in many cases the researcher is interested in discovering a relationship.
Data in which the observations are restricted to a set of values that possess gaps is called discrete. Data that can take on any value within some interval is called continuous. The quality of data is referred to as its level of measurement. When analyzing data, you must be exceedingly conscious of the data’s level of measurement because many statistical analyses can only be applied to data that possess a certain level of measurement.
Data that represents whether a variable possesses some characteristic is called nominal. Ordinal data represents categories that have some associated order. Note that ordinal data is also nominal, but it also possesses the additional property of ordinality.
If the data can be ordered and the arithmetic difference is meaningful, the data is interval. An example of interval data is temperature. Interval data is numerical data that possesses both the property of ordinality and the interval property. Ratio data is similar to interval data, except that it has a meaningful zero point and the ratio of two data points is meaningful.
Qualitative data is data measured on a nominal or ordinal scale. Quantitative data is measured on an interval or ratio scale.
Time series data originates as measurements usually taken from some process over equally spaced intervals of time. Time series data originate from processes. Processes can be divided into two categories: stationary and nonstationary. All time series that are interesting vary, and the nature of the variability determines how the process is characterized. In a stationary process the time series varies around some central value and has approximately the same variation over the series. In a nonstationary process, the time series possess a trend, the tendency for the series to either increase or decrease over time.
Cross-sectional data are measurements created at approximately the same period of time.
Organization of Data in Statistics
A frequency distribution is a summary technique that organizes data into classes and provides in tabular form a list of the classes along with the number of observations in each class.
The process begins by refining information. An analyst will do this. He takes raw data and organizes that data. This is done by counting the number of observations in each classification.
A frequency distribution is a good way to handle large amounts of data. With it, we can see the overall structure of the data.
There are two steps in creating a frequency distribution:
- Choose the classifications
- Counting the number in each class
Graphs are important because they put information in visual form. While individual data can be lost, this is more than made up for by a nice graph. Use some type of graphing software to do this easily. Lots of different programs are available to create nice looking graphs these days.
Bar Charts
The bar chart is a simple graph in which the length of each bar corresponds to the number of observations in a category.
They are a good presentation tool and helpful in showing the differences in magnitude.
Creating a bar chart can get complicated. You should think about size, color, and labeling.
Pie Charts
Pie charts can represent the same information as a bar chart. The slices in a pie chart are proportional to the total in each category. You can easily compare the total of each category to the total overall.
When your data is qualitative, choosing categories is pretty easy. However, when your data is qualitative, choosing those categories is more complicated. The reason is that your choices often reflect how others will interpret the data. So, you have to be careful when doing this.
Choosing the number of categories is your choice and should depend on the amount of data available. You want enough categories to make the comparisons meaningful but not so many that it is hard to understand. Each situation will be different in this regard.
Relative Frequency Distribution
This represents the total observations in a category. It enables a person to view the number in each category in relation to the total number of observations. Another thing it does is change the frequency in each category to a proportion so we can compare data sets easier. I looks like this:
\[ \text{relative frequency} = \frac{\text{number in category}}{\text{total number}} \]
Cumulative Frequency Distribution
This gives a person the ability to quickly look at any category and see the number of observations and how they are related. The cumulative frequency is the sum of the frequency of a particular category and all preceding categories.
Cumulative Relative Frequency
The cumulative relative frequency is the proportion of observations in a particular category and all preceding categories.
Histograms
A histogram is used frequently and reveals the distribution of data. It is a bar graph of the frequency in which the height of each bar corresponds to the frequency of the category. Each category is represented by a vertical bar whose height is proportional to the frequency of the interval. The horizontal boundaries of each vertical bar correspond to the category endpoints. Once the frequency distribution has been calculated, all the information necessary for plotting a histogram is available.
Stem and Leaf Display
The stem and leaf display is a mix of methods. The display is similar to a histogram but the data remains usable to the user. It is useful for ordering and detecting patterns in the data. In other words, the raw data is not lost in the graph. It is similar to a histogram but the data remains visible.
Ordered Array
An ordered array is a listing of all the data in either increasing or decreasing magnitude. Data listed in increasing order is said to be listed in rank order. If listed in decreasing order, it is listed in reverse rank order. Listing data in an order is very useful and usually done. It allows you to scan the data quickly for the largest and smallest values.
Dot Plots
A dot plot is a graph where each data value is plotted as a point. If there are multiple entries, they are plotted above each other.
Time Series Data
A time series plot graphs data using time as the horizontal axis.
Statistical and Critical Thinking
Surveys provide data that enable us to improve products or services. Surveys guide political candidates, shape business practices, influence social media, and affect many aspects of our lives.
A voluntary response sample is a sample in which respondents themselves decide whether to participate. Those with a strong interest in the topic are more likely to participate. Sample data must be collected in an appropriate way, such as through a process of random selection. If sample data are not collected in an appropriate way, the data may be so completely useless that no amount of statistical torturing can salvage them.
When using methods of statistics with sample data to form conclusions about a population, it is absolutely essential to collect sample data in a way that is appropriate.
Data are collections of observations, such as measurements, genders, or survey responses. A single data value is called a datum. The term data is plural.
Statistics is the science of planning studies and experiments, obtaining data, and organizing, summarizing, presenting, analyzing, and interpreting those data and then drawing conclusions based on them.
A population is the complete collection of all measurements or data that are being considered. Typically, a population is the complete collection of data that we would like to make inferences about.
A census is the collection of data from every member of the population.
A sample is a sub-collection of members selected from a population.
Because populations are often very large, a common objective of the use of statistics is to obtain data from a sample and then use those data to form a conclusion about the population.
A voluntary response sample is one in which the respondents themselves decide whether to be included.
The word statistics is derived from the Latin word status, meaning state. Early uses of statistics involved compilations of data and graphs describing various aspects of a state or country.
The following types of polls are common examples of voluntary response samples. By their very nature, all are seriously flawed because we should not make conclusions about a population on the basis of samples with a strong possibility of bias.
- Internet polls: people online can decide whether to respond.
- Mail-in polls: in which people can decide whether to reply.
- Telephone polls in which newspaper, radio, or television announcements ask that you call a special number to respond.
Analyze
After completing our preparation by considering the context, source, and sampling method, we begin to analyze the data.
Graph and Explore
An analysis should begin with appropriate graphs and explorations of data.
Apply Statistical Methods
A good statistical analysis does not require strong computational skills. A good statistical analysis does require using common sense and paying careful attention to sound statistical methods.
Conclude
The final step in our statistical process involves conclusions, and we should develop an ability to distinguish between statistical significance and practical significance.
Statistical significance is achieved in a study when we get a result that is very unlikely to occur by chance. A common criterion is that we have statistical significance if the likelihood of an event occurring by chance is 5 percent or less. Getting 98 girls in 100 random births is statistically significant because such an extreme outcome is not likely to result from random chance. Getting 52 girls in 100 births is not statistically significant because that event could easily occur with random chance.
Practical significance is when it is possible that some treatment or finding is effective, but common sense might suggest that the treatment or finding does not make enough of a difference to justify its use or to be practical.
Misleading Conclusions
When forming a conclusion based on a statistical analysis, we should make statements that are clear even to those who have no understanding of statistics and its terminology. We should carefully avoid making statements not justified by statistical analysis.
Sample Data Reported
When collecting data from people, it is better to take measurements yourself instead of asking subjects to report results. Ask people what they weigh and you are likely to get their desired weights, not their actual weight.
Loaded Questions
If survey questions are not worded carefully, the results of a study can be misleading. Survey questions can be loaded or intentionally worded to elicit a desired response.
Order of Questions
Sometimes survey questions are unintentionally loaded by such factors as the order of the items being considered.
Nonresponse
A nonresponse occurs when someone either refuses to respond to a survey question or is unavailable. When people are asked survey questions, some firmly refuse to answer.
Percentages
To find a percentage of an amount, replace the % symbol with division by 100, and then interpret “of” to be multiplication.
6% of 1200 responses = \(\frac{6}{100} * 1200 = 72 \)
Decimal to Percentage
To convert from a decimal to a percentage, multiply by 100%.
\[ 0.25 \rightarrow 0.25 * 100% = 25% \]
Fraction to Percentage
To convert from a fraction to a percentage, divide the denominator into the numerator to get an equivalent decimal number. Then multiply by 100 percent.
\[ \frac{}3}{4} = 0.75 \rightarrow 0.75 * 100% = 75% \]
Percentage to Decimal
To convert from a percentage to a decimal number, replace the % symbol with division by 100.
\[ 85% = \frac{85}{100} = 0.85 \]
A parameter is a numerical measurement describing some characteristic of a population.
A statistic is a numerical measurement describing some characteristic of a sample.
If we have more than one statistic, we have statistics. Another meaning of statistics is the science of planning studies and experiments; obtaining data, organizing, summarizing, presenting, analyzing, and interpreting those data.
Some data are numbers representing counts or measurements, whereas others are attributes that are not counts or measurements. Quantitative data consist of numbers representing counts or measurements.
Categorical data consist of names or labels. Categorical data are sometimes coded with numbers, with those numbers replacing names. Although such numbers might appear to be quantitative, they are actually categorical data.
Include Units of Measurement
With quantitative data, it is important to use the appropriate units of measurement, such as dollars, hours, feet, or meters. We should carefully observe information given about the units of measurement, such as all amounts are in thousands of dollars or all units are in kilograms.
Discrete or Continuous
Quantitative data can be further described by distinguishing between discrete and continuous types. Discrete data result when the data values are quantitative and the number of values is finite. Continuous or numerical data result from infinitely many possible quantitative values, where the collection of values is not countable.
The concept of countable data plays a key role in the preceding definitions, but it is not a particularly easy concept to understand. Continuous data can be measured, but not counted. If you select a particular value from continuous data, there is no next data value.
Levels of Measurement
Another common way of classifying data is to use four levels of measurement; nominal, ordinal, interval, and ratio. When we are applying statistics to real problems, the level of measurement of the data helps us to decide which procedure to use. Don’t do computations and don’t use statistical methods that are not appropriate for the data.
Ratio
There is a natural zero starting point and ratios make sense. These are heights, lengths, distances, and volumes.
Interval
Differences are meaningful, but there is no natural zero starting point and ratios are meaningless. Body temperatures in degrees is an example.
Ordinal
Data can be arranged in order, but differences either can’t be found or are meaningless. Examples are ranks of colleges.
Nominal
Categories only. Data cannot be arranged in order. An example is eye colors.
The nominal level of measurement is characterized by data that consist of names, labels, or categories only. The data cannot be arranged in some order.
Because nominal data lack any ordering or numerical significance, they should not be used for calculations. Numbers such as 1,2,3, or 4 are sometimes assigned to the different categories, but these numbers have no real computational significance and any average calculated from them is meaningless and possibly misleading.
Data are at the ordinal level of measurement if they can be arranged in some order, but differences between data values cannot be determined or are meaningless.
Ordinal data provide information about relative comparisons, but not the magnitudes of the differences. Usually, ordinal data should not be used for calculations such as an average, but this guideline is sometimes ignored.
Data are at the interval level of measurement if they can be arranged in order, and differences between data values can be found and are meaningful. Data at this level do not have a natural zero starting point at which none of the quantity is present.
Data are at the ratio level of measurement if they can be arranged in order, differences can be found and are meaningful, and there is a natural zero starting point. For data at this level, differences and ratios are both meaningful.
The distinction between the interval and ratio levels of measurement can be a bit tricky. For the ratio test, focus on the term ratio and know that the term twice describes the ratio of one value to be double the other value. To distinguish between the interval and ratio levels of measurement, use a ratio test by asking this question: Does use of the term twice make sense? Twice makes sense for data at this level of measurement, but it does not make sense for data at the interval level of measurement.
For the true zero test, and for ratios to make sense, there must be a value of true zero, where the value of zero indicates that none of the quantity is present, and zero is not simply an arbitrary value on a scale. The temperature of 0 F is arbitrary and does not indicate that there is no heat, so temperatures on the Fahrenheit scale are at the interval level of measurement not the ratio level.
Big data refers to data sets so large and so complex that their analysis is beyond the capabilities of traditional software tools. Analysis of big data may require software simultaneously running in parallel on many different computers.
Data science involves applications of statistics, computer science, and software engineering, along with some other relevant fields such as sociology or finance.
Example of Data Set Magnitudes
- Terabytes
- Petabytes
- Exabytes
- Zettabytes
- Yottabytes
Statistics in Data Science
The modern data scientist has a solid background in statistics and computer systems as well as expertise in fields that extend beyond statistics. The modern data scientist might be skilled with Hadoop software, which uses parallel processing on many computers for the analysis of big data. The modern data scientist might also have a strong background in some other field such as psychology, biology, medicine, chemistry, or economics.
Missing Data
When collecting sample data, it is quite common to find that some values are missing. Ignoring missing data can sometimes create misleading results. If you make the mistake of skipping over a few different samples when you are manually typing them into a statistics software program, the missing values are not likely to have a serious effect on the results. However, if a survey includes many missing salary entries because those with very low incomes are reluctant to reveal their salaries, those missing values will have the serious effect of making salaries appear higher than they really are.
A data value is missing completely at random if the likelihood of its being missing is independent of its value or any of the other values in the data set. That is, any data value is just as likely to be missing as any other data value.
A data value is missing not at random if the missing value is related to the reason that it is missing.
Missing data at random can happen and an example is when using a keyboard to manually enter ages of survey respondents and makes the mistake of failing to enter the age of 37 years. The data value is missing completely at random.
Biased Results
Based on the two definitions and examples from the previous page, it makes sense to conclude that if we ignore data missing completely at random, the remaining values are not likely to be biased and good results should be obtained. However, if we ignore data that are missing, not at random, it is very possible that the remaining values are biased and results will be misleading.
Correcting for Missing Data
There are different methods for dealing with missing data. One very common method for dealing with missing data is to delete all subjects having any missing values. If the data are missing completely at random, the remaining values are not likely to be biased and good results can be obtained, but with a smaller sample size. If the data are missing not at random, deleting subjects having any missing values can easily result in a bias among the remaining values, so results can be misleading.
We can also input missing data values when we substitute values for them. There are different methods of determining the replacement values, such as using the mean of the other values, or using a randomly selected value from other similar cases, or using a method based on regression analysis.
When analyzing sample data with missing values, try to determine why they are missing, then decide whether it makes sense to treat the remaining values as being representative of the population. If it appears that there are missing values that are missing not at random, know that the remaining data may well be biased and any conclusions based on those remaining values may well be misleading.
In an experiment, we apply some treatment and then proceed to observe its effects on the individuals. The individuals in experiments are called experimental units and they are often called subjects when they are people. In an observational study, we observe and measure specific characteristics, but we don’t attempt to modify the individuals being studied.
Experiments are often better than observational studies because well planned experiments typically reduce the chance of having the results affected by some variable that is not part of the study. A lurking variable is one that affects the variables included in the study, but it is not included in the study.
Design of Experiments
Good design of experiments includes replication, blinding, and randomization.
Replication is the repetition of an experiment on more than one individual. Good use of replication requires sample sizes that are large enough so that we can see effects of treatments.
Blinding is used when the subject doesn’t know whether he or she is receiving a treatment or a placebo. Blinding is a way to get around the placebo effect, which occurs when an untreated subject reports an improvement in symptoms.
Randomization is used when individuals are assigned to different groups through a process of random selection. The logic behind randomization is to use chance as a way to create two groups that are similar.
A simple random sample of n subjects is selected in such a way that every possible sample of the same size n has the same chance of being chosen.
Unlike careless or haphazard sampling, random sampling usually requires very careful planning and execution.
Simple Random Sample
A sample of n subjects is selected so that every sample of the same size n has the same chance of being selected
Systematic Sample
Select every kth subject
Convenience Sample
Use data that are very easy to get
Stratified Sample
Subdivide populations into strata or groups with the same characteristics, then randomly sample within those strata.
Cluster Sample
Partition the population in clusters or groups, then randomly select all members of the selected clusters.
Multistage Sampling
Professional pollsters and government researchers often collect data by using some combination of the preceding sampling methods. In a multistage sample design, pollsters select a sample in different stages, and each stage might use different methods of sampling.
In a cross sectional study, data are observed, measured, and collected at one point in time, not over a period of time.
In a retrospective study, data are collected from a past timer period by going back in time.
In a prospective study, data are collected in the future from groups that share common factors.
Experiments
In an experiment, confounding occurs when we can see some effect, but we can’t identify the specific factor that caused it.
A randomized block design uses the same basic idea as stratified sampling, but randomized block designs are used when designing experiments, whereas stratified sampling is used for surveys.
Matched Pairs Design
Compare two treatment groups by using subjects matched in pairs that are somehow related ort have similar characteristics.
Rigorously Controlled Design
Carefully assign subjects to different treatment groups, so that those given each treatment are similar in the ways that are important to the experiment. This can be extremely difficult to implement, and often we can never be sure that we have accounted for all of the relevant factors.
Sampling Errors
In statistics, you could use a good sampling method and do everything correctly, and yet it is possible to get wrong results. No matter how well you plan and execute the sample collection process, there is likely to be some error in the results.
A sampling error occurs when the sample has been selected with a random method, but there is a discrepancy between a sample result and the true population result, such an error results from chance sample fluctuations.
A non sampling error is the result of human error, including such factors as wrong data entries, computing errors, questions with biased wording, false data provided by respondents, forming biased conclusions, or applying statistical methods that are not appropriate for the circumstances.
A non random sampling error is the result of using a sampling method that is not random, such as using a convenience sample or a voluntary response sample.
The Gold Standard
Randomization with placebo/treatment groups is sometimes called the gold standard because it is so effective.
What Statistics Is All About
One of the first considerations is designing appropriate studies. The purpose is to collect data. This process can be done with either surveys or experiments. One of the most popular ways to collect data is the observational study in a way that does not affect them. Surveys have to be worded carefully to get good information.
An experiment is another popular way to gather data. It involves treatments on participants so that clear comparisons can be made. After treatments are made, responses are recorded.
Collecting quality data is a major consideration. It really does no good to get bad data. So, studies and experiments must be planned well. Once you have good data, you can make a good report on what you found. To minimize bias in a survey, you have to be random when selecting participants.
Descriptive Statistics
These are numerical values that describe a data set. This is usually done through different types of categories. If the data are categorical they are usually summarized using the number of individuals in each group. This is called the frequency. If you use the percentage of individuals, it is called the relative frequency.
Numerical data represent measurements or counts. You can do more with numerical data. For example, you can get the measure of center and the measure of spread in the data.
Some descriptive statistics are more appropriate than others in certain situations. The average is not always the best measure of the center of a data set.
Charts and Graphs
Data is summarized in a visual way using charts and graphs. These are displays that are organized to give you a big picture of the data.
Some of the basic graphs used for categorical data include pie charts and bar graphs. These break down variables in the data.
For numerical data, a different type of graph is needed. Histograms and box plots are usually used to represent numerical data. These types of graphs make it easier to visualize the data.
Distributions
A variable is a characteristic that is being counted or measured. A distribution is a listing of the possible values of a variable and how often they occur.
Different types of distributions exist for different types of variables.
If a variable is counting the number of successes in a certain number of trials, it has a binomial distribution.
If the variable takes on values that occur according to a bell-shaped curve, then that variable has a normal distribution.
If the variable is based on sample averages and you have limited data, the t-distribution may be in order.
When it comes to distributions, you need to know how to decide which distribution a particular variable has, how to find probabilities for it, and how to figure out what the long-term average and standard deviation of the outcomes would be.
Performing Analyses
After data has been collected and described, it is time to do the statistical analysis. There are many types of analyses. You have to choose the appropriate type for your data.
You often see statistics that try to estimate numbers pertaining to an entire population. However, it is just an estimate and most studies only ask a small number of people their questions. What happens is that data is collected on a small sample of people. Sometimes the results they get are very inaccurate.
Sample results vary from sample to sample, and this amount of variability needs to be reported but usually it is not. The statistic used to measure and report the level of precision in someone’s sample result is called the margin of error. The range of the margin of error is called the confidence interval.
Hypothesis Tests
One major staple of research studies is called hypothesis testing. A hypothesis test is a technique for using data to validate or invalidate a claim about a population.
The elements about a population that are most often tested are:
- The population mean
- The population proportion
- The difference in two population means or proportions
Hypothesis tests are used in a host of areas that affect your everyday life, such as medical studies, advertisements, and polling data. Often you only hear the conclusions of hypothesis tests but you don’t see the methods used to come to these conclusions.
Drawing Conclusions
To perform statistical analyses, researchers use software that depends on formulas. You have to use them correctly, though. Some of the most common mistakes made in conclusions are overstating the results. Until you do a controlled experiment, you can’t make a cause-and-effect conclusion based on relationships you find.
Statistics is about much more than numbers. You need to understand how to make appropriate conclusions from studying data and be smart enough to not believe everything you read.
Relationships Between Categorical Variables
When we want to see how two categorical variables are related, put the counts in a two-way table called a contingency table. Look at the marginal distribution of each variable. Also look at the conditional distribution of a variable within each category of the other variable. Comparing conditional distributions of one variable across categories of another tells us about the association between variables. If the conditional distributions of one variable are roughly the same for every category of the other, the variables are independent. Consider a third variable whenever it is appropriate, and be able to describe the relationships among the three variables.
Contingency Table
A contingency table displays counts and sometimes percentages of individuals falling into named categories on two or more variables. The table categorizes the individuals on all variables at once to reveal possible patterns in one variable that may be contingent on the category of the other.
Marginal Distribution
In a contingency table, the distribution of either variable alone is called the marginal distribution. The counts or percentages are the totals found in the margins of the table.
Table Percents
When a cell of a contingency table holds percents, these can be percents of the total in the row or column of that cell. These are row, column, and table percents.
Conditional Distribution
The distribution of a variable when the Who is restricted to consider only a smaller group of individuals is called a conditional distribution.
Independence
Variables are said to be independent if the conditional distribution of one variable is the same for each category of the other.
Segmented Bar Chart
A segmented bar chart displays the conditional distribution of a categorical variable within each category of another variable.
Mosaic Plot
A mosaic plot is a graphical representation of a contingency table. The plot is divided into rectangles so that the area of each rectangle is proportional to the number of cases in the corresponding cell.
Simpson’s Paradox
When averages are taken across different groups, they can appear to contradict the overall averages.
Lurking Variables
A lurking variable is one that is not immediately evident in an analysis, but changes the apparent relationships among the variables being studied.
Contingency Tables in Excel
Excel calls contingency tables Pivot Tables. To make a pivot table, from the Data menu, choose pivot table. In the layout window, drag your variables to the row area, the column area, and drag your variable again to the data area. This tells Excel to count the occurrences of each category.
Contingency Tables in R
Using the function xtabs, you can create a contingency table from two variables x and y in a data frame called mydata with the command:
con.table=xtabs(~x+y,data=mydata)
Working with Tables and Graphs
When working with large data sets, a frequency distribution is often helpful in organizing and summarizing data. A frequency distribution helps us to understand the nature of the distribution of a data set.
Frequency Distribution
A frequency distribution or table shows how data are partitioned among several categories by listing the categories along with the number of data values in each of them.
Lower class limits are the smallest numbers that can belong to each of the different classes. Upper class limits are the largest numbers that can belong to each of the different classes. Class boundaries are the numbers used to separate the classes, but without the gaps created by class limits. Class midpoints are the values in the middle of the classes. Class width is the difference between two consecutive lower class limits in a frequency distribution.
Finding the correct class width can be tricky. For class width, don’t make the most common mistake of using the difference between a lower class limit and an upper class limit. For class boundaries, remember that they split the difference between the end of one class and the beginning of the next class.
We construct frequency distributions to:
- Summarize large data sets
- See the distribution and identify outliers
- Have a basis for constructing graphs
Technology can generate frequency distributions but these are the common steps:
- Select the number of classes, usually between 5 and 20
- Calculate class width: \(\frac{\text{max data value - min data value}}{\text{number of classes}} \)
- Round this result to get a convenient number
- Choose the value for the first lower class limit by using either the min value or a convenient value below the minimum.
- Using the first lower class limit and the class width, list the other lower class limits.
- List the lower class limits in a vertical column and then determine and enter the upper class limits.
- Take each individual data value and put a tally mark in the appropriate class. Add the tally marks to find the total frequency for each class.
Relative Frequency Distribution
A variation of the basic frequency distribution is a relative frequency distribution. Each class frequency is replaced by a relative frequency as a percentage.
\[ \text{relative frequency} = \frac{\text{frequency for class}}{\text{sum of frequencies}} * 100 \]
This will give you the frequency percentage.
The sum of the percentages in a relative frequency distribution will be very close to 100 percent.
Another variation of a frequency distribution is a cumulative frequency distribution in which the frequency for each class is the sum of the frequencies for that class and all previous classes.
At the beginning we noted that a frequency distribution can help us understand the distribution of a data set, which is the nature or shape of the spread of the data over the range of values. In statistics, we are often interested in determining whether the data have a normal distribution. Data that have an approximately normal distribution are characterized by a frequency distribution with the following features:
- The frequencies start low, then increase to one or two high frequencies, and then decrease to a low frequency.
- The distribution is approximately symmetric. Frequencies preceding the maximum frequency should be roughly a mirror image of those that follow the maximum frequency.
The presence of gaps can suggest that the data are from two or more different populations.
Comparing two or more relative frequency distributions in one table makes comparisons of data much easier.
While a frequency distribution is a useful tool for summarizing data and investigating the distribution of data, an even better tool is a histogram, which is a graph that is easier to interpret than a table of numbers.
A histogram visually displays the shape of the distribution of the data. It shows the location of the center of the data. Histograms show the spread of data and can also identify outliers.
A histogram is basically a graph of a frequency distribution. Class frequencies should be used for the vertical scale and that scale should be labeled. There is no universal agreement on the procedure for selecting which values are used for the bar locations along the horizontal scale, but it is common to use class boundaries, class midpoints, class limits, or something else. It is often easier for us to use class midpoints for the horizontal scale. Histograms can usually be generated using technology.
A relative frequency histogram has the same shape and horizontal scale as a histogram, but the vertical scale uses relative frequencies instead of actual frequencies.
The ultimate objective of using histograms is to be able to understand characteristics of data. Exploring the data means to:
- Find the center of the data
- Find the variation
- Find the shape of the distribution
- Find any outliers
- Find the change of data over time
When a graph is said to be skewed to the right, it means the histogram shape has a tail on the right.
When a graph is said to be skewed to the left, it means the histogram shape has a tail on the left.
Bell-shaped distribution is called a normal distribution and has its highest values in the middle.
Uniform distribution is a histogram with roughly the same values all the way across.
Many statistical methods require that sample data come from a population having a distribution that is approximately a normal distribution.
In a uniform distribution, the different possible values occur with approximately the same frequency, so the heights of the bars in the histogram are approximately uniform.
A distribution of data is skewed if it is not symmetric and extends more to one side than to the other. Data skewed to the right, called positively skewed, have a longer right tail.
Data skewed to the left, called negatively skewed, have a longer left tail.
Some really important methods have a requirement that sample data must be from a population having a normal distribution. Histograms can be helpful in determining whether the normality requirement is satisfied, but they are not very helpful with very small data sets.
The population distribution is normal if the pattern of the points in the normal quantile plot is reasonably close to a straight line, and the points do not show some systematic pattern that is not a straight-line pattern.
The population distribution is not normal if the normal quantile plot has either or both of these two conditions:
- The points do not lie reasonably close to a straight-line pattern
- The points show some systematic pattern that is not a straight-line pattern
Graphs that Enlighten
A dot plot graph is a good type of graph. It consists of a graph of quantitative data in which each data value is plotted as a point above a horizontal scale of values. Dots representing equal values are stacked.
A dot plot:
- Displays the shape of the distribution of data
- It is usually possible to recreate the original list of data values.
A stem plot is another type of graph and it represents quantitative data by separating each value into two parts: the stem and the leaf. Better stem plots are often obtained by first rounding the original data values. Also, stem plots can be expanded to include more rows and can be condensed to include fewer rows.
Stem plots:
- Shows the shape of the distribution of data
- Retains the original data values
- The sample data are sorted
A time-series graph is a graph of time-series data, which are quantitative data that have been collected at different points in time, such as monthly or yearly.
Time-series graphs:
- Reveals information about trends over time
Bar graphs use bars of equal width to show frequencies of categories of categorical data. The bars may or not be separated by small gaps.
Bar graphs:
- Shows the relative distribution of categorical data so that it is easier to compare the different categories.
A pareto chart is a bar graph for categorical data, with the added stipulation that the bars are arranged in descending order according to frequencies, so the bars decrease in height from left to right.
Pareto charts:
- Shows the relative distribution of categorical data so that it is easier to compare the different categories.
- Draws attention to the more important categories.
A pie chart is a very common graph that depicts categorical data as slices of a circle, in which the size of each slice is proportional to the frequency count for the category. Although pie charts are very common, they are not as effective as Pareto charts.
Pie charts:
- Shows the distribution of categorical data in a commonly used format.
Try to never use pie charts because they waste ink on components that are not data, and they lack an appropriate scale.
A frequency polygon uses line segments connected to points located directly above class midpoint values. A frequency polygon is very similar to a histogram, but a frequency polygon uses line segments instead of bars.
A variation of the basic frequency polygon is the relative frequency polygon, which uses relative frequencies for the vertical scale. An advantage of relative frequency polygons is that two or more of them can be combined on a single graph for easy comparison.
Graphs that Deceive
Deceptive graphs are commonly used to mislead people. Graphs should be constructed in a way that is fair and objective.
A common deceptive graph involves using a vertical scale at some value greater than zero to exaggerate differences between groups. This is called a nonzero vertical graph. Always examine a graph carefully to see whether a vertical axis begins at some point other than zero so that differences are exaggerated.
Pictographs are another type of chart that are used to mislead. Data that are one-dimensional in nature are often depicted with two-dimensional objects or three-dimensional objects. By using pictographs, artists can create false impressions that grossly distort differences by using these same principles of basic geometry:
- When you double each side of a square, it’s area doesn’t merely double, it increase by a factor of four
- When you double each side of a cube, its volume doesn’t merely double, it increases by a factor of eight
When examining data depicted with a pictograph, determine whether the graph is misleading because objects of area or volume are used to depict amounts that are actually one-dimensional.
For small data sets of 20 values or fewer, use a table instead of a graph. A graph of data should make us focus on the true nature of the data, not on other elements, such as eye-catching but distracting design features. Do not distort data. Construct a graph to reveal the true nature of the data. Almost all of the ink in a graph should be used for the data, not for the design elements.
A correlation exists between two variables when the values of one variable are somehow associated with the values of the other variable.
A linear correlation exists between two variables when there is a correlation and the plotted points of paired data result in a pattern that can be approximated by a straight line. A scatterplot is a plot of paired quantitative data with a horizontal x-axis and the vertical axis is used for the second variable y.
The presence of correlation between two variables is not evidence that one of the variables causes the other. We might find a correlation between beer consumption and weight, but we cannot conclude from the statistical evidence that drinking beer has a direct effect on weight.
A scatterplot can be very helpful in determining whether there is a correlation between the two variables.
The linear correlation coefficient is denoted by r, and it measures the strength of the linear association between two variables.
When we do not conclude that there appears to be a linear correlation between two variables, we can find the equation of the straight line that best fits the sample data, and that equation can be used to predict the value of one variable when given a specific value of the other variable. Instead of using the straight-line equation of \(y = mx + b \) that we have all learned in prior math courses, we use the format that follows.
Given a collection of paired sample data, the regression line, or line of best fit, is the straight line that best fits the scatter plot of the data.
Identify the lower class limits, upper class limits, class width, class midpoints, and class boundaries for the given frequency distribution. Also identify the number of individuals in the summary.
Displaying and Describing Data
A symmetric distribution has roughly the same shape reflected around the center. A skewed distribution extends farther on one side than on the other. A unimodal distribution has a single major hump. A bimodal distribution has two humps. Multimodal distributions have more than two humps. Outliers are values that lie far from the rest of the data.
The mean is the sum of the values divided by the count. The median is the middle value. Half the values are above and half the values are below the median. The mean and median may differ because of outliers. If there are no outliers then the mean and median should be almost the same.
The standard deviation is roughly the square root of the average squared difference between each data value and the mean. It is the summary of choice for the spread of unimodal, symmetric variables. The IQR is the difference between the third and first quartiles. It is the preferred summary of spread for skewed distributions or data with outliers.
Area of Principle
In a statistical display, each data value should be represented by the same amount of area.
Frequency Table
A frequency table lists the categories in a categorical variable and gives the count of observations for each category.
Distribution
The distribution of a categorical value gives the possible values of the variable and the relative frequency of each variable.
Bar Chart
Bar charts show a bar whose area represents the count of observations for each category of a categorical variable.
Pie Chart
Pie charts show how a whole is divided into categories. The area of each wedge of the circle corresponds to the proportion in each category.
Histogram
A histogram uses adjacent bars to show the distribution of a quantitative variable. Each bar represents the frequency of values falling in each bin.
Gap
A region of the distribution where there are no values.
Stem and Leaf Display
A display that shows quantitative data values in a way that sketches the distribution of the data.
Dotplot
A dotplot graphs a dot for each case along a single axis.
Density Plot
A density plot shows the shape of a variable’s distribution by smoothing out its histogram to make a gentle curve.
Shape
To describe the shape of a distribution, look for single versus multiple modes, symmetry versus skewness, and outliers versus gaps.
Mode
A hump or local high point in the distribution of a variable. The apparent location of modes can change as the scale of a histogram is changed.
Uniform
A distribution that does not appear to have any mode and in which all the bars of its histogram are approximately the same height.
Symmetric
A distribution is symmetric if the two halves on either side of the center look approximately like mirror images of each other.
Tails
The parts of a distribution that trail off on either side. Distributions can be characterized as having long tails or short tails.
Skewed
A distribution is skewed if it’s not symmetric and one tail stretches out farther than the other. Distributions are said to be skewed left when the longer tail stretches to the left, and skewed right when it goes to the right.
Outlier
Outliers are extreme values that don’t appear to belong with the rest of the data. They may be unusual values that deserve further investigation or they may just be mistakes.
Center
The place in the distribution of a variable that you would point to if you wanted to attempt the impossible by summarizing the entire distribution with a single number. Measures of the center include the mean and median.
Median
The median is the middle value, with half the data above and half below it. If n is even, it is the average of the two middle values. It is usually paired with the IQR.
Mean
The mean is found by adding up all the data values and dividing by the count.
Spread
A numerical summary of high tightly the values are clustered around the center. Measures of spread include the IQR and standard deviation.
Range
The difference between the lowest and highest value in a dataset.
Quartile
The lower quartile Q1 is the value with a quarter of the data below it. The upper quartile Q3 has three quarters of the data below it. The median and quartiles divide the data into the four parts with approximately equal numbers of data values.
Percentile
The ith percentile is the number that falls above the i% of the data.
IQR - Interquartile Range
The IQR is the difference between the first and third quartiles, so Q3-Q1. It is usually reported along with the median.
Least Squares Property
The property of a statistic that the sum of the squared deviations of data values from data summaries due to that statistic is as small as it could be for any statistic is called the least squares property.
Residuals
A residual is the difference between an observed data value and some summary or model for that value.
Variance
The variance is the sum of squared deviations from the mean, divided by the count minus 1.
Standard Deviation
The standard deviation is the square root of the variance.
Bar Chart In Excel
First make a pivot table which is Excel’s name for a frequency table. From the data menu, choose Pivot table and Pivot Chart Report. When you reach the layout window, drag your variable to the row area and drag your variable again to the data area. This tells Excel to count the occurrences of each category. Once you have an Excel pivot table, you can construct bar charts and pie charts.
Compute Average in Excel
Click inside the Pivot table. Click the Pivot table chart wizard button. Excel creates a bar chart. To compute the mean, click on an empty cell. Go to the Formulas tab in the ribbon. Click the drop down arrow next to Auto-Sum and choose Average. Enter the data range in the formula displayed in the empty bow you selected earlier. Press enter and this will compute the mean for the values in that range.
Compute Standard Deviation in Excel
To computer standard deviation, click on an empty cell. Go to the Formulas tab in the ribbon and click the drop down arrow next to Auto-sum and select More Functions. In the dialog box that opens, select STDEV from the list of functions and click Ok. A new dialog box opens. Enter a range of fields into the text fields and click Ok. Excel computes the standard deviation for the values in that range and places it in the specified cell of the spreadsheet.
Comparing Distributions in Statistics
It is almost always more interesting to compare groups than to summarize data for a single group. There are several ways to summarize a variable. The median and quartiles are suitable even for data that may be skewed or have outliers and are usually used together. Along with these three values, we can report the max and min values. These five values together make up the 5-number summary of the data. They include the median, quartiles, max, and min. It is a useful, concise summary because it gives a good idea of the center, spread, and range.
A boxplot highlights several features of the distribution. The central box shows the middle half of the data, between the quartiles. The height of the box is equal to the IQR. If the median is roughly centered between the quartiles, then the middle half of the data is roughly symmetric. If the median is not centered, the distribution is skewed. The whiskers show skewness as well if they are not roughly the same length.
Histograms or stem and leaf displays are good for single distributions but not good for 20. It would be hard to see patterns. By placing boxplots side by side, you can easily see which groups have higher medians, which have greater IQR’s, where the central 50% of the data is located in each group, and which have the overall greater range.
Outliers
Outliers arise for many reasons. They may be the most important values in the dataset or they may be an error. It could be an exceptional case or illuminating a pattern by being the exception to the rule. Many outliers are not wrong, they are just different. Most repay the effort to understand them. You can sometimes learn more from extraordinary cases than from summaries of the entire dataset.
There are two things you should never do with outliers. You should not leave an outlier in place and proceed as if nothing happened. Analyses of data with outliers are very likely to be wrong. The other is to omit an outlier from the analysis without comment. A histogram is often a better way to see more detail about how the outlier fits in or doesn’t fit at all.
Timeplots
A display of values against time is called a timeplot. Timeplots often show a great deal of point to point variation. We usually want to see past this variation to understand any underlying smooth trends. Also we want to think about how the values vary around that tend, the timeplot version of center and spread.
Re-Expressing Data
When data are skewed, it can be hard to summarize them simply with a center and spread, and hard to decide whether the most extreme values are outliers or just part of the stretched out tail. We re-express the data by applying a simple function to each value. Re-express means to transform the data by applying a simple function to make the skewed distribution more symmetric. It could be either a square root or logarithm function. Variables that are skewed to the right often benefit from a re-expression by square roots, logs, or reciprocals. Those skewed to the left may benefit from squaring the data. Re-expressing can help alleviate the problem of comparing groups that have very different spreads.
Choose the right tool for comparing distributions. Compare the distributions of two or three groups with histograms. Compare several groups with boxplots, which make it easy to compare centers and spreads and spot outliers, but hide much of the detail of distribution shape.
Treat outliers with attention and care. Outliers are nominated by the boxplot rule, but you must decide what to do with them. Track down the background for outliers, it may be informative.
Re-express data to make them easier to work with. Re-expression can make skewed distributions more nearly symmetric. Re-expression can make the spreads of different groups more nearly comparable.
Outlier
Values that are large or small compared to most of the other values in a variable. Whether they are outliers is a judgement call that depends on the context. A boxplot displays values more than 1.5 IQR’s beyond the nearest quartile as potential outliers, but that is not a definition of outlier that can be used anywhere.
5-Number Summary
A summary of a variable’s distribution that consists, of the extremes, the quartiles, and the median.
Boxplot
A display of a box between the quartiles and whiskers extending to the highest and lowest values not nominated as outliers.
Far Outlier
In a boxplot a value more than 3 IQR’s beyond the nearest quartile. Such values deserve special attention.
Timeplot
A timeplot displays data that change over time. Often, successive values are connected with lines to show trends more clearly. Sometimes, a smooth curve is added to the plot to help show long0term patterns and trends.
Re-Express
This is another name for transform. The structure of data may be improved by working with a simple function of the data. The logarithm, square root, and reciprocal are the most common re-expression functions.
Measures of Center
Measures of center are widely used to provide representative values that summarize data sets.
A measure of center is a value at the center or middle of a data set.
The mean is generally the most important of all numerical measurements used to describe data. It is what most people call an average.
The mean of a set of data is the measure of center found by adding all of the data values and dividing the total by the number of data values.
Sample means drawn from the same population tend to vary less than other measures of center. The mean of a data set uses every data value. A disadvantage of the mean is that just one extreme value can change the value of the mean substantially. This extreme value is called an outlier. By this definition, we say the mean is not resistant.
A statistic is resistant if the presence of extreme values does not cause it to change very much.
The definition of the mean can be expressed by the formula:
\[\frac{\sigma x}{n} \]
Sigma refers to the sum of values. X is the sum of all values. N is the number of values.
If the data are from a sample of the population, the mean is denoted by x-bar.
If the data are from the entire population, the mean is denoted by mu.
Sample statistics are usually represented by English letters and population parameters are usually represented by Greek letters.
\(\sigma\) denotes the sum of a set of data values.
\(x\) is the variable usually used to represent the individual data values.
\(n\) represents the number of data values in a sample.
\(N\) represents the number of data values in a population.
Never use the term average when referring to a measure of center. The word average is often used for the mean but it should not be.
The median can be thought of as a middle value. More precisely, the median of a data set is the measure of center that is the middle value when the original data values are arranged in order of increasing or decreasing magnitude.
The median does not change by large amounts when we include just a few extreme values, so the median is a resistant measure of center. The median does not directly use every data value.
The median of a sample is sometimes denoted by x-tilde or m or Med. to find the median, first sort the values.
If the number of data values is odd, the median is the number located in the exact middle of the sorted list.
If the number of data values is even, the median is found by computing the mean of the two middle numbers in the sorted list.
Mode isn’t used much with quantitative data, but it is the only measure of center that can be used with qualitative data. The mode of a data set is the value that occurs with the greatest frequency.
The mode can be found with qualitative data. A data set can have no mode or one mode or multiple modes. When two data values occur with the same greatest frequency, each one is a mode and the data set is set to be bimodal. When more than two data values occur with the same greatest frequency, each is a mode and the data set is said to be multimodal. When no data value is repeated, we say there is no mode.
Midrange is another measure of center. The midrange of a data set is the measure of center that is the value midway between the max and min values in the original data set. It is found by adding the max data value to the min data value and then dividing the sum by 2.
Because the midrange uses only the max and min values, it is very sensitive to those extremes so the midrange is not resistant. In practice, the midrange is rarely used, but it has 3 redeeming features:
- It is very easy to compute
- It helps reinforce the very important point that there are several different ways to define the center of a data set.
- The value of the midrange is sometimes used incorrectly for the median, so confusion can be reduced by clearly defining the midrange along with the median.
When calculating measures of center, we often need to round the result.
For the mean, median, and midrange, carry one more decimal than is present in the original set of values.
For the mode, leave the value as is without rounding.
When applying any rounding rules, round only the final result, not anything before that.
We can always calculate measures of center from a sample of numbers, but we should always think about whether it makes sense to do that.
For example, it makes no sense to do numerical calculations with data at the nominal level of measurement. We should also think about the sampling method used to collect data. If the sampling method is not sound, the statistics we obtain may be very misleading.
Measures of Variation
To understand variation, we begin by introducing the range. The range of a set of data values is the difference between the max data value and the min data value. The range uses only the maximum and the minimum data values, so it is very sensitive to extreme values. It is not resistant. Because the range uses only the max and min values, it does not take every value into account and therefore does not truly reflect the variation among all of the data values.
\[ \text{Range = max value - min value} \]
Range Rule of Thumb
The range rule of thumb is a quick way to ballpark the standard deviation.
25% * range of data
Standard Deviation of a Sample
The standard deviation is the measure of variation most commonly used in statistics. It is a measure of how much data values deviate away from the mean. The standard deviation found from sample data is a statistic denoted by \{\text{s}\}.
The symbol for sample standard variation is \(s\).
The symbol for population standard deviation is \(\sigma\)
The symbol for sample variance is \(s^2\)
The symbol for population variance is \(\sigma^{2}\)
The standard deviation is a measure of how much data values deviate from the mean. The value of the standard deviation is never negative. It is zero only when all of the data values are exactly the same. Larger values indicate greater amounts of variation. The standard deviation can increase dramatically with one or more outliers. The units of the standard deviation are the same as the units of the original data values.
Here are the steps to finding standard deviation:
- Find the mean of your data values
- Subtract the mean from each individual sample value
- Square each of the deviations obtained from the previous step
- Add all of the squares obtained from previous step
- Divide the total from previous step by n-1, which is 1 less than the total number of data values present
- Find the square root of the result of the previous step.
Standard Deviation of a Population
A different formula is used to find the standard deviation of a population. We use the value of N instead of n-1. When using a calculator, make sure which kind of deviation it is giving you. The variance of a set of values is a measure of variation equal to the square of the standard deviation.
The units of the variance are the squares of the units of the original data values. The value of the variance can increase dramatically with the inclusion of outliers. So, the variance is not resistant. The value of the variance is never negative. It is zero only when all of the data values are the same number.
In measuring variation in a set of sample data, it makes sense to begin with the individual amounts by which values deviate from the mean. It makes sense to combine those deviations into one number that can serve as a measure of variation. We do not want to add the variations because that will give us a zero answer. Instead, we want to use the absolute values of the deviations. When we find the mean of that sum, we get the mean absolute deviation, which is the mean distance of the data from the mean.
Computation of the mean absolute deviation uses absolute values, so it uses an operation that is not algebraic. The use of absolute values would be simple but it would create algebraic difficulties in inferential statistics. The standard deviation has the advantage of using only algebraic operations. Because it is based on the square root of a sum of squares, the standard deviation closely parallels distance formulas found in algebra. There are many instances where a statistical procedure is based on a similar sum of squares. Consequently, instead of using absolute values, we square all deviations so that they are nonnegative and those squares are used to calculate the standard deviation.
After finding all of the individual values we combine them by finding their sum. We then divide by n-1 because there are only n-1 values that can be assigned without constraint. With a given mean, we can use any numbers for the first n-1 values, but the last value will then be automatically determined. With division by n-1, sample variances tend to center around the value of the population variance. With division by n, sample variances tend to underestimate the value of the population variance.
A concept helpful in interpreting the value of the standard deviation is the empirical rule. This rule states that for data sets having a distribution that is approximately bell-shaped, the following properties apply:
- 68 percent of all values fall within 1 standard deviation of the mean
- 95 percent of all values fall within 2 standard deviations of the mean
- 99.7 percent of all values fall within 3 standard deviations of the mean
Another concept helpful in understanding a value of a standard deviation is Chebyshev’s theorem. The empirical rule applies only to data sets with bell-shaped distributions, but Chebyshev’s theorem applies to any data set. Unfortunately, results are only approximate. Because the results are lower limits, this theorem has limited usefulness.
If the population mean is \(\mu\) and the population standard deviation is \(\sigma\), then the range rule of thumb for identifying significant values is as follows:
Significantly low values are \(\mu - 2\sigma\) or lower
Significantly high values are \(\mu + 2\sigma\) or higher.
Insignificant values are between the previous two values.
Measures Of Relative Standing
Measures of relative standing are numbers showing the location of data values relative to the other values within the same data set.
A z score is found by converting a value to a standardized scale. This definition shows that a z score is the number of standard deviations that a data value is away from the mean.
The z score is calculated by using:
\[z = \frac{x - \Xbar}{s}\]
Or
\[z = \frac{x - \mu}{\sigma}\]
A z score is the number of standard deviations that a given value is above or below the mean.
Z scores are expressed as numbers with no units of measurement.
A data value is significantly low if its z score is less than or equal to -2 or the value is significantly high if its z score is greater than or equal to +2.
If an individual data value is less than the mean, its corresponding z score is a negative number.
A value is significantly low or significantly high if it is at least two standard deviations away from the mean. It follows that significantly low values have z scores less than or equal to -2 and significantly high values have z scores greater than or equal to +2. If a value is in between these values then it is not significant.
A z score is a measure of position, in the sense that it describes the location of a value relative to the mean. Percentiles and quartiles are other measures of position useful for comparing values within the same data set or between different data sets.
Percentiles
Percentiles are one type of quantiles or fractiles which partition data into groups with roughly the same number of values in each group.
The 50th percentile has about 50% of the data values below and above it.
The process of finding the percentile that corresponds to a particular data value is given by the following formula:
\[\text{percentile} = \frac{\text{number of values less than x}}{\text{total number of values}}*100\]
Notation
- N = total number of values in the data set
- K = percentile being used, for example k=25
- L = locator that gives the position of a value.
- \(P_k\) = kth percentile
Algorithm
Sort the data from lowest to highest.
Compute \(L=\frac{k}{100}*n\) where n= number of values and k= percentile in question.
Is L a whole number?
If yes, the value of the kth percentile is midway between the Lth value and the next value in the sorted set of data. Find P_k by adding the Lth value and the next value and dividing the total by 2.
If no, change L by rounding it up to the next larger whole number.
The value of P_kl is the Lth value, counting from the lowest.
Quartiles
Just as there are 99 percentiles that divide the data into 100 groups, there are three quartiles that divide the data into four groups.
Quartiles are measures of location, Q1,Q2, and Q3, which divide a set of data into four groups with about 25% of the values in each group.
Interquartile range = \(Q_3 - Q_1\)
Semi-interquartile range = \(\frac{Q_3 - Q_1}{2}\)
Midquartile = \(\frac{Q_3 + Q_1}{2}\)
10-90 percentile range = \(P_90 = P_10\)
Boxplots
The values of the minimum, maximum, and three quartiles are used for the summary and construction of boxplot graphs.
For a set of data the summary consists of these 5 values:
- Minimum
- First quartile, Q1
- Second quartile, Q2
- Third quartile, Q3
- Maximum
A boxplot is a graph of a data set that consists of a line extending from the minimum value to the maximum value, and a box with lines drawn at the first quartile, the median, and the third quartile.
A boxplot can often be used to identify skewness. This means it is not symmetric.
Standard Deviation
Expressing a distance from the mean in standard deviations standardizes the performances. To standardize a value, we subtract the mean and then divide this difference by the standard deviation.
\[z = \frac{y - y{bar}}{s}\]
Standardizing Values
The values are called standardized values, and are commonly denoted with the letter z. Usually, we call them z-scores. Z-scores measure the distance of a value from the mean in standard deviations. A z-score of 2 says that a data value is 2 standard deviations above the mean. Data values below the mean have a negative z-score, so a z-score of -1.6 means that the data value was 1.6 standard deviations below the mean.
There are two steps to finding a z-score. First, the data are shifted by subtracting the mean. Then, they are rescaled by dividing by the standard deviation. Adding or subtracting a constant to every data value adds or subtracts the same constant to measures of position, but leaves measures of spread unchanged.
When we multiply or divide all the data values by any constant, all measures of position such as median, mean, and percentiles, and measures of spread such as range, IQR, and the standard deviation are multiplied by that same constant.
Shifting and Scaling Values
Standardizing data into z-scores is just shifting them by the mean and rescaling them by the standard deviation. Now we can see how standardizing affects the distribution. When we subtract the mean of the data from every data value, we shift the mean to zero. As we have seen, such a shift does not change the standard deviation.
When we divide each of these shifted values by s, the standard deviation should be divided by s as well. Since the standard deviation was s to start with, the new standard deviation becomes 1. Standardizing into z-scores does not change the shape of the distribution of a variable. Standardizing into z-scores changes the center by making the mean 0. Standardizing into z-scores changes the spread by making the standard deviation 1.
Normal Models
A z-score gives an indication of how unusual a value is because it tells how far it is from the mean. If the data value sits right at the mean, it’s not very far at all and its z-score is 0. A z-score of 1 tells us that the data value is 1 standard deviation above the mean, while a z-score of -1 tells us that the value is 1 standard deviation below the mean.
For many unimodal and symmetric distributions, about 68% of the values fall within one standard deviation of the mean. 95% of the values are found within two standard deviations of the mean. 99.7% or almost all of the values will be within three standard deviations of the mean.
In 1809 Gauss figured out the formula for the model that accounts for this observation, it is called the Normal or Gaussian model. It illustrates one of the most important uses of the standard deviation. The standard deviation is the statistician’s ruler. This model for unimodal symmetric data gives us even more information because it tells us how likely it is to have z-scores between -1 and1, between -2 and 2, and between -3 and 3.
These magic 68, 95, and 99.7 values come from the Normal model. As a model, it can give us corresponding values for any z-score.
N always denotes a Normal model. The mu symbol is the Greek letter for m and always represents the mean in a model. The sigma character is the lowercase Greek letter for s and always represents the standard deviation in a model. The man and standard deviation are not numerical summaries of data. They are characteristics of the model called parameters. Parameters are the values we choose that completely specify a model. We do not want to confuse the parameters with summaries of the data so we use special symbols. In statistics, we almost always use Greek letters for parameters. Summaries of data, like the sample mean, median, or standard deviation, are called statistics and are usually written with Latin letters.
If we model data with a Normal model and standardize them using the corresponding mu or sigma, we still call the standardized value a z-score.
\[z = \frac{y - \mu}{\singma}\]
Usually, it is easier to standardize data using the mean and standard deviation first. Then we only need the model with mean 0 and standard deviation 1. This Normal model is called the Standard Normal model.
Notice how well the 68-95-99.7 rule world when the distribution is unimodal and symmetric. Careful though, you should not use the Normal model for just any dataset. Standardizing will not change the shape of the distribution. If the distribution is not unimodal and symmetric to begin with, standardizing will not make it Normal.
All models make assumptions. Whenever we model we will be careful to point out the assumptions that we are making. We will also check the associated conditions in the data to make sure that those assumptions are reasonable. So, do not model data without checking whether the data is normal or not. To be Normal, the shape of the data’s distribution is unimodal and symmetric and there are no obvious outliers.
To sketch a Normal curve that looks normal is important. The Normal curve is bell-shaped and symmetric around its mean. Start at the middle and sketch to the right and left from there. Even though the Normal model extends forever on either side, you need to draw it only for 3 standard deviations. After that, there is little left that is worth sketching. The place where the bell shape changes from curving downward to curving back up, or inflection point, is exactly one standard deviation away from the mean.
Normal Percentiles
When a value does not fall exactly one, two or three standard deviations from the mean, we need to find the percentiles. Mathematically, the percentage of values falling between two z-scores is the area under the normal model between those values. So, Normal percentiles are the percentage of values in a standard Normal distribution found at that z-score or below.
Finding areas from z-scores is the simplest way to work with the Normal model. But sometimes we start with areas and are asked to work backward to find the corresponding z-score or even the original data value.
Normal Probability Plots
We have assumed that the underlying data distribution was roughly unimodal and symmetric so that using a Normal model is reasonable. Drawing a histogram of the data and looking at the shape is one good way to see whether a Normal model might work.
However, there is a more specialized graphical display that can help you to decide whether a Normal model is appropriate, the Normal probability plot. If the distribution of the data is roughly Normal, the plot will be roughly a diagonal straight line. Deviations from a straight line indicate that the distribution is not Normal. This plot is usually able to show deviations from Normality more clearly than the corresponding histogram, but it is usually easier to understand how a distribution fails to be Normal by looking at its histogram.
A Normal probability plot takes each data value and plots it against the z-score you would expect that point to have if the distribution were perfectly Normal. When the values match up well, the line is straight. If one or two points are surprising from the Normal’s point of view, they do not line up. When the entire distribution is skewed or different from the Normal in some other way, the values do not match up very well at all and the plot bends.
It turns out to be tricky to find the values we expect. They are called Normal scores, but you cannot easily look them up in tables. That is why probability plots are best made with technology and not by hand. The best advice on using probability plots is to see whether they are straight. If so, then your data look like data from a Normal model. If not, make a histogram to understand how they differ from the model.
Changing the spread and center of a variable is equivalent to changing the units. Indeed, the only part of the data’s context changed by standardizing is the units. All other aspects of the context do not depend on the choice or modification of measurement units. This fact points out an important distinction between the numbers the data provide for calculation and the meaning of the variables and the relationships among them. Standardizing can make the numbers easier to work with, but it does not alter the meaning.
Another way to look at this is to note that standardizing may change the center and spread values, but it does not affect the shape of a distribution. A histogram or boxplot of standardized values looks just the same as the histogram or boxplot of the original values except for the numbers on the axes. When we summarized shape, center, and spread for histograms, we compared them to unimodal, symmetric shapes. You could not ask for a nice example than the Normal model. If the shape is like a Normal, we will use the mean and standard deviation to standardize the values.
Basics of Probability
An event is any collection of results or outcomes of a procedure.
A simple event is an outcome or an event that cannot be further broken down into simpler components.
The sample space for a procedure consists of all possible events. That is, the sample space consists of all outcomes that cannot be broken down any further.
Simple Events
With one birth, the result of 1 girl is a simple event and the result of 1 boy is another simple event. They are individual simple events because they cannot be broken down any further.
With three births, the result of 2 girls followed by a boy is a simple event.
When rolling a single die, the outcome of 5 is a simple event, but the outcome of an even number is not a simple event.
Simple Events and Sample Spaces
With three births, the event of 2 girls and 1 boy is not a simple event because it can occur with different simple events.
With three births, the sample space consists of the eight different simple events.
Probability plays a central role in the important statistical method of hypothesis testing. Statisticians make decisions using data by rejecting explanations based on very low probabilities.
In probability, we deal with procedures that produce outcomes. An event is any collection of results or outcomes of a procedure. A simple event is an outcome or an event that cannot be further broken down into simpler components. The sample space for a procedure consists of all possible simple events. That is, the sample space consists of all outcomes that cannot be broken down any further.
Notation for Probabilities
P denotes a probability
A,B, and C denote specific events
P(A) denotes the probability of event A occurring
Three Approaches to Finding the Probability
Conduct a procedure and count the number of times that event A occurs. P(A) is then approximated as follows:
- Relative Frequency Approximation- \(P(A) = \frac{\text{number of time A occurred}}{\text{number of times procedure repeated}}\)
- Classical Approach to probability - If a procedure has n different sample events that are equally likely, and if event A can occur in s different ways, then: \(P(A)=\frac{\text{number of ways A occurs}}{\text{number of different simple events}}=\frac{s}{n}\)
- Subjective Probabilities-P(A), the probability of event A, is estimated by using knowledge of the relevant circumstances.
Simulations
Sometimes none of the preceding three approaches can be used. A simulation of a procedure is a process that behaves in the same ways as the procedure itself so that similar results are produced. Probabilities can sometimes be found by using a simulation.
Rounding Probabilities
When expressing the value of a probability, either give the exact fraction or decimal or round off final decimal results to three significant digits. When a probability is not a simple fraction such as \(\frac{2}{3}\), express it as a decimal so that the number can be better understood.
Law of Large Numbers
As a procedure is repeated again and again, the relative frequency probability of an event tends to approach the actual probability. It tells us that relative frequency approximations tend to get better with more observations. This law reflects a simple notion supported by common sense: a probability estimate based on only a few trials can be off by a substantial amount, but with a very large number of trials, the estimate tends to be much more accurate.
Don’t make the common mistake of finding a probability value by mindlessly dividing a smaller value by a larger number. Instead, think carefully about the numbers involved and what they represent. Carefully identify the total number of items being considered.
Complementary Events
Sometimes we need to find the probability that an event does not occur. The complement of event A, denoted by \(\Abar\), consists of all outcomes in which event A does not occur.
Identifying Significant Results
If, under a given assumption, the probability of a particular observed event is very small and the observed event occurs significantly less than or significantly greater than what we typically expect with that assumption, we conclude that the assumption is probably not correct.
We can use probabilities to identify values that are significantly low or significantly high.
- High number of successes: x successes among n trials is a significantly high number of successes if the probability of x or more successes is unlikely with a probability of 0.05 or less.
- Low number of successes: x successes among n trials is a significantly low number of successes if the probability of x or fewer successes is unlikely with a probability of 0.05 or less.
Odds
Expressions of likelihood are often given as odds, such as 50:1. Here are advantages of probabilities and odds:
- Odds make it easier to deal with money transfers associated with gambling.
- Probabilities make calculations easier, so they tend to be used by statisticians, mathematicians, scientists, and researchers in all fields.
In the three definitions that follow, the actual odds against and the actual odds in favor reflect the actual likelihood of an event, but the payoff odds describe the payoff amounts that are determined by gambling houses.
The actual odds against event A occurring are the ratio \(P(Abar) / P(A) \), usually expressed in the form of a:b, where a and b are integers.
The actual odds in favor of event A occurring are the ratio \(P(A) / P(Abar) \) which is the reciprocal of the actual odds against that event. If the odds against an event are a:b, then the odds in favor are b:a.
The payoff odds against event A occurring are the ratio of net profit(if you win) to the amount bet.
Payoff odds against event A = net profit:amount bet
If you bet $5 on the number 13 in roulette, your probability of winning is \(\frac{1}{38}\) but the payoff odds are given by the casino as 35:1
With P(13) = \({1}{38}\) and P(not 13) = \(\frac{37}{38}\), we get the actual odds against 13
= \(\frac{37/38}{1/38} or 37:1
Addition and Multiplication of Probabilities
The addition rule is a tool for finding P(A or B), which is the probability that either event A occurs or event B occurs as the single outcome of a procedure. The word “or” in the addition rule is associated with the addition of probabilities.
Multiplication Rule
This section also presents the basic multiplication rule used for finding P(A and B), which is the probability that event A occurs and event B occurs. The word “and” in the multiplication rule is associated with the multiplication of probabilities.
Compound Event
A compound event is any event combining two or more simple events.
Addition Rule
Here is the notation for the addition rule. P(A or B) = P(in a single trial, event A occurs or event B occurs or they both occur).
Intuitive Addition Rule
To find P(A or B), add the number of ways event A can occur and the number of ways event B can occur, but add in such a way that every outcome is counted only once. P(A or B) is equal to that sum, divided by the total number of outcomes in the sample space.
Formal Addition Rule
P(A or B) = P(A) + P(B) - P(A and B)
Where P(A and B) denotes the probability that A and B both occur at the same time as an outcome in a trial of a procedure.
Disjoint Events and the Addition Rule
Events A and B are disjoint or mutually exclusive if they cannot occur at the same time. That is, disjoint events do not overlap.
Event A - Randomly selecting someone for a clinical trial who is a male.
Event B - Randomly selecting someone for a clinical trial who is a female.
Disjoint Events
Event A - Randomly selecting someone taking a statistics course.
Event B - Randomly selecting someone who is a female.
Complementary Events and the Addition Rule
We use \(\bar{A}\) to indicate that event A does not occur. Common sense dictates this principle. We are certain with probability of 1 that either an event A occurs or does not occur, so it follows that |(P(A or \bar{A}) = 1. Because events \(A \text{and} \bar{A}\) must be disjoint, we can use the addition rule to express this principle as follows:
\[P(A or \bar{A}) = P(A) + P(\bar{A}) = 1 \]
Rule of Complementary Events
\[ P(A) + P(\bar{A}) = 1 \]
\[ P(\bar{A}) = 1 - P(A) \]
\[ P(A) = 1 - P(\bar{A}) \]
Multiplication Rule
P(A and B) = P(event A occurs in a first trial and event B occurs in a second trial)
P(B | A) represents the probability of event B occurring after it is assumed that event A has already occurred.
Multiplication Rule
P(A and B) = P(event A occurs in a first trial and event B occurs in a second trial)
P(B | A) represents the probability of event B occurring after it is assumed that event A has already occurred.
Intuitive Multiplication Rule
To find the probability that event A occurs in one trial and event B occurs in another trial, multiply the probability of event A by the probability of event B, but be sure that the probability of event B is found by assuming that event A has already occurred.
Formal Multiplication Rule
P(A and B) = P(B | A)
Independence and the Multiplication Rule
Two events A and B are independent if the occurrence of one does not affect the probability of the occurrence of the other. Several events are independent if the occurrence of any does not affect the probabilities of the occurrence of the others. I A and B are not independent, they are said to be dependent.
Sampling
In the world of statistics, sampling methods are critically important.
Sampling with replacement: Selections are independent events.
Sampling without replacement: Selections are dependent events.
Treating Dependent Events as Independent
When sampling without replacement and the sample size is no more than 5% of the size of the population, treat the selections as being independent, even though they are actually dependent.
Redundancy
The principle of redundancy is used to increase the reliability of many systems. Our eyes have passive redundancy in the sense that if one of them fails, we continue to see. An important finding of modern biology is that genes in an organism can often work in place of each other. Engineers often design redundant components so that the whole system will not fail because of the failure of a single component
When randomly selecting an adult, A denotes the event of selecting someone with blue eyes. What do \(P(A)\) and \(P(\bar{A})\) represent?
\(.P(A)\) represents the probability of selecting an adult with blue eyes.
\(P(\bar{A}) represents the probability of selecting an adult who does not have blue eyes.
There are 15,958,866 adults in a region. If a polling organization randomly selects 1235 adults without replacement, are the selections independent or dependent? If the selections are dependent, can they be treated as independent for the purposes of calculations?
The selections are dependent because the selection is done without replacement.
Yes, because the sample size is less than 5% of the population.
When randomly selecting an adult, let B represent the event of randomly selecting someone with type B blood. Write a sentence describing what the rule of complements below is telling us.
\(P(B or \bar{B}) = 1\)
It is certain that the selected adult has type B blood or does not have type B blood.
A research center poll showed that 76% of people believe that it is morally wrong to not report all income on tax returns. What is the probability that someone does not have this belief?
.24
Find the indicated complement.
A certain group of women has a 0.2% rate of red/green color blindness. If a woman is randomly selected, what is the probability that she does not have this color blindness?
.9998
Use the data in the following table, which lists drive-thru order accuracy at popular fast food chains. Assume that orders are randomly selected from those included in the table.
A B C D
316 266 250 125
32 56 37 20
If one order is selected, find the probability of getting food that is not from restaurant A.
Add up all of B,C, and D then divide by all of A,B,C, D.
754/1098=.68
Use the data in the following table which lists drive-thru order accuracy at popular fast food chains. Assume that orders are randomly selected from those included in the table.
If one order is selected, find the probability of getting an order that is not accurate.
Add up incorrect orders and then total orders
A B C D
320 260 236 149
39 59 32 12
142/1107= .128
Use the data in the following table, which lists drive-thru order accuracy at popular fast food chains.
A B C D
321 280 244 129
39 51 30 14
If one order is selected, find the probability of getting an order from restaurant A or an order that is accurate. Are the events of selecting an order from restaurant A and selecting an accurate order disjoint events?
The formal addition rule is \( P(A or B) = P(A) + P(B) - P(A and B) \)
Accurate orders =974
Inaccurate orders from restaurant A=39
Add together to get 1013
1013/1108=.914
Use the data in the following table, which lists drive-thru order accuracy at popular fast food chains.
A B C D
367 255 206 176
45 53 22 28
If two orders are selected, find the probability that they are both from restaurant D
Assume that the selections are made without replacement, are the events independent?
\[ P(A and B) = P(A) * P(B | A) \]
Calculate total orders from all restaurants
Calculate orders from restaurant D
Divide orders from restaurant D by the total number of orders. This gives \(P(A)\)
- Assume that the selections are made with replacement
The events are independent and probability of event B stays the same regardless of event A
So, \( P(A and B) = \frac{204}{1152} * \frac{204}{1152} = .0314 \)
- Assume that the selections are made without replacement.
The probability of event A will be the same \(\frac{204}{1152}\)
When replacements are not used, the events are not independent and the probability of event B changes depending on the outcome of event A.
Since event A was selecting an order from D, the selected order does not get replaced, the number of orders from D and the total number of orders to choose from each side each decrease by 1 when choosing event B.
So:
\[ P(A) = \frac{204}{1152} \text{and} P(B | A) = \frac{204-1}{1152-1} \]
Multiply the probability of event A by event B
\[ P(A and B) = .0312 \]
Use the data in the following table, which lists drive-thru order accuracy at popular fast food chains.
A B C D
323 267 241 128
30 55 34 12
If two orders are selected, find the probability that they are both accurate.
- Assume that the selections are made with replacement. Are the events independent?
Calculate total number of orders: 1090
Accurate orders: 959
\[\frac{959}{1090} * \frac{959}{1090} = .7741 \]
- Assume that the selections are made without replacement. Are the events independent?
Because the selections are made without replacement, the events are dependent events.
The probability of each order being accurate is affected by the other orders.
The probability \(P(A)\} remains the same as in part A.
The probability \(P(B|A)\) must be adjusted to reflect that the first order was accurate and is not available for the second order.
Recall that originally there were 1004 accurate orders out of 1152.
After the first accurate order is selected, there are 1151 orders remaining of which 1003 are accurate.
\[ P(A and B) = \frac{959}{1090} * \frac{958}{1089} = .7740 \]
The events are not independent because the sampling is done without replacement
Use the data in the following table.
A B C D
321 260 243 121
35 52 32 14
If three orders are selected, find the probability that they are all from B.
\[(312 / 1078) * 3 = .0242 \]
Use the following results from a test for marijuana use, which is provided by a certain drug testing company. Among 145 subjects with positive test results, there are 29 false positive results. Among 157 negative results, there are 3 false negative results.
- How many subjects were in the study?
No Yes
pos= 29 145
neg= 157 3
How many subjects were included in the study?
Add the subjects who tested positive to those who tested negative= 302
How many subjects did not use marijuana?=183
What is the probability that a randomly selected subject did not use marijuana?183/302=.606
Among 132 subjects with positive test results, there are 32 false positive results
Among 168 negative results, there are 8 false negative results.
If one of the test subjects is randomly selected, find the probability that the subject tested negative or did not use marijuana.
32 100
160 8
Total subjects=300
Next, find the probability that a randomly selected subject tested negative
168/300
Now, find the number of subjects that did not use marijuana
Two groups did not use marijuana. True negatives and the false positives
160+32=192
Next, find the probability that a randomly selected test subject did not use marijuana.
Did not use=192/300
Next, find the probability that a randomly selected test subject tested negative and did not use it
160/300
Finally, use the formal addition rule to find the probability that a randomly selected subject tested negative or did not use it, rounding to 3 decimal places
168/300+192/300-160/300 = .667
The principle of redundancy is used when system reliability is improved through redundant components. Assume that a student’s alarm has a 16.0% daily failure rate.
- What is the probability that the student’s alarm clock will not work on the morning of an important exam?
To convert a percentage to a decimal number, remove the % symbol and divide by 100.
For the stated failure rate of 16% remove the percent symbol and divide by 100.
16/100 = .160
So, the probability that the student’s alarm clock will not work on the morning of an important exam is .160.
- If the student has two such alarm clocks, what is the probability that they both fail on the morning of an important exam?
Use the formal rule of multiplication that states if P(A) is the probability of event A occurring and P(B|A) is the probability of B occurring given that A has occurred, the probability of both A and B occurring is given by:
\[P(A and B)=P(\bar{A})*P(\bar{A}|\bar{B}\]
The functioning of the second alarm clock is not affected by the failure of the first, so by definition they are independent events.
Multiply A and B together.
.160*.160=.0256
- What is the probability of not being awakened if the student uses three independent alarm clocks?
A * B * C = .160*.160*.160= .00410
- Do the second and third alarm clocks result in greatly improved reliability?
Compare the probability of one alarm clock not working to the probabilities of 2 or 3 alarm clocks not working. In general, when an event will occur with probability 1, it is called certain. An event occurring with probability less than or equal to .05 is called unlikely. An event occurring with probability 0 is called impossible.
Surge protectors p and q are used to protect a television. If there is a surge in the voltage, the surge protector reduces it to a safe level. Assume that each surge protector has a .88 probability of working correctly when a voltage surge occurs.
- If the two surge protectors are arranged in a series, what is the probability that a voltage surge will not damage the television?
With two independent surge protectors in series, the television will be protected unless both surge protectors fail. In other words, only one surge protector needs to work. Find the probability that only one surge protector works by calculating 1-P(p and q). This probability can be found by applying the multiplication rule for independent events.
\[P(A and B)=P(A)*P(B)\]
The probability that a surge protector works correctly is .88. The probability that a surge protector fails is calculated below.
1-.88=.12
The probability that one surge protector fails is .12. The probability that both surge protectors fail is the product of the probabilities that either one fails.
.12*.12=.0144
There is a .0144 probability that both surge protectors fail. The probability that the television is protected in a series configuration is the complement of the probability that both fail.
1-.0144=.9856
- If the two surge protectors are arranged in parallel, what is the probability that a voltage surge will not damage the television?
With two independent surge protectors in parallel, the television will be protected as long as both surge protectors work. The probability that the two independent surge protectors both work is found by applying the multiplication rule for independent events.
\[P(A and B)=P(A)*P(B)\]
The probability that a surge protector works correctly is .88. The probability that both surge protectors work is the product of the probabilities that both work correctly.
.88*.88=.7744
- Which arrangement should be used for better protection?
Series
Complements and Conditional Probability
When finding the probability of some event occurring at least once, we should understand that at least one has the same meaning as one or more. The complement of getting at least one particular event is that you get no occurrences of that event.
Finding the probability of getting at least one of some event:
- Let A = getting at least one of some event.
- Then \(/bar{A}\) = getting none of the event being considered.
- Find \(P(/bar{A})\) = probability that event A does not occur.
- Subtract the result from 1.
Conditional Probability
A conditional probability of an event is a probability obtained with the additional information that some other event has already occurred.
\[P(B | A)\] denotes the conditional probability of event B occurring, given that event A has already occurred.
Intuitive Approach For Finding P(B|A)
The conditional probability of B occurring given that A has occurred can be found by assuming that event A has occurred and then calculating the probability that event B will occur.
Formal Approach For Finding P(B|A)
The probability P(B|A) can be found by dividing the probability of events A and B both occurring by the probability of event A.
\[P(B|A)=\frac{P(A \text{and} B)}{P(A)}\]
The preceding formula is a formal expression of conditional probability, but blind use of formulas is not recommended. The intuitive approach is recommended.
Bayes’ Theorem
The importance and usefulness of bayes’ Theorem is that it can be used with sequential events, whereby new additional information is obtained for a subsequent event, and that new information is used to revise the probability of the initial event. In this context, the terms prior probability and posterior probability are commonly used.
A prior probability is an initial probability value originally obtained before any additional information is obtained.
A posterior probability is a probability value that has been revised by using additional information that is later obtained
Multiplication Counting Rule
The multiplication counting rule is used to find the total number of possibilities from some sequence of events. For a sequence of events in which the first event can occur n ways, the second event can occur n2 ways and so on, the total number of outcomes is n1*n2*n3….
Factorial Rule
The factorial rule is used to find the total number of ways that n different items can be rearranged. The factorial rule uses the following notation. The factorial symbol(!) denotes the product of decreasing positive whole numbers. The factorial rule is stated as the number of different arrangements of n different items when all n of them are selected is n! The factorial rule is based on the principle that the first item may be selected n different ways, the second item may be selected n-1 ways, and so on. This rule is really the multiplication counting rule modified for the elimination of one item on each selection.
Permutations and Combinations
When using different counting methods, it is essential to know whether different arrangements of the same items are counted only once or are counted separately. The terms permutations and combinations are standard in this context.
Permutations of items are arrangements in which different sequences of the same items are counted separately.
Combinations of items are arrangements in which different sequences of the same items are counted as being the same.
Permutations Rule
The permutation rule is used when there are n different items available for selection, we must select r of them without replacement, and the sequence of the items matters. The result is the total number of arrangements that are possible. Remember, rearrangements of the same items are counted as different permutations.
\[nP_r=\frac{n!}{(n-r)!}\]
When n items are all selected without replacement, but some items are identical, the number of possible permutations is found by using the following rule:
\[\frac{n!}{n_1!n_2!...n_k!}\]
Combinations Rule
The combinations rule is used when there are n different items available for selection, only r of them are selected without replacement, and order does not matter. The result is the total number of combinations that are possible. Remember, rearrangements of the same items are considered to be the same combination.
\[n_C_r=\frac{n!}{(n-r)!r!}\]
Find the probability that when a couple has three children, at least one of them is a girl. Assume that boys and girls are equally likely.
For each event there are two possibilities. There are 3 events.
½*½*½ = ⅛
1-⅛=⅞
In a certain country, the true probability of a baby being a girl is .509. Among the next six randomly selected births in the country, what is the probability that at least one of them is a girl?
The probability of at least one can be computed using the rule of complements. Let A represent the event that at least one of the next six births is a girl. Use the rule of complements below to find the probability of event A, P(A), where \(\bar{A}\) is the complement of A.
\[P(A)=1-P(\bar{A})\]
The complement of A, \(\bar{A}), is the event that the next six births are all boys.
Since each birth has no effect on any of the other births, the births are all independent events. The probability that the next six births are all boys can be found using the multiplication rule for independent events. The probability of the event can be written as shown below:
It is given that the probability of a birth being a boy is .509.
Use the multiplication rule for independent events to find the probability that the next six births are all boys. The multiplication rule for independent events states that the probability of two independent events occurring is the product of their individual probabilities. This can be extended to 6 independent events.
.509*.509*.509*.509*.509*.509 = .017
Then use the rule of complements to find the probability that the couple has at least one girl.
1-.017=.983
Therefore, the probability that the next six randomly selected births will contain at least one girl is .983
Subjects for the next presidential election poll are contacted using telephone numbers in which the last four digits are randomly selected (with replacement). Find the probability that for one such phone number, the last four digits include at least one 0.
10^4-9^4=3439
10^4=10000
3439/1000=.344
Based on a poll, 72% of internet users are more careful about personal information when using a public wi-fi hotspot. What is the probability that among three randomly selected internet users, at least one is more careful about personal information when using a public wi-fi hotspot? How is the result affected by the additional information that the survey subjects volunteered to respond to?
The probability of at least one can be computed using the rule of complements. The rule of complements states that the following expression is true for events A and \(\bar{A}\), where \(\bar{A}\) indicates that event A did not take place.
\[P(A)=1-P(\bar{A})\]
Identify the event that is the complement of A
\[\bar{A} = \text{none of the internet users are more careful}\]
To find the probability of the complement, first find the probability that an internet user is not more careful with personal information while using a public wi-fi hotspot.
1-P(is more careful)
1-.072 = .28
Find the probability of the complement using the multiplication rule for independent events, rounding to three decimal places. The multiplication rule for independent events states that the probability of two independent events occurring is the product of their individual probabilities. This can be extended to three independent events.
.28 * .28 * .28 = .022
Now use the rule of complements
1-.022 = .978
It is very possible that this result is not representative of people that use wi-fi
In an experiment, college students were given either four quarters or a $1 bill and they could either keep the money or spend it on gum.
Purchased Gum Kept the Money
Given four quarters 37 13
Given $1 bill 11 39
- Find the probability of randomly selecting a student who spent the money, given that the student was given four quarters.
The conditional probability of B occurring given that A has occurred, P(B|A), can be found intuitively by assuming that event A has occurred and then calculating the probability that event B will occur.
More formally, the probability P(B|A) can be found by dividing the probability of events A and B both occurring by the probability of event A.
In this case, given four quarters corresponds to event A and spent the money corresponds to event B.
First determine the number of students given four quarters that spent the money
37 students
Now calculate the probability
37/50=.74
- Find the probability of randomly selecting a student who kept the money given that the student was given four quarters.
Recall that 50 students were given four quarters
Identify the number of students given four quarters that kept the money
13 students
Now calculate the probability
13/50=.26
Now that since the students either kept the money or spent the money, these probabilities are complements.
.26=1-.74
- What do the preceding results suggest?
Compare the probabilities found in first parts
Spent the money=.74
Kept the money=.26
Since .74..26 P(spent the money | four quarters) has the greater probability
The accompanying table shows the results from a test for a certain disease. Find the probability of selecting a subject with a negative test result, given that the subject has the disease. What would be an unfavorable consequence for this error?
357 26
18 1150
A conditional probability of an event is a probability obtained with the additional information that some other event has already occurred. P(B|A) denotes the conditional probability of event B occurring, given that event A has already occurred.
A is the event which is known to have occurred. The given event is “the individual has the disease”.
B is the event for which the probability is sought. The event is “the individual tests negative for the disease”.
The conditional probability of B given A can be found by assuming that event A has occurred and, working under that assumption, calculating the probability that event B will occur.
First, determine the number of individuals who have the disease. Add all the values in the indicated column.
357+18=375
From the table, there are 18 individuals who have the disease and test negative. Divide to find the probability
18/375=.048
Therefore, the probability that a randomly selected individual who has the disease tests negative is.048
To determine an unfavorable consequence of this error, consider a subject that has the disease but with a negative test result.
Note that negative test results would lead the subject to believe that they have the disease.
The table below displays results from experiments with polygraph instruments. Find the positive predictive value for the test. That is, find the probability that the subject lied, given that the test yields a positive result.
Did not lie lied
Pos 9 46
Neg 30 13
Use the intuitive approach to conditional probability. The conditional probability of B occurring given that A has occurred can be found by assuming that event A has occurred and then calculating the probability that event B will occur. Find the probability of selecting a subject who lied, given that the selected subject had a positive test result. If it is assumed that the subject had a positive test result, then only the 9+46=55 subjects in the top row of the table are to be used. Among those 55 subjects, 46 subjects who had a negative test result actually lied.
Divide the number of subjects who had a positive test result and actually lied by the total number of subjects who had a positive test result to find the probability, rounding to three decimal places.
46/55=.836
Assume that there is a 12% rate of disk drive failure in a year.
- If all your computer data is stored on a hard disk with a copy stored on a second hard disk drive, what is the probability that during a year, you can void catastrophe with at least one working drive.
- If copies of all your computer data are stored on three independent disk drives, what is the probability that during a year, you can avoid catastrophe with at least one working drive.
- Use the rule of complements shown below to find the probability that you can avoid catastrophe. Let A=at least one hard drive works correctly
\[P(A)=1-P(bar{A})\]
Identify the event that is the complement of A
\[\bar{A}=both hard drives fail\]
Since the two hard drives operate separately, their failures are independent events. Use the multiplication rule for independent events to find the probability of the complement of event A. The multiplication rule for independent events states that the probability of two independent events occurring is the product of their individual probabilities. The probability of any one of the hard drives failing to work correctly is 0.12
.\[ P(\bar{A})= .12*.12=.0144 \]
Now find P(A) by evaluating \(1-P(\bar{A})\)
1-.0144 = .9856
- Again, let A = at least one hard drive works correctly.
\[P(\bar{A}) = .12 * .12 * .12 = .001728 \]
Now find P(A) by evaluating \(1-P(\bar{A})\)
1-.001728 = .998272
Beginning Probability
Random Phenomena
A random phenomenon is a situation in which we know what outcomes can possibly occur, but we don’t know which outcome will happen. In general, each occasion upon which we observe a random phenomenon is called a trial. At each trial, we note the value of the randomness phenomenon, and call that the trial’s outcome. When we combine outcomes, the resulting combination is an event. We call the collection of all possible outcomes the sample space. We will denote the sample space as S.
The law of large numbers says that as we repeat a random process over and over, the proportion of times that an event occurs does settle down to one number. We call this number the probability of an event. But the law of large numbers requires two key assumptions. First, the random phenomenon we are studying must not change, the outcomes must have the same probabilities for each trial. The events must also be independent. Informally, independence means that the outcome of one trial does not affect the outcomes of the others. The law of large numbers says that as the number of independent trials increases,. The long-run relative frequency of repeated events gets closer and closer to a single value.
Because the law of large numbers guarantees that relative frequencies settle down in the long run, we can give a name to the value that they approach. We call it the probability of the event. Because this definition is based on repeatedly observing the event’s outcome, this definition is often called empirical probability.
Even though the law of large numbers seems natural, it is often misunderstood because the idea of the long-run is hard to grasp. Many people believe that an outcome of a random event that hasn’t occurred in many trials is due to occur. We know that in the long-run, the relative frequency will settle down to the probability of that outcome.
Example 1
You have just flipped a fair coin and seen six heads in a row.
Does the coin owe you some tails? Suppose you spend that coin and your friend gets it in exchange. When she starts flipping the coin, should she expect a run of tails?
Of course not. Each flip is a new event. The coin cannot remember what it did in the past, so it cannot owe any particular outcomes in the future.
The lesson of the law of large numbers is that sequences of random events do not compensate in the short run and do not need to do so to get back to the right long-run probability. If the probability of an outcome does not change and events are independent, the probability of any outcome in another trial is always what it was, no matter what has happened in other trials.
Modeling Probability
Probability was first studied extensively by a group of French mathematicians who were interested in games of chance. Rather than experiment with the games, they developed mathematical models. When the probability comes from a mathematical model and not from observation, it is called theoretical probability. To make things simple, they started by looking at games in which the different outcomes were equally likely.
It is easy to find probabilities for events that are made up of several equally likely outcomes. We just count all the outcomes that the event contains. The probability of the event is the number of outcomes in the event divided by the total number of possible outcomes.
\[P(A) = \frac{outcomes}{possible-outcomes}\]
For example, the probability of drawing a face card from a deck is:
\[P = \frac{face=cards}{cards} = \frac{12}{52} = \frac{2}{13}\]
Formal Probability
If the probability is 0, the event never occurs, and likewise if it has probability 1, it always occurs. Even if you think an event is very unlikely, its probability can’t be negative, and even if you are sure it will happen, its probability can’t be greater than 1.
We have been careful to discuss probabilities only for situations in which the outcomes were finite, or even countably infinite. But if the outcomes can take on any numerical value at all, we say they are continuous.
If a random phenomenon has only one possible outcome, it is not very interesting. So, we need to distribute the probabilities among all the outcomes a trial can have. When we assign probabilities to these outcomes, the first thing to be sure of is that we distribute all of the available probability. If we look at all the events in the entire sample space, the probability of that collection of events has to be 1. So the probability of the entire sample space is 1. Making this more formal gives the Probability Assignment Rule: The set of all possible outcomes of a trial must have probability 1.
Suppose the probability that you get to class on time is 0.8. What’s the probability that you do not get to class on time? It is 0.2. The set of outcomes that are not in the event A is called the complement of A. This leads to the Complement Rule: The probability of an event not occurring is 1 minus the probability that it does occur.
Example 2
If P(green) = 0.35, what is the probability the light is not green when you get to your destination?
Not green is the complement of green, so P(not green) = 1 - P(green) = 1-.35=0.65
There is a 65% chance I will not have a green light.
Suppose the probability that a randomly selected student is a sophomore is 0.20, and the probability that they are a junior is 0.30. What is the probability that the student is either a sophomore or a junior, written P(A or B)? If you guessed 0.50, you have deduced the Addition Rule, which says that you can add the probabilities of events that are disjoint. To see whether two events are disjoint, we take them apart into their component outcomes and check whether they have any outcomes in common. Disjoint events have no outcomes in common. The Addition Rule states: For two disjoint events A and B, the probability that one or the other occurs is the sum of the probabilities of the two events.
P(A or B) = P(A) + P(B), provided that A and B are disjoint.
Example 3
Suppose we find out that P(yellow) is 0.04. What is the probability that the light is red?
The light must be red, green, or yellow, so if we can figure out the probability that the light is green or yellow, we can use the complement rule to find the probability that it is red. To find the probability that the light is green or yellow, I can use the Addition Rule because these are disjoint events: The light can’t be both green and yellow at the same time.
P(green or yellow) = .35 + .04 = .39
Red is the only remaining alternative, and the probabilities must add up to 1, so:
P(red) = P(not green or yellow) = 1-P(green or yellow) = 1-.39=.61
The addition rule can be extended to any number of disjoint events, and that is helpful for checking probability assignments. Because individual sample space outcomes are always disjoint, we have an easy way to check whether the probabilities we have assigned to the possible outcomes are legitimate. The Probability Assignment Rule tells us that to be a legitimate assignment of probabilities, the sum of the probabilities of all possible outcomes must be exactly 1. No more, no less. For example, if we were told that the probabilities of selecting at random a freshman, sophomore, junior, or senior from all the undergraduates at a school were .25,.23,.22, .20, respectively, we would know that something was wrong. These probabilities add only to .90, so this is not a legitimate probability assignment. Either a value is wrong or we just missed some possible outcomes.
Suppose your job requires you to fly from Atlanta to Houston every Monday morning. The airline’s website reports that this flight is on time 85% of the time. What is the chance it will be on time two weeks in a row? That is the same as asking for the probability that your flight is on time this week and it is on time again next week. For independent events, the answer is simple. Remember that independence means that the outcome of one event does not influence the outcome of the other. What happens with your flight this week does not influence whether it will be on time next week, so it is reasonable to assume that those events are independent.
The Multiplication Rule says that for independent events, to find the probability that both events occur, we just multiply the probabilities together. This rule can be extended to more than two independent events. What is the chance of your flight being on time for a month? We can multiply the probabilities of it happening each week:
.85*.85*.85*.85 = .522
Of course, to calculate this probability, we have used the assumption that the four events are independent. Many statistics methods require an Independence Assumption, but assuming independence does not make it true. Always think about whether that assumption is reasonable before using the Multiplication Rule.
Example 4
We have determined that the probability that we encounter a green light at the corner is .35, a yellow light .04, and a red light .61. Let us think about how many times during your morning commute in the week ahead you might hit a red light there.
What is the probability you find the light red on both Monday and Tuesday?
Because the color of the light I see on Monday does not influence the color I will see on Tuesday, these are independent events. I can use the Multiplication Rule:
P(red Monday and Tuesday) = P(red) * P(red) = .61 * .61 = .3721
There is about a 37% chance I will hit red lights both Monday and Tuesday mornings.
What is the probability you do not encounter a red light until Wednesday?
For that to happen, I would have to see green or yellow on Monday, green or yellow on Tuesday, and then red on Wednesday. I can simplify this by thinking of it as not red on Monday, not red on Tuesday, and then red on Wednesday.
P(not red) = 1-P(red) = 1-.61 = .39, so:
P(not red Monday and Tuesday) = P(not red) * P(not red) * P(red)
=.39 * .39 * .61 = .092781
There is about a 9% chance that this week I will hit my first red light on Wednesday morning
What is the probability that you will have to stop at least once during the week?
Having to stop at least once means that I have to stop for the light 1,2,3,4, or 5 times next week. It is easier to think about the complement, never having to stop at a red light. Having to stop at least once means that I did not make it through the week with no red lights.
P(having to stop at light at least once in 5 days)
=1-P(no red lights for 5 days in a row)
=1-P(not red and not red and not red and not red and not red)
\[=1-(.39)^5\}]
=1-.0090 = .991
I am not likely to make it through the intersection without having to stop sometime this week.
Note that the phrase at least is often a tip off to think about the complement. Something that happens at least once does happen. Happening at least once is the complement of not happening at all, and that is easier to find.
Example 5
What is the probability that a Japanese M&M’s survey respondent selected at random chose either pink or teal?
Plan and decide which rules to use and check the conditions they require.
The events pink and teal are disjoint because one respondent can’t choose both. We can apply the Addition Rule.
Show your work.
P(pink or teal) = P(pink) + P(teal) = .38 + .36 = .74
Interpret your results in the proper context.
The probability that the respondent chose either pink or teal is .74 or 74%
If we pick two respondents at random, what is the probability that they both said purple?
The word both suggests we want P(A and B), which calls for the Multiplication Rule. Think about the assumption.
Independence Assumption: The choice made by one respondent does not affect the choice of the other, so the events are independent. I can use the Multiplication Rule.
Show your work. For both respondents to choose purple, each one has to choose purple.
P(both purple) = P(first purple and second purple)
= P(first purple) * P(second purple) = .16 * .16 = .0256
Interpret your results in the proper context.
The probability that both chose purple is .0256
If we pick three respondents at random, what is the probability that at least one chose purple?
The phrase, at least, often flags a question best answered by looking at the complement, and that is the best approach here. The complement of at least one preferred purple is none of them preferred purple. Think about the assumption.
P(at least one purple) = P(none purple) = 1 - P(none purple)
P(none purple) = P(not purple and not purple and not purple)
Independence Assumption: These are independent events because they are choices by three random respondents. I can use the Multiplication Rule.
We calculate P(none purple) by using the Multiplication Rule.
P(none purple) = P(first not purple) * P(second not purple) * P(third not purple)
=P(not purple)^3
P(not purple) = 1 - P(purple) = 1 - .16 = .84
So, P(none purple) = .84^3 = .5927
Then we can use the Complement Rule to get the probability we want.
P(at least 1 purple) = 1 - P(none purple) = 1 - .5927 = .4073
Interpret your results in the proper context
There is about a 40.7% chance that at least one of the respondents chose purple.
Beware of probabilities that don’t add up to 1. To be a legitimate probability assignment, the sum of the probabilities for all possible outcomes must total 1. If the sum is less than 1, you may need to add another category and assign the remaining probability to that outcome. If the sum is more than 1, check that the outcomes are disjoint. If they are not, then you cannot assign probabilities by just counting relative frequencies.
Do not add probabilities of events if they are not disjoint. Events must be disjoint to use the Addition Rule. The probability of being younger than 80 or a female is not the probability of being younger than 80 plus the probability of being female. That sum may be more than 1.
Do not multiply probabilities of events if they are not independent. The probability of selecting a student at random who is over 6’10” tall and on the basketball team is not the probability the student is over 6’10” tall times the probability he is on the basketball team. Knowing that the student is over 6’10” changes the probability of his being on the basketball team. You cannot multiple these probabilities. The multiplication of probabilities of events that are not independent is one of the most common errors people make in dealing with probabilities.
Do not confuse disjoint and independent. Disjoint events cannot be independent. If A = {you get an A in this class} and B = {you get a B in this class}, A and B are disjoint. Are they independent? If you find out that A is true, does that change the probability of B? Yes it does. So they cannot be independent.
Random Phenomenon
A phenomenon is random if we know what outcomes could happen, but not which particular values will happen.
Trial
A single attempt or realization of a particular phenomenon
Outcome
The value measures, observed, or reported for an individual instance of trial
Event
A collection of outcomes. Usually, we identify events so that we can attach probabilities to them. We denote events with bold capital letters such as A, B, or C.
Sample Space
The collection of all possible outcome values. The collection of values in the sample space has a probability of 1. We denote the sample space with a boldface capital S.
Law of Large Numbers
This law states that the long-run relative frequency of an event’s occurrence gets closer and closer to the true relative frequency as the number of trials increases.
Independence
Two events are independent if learning that one event occurs does not change the probability that the other event occurs.
Probability
The probability of an event is a number between 0 and 1 that reports the likelihood of that event’s occurrence. We write P(A) for the probability of the event A.
Empirical Probability
When the probability comes from the long-run relative frequency of the event’s occurrence, it is an empirical probability.
Theoretical Probability
When the probability comes from a model, it is theoretical probability.
Personal Probability
When the probability is subjective and represents your personal degree of belief, it is a personal probability.
Probability Assignment Rule
The probability of an entire sample space must be 1. P(S) = 1.
Complement Rule
The probability of an event not occurring is 1 minus the probability that it does occur.
Addition Rule
If A and B are disjoint events, then the probability of A or B is P(A or B) = P(A) + P(B)
Disjoint
Two events are disjoint if they share no outcomes in common. If A and B are disjoint, then knowing that A occurs tells us that B cannot occur. Disjoint events are called mutually exclusive.
Legitimate Assignment of Probabilities
An assignment of probabilities to outcome is legitimate if each probability is between 0 and 1 and the sum of the probabilities is 1.
Multiplication Rule
If A and B are independent events, then the probability of A and B is P(A and B) = P(A) * P(B)
Question 1
In a dresser are three blue shirts, four red shirts, and nine black shirts.
What is the probability of randomly selecting a red shirt?
4/16=.25
What is the probability that a randomly selected shirt is not black?
Not black means blue and red shirts.
7/16=.4375
Question 2
A recent study conducted by a health statistics center found that 23% of households in a certain country had no landline service. This raises concerns about the accuracy of certain surveys, as they depend on random-digit dialing to households via landlines. Pick three households from this country at random.
What is the probability that all three of them have a landline?
The Multiplication Rule says that for independent events, to find the probability that both events occur, multiply them together.
.77 * .77 * .77 = .457
What is the probability that at least one of them does not have a landline?
The Complement Rule says that the probability of an event not occurring is 1 minus the probability that it does occur.
1 - .457 = .543
What is the probability that at least one of them does have a landline?
1 - .012 = .988
Question 3
For each of the following, list the sample space and tell whether you think the events are equally likely.
A sample space is the collection of all possible outcome values. The collection of values in the sample space has a probability of 1.
Roll two dice, record the sum of the numbers. = {2,3,4,5,6,7,8,9,10,11,12}
The events are not equally likely
A family has 3 children, record each child’s sex in order of birth.
=bbb,bbg,bgb,bgg,gbg,ggb,ggg
= are equally likely
Toss four coins and record the number of tails.
=0,1,2,3,4
Are not
Toss a coin 10 times and record the length of the longest run of heads.
=0,1,2,3,4,5,6,7,8,9,10
Are not
Question 4
The plastic arrow on a spinner for a child’s game stops rotating to point at a color that will determine what happens next. Are the given probability assignments possible?
Each probability is between 0 and 1 and the sum of the probabilities is 1, is possible
Each probability is between 0 and 1 and the sum of the probabilities is 1, this is possible
The sum of the probabilities is greater than 1, this is not possible
Each probability is between 0 and 1 and the sum of the probabilities is 1, this is possible
At least one probability is not between 0 and 1, this is not possible
Probability Distributions
Basic Concepts
A random variable is a variable that has a single numeric value, determined by chance, for each outcome of a procedure.
A probability distribution is a description that gives the probability for each value of the random variable. It is often expressed in the format of a table, formula, or graph.
A discrete random variable has a collection of values that is finite or countable. If there are infinitely many values, the number of values is countable if it is possible to count them individually, such as the number of tosses of a coin before getting to heads.
A continuous random variable has infinitely many values, and the collection of values is not countable. That is, it is impossible to count the individual items because at least some of them are on a continuous scale, such as body temperatures.
Probability Distribution Requirements
Every probability distribution must satisfy each of the following three requirements.
- There is a numerical random variable, and its number values are associated with corresponding probabilities.
- \(\Sigma P(x)=1\) where x assumes all possible values.
- \(0 \leq P(x) \leq 1 for every individual value of the random variable x. That is, each probability value must be between 0 and 1 inclusive.
The second requirement comes from the simple fact that the random variable x represents all possible events in the entire sample space, so we are certain that one of the events will occur. The third requirement comes from the basic principle that any probability value must be 0 or 1 or a value between 0 and 1.
The above x variable is a random variable because its numerical values depend on chance. The variable x is a numerical random variable, and its values are associated with probabilities. \(\sumP(x)=.25+.50+.25=1\). Each value of P(x) is between 0 and 1. The random variable x is a discrete random variable, because it has three possible values and three is a finite number.
Notation for 0+
In tables or the binomial probabilities, we recommend using 0+ to represent a probability value that is positive but very small, such as .0000000123. When rounding a probability value for inclusion in such a table, rounding to 0 would be misleading because it would incorrectly suggest the vent is impossible.
Probability Histogram
There are various ways to graph a probability distribution, but for now we will consider only the probability histogram.
Parameters of a Probability Distribution
Remember that with a probability distribution, we have a description of a population instead of a sample, so the values of the mean, standard deviation, and variance are parameters, not statistics. The man, variance, and standard deviation of a discrete probability distribution can be found with the following formula.
This is the mean for a probability distribution:
\[ \mu = \sum [x * P(x)] \]
Variance for a probability distribution that should be easier to understand:
\[\sigma^2 = \Sigma[(x - \mu)^2 * P(x)]
Variance for probability distribution that is good for manual calculations:
\[\sigma^2 = \Sigma[x^2*P(x)] - \mu^2 \]
Standard deviation for probability distribution:
\[\sigma = \sqrt{\Sigma[x^2*P(x)] - \mu^2}\]
Expected Value
The mean of a discrete random variable is the theoretical mean outcome for infinitely many trials. We can think of that mean as the expected value in the sense that it is the average value that we would expect to get if the trials could continue indefinitely.
The expected value of a discrete random variable is denoted by E, and it is the mean value of the outcomes, so \(E=\mu\) abd E can also be found by evaluating \(\Sigma[x*P(x)]\).
An expected value need not be a whole number, even if the different possible values of x might all be whole numbers. The expected number of girls in five births is 2.5, even though five particular children can never result in 2.5 girls. If we were to survey many couples with 5 children, we expect that the mean number of girls will be 2.5.
Making Sense of Significant Figures
We present the following two different approaches for determining whether a value of a random variable is significantly low or high.
Range Rule of Thumb
The range rule of thumb may be helpful in interpreting the value of a standard deviation. According to the range rule of thumb, the vast majority of values should lie within 2 standard deviations of the mean, so we can consider a value to be significant if it is at least 2 standard deviations away from the mean. We can identify significant values as follows:
- Significantly low values are \((\mu-2\sigma\) or lower
- Significantly high values are \(\mu+2\sigma\) or higher
- Values not significant are between the previous two conditions
Know that the use of the number 2 in the range rule of thumb is somewhat arbitrary and this is a guideline, not an absolutely rigid rule.
Identifying Significant Results With Probabilities
X successes among n trials is a significantly high number of successes if the probability of x or more successes is .05 or less. That is, x is a significantly high of successes if \(P(x \text{or more}) \leg .05\)
X successes among n trials is a significantly low number of successes if the probability of x or fewer successes is .05 or less. That is, x is a significantly low number of successes if \(P(x \text{or fewer}) \leq .05\).
The Rare Event Rule For Inferential Statistics
If, under a given assumption, the probability of a particular outcome is very small and the outcome occurs significantly less than or significantly greater than what we expect with that assumption, we conclude that the assumption is probably not correct.
For example, if testing the assumption that boys and girls are equally likely, the outcome of 20 girls in 100 births is significantly low and would be a basis for rejecting that assumption.
Expected Value and Rationale for Formulas
Earlier we noted that the expected value of a random variable is equal to the mean. We can therefore find the expected value by computing \(\Sigma[x*P(x)]\), just as we do for finding the value of \(\mu\). We also noted that the concept of expected value is used in decision theory.
Rationale for Earlier Formulas
Instead of blindly accepting and using formulas, it is much better to have some understanding of why they work. When computing the mean from a frequency distribution, f represents class frequency and N represents population size. In the expression that follows, we rewrite the formula for the mean of a frequency so that it applies to a population. In the fraction f/n, the value of f is the frequency with which the value x occurs and N is the population size, so f/N is the probability for the value of x. When we replace f/N with P(x), we make the transition from relative frequency based on a limited number of observations to probability based on infinitely many trials.
Example 1
The table below lists probabilities for the corresponding numbers of girls in three births. What is the random variable, what are its possible values, and are its values numerical?
Girls(x) P(x)
0 0.125
1 0.375
2 0.375
3 0.125
The random variable is x, which is the number of girls in three births. The possible values of x are 0,1,2, and 3. The values of the random value x are numerical.
Example 2
Is the random variable given in the accompanying table discrete or continuous?
Girls(x) P(x)
0 0.063
1 0.250
2 0.375
3 0.250
4 0.063
The random variable given in the accompanying table is discrete because there are a finite number of values.
Example 3
For 100 births, P(exactly 56 girls)=0.0390 and P(56 or more girls)=0.136. Is 56 girls in 100 births a significantly high number of girls? Which probability is relevant to answering that question? Consider a number of girls to be significantly high if the appropriate probability is 0.05 or less.
The relevant probability is P(56 or more girls), so 56 girls in 100 births is not a significantly high number of girls because the relevant probability is greater than 0.05.
Example 4
Five males with an x-linked genetic disorder have one child each. The random variable x is the number of children among the five who inherit the x-linked genetic disorder. Determine whether a probability distribution is given. If a probability distribution is given, find its mean and standard deviation. If a probability distribution is not given, identify the requirements that are not satisfied.
X P(x)
0 0.024
1 0.167
2 0.309
3 0.309
4 0.167
5 0.024
The random variable x is numerical because x takes on the integer values from 0 to 5.
The number values are associated with probabilities because each value of x has a corresponding value of P(x) in the next column of the table.
The mean for a probability distribution is given by the formula below.
\[\mu = \Sigma[x*P(x)]\]
Find each product of x and P(x)
0+.167+.618+.927+.668+.12=2.5
\[\mu=2.5\]
The standard deviation for a probability distribution is given by the formula below.
\[\sigma=\sqrt{\Sigma[x^2*P(x)]-\mu^2}\]
Create another table for the new values
X^2 X^2*P(x)
0 0
1 .167
4 1.236
9 2.781
16 2.672
25 .6
Sum = 7.456
Substitute into formula
\[\sqrt{7.456-2.5^2}= 1.1\]
Example 5
When conducting research on color blindness in males, a researcher forms random groups with five males in each group. The random variable x is the number of males in the group who have a form of color blindness. Determine whether a probability distribution is given. If a probability distribution is given, find its mean and standard deviation. If not, state why.
X P(x)
0 .657
1 .284
2 .053
3 .005
4 .001
5 .000
Find the mean of the random variable x
0+.284+.106+.015+.004+0=.409
Find the standard deviation of the random variable x
0+(1^2*.284)+(2^2*.053)+(3^2*.005)+(4^2*.001)+(5^2*0)=.557
\[\sqrt{.557-.409^2}\]=.6243
Example 6
Look at the next table. Determine whether a probability distribution is given. If it is, find the mean and standard deviation. If not, state why.
X P(x)
0 .001
1 .009
2 .034
3 .056
Does the table show a probability distribution?
No, the sum of all the probabilities is not equal to 1
Example 7
Look at the following table.
X P(x)
0 .094
1 .347
2 .395
3 .164
Does the table show a probability distribution?
Yes, the table shows a probability distribution
Find the mean of the random variable x
(0)+(.347)+(2*.395)+(3*.164)=1.629
Find the standard deviation of x
0+.347+(4*.395)+(9*.164)=3.403
\[\sqrt{3.403-1.629^2}=.8656\]
Example 8
Look at the following table
X P(x)
0 .365
1 .431
2 .178
3 .026
Does the table show a probability distribution?
Yes, the table shows a probability distribution
Find the mean of the random variable x
0+.431+(2*.178)+(3*.026)=.865
Find the standard deviation of x
0+.431+(4*.178)+(9*.026)=1.377
\[\sqrt{1.377-.865^2}=.7929\]
Example 9
Look at the table below
X P(x)
0 .002
1 .035
2 .111
3 .221
4 .272
5 .211
6 .116
7 .027
8 .005
Find the mean
0+.035+(2*.111)+(3*.221)+(4*.272)+(5*.211)+(6*.116)+(7*.027)+(.005)=3.953
Find the standard deviation
0+.035+(2^2*.111)+(3^2*.221)+(4^2*.272)+(5^2*.211)+(6^2*.116)+(7^2*.027)+(8^2*.005)=17.914
\[\sqrt{17.914-3.953^2}=1.5\]
Example 10
The following table describes results from groups of 10 births from 10 different sets of parents. The random variable x represents the number of girls among 10 children. Use the range rule of thumb to determine whether 1 girl in 10 births is a significantly low number of girls.
X P(x)
0 .005
1 .010
2 .046
3 .113
4 .194
5 .241
6 .211
7 .111
8 .039
9 .020
10 .010
The range rule of thumb for identifying significant values is shown below.
Significantly low values are \(\mu-2\sigma\) or lower
Significantly high values are \(\mu+2\sigma\) or higher
Values between these are not significant
To find the range of values that are not significant, first find the mean and standard deviation
Let us start with the mean
0+.010+.092+.339+.776+1.205+1.266+.777+.312+.180+.100=5.057
Now find the standard deviation
0+.010+(4*.046)+(9*.113)+(16*.194)+(25*.241)+(36*.211)+(49*.111)+(64*.039)+(81*.020)+(100*.010)=28.491
\[sqrt{28.491-5.057^2}=1.708\]
Now find the max range of values that are not significant
Max value = \(\mu+2\sigma\)
5.1+2*1.7=8.5
Now find the minimum range of values that are not significant
Min value = \(\mu-2\sigma\)
5.1-2*1.7=1.7
Univariate Data
The population is the entire group of individuals or things that we are interested in. The sample is the part of the population that is actively studied.
When examining a graphical summary of a distribution of univariate data, use the center, spread, and shape. In addition, you should also note any clustering of data, any gaps in the data, and any outliers. If possible, try to provide explanations for such features. Be sure to write your descriptions within the context of the problem.
Continuous Variables
Numerical, tabular, and graphical methods complement one another. Numerical methods are precise and can be used in a variety of ways for statistical inference. Graphical methods allow us to view a large amount of data and a large amount of relationships at once. Tabular methods allow us to find precise values but are not as good for grasping relationships among variables. The three types of numerical measures are measures of central tendency, measures of variation, and measures of position.
A Greek capital letter \(\Sigma\) is used to indicate the sum of a set of measurements.
Measures of Central Tendency
Measures of central tendency determine the central point of a data set or the point around which all the measurements are scattered. The two main measures of central tendency are the mean and the median.
The arithmetic mean, or average, is the most commonly used measure of the center of a set of data. The mean can be described as a data set’s center of gravity, the point at which the whole group of data balances. Unlike the median, the mean is affected by extreme or outlier measurements. One very large or very small measurement can pull the mean up or down. We say that the mean is not resistant to changes caused by outliers.
The population mean is denoted by the Greek letter \(\Mu\). Simply add up all of the values in the entire population and divide by the number of values. The sample mean is generally denoted by an English letter with a bar on top. It is computed the same way as the population mean.
Median
The median is another commonly used measure of central tendency. The median is the point that divides the measurements in half. Half the values are at or below the median and half are at or above the median. The median is not affected by outliers. Therefore, for skewed data sets, it is better to use the median rather than the mean to measure the center of data. The median is resistant to changes caused by outliers.
Note that if the data set contains an odd number of measurements, then the median is the middle value. If there is an even number of measurements it is the mean of the middle two measurements.
If the data set is symmetric use the mean but if it is skewed then use the median value.
Measures of Variation
Measures of variation summarize the spread of a data set. They describe how measurements differ from each other and from their mean. The three most commonly used measures of variation are range, interquartile range, and standard deviation.
The range is the difference between the largest and the smallest measurements in a data set. It is the simplest of the measures of spread. It is very easy to compute and understand, but it is not a reliable measure because it depends only on the two extreme measurements and does not take into account the values of the remaining measurements.
The interquartile range is the range of the middle 50% of the data, the difference between the third quartile and the first quartile. Interquartile range is not affected by outliers. If you choose to measure the center using the median, you should use the interquartile range to measure the spread.
Standard deviation is often a more useful measure of variation than range is. Unlike range, standard deviation takes every measurement into account. However, like range, the standard deviation is affected by outliers. When there are outliers, the interquartile range may be a more useful measurement, similar to how the median is more useful than the mean when outliers are present. The square of the standard deviation is known as the variance.
A lowercase Greek letter \(\sigma\) is used to denote a population standard deviation. So, \(\sigma^2\) denotes a population variance. We square the difference between each point and the mean, add those squares, divide by the number of points, and take the square root. The letter \(s\) is used to denote a sample standard deviation. So \(s^2\) denotes a sample variance.
Note that standard deviation is measured in the same units as are data values, whereas variance is measured in squared units of the data values.
Standard deviation can be used as a unit for measuring the distance between any measurement and the mean of the data set. For example, a measurement can be described as being so many standard deviations above or below the mean.
A standard deviation of 0 indicates that all of the measurements are identical. It is the positive square root of variance. Because variance is a squared quantity, it is always a positive number. A larger standard deviation indicates a larger spread among the measurements. The larger the standard deviation, the wider the graph.
Measures of Position
These measures are used to describe the position of a value with respect to the rest of the values of the data set. Quartiles, percentiles, and standardized scores(z-scores) are the most commonly used measures of position. To compute quartiles and percentiles, but not to compute z-scores, the data must be sorted by value.
Percentiles divide a set of values into 100 equal parts. A 95th percentile means that 95% of the values are at or below this point. So, arrange all values in order, then the percentile of a particular point in the ith position when counted from the lowest measurement.
Quartiles divide a set of values into four equal parts by using the 25th, 50th, and 75th percentiles. Q1 is the 25th percentile, 25% of values are below and 75% of values are above. Q2 is the 50th percentile, 50% of values are below and 50% of values are above. Q3 is the 75th percentile. 75% of values are below and 25% of values are above.
Standardized scores or z-scores are independent of the units in which the data values are measured. Therefore, they are useful when comparing observations measured on different scales. They are computed as:
\[z-score = \frac{measurement-mean}{StandardDeviation}\]
A z-score gives the distance between the measurements and the mean in terms of the number of standard deviations. A negative z-score indicates that the measurement is smaller than the mean. A positive z-score indicates that the measurement is larger than the mean.
Graphing Univariate Data
Graphical summary measures are a good way of conveying information, but they are also subject to misinterpretation and can be distorted very easily. Two researchers can take the same data and convey completely different messages just by manipulating the layout of a graph.
Boxplots
A boxplot is a graphical data summary based on measures of position. It is useful for identifying outliers and the general shape of the distribution.
- Any points below the lower whisker are identified as outliers on the lower end
- Any points above the higher whisker are identified as outliers on the higher end
- The length of the box indicates the IQR or the middle 50% of the data, when the data is arranged in increasing order of value
- The length of the lower whisker shows the spread of the smallest 25% of data, when the data is arranged in increasing order of value
- The length of the first compartment of the box shows the spread of the next smallest 25% of data
- The length of the second compartment of the box shows the spread of the third smallest 25% of data
- The length of the upper whisker shows the spread of the largest 25% of the data
- Compare the lengths of the four parts to compare the respective spread of the data. Use the information about the spread to determine the shape of the distribution
Comparing Distributions
When comparing distributions of two or more groups, use the following criteria.
- Compare the centers of the distributions
- Compare the spreads of the distributions. Consider the differences in the spread of data within each group as well as the differences between groups
- Compare clusters of measures and gaps in measurements
- Compare outliers and any other unusual features
- Compare the shapes of the distributions
- Compare in the context of the question
Exploring Bivariate Data
Bivariate data is data on two different variables collected from each item in a study. We often want to investigate the relationship between two quantitative variables. If two quantitative variables have a linear relation, then we can measure the strength of that relationship with linear regression, a popular and relatively simple method.
There are two commonly used measures to summarize the relation between two variables. These are scatterplots and the correlation coefficient. A scatterplot is used to describe the nature, degree, and direction of the relation between two variables x and y, where they give a pair of measurements.
- Draw an x-axis and an y-axis
- Scale the axes to accommodate the ranges of data for the first and second variable
- For each pair of measurements, mark the point on the graph where the unmarked lines of the x and y values cross
A scatterplot can tell us a few things concerning the two variables including the shape, direction, and strength of relationship. A scatterplot tells us whether the nature of the relationship between the two variables is linear or nonlinear. A linear relation is one that can be described well using a straight line. The scatterplot will show whether the y-value increases or decreases as the x increases, or that it changes direction.
- If a scatterplot shows an increasing or upward trend, then it indicates a positive relationship between the two variables.
- If a scatterplot shows a decreasing or downward trend, then it indicates a negative relationship between the two variables.
If the trend of the data can be described with a line or a curve, then the spread of the data values around the line or curve describe the degree or strength of the relationship between the two variables.
- If the data points are close to the line, then it indicates a strong relationship between the two variables
- If a scatterplot has points that are more loosely scattered, then it indicates a weaker relationship between the two variables. If a scatterplot shows points scattered without any apparent pattern, then it indicates no relationship between the two variables.
Numerical Methods For Continuous Data
Correlation coefficients are numerical measures used to judge the relation between two variables. Pearson’s correlation coefficient is a numeric measure of the strength and direction of the linear relation between two quantitative variables. The Pearson’s correlation coefficient between two variables x and y computed from a population is denoted by \(\rho\), whereas the correlation coefficient between two variables computed from a sample is denoted by r.
Numerical Methods
The positive or negative sign of the correlation coefficient describes the direction of the linear relation between the two variables.
A positive value of the correlation indicates a positive relation between x and y. This means that as x increases, y also increases linearly. For example, the relation between the heights of fathers and the heights of their sons is a positive relation. Taller fathers tend to have taller sons. A scatterplot of such data will show an increasing or upward linear trend.
A negative value of the correlation coefficient indicates a negative relation between x and y. This means that as x increases, y decreases linearly. For example, the relation between the weight of a car and its gas mileage is a negative relation. Heavier cars tend to get lower gas mileage. A scatterplot of such data will show a decreasing or downward linear trend.
The numeric value of the correlation coefficient describes the strength of the linear relation between the two variables.
If the value of the correlation coefficient is equal to +1 or -1 , then it indicates a perfect correlation between two variables. In this case, all the points in a scatterplot would fall perfectly on the line.
The farther away the correlation coefficient gets from 0, the stronger the relationship between the two variables, and the closer the correlation coefficient is to 0, the weaker the relationship between the two variables. For example, if the correlation coefficient between x and y is -0.86, whereas the correlation coefficient between x and z is 0.75, then x and y have a stronger relation than x and z. Note again that, unlike scatterplots, correlation coefficients do not show the shape of the relationship.
Correlation coefficients are usually computed using a calculator or by reading computer output from statistics programs.
Since the farther an r-value is from zero, the stronger the correlation, it can be challenging to interpret a specific number as weak, strong, or very strong. Statisticians often rely on arbitrary cutoff numbers to distinguish these values.
Least-Squares Regression
Once we have established that the two variables are related to each other, we are often interested in estimating or quantifying the relation between the two variables. When one variable explains or causes the other or when one is dependent on the other, estimating a linear regression model can be useful. Such an estimate can be useful for predicting the corresponding values of one variable for known values of the other variable.
A linear regression model or linear regression equation is an equation that gives a straight line relationship between two variables. The linear relation between two variables is given by the following equation for the regression line:
\[Y=a + BX\]
- Y is the dependent variable or response variable
- X is the independent variable or explanatory variable
- A is the y-intercept. It is the value of Y for X=0
- B is the slope of the line. It gives the amount of change in y for every unit change in X
The predicted value of Y for a given value of X is denoted by y-hat. It is computed by using the estimated regression line: \(y\^ = a bx\)
The least squares regression line is a line that minimizes the sum of the squares of the residuals. It is also known as the line of best fit. The line of best fit will always pass through the point (X,Y).
The coefficient of determination measures the percent of variation in Y-values explained by the linear relation between X and Y values. In other words, it measures the percent of variation in Y-values attributable to the variation in X-values. It can be shown that for a linear regression it is equal to the square of the Pearson’s correlation coefficient.
Outliers and Influential Points
As discussed earlier, an outlier is an observation that is surprisingly different from the rest of the data. It is an observation that does not conform to the general trend. An influential observation is an observation that strongly affects a statistic. Some outliers are influential, while others are not. If there is a considerable difference between the correlation coefficients computed with and without a specific observation, then that observation is influential. The same can be said about the line of best fit. If the estimates of the line of best fit change considerably when including or excluding a point, then that point is an influential observation.
Residuals and Residual Plots
When we use a least-square regression line to make predictions, there is usually a difference between the predicted values and actual observed values. We can think of this difference between observed and predicted values as error, or a residual value.
A residual plot is a plot of residuals versus the predicted values of Y. This type of plot is used to assess the fit of the model. A residual plot should look random. If the residual plot shows any patterns or trends, it is an indication that the linear model is not appropriate.
Transformation To Achieve Linearity
Always draw a scatterplot of the data to examine the nature of the relation between two variables. You should also examine the fit of the linear model using a residual plot. If either one of these plots indicates that the linear model might not be appropriate for the data, then there are two options available: You can either use nonlinear models or use a transformation to achieve linearity. For example, if the data seem to have a relation of the nature Y=aX,then we can take the logarithm of both sides to get ln(Y)=ln(a)+bln(x). This gives the equation of a straight line, in other words, a linear relation.
After the variables have been appropriately transformed, we can then use them to make a model. For example, we could take the natural logarithm of all Y-values or the square root of all Y-values. Then we would fit the model for Z as a function of X. When using a fitted model for predictions, remember to transform the predicted values back to the original scale using a reverse transformation.
The logarithm transformation is used to linearize the regression model when the relationship between Y and X suggests a model with a consistently increasing slope. The square root transformation is used when the spread of observations increases with the mean. The reciprocal transformation is used to minimize the effect of large values of X. The square transformation is used when the slope of the relation consistently decreases as the independent variable increases. The power transformation is used if the relation between dependent and independent variables is modeled by Y=aX.
Joint Frequencies of Two-Way Tables
Suppose data is classified by two different criteria. If the classification criterion 1 has r categories and the classification criterion 2 has c categories, then the classification of data would result in a table with r rows and c columns.
A table of data classified by r categories of classification criterion 1 and c categories of classification criterion 2 is known as an r*c contingency table. These row and column totals give the marginal frequencies for these two categories. The marginal frequency is the frequency with which each category occurs.
Conditional Relative Frequencies
The conditional relative frequency is the relative frequency of one category given that the other category has occurred. This frequency is used to determine whether there is an association between the two classification criteria. To measure the degree of relation between two quantitative variables, we use the concept of correlation. On the other hand, to measure the degree of relation between two categorical variables, we use the concept of association.
Question 1
Describe a situation in which it is better to use the median as a measure of central tendency over the mean.
When data is highly skewed, the values in the tail affect the mean more than they affect the median. Since the median is more stable, it is often a better measure of central tendency when data are skewed.
Question 2
Which of the three main measures of central tendency can be used with categorical data?
The mode can be used with both categorical and continuous variables, or qualitative and quantitative. Since categories are not numeric, we cannot calculate a mean, and if the categories are not ordinal, we cannot put them in order to calculate a median.
Question 3
Describe the strength and direction of the relationship between two variables with a correlation of -0.5.
Since the correlation coefficient is negative, it means that there is an inverse relationship between X and Y. As X goes up, Y will go down. -0.5 indicates a moderately strong relationship, since values close to 0 indicate little relation between two variables, and values close to -1 or 1 indicate nearly perfect relationships.
Question 4
If there is a clear pattern in a residual plot, is a linear relationship appropriate?
No, if a linear relationship is appropriate, we expect to see a residual plot with near to zero correlation between residuals and predicted values. Patterns indicate that the relationship between X and Y is not linear.
Question 5
Which measure of spread is least affected when there are extreme outliers in your data set?
The interquartile range does not include the lowest or highest 25% of data and is therefore less affected by extreme values. Variance, standard deviation, and range do include these extreme values.
Summary
The two major categories of variables are quantitative and categorical. Data can be described using tables, graphs, and numbers. Bar graphs and pie charts are useful methods for depicting categorical data.
There are many graphical methods for describing quantitative data. Small data sets can be depicted using stem plots or dot plots. Larger data sets can be shown as boxplots, frequency charts, and histograms.
Use a scatter plot to organize and display bivariate quantitative data. Use a two-way contingency table to summarize bivariate categorical data. Ideally, data should always be visualized first, as symmetric data and skewed data are summarized using different numerical summaries.
The r-value, or correlation coefficient is always between -1 and 1. The r-statistic describes the strength of the correlation, with numbers closer to + or - 1 being stronger and values closer to 0 being weaker. The r-squared value describes how much variation in the y-values data can be attributed to changes in the x-values.
There are two major formulas needed to calculate the least-squares regression line. To find the estimated slope, you need the r-statistic, the standard deviation of the x-values, and the standard deviation of the y-values. A residual plot shows whether a linear model is a good fit. If the points on a residual plot are randomly scattered, then a linear model is appropriate. If the points are not random, then a linear model should not be used.
Methods Of Data Collection
If you want to draw valid conclusions from a study, you must collect the data according to a well developed plan. This plan must include the question or questions to be answered as well as an appropriate method of data collection and analysis.
Methods Of Data Collection
First, we need to introduce some terms and concepts.
A population is the entire group of individuals or items that we are interested in.
A frame is a list of all the members from which the sample is to be taken. This is usually the same as the population.
A sample is the part of the population that is actually being examined.
A sample survey is the process of collecting information from a sample. Information obtained from the sample is usually used to make inferences about a population parameters.
A census is the process of collecting information from all the units in a
population. It is feasible to do a census if the population is small and the
process of getting information does not destroy or modify units of the
population. A census is often too costly and sometimes too damaging to the
population being studied. We usually have to take samples instead.
Experiments and Observational Studies
An experiment is a planned activity that results in measurements. In an
experiment, the experimenter creates differences in the variables involved in
the study and then observes the effects of such differences on the resulting
measurements.
An observational study is an activity in which the experimenter observes the
relationships among variables rather than creating them. Experiments have some
advantages, but unfortunately, it is impossible or unethical to conduct an
experiment. Sometimes we must use an observational study.
One of the problems with observational studies is that their results often
cannot be generalized to a population, because many observational studies use
samples that are not representative of the population at interest. These
samples might simply be easiest to obtain. Another problem is that of
confounding factors. These occur when the two variables of interest are related
to a third variable instead of just to each other.
Planning and Conducting Surveys
There are many methods of getting a sample from the population. Some sampling
methods are better than others. Biased sampling methods result in values that
are systematically different from the population values or systematically favor
certain outcomes.
Judgmental sampling, samples of convenience, and volunteer samples are some of
the methods that generally result in biased outcomes. Sampling methods that are
based on a probabilistic selection of samples such as simple random sampling,
generally result in unbiased outcomes.
Judgmental sampling makes use of a nonrandom approach to determine which item
of the population is to be selected in the sample. The approach is entirely
based on the judgement of the person selecting the sample.
Using a sample of convenience is another method that can result in biased
outcomes. Samples of convenience are easy to obtain.
Volunteer samples, in which the subjects choose to be part of the sample, may
also result in biased outcomes.
Simple Random Sampling
Simple random sampling is a process of obtaining a sample from a population in
which each member has an equal chance of being selected. In this type of sample,
there is no bias or preference for one individual over another. Simple random
samples, also known as random samples, are obtained in two different ways:
sampling with replacement from a finite population and sampling without
replacement from an infinite population.
To select a simple random sample from a population, we need to use some kind of
chance mechanism.
Beside simple random sampling, there are other sampling procedures that make
use of a random phenomenon to get a sample from a population. In a systematic
sampling procedure, the first item is selected at random from the first k items
in the frame, and then every kth item is included in the sample. This method is
popular among biologists, foresters, environmentalists, and marine scientists.
In stratified random sampling, the population is divided into groups called
strata, and a simple random sample is selected from each stratum. Strata are
homogeneous groups of population units. The units in a given stratum are similar
in some characteristics, whereas those in different strata differ in those
characteristics.
In proportional sampling, the population is divided into groups called strata,
and a simple random sample of size proportional to the stratum size is selected
from each stratum. Proportional sampling is the preferred method of stratified
sampling.
In cluster sampling, a population is divided into existing, non-homogeneous
groups called clusters. A simple random sample of clusters is obtained, and all
individuals within the selected clusters are included in the sample. In order to
safely use cluster sampling, each cluster should be representative of the
population as a whole. Cluster sampling is often used to reduce the cost of
obtaining a sample, especially when the population is large.
Bias In Surveys
For a survey to produce reliable results, it must be properly designed and
conducted. Sample should be selected using a proper randomization technique. A
nonrandom selection will limit the generalizability of the results. Furthermore,
interviewers should be trained in proper interviewing techniques. The attitude
and behavior of the interviewer should not lead to any specific answers, because
this would result in a biased outcome. Questions should be carefully worded, as
the wording of a question can affect the response, and leading questions should
be avoided.
A survey is biased if it systematically favors certain outcomes. Response bias
occurs when a respondent provides an answer that is either factually wrong or
does not accurately reflect his or her true opinion.
Non-response bias may occur if the person selected for an interview cannot be
contacted or refuses to answer. If such individuals are different, as a group,
from those who are eventually interviewed, then the results may not accurately
reflect the whole population.
Under-coverage bias may occur if part of the population is left out of the
selection process.
Wording effect bias may occur if confusing or leading questions are asked.
Planning Experiments
A dependent or response variable is the variable to be measured in the
experiment. An independent or explanatory variable is a variable that may
explain the differences in responses. We are interested in studying the
effects of independent variables on dependent variables.
An experimental unit is the smallest unit of the population to which a
treatment is applied. A confounding variable is a variable whose effect on the
response cannot be separated from the effect of the explanatory variable. In
properly constructed experiments, an experimenter tries to control confounding
variables. Confounding can be an even more serious problem in observational
studies because the experimenter has no control over the confounding variables.
A factor is a variable whose effect on the response is of interest in the
experiment. Factors are of two types- qualitative and quantitative. Qualitative
is where the data is in non-numerical groups. Quantitative is where the data can
be measured numerically.
Levels are the values of a factor used in the experiment. An experiment can have
one or more factors. The number of levels used in the experiment may differ from
factor to factor. Treatments are the factor-level combination used in the
experiment. If the experiment has only one factor, then all the levels of that
factor are considered treatments of the experiment.
A control group is a group of experimental units similar to all the other
experimental units except that it is not given any treatment. A control group is
used to establish the baseline response expected from experimental units if no
treatment is given.
A placebo groups is a control group that receives a treatment that looks and
feels similar to an experimental treatment but is expected to have no effect. A
placebo is a medicine that looks exactly like the real medicine but does not
contain any active ingredients.
Single and Double Blind Experiments
Similarly, it is possible that measurements or subject interaction will be
biased if the person taking the measurements knows whether a person received a
placebo or not. Blinding technique is used in medical experiments to prevent
such a bias. The blinding technique can be used in two different fashions-
double blinding and single blinding. In a single blind experiment, either the
person does not know which treatment he or she is receiving or the person
measuring the patient's reaction does not know which treatment was given. In
a double blind experiment, both the patient and the person measuring the
patient's reaction do not know which treatment the patient was given.
Double blind experiments are preferred, but in certain situations they simply
cannot be conducted.
Randomization
The technique of randomization is used to average the effects of extraneous
factors on responses. In other words, it balances the effects of factors you
cannot see.
If each experimental unit is supposed to receive only one treatment, then which
experimental unit receives which treatment should be determined randomly. If
each experimental unit is supposed to receive all treatments, then the order of
treatments should be determined randomly for each experimental unit.
Blocking
The technique of blocking is used to control the effects of known factors. A
block is a group of homogeneous experimental units. Experimental units in a
block are similar in certain characteristics, whereas those in different blocks
differ in those characteristics.
Replication
Replication refers to the process of giving a certain treatment numerous times
in an experiment, or even repeating an experiment multiple times. Replication
reduces chance variation among results. It also allows us to estimate chance
variation among results.
Completely Randomized Design
In a completely randomized design, treatments are assigned randomly to all
experimental units, or experimental units are assigned randomly to all
treatments. This design can compare any number of treatments. There are
advantages in having an equal number of experimental units for each treatment.
Randomized Block Design
If the treatments are the only systematic differences present in the
experiment, then the completely randomized design is best for comparing the
responses. But often there are other factors affecting responses. Unless they
are controlled, the results will be biased. One way to control the effects of
known extraneous factors is to form groups of similar units called blocks. In a
randomized block design, all experimental units allows the experimenter to
account for systematic differences in responses due to a known factor and leads
to more precise conclusions from the experiment.
Matched-Pairs Design
If there are only two treatments to be compared in the presence of a blocking
factor, then you should use a randomized paired comparison design. This can be
designed in different ways. Form two or more blocks of two experimental units
each. Experimental units within each block should be matched by some relevant
characteristics. Within each block, toss a coin to assign two treatments to the
two experimental units randomly. Each block will have one experimental and one
control unit. Because both experimental units are similar to each other except
for the treatment received, the differences in responses can be attributed to
the differences in treatments. This type of experiment is called a matched-pairs
design.
Alternatively, each experimental unit can be used as its own block. Assign both
treatments to each experimental unit, but in random order. To control the effect
of the order of treatment, randomly determine the order. With each experimental
unit, toss a coin to decide whether the order of treatments should be treatment1
or treatment2. Because both treatments are assigned to the same experimental
unit, the individual effects of experimental units are nullified, and the
differences in responses can attributed to the differences in treatments.
Control groups allow us to see what would happen to experimental units over
time. Placebo groups go above and beyond control groups by implementing an
intervention or treatment that is similar to the target intervention or
treatment. This helps us account for even more external factors.
Probability In Statistics
Introduction
Words like probability, chance, and likely have similar meanings when used
casually. They all convey uncertainty. By using probability, we can make
numerical statements about uncertainty. Probability is a measure of the likelihood of an event.
Sample Space
Any process that results in an observation or an outcome is an experiment. An
experiment may have more than one possible outcome. A set of all possible
outcomes of an experiment is known as a sample space. It is generally denoted
using the letter \(S\).
Tossing a coin will result in one of two possible outcomes, heads or tails.
Therefore, the sample space of tossing a coin is
\(S = {Heads, Tails}\)
Throwing a six sided die will result in one of six possible outcomes. The
resulting sample space is
\(S = {1,2,3,4,5,6}\)
Tossing two coins will result in one of four possible outcomes. We can indicate
the outcome of each of the two tosses by using a pair of letters, the first
letter of which indicates the outcome of tossing the first coin and the second
letter the outcome of tossing the second coin. H is for heads and T is for
tails. The resulting sample space is
\(S = {(H,H), (H,T), (T,H), (T,T)}\)
The outcomes listed in a sample space are never repeated, and no outcome is left
out. Two events are said to be equally likely if one does not occur more often
than the other. For example, the six possible outcomes for a throw of a die are
equally likely.
A tree diagram representation is useful in determining the sample space for an
experiment, especially if there are relatively few possible outcomes. For
example, imagine an experiment in which a die and a quarter are tossed together.
What are all the possible outcomes? The six possible outcomes of throwing a die
are 1,2,3,4,5,6. The two possible outcomes of using a quarter are Heads and
Tails.
The probability of an event is generally denoted by a capital P followed by the
name of the event in parentheses. If all the events in a sample space are
equally likely, the by using the concept of relative frequency, we can compute
the probability of an event as:
\(P(event) = \frac{true-outcomes}{total-outcomes}\)
Applying this to the events defined earlier:
\(P(A) = \frac{3}{6} = \frac{1}{2} = 0.5\)
The probability of getting an even number when a six sided die is thrown is 0.5.
In other words, there is a 50% chance of getting an even number when a six sided
die is thrown.
\(P(B) = \frac{1}{4} = 0.25\)
The probability of getting two heads when two coins are tossed is 0.25. In other
words, there is a 25% chance of getting two heads when two coins are tossed.
Probability Rules and Terms
There are two rules that all probability must satisfy.
Rule 1: For any event A, the probability of A is always greater than or equal to
0 and less than or equal to 1.
Rule 2: The sum of the probabilities for all possible outcomes in a sample space
is always 1.
So, if an event can never occur, its probability is 0. Such an event is known as
an impossible event.
If an event must occur every time, its probability is 1. Such an event is known
as a sure event.
The odds in favor of an event is a ratio of the probability of the occurrence of
an event to the probability of the nonoccurrence of that event.
\(Odds-of-event = \frac{P(Event-Occurs)}{P(Event-Doesn't-Occur)}\)
or:
\(P(Event-occurs) : P(Event-Doesn't-Occur)\)
Example 1
When tossing a die, what are the odds in favor of getting the number2?
When tossing a die:
\(P(2) = \frac{1}{6}\)
and:
\(P(not-2) = \frac{5}{6}\)
So, the odds in favor of getting the number 2 are \(\frac{1}{6}:\frac{5}{6} or
1:5\)
The Venn diagram illustrates some of the following terms. The rectangular box
indicates the sample space. Circles indicate different events.
The complement of an event is the set of all possible outcomes in a sample space
that do not lead to the event. The complement of an event is denoted by A'.
Disjoint or mutually exclusive events are events that have no outcome in common.
In other words, they cannot occur together. Two separate circles in a Venn
diagram are disjoint events.
The union of events A and B is the set of all possible outcomes that lead to at
least one of the two events A and B. The union of events A and B is denoted by
\(A-or-B\). The intersection of events A and B is the set of all possible
outcomes that lead to both events A and B. The intersection of events A and B is
denoted \(A-and-B\).
A conditional event: A given B is a set of outcomes for event A that occurs if B
has occurred. It is indicated by "A given B". Two events A and B are considered
independent if the occurrence of one event does not affect the probability of
the other event.
Independence vs. Dependence
Events happening that do not depend on each other are called independent. If the
events are related, then we call this a dependent event.
Example 2
The sample space for throwing a die is S={1,2,3,4,5,6}. Suppose events A,B, and
C are defined as follows:
A=Getting an even number={2,4,6}
B=Getting at least 5={5,6}
C=Getting at most 3={1,2,3}
Find the probability of each of these events and its complement. Then, find the
union, intersection, and conditional probability of each pair of events.
Solution:
\(P(A)=\frac{3}{6}=0.5\)
\(P(B)=\frac{2}{6}=0.3\)
\(P(C)=\frac{3}{6}=0.5\)
Complement:
A'=Getting an odd number={1,3,5}
P(A')=\frac{3}{6}=0.5=1-P(A)
B'=Getting a number less than 5={1,2,3,4}
P(B')=\frac{4}{6}=0.6=1-P(B)
C'=Getting a number larger than 3={4,5,6}
P(C')=\frac{3}{6}=0.5=1-P(C)
Union:=
Getting an even number or a number greater than or equal to 5 or both
={2,4,5,6} = \(P(A-orB)=\frac{4}{6}=0.6\)
A or C=Getting an even number or a number less than or equal to 3 or both
={1,2,3,4,6} = P(A or C)=\(\frac{5}{6}=0.83\)
B or C=Getting a number that is at most 3 or at least 5 or both
={1,2,3,5,6} = P(B or C)=\(\frac{5}{6}=0.83\)
Intersection:
(A and B)=Getting an even number that is at least 5={6}
P(A and B)=\(\frac{1}{6}=0.16\)
(A and C)=Getting an even number that is at most 3={2}
P(A and C)= P(A and C)=\frac{1}{6}=0.16
(B and C)=Getting a number that is at most 3 and at least 5 ={}
P(B and c)=\(\frac{0}{6} = 0 = 0.0\)
So, B and C are disjoint or mutually exclusive events.
Conditional Events:
(A|C)=Getting an even number given that the number is at most 3={2}
P(A!C)=\(\frac{1}{3}=0.3\)
(A|B)=Getting an even number given that the number is at least 5={6}
P(A!B)=\(\frac{1}{2}=0.5\)
(B|C)=Getting at least 5 given that the number is at most 3 =0
\(P(B|C)=0\)
Mutually Exclusive Events
Mutually Exclusive Events
Independence and mutually exclusive events are two relationships that often get
mixed up. Independence describes events occurring that do not affect each
other's probability while mutually exclusive describes events with no shared
outcomes. Though these two relationships mean very different things, they are
linked in a way. Suppose we flip a coin. The events heads and tails are clearly
mutually exclusive. If you get one, you cannot get the other at the same time.
That means if you get heads for the flip, you have no chance at getting tails on
the same flip. In other words, since there is non shared outcome, getting one
outcome makes it impossible to get the other. What this implies is that all
mutually exclusive events are never independent, which in turn means that
independent events can never be mutually exclusive.
Probability Rules
Complements: The probability of the complement of an event A is given by
\(P(A') = 1-P(A)\)
Union: Addition rule where the probability of the union of two events A and B is
given by \(P(A \cup B) = P(A) + P(B) - P(A \cap B)\). If the events A and B are
disjoint, then \(P(A \cap B)=0\) and \(P(A \cup B)=P(A)+(PB)\).
Intersection multiplication rule: For events A and B defined in a sample space
S, \(P(A \cap B)=P(A)*P(B|A)=P(B)*P(A|B)\). If the events A and B are independent
of each other, then \(P(A|B)=P(A), P(B|A)=P(B)< and P(A \cap B)=P(A)*P(B)\).
Conditional Probabilities(Bayes Theorem): The probability of A given B is
\(P(A|B)=\frac{P(A \cap B)}{P(B)}\).
Independence: Two events A and B are independent if and only if
\(P(A \cap B)=P(A)*P(B)\).
Random Variables and their Probability
A variables is a quantity who value varies from subject to subject.
A probability experiment is an experiment whose possible outcomes may be known
but whose exact outcome is a random event and cannot be predicted with certainty
in advance. If the outcome of a probability experiment takes a numerical value,
then the outcome is a random variable. Random variables are usually denoted
using capital letters. Sometimes two or more variables are denoted using the
same letter but different subscripts.
There are two types of random variables, discrete and continuous.
A discrete random variable is a quantitative variable that takes a countable
number of values.
Note that between any two possible values of a discrete random variable, there is a countable number of possible values.
A continuous random variable is a quantitative variable that can take all the
possible values in a given range. A person's weight is a good example. A person
can weigh 150 pounds or 155 pounds or anything in between. Other examples are:
altitude of a plane, amount of rainfall in a city in a day, amount of gasoline
pumped into a car's gas tank, weight of a newborn baby, or the amount of water
flowing through a dam per hour.
Probability Distributions of Discrete Random Variables
A probability distribution of a discrete random variable or a discrete
probability distribution is a table, list, graph, or formula giving all possible
values taken by a random variable and their corresponding probabilities.
Mean of a Discrete Random Variable
The mean \(\mu\) of a discrete random variable X is also known as the expected
value. It is denoted by \(E(X)\) and is computed by multiplying each value of
the random variable by its probability and then adding over the sample space.
\(\mu_{x} = E(X) = \sum{X}P(X{1})\)
The variance of a discrete random variable is defined as the sum of the product
of squared deviations of the values of the variable from the mean and the
corresponding probabilities:
\(\sigma^2 = \sum(x_{i}-\mu)^2 P(x_{i})\)
Remember that standard deviation is simply the square root of variance. Standard
deviation is our expected value for how much any given data point will vary from
the mean.
Combinations
A combination is the number of ways r items can be selected out of n items if
the order of selection is not important. It is denoted by \(\frac{n}{r}\), which
reads as "no choose r", and is computed as \(\frac{n}{r} = \frac{n!}{r!(n-r)!}\)
For any integer \(n\geq0,n!\) is read as "n factorial" and is computed as \(n! =
n(n-1)(n-2)(n-3)...(3)(2)(1)\). For example, \(3! = (3)(2)(1) = 6\) and \(5! =
(5)(4)(3)(2)(1) = 120\). Note that \(0!=1\) and \(1!=1\).
Why do we use combinations? We do not care when in the sequence our x successes
occur. We just want there to be x successes out of n trials. There are
\(\frac{n}{x}\) ways to get x out of n successes.
P^x is the probability of getting x successes. If the probability of getting one
success is p, then the probability of getting two successes is p*p. Similarly,
(1-p) must be the probability of not getting a success. The probability of
getting one failure is (1-p)*(1-p).
\(P(A\text{ and }B) = P(A)*P(B)\) if A and B are mutually exclusive.
Binomial Distributions
One example of a distribution of discrete random variables is the binomial
distribution. A binomial distribution occurs in an experiment that possesses the
following properties:
1. There are n repeated trials of a number fixed in advance
2. Each trial has two possible outcomes, known as success and failure.
3. All trials are identical and independent, thus the probability for success
remains the same for each trial.
The binomial variable X:
X = the number of successes in n trials = 0,1,2,...n
\(P(X=x)=\frac{n}{x}p^x(1-p)^{n-x}\)
Mean of a binomial random variable (how many times do you expect to succeed?)
\(\mu=np\)
Variance of a binomial random variable (how much do you expect your number of
successes to vary from sample to sample):
\(\sigma^2 = np(1-p)\)
Some examples of binomial random variables:
1. A quality control inspector takes a random sample of 20 items from a large
lot, inspects each item, classifies each as defective or nondefective, and
counts the number of defective items in the sample.
2. A telephone survey asks 400 area residents, selected at random, whether they
support the new gasoline tax increase. The answers are recorded as yes or no.
The number of persons answering yes is counted.
3. A random sample of families is taken, and for each family with three
children, the number of girls out of the three children is recorded.
4. A certain medical procedure is performed on 15 patients who are not related
to each other. The number of successful procedures is counted.
5. A homeowner buys 20 azalea plants from a nursery. The number of plants that
survive at the end of the year is counted.
The shape of the binomial distribution depends on the values of n and p. The
distribution spreads from 0 to n.
Geometric Distribution
Another example of a distribution of discrete random variables is the geometric
distribution. The geometric distribution occurs in an experiment where repeated
trials possess the following properties:
1. There are n repeated trials.
2. Each trial has two possible outcomes, success or failure.
3. Trials are repeated until a predetermined number of successes is reached.
4. All trials are identical and independent, thus the probability for success
remains the same for each trial.
The geometric random variable X:
X = the number of trials required to obtain the first success = 0,1,2...
P(x trials needed until to obtain the first success is observed) =
\((1-p)^{x-1}p\).
Mean of the geometric random variable: \(\mu = E(X) = \frac{1}{p}\)
Variance of the geometric random variable: \(\sigma^2 = Var(X)=\frac{1-p}{p^2}\)
Some examples of the geometric random variable:
1. A worker opening oysters to look for pearls counts the number of oysters he
has to open until he finds the first pearl.
2. A supervisor at the end of an assembly line counts the number of nondefective
items produced until he finds the first defective one.
3. An electrician inspecting cable one yard at a time for defects counts the
number of yards he inspects before he finds a defect.
You can think about the mean of a geometric random variable intuitively. If p
gets bigger, the mean number of trials until the first success goes down. If
something happens often, it is very unlikely that you will have to wait very
long for it to occur.
In the binomial distribution, the number of trials is fixed, and the number of
successes is a random event. In the geometric distribution, the number of
successes is fixed, but the number of trials required to get the success is a
random event.
Probability Distributions of Random Variables
Probability Distributions of Random Variables
A continuous random variable takes all possible values in a given range. An
example is the distance traveled by a car using one gallon of gas. Occasionally,
when a discrete variable takes lots of values, it is treated as a continuous
variable.
The probability distribution of a continuous random variable or the continuous
probability distribution is a graph or a formula giving all possible values
taken by a random variable and the corresponding probabilities. It is also known
as the density function, or probability density function.
Let X be a continuous random variable taking values in the range(a,n). Then, the
area under the density curve is equal to the probability. The total probability
under the curve =1. The probability that X takes a specific value is equal 0.
The reason is that the probability of getting any X exactly is 0. For example,
the chance of it raining exactly 3.00233221 inches is 0, but the chance of it
raining between 3.00 and 3.01 inches is small but measurable.
This may seem hard to understand but remember that there are an infinite number
of points on your probability density function and all of their probabilities
add up to 1. Pretend you only have 10 events whose probabilities add up to 1. If
they all had equal probability, then each would have a 0.1 probability. Now
imagine there are 100 events, you would have a 0.01 probability for each one.
Now infinitely many events and the probability goes to 0.
The cumulative distribution function of a random variable X is \(P(X) \leq
x_{0}\) for any \(a < x_{0} < b\). It is equal to 0 for any \(x_0 < a\), and it
is equal to 1 for any \(x_0 > b\).
Normal Distribution
The discovery of the normal distribution is credited with Carl Gauss. It is also
known as the bell curve or Gaussian distribution. This is the most commonly used
distribution in statistics because it closely approximates the distribution of
many different measurements.
If a random variable x follows a normal distribution with mean \(\mu\) and
standard deviation \(\sigma\), then it is denoted \(X=N(\mu \sigma)\).
The standard normal is the normal distribution with a mean of 0 and a standard
deviation of 1. Any normal random variable can be transformed into the standard
normal using the relation \(Z=\frac{X-\mu}{\sigma}\). The value of variable Z
for any specific value of X is known as the z-score. For example, suppose
\(X=N(10,2)\). The z-score for X=12.5 is then \(Z=\frac{x-\mu}{\sigma} =
\frac{12.5-10}{2}=1.25\).
This process is called z-scoring. It does not change the distribution at all. It
simply changes the units on the x-axis, shifting it over by \(-\mu\) and
relabeling the axis in units of standard deviation so that 1 unit =1 standard
deviation.
Properties of the Normal Distribution
It is continuous. It is symmetric around its mean. It is bell-shaped.
Mean=median=mode. The curve approaches the horizontal axis on both sides of the
mean without ever touching or crossing it. Nearly all of the distributions lie
within three standard deviations of the mean. It has two inflection points: one
at \(\mu-\sigma\) and one at \(\mu+\sigma\).
The normal distribution is fully determined by two parameters, mean and
variance(standard deviation). The location of the distribution on the number
line depends on the mean of the distribution. The shape of the distribution
depends on the standard deviation. A normal distribution with a larger standard
deviation is more spread out, while one with a smaller standard deviation is
more tightly bunched.
Using the Normal Distribution Table
If the random variable X follows a normal distribution with mean \(\mu\) and
standard deviation \(\sigma\), then the random variable
\(Z=\frac{X-\mu}{\sigma}\) follows a standard normal distribution, a normal
distribution with mean 0 and standard deviation 1.
To find the area under the standard normal distribution, a normal distribution
with mean 0 and standard deviation 1, you can simply look at the standard normal
probability table. To find the area under the curve, the probability, for any
normal distribution other than the standard normal, we convert it to a standard
normal using the above formula. Approximately 68% of the area under the curve
lies between \(\mu-\sigma\) and \(\mu+\sigma\). Approximately 95% of the area
under the curve lies between \(\mu-2\sigma\) and \(\mu+2\sigma\). Lastly,
approximately 99.73% of the area under the curve lies between \(\mu-3\sigma\)
and \(\mu+3\sigma\).
Combining Independent Random Variables
Sometimes we are interested in linear combinations of independent random
variables. If we know the means and the variances of two random variables, we
can determine the means and the variances of a linear combination of these
variables. If x and Y are normally distributed, then a linear combination of the
two will also be normally distributed.
Sampling Distributions
A parameter is a numerical measure of a population. for example, a student's GPA
is computed using grades from all his courses. GPA is a parameter.
A statistic is a numerical measure of a sample. An example is the percent of
votes received by a presidential candidate. Generally, not every eligible voter
votes in an election. Therefore, the president is elected based on the support
received from a sample of the eligible voters, so the person is a statistic. If
every eligible voter does vote, then the percent of votes received would be a
parameter.
The sampling distribution is the probability distribution of all possible values
of a statistic. Different samples of the same size from the same population will
result in a different statistic values. Therefore, a statistic is a random
variable. Any table, list, graph, or formula giving all possible values a
statistic can take and their corresponding probabilities gives a sampling
distribution of that statistic.
The standard error is the standard deviation of the distribution of a statistic.
Central Limit Theorem
Regardless of the shape of the distribution of the distribution of the
population, if the sample size is large and there is finite variance, then the
distribution of the sample means will be approximately normal, with
mean\(\mu_{x}=0\) and standard deviation \(\sigma_{x}=\frac{\sigma}{\sqrt{n}}\).
Basically, the central limit theorem tells us that regardless of the shape of
the population distribution, as the sample size n increases. Th shape of the
distribution of X becomes more symmetric and bell-shaped or more like a normal
distribution. The center of the distribution of X remains at \(\mu\). The spread
of the distribution of X decreases, and the distribution becomes more peaked.
Independent Events
Two events, A and B, are independent if the outcome of one event does not
affect the probability of the other. In other words, if \(P(A)=P(A)|(B)\) and
\(P(B)=P(B)|(A)\) then A and B are independent. Knowing A would not give you
any information about B.
Geometric Distribution
A geometric distribution tells you the probability of having your first success
on the kth trial. for example, it can be used to describe the probability of
making heads on 4 out of 5 coin flips.
Binomial Distribution
The binomial distribution tells you the probability of having k successes in n
trials.
The central limit theorem states that if you take a sample of size n from a
population with finite variance, and n is large enough, the distribution of
sample means will be normally distributed. Even if you are sampling from a
skewed distribution like height or income, the distribution of sample means
will be normal.