Basics of Statistics

These are my notes and thoughts on an introduction to statistics.

This is my favorite Statistics book on Amazon, if you are interested in learning Statistics I highly recommend it

 

Table of Contents

Exploring Data

We use data to make decisions. We make estimations and develop guidelines using data. Therefore, data and its analysis is important. 

Businesses collect data on their users. Scientists collect data on their experiments. Doctors collect data on their patients. Police collect data on criminals. Lots of people, organizations, and businesses collect data now. In fact, almost all of them do.

Most data that is collected is not immediately useful. It has to be organized and put in the proper format. It also needs to be summarized so decision makers can easily make choices on what is best for their organization. Doing this analysis is called descriptive methods. They are useful for presentation, data reduction, and summarization of data. 

Variables

There are two types of variables, categorical and numerical. A variable is categorical if it places the individuals being studied into one of several groups or categories. A variable is numerical if its outcomes are quantitative and can be analyzed using arithmetic. Numerical variables can be either discrete or continuous. Different methods of analysis must be used for categorical and numerical variables. 

If we take only one measurement on each object, we get univariate data. With two measurements on each object, we get bivariate data. 

Types of Descriptive Methods

There are different descriptive methods depending on the type of data that is collected. These are tabular, graphical, and numerical methods.

Different descriptive methods will answer different questions about data. 

Tabular

Collected data needs to be rearranged before analysis. One tabular method is the frequency distribution table. The letter ’n’ is used to denote the number of observations in a data set. The frequency of a value is the number of times that observations occur. Frequency is usually denoted by using the letter f. The relative frequency of a value is the ratio of the frequency to the total number of observations. It is usually denoted by \(rf\)and equals \(\frac{f}{n}\).  The cumulative frequency gives the number of observations less than or equal to a specified value and is denoted by \(cf\). A frequency distribution table is a table giving all possible values of a variable and their frequencies.  

Graphical

Presenting your data in tables is not very useful, but it is done. You should know how to interpret a table if you have to analyze one. Charts are just a better tool.

Bar charts are used a lot. It can have either horizontal or vertical bars. They are used to display categorical data very commonly. 

Pie charts can also display amounts and frequencies of data. They are a popular graphical method but not usually the best choice. They are difficult to make and read. 

Segmented Bar Charts

 It is important to see categorical data that stems from different groups in order to make comparisons. A segmented bar chart takes the distribution from each group and arranges them along either the horizontal or vertical axis. Then it shows the relative frequency of each group represented in one bar for each group. These data charts can be used to show frequency with bars of various sizes or relative frequency where all bars are the same size regardless of group size. Segmented bar charts that measure relative frequency between groups can be somewhat misleading when sample size is concerned.

Mosaic Plots

These are kind of similar to segmented bar charts. They are just a different way to compare categorical data. In a mosaic plot, use the width of the bars to represent the size of the sample. Each header indicates a different group. The groups can be arranged along the x or y axis. The lengths of these bars along the axis represent the relative frequencies of these groups compared to each other.

Along the other axis, the bars of each group are the same length. Each section within the group bars represents the percentage that category occurred in the data set for that group. These same categories should appear within each of the group bars. We can make comparisons about the size of each group based on the length of each group bar. We can also evaluate the proportions of the categorical variables within each group by comparing the relative sizes of each section.

Graphical Methods For Numerical Data

To summarize and describe numerical data, dotplots and stemplots are used for small sets of data. For larger sets, histograms, cumulative frequency charts, and boxplots are often used.

We can describe the overall pattern of the distribution of a numerical variable set using the following three methods: center, spread, and the shape.

The center of a distribution describes the central data point. There are a few ways to measure the central tendency which include the mean, median, and the mode. Each measure has different pros and cons depending on the type and shape of the data.

The spread of a distribution can tell us where most of the data is. You can have a symmetric distribution and a skewed distribution.

For a symmetric distribution, if the left half of the distribution is approximately a mirror image of the right half, then the distribution is called as symmetric. This means that the data is spread out in the same way on both sides and that there is the same amount of data on each side of the center.

In a skewed distribution, if there are extreme values in only one direction that cause one side to have a longer tail, we call that distribution skewed. It is right skewed if the longer tail is on the right and left skewed if the longer tail is on the left. 

Patterns of Data

When looking at data, we should look for patterns and deviations. To describe patterns, you can have clusters of data and outlier data. In clustered data, observations are grouped together tightly. If data is not clustered it can be described as having gaps. It is important to make these distinctions. 

If you have outliers in your data, you have an observation that is a lot different from the rest of the data. Outliers fall away from the middle of the data set. 

Graphical Methods for Continuous Variables

There are several ways to show graphical data for continuous variables. These include dotplots, stemplots, histograms, and cumulative frequency charts.

Dotplots

Dotplots are easy to make. They are nice for smaller data sets. However, if there is too much data the dotplot becomes too cluttered to read. To make a dotplot, draw a horizontal line to indicate the data range, scale the line to accommodate the entire range of data, if more than one observation has the same value then add dots above the other, and mark a dot for each observation in the appropriate place above the scaled line. 

Each dot on the plot indicates the location of the value of a data point. For any data point, we can look directly down at the scale to determine the value of the point. When looking at a dotplot, we can see how the data points are spread, what kind of shape the points make collectively, and where the approximate center of the distribution is.  

Stemplots

Stemplots are also used a lot. An advantage of the stemplot is that it shows every value. However, since that is the case, it is only useful for small data sets. 

To make a stemplot, separate each observation into two parts. The left part of the observation is called the stem and the right part is called the leaf. Draw a vertical line on the left side of the page to separate the stems from the leaves. Write all possible stems in increasing order on the left of the line. For each observation, write in the leaf to the right of the corresponding stem on the right side of the vertical line in increasing order. 

The numbers on the left side of the vertical line are stems. The value of a data point is the stem plus the leaf. Each stem has a different number of leaves, indicating the frequency of the class. Each leaf indicates a single observation. 

We use stemplots to see how the data is shaped and how it is spread. We also use it to see where the center of the data is. 

Histograms

A histogram is the most popular form of displaying data. It resembles a stemplot on its side. They are useful for showing patterns in large data sets. A histogram can be drawn using frequencies, relative frequencies, or percentages.

To make a histogram, create groups from continuous data, draw the x axis and the y axis to scale and to accommodate all of the data groups and frequencies, Draw bars of heights equal to the corresponding frequencies and add a label for each group, and draw the bars next to each other without any gaps. There are no gaps between histogram bars because the data values are continuous and the values in one bar flow right into the next one. 

Each bar represents a single group or class. There is only one bar for each class. The classes are placed on the x axis in numerically increasing order, just as on a number line. The height of a bar in a frequency histogram corresponds to the frequency of that class. Percentage or relative frequency histograms can be read similarly. In a relative frequency histogram, the height of the bar reflects the relative frequency corresponding to the class. In a percentage frequency histogram, the height of the bar reflects the percent frequency that corresponds to the class. 

Cumulative Frequency Charts

The cumulative frequency for any group is the frequency for that group plus the frequencies of all groups of smaller observations. 

To draw cumulative frequency charts, draw the x and y axis, scale the x axis to accommodate the range of all groups, mark the upper boundary of each group, scale the y axis from 0 to n for a cumulative frequency chart, place a dot at the height equal to the cumulative frequency for that group above the upper boundary for each group, then connect all the dots with straight lines.

From any point on the graph, we can draw a vertical line to read the x value from the x axis and a horizontal line to read the y value from the y axis. For right skewed distributions, the curve increases quickly in the beginning but then steadies in the later part. For left skewed distributions, the curve increases slowly in the beginning, but then steeply later on. The cumulative frequency chart for a symmetric distribution is often described as s-shaped because it begins with a slow increase on the left, rises rapidly in the middle, and then tapers off to a slow increase again at the right. 

Summary

Visualizations are the very first step you should take when analyzing data. The types of summary statistics, inferential tests, and analysis that can be calculated are dependent upon the shape of the distribution. The key point to remember is that there are different calculations for symmetric and skewed data. Knowing the shape of the distributions will help you get started. 

What Is Statistics

Data is any collection of numbers, characters, images, or any other items that provide information about something. What is Statistics? It is a way of reasoning, along with a collection of tools and methods, designed to help us understand the world. What are Statistics? Statistics are particular calculations made from data. 

 

The characteristics recorded about each individual are called variables. They are usually found as the columns of a data table with a name in the header that identifies what has been recorded. 

 

Some variables are called nominal because they name categories. That means you can’t do math on the data or that it would make no sense if you did. Descriptive responses to questions are often categories. 

 

When a variable contains measured numerical values with measurement units, we call it a quantitative variable. Quantitative variables typically record an amount or degree of something. For quantitative variables, its measurement units provide a meaning for the numbers. Some quantitative variables do not have obvious units, like the stock market. Sometimes a variable with numerical values can be treated as either categorical or quantitative depending on what we want to know from it. 

 

For a categorical variable, each individual is assigned one of two possible values. However, some variables will have many values, this is an identifier variable. Identifier variables do not tell us anything useful about their categories because we know there is exactly one in each. Identifiers are part of what is called metadata, or the data about data. 

 

Variables that report order without natural units are often called ordinal variables. You still have to look at what you want from your study to understand what you want to learn from the variable to decide whether to treat it as categorical or quantitative. 

 

Models are summaries and simplifications of data that help our understanding in many ways. It is a simplification of reality that gives us information that we can learn from and use. Without making models for how data vary, we would be limited to reporting only what the data we have says. 

 

Don’t label a variable as categorical or quantitative without thinking about the data and what they represent. The same variable can sometimes take on different roles. Do not assume that a variable is quantitative just because its values are numbers. Categories are often given numerical labels. Do not let that fool you into thinking they have quantitative meaning. Look at the context. 

 

Always be skeptical. One reason to analyze data is to discover the truth. Even when you are told a context for the data, it may turn out that the truth is a bit different. The context colors out interpretation of the data, so those who want to influence what you think may slant the context. 

 

Data are recorded values, whether numbers or labels, together with their context. A data table is an arrangement of data in which each row represents a case and each column represents a variable. The context ideally tells who was measured, what was measured, and how the data were collected, and why the study was performed. 

 

An individual about whom or which we have data is a case. A respondent is someone who answers or responds to a survey. A subject is a human experimental unit, also called a participant. A participant is a human experimental unit, also called a subject. A variable holds information about the same characteristic for many cases. A categorical variable names categories with words or numerals.

 

A nominal variable can be applied to a variable whose values are used only to name categories. A quantitative variable is a variable in which the numbers are values of measured quantities.  A unit is a quantity or amount adopted as a standard of measurement, such as dollars or hours. 

 

Metadata is the data about data. It can provide information to uniquely identify cases, making it possible to combine data from different sources, protect privacy, or label cases uniquely. An ordinal variable can be applied to a variable whose categorical values possess some kind of order. A model is a description or representation, in mathematical or statistical terms, of the behavior of a phenomenon based on data. 

 

Example 1

Because of the difficulty of weighing a dolphin in the ocean, researchers caught and measured 12 dolphins, recording their weight, fin length, body length, and sex. They hoped to find a way to estimate weight from the other more easily determined quantities. 

  1. Who was measured?

12 dolphins

  1. When were the measurements taken?

 This information is not given

  1. Where were the measurements taken?

This information is not given

  1. Why were the measurements taken?

To find an easier way to estimate the weight of a dolphin

  1. How did the researchers obtain the measurements?

Researchers collected data on the 12 dolphins they were able to catch

  1. Specify whether the variables are categorical or quantitative.

The variable weight is quantitative and units were not provided

The variable fin length is quantitative and units were not provided

The variable body length is quantitative and units were not provided

The variable sex is categorical

 

Example 2

Researchers investigating the impact of prenatal care on newborn health collected data from 708 births during 1991-1993. They kept track of the mother’s age, the number of weeks the pregnancy lasted, the type of birth, the level of prenatal care the mother had, the weight and sex of the babies, and whether the babies exhibited health problems.

  1. Identify the who for the description of data.

The 708 births

  1. Identify the what for the description of data.

Baby’s health problems, sex of the babies, level of prenatal care, type of birth, weight of the babies, duration of pregnancy, and mother’s age

  1. Identify the when for the description of data.

Between the years 1991-1993

  1. Identify the where for the description of data

This information is not given

  1. Identify the why for the description of data

To determine the effect of prenatal care on the babies health

  1. Identify the how for the description of data

This information is not given

Statistics and Problem Solving

A population is the total set of subjects or things we are interested in studying. Populations are defined by what a researcher is studying and can come in all shapes and sizes.

A frame is a list containing all members of the population.

Population parameters are facts about the population. Since parameters are descriptions of the population, a population can have many parameters. Parameters can be averages, percentages, minimums, or maximums. For a specific population at a specific point in time, population parameters do not change.

A sample is a subset of the population which is used to gain insight about the population. Samples are used to represent a larger group, the population. 

A statistic is a fact or characteristic about the sample. For any given sample a statistic is a fixed number. Statistics are used as estimates of population parameters. 

A process is a method for obtaining a desired result. The idea of a process is closely tied to quality control. In order to improve a process, there must be an understanding of how the process is currently performing. This required definition and measurement of the process. 

The science of statistics is divided into two categories, descriptive and inferential. Descriptive methods describe and summarize data. Descriptive statistics is the collection, organization, and presentation of data.

The objective of inferential statistics is to make reasonable guesses about the population characteristics using sample data. 


Collecting and Analyzing Data

Part of becoming a problem solver and user of statistics is developing an ability to appraise the quality of measurements. When you encounter data, consider whether the concept under study is adequately reflected by the proposed measurements, is the data measured accurately, and is there a sufficient quantity of the data to draw a reasonable conclusion.

Measurement and data are an integral part of science. Methods have been developed to solve research problems. Gather information about the phenomenon being studied. On the basis of the data, formulate a preliminary generalization or hypothesis. Collect further data to test the hypothesis. If the data and other subsequent experiments support the hypothesis, it becomes a law.

There are two ways to obtain data, observation and controlled experiments. In a statistical analysis, it is usually not possible to recover from poorly measured concepts or badly collected measurements. 

A response variable measures the outcome of interest in a study. An explanatory variable causes or explains changes in a response variable. Isolating the effects of one variable on another means anticipating potentially confounding variables and designing a controlled experiment to produce data in which the values of the confounding variable are regulated.

Observational data comes about from measuring things. They can be extremely valuable. 

Much of the statistical information presented to us is in the form of surveys. So, it is important to understand them and how they are done. In some cases, the purpose of a survey is purely descriptive. However, in many cases the researcher is interested in discovering a relationship.

Data in which the observations are restricted to a set of values that possess gaps is called discrete. Data that can take on any value within some interval is called continuous. The quality of data is referred to as its level of measurement. When analyzing data, you must be exceedingly conscious of the data’s level of measurement because many statistical analyses can only be applied to data that possess a certain level of measurement. 

Data that represents whether a variable possesses some characteristic is called nominal. Ordinal data represents categories that have some associated order. Note that ordinal data is also nominal, but it also possesses the additional property of ordinality. 

If the data can be ordered and the arithmetic difference is meaningful, the data is interval. An example of interval data is temperature. Interval data is numerical data that possesses both the property of ordinality and the interval property. Ratio data is similar to interval data, except that it has a meaningful zero point and the ratio of two data points is meaningful. 

Qualitative data is data measured on a nominal or ordinal scale. Quantitative data is measured on an interval or ratio scale. 

Time series data originates as measurements usually taken from some process over equally spaced intervals of time. Time series data originate from processes. Processes can be divided into two categories: stationary and nonstationary. All time series that are interesting vary, and the nature of the variability determines how the process is characterized. In a stationary process the time series varies around some central value and has approximately the same variation over the series. In a nonstationary process, the time series possess a trend, the tendency for the series to either increase or decrease over time. 

Cross-sectional data are measurements created at approximately the same period of time. 

Organization of Data in Statistics

A frequency distribution is a summary technique that organizes data into classes and provides in tabular form a list of the classes along with the number of observations in each class.

The process begins by refining information. An analyst will do this. He takes raw data and organizes that data. This is done by counting the number of observations in each classification. 

A frequency distribution is a good way to handle large amounts of data. With it, we can see the overall structure of the data.

There are two steps in creating a frequency distribution:

  1. Choose the classifications
  2. Counting the number in each class

Graphs are important because they put information in visual form. While individual data can be lost, this is more than made up for by a nice graph. Use some type of graphing software to do this easily. Lots of different programs are available to create nice looking graphs these days. 

Bar Charts

The bar chart is a simple graph in which the length of each bar corresponds to the number of observations in a category.

They are a good presentation tool and helpful in showing the differences in magnitude. 

Creating a bar chart can get complicated. You should think about size, color, and labeling. 

Pie Charts

Pie charts can represent the same information as a bar chart. The slices in a pie chart are proportional to the total in each category. You can easily compare the total of each category to the total overall. 

When your data is qualitative, choosing categories is pretty easy. However, when your data is qualitative, choosing those categories is more complicated. The reason is that your choices often reflect how others will interpret the data. So, you have to be careful when doing this. 

Choosing the number of categories is your choice and should depend on the amount of data available. You want enough categories to make the comparisons meaningful but not so many that it is hard to understand. Each situation will be different in this regard. 

 

Relative Frequency Distribution

This represents the total observations in a category. It enables a person to view the number in each category in relation to the total number of observations. Another thing it does is change the frequency in each category to a proportion so we can compare data sets easier. I looks like this:

\[ \text{relative frequency} = \frac{\text{number in category}}{\text{total number}} \]

Cumulative Frequency Distribution

This gives a person the ability to quickly look at any category and see the number of observations and how they are related. The cumulative frequency is the sum of the frequency of a particular category and all preceding categories.

Cumulative Relative Frequency

The cumulative relative frequency is the proportion of observations in a particular category and all preceding categories. 

Histograms

A histogram is used frequently and reveals the distribution of data. It is a bar graph of the frequency in which the height of each bar corresponds to the frequency of the category. Each category is represented by a vertical bar whose height is proportional to the frequency of the interval. The horizontal boundaries of each vertical bar correspond to the category endpoints. Once the frequency distribution has been calculated, all the information necessary for plotting a histogram is available. 

Stem and Leaf Display

The stem and leaf display is a mix of methods. The display is similar to a histogram but the data remains usable to the user. It is useful for ordering and detecting patterns in the data. In other words, the raw data is not lost in the graph. It is similar to a histogram but the data remains visible. 

Ordered Array

An ordered array is a listing of all the data in either increasing or decreasing magnitude. Data listed in increasing order is said to be listed in rank order. If listed in decreasing order, it is listed in reverse rank order. Listing data in an order is very useful and usually done. It allows you to scan the data quickly for the largest and smallest values. 

Dot Plots

A dot plot is a graph where each data value is plotted as a point. If there are multiple entries, they are plotted above each other. 

Time Series Data

A time series plot graphs data using time as the horizontal axis. 

Statistical and Critical Thinking

Surveys provide data that enable us to improve products or services. Surveys guide political candidates, shape business practices, influence social media, and affect many aspects of our lives. 

A voluntary response sample is a sample in which respondents themselves decide whether to participate. Those with a strong interest in the topic are more likely to participate. Sample data must be collected in an appropriate way, such as through a process of random selection. If sample data are not collected in an appropriate way, the data may be so completely useless that no amount of statistical torturing can salvage them.

When using methods of statistics with sample data to form conclusions about a population, it is absolutely essential to collect sample data in a way that is appropriate. 

Data are collections of observations, such as measurements, genders, or survey responses. A single data value is called a datum. The term data is plural.

Statistics is the science of planning studies and experiments, obtaining data, and organizing, summarizing, presenting, analyzing, and interpreting those data and then drawing conclusions based on them. 

A population is the complete collection of all measurements or data that are being considered. Typically, a population is the complete collection of data that we would like to make inferences about. 

A census is the collection of data from every member of the population.

A sample is a sub-collection of members selected from a population.

Because populations are often very large, a common objective of the use of statistics is to obtain data from a sample and then use those data to form a conclusion about the population.

A voluntary response sample is one in which the respondents themselves decide whether to be included.

The word statistics is derived from the Latin word status, meaning state. Early uses of statistics involved compilations of data and graphs describing various aspects of a state or country. 

The following types of polls are common examples of voluntary response samples. By their very nature, all are seriously flawed because we should not make conclusions about a population on the basis of samples with a strong possibility of bias.

  1. Internet polls: people online can decide whether to respond.
  2. Mail-in polls: in which people can decide whether to reply.
  3. Telephone polls in which newspaper, radio, or television announcements ask that you call a special number to respond.

Analyze

After completing our preparation by considering the context, source, and sampling method, we begin to analyze the data.

Graph and Explore

An analysis should begin with appropriate graphs and explorations of data.

Apply Statistical Methods

A good statistical analysis does not require strong computational skills. A good statistical analysis does require using common sense and paying careful attention to sound statistical methods.

Conclude

The final step in our statistical process involves conclusions, and we should develop an ability to distinguish between statistical significance and practical significance. 

 

Statistical significance is achieved in a study when we get a result that is very unlikely to occur by chance. A common criterion is that we have statistical significance if the likelihood of an event occurring by chance is 5 percent or less. Getting 98 girls in 100 random births is statistically significant because such an extreme outcome is not likely to result from random chance. Getting 52 girls in 100 births is not statistically significant because that event could easily occur with random chance.

 

Practical significance is when it is possible that some treatment or finding is effective, but common sense might suggest that the treatment or finding does not make enough of a difference to justify its use or to be practical.

 

Misleading Conclusions

When forming a conclusion based on a statistical analysis, we should make statements that are clear even to those who have no understanding of statistics and its terminology. We should carefully avoid making statements not justified by statistical analysis. 

 

Sample Data Reported

When collecting data from people, it is better to take measurements yourself instead of asking subjects to report results. Ask people what they weigh and you are likely to get their desired weights, not their actual weight. 

 

Loaded Questions

If survey questions are not worded carefully, the results of a study can be misleading. Survey questions can be loaded or intentionally worded to elicit a desired response. 

 

Order of Questions

Sometimes survey questions are unintentionally loaded by such factors as the order of the items being considered. 

 

Nonresponse

A nonresponse occurs when someone either refuses to respond to a survey question or is unavailable. When people are asked survey questions, some firmly refuse to answer. 

 

Percentages

To find a percentage of an amount, replace the % symbol with division by 100, and then interpret “of” to be multiplication. 

6% of 1200 responses = \(\frac{6}{100} * 1200 = 72 \)

 

Decimal to Percentage

To convert from a decimal to a percentage, multiply by 100%.

\[ 0.25 \rightarrow 0.25 * 100% = 25% \]

 

Fraction to Percentage

To convert from a fraction to a percentage, divide the denominator into the numerator to get an equivalent decimal number. Then multiply by 100 percent.

\[ \frac{}3}{4} = 0.75 \rightarrow 0.75 * 100% = 75% \]

 

Percentage to Decimal

To convert from a percentage to a decimal number, replace the % symbol with division by 100. 

\[ 85% = \frac{85}{100} = 0.85 \]

A parameter is a numerical measurement describing some characteristic of a population.

A statistic is a numerical measurement describing some characteristic of a sample.

If we have more than one statistic, we have statistics. Another meaning of statistics is the science of planning studies and experiments; obtaining data, organizing, summarizing, presenting, analyzing, and interpreting those data.

Some data are numbers representing counts or measurements, whereas others are attributes that are not counts or measurements. Quantitative data consist of numbers representing counts or measurements. 

Categorical data consist of names or labels. Categorical data are sometimes coded with numbers, with those numbers replacing names. Although such numbers might appear to be quantitative, they are actually categorical data. 

Include Units of Measurement

With quantitative data, it is important to use the appropriate units of measurement, such as dollars, hours, feet, or meters. We should carefully observe information given about the units of measurement, such as all amounts are in thousands of dollars or all units are in kilograms. 

Discrete or Continuous

Quantitative data can be further described by distinguishing between discrete and continuous types. Discrete data result when the data values are quantitative and the number of values is finite. Continuous or numerical data result from infinitely many possible quantitative values, where the collection of values is not countable. 

The concept of countable data plays a key role in the preceding definitions, but it is not a particularly easy concept to understand. Continuous data can be measured, but not counted. If you select a particular value from continuous data, there is no next data value.  

Levels of Measurement

Another common way of classifying data is to use four levels of measurement; nominal, ordinal, interval, and ratio. When we are applying statistics to real problems, the level of measurement of the data helps us to decide which procedure to use. Don’t do computations and don’t use statistical methods that are not appropriate for the data. 

Ratio

There is a natural zero starting point and ratios make sense. These are heights, lengths, distances, and volumes.

Interval

Differences are meaningful, but there is no natural zero starting point and ratios are meaningless. Body temperatures in degrees is an example.

Ordinal

Data can be arranged in order, but differences either can’t be found or are meaningless. Examples are ranks of colleges.

Nominal

Categories only. Data cannot be arranged in order. An example is eye colors.

The nominal level of measurement is characterized by data that consist of names, labels, or categories only. The data cannot be arranged in some order. 

Because nominal data lack any ordering or numerical significance, they should not be used for calculations. Numbers such as 1,2,3, or 4 are sometimes assigned to the different categories, but these numbers have no real computational significance and any average calculated from them is meaningless and possibly misleading.

Data are at the ordinal level of measurement if they can be arranged in some order, but differences between data values cannot be determined or are meaningless.

Ordinal data provide information about relative comparisons, but not the magnitudes of the differences. Usually, ordinal data should not be used for calculations such as an average, but this guideline is sometimes ignored.

Data are at the interval level of measurement if they can be arranged in order, and differences between data values  can be found and are meaningful. Data at this level do not have a natural zero starting point at which none of the quantity is present. 

Data are at the ratio level of measurement if they can be arranged in order, differences can be found and are meaningful, and there is a natural zero starting point. For data at this level, differences and ratios are both meaningful.

The distinction between the interval and ratio levels of measurement can be a bit tricky. For the ratio test, focus on the term ratio and know that the term twice describes the ratio of one value to be double the other value. To distinguish between the interval and ratio levels of measurement, use a ratio test by asking this question: Does use of the term twice make sense? Twice makes sense for data at this level of measurement, but it does not make sense for data at the interval level of measurement.

For the true zero test, and for ratios to make sense, there must be a value of true zero, where the value of zero indicates that none of the quantity is present, and zero is not simply an arbitrary value on a scale. The temperature of 0 F is arbitrary and does not indicate that there is no heat, so temperatures on the Fahrenheit scale are at the interval level of measurement not the ratio level.

Big data refers to data sets so large and so complex that their analysis is beyond the capabilities of traditional software tools. Analysis of big data may require software simultaneously running in parallel on many different computers.

Data science involves applications of statistics, computer science, and software engineering, along with some other relevant fields such as sociology or finance.

Example of Data Set Magnitudes

  • Terabytes 
  • Petabytes
  • Exabytes
  • Zettabytes
  • Yottabytes

 

Statistics in Data Science

The modern data scientist has a solid background in statistics and computer systems as well as expertise in fields that extend beyond statistics. The modern data scientist might be skilled with Hadoop software, which uses parallel processing on many computers for the analysis of big data. The modern data scientist might also have a strong background in some other field such as psychology, biology, medicine, chemistry, or economics.

Missing Data

When collecting sample data, it is quite common to find that some values are missing. Ignoring missing data can sometimes create misleading results. If you make the mistake of skipping over a few different samples when you are manually typing them into a statistics software program, the missing values are not likely to have a serious effect on the results. However, if a survey includes many missing salary entries because those with very low incomes are reluctant to reveal their salaries, those missing values will have the serious effect of making salaries appear higher than they really are.

A data value is missing completely at random if the likelihood of its being missing is independent of its value or any of the other values in the data set. That is, any data value is just as likely to be missing as any other data value. 

A data value is missing not at random if the missing value is related to the reason that it is missing.

Missing data at random can happen and an example is when using a keyboard to manually enter ages of survey respondents and makes the mistake of failing to enter the age of 37 years. The data value is missing completely at random.

Biased Results

Based on the two definitions and examples from the previous page, it makes sense to conclude that if we ignore data missing completely at random, the remaining values are not likely to be biased and good results should be obtained. However, if we ignore data that are missing, not at random, it is very possible that the remaining values are biased and results will be misleading.

Correcting for Missing Data

There are different methods for dealing with missing data. One very common method for dealing with missing data is to delete all subjects having any missing values. If the data are missing completely at random, the remaining values are not likely to be biased and good results can be obtained, but with a smaller sample size. If the data are missing not at random, deleting subjects having any missing values can easily result in a bias among the remaining values, so results can be misleading. 

We can also input missing data values when we substitute values for them. There are different methods of determining the replacement values, such as using the mean of the other values, or using a randomly selected value from other similar cases, or using a method based on regression analysis. 

When analyzing sample data with missing values, try to determine why they are missing, then decide whether it makes sense to treat the remaining values as being representative of the population. If it appears that there are missing values that are missing not at random, know that the remaining data may well be biased and any conclusions based on those remaining values may well be misleading.

In an experiment, we apply some treatment and then proceed to observe its effects on the individuals. The individuals in experiments are called experimental units and they are often called subjects when they are people. In an observational study, we observe and measure specific characteristics, but we don’t attempt to modify the individuals being studied. 

Experiments are often better than observational studies because well planned experiments typically reduce the chance of having the results affected by some variable that is not part of the study. A lurking variable is one that affects the variables included in the study, but it is not included in the study.

Design of Experiments

Good design of experiments includes replication, blinding, and randomization.

Replication is the repetition of an experiment on more than one individual. Good use of replication requires sample sizes that are large enough so that we can see effects of treatments. 

Blinding is used when the subject doesn’t know whether he or she is receiving a treatment or a placebo. Blinding is a way to get around the placebo effect, which occurs when an untreated subject reports an improvement in symptoms. 

Randomization is used when individuals are assigned to different groups through a process of random selection. The logic behind randomization is to use chance as a way to create two groups that are similar. 

A simple random sample of n subjects is selected in such a way that every possible sample of the same size n has the same chance of being chosen.

Unlike careless or haphazard sampling, random sampling usually requires very careful planning and execution. 

Simple Random Sample

A sample of n subjects is selected so that every sample of the same size n has the same chance of being selected

Systematic Sample

Select every kth subject

Convenience Sample

Use data that are very easy to get

Stratified Sample

Subdivide populations into strata or groups with the same characteristics, then randomly sample within those strata.

Cluster Sample

Partition the population in clusters or groups, then randomly select all members of the selected clusters.

Multistage Sampling

Professional pollsters and government researchers often collect data by using some combination of the preceding sampling methods. In a multistage sample design, pollsters select a sample in different stages, and each stage might use different methods of sampling.

In a cross sectional study, data are observed, measured, and collected at one point in time, not over a period of time.

In a retrospective study, data are collected from a past timer period by going back in time.

In a prospective study, data are collected in the future from groups that share common factors.

Experiments

In an experiment, confounding occurs when we can see some effect, but we can’t identify the specific factor that caused it.

A randomized block design uses the same basic idea as stratified sampling, but randomized block designs are used when designing experiments, whereas stratified sampling is used for surveys.

Matched Pairs Design

Compare two treatment groups by using subjects matched in pairs that are somehow related ort have similar characteristics. 

Rigorously Controlled Design

Carefully assign subjects to different treatment groups, so that those given each treatment are similar in the ways that are important to the experiment. This can be extremely difficult to implement, and often we can never be sure that we have accounted for all of the relevant factors. 

Sampling Errors

In statistics, you could use a good sampling method and do everything correctly, and yet it is possible to get wrong results. No matter how well you plan and execute the sample collection process, there is likely to be some error in the results.

A sampling error occurs when the sample has been selected with a random method, but there is a discrepancy between a sample result and the true population result, such an error results from chance sample fluctuations.

A non sampling error is the result of human error, including such factors as wrong data entries, computing errors, questions with biased wording, false data provided by respondents, forming biased conclusions, or applying statistical methods that are not appropriate for the circumstances. 

A non random sampling error is the result of using a sampling method that is not random, such as using a convenience sample or a voluntary response sample.

The Gold Standard

Randomization with placebo/treatment groups is sometimes called the gold standard because it is so effective.

What Statistics Is All About

One of the first considerations is designing appropriate studies. The purpose is to collect data. This process can be done with either surveys or experiments. One of the most popular ways to collect data is the observational study in a way that does not affect them. Surveys have to be worded carefully to get good information.

An experiment is another popular way to gather data. It involves treatments on participants so that clear comparisons can be made. After treatments are made, responses are recorded.  

Collecting quality data is a major consideration. It really does no good to get bad data. So, studies and experiments must be planned well. Once you have good data, you can make a good report on what you found. To minimize bias in a survey, you have to be random when selecting participants. 

Descriptive Statistics

These are numerical values that describe a data set. This is usually done through different types of categories. If the data are categorical they are usually summarized using the number of individuals in each group. This is called the frequency. If you use the percentage of individuals, it is called the relative frequency.

Numerical data represent measurements or counts. You can do more with numerical data. For example, you can get the measure of center and the measure of spread in the data. 

Some descriptive statistics are more appropriate than others in certain situations. The average is not always the best measure of the center of a data set. 

Charts and Graphs

Data is summarized in a visual way using charts and graphs. These are displays that are organized to give you a big picture of the data. 

Some of the basic graphs used for categorical data include pie charts and bar graphs. These break down variables in the data. 

For numerical data, a different type of graph is needed. Histograms and box plots are usually used to represent numerical data. These types of graphs make it easier to visualize the data.

Distributions

A variable is a characteristic that is being counted or measured. A distribution is a listing of the possible values of a variable and how often they occur. 

Different types of distributions exist for different types of variables.

If a variable is counting the number of successes in a certain number of trials, it has a binomial distribution. 

If the variable takes on values that occur according to a bell-shaped curve, then that variable has a normal distribution.

If the variable is based on sample averages and you have limited data, the t-distribution may be in order.

When it comes to distributions, you need to know how to decide which distribution a particular variable has, how to find probabilities for it, and how to figure out what the long-term average and standard deviation of the outcomes would be.

Performing Analyses

After data has been collected and described, it is time to do the statistical analysis. There are many types of analyses. You have to choose the appropriate type for your data. 

You often see statistics that try to estimate numbers pertaining to an entire population. However, it is just an estimate and most studies only ask a small number of people their questions. What happens is that data is collected on a small sample of people. Sometimes the results they get are very inaccurate. 

Sample results vary from sample to sample, and this amount of variability needs to be reported but usually it is not. The statistic used to measure and report the level of precision in someone’s sample result is called the margin of error. The range of the margin of error is called the confidence interval. 

Hypothesis Tests

One major staple of research studies is called hypothesis testing. A hypothesis test is a technique for using data to validate or invalidate a claim about a population. 

The elements about a population that are most often tested are:

  • The population mean
  • The population proportion
  • The difference in two population means or proportions

Hypothesis tests are used in a host of areas that affect your everyday life, such as medical studies, advertisements, and polling data. Often you only hear the conclusions of hypothesis tests but you don’t see the methods used to come to these conclusions. 

Drawing Conclusions

To perform statistical analyses, researchers use software that depends on formulas. You have to use them correctly, though. Some of the most common mistakes made in conclusions are overstating the results. Until you do a controlled experiment, you can’t make a cause-and-effect conclusion based on relationships you find. 

Statistics is about much more than numbers. You need to understand how to make appropriate conclusions from studying data and be smart enough to not believe everything you read. 

Working with Tables and Graphs

When working with large data sets, a frequency distribution is often helpful in organizing and summarizing data. A frequency distribution helps us to understand the nature of the distribution of a data set.

Frequency Distribution 

A frequency distribution or table shows how data are partitioned among several categories by listing the categories along with the number of data values in each of them.

Lower class limits are the smallest numbers that can belong to each of the different classes. Upper class limits are the largest numbers that can belong to each of the different classes. Class boundaries are the numbers used to separate the classes, but without the gaps created by class limits. Class midpoints are the values in the middle of the classes. Class width is the difference between two consecutive lower class limits in a frequency distribution.  

Finding the correct class width can be tricky. For class width, don’t make the most common mistake of using the difference between a lower class limit and an upper class limit. For class boundaries, remember that they split the difference between the end of one class and the beginning of the next class.

We construct frequency distributions to:

  1. Summarize large data sets
  2. See the distribution and identify outliers
  3. Have a basis for constructing graphs

Technology can generate frequency distributions but these are the common steps:

  • Select the number of classes, usually between 5 and 20
  • Calculate class width: \(\frac{\text{max data value - min data value}}{\text{number of classes}} \)
  • Round this result to get a convenient number
  • Choose the value for the first lower class limit by using either the min value or a convenient value below the minimum.
  • Using the first lower class limit and the class width, list the other lower class limits.
  • List the lower class limits in a vertical column and then determine and enter the upper class limits.
  • Take each individual data value and put a tally mark in the appropriate class. Add the tally marks to find the total frequency for each class.

Relative Frequency Distribution

A variation of the basic frequency distribution is a relative frequency distribution. Each class frequency is replaced by a relative frequency as a percentage. 

\[ \text{relative frequency} = \frac{\text{frequency for class}}{\text{sum of frequencies}} * 100 \]

This will give you the frequency percentage.

The sum of the percentages in a relative frequency distribution will be very close to 100 percent.

Another variation of a frequency distribution is a cumulative frequency distribution in which the frequency for each class is the sum of the frequencies for that class and all previous classes. 

At the beginning we noted that a frequency distribution can help us understand the distribution of a data set, which is the nature or shape of the spread of the data over the range of values. In statistics, we are often interested in determining whether the data have a normal distribution. Data that have an approximately normal distribution are characterized by a frequency distribution with the following features:

  1. The frequencies start low, then increase to one or two high frequencies, and then decrease to a low frequency.
  2. The distribution is approximately symmetric. Frequencies preceding the maximum frequency should be roughly a mirror image of those that follow the maximum frequency.

The presence of gaps can suggest that the data are from two or more different populations.

Comparing two or more relative frequency distributions in one table makes comparisons of data much easier.

While a frequency distribution is a useful tool for summarizing data and investigating the distribution of data, an even better tool is a histogram, which is a graph that is easier to interpret than a table of numbers.

A histogram visually displays the shape of the distribution of the data. It shows the location of the center of the data. Histograms show the spread of data and can also identify outliers.

A histogram is basically a graph of a frequency distribution. Class frequencies should be used for the vertical scale and that scale should be labeled. There is no universal agreement on the procedure for selecting which values are used for the bar locations along the horizontal scale, but it is common to use class boundaries, class midpoints, class limits, or something else. It is often easier for us to use class midpoints for the horizontal scale. Histograms can usually be generated using technology.

A relative frequency histogram has the same shape and horizontal scale as a histogram, but the vertical scale uses relative frequencies instead of actual frequencies. 

The ultimate objective of using histograms is to be able to understand characteristics of data. Exploring the data means to:

  1. Find the center of the data
  2. Find the variation
  3. Find the shape of the distribution
  4. Find any outliers
  5. Find the change of data over time

When a graph is said to be skewed to the right, it means the histogram shape has a tail on the right.

When a graph is said to be skewed to the left, it means the histogram shape has a tail on the left.

Bell-shaped distribution is called a normal distribution and has its highest values in the middle.

Uniform distribution is a histogram with roughly the same values all the way across.

Many statistical methods require that sample data come from a population having a distribution that is approximately a normal distribution.

In a uniform distribution, the different possible values occur with approximately the same frequency, so the heights of the bars in the histogram are approximately uniform. 

A distribution of data is skewed if it is not symmetric and extends more to one side than to the other. Data skewed to the right, called positively skewed, have a longer right tail.

Data skewed to the left, called negatively skewed, have a longer left tail.

Some really important methods have a requirement that sample data must be from a population having a normal distribution. Histograms can be helpful in determining whether the normality requirement is satisfied, but they are not very helpful with very small data sets.

The population distribution is normal if the pattern of the points in the normal quantile plot is reasonably close to a straight line, and the points do not show some systematic pattern that is not a straight-line pattern.

The population distribution is not normal if the normal quantile plot has either or both of these two conditions:

  1. The points do not lie reasonably close to a straight-line pattern
  2. The points show some systematic pattern that is not a straight-line pattern

Graphs that Enlighten

A dot plot graph is a good type of graph. It consists of a graph of quantitative data in which each data value is plotted as a point above a horizontal scale of values. Dots representing equal values are stacked. 

A dot plot:

  1. Displays the shape of the distribution of data
  2. It is usually possible to recreate the original list of data values.

A stem plot is another type of graph and it represents quantitative data by separating each value into two parts: the stem and the leaf. Better stem plots are often obtained by first rounding the original data values. Also, stem plots can be expanded to include more rows and can be condensed to include fewer rows.

Stem plots:

  1. Shows the shape of the distribution of data
  2. Retains the original data values
  3. The sample data are sorted

A time-series graph is a graph of time-series data, which are quantitative data that have been collected at different points in time, such as monthly or yearly.

Time-series graphs:

  1. Reveals information about trends over time

Bar graphs use bars of equal width to show frequencies of categories of categorical data. The bars may or not be separated by small gaps.

Bar graphs:

  1. Shows the relative distribution of categorical data so that it is easier to compare the different categories.

A pareto chart is a bar graph for categorical data, with the added stipulation that the bars are arranged in descending order according to frequencies, so the bars decrease in height from left to right. 

Pareto charts:

  1. Shows the relative distribution of categorical data so that it is easier to compare the different categories.
  2. Draws attention to the more important categories.

A pie chart is a very common graph that depicts categorical data as slices of a circle, in which the size of each slice is proportional to the frequency count for the category. Although pie charts are very common, they are not as effective as Pareto charts.

Pie charts:

  1. Shows the distribution of categorical data in a commonly used format.

Try to never use pie charts because they waste ink on components that are not data, and they lack an appropriate scale.

A frequency polygon uses line segments connected to points located directly above class midpoint values. A frequency polygon is very similar to a histogram, but a frequency polygon uses line segments instead of bars. 

A variation of the basic frequency polygon is the relative frequency polygon, which uses relative frequencies for the vertical scale. An advantage of relative frequency polygons is that two or more of them can be combined on a single graph for easy comparison. 

Graphs that Deceive

Deceptive graphs are commonly used to mislead people. Graphs should be constructed in a way that is fair and objective. 

A common deceptive graph involves using a vertical scale at some value greater than zero to exaggerate differences between groups. This is called a nonzero vertical graph. Always examine a graph carefully to see whether a vertical axis begins at some point other than zero so that differences are exaggerated.

Pictographs are another type of chart that are used to mislead. Data that are one-dimensional in nature are often depicted with two-dimensional objects or three-dimensional objects. By using pictographs, artists can create false impressions that grossly distort differences by using these same principles of basic geometry:

  1. When you double each side of a square, it’s area doesn’t merely double, it increase by a factor of four
  2. When you double each side of a cube, its volume doesn’t merely double, it increases by a factor of eight

When examining data depicted with a pictograph, determine whether the graph is misleading because objects of area or volume are used to depict amounts that are actually one-dimensional. 

For small data sets of 20 values or fewer, use a table instead of a graph. A graph of data should make us focus on the true nature of the data, not on other elements, such as eye-catching but distracting design features. Do not distort data. Construct a graph to reveal the true nature of the data. Almost all of the ink in a graph should be used for the data, not for the design elements.

A correlation exists between two variables when the values of one variable are somehow associated with the values of the other variable.

A linear correlation exists between two variables when there is a correlation and the plotted points of paired data result in a  pattern that can be approximated by a straight line. A scatterplot is a plot of paired quantitative data with a horizontal x-axis and the vertical axis is used for the second variable y. 

The presence of correlation between two variables is not evidence that one of the variables causes the other. We might find a correlation between beer consumption and weight, but we cannot conclude from the statistical evidence that drinking beer has a direct effect on weight. 

A scatterplot can be very helpful in determining whether there is a correlation between the two variables.

The linear correlation coefficient is denoted by r, and it measures the strength of the linear association between two variables. 

When we do not conclude that there appears to be a linear correlation between two variables, we can find the equation of the straight line that best fits the sample data, and that equation can be used to predict the value of one variable when given a specific value of the other variable. Instead of using the straight-line equation of \(y = mx + b \) that we have all learned in prior math courses, we use the format that follows.

Given a collection of paired sample data, the regression line, or line of best fit, is the straight line that best fits the scatter plot of the data. 

Identify the lower class limits, upper class limits, class width, class midpoints, and class boundaries for the given frequency distribution. Also identify the number of individuals in the summary.

Measures of Center

Measures of center are widely used to provide representative values that summarize data sets.

A measure of center is a value at the center or middle of a data set. 

The mean is generally the most important of all numerical measurements used to describe data. It is what most people call an average.

The mean of a set of data is the measure of center found by adding all of the data values and dividing the total by the number of data values.

Sample means drawn from the same population tend to vary less than other measures of center. The mean of a data set uses every data value. A disadvantage of the mean is that just one extreme value can change the value of the mean substantially. This extreme value is called an outlier. By this definition, we say the mean is not resistant.

A statistic is resistant if the presence of extreme values does not cause it to change very much. 

The definition of the mean can be expressed by the formula:

\[\frac{\sigma x}{n} \]

Sigma refers to the sum of values. X is the sum of all values. N is the number of values.

If the data are from a sample of the population, the mean is denoted by x-bar.

If the data are from the entire population, the mean is denoted by mu.

Sample statistics are usually represented by English letters and population parameters are usually represented by Greek letters.

\(\sigma\) denotes the sum of a set of data values.

\(x\) is the variable usually used to represent the individual data values.

\(n\) represents the number of data values in a sample.

\(N\) represents the number of data values in a population.

Never use the term average when referring to a measure of center. The word average is often used for the mean but it should not be.

The median can be thought of as a middle value. More precisely, the median of a data set is the measure of center that is the middle value when the original data values are arranged in order of increasing or decreasing magnitude. 

The median does not change by large amounts when we include just a few extreme values, so the median is a resistant measure of center. The median does not directly use every data value.

The median of a sample is sometimes denoted by x-tilde or m or Med. to find the median, first sort the values. 

If the number of data values is odd, the median is the number located in the exact middle of the sorted list.

If the number of data values is even, the median is found by computing the mean of the two middle numbers in the sorted list. 

Mode isn’t used much with quantitative data, but it is the only measure of center that can be used with qualitative data. The mode of a data set is the value that occurs with the greatest frequency. 

The mode can be found with qualitative data. A data set can have no mode or one mode or multiple modes. When two data values occur with the same greatest frequency, each one is a mode and the data set is set to be bimodal. When more than two data values occur with the same greatest frequency, each is a mode and the data set is said to be multimodal. When no data value is repeated, we say there is no mode.  

Midrange is another measure of center. The midrange of a data set is the measure of center that is the value midway between the max and min values in the original data set. It is found by adding the max data value to the min data value and then dividing the sum by 2.

Because the midrange uses only the max and min values, it is very sensitive to those extremes so the midrange is not resistant. In practice, the midrange is rarely used, but it has 3 redeeming features:

  1. It is very easy to compute
  2. It helps reinforce the very important point that there are several different ways to define the center of a data set.
  3. The value of the midrange is sometimes used incorrectly for the median, so confusion can be reduced by clearly defining the midrange along with the median. 

When calculating measures of center, we often need to round the result.

For the mean, median, and midrange, carry one more decimal than is present in the original set of values.

For the mode, leave the value as is without rounding.

When applying any rounding rules, round only the final result, not anything before that.

We can always calculate measures of center from a sample of numbers, but we should always think about whether it makes sense to do that. 

For example, it makes no sense to do numerical calculations with data at the nominal level of measurement. We should also think about the sampling method used to collect data. If the sampling method is not sound, the statistics we obtain may be very misleading. 

Measures of Variation

 

To understand variation, we begin by introducing the range. The range of a set of data values is the difference between the max data value and the min data value. The range uses only the maximum and the minimum data values, so it is very sensitive to extreme values. It is not resistant. Because the range uses only the max and min values, it does not take every value into account and therefore does not truly reflect the variation among all of the data values.

\[ \text{Range = max value - min value} \]

Range Rule of Thumb

The range rule of thumb is a quick way to ballpark the standard deviation.

25%  *  range of data

Standard Deviation of a Sample

The standard deviation is the measure of variation most commonly used in statistics. It is a measure of how much data values deviate away from the mean. The standard deviation found from sample data is a statistic denoted by \{\text{s}\}. 

The symbol for sample standard variation is \(s\).

The symbol for population standard deviation is \(\sigma\)

The symbol for sample variance is \(s^2\)

The symbol for population variance is \(\sigma^{2}\)

The standard deviation is a measure of how much data values deviate from the mean. The value of the standard deviation is never negative. It is zero only when all of the data values are exactly the same. Larger values indicate greater amounts of variation. The standard deviation can increase dramatically with one or more outliers. The units of the standard deviation are the same as the units of the original data values.

Here are the steps to finding standard deviation:

  1. Find the mean of your data values
  2. Subtract the mean from each individual sample value
  3. Square each of the deviations obtained from the previous step
  4. Add all of the squares obtained from previous step
  5. Divide the total from previous step by n-1, which is 1 less than the total number of data values present
  6. Find the square root of the result of the previous step.

Standard Deviation of a Population

A different formula is used to find the standard deviation of a population. We use the value of N instead of n-1. When using a calculator, make sure which kind of deviation it is giving you. The variance of a set of values is a measure of variation equal to the square of the standard deviation. 

The units of the variance are the squares of the units of the original data values. The value of the variance can increase dramatically with the inclusion of outliers. So, the variance is not resistant. The value of the variance is never negative. It is zero only when all of the data values are the same number. 

In measuring variation in a set of sample data, it makes sense to begin with the individual amounts by which values deviate from the mean. It makes sense to combine those deviations into one number that can serve as a measure of variation. We do not want to add the variations because that will give us a zero answer. Instead, we want to use the absolute values of the deviations. When we find the mean of that sum, we get the mean absolute deviation, which is the mean distance of the data from the mean.

Computation of the mean absolute deviation uses absolute values, so it uses an operation that is not algebraic. The use of absolute values would be simple but it would create algebraic difficulties in inferential statistics. The standard deviation has the advantage of using only algebraic operations. Because it is based on the square root of a sum of squares, the standard deviation closely parallels distance formulas found in algebra. There are many instances where a statistical procedure is based on a similar sum of squares. Consequently, instead of using absolute values, we square all deviations so that they are nonnegative and those squares are used to calculate the standard deviation. 

After finding all of the individual values we combine them by finding their sum. We then divide by n-1 because there are only n-1 values that can be assigned without constraint. With a given mean, we can use any numbers for the first n-1 values, but the last value will then be automatically determined. With division by n-1, sample variances tend to center around the value of the population variance. With division by n, sample variances tend to underestimate the value of the population variance.

A concept helpful in interpreting the value of the standard deviation is the empirical rule. This rule states that for data sets having a distribution that is approximately bell-shaped, the following properties apply:

  1. 68 percent of all values fall within 1 standard deviation of the mean
  2. 95 percent of all values fall within 2 standard deviations of the mean
  3. 99.7 percent of all values fall within 3 standard deviations of the mean

Another concept helpful in understanding a value of a standard deviation is Chebyshev’s theorem. The empirical rule applies only to data sets with bell-shaped distributions, but Chebyshev’s theorem applies to any data set. Unfortunately, results are only approximate. Because the results are lower limits, this theorem has limited usefulness. 

If the population mean is \(\mu\) and the population standard deviation is \(\sigma\), then the range rule of thumb for identifying significant values is as follows:

Significantly low values are \(\mu - 2\sigma\) or lower

Significantly high values are \(\mu + 2\sigma\) or higher.

Insignificant values are between the previous two values.

Measures Of Relative Standing

Measures of relative standing are numbers showing the location of data values relative to the other values within the same data set.

A z score is found by converting a value to a standardized scale. This definition shows that a z score is the number of standard deviations that a data value is away from the mean.

The z score is calculated by using:

\[z = \frac{x - \Xbar}{s}\]

Or

\[z = \frac{x - \mu}{\sigma}\]

A z score is the number of standard deviations that a given value is above or below the mean.

Z scores are expressed as numbers with no units of measurement.

A data value is significantly low if its z score is less than or equal to -2 or the value is significantly high if its z score is greater than or equal to +2.

If an individual data value is less than the mean, its corresponding z score is a negative number.

A value is significantly low or significantly high if it is at least two standard deviations away from the mean. It follows that significantly low values have z scores less than or equal to -2 and significantly high values have z scores greater than or equal to +2. If a value is in between these values then it is not significant.

A z score is a measure of position, in the sense that it describes the location of a value relative to the mean. Percentiles and quartiles are other measures of position useful for comparing values within the same data set or between different data sets.

Percentiles

Percentiles are one type of quantiles or fractiles which partition data into groups with roughly the same number of values in each group.

The 50th percentile has about 50% of the data values below and above it.

The process of finding the percentile that corresponds to a particular data value is given by the following formula:

\[\text{percentile} = \frac{\text{number of values less than x}}{\text{total number of values}}*100\]

Notation

  • N = total number of values in the data set
  • K = percentile being used, for example k=25
  • L = locator that gives the position of a value.
  • \(P_k\) = kth percentile

Algorithm

Sort the data from lowest to highest.

Compute \(L=\frac{k}{100}*n\) where n= number of values and k= percentile in question.

Is L a whole number?

If yes, the value of the kth percentile is midway between the Lth value and the next value in the sorted set of data. Find P_k by adding the Lth value and the next value and dividing the total by 2.

If no, change L by rounding it up to the next larger whole number.

The value of P_kl is the Lth value, counting from the lowest.

Quartiles

Just as there are 99 percentiles that divide the data into 100 groups, there are three quartiles that divide the data into four groups.

Quartiles are measures of location, Q1,Q2, and Q3, which divide a set of data into four groups with about 25% of the values in each group. 

Interquartile range = \(Q_3 - Q_1\)

Semi-interquartile range = \(\frac{Q_3 - Q_1}{2}\)

Midquartile = \(\frac{Q_3 + Q_1}{2}\)

10-90 percentile range = \(P_90 = P_10\)

Boxplots

The values of the minimum, maximum, and three quartiles are used for the summary and construction of boxplot graphs.

For a set of data the summary consists of these 5 values:

  • Minimum
  • First quartile, Q1
  • Second quartile, Q2
  • Third quartile, Q3
  • Maximum

A boxplot is a graph of a data set that consists of a line extending from the minimum value to the maximum value, and a box with lines drawn at the first quartile, the median, and the third quartile.

A boxplot can often be used to identify skewness. This means it is not symmetric.

Basics of Probability

An event is any collection of results or outcomes of a procedure.

A simple event is an outcome or an event that cannot be further broken down into simpler components.

The sample space for a procedure consists of all possible events. That is, the sample space consists of all outcomes that cannot be broken down any further.

Simple Events

With one birth, the result of 1 girl is a simple event and the result of 1 boy is another simple event. They are individual simple events because they cannot be broken down any further.

With three births, the result of 2 girls followed by a boy is a simple event.

When rolling a single die, the outcome of 5 is a simple event, but the outcome of an even number is not a simple event.

Simple Events and Sample Spaces

With three births, the event of 2 girls and 1 boy is not a simple event because it can occur with different simple events.

With three births, the sample space consists of the eight different simple events.

Probability plays a central role in the important statistical method of hypothesis testing. Statisticians make decisions using data by rejecting explanations based on very low probabilities.

In probability, we deal with procedures that produce outcomes. An event is any collection of results or outcomes of a procedure. A simple event is an outcome or an event that cannot be further broken down into simpler components. The sample space for a procedure consists of all possible simple events. That is, the sample space consists of all outcomes that cannot be broken down any further.

Notation for Probabilities

P denotes a probability

A,B, and C denote specific events

P(A) denotes the probability of event A occurring

Three Approaches to Finding the Probability

Conduct a procedure and count the number of times that event A occurs. P(A) is then approximated as follows:

  1. Relative Frequency Approximation- \(P(A) = \frac{\text{number of time A occurred}}{\text{number of times procedure repeated}}\)
  2. Classical Approach to probability - If a procedure has n different sample events that are equally likely, and if event A can occur in s different ways, then: \(P(A)=\frac{\text{number of ways A occurs}}{\text{number of different simple events}}=\frac{s}{n}\)
  3. Subjective Probabilities-P(A), the probability of event A, is estimated by using knowledge of the relevant circumstances.

Simulations

Sometimes none of the preceding three approaches can be used. A simulation of a procedure is a process that behaves in the same ways as the procedure itself so that similar results are produced. Probabilities can sometimes be found by using a simulation.

Rounding Probabilities

When expressing the value of a probability, either give the exact fraction or decimal or round off final decimal results to three significant digits. When a probability is not a simple fraction such as \(\frac{2}{3}\), express it as a decimal so that the number can be better understood.

Law of Large Numbers

As a procedure is repeated again and again, the relative frequency probability of an event tends to approach the actual probability. It tells us that relative frequency approximations tend to get better with more observations. This law reflects a simple notion supported by common sense: a probability estimate based on only a few trials can be off by a substantial amount, but with a very large number of trials, the estimate tends to be much more accurate.

Don’t make the common mistake of finding a probability value by mindlessly dividing a smaller value by a larger number. Instead, think carefully about the numbers involved and what they represent. Carefully identify the total number of items being considered. 

Complementary Events

Sometimes we need to find the probability that an event does not occur. The complement of event A, denoted by \(\Abar\), consists of all outcomes in which event A does not occur. 

Identifying Significant Results

If, under a given assumption, the probability of a particular observed event is very small and the observed event occurs significantly less than or significantly greater than what we typically expect with that assumption, we conclude that the assumption is probably not correct.

We can use probabilities to identify values that are significantly low or significantly high.

  1. High number of successes: x successes among n trials is a significantly high number of successes if the probability of x or more successes is unlikely with a probability of 0.05 or less. 
  2. Low number of successes: x successes among n trials is a significantly low number of successes if the probability of x or fewer successes is unlikely with a probability of 0.05 or less.

Odds

Expressions of likelihood are often given as odds, such as 50:1. Here are advantages of probabilities and odds:

  1. Odds make it easier to deal with money transfers associated with gambling.
  2. Probabilities make calculations easier, so they tend to be used by statisticians, mathematicians, scientists, and researchers in all fields.

In the three definitions that follow, the actual odds against and the actual odds in favor reflect the actual likelihood of an event, but the payoff odds describe the payoff amounts that are determined by gambling houses. 

The actual odds against event A occurring are the ratio \(P(Abar) / P(A) \), usually expressed in the form of a:b, where a and b are integers.

The actual odds in favor of event A occurring are the ratio \(P(A) / P(Abar) \) which is the reciprocal of the actual odds against that event. If the odds against an event are a:b, then the odds in favor are b:a.

The payoff odds against event A occurring are the ratio of net profit(if you win) to the amount bet.

Payoff odds against event A = net profit:amount bet

If you bet $5 on the number 13 in roulette, your probability of winning is \(\frac{1}{38}\) but the payoff odds are given by the casino as 35:1

With P(13) = \({1}{38}\) and P(not 13) = \(\frac{37}{38}\), we get the actual odds against 13

= \(\frac{37/38}{1/38} or 37:1


Addition and Multiplication of Probabilities

Addition Rule

The addition rule is a tool for finding P(A or B), which is the probability that either event A occurs or event B occurs as the single outcome of a procedure. The word “or” in the addition rule is associated with the addition of probabilities.

Multiplication Rule

This section also presents the basic multiplication rule used for finding P(A and B), which is the probability that event A occurs and event B occurs. The word “and” in the multiplication rule is associated with the multiplication of probabilities.

Compound Event

A compound event is any event combining two or more simple events.

Addition Rule

Here is the notation for the addition rule. P(A or B) = P(in a single trial, event A occurs or event B occurs or they both occur).

Intuitive Addition Rule

To find P(A or B), add the number of ways event A can occur and the number of ways event B can occur, but add in such a way that every outcome is counted only once. P(A or B) is equal to that sum, divided by the total number of outcomes in the sample space.

Formal Addition Rule

P(A or B) = P(A) + P(B) - P(A and B)

Where P(A and B) denotes the probability that A and B both occur at the same time as an outcome in a trial of a procedure.

Disjoint Events and the Addition Rule

Events A and B are disjoint or mutually exclusive if they cannot occur at the same time. That is, disjoint events do not overlap.

Event A - Randomly selecting someone for a clinical trial who is a male.

Event B - Randomly selecting someone for a clinical trial who is a female.

Disjoint Events

Event A - Randomly selecting someone taking a statistics course.

Event B - Randomly selecting someone who is a female.

Complementary Events and the Addition Rule

We use \(\bar{A}\) to indicate that event A does not occur. Common sense dictates this principle. We are certain with probability of 1 that either an event A occurs or does not occur, so it follows that |(P(A or \bar{A}) = 1. Because events \(A \text{and} \bar{A}\) must be disjoint, we can use the addition rule to express this principle as follows:

\[P(A or \bar{A}) = P(A) + P(\bar{A}) = 1 \]

Rule of Complementary Events

\[ P(A) + P(\bar{A}) = 1 \]

\[ P(\bar{A}) = 1 - P(A) \]

\[ P(A) = 1 - P(\bar{A}) \]

Multiplication Rule

P(A and B) = P(event A occurs in a first trial and event B occurs in a second trial)

P(B | A) represents the probability of event B occurring after it is assumed that event A has already occurred.

Multiplication Rule

P(A and B) = P(event A occurs in a first trial and event B occurs in a second trial)

P(B | A) represents the probability of event B occurring after it is assumed that event A has already occurred.

Intuitive Multiplication Rule

To find the probability that event A occurs in one trial and event B occurs in another trial, multiply the probability of event A by the probability of event B, but be sure that the probability of event B is found by assuming that event A has already occurred.

Formal Multiplication Rule

P(A and B) = P(B | A)

Independence and the Multiplication Rule

Two events A and B are independent if the occurrence of one does not affect the probability of the occurrence of the other. Several events are independent if the occurrence of any does not affect the probabilities of the occurrence of the others. I A and B are not independent, they are said to be dependent.

Sampling

In the world of statistics, sampling methods are critically important.

Sampling with replacement: Selections are independent events.

Sampling without replacement: Selections are dependent events.

Treating Dependent Events as Independent

When sampling without replacement and the sample size is no more than 5% of the size of the population, treat the selections as being independent, even though they are actually dependent.

Redundancy

The principle of redundancy is used to increase the reliability of many systems. Our eyes have passive redundancy in the sense that if one of them fails, we continue to see. An important finding of modern biology is that genes in an organism can often work in place of each other. Engineers often design redundant components so that the whole system will not fail because of the failure of a single component

 

When randomly selecting an adult, A denotes the event of selecting someone with blue eyes. What do \(P(A)\) and \(P(\bar{A})\) represent?

\(.P(A)\) represents the probability of selecting an adult with blue eyes.

\(P(\bar{A}) represents the probability of selecting an adult who does not have blue eyes.

 

There are 15,958,866 adults in a region. If a polling organization randomly selects 1235 adults without replacement, are the selections independent or dependent? If the selections are dependent, can they be treated as independent for the purposes of calculations?

The selections are dependent because the selection is done without replacement.

Yes, because the sample size is less than 5% of the population.

When randomly selecting an adult, let B represent the event of randomly selecting someone with type B blood. Write a sentence describing what the rule of complements below is telling us.

\(P(B or \bar{B}) = 1\)

 It is certain that the selected adult has type B blood or does not have type B blood.

A research center poll showed that 76% of people believe that it is morally wrong to not report all income on tax returns. What is the probability that someone does not have this belief?

.24

Find the indicated complement.

A certain group of women has a 0.2% rate of red/green color blindness. If a woman is randomly selected, what is the probability that she does not have this color blindness?

.9998

Use the data in the following table, which lists drive-thru order accuracy at popular fast food chains. Assume that orders are randomly selected from those included in the table.

A B C D

316 266 250 125

32 56 37 20

If one order is selected, find the probability of getting food that is not from restaurant A.

Add up all of B,C, and D then divide by all of A,B,C, D.

754/1098=.68

Use the data in the following table which lists drive-thru order accuracy at popular fast food chains. Assume that orders are randomly selected from those included in the table.

If one order is selected, find the probability of getting an order that is not accurate.

Add up incorrect orders and then total orders

A B C D

320 260 236 149

39 59 32 12

142/1107= .128

Use the data in the following table, which lists drive-thru order accuracy at popular fast food chains. 

A B C D

321 280 244 129

39 51 30 14

If one order is selected, find the probability of getting an order from restaurant A or an order that is accurate. Are the events of selecting an order from restaurant A and selecting an accurate order disjoint events?

The formal addition rule is \( P(A or B) = P(A) + P(B) - P(A and B) \)

Accurate orders =974

Inaccurate orders from restaurant A=39

Add together to get 1013

1013/1108=.914

Use the data in the following table, which lists drive-thru order accuracy at popular fast food chains.

A B C D

367 255 206 176

45 53 22 28

If two orders are selected, find the probability that they are both from restaurant D

Assume that the selections are made without replacement, are the events independent?

\[ P(A and B) = P(A) * P(B | A) \]

Calculate total orders from all restaurants

Calculate orders from restaurant D

Divide orders from restaurant D by the total number of orders. This gives \(P(A)\)

  1. Assume that the selections are made with replacement

The events are independent and probability of event B stays the same regardless of event A

 So, \( P(A and B) = \frac{204}{1152} * \frac{204}{1152} = .0314 \)

  1. Assume that the selections are made without replacement.

The probability of event A will be the same \(\frac{204}{1152}\)

When replacements are not used, the events are not independent and the probability of event B changes depending on the outcome of event A.

Since event A was selecting an order from D, the selected order does not get replaced, the number of orders from D and the total number of orders to choose from each side each decrease by 1 when choosing event B.

So:

\[ P(A) = \frac{204}{1152} \text{and} P(B | A) = \frac{204-1}{1152-1} \]

Multiply the probability of event A by event B

\[ P(A and B) = .0312 \]

Use the data in the following table, which lists drive-thru order accuracy at popular fast food chains.

A B C D

323 267 241 128

30 55 34 12

If two orders are selected, find the probability that they are both accurate.

  1. Assume that the selections are made with replacement. Are the events independent?

Calculate total number of orders: 1090

Accurate orders: 959

\[\frac{959}{1090} * \frac{959}{1090} = .7741 \]

  1. Assume that the selections are made without replacement. Are the events independent?

Because the selections are made without replacement, the events are dependent events. 

The probability of each order being accurate is affected by the other orders.

The probability \(P(A)\} remains the same as in part A.

The probability \(P(B|A)\) must be adjusted to reflect that the first order was accurate and is not available for the second order.

Recall that originally there were 1004 accurate orders out of 1152.

After the first accurate order is selected, there are 1151 orders remaining of which 1003 are accurate.

\[ P(A and B) = \frac{959}{1090} * \frac{958}{1089} = .7740 \]

The events are not independent because the sampling is done without replacement

Use the data in the following table.

A B C D

321 260 243 121

35 52 32 14

If three orders are selected, find the probability that they are all from B.

\[(312 / 1078) * 3 = .0242 \]

Use the following results from a test for marijuana use, which is provided by a certain drug testing company. Among 145 subjects with positive test results, there are 29 false positive results. Among 157 negative results, there are 3 false negative results.

  1. How many subjects were in the study?

No Yes

pos=    29 145

neg=    157     3

How many subjects were included in the study? 

Add the subjects who tested positive to those who tested negative= 302

How many subjects did not use marijuana?=183

What is the probability that a randomly selected subject did not use marijuana?183/302=.606

Among 132 subjects with positive test results, there are 32 false positive results

Among 168 negative results, there are 8 false negative results.

If one of the test subjects is randomly selected, find the probability that the subject tested negative or did not use marijuana.

32 100

160 8

Total subjects=300

Next, find the probability that a randomly selected subject tested negative 

168/300

Now, find the number of subjects that did not use marijuana

Two groups did not use marijuana. True negatives and the false positives

160+32=192

Next, find the probability that a randomly selected test subject did not use marijuana.

Did not use=192/300

Next, find the probability that a randomly selected test subject tested negative and did not use it

160/300

Finally, use the formal addition rule to find the probability that a randomly selected subject tested negative or did not use it, rounding to 3 decimal places

168/300+192/300-160/300 = .667

The principle of redundancy is used when system reliability is improved through redundant components. Assume that a student’s alarm has a 16.0% daily failure rate.

  1. What is the probability that the student’s alarm clock will not work on the morning of an important exam?

To convert a percentage to a decimal number, remove the % symbol and divide by 100.

For the stated failure rate of 16% remove the percent symbol and divide by 100.

16/100 = .160

So, the probability that the student’s alarm clock will not work on the morning of an important exam is .160.

  1. If the student has two such alarm clocks, what is the probability that they both fail on the morning of an important exam?

Use the formal rule of multiplication that states if P(A) is the probability of event A occurring and P(B|A) is the probability of B occurring given that A has occurred, the probability of both A and B occurring is given by:

\[P(A and B)=P(\bar{A})*P(\bar{A}|\bar{B}\]

The functioning of the second alarm clock is not affected by the failure of the first, so by definition they are independent events.

Multiply A and B together.

.160*.160=.0256

  1. What is the probability of not being awakened if the student uses three independent alarm clocks?

A * B * C = .160*.160*.160= .00410

  1. Do the second and third alarm clocks result in greatly improved reliability?

Compare the probability of one alarm clock not working to the probabilities of 2 or 3 alarm clocks not working. In general, when an event will occur with probability 1, it is called certain. An event occurring with probability less than or equal to .05 is called unlikely. An event occurring with probability 0 is called impossible.

Surge protectors p and q are used to protect a television. If there is a surge in the voltage, the surge protector reduces it to a safe level. Assume that each surge protector has a .88 probability of working correctly when a voltage surge occurs.

  1. If the two surge protectors are arranged in a series, what is the probability that a voltage surge will not damage the television?

With two independent surge protectors in series, the television will be protected unless both surge protectors fail. In other words, only one surge protector needs to work. Find the probability that only one surge protector works by calculating 1-P(p and q). This probability can be found by applying the multiplication rule for independent events.

\[P(A and B)=P(A)*P(B)\]

The probability that a surge protector works correctly is .88. The probability that a surge protector fails is calculated below.

1-.88=.12

The probability that one surge protector fails is .12. The probability that both surge protectors fail is the product of the probabilities that either one fails.

.12*.12=.0144

There is a .0144 probability that both surge protectors fail. The probability that the television is protected in a series configuration is the complement of the probability that both fail.

1-.0144=.9856

  1. If the two surge protectors are arranged in parallel, what is the probability that a voltage surge will not damage the television?

With two independent surge protectors in parallel, the television will be protected as long as both surge protectors work. The probability that the two independent surge protectors both work is found by applying the multiplication rule for independent events.

\[P(A and B)=P(A)*P(B)\]

The probability that a surge protector works correctly is .88. The probability that both surge protectors work is the product of the probabilities that both work correctly.

.88*.88=.7744

  1. Which arrangement should be used for better protection?

Series

Complements and Conditional Probability

Complements

When finding the probability of some event occurring at least once, we should understand that at least one has the same meaning as one or more. The complement of getting at least one particular event is that you get no occurrences of that event.

Finding the probability of getting at least one of some event:

  1. Let A = getting at least one of some event.
  2. Then \(/bar{A}\) = getting none of the event being considered.
  3. Find \(P(/bar{A})\) = probability that event A does not occur.
  4. Subtract the result from 1.

Conditional Probability

A conditional probability of an event is a probability obtained with the additional information that some other event has already occurred.

\[P(B | A)\] denotes the conditional probability of event B occurring, given that event A has already occurred.

Intuitive Approach For Finding P(B|A)

The conditional probability of B occurring given that A has occurred can be found by assuming that event A has occurred and then calculating the probability that event B will occur.

Formal Approach For Finding P(B|A)

The probability P(B|A) can be found by dividing the probability of events A and B both occurring by the probability of event A.

\[P(B|A)=\frac{P(A \text{and} B)}{P(A)}\]

The preceding formula is a formal expression of conditional probability, but blind use of formulas is not recommended. The intuitive approach is recommended.

Bayes’ Theorem

The importance and usefulness of bayes’ Theorem is that it can be used with sequential events, whereby new additional information is obtained for a subsequent event, and that new information is used to revise the probability of the initial event. In this context, the terms prior probability and posterior probability are commonly used.

A prior probability is an initial probability value originally obtained before any additional information is obtained.

A posterior probability is a probability value that has been revised by using additional information that is later obtained

Multiplication Counting Rule

The multiplication counting rule is used to find the total number of possibilities from some sequence of events. For a sequence of events in which the first event can occur n ways, the second event can occur n2 ways and so on, the total number of outcomes is n1*n2*n3….

Factorial Rule

The factorial rule is used to find the total number of ways that n different items can be rearranged. The factorial rule uses the following notation. The factorial symbol(!) denotes the product of decreasing positive whole numbers. The factorial rule is stated as the number of different arrangements of n different items when all n of them are selected is n! The factorial rule is based on the principle that the first item may be selected n different ways, the second item may be selected n-1 ways, and so on. This rule is really the multiplication counting rule modified for the elimination of one item on each selection.

Permutations and Combinations

When using different counting methods, it is essential to know whether different arrangements of the same items are counted only once or are counted separately. The terms permutations and combinations are standard in this context. 

Permutations of items are arrangements in which different sequences of the same items are counted separately.

Combinations of items are arrangements in which different sequences of the same items are counted as being the same.

Permutations Rule

The permutation rule is used when there are n different items available for selection, we must select r of them without replacement, and the sequence of the items matters. The result is the total number of arrangements that are possible. Remember, rearrangements of the same items are counted as different permutations.

\[nP_r=\frac{n!}{(n-r)!}\]

When n items are all selected without replacement, but some items are identical, the number of possible permutations is found by using the following rule:

\[\frac{n!}{n_1!n_2!...n_k!}\]

Combinations Rule

The combinations rule is used when there are n different items available for selection, only r of them are selected without replacement, and order does not matter. The result is the total number of combinations that are possible. Remember, rearrangements of the same items are considered to be the same combination.

\[n_C_r=\frac{n!}{(n-r)!r!}\]

Find the probability that when a couple has three children, at least one of them is a girl. Assume that boys and girls are equally likely.

For each event there are two possibilities. There are 3 events.

½*½*½ = ⅛

1-⅛=⅞

In a certain country, the true probability of a baby being a girl is .509. Among the next six randomly selected births in the country, what is the probability that at least one of them is a girl?

The probability of at least one can be computed using the rule of complements. Let A represent the event that at least one of the next six births is a girl. Use the rule of complements below to find the probability of event A, P(A), where \(\bar{A}\) is the complement of A.

\[P(A)=1-P(\bar{A})\]

The complement of A, \(\bar{A}), is the event that the next six births are all boys.

Since each birth has no effect on any of the other births, the births are all independent events. The probability that the next six births are all boys can be found using the multiplication rule for independent events. The probability of the event can be written as shown below:

It is given that the probability of a birth being a boy is .509.

Use the multiplication rule for independent events to find the probability that the next six births are all boys. The multiplication rule for independent events states that the probability of two independent events occurring is the product of their individual probabilities. This can be extended to 6 independent events.

.509*.509*.509*.509*.509*.509 = .017

Then use the rule of complements to find the probability that the couple has at least one girl.

1-.017=.983

Therefore, the probability that the next six randomly selected births will contain at least one girl is .983

Subjects for the next presidential election poll are contacted using telephone numbers in which the last four digits are randomly selected​ (with replacement). Find the probability that for one such phone​ number, the last four digits include at least one 0.

10^4-9^4=3439

10^4=10000

3439/1000=.344

Based on a poll, 72% of internet users are more careful about personal information when using a public wi-fi hotspot. What is the probability that among three randomly selected internet users, at least one is more careful about personal information when using a public wi-fi hotspot? How is the result affected by the additional information that the survey subjects volunteered to respond to?

The probability of at least one can be computed using the rule of complements. The rule of complements states that the following expression is true for events A and \(\bar{A}\), where \(\bar{A}\) indicates that event A did not take place.

\[P(A)=1-P(\bar{A})\]

Identify the event that is the complement of A

\[\bar{A} = \text{none of the internet users are more careful}\]

To find the probability of the complement, first find the probability that an internet user is not more careful with personal information while using a public wi-fi hotspot.

1-P(is more careful)

1-.072 = .28

Find the probability of the complement using the multiplication rule for independent events, rounding to three decimal places. The multiplication rule for independent events states that the probability of two independent events occurring is the product of their individual probabilities. This can be extended to three independent events.

.28 * .28 * .28 = .022

Now use the rule of complements

1-.022 = .978

It is very possible that this result is not representative of people that use wi-fi

In an experiment, college students were given either four quarters or a $1 bill and they could either keep the money or spend it on gum.

Purchased Gum Kept the Money

Given four quarters     37 13

Given $1 bill                11 39

  1. Find the probability of randomly selecting a student who spent the money, given that the student was given four quarters.

The conditional probability of B occurring given that A has occurred, P(B|A), can be found intuitively by assuming that event A has occurred and then calculating the probability that event B will occur.

More formally, the probability P(B|A) can be found by dividing the probability of events A and B both occurring by the probability of event A.

In this case, given four quarters corresponds to event A and spent the money corresponds to event B.

First determine the number of students given four quarters that spent the money

37 students

Now calculate the probability

37/50=.74

  1. Find the probability of randomly selecting a student who kept the money given that the student was given four quarters.

Recall that 50 students were given four quarters

Identify the number of students given four quarters that kept the money

13 students

Now calculate the probability

13/50=.26

Now that since the students either kept the money or spent the money, these probabilities are complements.

.26=1-.74

  1. What do the preceding results suggest?

Compare the probabilities found in first parts

Spent the money=.74

Kept the money=.26

Since .74..26 P(spent the money | four quarters) has the greater probability

The accompanying table shows the results from a test for a certain disease. Find the probability of selecting a subject with a negative test result, given that the subject has the disease. What would be an unfavorable consequence for this error?

357 26

18 1150

A conditional probability of an event is a probability obtained with the additional information that some other event has already occurred. P(B|A) denotes the conditional probability of event B occurring, given that event A has already occurred.

A is the event which is known to have occurred. The given event is “the individual has the disease”.

B is the event for which the probability is sought. The event is “the individual tests negative for the disease”.

The conditional probability of B given A can be found by assuming that event A has occurred and, working under that assumption, calculating the probability that event B will occur.

First, determine the number of individuals who have the disease. Add all the values in the indicated column.

357+18=375

From the table, there are 18 individuals who have the disease and test negative. Divide to find the probability

18/375=.048

Therefore, the probability that a randomly selected individual who has the disease tests negative is.048

To determine an unfavorable consequence of this error, consider a subject that has the disease but with a negative test result.

Note that negative test results would lead the subject to believe that they have the disease.

The table below displays results from experiments with polygraph instruments. Find the positive predictive value for the test. That is, find the probability that the subject lied, given that the test yields a positive result.

Did not lie lied

Pos     9                                  46

Neg     30                                13

Use the intuitive approach to conditional probability. The conditional probability of B occurring given that A has occurred can be found by assuming that event A has occurred and then calculating the probability that event B will occur. Find the probability of selecting a subject who lied, given that the selected subject had a positive test result. If it is assumed that the subject had a positive test result, then only the 9+46=55 subjects in the top row of the table are to be used. Among those 55 subjects, 46 subjects who had a negative test result actually lied.

Divide the number of subjects who had a positive test result and actually lied by the total number of subjects who had a positive test result to find the probability, rounding to three decimal places.

46/55=.836

Assume that there is a 12% rate of disk drive failure in a year.

  1. If all your computer data is stored on a hard disk with a copy stored on a second hard disk drive, what is the probability that during a year, you can void catastrophe with at least one working drive.
  2. If copies of all your computer data are stored on three independent disk drives, what is the probability that during a year, you can avoid catastrophe with at least one working drive.
  1. Use the rule of complements shown below to find the probability that you can avoid catastrophe. Let A=at least one hard drive works correctly

\[P(A)=1-P(bar{A})\]

Identify the event that is the complement of A

\[\bar{A}=both hard drives fail\]

Since the two hard drives operate separately, their failures are independent events. Use the multiplication rule for independent events to find the probability of the complement of event A. The multiplication rule for independent events states that the probability of two independent events occurring is the product of their individual probabilities. The probability of any one of the hard drives failing to work correctly is 0.12

.\[ P(\bar{A})= .12*.12=.0144 \]

Now find P(A) by evaluating \(1-P(\bar{A})\)

1-.0144 = .9856

  1. Again, let A = at least one hard drive works correctly.

\[P(\bar{A}) = .12 * .12 * .12 = .001728 \]

Now find P(A) by evaluating \(1-P(\bar{A})\)

1-.001728 = .998272

Probability Distributions

Basic Concepts

A random variable is a variable that has a single numeric value, determined by chance, for each outcome of a procedure.

A probability distribution is a description that gives the probability for each value of the random variable. It is often expressed in the format of a table, formula, or graph.

A discrete random variable has a collection of values that is finite or countable. If there are infinitely many values, the number of values is countable if it is possible to count them individually, such as the number of tosses of a coin before getting to heads.

A continuous random variable has infinitely many values, and the collection of values is not countable. That is, it is impossible to count the individual items because at least some of them are on a continuous scale, such as body temperatures.

Probability Distribution Requirements

Every probability distribution must satisfy each of the following three requirements.

  1. There is a numerical random variable, and its number values are associated with corresponding probabilities.
  2. \(\Sigma P(x)=1\) where x assumes all possible values.
  3. \(0 \leq P(x) \leq 1 for every individual value of the random variable x. That is, each probability value must be between 0 and 1 inclusive.

The second requirement comes from the simple fact that the random variable x represents all possible events in the entire sample space, so we are certain that one of the events will occur. The third requirement comes from the basic principle that any probability value must be 0 or 1 or a value between 0 and 1.

 

The above x variable is a random variable because its numerical values depend on chance. The variable x is a numerical random variable, and its values are associated with probabilities. \(\sumP(x)=.25+.50+.25=1\). Each value of P(x) is between 0 and 1. The random variable x is a discrete random variable, because it has three possible values and three is a finite number.

Notation for 0+

In tables or the binomial probabilities, we recommend using 0+ to represent a probability value that is positive but very small, such as .0000000123. When rounding a probability value for inclusion in such a table, rounding to 0 would be misleading because it would incorrectly suggest the vent is impossible.

Probability Histogram

There are various ways to graph a probability distribution, but for now we will consider only the probability histogram. 

Parameters of a Probability Distribution

Remember that with a probability distribution, we have a description of a population instead of a sample, so the values of the mean, standard deviation, and variance are parameters, not statistics. The man, variance, and standard deviation of a discrete probability distribution can be found with the following formula.

This is the mean for a probability distribution:

\[ \mu = \sum [x * P(x)] \]

Variance for a probability distribution that should be easier to understand:

\[\sigma^2 = \Sigma[(x - \mu)^2 * P(x)]

Variance for probability distribution that is good for manual calculations:

\[\sigma^2 = \Sigma[x^2*P(x)] - \mu^2 \]

Standard deviation for probability distribution:

\[\sigma = \sqrt{\Sigma[x^2*P(x)] - \mu^2}\]

Expected Value

The mean of a discrete random variable is the theoretical mean outcome for infinitely many trials. We can think of that mean as the expected value in the sense that it is the average value that we would expect to get if the trials could continue indefinitely.

The expected value of a discrete random variable is denoted by E, and it is the mean value of the outcomes, so \(E=\mu\) abd E can also be found by evaluating \(\Sigma[x*P(x)]\).

An expected value need not be a whole number, even if the different possible values of x might all be whole numbers. The expected number of girls in five births is 2.5, even though five particular children can never result in 2.5 girls. If we were to survey many couples with 5 children, we expect that the mean number of girls will be 2.5.

Making Sense of Significant Figures

We present the following two different approaches for determining whether a value of a random variable is significantly low or high.

Range Rule of Thumb

The range rule of thumb may be helpful in interpreting the value of a standard deviation. According to the range rule of thumb, the vast majority of values should lie within 2 standard deviations of the mean, so we can consider a value to be significant if it is at least 2 standard deviations away from the mean. We can identify significant values as follows:

  1. Significantly low values are \((\mu-2\sigma\) or lower
  2. Significantly high values are \(\mu+2\sigma\) or higher
  3. Values not significant are between the previous two conditions

Know that the use of the number 2 in the range rule of thumb is somewhat arbitrary and this is a guideline, not an absolutely rigid rule.

Identifying Significant Results With Probabilities

X successes among n trials is a significantly high number of successes if the probability of x or more successes is .05 or less. That is, x is a significantly high of successes if \(P(x \text{or more}) \leg .05\)

X successes among n trials is a significantly low number of successes if the probability of x or fewer successes is .05 or less. That is, x is a significantly low number of successes if \(P(x \text{or fewer}) \leq .05\).

The Rare Event Rule For Inferential Statistics

If, under a given assumption, the probability of a particular outcome is very small and the outcome occurs significantly less than or significantly greater than what we expect with that assumption, we conclude that the assumption is probably not correct.

For example, if testing the assumption that boys and girls are equally likely, the outcome of 20 girls in 100 births is significantly low and would be a basis for rejecting that assumption.

Expected Value and Rationale for Formulas

Earlier we noted that the expected value of a random variable is equal to the mean. We can therefore find the expected value by computing \(\Sigma[x*P(x)]\), just as we do for finding the value of \(\mu\). We also noted that the concept of expected value is used in decision theory. 

Rationale for Earlier Formulas

Instead of blindly accepting and using formulas, it is much better to have some understanding of why they work. When computing the mean from a frequency distribution, f represents class frequency and N represents population size. In the expression that follows, we rewrite the formula for the mean of a frequency so that it applies to a population. In the fraction f/n, the value of f is the frequency with which the value x occurs and N is the population size, so f/N is the probability for the value of x. When we replace f/N with P(x), we make the transition from relative frequency based on a limited number of observations to probability based on infinitely many trials. 

Example 1

 The table below lists probabilities for the corresponding numbers of girls in three births. What is the random variable, what are its possible values, and are its values numerical?

Girls(x) P(x)

0 0.125

1 0.375

2 0.375

3 0.125

The random variable is x, which is the number of girls in three births. The possible values of x are 0,1,2, and 3. The values of the random value x are numerical.

Example 2

Is the random variable given in the accompanying table discrete or continuous?

Girls(x) P(x)

0 0.063

1 0.250

2 0.375

3 0.250

4 0.063

The random variable given in the accompanying table is discrete because there are a finite number of values.

Example 3

For 100 births, P(exactly 56 girls)=0.0390 and P(56 or more girls)=0.136. Is 56 girls in 100 births a significantly high number of girls? Which probability is relevant to answering that question? Consider a number of girls to be significantly high if the appropriate probability is 0.05 or less.

The relevant probability is P(56 or more girls), so 56 girls in 100 births is not a significantly high number of girls because the relevant probability is greater than 0.05.

Example 4

Five males with an x-linked genetic disorder have one child each. The random variable x is the number of children among the five who inherit the x-linked genetic disorder. Determine whether a probability distribution is given. If a probability distribution is given, find its mean and standard deviation. If a probability distribution is not given, identify the requirements that are not satisfied.

X P(x)

0 0.024

1 0.167

2 0.309

3 0.309

4 0.167

5 0.024

The random variable x is numerical because x takes on the integer values from 0 to 5.

The number values are associated with probabilities because each value of x has a corresponding value of P(x) in the next column of the table.

The mean for a probability distribution is given by the formula below.

\[\mu = \Sigma[x*P(x)]\]

Find each product of x and P(x)

0+.167+.618+.927+.668+.12=2.5

\[\mu=2.5\]

The standard deviation for a probability distribution is given by the formula below.

\[\sigma=\sqrt{\Sigma[x^2*P(x)]-\mu^2}\]

Create another table for the new values

X^2 X^2*P(x)

0 0

1 .167

4 1.236

9 2.781

16 2.672

25 .6

Sum = 7.456

Substitute into formula

\[\sqrt{7.456-2.5^2}= 1.1\]

Example 5

When conducting research on color blindness in males, a researcher forms random groups with five males in each group. The random variable x is the number of males in the group who have a form of color blindness. Determine whether a probability distribution is given. If a probability distribution is given, find its mean and standard deviation. If not, state why.

X P(x)

0 .657

1 .284

2 .053

3 .005

4 .001

5 .000

Find the mean of the random variable x

0+.284+.106+.015+.004+0=.409

Find the standard deviation of the random variable x

0+(1^2*.284)+(2^2*.053)+(3^2*.005)+(4^2*.001)+(5^2*0)=.557

\[\sqrt{.557-.409^2}\]=.6243

Example 6

Look at the next table. Determine whether a probability distribution is given. If it is, find the mean and standard deviation. If not, state why.

X P(x)

0 .001

1 .009

2 .034

3 .056

Does the table show a probability distribution?

No, the sum of all the probabilities is not equal to 1

Example 7

Look at the following table.

X P(x)

0 .094

1 .347

2 .395

3 .164

Does the table show a probability distribution?

Yes, the table shows a probability distribution

Find the mean of the random variable x

(0)+(.347)+(2*.395)+(3*.164)=1.629

Find the standard deviation of x

0+.347+(4*.395)+(9*.164)=3.403

\[\sqrt{3.403-1.629^2}=.8656\]

Example 8

Look at the following table

X P(x)

0 .365

1 .431

2 .178

3 .026

Does the table show a probability distribution?

Yes, the table shows a probability distribution

Find the mean of the random variable x

0+.431+(2*.178)+(3*.026)=.865

Find the standard deviation of x

0+.431+(4*.178)+(9*.026)=1.377

\[\sqrt{1.377-.865^2}=.7929\]

Example 9

Look at the table below

X P(x)

0 .002

1 .035

2 .111

3 .221

4 .272

5 .211

6 .116

7 .027

8 .005

Find the mean

0+.035+(2*.111)+(3*.221)+(4*.272)+(5*.211)+(6*.116)+(7*.027)+(.005)=3.953

Find the standard deviation

0+.035+(2^2*.111)+(3^2*.221)+(4^2*.272)+(5^2*.211)+(6^2*.116)+(7^2*.027)+(8^2*.005)=17.914

\[\sqrt{17.914-3.953^2}=1.5\]

Example 10

The following table describes results from groups of 10 births from 10 different sets of parents. The random variable x represents the number of girls among 10 children. Use the range rule of thumb to determine whether 1 girl in 10 births is a significantly low number of girls.

X P(x)

0 .005

1 .010

2 .046

3 .113

4 .194

5 .241

6 .211

7 .111

8 .039

9 .020

10 .010

The range rule of thumb for identifying significant values is shown below.

Significantly low values are \(\mu-2\sigma\) or lower

Significantly high values are \(\mu+2\sigma\) or higher

Values between these are not significant

To find the range of values that are not significant, first find the mean and standard deviation

Let us start with the mean

0+.010+.092+.339+.776+1.205+1.266+.777+.312+.180+.100=5.057

Now find the standard deviation

0+.010+(4*.046)+(9*.113)+(16*.194)+(25*.241)+(36*.211)+(49*.111)+(64*.039)+(81*.020)+(100*.010)=28.491

\[sqrt{28.491-5.057^2}=1.708\]

Now find the max range of values that are not significant

Max value = \(\mu+2\sigma\)

5.1+2*1.7=8.5

Now find the minimum range of values that are not significant

Min value = \(\mu-2\sigma\)

5.1-2*1.7=1.7

Displaying and Describing Data

A symmetric distribution has roughly the same shape reflected around the center. A skewed distribution extends farther on one side than on the other. A unimodal distribution has a single major hump. A bimodal distribution has two humps. Multimodal distributions have more than two humps. Outliers are values that lie far from the rest of the data.

 

The mean is the sum of the values divided by the count. The median is the middle value. Half the values are above and half the values are below the median. The mean and median may differ because of outliers. If there are no outliers then the mean and median should be almost the same. 

 

The standard deviation is roughly the square root of the average squared difference between each data value and the mean. It is the summary of choice for the spread of unimodal, symmetric variables. The IQR is the difference between the third and first quartiles. It is the preferred summary of spread for skewed distributions or data with outliers. 

 

Area of Principle

In a statistical display, each data value should be represented by the same amount of area.

 

Frequency Table

A frequency table lists the categories in a categorical variable and gives the count of observations for each category.

 

Distribution

The distribution of a categorical value gives the possible values of the variable and the relative frequency of each variable.

 

Bar Chart

Bar charts show a bar whose area represents the count of observations for each category of a categorical variable.

 

Pie Chart

Pie charts show how a whole is divided into categories. The area of each wedge of the circle corresponds to the proportion in each category.

 

Histogram

A histogram uses adjacent bars to show the distribution of a quantitative variable. Each bar represents the frequency of values falling in each bin.

 

Gap

A region of the distribution where there are no values.

 

Stem and Leaf Display

A display that shows quantitative data values in a way that sketches the distribution of the data.

 

Dotplot

A dotplot graphs a dot for each case along a single axis.

 

Density Plot

A density plot shows the shape of a variable’s distribution by smoothing out its histogram to make a gentle curve.

 

Shape

To describe the shape of a distribution, look for single versus multiple modes, symmetry versus skewness, and outliers versus gaps.

 

Mode

A hump or local high point in the distribution of a variable. The apparent location of modes can change as the scale of a histogram is changed.

 

Uniform

A distribution that does not appear to have any mode and in which all the bars of its histogram are approximately the same height.

 

Symmetric

A distribution is symmetric if the two halves on either side of the center look approximately like mirror images of each other.

 

Tails

The parts of a distribution that trail off on either side. Distributions can be characterized as having long tails or short tails.

 

Skewed

A distribution is skewed if it’s not symmetric and one tail stretches out farther than the other. Distributions are said to be skewed left when the longer tail stretches to the left, and skewed right when it goes to the right.

 

Outlier

Outliers are extreme values that don’t appear to belong with the rest of the data. They may be unusual values that deserve further investigation or they may just be mistakes.

 

Center

The place in the distribution of a variable that you would point to if you wanted to attempt the impossible by summarizing the entire distribution with a single number. Measures of the center include the mean and median.

 

Median

The median is the middle value, with half the data above and half below it. If n is even, it is the average of the two middle values. It is usually paired with the IQR.

 

Mean

The mean is found by adding up all the data values and dividing by the count.

 

Spread

A numerical summary of high tightly the values are clustered around the center. Measures of spread include the IQR and standard deviation.

 

Range

The difference between the lowest and highest value in a dataset.

 

Quartile

The lower quartile Q1 is the value with a quarter of the data below it. The upper quartile Q3 has three quarters of the data below it. The median and quartiles divide the data into the four parts with approximately equal numbers of data values.

 

Percentile

The ith percentile is the number that falls above the i% of the data.

 

IQR - Interquartile Range

The IQR is the difference between the first and third quartiles, so Q3-Q1. It is usually reported along with the median.

 

Least Squares Property

The property of a statistic that the sum of the squared deviations of data values from data summaries due to that statistic is as small as it could be for any statistic is called the least squares property. 

 

Residuals

A residual is the difference between an observed data value and some summary or model for that value.

 

Variance

The variance is the sum of squared deviations from the mean, divided by the count minus 1.

 

Standard Deviation

The standard deviation is the square root of the variance.

 

Bar Chart In Excel

First make a pivot table which is Excel’s name for a frequency table. From the data menu, choose Pivot table and Pivot Chart Report. When you reach the layout window, drag your variable to the row area and drag your variable again to the data area. This tells Excel to count the occurrences of each category. Once you have an Excel pivot table, you can construct bar charts and pie charts. 

 

Compute Average in Excel

Click inside the Pivot table. Click the Pivot table chart wizard button. Excel creates a bar chart. To compute the mean, click on an empty cell. Go to the Formulas tab in the ribbon. Click the drop down arrow next to Auto-Sum and choose Average. Enter the data range in the formula displayed in the empty bow you selected earlier. Press enter and this will compute the mean for the values in that range.

 

Compute Standard Deviation in Excel

To computer standard deviation, click on an empty cell. Go to the Formulas tab in the ribbon and click the drop down arrow next to Auto-sum and select More Functions. In the dialog box that opens, select STDEV from the list of functions and click Ok. A new dialog box opens. Enter a range of fields into the text fields and click Ok. Excel computes the standard deviation for the values in that range and places it in the specified cell of the spreadsheet.

Relationships Between Categorical Variables

When we want to see how two categorical variables are related, put the counts in a two-way table called a contingency table. Look at the marginal distribution of each variable. Also look at the conditional distribution of a variable within each category of the other variable. Comparing conditional distributions of one variable across categories of another tells us about the association between variables. If the conditional distributions of one variable are roughly the same for every category of the other, the variables are independent. Consider a third variable whenever it is appropriate, and be able to describe the relationships among the three variables.

 

Contingency Table

A contingency table displays counts and sometimes percentages of individuals falling into named categories on two or more variables. The table categorizes the individuals on all variables at once to reveal possible patterns in one variable that may be contingent on the category of the other.

 

Marginal Distribution

In a contingency table, the distribution of either variable alone is called the marginal distribution. The counts or percentages are the totals found in the margins of the table.

 

Table Percents

When a cell of a contingency table holds percents, these can be percents of the total in the row or column of that cell. These are row, column, and table percents.

 

Conditional Distribution

The distribution of a variable when the Who is restricted to consider only a smaller group of individuals is called a conditional distribution.

 

Independence
Variables are said to be independent if the conditional distribution of one variable is the same for each category of the other.   

 

Segmented Bar Chart

A segmented bar chart displays the conditional distribution of a categorical variable within each category of another variable.

 

Mosaic Plot

A mosaic plot is a graphical representation of a contingency table. The plot is divided into rectangles so that the area of each rectangle is proportional to the number of cases in the corresponding cell.

 

Simpson’s Paradox

When averages are taken across different groups, they can appear to contradict the overall averages. 

 

Lurking Variables

A lurking variable is one that is not immediately evident in an analysis, but changes the apparent relationships among the variables being studied.

 

Contingency Tables in Excel

Excel calls contingency tables Pivot Tables. To make a pivot table, from the Data menu, choose pivot table. In the layout window, drag your variables to the row area, the column area, and drag your variable again to the data area. This tells Excel to count the occurrences of each category. 

 

Contingency Tables in R

Using the function xtabs, you can create a contingency table from two variables x and y in a data frame called mydata with the command:

con.table=xtabs(~x+y,data=mydata)

Comparing Distributions in Statistics

 

Displays For Comparing Groups

It is almost always more interesting to compare groups than to summarize data for a single group. There are several ways to summarize a variable. The median and quartiles are suitable even for data that may be skewed or have outliers and are usually used together. Along with these three values, we can report the max and min values. These five values together make up the 5-number summary of the data. They include the median, quartiles, max, and min. It is a useful, concise summary because it gives a good idea of the center, spread, and range. 

 

A boxplot highlights several features of the distribution. The central box shows the middle half of the data, between the quartiles. The height of the box is equal to the IQR. If the median is roughly centered between the quartiles, then the middle half of the data is roughly symmetric. If the median is not centered, the distribution is skewed. The whiskers show skewness as well if they are not roughly the same length. 

 

Histograms or stem and leaf displays are good for single distributions but not good for 20. It would be hard to see patterns. By placing boxplots side by side, you can easily see which groups have higher medians, which have greater IQR’s, where the central 50% of the data is located in each group, and which have the overall greater range. 

 

Outliers

Outliers arise for many reasons. They may be the most important values in the dataset or they may be an error. It could be an exceptional case or illuminating a pattern by being the exception to the rule. Many outliers are not wrong, they are just different. Most repay the effort to understand them. You can sometimes learn more from extraordinary cases than from summaries of the entire dataset. 

 

There are two things you should never do with outliers. You should not leave an outlier in place and proceed as if nothing happened. Analyses of data with outliers are very likely to be wrong. The other is to omit an outlier from the analysis without comment. A histogram is often a better way to see more detail about how the outlier fits in or doesn’t fit at all.  

 

Timeplots

A display of values against time is called a timeplot. Timeplots often show a great deal of point to point variation. We usually want to see past this variation to understand any underlying smooth trends. Also we want to think about how the values vary around that tend, the timeplot version of center and spread. 

 

Re-Expressing Data

When data are skewed, it can be hard to summarize them simply with a center and spread, and hard to decide whether the most extreme values are outliers or just part of the stretched out tail. We re-express the data by applying a simple function to each value. Re-express means to transform the data by applying a simple function to make the skewed distribution more symmetric. It could be either a square root or logarithm function. Variables that are skewed to the right often benefit from a re-expression by square roots, logs, or reciprocals. Those skewed to the left may benefit from squaring the data. Re-expressing can help alleviate the problem of comparing groups that have very different spreads.

 

Choose the right tool for comparing distributions. Compare the distributions of two or three groups with histograms. Compare several groups with boxplots, which make it easy to compare centers and spreads and spot outliers, but hide much of the detail of distribution shape.

 

Treat outliers with attention and care. Outliers are nominated by the boxplot rule, but you must decide what to do with them. Track down the background for outliers, it may be informative. 

 

Re-express data to make them easier to work with. Re-expression can make skewed distributions more nearly symmetric. Re-expression can make the spreads of different groups more nearly comparable. 

 

Outlier

Values that are large or small compared to most of the other values in a variable. Whether they are outliers is a judgement call that depends on the context. A boxplot displays values more than 1.5 IQR’s beyond the nearest quartile as potential outliers, but that is not a definition of outlier that can be used anywhere.

 

5-Number Summary

A summary of a variable’s distribution that consists, of the extremes, the quartiles, and the median.

 

Boxplot

A display of a box between the quartiles and whiskers extending to the highest and lowest values not nominated as outliers. 

 

Far Outlier

In a boxplot a value more than 3 IQR’s beyond the nearest quartile. Such values deserve special attention.

 

Timeplot

A timeplot displays data that change over time. Often, successive values are connected with lines to show trends more clearly. Sometimes, a smooth curve is added to the plot to help show long0term patterns and trends.

 

Re-Express

This is another name for transform. The structure of data may be improved by working with a simple function of the data. The logarithm, square root, and reciprocal are the most common re-expression functions.

Standard Deviation 

Expressing a distance from the mean in standard deviations standardizes the performances. To standardize a value, we subtract the mean and then divide this difference by the standard deviation.

\[z = \frac{y - y{bar}}{s}\]

 

Standardizing Values

The values are called standardized values, and are commonly denoted with the letter z. Usually, we call them z-scores. Z-scores measure the distance of a value from the mean in standard deviations. A z-score of 2 says that a data value is 2 standard deviations above the mean. Data values below the mean have a negative z-score, so a z-score of -1.6 means that the data value was 1.6 standard deviations below the mean.

 

There are two steps to finding a z-score. First, the data are shifted by subtracting the mean. Then, they are rescaled by dividing by the standard deviation. Adding or subtracting a constant to every data value adds or subtracts the same constant to measures of position, but leaves measures of spread unchanged.

 

When we multiply or divide all the data values by any constant, all measures of position such as median, mean, and percentiles, and measures of spread such as range, IQR, and the standard deviation are multiplied by that same constant. 

 

Shifting and Scaling Values

Standardizing data into z-scores is just shifting them by the mean and rescaling them by the standard deviation. Now we can see how standardizing affects the distribution. When we subtract the mean of the data from every data value, we shift the mean to zero. As we have seen, such a shift does not change the standard deviation. 

 

When we divide each of these shifted values by s, the standard deviation should be divided by s as well. Since the standard deviation was s to start with, the new standard deviation becomes 1. Standardizing into z-scores does not change the shape of the distribution of a variable. Standardizing into z-scores changes the center by making the mean 0. Standardizing into z-scores changes the spread by making the standard deviation 1.

 

Normal Models

A z-score gives an indication of how unusual a value is because it tells how far it is from the mean. If the data value sits right at the mean, it’s not very far at all and its z-score is 0. A z-score of 1 tells us that the data value is 1 standard deviation above the mean, while a z-score of -1 tells us that the value is 1 standard deviation below the mean.

 

For many unimodal and symmetric distributions, about 68% of the values  fall within one standard deviation of the mean. 95% of the values are found within two standard deviations of the mean. 99.7% or almost all of the values will be within three standard deviations of the mean. 

 

In 1809 Gauss figured out the formula for the model that accounts for this observation, it is called the Normal or Gaussian model. It illustrates one of the most important uses of the standard deviation. The standard deviation is the statistician’s ruler. This model for unimodal symmetric data gives us even more information because it tells us how likely it is to have z-scores between -1 and1, between -2 and 2, and between -3 and 3.

 

These magic 68, 95, and 99.7 values come from the Normal model. As a model, it can give us corresponding values for any z-score.

 

N always denotes a Normal model. The mu symbol is the Greek letter for m and always represents the mean in a model. The sigma character is the lowercase Greek letter for s and always represents the standard deviation in a model. The man and standard deviation are not numerical summaries of data. They are characteristics of the model called parameters. Parameters are the values we choose that completely specify a model. We do not want to confuse the parameters with summaries of the data so we use special symbols. In statistics, we almost always use Greek letters for parameters. Summaries of data, like the sample mean, median, or standard deviation, are called statistics and are usually written with Latin letters.

 

 If we model data with a Normal model and standardize them using the corresponding mu or sigma, we still call the standardized value a z-score.

\[z = \frac{y - \mu}{\singma}\]

Usually, it is easier to standardize data using the mean and standard deviation first. Then we only need the model with mean 0 and standard deviation 1. This Normal model is called the Standard Normal model.

 

Notice how well the 68-95-99.7 rule world when the distribution is unimodal and symmetric. Careful though, you should not use the Normal model for just any dataset. Standardizing will not change the shape of the distribution. If the distribution is not unimodal and symmetric to begin with, standardizing will not make it Normal.

 

All models make assumptions. Whenever we model we will be careful to point out the assumptions that we are making. We will also check the associated conditions in the data to make sure that those assumptions are reasonable. So, do not model data without checking whether the data is normal or not. To be Normal, the shape of the data’s distribution is unimodal and symmetric and there are no obvious outliers. 

 

To sketch a Normal curve that looks normal is important. The Normal curve is bell-shaped and symmetric around its mean. Start at the middle and sketch to the right and left from there. Even though the Normal model extends forever on either side, you need to draw it only for 3 standard deviations. After that, there is little left that is worth sketching. The place where the bell shape changes from curving downward to curving back up, or inflection point, is exactly one standard deviation away from the mean. 

 

Normal Percentiles

When a value does not fall exactly one, two or three standard deviations from the mean, we need to find the percentiles. Mathematically, the percentage of values falling between two z-scores is the area under the normal model between those values. So, Normal percentiles are the percentage of values in a standard Normal distribution found at that z-score or below. 

 

Finding areas from z-scores is the simplest way to work with the Normal model. But sometimes we start with areas and are asked to work backward to find the corresponding z-score or even the original data value. 

 

Normal Probability Plots

We have assumed that the underlying data distribution was roughly unimodal and symmetric so that using a Normal model is reasonable. Drawing a histogram of the data and looking at the shape is one good way to see whether a Normal model might work. 

 

However, there is a more specialized graphical display that can help you to decide whether a Normal model is appropriate, the Normal probability plot. If the distribution of the data is roughly Normal, the plot will be roughly a diagonal straight line. Deviations from a straight line indicate that the distribution is not Normal. This plot is usually able to show deviations from Normality more clearly than the corresponding histogram, but it is usually easier to understand how a distribution fails to be Normal by looking at its histogram. 

 

A Normal probability plot takes each data value and plots it against the z-score you would expect that point to have if the distribution were perfectly Normal. When the values match up well, the line is straight. If one or two points are surprising from the Normal’s point of view, they do not line up. When the entire distribution is skewed or different from the Normal in some other way, the values do not match up very well at all and the plot bends. 

 

It turns out to be tricky to find the values we expect. They are called Normal scores, but you cannot easily look them up in tables. That is why probability plots are best made with technology and not by hand. The best advice on using probability plots is to see whether they are straight. If so, then your data look like data from a Normal model. If not, make a histogram to understand how they differ from the model. 

 

Changing the spread and center of a variable is equivalent to changing the units. Indeed, the only part of the data’s context changed by standardizing is the units. All other aspects of the context do not depend on the choice or modification of measurement units. This fact points out an important distinction between the numbers the data provide for calculation and the meaning of the variables and the relationships among them. Standardizing can make the numbers easier to work with, but it does not alter the meaning.

 

Another way to look at this is to note that standardizing may change the center and spread values, but it does not affect the shape of a distribution. A histogram or boxplot of standardized values looks just the same as the histogram or boxplot of the original values except for the numbers on the axes. When we summarized shape, center, and spread for histograms, we compared them to unimodal, symmetric shapes. You could not ask for a nice example than the Normal model. If the shape is like a Normal, we will use the mean and standard deviation to standardize the values.