Probability Distributions in Statistics
These are my notes on discrete probability distributions in statistics.
Basic Concepts
A random variable is a variable that has a single numeric value, determined by chance, for each outcome of a procedure.
A probability distribution is a description that gives the probability for each value of the random variable. It is often expressed in the format of a table, formula, or graph.
A discrete random variable has a collection of values that is finite or countable. If there are infinitely many values, the number of values is countable if it is possible to count them individually, such as the number of tosses of a coin before getting to heads.
A continuous random variable has infinitely many values, and the collection of values is not countable. That is, it is impossible to count the individual items because at least some of them are on a continuous scale, such as body temperatures.
Probability Distribution Requirements
Every probability distribution must satisfy each of the following three requirements.
- There is a numerical random variable, and its number values are associated with corresponding probabilities.
- \(\Sigma P(x)=1\) where x assumes all possible values.
- \(0 \leq P(x) \leq 1 for every individual value of the random variable x. That is, each probability value must be between 0 and 1 inclusive.
The second requirement comes from the simple fact that the random variable x represents all possible events in the entire sample space, so we are certain that one of the events will occur. The third requirement comes from the basic principle that any probability value must be 0 or 1 or a value between 0 and 1.
The above x variable is a random variable because its numerical values depend on chance. The variable x is a numerical random variable, and its values are associated with probabilities. \(\sumP(x)=.25+.50+.25=1\). Each value of P(x) is between 0 and 1. The random variable x is a discrete random variable, because it has three possible values and three is a finite number.
Notation for 0+
In tables or the binomial probabilities, we recommend using 0+ to represent a probability value that is positive but very small, such as .0000000123. When rounding a probability value for inclusion in such a table, rounding to 0 would be misleading because it would incorrectly suggest the vent is impossible.
Probability Histogram
There are various ways to graph a probability distribution, but for now we will consider only the probability histogram.
Parameters of a Probability Distribution
Remember that with a probability distribution, we have a description of a population instead of a sample, so the values of the mean, standard deviation, and variance are parameters, not statistics. The man, variance, and standard deviation of a discrete probability distribution can be found with the following formula.
This is the mean for a probability distribution:
\[ \mu = \sum [x * P(x)] \]
Variance for a probability distribution that should be easier to understand:
\[\sigma^2 = \Sigma[(x - \mu)^2 * P(x)]
Variance for probability distribution that is good for manual calculations:
\[\sigma^2 = \Sigma[x^2*P(x)] - \mu^2 \]
Standard deviation for probability distribution:
\[\sigma = \sqrt{\Sigma[x^2*P(x)] - \mu^2}\]
Expected Value
The mean of a discrete random variable is the theoretical mean outcome for infinitely many trials. We can think of that mean as the expected value in the sense that it is the average value that we would expect to get if the trials could continue indefinitely.
The expected value of a discrete random variable is denoted by E, and it is the mean value of the outcomes, so \(E=\mu\) abd E can also be found by evaluating \(\Sigma[x*P(x)]\).
An expected value need not be a whole number, even if the different possible values of x might all be whole numbers. The expected number of girls in five births is 2.5, even though five particular children can never result in 2.5 girls. If we were to survey many couples with 5 children, we expect that the mean number of girls will be 2.5.
Making Sense of Significant Figures
We present the following two different approaches for determining whether a value of a random variable is significantly low or high.
Range Rule of Thumb
The range rule of thumb may be helpful in interpreting the value of a standard deviation. According to the range rule of thumb, the vast majority of values should lie within 2 standard deviations of the mean, so we can consider a value to be significant if it is at least 2 standard deviations away from the mean. We can identify significant values as follows:
- Significantly low values are \((\mu-2\sigma\) or lower
- Significantly high values are \(\mu+2\sigma\) or higher
- Values not significant are between the previous two conditions
Know that the use of the number 2 in the range rule of thumb is somewhat arbitrary and this is a guideline, not an absolutely rigid rule.
Identifying Significant Results With Probabilities
X successes among n trials is a significantly high number of successes if the probability of x or more successes is .05 or less. That is, x is a significantly high of successes if \(P(x \text{or more}) \leg .05\)
X successes among n trials is a significantly low number of successes if the probability of x or fewer successes is .05 or less. That is, x is a significantly low number of successes if \(P(x \text{or fewer}) \leq .05\).
The Rare Event Rule For Inferential Statistics
If, under a given assumption, the probability of a particular outcome is very small and the outcome occurs significantly less than or significantly greater than what we expect with that assumption, we conclude that the assumption is probably not correct.
For example, if testing the assumption that boys and girls are equally likely, the outcome of 20 girls in 100 births is significantly low and would be a basis for rejecting that assumption.
Expected Value and Rationale for Formulas
Earlier we noted that the expected value of a random variable is equal to the mean. We can therefore find the expected value by computing \(\Sigma[x*P(x)]\), just as we do for finding the value of \(\mu\). We also noted that the concept of expected value is used in decision theory.
Rationale for Earlier Formulas
Instead of blindly accepting and using formulas, it is much better to have some understanding of why they work. When computing the mean from a frequency distribution, f represents class frequency and N represents population size. In the expression that follows, we rewrite the formula for the mean of a frequency so that it applies to a population. In the fraction f/n, the value of f is the frequency with which the value x occurs and N is the population size, so f/N is the probability for the value of x. When we replace f/N with P(x), we make the transition from relative frequency based on a limited number of observations to probability based on infinitely many trials.
Example 1
The table below lists probabilities for the corresponding numbers of girls in three births. What is the random variable, what are its possible values, and are its values numerical?
Girls(x) P(x)
0 0.125
1 0.375
2 0.375
3 0.125
The random variable is x, which is the number of girls in three births. The possible values of x are 0,1,2, and 3. The values of the random value x are numerical.
Example 2
Is the random variable given in the accompanying table discrete or continuous?
Girls(x) P(x)
0 0.063
1 0.250
2 0.375
3 0.250
4 0.063
The random variable given in the accompanying table is discrete because there are a finite number of values.
Example 3
For 100 births, P(exactly 56 girls)=0.0390 and P(56 or more girls)=0.136. Is 56 girls in 100 births a significantly high number of girls? Which probability is relevant to answering that question? Consider a number of girls to be significantly high if the appropriate probability is 0.05 or less.
The relevant probability is P(56 or more girls), so 56 girls in 100 births is not a significantly high number of girls because the relevant probability is greater than 0.05.
Example 4
Five males with an x-linked genetic disorder have one child each. The random variable x is the number of children among the five who inherit the x-linked genetic disorder. Determine whether a probability distribution is given. If a probability distribution is given, find its mean and standard deviation. If a probability distribution is not given, identify the requirements that are not satisfied.
X P(x)
0 0.024
1 0.167
2 0.309
3 0.309
4 0.167
5 0.024
The random variable x is numerical because x takes on the integer values from 0 to 5.
The number values are associated with probabilities because each value of x has a corresponding value of P(x) in the next column of the table.
The mean for a probability distribution is given by the formula below.
\[\mu = \Sigma[x*P(x)]\]
Find each product of x and P(x)
0+.167+.618+.927+.668+.12=2.5
\[\mu=2.5\]
The standard deviation for a probability distribution is given by the formula below.
\[\sigma=\sqrt{\Sigma[x^2*P(x)]-\mu^2}\]
Create another table for the new values
X^2 X^2*P(x)
0 0
1 .167
4 1.236
9 2.781
16 2.672
25 .6
Sum = 7.456
Substitute into formula
\[\sqrt{7.456-2.5^2}= 1.1\]
Example 5
When conducting research on color blindness in males, a researcher forms random groups with five males in each group. The random variable x is the number of males in the group who have a form of color blindness. Determine whether a probability distribution is given. If a probability distribution is given, find its mean and standard deviation. If not, state why.
X P(x)
0 .657
1 .284
2 .053
3 .005
4 .001
5 .000
Find the mean of the random variable x
0+.284+.106+.015+.004+0=.409
Find the standard deviation of the random variable x
0+(1^2*.284)+(2^2*.053)+(3^2*.005)+(4^2*.001)+(5^2*0)=.557
\[\sqrt{.557-.409^2}\]=.6243
Example 6
Look at the next table. Determine whether a probability distribution is given. If it is, find the mean and standard deviation. If not, state why.
X P(x)
0 .001
1 .009
2 .034
3 .056
Does the table show a probability distribution?
No, the sum of all the probabilities is not equal to 1
Example 7
Look at the following table.
X P(x)
0 .094
1 .347
2 .395
3 .164
Does the table show a probability distribution?
Yes, the table shows a probability distribution
Find the mean of the random variable x
(0)+(.347)+(2*.395)+(3*.164)=1.629
Find the standard deviation of x
0+.347+(4*.395)+(9*.164)=3.403
\[\sqrt{3.403-1.629^2}=.8656\]
Example 8
Look at the following table
X P(x)
0 .365
1 .431
2 .178
3 .026
Does the table show a probability distribution?
Yes, the table shows a probability distribution
Find the mean of the random variable x
0+.431+(2*.178)+(3*.026)=.865
Find the standard deviation of x
0+.431+(4*.178)+(9*.026)=1.377
\[\sqrt{1.377-.865^2}=.7929\]
Example 9
Look at the table below
X P(x)
0 .002
1 .035
2 .111
3 .221
4 .272
5 .211
6 .116
7 .027
8 .005
Find the mean
0+.035+(2*.111)+(3*.221)+(4*.272)+(5*.211)+(6*.116)+(7*.027)+(.005)=3.953
Find the standard deviation
0+.035+(2^2*.111)+(3^2*.221)+(4^2*.272)+(5^2*.211)+(6^2*.116)+(7^2*.027)+(8^2*.005)=17.914
\[\sqrt{17.914-3.953^2}=1.5\]
Example 10
The following table describes results from groups of 10 births from 10 different sets of parents. The random variable x represents the number of girls among 10 children. Use the range rule of thumb to determine whether 1 girl in 10 births is a significantly low number of girls.
X P(x)
0 .005
1 .010
2 .046
3 .113
4 .194
5 .241
6 .211
7 .111
8 .039
9 .020
10 .010
The range rule of thumb for identifying significant values is shown below.
Significantly low values are \(\mu-2\sigma\) or lower
Significantly high values are \(\mu+2\sigma\) or higher
Values between these are not significant
To find the range of values that are not significant, first find the mean and standard deviation
Let us start with the mean
0+.010+.092+.339+.776+1.205+1.266+.777+.312+.180+.100=5.057
Now find the standard deviation
0+.010+(4*.046)+(9*.113)+(16*.194)+(25*.241)+(36*.211)+(49*.111)+(64*.039)+(81*.020)+(100*.010)=28.491
\[sqrt{28.491-5.057^2}=1.708\]
Now find the max range of values that are not significant
Max value = \(\mu+2\sigma\)
5.1+2*1.7=8.5
Now find the minimum range of values that are not significant
Min value = \(\mu-2\sigma\)
5.1-2*1.7=1.7