Joint Frequencies of Two-Way Tables

These are my notes on joint frequencies of two-way tables.

This is my favorite Statistics book on Amazon, if you are interested in learning Statistics I highly recommend it

Suppose data is classified by two different criteria. If the classification criterion 1 has r categories and the classification criterion 2 has c categories, then the classification of data would result in a table with r rows and c columns.

 

A table of data classified by r categories of classification criterion 1 and c categories of classification criterion 2 is known as an r*c contingency table. These row and column totals give the marginal frequencies for these two categories. The marginal frequency is the frequency with which each category occurs.

 

Conditional Relative Frequencies

The conditional relative frequency is the relative frequency of one category given that the other category has occurred. This frequency is used to determine whether there is an association between the two classification criteria. To measure the degree of relation between two quantitative variables, we use the concept of correlation. On the other hand, to measure the degree of relation between two categorical variables, we use the concept of association.

 

Question 1

Describe a situation in which it is better to use the median as a measure of central tendency over the mean.

 

When data is highly skewed, the values in the tail affect the mean more than they affect the median. Since the median is more stable, it is often a better measure of central tendency when data are skewed.

 

Question 2

Which of the three main measures of central tendency can be used with categorical data?

 

The mode can be used with both categorical and continuous variables, or qualitative and quantitative. Since categories are not numeric, we cannot calculate a mean, and if the categories are not ordinal, we cannot put them in order to calculate a median.

 

Question 3

Describe the strength and direction of the relationship between two variables with a correlation of -0.5.

 

Since the correlation coefficient is negative, it means that there is an inverse relationship between X and Y. As X goes up, Y will go down. -0.5 indicates a moderately strong relationship, since values close to 0 indicate little relation between two variables, and values close to -1 or 1 indicate nearly perfect relationships.

 

Question 4

If there is a clear pattern in a residual plot, is a linear relationship appropriate?

 

No, if a linear relationship is appropriate, we expect to see a residual plot with near to zero correlation between residuals and predicted values. Patterns indicate that the relationship between X and Y is not linear.

 

Question 5

Which measure of spread is least affected when there are extreme outliers in your data set?

 

The interquartile range does not include the lowest or highest 25% of data and is therefore less affected by extreme values. Variance, standard deviation, and range do include these extreme values.

 

Summary

The two major categories of variables are quantitative and categorical. Data can be described using tables, graphs, and numbers. Bar graphs and pie charts are useful methods for depicting categorical data. 

 

There are many graphical methods for describing quantitative data. Small data sets can be depicted using stem plots or dot plots. Larger data sets can be shown as boxplots, frequency charts, and histograms.

 

Use a scatter plot to organize and display bivariate quantitative data. Use a two-way contingency table to summarize bivariate categorical data. Ideally, data should always be visualized first, as symmetric data and skewed data are summarized using different numerical summaries. 

 

The r-value, or correlation coefficient is always between -1 and 1. The r-statistic describes the strength of the correlation, with numbers closer to + or - 1 being stronger and values closer to 0 being weaker. The r-squared value describes how much variation in the y-values data can be attributed to changes in the x-values. 

 

There are two major formulas needed to calculate the least-squares regression line. To find the estimated slope, you need the r-statistic, the standard deviation of the x-values, and the standard deviation of the y-values. A residual plot shows whether a linear model is a good fit. If the points on a residual plot are randomly scattered, then a linear model is appropriate. If the points are not random, then a linear model should not be used.