Numerical Methods For Continuous Data in Statistics
These are my notes on numerical methods for continuous data.
Correlation coefficients are numerical measures used to judge the relation between two variables. Pearson’s correlation coefficient is a numeric measure of the strength and direction of the linear relation between two quantitative variables. The Pearson’s correlation coefficient between two variables x and y computed from a population is denoted by \(\rho\), whereas the correlation coefficient between two variables computed from a sample is denoted by r.
Numerical Methods
The positive or negative sign of the correlation coefficient describes the direction of the linear relation between the two variables.
A positive value of the correlation indicates a positive relation between x and y. This means that as x increases, y also increases linearly. For example, the relation between the heights of fathers and the heights of their sons is a positive relation. Taller fathers tend to have taller sons. A scatterplot of such data will show an increasing or upward linear trend.
A negative value of the correlation coefficient indicates a negative relation between x and y. This means that as x increases, y decreases linearly. For example, the relation between the weight of a car and its gas mileage is a negative relation. Heavier cars tend to get lower gas mileage. A scatterplot of such data will show a decreasing or downward linear trend.
The numeric value of the correlation coefficient describes the strength of the linear relation between the two variables.
If the value of the correlation coefficient is equal to +1 or -1 , then it indicates a perfect correlation between two variables. In this case, all the points in a scatterplot would fall perfectly on the line.
The farther away the correlation coefficient gets from 0, the stronger the relationship between the two variables, and the closer the correlation coefficient is to 0, the weaker the relationship between the two variables. For example, if the correlation coefficient between x and y is -0.86, whereas the correlation coefficient between x and z is 0.75, then x and y have a stronger relation than x and z. Note again that, unlike scatterplots, correlation coefficients do not show the shape of the relationship.
Correlation coefficients are usually computed using a calculator or by reading computer output from statistics programs.
Since the farther an r-value is from zero, the stronger the correlation, it can be challenging to interpret a specific number as weak, strong, or very strong. Statisticians often rely on arbitrary cutoff numbers to distinguish these values.
Least-Squares Regression
Once we have established that the two variables are related to each other, we are often interested in estimating or quantifying the relation between the two variables. When one variable explains or causes the other or when one is dependent on the other, estimating a linear regression model can be useful. Such an estimate can be useful for predicting the corresponding values of one variable for known values of the other variable.
A linear regression model or linear regression equation is an equation that gives a straight line relationship between two variables. The linear relation between two variables is given by the following equation for the regression line:
\[Y=a + BX\]
- Y is the dependent variable or response variable
- X is the independent variable or explanatory variable
- A is the y-intercept. It is the value of Y for X=0
- B is the slope of the line. It gives the amount of change in y for every unit change in X
The predicted value of Y for a given value of X is denoted by y-hat. It is computed by using the estimated regression line: \(y\^ = a bx\)
The least squares regression line is a line that minimizes the sum of the squares of the residuals. It is also known as the line of best fit. The line of best fit will always pass through the point (X,Y).
The coefficient of determination measures the percent of variation in Y-values explained by the linear relation between X and Y values. In other words, it measures the percent of variation in Y-values attributable to the variation in X-values. It can be shown that for a linear regression it is equal to the square of the Pearson’s correlation coefficient.
Outliers and Influential Points
As discussed earlier, an outlier is an observation that is surprisingly different from the rest of the data. It is an observation that does not conform to the general trend. An influential observation is an observation that strongly affects a statistic. Some outliers are influential, while others are not. If there is a considerable difference between the correlation coefficients computed with and without a specific observation, then that observation is influential. The same can be said about the line of best fit. If the estimates of the line of best fit change considerably when including or excluding a point, then that point is an influential observation.
Residuals and Residual Plots
When we use a least-square regression line to make predictions, there is usually a difference between the predicted values and actual observed values. We can think of this difference between observed and predicted values as error, or a residual value.
A residual plot is a plot of residuals versus the predicted values of Y. This type of plot is used to assess the fit of the model. A residual plot should look random. If the residual plot shows any patterns or trends, it is an indication that the linear model is not appropriate.
Transformation To Achieve Linearity
Always draw a scatterplot of the data to examine the nature of the relation between two variables. You should also examine the fit of the linear model using a residual plot. If either one of these plots indicates that the linear model might not be appropriate for the data, then there are two options available: You can either use nonlinear models or use a transformation to achieve linearity. For example, if the data seem to have a relation of the nature Y=aX,then we can take the logarithm of both sides to get ln(Y)=ln(a)+bln(x). This gives the equation of a straight line, in other words, a linear relation.
After the variables have been appropriately transformed, we can then use them to make a model. For example, we could take the natural logarithm of all Y-values or the square root of all Y-values. Then we would fit the model for Z as a function of X. When using a fitted model for predictions, remember to transform the predicted values back to the original scale using a reverse transformation.
The logarithm transformation is used to linearize the regression model when the relationship between Y and X suggests a model with a consistently increasing slope. The square root transformation is used when the spread of observations increases with the mean. The reciprocal transformation is used to minimize the effect of large values of X. The square transformation is used when the slope of the relation consistently decreases as the independent variable increases. The power transformation is used if the relation between dependent and independent variables is modeled by Y=aX.