Part I: A review of Regression
(aka SLR- Simple Linear Regression)
Description: A method used
to generate a mathematical equation which will describe the nature of the
relationship between two variables.
Differences between correlation and
regression:
1. Regression assumes causation; correlation
does not.
2. Regression generates a mathematical model;
correlation does not.
Uses of regression models:
1. description
A model is a more compact description of a
set of data.
Formulation of a model allows you to assess
the relative degree to which each predictor variable accounts for variation
in the criterion variable.
2. prediction
Extrapolation
Interpretation
Objectives of regression analysis:
to determine whether or not a relationship
exists between two variables.
to describe the nature of the relationship
(should one exist) in the form of a mathematical equation.
to assess the degree of accuracy of description
or prediction achieved by the regression equation.
to assess the relative importance of the various
predictor variables in their contribution to the variation in the criterion
variable (specific to multiple regression).
Classification of Variables:
Criterion variable- the dependent variable
which the model will attempt to predict.
Predictor variables- the independent variables
which influence the criterion variable.
Part
II: Developing a regression model
Two main issues to be resolved…
-
1. Which variables to include (which are true
"predictors" and which are incidental conditions in the system).
-
2. The relative contributions of each of those
variables. These values will be affected by the interaction between
the variables decided upon in #1.
Regression requirements (both SLR and multiple
regression):
-
1. All variables are continuous.
-
2. The independent variable is fixed (i.e.
they are all under the control of the investigator) while the dependent
variable is random.
-
3. A linear function will be described by
the data. You may have to transform some of the data in order to
accomplish this.
-
4. At each level of the independent variable,
the dependent variables are all independently and normally distributed.
-
5. At each level of the independent variable,
samples of the dependent variables are all homoscedastic.
How a model is generated:
A regression model draws the Line of Best
Fit- the line which travels through the individual data points with the
smallest sum of squared residuals. This is accomplished by using
the Method of Least Squares.
Residuals:
-sum of residuals = 0, therefor the average
of residuals = 0
General equation for a linear bivariate
relationship:
<dependent variable> (units)
= intercept + slope * <dependent variable> (units)
The y-intercept represents the value of
the criterion variable in the absence of the predictor variable(s).
Limitations:
-
1. Regression tells us nothing about causation!
You have to establish causation prior to developing a regression model.
-
2. Sample size: Most authors recommend that
one should have at least 10 to 20 times as many observations (cases, respondents)
as one has variables, otherwise the estimates of the regression line are
probably very unstable and unlikely to replicate if one were to do the
study over.
Part
III The Coefficient of Determination
The Coefficient of Determination:
r2:
-
reported as the measure of fit b/w the independent
and dependent variables.
-
measure of the proportion (or percentage)
of variation accounted for by the model
-
It has a domain of 0 to 1.
-
r2 = SSregression
/ SStotal = variation explained by the model / unexplained variation.
-
It is not directly tested for significance.
When reporting a regression model,
give:
-
The complete model with the independent and
dependent variables named
-
The probability of the model
-
r2
For example: The following highly significant
(p < 0.0001, r2 = 0.91) linear model was found between
hemocrit and age of 9 men:
hemocrit (%) = 65.5
- 0.563 (age, years).
Outliers
-
Single points which would dramatically affect
the regression line
-
Ways to test for them:
-
1.eyeball a scatterplot (proc plot)
-
2.analysis of residuals
-
Cook's D statistical assessment is also useful
for identifying these troublesome points.
Confidence belts:
Errors in predicting y are due to 3 sources:
-
1. sy.x - the variation around the true regression
line.
-
2. error in estimating the overall estimation
(y-intercept) of the true regression line.
-
3. error in estimating the slope of the true
regression line.
Confidence belts take these into account.
Hypotheses testing:
What can you do with one regression model?
-
Test to see if it is statistically significant:
(i.e. is the slope significantly different than zero?) Ho: b = 0
-
Allows us to be certain that the observed
linear equation was not simply a chance departure from a horizontal line.
Can be accomplished in two ways:
-
1. F test = MSregr / MSresid
-
2. t test = slope / standard error of the
slope
What can you do with two regression
models?**
-
Compare the slopes of two models (most
common comparison made).*
-
Compare the elevations of two models.*
-
Compare predicted Y values for a given X between
two models.
*Use approximately the same range of X
when comparing two models.
** none of these tests can be accomplished
purely via SAS. However, SAS will provide you with intermediate values
required for the calculations.
Additional considerations regarding regression
analysis:
-
The range of the independent is important:
i.e. you could have a "window" in which the relationship appears linear.
-
Incorporate all the data, not just the means
(or medians). This increases the sample size and ensures that the
raw data display a relationship, not just the means.
-
Make sure you know which is the independent
variable and which is the dependent. If you can't establish causation,
then you will have to report correlation instead of a regression model.
-
Just because a model is statistically significant
does not mean that it is the best model (i.e. it might not be a linear
relationship).
-
Don't force a linear model on a data set.
The model could be significant, but the relationship not linear.
Nonsignificance does not mean that there is no relationship, just not a
linear one.
Nonlinear regression:
Part
IV: Multiple regression
Often more than one independent
variable contributes to the value of a dependent variable.
Approach multiple regression with
a practical mind. The ultimate goal of multiple regression is to
generate the most simple, compact model which will accurately describe
and/or predict the value of the dependent variable of interest. This
means choosing the independent variables which contribute the most to the
value of the dependent variable. There are many tests which will
make these evaluations for you.
Beta coefficients
Tell you the relative contribution of a predictor
to the model.
Beta coefficients will change when variables
are added or removed from a model.
These partial contributions are estimated
by converting the raw coefficients to z scores.
Note: beta coefficients can inform us only
of the relative importance of each of the predictor variables, not the
absolute contributions, since there is still the joint contributions of
two or more variables taken together that cannot be disentangled.
The relative importance of any two predictor variables is dependent upon
which other predictor variables have been included in the analysis.
Stepwise procedures:
These procedures stop at the point
when the introduction of another variable would account for only a trivial
or statistically insignificant proportion of the unexplained variance.
I The step-down (aka backward elimination)
procedure:
start with all the predictor variables.
sequentially eliminate the least predictive
one at a time.
stop when the elimination of the next variable
would sacrifice a significant amount of explained variance in the criterion
variable.
II The step-up (aka forward addition)
procedure:
just the opposite
Note: The step-up and step-down procedures
will not always result in the same regression equation. However,
it is possible to result in the same R^2 but with completely different
sets of variables eliminated from the analysis.
III All-regressions:
individual evaluate models containing all
possible combinations of the predictor values.
The Maximum R2 procedure:
These attempt to find the best
1, 2, 3, … n model. In this way you can choose the most appropriately
sized model from the output of the most accurate models.
Validation of the regression equation:
Apply the regression equation
to a fresh sample of objects to see how well it does in fact predict values
on the criterion variable. This will demonstrate whether or not the
decision to include the predictor variables was based purely on chance
relationships.
Pairwise combinations:
The number of pairwise combinations of variables
is given my the formula:
x (x - 1) / 2, where x is the number of variables.
This is useful for anticipating the size of
your correlation matrix.
Collinearity (aka variable redundancy,
aka multicollinearity)
This is the problem of using two or more predictor
variables which are highly correlated with one another.
A related problem is using variables that
are directly related to one another. For example, costs, sales, and
profits are directly related to one another. To include any two of
these variables in a model would include the third.
In such a situation, the computer cannot do
the thinking for you.
|
|