Chapter 12: Q118E (page 781)

Question: Adverse effects of hot-water runoff. The Environmental Protection Agency (EPA) wants to determine whether the hot-water runoff from a particular power plant located near a large gulf is having an adverse effect on the marine life in the area. The goal is to acquire a prediction equation for the number of marine animals located at certain designated areas, or stations, in the gulf. Based on past experience, the EPA considered the following environmental factors as predictors for the number of animals at a particular station:
X₁ = Temperature of water (TEMP)
X₂ = Salinity of water (SAL)
X₃ = Dissolved oxygen content of water (DO)
X₄ = Turbidity index, a measure of the turbidity of the water (TI)
x₅ = Depth of the water at the station (ST_DEPTH)
x₆ = Total weight of sea grasses in sampled area (TGRSWT)
As a preliminary step in the construction of this model, the EPA used a stepwise regression procedure to identify the most important of these six variables. A total of 716 samples were taken at different stations in the gulf, producing the SPSS printout shown below. (The response measured was y, the logarithm of the number of marine animals found in the sampled area.)
a. According to the SPSS printout, which of the six independent variables should be used in the model? (Use α = .10.)
b. Are we able to assume that the EPA has identified all the important independent variables for the prediction of y? Why?
c. Using the variables identified in part a, write the first-order model with interaction that may be used to predict y.
d. How would the EPA determine whether the model specified in part c is better than the first-order model?
e.Note the small value of R². What action might the EPA take to improve the model?

Short Answer

Expert verified

Answer

a. The variables which should be used in the model are ST_DEPTH, TGRSWT, and TI.

b. The EPA should not assume that they have identified all the important independent variables for prediction. The stepwise procedure tends to perform a large number of t-tests, inflating the overall probability of a Type I error, and does not automatically include higher-order terms (e.g., interactions and squared terms) in the final model which might not give all the important variables for prediction.

c. Using variables identified in part a, the first-order model with interaction can be written as $E (y) = β_{0} + β_{1} (S T_{D} E P T H) + β_{2} (T G R S W T) + β_{3} (T I) + β_{4} (S T_{D E P T H}) (T G R S W T) + β_{5} (T G R S W T) (T I) + β_{6} (S T_{D E P T H}) (T I)$ .

d. To determine if model described in part c is better than first-order model, t-test hypothesis testing is conducted on interaction terms present in the model to check if they are statistically significant to the model or not.

e. The R² values for the three models are 0.122, 0.182, and 0.187. These values are significantly low and indicate that the model fitted to the data is not a good fit. To improve the model, different sets of variables ca be used which explain the variation in the data better.

Step by step solution

Variable selection

From the SPSS printout, it is clear that for ST_DEPTH, TGRSWT, and TI the p-value are <0.050. At α = .10, if p-value < α then H₀that the β parameter is not statistically significantrejected. Here for all three variables p-value < α indicating that all β values are statistically significant.

The variables which should be used in the model are ST_DEPTH, TGRSWT, and TI.

Drawbacks of stepwise regression model

The EPA should not assume that they have identified all the important independent variables for prediction. The stepwise procedure tends to perform a large number of t-tests, inflating the overall probability of a Type I error, and does not automatically include higher-order terms (e.g., interactions and squared terms) in the final model which might not give all the important variables for prediction.

Stepwise regression model

Using variables identified in part a, the first-order model with interaction can be written as $E (y) = β_{0} + β_{1} (S T_{D} E P T H) + β_{2} (T G R S W T) + β_{3} (T I) + β_{4} (S T_{D E P T H}) (T G R S W T) + β_{5} (T G R S W T) (T I) + β_{6} (S T_{D E P T H}) (T I)$ .

Significance of interaction term

To determine if model described in part c is better than first-order model, t-test hypothesis testing is conducted on interaction terms present in the model to check if they are statistically significant to the model or not.

Interpretation of R2

The R² values for the three models are 0.122, 0.182, and 0.187. These values are significantly low and indicate that the model fitted to the data is not a good fit. To improve the model, different sets of variables ca be used which explain the variation in the data better.