Question: Adverse effects of hot-water runoff. The Environmental Protection Agency (EPA) wants to determine whether the hot-water runoff from a particular power plant located near a large gulf is having an adverse effect on the marine life in the area. The goal is to acquire a prediction equation for the number of marine animals located at certain designated areas, or stations, in the gulf. Based on past experience, the EPA considered the following environmental factors as predictors for the number of animals at a particular station:

X1 = Temperature of water (TEMP)

X2 = Salinity of water (SAL)

X3 = Dissolved oxygen content of water (DO)

X4 = Turbidity index, a measure of the turbidity of the water (TI)

x5 = Depth of the water at the station (ST_DEPTH)

x6 = Total weight of sea grasses in sampled area (TGRSWT)

As a preliminary step in the construction of this model, the EPA used a stepwise regression procedure to identify the most important of these six variables. A total of 716 samples were taken at different stations in the gulf, producing the SPSS printout shown below. (The response measured was y, the logarithm of the number of marine animals found in the sampled area.)

a. According to the SPSS printout, which of the six independent variables should be used in the model? (Use α = .10.)

b. Are we able to assume that the EPA has identified all the important independent variables for the prediction of y? Why?

c. Using the variables identified in part a, write the first-order model with interaction that may be used to predict y.

d. How would the EPA determine whether the model specified in part c is better than the first-order model?

e.Note the small value of R2. What action might the EPA take to improve the model?

Short Answer

Expert verified

Answer

a. The variables which should be used in the model are ST_DEPTH, TGRSWT, and TI.

b. The EPA should not assume that they have identified all the important independent variables for prediction. The stepwise procedure tends to perform a large number of t-tests, inflating the overall probability of a Type I error, and does not automatically include higher-order terms (e.g., interactions and squared terms) in the final model which might not give all the important variables for prediction.

c. Using variables identified in part a, the first-order model with interaction can be written as E(y)=β0+β1(STDEPTH)+β2(TGRSWT)+β3(TI)+β4(STDEPTH)(TGRSWT)+β5(TGRSWT)(TI)+β6(STDEPTH)(TI).

d. To determine if model described in part c is better than first-order model, t-test hypothesis testing is conducted on interaction terms present in the model to check if they are statistically significant to the model or not.

e. The R2 values for the three models are 0.122, 0.182, and 0.187. These values are significantly low and indicate that the model fitted to the data is not a good fit. To improve the model, different sets of variables ca be used which explain the variation in the data better.

Step by step solution

01

Variable selection

From the SPSS printout, it is clear that for ST_DEPTH, TGRSWT, and TI the p-value are <0.050. At α = .10, if p-value < α then H0that the β parameter is not statistically significantrejected. Here for all three variables p-value < α indicating that all β values are statistically significant.

The variables which should be used in the model are ST_DEPTH, TGRSWT, and TI.

02

Drawbacks of stepwise regression model

The EPA should not assume that they have identified all the important independent variables for prediction. The stepwise procedure tends to perform a large number of t-tests, inflating the overall probability of a Type I error, and does not automatically include higher-order terms (e.g., interactions and squared terms) in the final model which might not give all the important variables for prediction.

03

Stepwise regression model

Using variables identified in part a, the first-order model with interaction can be written asE(y)=β0+β1(STDEPTH)+β2(TGRSWT)+β3(TI)+β4(STDEPTH)(TGRSWT)+β5(TGRSWT)(TI)+β6(STDEPTH)(TI).

04

Significance of interaction term

To determine if model described in part c is better than first-order model, t-test hypothesis testing is conducted on interaction terms present in the model to check if they are statistically significant to the model or not.

05

Interpretation of R2

The R2 values for the three models are 0.122, 0.182, and 0.187. These values are significantly low and indicate that the model fitted to the data is not a good fit. To improve the model, different sets of variables ca be used which explain the variation in the data better.

Unlock Step-by-Step Solutions & Ace Your Exams!

  • Full Textbook Solutions

    Get detailed explanations and key concepts

  • Unlimited Al creation

    Al flashcards, explanations, exams and more...

  • Ads-free access

    To over 500 millions flashcards

  • Money-back guarantee

    We refund you if you fail your exam.

Over 30 million students worldwide already upgrade their learning with Vaia!

One App. One Place for Learning.

All the tools & learning materials you need for study success - in one app.

Get started for free

Most popular questions from this chapter

Question: Accuracy of software effort estimates. Periodically, software engineers must provide estimates of their effort in developing new software. In the Journal of Empirical Software Engineering (Vol. 9, 2004), multiple regression was used to predict the accuracy of these effort estimates. The dependent variable, defined as the relative error in estimating effort, y = (Actual effort - Estimated effort)/ (Actual effort) was determined for each in a sample of n = 49 software development tasks. Eight independent variables were evaluated as potential predictors of relative error using stepwise regression. Each of these was formulated as a dummy variable, as shown in the table.

Company role of estimator: x1 = 1 if developer, 0 if project leader

Task complexity: x2 = 1 if low, 0 if medium/high

Contract type: x3 = 1 if fixed price, 0 if hourly rate

Customer importance: x4 = 1 if high, 0 if low/medium

Customer priority: x5 = 1 if time of delivery, 0 if cost or quality

Level of knowledge: x6 = 1 if high, 0 if low/medium

Participation: x7 = 1 if estimator participates in work, 0 if not

Previous accuracy: x8 = 1 if more than 20% accurate, 0 if less than 20% accurate

a. In step 1 of the stepwise regression, how many different one-variable models are fit to the data?

b. In step 1, the variable x1 is selected as the best one- variable predictor. How is this determined?

c. In step 2 of the stepwise regression, how many different two-variable models (where x1 is one of the variables) are fit to the data?

d. The only two variables selected for entry into the stepwise regression model were x1 and x8. The stepwise regression yielded the following prediction equation:

Give a practical interpretation of the β estimates multiplied by x1 and x8.

e) Why should a researcher be wary of using the model, part d, as the final model for predicting effort (y)?

Question: Tipping behaviour in restaurants. Can food servers increase their tips by complimenting the customers they are waiting on? To answer this question, researchers collected data on the customer tipping behaviour for a sample of 348 dining parties and reported their findings in the Journal of Applied Social Psychology (Vol. 40, 2010). Tip size (y, measured as a percentage of the total food bill) was modelled as a function of size of the dining party(x1)and whether or not the server complimented the customers’ choice of menu items (x2). One theory states that the effect of the size of the dining party on tip size is independent of whether or not the server compliments the customers’ menu choices. A second theory hypothesizes that the effect of size of the dining party on tip size is greater when the server compliments the customers’ menu choices as opposed to when the server refrains from complimenting menu choices.

a. Write a model for E(y) as a function of x1 and x2 that corresponds to Theory 1.

b. Write a model for E(y) as a function of x1and x2that corresponds to Theory 2.

c. The researchers summarized the results of their analysis with the following graph. Based on the graph, which of the two models would you expect to fit the data better? Explain.

Question: Write a regression model relating E(y) to a qualitative independent variable that can assume three levels. Interpret all the terms in the model.

Consider fitting the multiple regression model

E(y)= β0+β1x1+ β2x2+β3x3+ β4x4 +β5x5

A matrix of correlations for all pairs of independent variables is given below. Do you detect a multicollinearity problem? Explain


The first-order model E(y)=β0+β1x1was fit to n = 19 data points. A residual plot for the model is provided below. Is the need for a quadratic term in the model evident from the residual plot? Explain.


See all solutions

Recommended explanations on Math Textbooks

View all explanations

What do you think about this solution?

We value your feedback to improve our textbook solutions.

Study anywhere. Anytime. Across all devices.

Sign-up for free