Python File and Dataset
Python IDE PyCharm - pandas, matplotlib.pyplot, statsmodels.api, patsy.dmatricies, pandas.tools.plotting
The Study
In a study on cars data was collected on the average price in 1000s of US dollars for 91 new car models for the year 1993, together with the following information.
- airbag - air bag standard: 0 = none, 1 = driver only, 2 = driver and passenger
- cmpg - miles per gallon for city driving
- hmpg - miles per gallon for motorway driving
- origin - place of manufacture: 0 = non-US, 1 = US
- hp - maximum horsepower (a measure of the power of the engine)
Python File
In line 9 I load the datafile 'cars.csv' into a pandas dataframe object called 'df1' . Then in lines 12 to 16, I produce a scatterplot matrix of the data frame, print and save it to a file. In lines 19 and 20 I calculate and print the correlations for all variables excluding price.
In lines 22 to 25 I transform price to 1/price and transform hp to 1/hp, replacing the initial variables. In lines 28 to 32 I produce a scatterplot matrix of the transformed data frame, print and save it to a file. In lines 19 and 20 I calculate and print the correlations for all variables excluding price.
In line 40 I set up the endogenous and exogenous matrices using patsy.dmatricies by specifying the linear regression relationship 'price ~ C(airbag) + cmpg + hmpg + C(origin) + hp' where C is a categorical variable.
Then in lines 44 to 46 I fit the model and print the results summary. In lines 38 and 39 I calculate the regression line for later use. In lines 50 to 66 I produce a composite residuals plot of the deviance residuals as test for an appropriate fit.
In lines 70 to 96 I rerun the analysis with new linear regression relationship 'price ~ C(airbag) + cmpg + C(origin) + hp'.
Analysis
From the initial scatterplot matrix (fig1) I can see that the relationship between price and airbag, shows that cars with airbag = 0 are generally low in price and that the variation in price is small for this group, conversely for airbag = 1 or 2 the variation of price is much greater with possibly increasing means, so there appears to be a positive linear relationship between price and airbag.
![]() |
Fig 1 |
The relationship between price and hp shows a generally strong positive linear relationship, however there is clearly an increase in variability as the either variable increases.
Finally there is little evidence of a relationship between price and origin as both origin = 1 or 2 have roughly equal variance and mean.
The correlations between the explanatory variables and origin are low, the maximum in magnitude is against cmpg at - 0.2663, so given that origin has a low correlation to the other variable and an uninformative scatterplot then it is not clear that origin will be in the final model.
The correlation between cmpg and hmpg is very high at 0.9434, this means that high values of one correspond with high values of the other so there is little to be gained by including both variables, so in a good linear model I would expect to see either cmpg or hmpg but not both and that the weighting will be small.
The correlations between airbag and cmpg, hmpg and origin are -0.3281, -0.2595 and 0.0707, which are all low in magnitude and for hp it is only 0.4993, also the correlations between origin and hp, airbag, cmpg, hmpg are 0.0720, 0.0707, -0.2663, -0.1869 respectively which are also all low or at least not too much like another variable then I would expect both airbag and origin to be included in a good linear model but relatively small in weighting.
The correlations for hp against origin is low at 0.0720 and the rest are 0.4993, -0.6772 and -0.6280 for airbag, cmpg and hmpg respectively, these correlations are shown quite clearly in the scatterplot matrix as a positive relationship against airbag and a negative relationship against cmpg and hmpg. So in a good liner model I would expect to see hp and for it to have a large weighting because of the strong positive relationship with price.
I can conclude that in a good regression model for the response variable price for the explanatory variables airbag, cmpg, hmpg, hp and origin, IO would expect to see either cmpg or hmpg but not both with airbag and origin and hp, with hp having the largest weighting.
In order to remove the curved nature of the relationship between price and hp I will need to transform price to 1/price, but in doing so I will also have to transform hp to 1/hp because if only price was transformed the values of price would only increase or decrease so the points on the plot would move only left or right, if this happens then the variability already seen as hp increases would be even greater, or if price was decreased then the plot would become even more curved, hence hp must also be transformed.
After performing the transformation I then reproduce the scatterplot matrix (fig2). The variable price against airbag, cmpg, hmpg and hp plots seem to show good linear relationships and there is no longer a problem of increasing variance so they could be approximated by straight lines with either positive or negative slopes. The origin plots against the other variables continue to show little to no apparent relationship. The transformation of price does not appear to have had much effect on the untransformed variables and so I can conclude that no further transformations are required for a linear regression model analysis.
![]() |
Fig 2 |
From the composite residuals plots (fig3) of deviance residuals, I can see that the histogram appears to be normally distributed about zero, being symmetrical and with values tailing off towards the extremes. The fitted values plot shows no clear pattern and these are clearly randomly distributed with a constant variance, however there does appear to be one outlier towards the extreme end however there is no reason to suspect a departure from normality.
![]() |
Fig 3 |
I can conclude from this composite residuals plot the the plots do not show any evidence of the assumption that the residuals of the fitted model are normally distributed, but it might be worth analysing the results both including and excluding the two potential outliers.
Rerunning the analysis with with a linear regression analysis on the full set of transformed variables with the relationship, 'price ~ C(airbag) + cmpg + C(origin) + hp' where C is a categorical variable. I find that by dropping the variable hmpg from the analysis has not effected the significance of any of the other variables and has reduced the degrees of freedom for the analysis. Comparing the composite residuals plot (fig4) with the previous (fig3)
![]() |
Fig 4 |
By Edward Adcock
No comments:
Post a Comment