Monday, 19 September 2016

Cars - Multiple Linear Regression

Cars - Multiple Linear Regression

Python File and Dataset
Python IDE PyCharm - pandas, matplotlib.pyplot, statsmodels.api, patsy.dmatricies, pandas.tools.plotting

The Study
In a study on cars data was collected on the average price in 1000s of US dollars for 91 new car models for the year 1993, together with the following information.
  • airbag - air bag standard: 0 = none, 1 = driver only, 2 = driver and passenger
  • cmpg - miles per gallon for city driving
  • hmpg - miles per gallon for motorway driving
  • origin - place of manufacture: 0 = non-US, 1 = US
  • hp - maximum horsepower (a measure of the power of the engine)
The data was stored in a csv datafile 'cars.csv', with variates with the above names and 91 rows (one for each car). I investigate this dataset with a view to seeing if I can determine whether, and how, the average price is related to the other five variables.

Python File
In line 9 I load the datafile 'cars.csv' into a pandas dataframe object called 'df1' . Then in lines 12 to 16, I produce a scatterplot matrix of the data frame, print and save it to a file. In lines 19 and 20 I calculate and print the correlations for all variables excluding price.

In lines 22 to 25 I transform price to 1/price and transform hp to 1/hp, replacing the initial variables. In lines 28 to 32  I produce a scatterplot matrix of the transformed data frame, print and save it to a file. In lines 19 and 20 I calculate and print the correlations for all variables excluding price.

In line 40 I set up the endogenous and exogenous matrices using patsy.dmatricies by specifying the linear regression relationship 'price ~ C(airbag) + cmpg + hmpg + C(origin) + hp' where C is a categorical variable.

Then in lines 44 to 46 I fit the model and print the results summary. In lines 38 and 39 I calculate the regression line for later use. In lines 50 to 66 I produce a composite residuals plot of the deviance residuals as test for an appropriate fit.

In lines 70 to 96 I rerun the analysis with new linear regression relationship  'price ~ C(airbag) + cmpg + C(origin) + hp'.

Analysis
From the initial scatterplot matrix (fig1) I can see that the relationship between price and airbag, shows that cars with airbag = 0 are generally low in price and that the variation in price is small for this group, conversely for airbag = 1 or 2 the variation of price is much greater with possibly increasing means, so there appears to be a positive linear relationship between price and airbag.
Fig 1
The relationship between price and cmpg is nearly the same as between price and hmpg with generally high values of cmpg or hmpg corresponding with low values of price, so there appears to be a negative relationship between price and cmpg/hmpg although the relationship is distinctly curved in nature.

The relationship between price and hp shows a generally strong positive linear relationship, however there is clearly an increase in variability as the either variable increases.

Finally there is little evidence of a relationship between price and origin as both origin = 1 or 2 have roughly equal variance and mean.

The correlations between the explanatory variables and origin are low, the maximum in magnitude is against cmpg at - 0.2663, so given that origin has a low correlation to the other variable and an uninformative scatterplot then it is not clear that origin will be in the final model.

The correlation between cmpg and hmpg is very high at 0.9434, this means that high values of one correspond with high values of the other so there is little to be gained by including both variables, so in a good linear model I would expect to see either cmpg or hmpg but not both and that the weighting will be small.

The correlations between airbag and cmpg, hmpg and origin are -0.3281, -0.2595 and 0.0707, which are all low in magnitude and for hp it is only 0.4993, also the correlations between origin and hp, airbag, cmpg, hmpg are 0.0720, 0.0707, -0.2663, -0.1869 respectively which are also all low or at least not too much like another variable then I would expect both airbag and origin to be included in a good linear model but relatively small in weighting.

The correlations for hp against origin is low at 0.0720 and the rest are 0.4993, -0.6772 and -0.6280 for airbag, cmpg and hmpg respectively, these correlations are shown quite clearly in the scatterplot matrix as a positive relationship against airbag and a negative relationship against cmpg and hmpg. So in a good liner model I would expect to see hp and for it to have a large weighting because of the strong positive relationship with price.

I can conclude that in a good regression model for the response variable price for the explanatory variables airbag, cmpg, hmpg, hp and origin, IO would expect to see either cmpg or hmpg but not both with airbag and origin and hp, with hp having the largest weighting.

In order to remove the curved nature of the relationship between price and hp I will need to transform price to 1/price, but in doing so I will also have to transform hp to 1/hp because if only price was transformed the values of price would only increase or decrease so the points on the plot would move only left or right, if this happens then the variability already seen as hp increases would be even greater, or if price was decreased then the plot would become even more curved, hence hp must also be transformed.

After performing the transformation I then reproduce the scatterplot matrix (fig2). The variable price against airbag, cmpg, hmpg and hp plots seem to show good linear relationships and there is no longer a problem of increasing variance so they could be approximated by straight lines with either positive or negative slopes. The origin plots against the other variables continue to show little to no apparent relationship. The transformation of price does not appear to have had much effect on the untransformed variables and so I can conclude that no further transformations are required for a linear regression model analysis.
Fig 2
Proceeding with a linear regression analysis on the full set of transformed variables with the relationship, 'price ~ C(airbag) + cmpg + hmpg + C(origin) + hp' where C is a categorical variable. From the table of estimates of parameters airbag and hp all have p values of <0.001 so they are highly significant and origin has p<0.01 so also strong evidence for significance. The p value for cmpg is 0.067 which is strong evidence for significance and hmpg has p value 0.916 which means that there is little evidence that the slope for hmpg is not zero.

From the composite residuals plots (fig3) of deviance residuals, I can see that the histogram appears to be normally distributed about zero, being symmetrical and with values tailing off towards the extremes. The fitted values plot shows no clear pattern and these are clearly randomly distributed with a constant variance, however there does appear to be one outlier towards the extreme end however there is no reason to suspect a departure from normality.
Fig 3
The normal probability plot clearly shows a very good straight line along the x=y line so there is no evidence of departure from normality, however there does appear to be two potential outliers that are at either extreme end of the data.

I can conclude from this composite residuals plot the the plots do not show any evidence of the assumption that the residuals of the fitted model are normally distributed, but it might be worth analysing the results both including and excluding the two potential outliers.

Rerunning the analysis with with a linear regression analysis on the full set of transformed variables with the relationship, 'price ~ C(airbag) + cmpg + C(origin) + hp' where C is a categorical variable. I find that by dropping the variable hmpg from the analysis has not effected the significance of any of the other variables and has reduced the degrees of freedom for the analysis. Comparing the composite residuals plot (fig4) with the previous (fig3)
Fig 4
I can see that the model is now even more representative of a normal distribution. Hence I can conclude that this new model is stronger than the previous model as the residuals are more normally distributed and the power is increased by the reduction to the degrees of freedom.

By Edward Adcock

No comments:

Post a Comment