Monday, 12 September 2016

Pines - Simple Linear Regression Model

Pines

Python File and Dataset
Python IDE PyCharm - pandas, numpy, matplotlib.pyplot, statsmodels.api, patsy.dmatricies

The Study
In a study, data on the heights (in centimeters) and ages (in years) of seedlings and saplings of 204 Japanese pine trees was collected. The data was stored in a csv datafile 'pines.csv', with variates called 'height' and 'age'. I investigate this dataset with a view to seeing if I can predict the height based on the age of the tree.

Python File
In line 9 I load the datafile 'pines.csv' into a pandas dataframe object called 'df1' . Then to get a feel for the data in lines 12 to 15, I produce a scatterplot of the data with 'age' as the explanatory variable and 'height' as the response variable.

In lines 17 and 18 I transform the variables using the natural log and store these in new columns in the dataframe. In lines 23 to 26 I produce a scatter plot of the transformed data. In line 30 I set up the endogenous and exogenous matrices using patsy.damatricies by specifying the linear regression relationship 'logh' ~ 'loga'.

Then in lines 33 to 54 I fit the model and print the results summary. In lines 38 and 39 I calculate the regression line for later use. In lines 43 to 62 I produce a composite residuals plot of the deviance residuals as test for an appropriate fit.

In line 65 I select the rows from 'df1' where 'age' is greater the 1 and in lines 69 to 101 I rerun the above analysis.


Analysis
From the initial scatterplot (fig1) I can see that fitting a simple regression line to the untransformed data would be inappropriate because there is a clear increase in the variability as the age of the pine tree increases and that the data appear to exhibit a curve. As a transformation of the 'height' values would only reduce the variability but not the curve and a transformation of the 'age' variables would only remove the curve but not the variability, so a combination is required for the data to be modelled appropriately by a simple linear regression model.
fig 1

Taking the natural log of both variables, I plot a scatterplot (fig2) of the transformed variables. The relationship now appears to be linear and is certainly more appropriate to be modelled by a simple linear regression model.
fig 2
Fitting a regression model yields a fitted regression line of Yi = 0.1621 + 1.7237Xi. From the composite residuals plots (fig3) of deviance residuals, I can see that the histogram appears to be normally distributed about zero, however it is slightly negatively skewed. The fitted values plot and normal probability plot give no cause for concern as the normal plot is approximately linear and the fitted values are randomly distributed with no obvious pattern. I can conclude that the regression model appears to be appropriate for the transformed data.

The histogram from the composite residuals plot (fig3) is negatively skewed, implying that the younger trees might be better modelled using a separate analysis. In addition the fitted values plot does show notably more values above the line in the beginning implying that the residuals are positive, in fact all three observations are positive providing evidence that young trees might be better modelled using a separate analysis.
fig 3
I re-perform the analysis excluding the 5 tress that are 1 year old. The new equation for the regression line is Yi = 0.1441 + 1.7333*Xi. From the composite residuals plots (fig4) there is little change to the normal plot, however the histogram is still slightly negatively skewed but it does have  a much greater peak at zero, thus improving on the assumption of normality. The fitted values plot now shows that lower values are well distributed however older trees now show as slightly negatively distributed.
fig 4
The advantages of this new analysis are that the model is stronger as the data residuals are less negatively skewed and the location of the residuals is close to 0. The disadvantages of this new analysis are that it is now inappropriate to use the model for trees of 1 year old so reducing its use and that the model now tends to overestimate the height of older trees.

By Edward Adcock

No comments:

Post a Comment