Python File and Dataset
Python IDE PyCharm - pandas, numpy, matplotlib.pyplot, statsmodels.api, patsy.dmatricies
The Study
In a study, data on the heights (in centimeters) and ages (in years) of seedlings and saplings of 204 Japanese pine trees was collected. The data was stored in a csv datafile 'pines.csv', with variates called 'height' and 'age'. I investigate this dataset with a view to seeing if I can predict the height based on the age of the tree.
Python File
In line 9 I load the datafile 'pines.csv' into a pandas dataframe object called 'df1' . Then to get a feel for the data in lines 12 to 15, I produce a scatterplot of the data with 'age' as the explanatory variable and 'height' as the response variable.
In lines 17 and 18 I transform the variables using the natural log and store these in new columns in the dataframe. In lines 23 to 26 I produce a scatter plot of the transformed data. In line 30 I set up the endogenous and exogenous matrices using patsy.damatricies by specifying the linear regression relationship 'logh' ~ 'loga'.
Then in lines 33 to 54 I fit the model and print the results summary. In lines 38 and 39 I calculate the regression line for later use. In lines 43 to 62 I produce a composite residuals plot of the deviance residuals as test for an appropriate fit.
In line 65 I select the rows from 'df1' where 'age' is greater the 1 and in lines 69 to 101 I rerun the above analysis.
Analysis
From the initial scatterplot (fig1) I can see that fitting a simple regression line to the untransformed data would be inappropriate because there is a clear increase in the variability as the age of the pine tree increases and that the data appear to exhibit a curve. As a transformation of the 'height' values would only reduce the variability but not the curve and a transformation of the 'age' variables would only remove the curve but not the variability, so a combination is required for the data to be modelled appropriately by a simple linear regression model.
![]() | ||
fig 1 |
![]() |
fig 2 |
The histogram from the composite residuals plot (fig3) is negatively skewed, implying that the younger trees might be better modelled using a separate analysis. In addition the fitted values plot does show notably more values above the line in the beginning implying that the residuals are positive, in fact all three observations are positive providing evidence that young trees might be better modelled using a separate analysis.
![]() |
fig 3 |
![]() |
fig 4 |
By Edward Adcock
No comments:
Post a Comment