Sunday, 25 September 2016

Daphnia - Analysis of Variance (ANOVA)

Daphnia

Python File and Data Set
Python IDE PyCharm - pandas, matplotlib.pyplot, statsmodels.api, patsy.dmatricies

The Study
A study was carried out to investigate the effect of detergent pollution on the growth of the common freshwater invertebrate Daphnia. Seventy-two small populations of Daphnia were raised in samples of river water to which detergent of different brands had been added, and the growth rates of the Daphnia were recorded. As well as differences in detergent brand, the populations differed in terms of Daphnia clone from which they were raised, and in terms of the source of the river water.

There were four variables recorded:
  • growth - the growth rate of the Daphnia which is the response variable
  • water - the source of the water in which the Daphnia was raised (Tyne, Wear)
  • deterg - the detergent brand used (BrandA, BrandB, BrandC, BrandD)
  • clone - the Daphnia clone (Clone1, Clone2, Clone3)
The data was stored in a csv datafile 'daphnia.csv', with variates with the above names and 72 rows (one for each population). I investigate this dataset with a view to seeing if I can determine whether, and how, the average growth rate is related to the other three categorical variables.

Python File
In line 9 I load the datafile 'daphnia.csv' into a pandas dataframe object called 'df1'. Then in line 11 I print the statistical summaries of the data. In lines 14 to 16 I slice the df1 into three new data frames, one for each categorical variables against growth.  Then in lines 19 to 21 I print the statistical summaries of each of the three new dataframes.

In lines 25 to 38 I setup a figure with three axes and plot boxplots of the three categorical variables against growth to each, then adjust the titles and finally print and save it to a file.

In line 42 I set up the endogenous and exogenous matrices using patsy.damatricies by specifying the linear regression relationship 'growth ~ C(water) * C(deterg) * C(clone)'. Then in lines 36 to 48 I fit the model and print the results summary.Then in lines 51 and 52 I print out the analysis of variance summary table.

In lines 56 to 72 I produce a composite residuals plot of the deviance residuals as test for an appropriate fit. In lines 75 to 83 I use the group by function and the mean function to print out the array of mean value of growth rate grouped by the main effect, the secondary effects and the tertiary effects.

In lines 86 to 98 I set up a single figure with one ax and plot the mean growth rates for water at different levels of clone, set the labels and the legend then save and show the plot. In lines 101 to 113 I set up a single figure with one ax and plot the mean growth rates for deterg at different levels of clone, set the labels, tick marks and the legend then save and show the plot.

Analysis
I begin with some exploratory analysis of the dataset.

From the boxplots of water (fig1-top left) the growth rates of Tyne are less dispersed than the growth rates of wear, both show a positive skew and the mean of Tyne is within the second quartile of Wear, so the means appear to be similar but the distributions are not normal.
fig 1
From the boxplots of deterg (fig1-top right) the growth rates of brands A and B appear to be very similar and nearly symmetrical, however brands C and D are both positively skewed with much larger ranges. The medians of brands A B and C are relatively close together and could be the same, however the median of brand D is below the second quartile of brands B and C and on the extreme lower end of brand A. Hence it is in some doubt that the median of brand D is the same as the other three brands and that brands C and D are normally distributed.

From the boxplots of clone (fig1-bottom left) the growth rates of clone 1 are significantly lower and less dispersed than clones 2 and 3, however clone 1 is contained entirely within the range of clone 3. Clone 2 appears to have a positive skew and clone 3 appears to have a negative skew however their medians appear to be relatively close together. Hence i can conclude that there is some doubt about the median on clones 2 and 3 being the same as clone 1 and that clones 2 and 3 are normally distributed.

The assumption of independence for each result is assumed from the design of the experiment and the assumption of equal variance across all groups is within the 'rule of 4', (that is the variance is within 4 times the variance of the other groups) so both assumptions are met.

However from the above exploration there is definitely doubt about a normal distribution of the data however it would not be unwise to proceed with an Analysis of Variance (ANOVA) under caution.

I fit the ANOVA model 'growth ~ C(water) * C(deterg) * C(clone)' where * means to include all factors and all possible interactions between the three factors.

From the ANOVA results table, the explanatory variable clone has a p value of p<0.001, which is strong evidence that there is a main effect for clone. The p values for deterg and water and 0.375 and 0.098 respectively and so show little evidence of main effects.

From the ANOVA results table, the p value for clone.deterg and clone.water have p values of p<0.001, which is strong evidence that the two variables interact with clone. So clone appears to have both a significant main effect and strong interactions with water and deterg. All other second order interactions have p values > 0.09 and so show little to no evidence of significant effects.

From the ANOVA results table, the p value for clone.deterg.water is p=0.234 with is little evidence of any interactions and so there appears to be no significant third order interaction between the variables.

From the composite residuals plot (fig2) for the fitted ANOVA model the histogram is uni-model and is plausibly normally distributed. The fitted values plot shows no evidence of changing variance or obvious patterns. The normal plot appear relatively straight along the x=y axis with not obvious outliers so there appears to be no cause for concern.  
fig 2
So I can conclude from these plots that the assumptions underlying the model appear to be satisfied even given the concern I raised earlier about the possibility of growth rates not being normally distributed.

From the table of means I can see that all three factors should be included in the model because clone has a significant main effect and there are significant interactions between both clone.deterg and clone.water.

The means plot (fig3) for water (Tyne and Wear) at different levels of clone (1, 2 and 3) shows that the water source has plasisably no effect on clone 1 and a large positive effect on clone 2, however there is a slight negative effect on clone 3 which is within the standard error and so cloud plausibly be due to sample variation and so have no effect on clone 3.
fig 3
The gain from the data of clone2.water gives the means for Tyne = 3.806 and Wear = 5.348, which is a much larger difference than the standard error 0.3407., So overall for the water means plot, the change in source of water from Tyne to Wear has a significant effect on clone 2 but not on clones 1 or 3, and as the p value for the interaction is p<0.001 then the  evidence strongly suggests that the clone 2 mean is not equal to the means of clones 1 or 3.

Hence I can conclude that the water source has no effect on clone 1 or 3, but has a positive effect on clone 2 by increasing the mean growth rate of water if water from the Wear river is used.

The means plot (fig4) for deterg (BrandA, B, C and D) at different levels of clone (1, 2 and 3) shows that for clone 1 the growth rate remains roughly the same, as the changes are within the standard error so it is plausible that deterg does not effect clone1.
fig 4
Clone 2 shows an increasing trend on A, B, C and D and these changes are roughly one Standard Error per level so it is plausible that these means for clone 2 are different. Conversely, clone 3 shows a decreasing trend on A, B, C and D with increasing steepness in change, so it is clear that the means for clone 3 are different.

So overall the deterg plot shows that the changes in detergent used has a significant effect on clone bu increasing growth rate, on clone 3 by decreasing growth rate and having no effect on clone 1, and as the p value for the interaction of p<0.001 the evidence strongly suggests that the means for clone.deterg are not equal.

Hence I can conclude that the brand of detergent used has no effect on clone 1, but has a positive effect on clone 2 growth rates and a negative effect on clone 3 growth rates.

From the above analysis it was shown that the source of water has the main effect p value of p=0.098 which is rejected at the 5% significance level. So although the main effect of water is not significantly different from zero it does have an impact on the growth rates of different clone types, as the p value for the clone.water interaction is p<0.001.

This effect is most clearly seen in the clone2.water group means of Tyne=3.806 and Wear=5.348. This is because the explanatory variable water effects the explanatory variable clone which does have a main effect on growth, so the interaction between water and clone means that water indirectly effects growth through clone.

So I can conclude that although there is no main effect for water, there is a significant interaction with clone which does have a main effect and so water should be included in a good model.

Monday, 19 September 2016

Cars - Multiple Linear Regression

Cars - Multiple Linear Regression

Python File and Dataset
Python IDE PyCharm - pandas, matplotlib.pyplot, statsmodels.api, patsy.dmatricies, pandas.tools.plotting

The Study
In a study on cars data was collected on the average price in 1000s of US dollars for 91 new car models for the year 1993, together with the following information.
  • airbag - air bag standard: 0 = none, 1 = driver only, 2 = driver and passenger
  • cmpg - miles per gallon for city driving
  • hmpg - miles per gallon for motorway driving
  • origin - place of manufacture: 0 = non-US, 1 = US
  • hp - maximum horsepower (a measure of the power of the engine)
The data was stored in a csv datafile 'cars.csv', with variates with the above names and 91 rows (one for each car). I investigate this dataset with a view to seeing if I can determine whether, and how, the average price is related to the other five variables.

Python File
In line 9 I load the datafile 'cars.csv' into a pandas dataframe object called 'df1' . Then in lines 12 to 16, I produce a scatterplot matrix of the data frame, print and save it to a file. In lines 19 and 20 I calculate and print the correlations for all variables excluding price.

In lines 22 to 25 I transform price to 1/price and transform hp to 1/hp, replacing the initial variables. In lines 28 to 32  I produce a scatterplot matrix of the transformed data frame, print and save it to a file. In lines 19 and 20 I calculate and print the correlations for all variables excluding price.

In line 40 I set up the endogenous and exogenous matrices using patsy.dmatricies by specifying the linear regression relationship 'price ~ C(airbag) + cmpg + hmpg + C(origin) + hp' where C is a categorical variable.

Then in lines 44 to 46 I fit the model and print the results summary. In lines 38 and 39 I calculate the regression line for later use. In lines 50 to 66 I produce a composite residuals plot of the deviance residuals as test for an appropriate fit.

In lines 70 to 96 I rerun the analysis with new linear regression relationship  'price ~ C(airbag) + cmpg + C(origin) + hp'.

Analysis
From the initial scatterplot matrix (fig1) I can see that the relationship between price and airbag, shows that cars with airbag = 0 are generally low in price and that the variation in price is small for this group, conversely for airbag = 1 or 2 the variation of price is much greater with possibly increasing means, so there appears to be a positive linear relationship between price and airbag.
Fig 1
The relationship between price and cmpg is nearly the same as between price and hmpg with generally high values of cmpg or hmpg corresponding with low values of price, so there appears to be a negative relationship between price and cmpg/hmpg although the relationship is distinctly curved in nature.

The relationship between price and hp shows a generally strong positive linear relationship, however there is clearly an increase in variability as the either variable increases.

Finally there is little evidence of a relationship between price and origin as both origin = 1 or 2 have roughly equal variance and mean.

The correlations between the explanatory variables and origin are low, the maximum in magnitude is against cmpg at - 0.2663, so given that origin has a low correlation to the other variable and an uninformative scatterplot then it is not clear that origin will be in the final model.

The correlation between cmpg and hmpg is very high at 0.9434, this means that high values of one correspond with high values of the other so there is little to be gained by including both variables, so in a good linear model I would expect to see either cmpg or hmpg but not both and that the weighting will be small.

The correlations between airbag and cmpg, hmpg and origin are -0.3281, -0.2595 and 0.0707, which are all low in magnitude and for hp it is only 0.4993, also the correlations between origin and hp, airbag, cmpg, hmpg are 0.0720, 0.0707, -0.2663, -0.1869 respectively which are also all low or at least not too much like another variable then I would expect both airbag and origin to be included in a good linear model but relatively small in weighting.

The correlations for hp against origin is low at 0.0720 and the rest are 0.4993, -0.6772 and -0.6280 for airbag, cmpg and hmpg respectively, these correlations are shown quite clearly in the scatterplot matrix as a positive relationship against airbag and a negative relationship against cmpg and hmpg. So in a good liner model I would expect to see hp and for it to have a large weighting because of the strong positive relationship with price.

I can conclude that in a good regression model for the response variable price for the explanatory variables airbag, cmpg, hmpg, hp and origin, IO would expect to see either cmpg or hmpg but not both with airbag and origin and hp, with hp having the largest weighting.

In order to remove the curved nature of the relationship between price and hp I will need to transform price to 1/price, but in doing so I will also have to transform hp to 1/hp because if only price was transformed the values of price would only increase or decrease so the points on the plot would move only left or right, if this happens then the variability already seen as hp increases would be even greater, or if price was decreased then the plot would become even more curved, hence hp must also be transformed.

After performing the transformation I then reproduce the scatterplot matrix (fig2). The variable price against airbag, cmpg, hmpg and hp plots seem to show good linear relationships and there is no longer a problem of increasing variance so they could be approximated by straight lines with either positive or negative slopes. The origin plots against the other variables continue to show little to no apparent relationship. The transformation of price does not appear to have had much effect on the untransformed variables and so I can conclude that no further transformations are required for a linear regression model analysis.
Fig 2
Proceeding with a linear regression analysis on the full set of transformed variables with the relationship, 'price ~ C(airbag) + cmpg + hmpg + C(origin) + hp' where C is a categorical variable. From the table of estimates of parameters airbag and hp all have p values of <0.001 so they are highly significant and origin has p<0.01 so also strong evidence for significance. The p value for cmpg is 0.067 which is strong evidence for significance and hmpg has p value 0.916 which means that there is little evidence that the slope for hmpg is not zero.

From the composite residuals plots (fig3) of deviance residuals, I can see that the histogram appears to be normally distributed about zero, being symmetrical and with values tailing off towards the extremes. The fitted values plot shows no clear pattern and these are clearly randomly distributed with a constant variance, however there does appear to be one outlier towards the extreme end however there is no reason to suspect a departure from normality.
Fig 3
The normal probability plot clearly shows a very good straight line along the x=y line so there is no evidence of departure from normality, however there does appear to be two potential outliers that are at either extreme end of the data.

I can conclude from this composite residuals plot the the plots do not show any evidence of the assumption that the residuals of the fitted model are normally distributed, but it might be worth analysing the results both including and excluding the two potential outliers.

Rerunning the analysis with with a linear regression analysis on the full set of transformed variables with the relationship, 'price ~ C(airbag) + cmpg + C(origin) + hp' where C is a categorical variable. I find that by dropping the variable hmpg from the analysis has not effected the significance of any of the other variables and has reduced the degrees of freedom for the analysis. Comparing the composite residuals plot (fig4) with the previous (fig3)
Fig 4
I can see that the model is now even more representative of a normal distribution. Hence I can conclude that this new model is stronger than the previous model as the residuals are more normally distributed and the power is increased by the reduction to the degrees of freedom.

By Edward Adcock

Monday, 12 September 2016

Ventrial - Analysis of Variance (One-way)

Ventrial

Python File and Dataset 
Python IDE PyCharm - pandas, numpy, matplotlib.pyplot, statsmodels.api.stats.anova_lm, patsy.dmatricies, statsmodels.formula.api

The Study
In a clinical trial to investigate the effects of different ventilation treatments on patients undergoing cardiac bypass surgery, twenty-two patients were randomized to receive one of three treatments as follows:
  • Treatment I - Patients received a 50% nitrous oxide and 50% oxygen mixture continuously for 24 hours
  • Treatment II - Patients received a 50% nitrous oxide and 50% oxygen mixture only during the operation
  • Treatment III - Patients received no nitrous oxide but received 35-50% oxygen for 24 hours
Among the information recorded about the patients was the level of folate in their red blood cells 24 hours after the start of ventilation. It was suspected that different types of ventilation, and in particular different amounts of nitrous oxide, might have an effect on this.

The data was stored in a csv datafile 'ventrial.csv', with a variate called 'folate' containing the red blood cell folate level (in ug/l), and the factor 'ventil' with three levels (I, II and III) indicating the treatment group. I investigate this dataset with a view to seeing if there are any differences in the treatment groups and which treatment performs the best.


Python File
In line 8 I load the datafile 'ventrial.csv' into a pandas dataframe object called 'df1' . Then to get a feel for the data in lines 11 to 13, I produce a boxplot of the data grouped by treatment I, II and III, and use the .describe function to calculate and print the summary statistics for the variable 'folate'.

In lines 16 to 18 I fit the ordinary least squares regression model the run the analysis of variance summary results and print the table. In lines 19 and 20 I save the residuals of the fitted model and the fitted values of the model for later reference.

In lines 24 to 39 I produce a composite residuals plot of the deviance residuals as test for an appropriate fit.

In lines 42 to 44 I run a further analysis based on the treatment contrast O1 = vantil I - ventil II and print the summary table. In lines 46 to 48 I run a further analysis based on the treatment contrast O1 = vantil I - ventil III and print the summary table.



Analysis
This study is a controlled experiment because each person can be viewed as s single experiment taking one randomly selected treatment, so in theory this study could be repeated many times with different people. The treatment groups have been defined in a way that the researcher has set, hence the experiment is controlled.

From the boxplots (fig1) I can see that the ventil treatment group I has the largest range with the highest median 319, the lower quartile contains the median of group III but not of group II, also the data for group I appear to be normally distributed as the boxplot is symmetrical.
fig 1
The ranges of both groups II and III are very similar, however the median of group II 254 is the lowest but contained within the lower quartile of group III. Also the data for group II appear to be normally distributed.

Lastly the mean of group III is 270 which is in the middle of group I and II medians, the data for this group show a slight negative skew however due to the small number of observations (only 5) for group III then this is well within sampling variation of a normal distribution.

The assumption of independence for each result is assumed from the design of the experiment, the assumption of equal variance across all groups is within the 'rule of 4', that is the variance is within 4 times the variance of the other groups, so this holds as well.

Finally the assumption that the data for each group is normally distributed is clearly shown by the boxplots as all boxplots appear to be roughly symmetrical about their group medians, although group III is slightly skewed. I can conclude it is reasonable to proceed with an analysis of variance.

From the model summary table the test statistic given in is 3.711, this is compared to the distribution F(2,19) giving a p value of p = 0.044, this is moderate evidence against the null hypothesis of equal means for all groups i.e. mean(I)=mean(II)=mean(III). I can conclude from this that at least one of the group means for folate levels for each treatment group is not equal to the others.

From composite residual plots (fig2) of the fitted model above, the histogram looks slightly odd but given the small number of observations (only 22) this is well within expected sampling variation, also the residuals appear to tail off towards the extreme values of the data, so it not unreasonable to assume the residuals are normally distributed.
fig 2
The fitted values plot shows that the residuals variance for group I appears to be higher than the other groups, but altogether there is no apparent pattern to the residuals.

The normal plot show that the two most extreme values which could be considered to be potential outliers five a slight curve to the extreme edges of the residuals, but again due to the small number of observations this is not a cause for concern, however it might be beneficial to examine the analysis of variance results without these two points.

I can conclude that the composite residuals plots do not give any substantial evidence that the residuals are not normally distributed so I can conclude that the one-way ANOVA model appears to be adequate.

I then proceed to test the null hypothesis that O1 = mean(I) - mean(II) = 0, that is the mean of the group I minus the mean of group II is equal to 0. By using the contrast group parameters [[1],[-1],[0]]. This gives the test statistic 7.62 compared against the F(1,19) distribution giving a p values of P=0.012. This is strong evidence against the null hypothesis that the means of groups I and II are the same.

I then proceed to test the null hypothesis that O2 = mean(I) - mean(III) = 0, that is the mean of the group I minus the mean of group III is equal to 0. By using the contrast group parameters [[1],[0],[-1]]. This gives the test statistic 2.82 compared against the F(1,19) distribution giving a p values of P=0.108. This is little to no evidence against the null hypothesis that the means of groups I and III are the same.

In light of the above tests of group contrasts, there is evidence to suggest that the treatment groups I and II do have different means, that is, providing the same treatment for 24 hours (group I) against only during the operation (group II), both with a 50% nitrous oxide and 50% oxygen. In fact the evidence shows that group II has a lower mean then group I.

However there is little evidence to support the study's main aim that nitrous oxide improves the folate levels of patients because the test O2 = mean(I) - mean(III) shows little evidence of different group means, because as group I received 50% nitrous oxide and 50% oxygen group III only received 35-50% oxygen for the same period of time.

I can conclude that there is little to no evidence to support that nitrous oxide does affect the folate levels 24 hours after ventilation starts.

By Edward Adcock

Pines - Simple Linear Regression Model

Pines

Python File and Dataset
Python IDE PyCharm - pandas, numpy, matplotlib.pyplot, statsmodels.api, patsy.dmatricies

The Study
In a study, data on the heights (in centimeters) and ages (in years) of seedlings and saplings of 204 Japanese pine trees was collected. The data was stored in a csv datafile 'pines.csv', with variates called 'height' and 'age'. I investigate this dataset with a view to seeing if I can predict the height based on the age of the tree.

Python File
In line 9 I load the datafile 'pines.csv' into a pandas dataframe object called 'df1' . Then to get a feel for the data in lines 12 to 15, I produce a scatterplot of the data with 'age' as the explanatory variable and 'height' as the response variable.

In lines 17 and 18 I transform the variables using the natural log and store these in new columns in the dataframe. In lines 23 to 26 I produce a scatter plot of the transformed data. In line 30 I set up the endogenous and exogenous matrices using patsy.damatricies by specifying the linear regression relationship 'logh' ~ 'loga'.

Then in lines 33 to 54 I fit the model and print the results summary. In lines 38 and 39 I calculate the regression line for later use. In lines 43 to 62 I produce a composite residuals plot of the deviance residuals as test for an appropriate fit.

In line 65 I select the rows from 'df1' where 'age' is greater the 1 and in lines 69 to 101 I rerun the above analysis.


Analysis
From the initial scatterplot (fig1) I can see that fitting a simple regression line to the untransformed data would be inappropriate because there is a clear increase in the variability as the age of the pine tree increases and that the data appear to exhibit a curve. As a transformation of the 'height' values would only reduce the variability but not the curve and a transformation of the 'age' variables would only remove the curve but not the variability, so a combination is required for the data to be modelled appropriately by a simple linear regression model.
fig 1

Taking the natural log of both variables, I plot a scatterplot (fig2) of the transformed variables. The relationship now appears to be linear and is certainly more appropriate to be modelled by a simple linear regression model.
fig 2
Fitting a regression model yields a fitted regression line of Yi = 0.1621 + 1.7237Xi. From the composite residuals plots (fig3) of deviance residuals, I can see that the histogram appears to be normally distributed about zero, however it is slightly negatively skewed. The fitted values plot and normal probability plot give no cause for concern as the normal plot is approximately linear and the fitted values are randomly distributed with no obvious pattern. I can conclude that the regression model appears to be appropriate for the transformed data.

The histogram from the composite residuals plot (fig3) is negatively skewed, implying that the younger trees might be better modelled using a separate analysis. In addition the fitted values plot does show notably more values above the line in the beginning implying that the residuals are positive, in fact all three observations are positive providing evidence that young trees might be better modelled using a separate analysis.
fig 3
I re-perform the analysis excluding the 5 tress that are 1 year old. The new equation for the regression line is Yi = 0.1441 + 1.7333*Xi. From the composite residuals plots (fig4) there is little change to the normal plot, however the histogram is still slightly negatively skewed but it does have  a much greater peak at zero, thus improving on the assumption of normality. The fitted values plot now shows that lower values are well distributed however older trees now show as slightly negatively distributed.
fig 4
The advantages of this new analysis are that the model is stronger as the data residuals are less negatively skewed and the location of the residuals is close to 0. The disadvantages of this new analysis are that it is now inappropriate to use the model for trees of 1 year old so reducing its use and that the model now tends to overestimate the height of older trees.

By Edward Adcock