Sunday, 25 September 2016

Daphnia - Analysis of Variance (ANOVA)

Daphnia

Python File and Data Set
Python IDE PyCharm - pandas, matplotlib.pyplot, statsmodels.api, patsy.dmatricies

The Study
A study was carried out to investigate the effect of detergent pollution on the growth of the common freshwater invertebrate Daphnia. Seventy-two small populations of Daphnia were raised in samples of river water to which detergent of different brands had been added, and the growth rates of the Daphnia were recorded. As well as differences in detergent brand, the populations differed in terms of Daphnia clone from which they were raised, and in terms of the source of the river water.

There were four variables recorded:
  • growth - the growth rate of the Daphnia which is the response variable
  • water - the source of the water in which the Daphnia was raised (Tyne, Wear)
  • deterg - the detergent brand used (BrandA, BrandB, BrandC, BrandD)
  • clone - the Daphnia clone (Clone1, Clone2, Clone3)
The data was stored in a csv datafile 'daphnia.csv', with variates with the above names and 72 rows (one for each population). I investigate this dataset with a view to seeing if I can determine whether, and how, the average growth rate is related to the other three categorical variables.

Python File
In line 9 I load the datafile 'daphnia.csv' into a pandas dataframe object called 'df1'. Then in line 11 I print the statistical summaries of the data. In lines 14 to 16 I slice the df1 into three new data frames, one for each categorical variables against growth.  Then in lines 19 to 21 I print the statistical summaries of each of the three new dataframes.

In lines 25 to 38 I setup a figure with three axes and plot boxplots of the three categorical variables against growth to each, then adjust the titles and finally print and save it to a file.

In line 42 I set up the endogenous and exogenous matrices using patsy.damatricies by specifying the linear regression relationship 'growth ~ C(water) * C(deterg) * C(clone)'. Then in lines 36 to 48 I fit the model and print the results summary.Then in lines 51 and 52 I print out the analysis of variance summary table.

In lines 56 to 72 I produce a composite residuals plot of the deviance residuals as test for an appropriate fit. In lines 75 to 83 I use the group by function and the mean function to print out the array of mean value of growth rate grouped by the main effect, the secondary effects and the tertiary effects.

In lines 86 to 98 I set up a single figure with one ax and plot the mean growth rates for water at different levels of clone, set the labels and the legend then save and show the plot. In lines 101 to 113 I set up a single figure with one ax and plot the mean growth rates for deterg at different levels of clone, set the labels, tick marks and the legend then save and show the plot.

Analysis
I begin with some exploratory analysis of the dataset.

From the boxplots of water (fig1-top left) the growth rates of Tyne are less dispersed than the growth rates of wear, both show a positive skew and the mean of Tyne is within the second quartile of Wear, so the means appear to be similar but the distributions are not normal.
fig 1
From the boxplots of deterg (fig1-top right) the growth rates of brands A and B appear to be very similar and nearly symmetrical, however brands C and D are both positively skewed with much larger ranges. The medians of brands A B and C are relatively close together and could be the same, however the median of brand D is below the second quartile of brands B and C and on the extreme lower end of brand A. Hence it is in some doubt that the median of brand D is the same as the other three brands and that brands C and D are normally distributed.

From the boxplots of clone (fig1-bottom left) the growth rates of clone 1 are significantly lower and less dispersed than clones 2 and 3, however clone 1 is contained entirely within the range of clone 3. Clone 2 appears to have a positive skew and clone 3 appears to have a negative skew however their medians appear to be relatively close together. Hence i can conclude that there is some doubt about the median on clones 2 and 3 being the same as clone 1 and that clones 2 and 3 are normally distributed.

The assumption of independence for each result is assumed from the design of the experiment and the assumption of equal variance across all groups is within the 'rule of 4', (that is the variance is within 4 times the variance of the other groups) so both assumptions are met.

However from the above exploration there is definitely doubt about a normal distribution of the data however it would not be unwise to proceed with an Analysis of Variance (ANOVA) under caution.

I fit the ANOVA model 'growth ~ C(water) * C(deterg) * C(clone)' where * means to include all factors and all possible interactions between the three factors.

From the ANOVA results table, the explanatory variable clone has a p value of p<0.001, which is strong evidence that there is a main effect for clone. The p values for deterg and water and 0.375 and 0.098 respectively and so show little evidence of main effects.

From the ANOVA results table, the p value for clone.deterg and clone.water have p values of p<0.001, which is strong evidence that the two variables interact with clone. So clone appears to have both a significant main effect and strong interactions with water and deterg. All other second order interactions have p values > 0.09 and so show little to no evidence of significant effects.

From the ANOVA results table, the p value for clone.deterg.water is p=0.234 with is little evidence of any interactions and so there appears to be no significant third order interaction between the variables.

From the composite residuals plot (fig2) for the fitted ANOVA model the histogram is uni-model and is plausibly normally distributed. The fitted values plot shows no evidence of changing variance or obvious patterns. The normal plot appear relatively straight along the x=y axis with not obvious outliers so there appears to be no cause for concern.  
fig 2
So I can conclude from these plots that the assumptions underlying the model appear to be satisfied even given the concern I raised earlier about the possibility of growth rates not being normally distributed.

From the table of means I can see that all three factors should be included in the model because clone has a significant main effect and there are significant interactions between both clone.deterg and clone.water.

The means plot (fig3) for water (Tyne and Wear) at different levels of clone (1, 2 and 3) shows that the water source has plasisably no effect on clone 1 and a large positive effect on clone 2, however there is a slight negative effect on clone 3 which is within the standard error and so cloud plausibly be due to sample variation and so have no effect on clone 3.
fig 3
The gain from the data of clone2.water gives the means for Tyne = 3.806 and Wear = 5.348, which is a much larger difference than the standard error 0.3407., So overall for the water means plot, the change in source of water from Tyne to Wear has a significant effect on clone 2 but not on clones 1 or 3, and as the p value for the interaction is p<0.001 then the  evidence strongly suggests that the clone 2 mean is not equal to the means of clones 1 or 3.

Hence I can conclude that the water source has no effect on clone 1 or 3, but has a positive effect on clone 2 by increasing the mean growth rate of water if water from the Wear river is used.

The means plot (fig4) for deterg (BrandA, B, C and D) at different levels of clone (1, 2 and 3) shows that for clone 1 the growth rate remains roughly the same, as the changes are within the standard error so it is plausible that deterg does not effect clone1.
fig 4
Clone 2 shows an increasing trend on A, B, C and D and these changes are roughly one Standard Error per level so it is plausible that these means for clone 2 are different. Conversely, clone 3 shows a decreasing trend on A, B, C and D with increasing steepness in change, so it is clear that the means for clone 3 are different.

So overall the deterg plot shows that the changes in detergent used has a significant effect on clone bu increasing growth rate, on clone 3 by decreasing growth rate and having no effect on clone 1, and as the p value for the interaction of p<0.001 the evidence strongly suggests that the means for clone.deterg are not equal.

Hence I can conclude that the brand of detergent used has no effect on clone 1, but has a positive effect on clone 2 growth rates and a negative effect on clone 3 growth rates.

From the above analysis it was shown that the source of water has the main effect p value of p=0.098 which is rejected at the 5% significance level. So although the main effect of water is not significantly different from zero it does have an impact on the growth rates of different clone types, as the p value for the clone.water interaction is p<0.001.

This effect is most clearly seen in the clone2.water group means of Tyne=3.806 and Wear=5.348. This is because the explanatory variable water effects the explanatory variable clone which does have a main effect on growth, so the interaction between water and clone means that water indirectly effects growth through clone.

So I can conclude that although there is no main effect for water, there is a significant interaction with clone which does have a main effect and so water should be included in a good model.

No comments:

Post a Comment