I found a paper discussing the body temperature of lizards, and how it is expected that the body temps have a left skew. This is for multiple reasons, such as time spent at different body temperatures. Another interesting reason is that in summer, body temps are higher than in the cold season. They also noted that lizards are more active in warmer months. So, I will be creating fake data to reflect this difference in lizard body temperature in winter compared to warmer months. If the distribution of data is skewed to the left, the mean is less than the median usually.
nGroup <- 3 # number of treatment groups
nName <- c("Summer","Winter", "Spring") # names of groups
nSize <- c(80,50,100) # number of observations in each group
nMean <- c(1,-5,0) # mean temp of each group, centered on modal temperature
nSD <- c(3,5,3) # standard deviation of each group
I made the sample sizes higher in warmer seasons to reflect the fact that lizards are more active in these seasons, and would then be more likely to be able to capture more lizards in the field. The means, centered on modal temperature, reflect the difference in typical temperature based on season, and based on the original paper.
ID <- 1:(sum(nSize)) # id vector for each row
resVar <- c(rnorm(n=nSize[1],mean=nMean[1],sd=nSD[1]),
rnorm(n=nSize[2],mean=nMean[2],sd=nSD[2]),
rnorm(n=nSize[3],mean=nMean[3],sd=nSD[3]))
TGroup <- rep(nName,nSize)
ANOdata <- data.frame(ID,TGroup,resVar)
str(ANOdata)
## 'data.frame': 230 obs. of 3 variables:
## $ ID : int 1 2 3 4 5 6 7 8 9 10 ...
## $ TGroup: chr "Summer" "Summer" "Summer" "Summer" ...
## $ resVar: num -1.81 1.83 2.99 3.73 6.65 ...
ANOmodel <- aov(resVar~TGroup,data=ANOdata)
print(ANOmodel)
## Call:
## aov(formula = resVar ~ TGroup, data = ANOdata)
##
## Terms:
## TGroup Residuals
## Sum of Squares 1120.657 2507.007
## Deg. of Freedom 2 227
##
## Residual standard error: 3.323264
## Estimated effects may be unbalanced
print(summary(ANOmodel))
## Df Sum Sq Mean Sq F value Pr(>F)
## TGroup 2 1121 560.3 50.74 <2e-16 ***
## Residuals 227 2507 11.0
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# pool seasons to show left skew
ANOPlot <- ggplot(data=ANOmodel, aes(x=resVar, y=after_stat(density))) +
geom_histogram(color="grey60",fill="thistle1",size=0.2) + geom_density(linetype="dotted", linewidth=0.75)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
print(ANOPlot)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
# keep old treatment group sizes, names, and standard deviation, but change the means
newMean <- c(1,-1,1) # mean temp of each group, centered on modal temperature
newTemp <- c(rnorm(n=nSize[1],mean=newMean[1],sd=nSD[1]),
rnorm(n=nSize[2],mean=newMean[2],sd=nSD[2]),
rnorm(n=nSize[3],mean=newMean[3],sd=nSD[3]))
TGroup <- rep(nName,nSize)
newANOdata <- data.frame(ID,TGroup,newTemp)
newANOmodel <- aov(newTemp~TGroup,data=newANOdata)
print(summary(newANOmodel))
## Df Sum Sq Mean Sq F value Pr(>F)
## TGroup 2 49.2 24.59 2.185 0.115
## Residuals 227 2554.9 11.26
newANOPlot <- ggplot(data=newANOmodel, aes(x=newTemp, y=after_stat(density))) +
geom_histogram(color="grey60",fill="lightsalmon",size=0.2) + geom_density(linetype="dotted", linewidth=0.75)
print(newANOPlot)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
Using means 1, -1, 2 still gives significant difference between
groups of Pr(>F) = 0.000954
Using means 1, -1, 1 still gives significant difference between
groups of Pr(>F) = 0.0123
However, in both of these examples, a left tail is not very noticeable.
# keep old treatment group names, means, and standard deviation, but change the sample sizes
newSize <- c(10,10,10) # number of observations in each group
newTemp <- c(rnorm(n=newSize[1],mean=nMean[1],sd=nSD[1]),
rnorm(n=newSize[2],mean=nMean[2],sd=nSD[2]),
rnorm(n=newSize[3],mean=nMean[3],sd=nSD[3]))
TGroup <- rep(nName,newSize)
ID <- 1:(sum(newSize)) # id vector for each row
newANOdata <- data.frame(ID,TGroup,newTemp)
newANOmodel <- aov(newTemp~TGroup,data=newANOdata)
print(summary(newANOmodel))
## Df Sum Sq Mean Sq F value Pr(>F)
## TGroup 2 199.2 99.58 5.273 0.0117 *
## Residuals 27 509.9 18.89
## ---
## Signif. codes: 0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
newANOPlot <- ggplot(data=newANOmodel, aes(x=newTemp, y=after_stat(density))) +
geom_histogram(color="grey60",fill="lightsalmon",size=0.2) + geom_density(linetype="dotted", linewidth=0.75)
print(newANOPlot)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
equal sample sizes of 100, 100, 100 gives significant difference
between groups of Pr(>F) = 0.0123
sample sizes of 10, 10, 10 gives significant difference between
groups ranging through Pr(>F) = 0.00183
to
Pr(>F) = 0.985
Write up your results in a markdown file, organized with headers and different code chunks to show your analysis. Be explicit in your explanation and justification for sample sizes, means, and variances.
If you have time, try repeating this exercise with one of the more sophisticated distributions, such as the gamma or negative binomial (depending on the kind of data you have). You will have to spend some time figuring out by trial and error the parameter values you will need to generate appropriate means and variances of the different groups.
trying beta distribution
myBeta <- rbeta(n=1000,shape1=5,shape2 = 2)
qplot(myBeta,xlim=c(0,1), color=I("black"),fill=I("orchid"))
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing missing values (`geom_bar()`).