Creating Fake Data Sets to Explore Hypotheses

1. Think about an ongoing study in your lab (or a paper you have read in a different class), and decide on a pattern that you might expect in your experiment if a specific hypothesis were true.

I found a paper discussing the body temperature of lizards, and how it is expected that the body temps have a left skew. This is for multiple reasons, such as time spent at different body temperatures. Another interesting reason is that in summer, body temps are higher than in the cold season. They also noted that lizards are more active in warmer months. So, I will be creating fake data to reflect this difference in lizard body temperature in winter compared to warmer months. If the distribution of data is skewed to the left, the mean is less than the median usually.

2. To start simply, assume that the data in each of your treatment groups follow a normal distribution. Specify the sample sizes, means, and variances for each group that would be reasonable if your hypothesis were true. You may need to consult some previous literature and/or an expert in the field to come up with these numbers.
nGroup <- 3 # number of treatment groups
nName <- c("Summer","Winter", "Spring") # names of groups
nSize <- c(80,50,100) # number of observations in each group
nMean <- c(1,-5,0) # mean temp of each group, centered on modal temperature
nSD <- c(3,5,3) # standard deviation of each group

I made the sample sizes higher in warmer seasons to reflect the fact that lizards are more active in these seasons, and would then be more likely to be able to capture more lizards in the field. The means, centered on modal temperature, reflect the difference in typical temperature based on season, and based on the original paper.

3. Using the methods we have covered in class, write code to create a random data set that has these attributes. Organize these data into a data frame with the appropriate structure.
ID <- 1:(sum(nSize)) # id vector for each row
resVar <- c(rnorm(n=nSize[1],mean=nMean[1],sd=nSD[1]),
            rnorm(n=nSize[2],mean=nMean[2],sd=nSD[2]),
            rnorm(n=nSize[3],mean=nMean[3],sd=nSD[3]))
TGroup <- rep(nName,nSize)
ANOdata <- data.frame(ID,TGroup,resVar)
str(ANOdata)
## 'data.frame':    230 obs. of  3 variables:
##  $ ID    : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ TGroup: chr  "Summer" "Summer" "Summer" "Summer" ...
##  $ resVar: num  -1.81 1.83 2.99 3.73 6.65 ...
4. Now write code to analyze the data (probably as an ANOVA or regression analysis, but possibly as a logistic regression or contingency table analysis). Write code to generate a useful graph of the data.
ANOmodel <- aov(resVar~TGroup,data=ANOdata)
print(ANOmodel)
## Call:
##    aov(formula = resVar ~ TGroup, data = ANOdata)
## 
## Terms:
##                   TGroup Residuals
## Sum of Squares  1120.657  2507.007
## Deg. of Freedom        2       227
## 
## Residual standard error: 3.323264
## Estimated effects may be unbalanced
print(summary(ANOmodel))
##              Df Sum Sq Mean Sq F value Pr(>F)    
## TGroup        2   1121   560.3   50.74 <2e-16 ***
## Residuals   227   2507    11.0                   
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# pool seasons to show left skew
ANOPlot <- ggplot(data=ANOmodel, aes(x=resVar, y=after_stat(density))) +
  geom_histogram(color="grey60",fill="thistle1",size=0.2) +  geom_density(linetype="dotted", linewidth=0.75)
## Warning: Using `size` aesthetic for lines was deprecated in ggplot2 3.4.0.
## ℹ Please use `linewidth` instead.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
print(ANOPlot)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

5. Try running your analysis multiple times to get a feeling for how variable the results are with the same parameters, but different sets of random numbers.
6. Now begin adjusting the means of the different groups. Given the sample sizes you have chosen, how small can the differences between the groups be (the “effect size”) for you to still detect a significant pattern (p < 0.05)?
# keep old treatment group sizes, names, and standard deviation, but change the means
newMean <- c(1,-1,1) # mean temp of each group, centered on modal temperature

newTemp <- c(rnorm(n=nSize[1],mean=newMean[1],sd=nSD[1]),
            rnorm(n=nSize[2],mean=newMean[2],sd=nSD[2]),
            rnorm(n=nSize[3],mean=newMean[3],sd=nSD[3]))
TGroup <- rep(nName,nSize)
newANOdata <- data.frame(ID,TGroup,newTemp)

newANOmodel <- aov(newTemp~TGroup,data=newANOdata)
print(summary(newANOmodel))
##              Df Sum Sq Mean Sq F value Pr(>F)
## TGroup        2   49.2   24.59   2.185  0.115
## Residuals   227 2554.9   11.26
newANOPlot <- ggplot(data=newANOmodel, aes(x=newTemp, y=after_stat(density))) +
  geom_histogram(color="grey60",fill="lightsalmon",size=0.2) +  geom_density(linetype="dotted", linewidth=0.75)
print(newANOPlot)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

Using means 1, -1, 2 still gives significant difference between groups of Pr(>F) = 0.000954

Using means 1, -1, 1 still gives significant difference between groups of Pr(>F) = 0.0123

However, in both of these examples, a left tail is not very noticeable.

7. Alternatively, for the effect sizes you originally hypothesized, what is the minimum sample size you would need in order to detect a statistically significant effect? Again, run the model a few times with the same parameter set to get a feeling for the effect of random variation in the data.
# keep old treatment group names, means, and standard deviation, but change the sample sizes
newSize <- c(10,10,10) # number of observations in each group

newTemp <- c(rnorm(n=newSize[1],mean=nMean[1],sd=nSD[1]),
            rnorm(n=newSize[2],mean=nMean[2],sd=nSD[2]),
            rnorm(n=newSize[3],mean=nMean[3],sd=nSD[3]))
TGroup <- rep(nName,newSize)
ID <- 1:(sum(newSize)) # id vector for each row
newANOdata <- data.frame(ID,TGroup,newTemp)

newANOmodel <- aov(newTemp~TGroup,data=newANOdata)
print(summary(newANOmodel))
##             Df Sum Sq Mean Sq F value Pr(>F)  
## TGroup       2  199.2   99.58   5.273 0.0117 *
## Residuals   27  509.9   18.89                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
newANOPlot <- ggplot(data=newANOmodel, aes(x=newTemp, y=after_stat(density))) +
  geom_histogram(color="grey60",fill="lightsalmon",size=0.2) +  geom_density(linetype="dotted", linewidth=0.75)
print(newANOPlot)
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.

equal sample sizes of 100, 100, 100 gives significant difference between groups of Pr(>F) = 0.0123

sample sizes of 10, 10, 10 gives significant difference between groups ranging through Pr(>F) = 0.00183 to Pr(>F) = 0.985

8.

Write up your results in a markdown file, organized with headers and different code chunks to show your analysis. Be explicit in your explanation and justification for sample sizes, means, and variances.

9.

If you have time, try repeating this exercise with one of the more sophisticated distributions, such as the gamma or negative binomial (depending on the kind of data you have). You will have to spend some time figuring out by trial and error the parameter values you will need to generate appropriate means and variances of the different groups.

trying beta distribution

myBeta <- rbeta(n=1000,shape1=5,shape2 = 2)
qplot(myBeta,xlim=c(0,1), color=I("black"),fill=I("orchid"))
## Warning: `qplot()` was deprecated in ggplot2 3.4.0.
## This warning is displayed once every 8 hours.
## Call `lifecycle::last_lifecycle_warnings()` to see where this warning was
## generated.
## `stat_bin()` using `bins = 30`. Pick better value with `binwidth`.
## Warning: Removed 2 rows containing missing values (`geom_bar()`).