Creating Fake Data Sets to Explore Hypotheses

1. Think about an ongoing study in your lab (or a paper you have read in a different class), and decide on a pattern that you might expect in your experiment if a specific hypothesis were true.

Experiment: How does earthworm invasion intensity affect soil insect richness?

Treatments: Control, Low, High (invasion intensity)

Variable: Insect abundance (# of insects present)

2. To start simply, assume that the data in each of your treatment groups follow a normal distribution. Specify the sample sizes, means, and variances for each group that would be reasonable if your hypothesis were true. You may need to consult some previous literature and/or an expert in the field to come up with these numbers.

my_data <- read.csv("insect_abundance_earthworm_invasion.csv",header=TRUE,sep=",")
str(my_data)
## 'data.frame':    30 obs. of  3 variables:
##  $ Control: int  74 50 69 45 56 49 35 57 78 80 ...
##  $ Low    : int  56 67 45 67 78 98 45 57 87 34 ...
##  $ High   : int  23 32 34 32 34 45 13 42 53 12 ...
summary(my_data)
##     Control           Low             High      
##  Min.   :35.00   Min.   :23.00   Min.   :10.00  
##  1st Qu.:57.00   1st Qu.:34.50   1st Qu.:21.75  
##  Median :68.00   Median :45.00   Median :32.00  
##  Mean   :68.37   Mean   :51.15   Mean   :30.46  
##  3rd Qu.:83.00   3rd Qu.:65.00   3rd Qu.:39.75  
##  Max.   :97.00   Max.   :98.00   Max.   :53.00  
##                  NA's   :3       NA's   :6
var(my_data$Control)
## [1] 274.792
var(my_data$Low, na.rm = TRUE)
## [1] 362.8234
var(my_data$High, na.rm = TRUE)
## [1] 164.9547
sd(my_data$Control)
## [1] 16.57685
sd(my_data$Low, na.rm = TRUE)
## [1] 19.04792
sd(my_data$High, na.rm = TRUE)
## [1] 12.84347

Control: Sample size = 30, Mean = 68.37, Variance = 274.792, SD = 16.57685

Low: Sample size = 27, Mean = 51.15, Variance = 362.8234, SD = 19.04792

High: Sample size = 25, Mean = 30.46, Variance = 164.9547, SD = 12.84347

3. Using the methods we have covered in class, write code to create a random data set that has these attributes. Organize these data into a data frame with the appropriate structure.

# creating data table 
nGroup <- 3 # number of treatment groups
nName <- c("Control","Low", "High") # names of groups
nSize <- c(30,27,25) # number of observations in each group
nMean <- c(68.37,51.15,30.46) # mean of each group
nSD <- c(16.58,19.05,12.84) # standard deviation of each group


ID <- 1:(sum(nSize)) # id vector for each row
Abundance <- c(rnorm(n=nSize[1],mean=nMean[1],sd=nSD[1]),
            rnorm(n=nSize[2],mean=nMean[2],sd=nSD[2]),
            rnorm(n=nSize[3],mean=nMean[3],sd=nSD[3]))
Treatment <- rep(nName,nSize)
ANOdata <- data.frame(ID,Treatment,Abundance)
str(ANOdata)
## 'data.frame':    82 obs. of  3 variables:
##  $ ID       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Treatment: chr  "Control" "Control" "Control" "Control" ...
##  $ Abundance: num  100.1 88.7 94.7 103 68.8 ...
print(ANOdata)
##    ID Treatment Abundance
## 1   1   Control 100.08390
## 2   2   Control  88.74780
## 3   3   Control  94.71581
## 4   4   Control 103.00641
## 5   5   Control  68.77967
## 6   6   Control  89.55094
## 7   7   Control  57.49989
## 8   8   Control  69.15121
## 9   9   Control  74.32657
## 10 10   Control  45.99579
## 11 11   Control  83.62930
## 12 12   Control  93.95153
## 13 13   Control  42.68466
## 14 14   Control  61.47345
## 15 15   Control  63.60675
## 16 16   Control  72.65832
## 17 17   Control  66.95870
## 18 18   Control  67.54365
## 19 19   Control  67.05332
## 20 20   Control  66.85924
## 21 21   Control  60.33262
## 22 22   Control  48.61222
## 23 23   Control  55.00196
## 24 24   Control  51.42227
## 25 25   Control  64.78489
## 26 26   Control  64.75341
## 27 27   Control  52.08256
## 28 28   Control 105.18089
## 29 29   Control  76.27160
## 30 30   Control  57.23531
## 31 31       Low  68.22492
## 32 32       Low  54.14860
## 33 33       Low  36.92846
## 34 34       Low  50.56000
## 35 35       Low  61.57487
## 36 36       Low  84.27558
## 37 37       Low  25.72168
## 38 38       Low  63.32460
## 39 39       Low  51.33056
## 40 40       Low  56.03900
## 41 41       Low  23.78432
## 42 42       Low  85.94737
## 43 43       Low  70.74214
## 44 44       Low  85.79916
## 45 45       Low  68.63415
## 46 46       Low  78.08207
## 47 47       Low  66.46903
## 48 48       Low  67.08882
## 49 49       Low  56.56022
## 50 50       Low 113.58766
## 51 51       Low  46.20822
## 52 52       Low  55.03833
## 53 53       Low  35.65834
## 54 54       Low  38.26818
## 55 55       Low  26.17867
## 56 56       Low  74.00803
## 57 57       Low  39.08951
## 58 58      High  14.43213
## 59 59      High  36.54602
## 60 60      High  45.10601
## 61 61      High  48.37031
## 62 62      High  35.95813
## 63 63      High  22.64326
## 64 64      High  28.93933
## 65 65      High  25.18287
## 66 66      High  30.06778
## 67 67      High  41.91877
## 68 68      High  35.22264
## 69 69      High  37.16639
## 70 70      High  10.38941
## 71 71      High  16.18895
## 72 72      High  49.06463
## 73 73      High  15.69114
## 74 74      High  31.60111
## 75 75      High  31.53490
## 76 76      High  22.90306
## 77 77      High  34.74800
## 78 78      High  40.72921
## 79 79      High  37.15754
## 80 80      High  33.36239
## 81 81      High  42.53454
## 82 82      High  38.83720

4. Now write code to analyze the data, probably as an ANOVA or regression analysis, but possibly as a logistic regression or contingency table analysis. Write code to generate a useful graph of the data.

Since I want to compare differences among three different variables, I will be using a one-way ANOVA. The independent variable (treatment) is discrete and the dependent variable (abundance) is continuous.

library(tidyverse)
# Basic ANOVA in R
abundanceAOV <- aov(Abundance~Treatment,data=ANOdata)
print(abundanceAOV)
## Call:
##    aov(formula = Abundance ~ Treatment, data = ANOdata)
## 
## Terms:
##                 Treatment Residuals
## Sum of Squares   20469.29  23117.55
## Deg. of Freedom         2        79
## 
## Residual standard error: 17.10635
## Estimated effects may be unbalanced
print(summary(abundanceAOV))
##             Df Sum Sq Mean Sq F value   Pr(>F)    
## Treatment    2  20469   10235   34.98 1.32e-11 ***
## Residuals   79  23118     293                     
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1
# graph for ANOVA
ANOPlot <- ggplot(data=ANOdata,aes(x=Treatment,y=Abundance,fill=Treatment)) +
  geom_boxplot()+
  theme_bw()
print(ANOPlot)

5. Try running your analysis multiple times to get a feeling for how variable the results are with the same parameters, but different sets of random numbers.

Rerunning the code from questions 3 and 4 provides a different, but similar conclusion every time. There is variability within the dataset and the boxplots in the graph are different size, but there is always a significant difference among treatment.

6. Now begin adjusting the means of the different groups. Given the sample sizes you have chosen, how small can the differences between the groups be (the “effect size”) for you to still detect a significant pattern (p < 0.05)?

# creating data table with new/adjusted means
nGroup <- 3 # number of treatment groups
nName <- c("Control","Low", "High") # names of groups
nSize <- c(30,27,25) # number of observations in each group
nMean <- c(50,45,39) # mean of each group
nSD <- c(16.58,19.05,12.84) # standard deviation of each group


ID <- 1:(sum(nSize)) # id vector for each row
Abundance <- c(rnorm(n=nSize[1],mean=nMean[1],sd=nSD[1]),
            rnorm(n=nSize[2],mean=nMean[2],sd=nSD[2]),
            rnorm(n=nSize[3],mean=nMean[3],sd=nSD[3]))
Treatment <- rep(nName,nSize)
ANOdata_adjmeans <- data.frame(ID,Treatment,Abundance)
str(ANOdata_adjmeans)
## 'data.frame':    82 obs. of  3 variables:
##  $ ID       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Treatment: chr  "Control" "Control" "Control" "Control" ...
##  $ Abundance: num  10.1 55.9 61.8 54.6 61.7 ...
print(ANOdata_adjmeans)
##    ID Treatment Abundance
## 1   1   Control 10.126942
## 2   2   Control 55.886228
## 3   3   Control 61.793226
## 4   4   Control 54.556022
## 5   5   Control 61.739301
## 6   6   Control 48.730382
## 7   7   Control 56.598829
## 8   8   Control 87.348901
## 9   9   Control 35.283963
## 10 10   Control 64.971768
## 11 11   Control 55.978959
## 12 12   Control 47.357390
## 13 13   Control 66.314794
## 14 14   Control 47.172185
## 15 15   Control 42.283649
## 16 16   Control 52.038942
## 17 17   Control 67.968300
## 18 18   Control 54.674130
## 19 19   Control 44.870380
## 20 20   Control 60.450968
## 21 21   Control 39.078994
## 22 22   Control 51.170394
## 23 23   Control 30.802062
## 24 24   Control 68.756830
## 25 25   Control 60.881884
## 26 26   Control 48.013707
## 27 27   Control 43.910754
## 28 28   Control 20.219878
## 29 29   Control 65.095964
## 30 30   Control 43.000610
## 31 31       Low 52.407370
## 32 32       Low 41.453432
## 33 33       Low 65.579034
## 34 34       Low 54.979053
## 35 35       Low 24.622450
## 36 36       Low 29.027672
## 37 37       Low 23.243961
## 38 38       Low 94.595412
## 39 39       Low 58.887979
## 40 40       Low 58.720452
## 41 41       Low 52.032390
## 42 42       Low 22.254370
## 43 43       Low 52.958933
## 44 44       Low 43.001060
## 45 45       Low 68.092041
## 46 46       Low 33.660315
## 47 47       Low 56.797848
## 48 48       Low 43.361099
## 49 49       Low 52.535683
## 50 50       Low 55.064336
## 51 51       Low 12.164619
## 52 52       Low 35.781797
## 53 53       Low 41.061489
## 54 54       Low  8.534351
## 55 55       Low 40.166816
## 56 56       Low 53.401127
## 57 57       Low 63.454156
## 58 58      High 26.772947
## 59 59      High 58.560665
## 60 60      High 37.026120
## 61 61      High 35.130926
## 62 62      High 28.467705
## 63 63      High 32.719520
## 64 64      High 37.925062
## 65 65      High 32.433904
## 66 66      High 18.575099
## 67 67      High 38.188226
## 68 68      High 58.215165
## 69 69      High 36.494967
## 70 70      High 39.299961
## 71 71      High 46.056919
## 72 72      High 38.306512
## 73 73      High 44.380383
## 74 74      High 40.277770
## 75 75      High 37.946671
## 76 76      High 46.249274
## 77 77      High 37.583939
## 78 78      High 22.777028
## 79 79      High 37.909697
## 80 80      High 59.122862
## 81 81      High 53.207288
## 82 82      High 58.871139
# ANOVA results for adjusted data
abundanceAOV_adjmeans <- aov(Abundance~Treatment,data=ANOdata_adjmeans)
print(abundanceAOV_adjmeans)
## Call:
##    aov(formula = Abundance ~ Treatment, data = ANOdata_adjmeans)
## 
## Terms:
##                 Treatment Residuals
## Sum of Squares    1798.48  18760.03
## Deg. of Freedom         2        79
## 
## Residual standard error: 15.41002
## Estimated effects may be unbalanced
print(summary(abundanceAOV_adjmeans))
##             Df Sum Sq Mean Sq F value Pr(>F)  
## Treatment    2   1798   899.2   3.787 0.0269 *
## Residuals   79  18760   237.5                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

The original means were: Control: 68.37, Low: 51.15, High: 30.46 (*** significance).

When I adjusted them to be Control: 50, Low: 45, High: 40, the results were not significant. I then adjusted them again at Control: 50, Low: 45, High: 39 and the results showed significance (*significance). Compared to the original means, the new means are much closer together while still being able to detect significance.

7. Alternatively, for the effect sizes you originally hypothesized, what is the minimum sample size you would need in order to detect a statistically significant effect? Again, run the model a few times with the same parameter set to get a feeling for the effect of random variation in the data.

# creating new dataset with smaller sample size 
nGroup <- 3 # number of treatment groups
nName <- c("Control","Low", "High") # names of groups
nSize <- c(5,5,5) # number of observations in each group
nMean <- c(68.37,51.15,30.46) # mean of each group
nSD <- c(16.58,19.05,12.84) # standard deviation of each group


ID <- 1:(sum(nSize)) # id vector for each row
Abundance <- c(rnorm(n=nSize[1],mean=nMean[1],sd=nSD[1]),
            rnorm(n=nSize[2],mean=nMean[2],sd=nSD[2]),
            rnorm(n=nSize[3],mean=nMean[3],sd=nSD[3]))
Treatment <- rep(nName,nSize)
ANOdata_adjss <- data.frame(ID,Treatment,Abundance)
str(ANOdata_adjss)
## 'data.frame':    15 obs. of  3 variables:
##  $ ID       : int  1 2 3 4 5 6 7 8 9 10 ...
##  $ Treatment: chr  "Control" "Control" "Control" "Control" ...
##  $ Abundance: num  62.1 56.3 79.2 64.3 52.7 ...
print(ANOdata_adjss)
##    ID Treatment Abundance
## 1   1   Control  62.06732
## 2   2   Control  56.32211
## 3   3   Control  79.24600
## 4   4   Control  64.26977
## 5   5   Control  52.68688
## 6   6       Low  55.53522
## 7   7       Low  78.41318
## 8   8       Low  34.41172
## 9   9       Low  65.51155
## 10 10       Low  52.82815
## 11 11      High  31.91117
## 12 12      High  32.27341
## 13 13      High  26.81920
## 14 14      High  39.74430
## 15 15      High  53.04779
# ANOVA results for adjusted sample size data
abundanceAOV_adjss <- aov(Abundance~Treatment,data=ANOdata_adjss)
print(abundanceAOV_adjss)
## Call:
##    aov(formula = Abundance ~ Treatment, data = ANOdata_adjss)
## 
## Terms:
##                 Treatment Residuals
## Sum of Squares   1898.323  1894.167
## Deg. of Freedom         2        12
## 
## Residual standard error: 12.56373
## Estimated effects may be unbalanced
print(summary(abundanceAOV_adjss))
##             Df Sum Sq Mean Sq F value Pr(>F)  
## Treatment    2   1898   949.2   6.013 0.0155 *
## Residuals   12   1894   157.8                 
## ---
## Signif. codes:  0 '***' 0.001 '**' 0.01 '*' 0.05 '.' 0.1 ' ' 1

First, I made all treatments have equal sample size at n=25 (sig diff detected).

All sample size n=20, still sig diff.

All sample size n=15, still sig diff.

All sample size n=10, still sig diff.

All sample size n=5, still sig diff. this is the smallest sample size to produce a statistical difference

All sample size n=3, no sig diff.

All sample size n=4, no sig diff.