Homework

Create and use random data

library(ggplot2) # for graphics
library(MASS) # for maximum likelihood estimation
library(tidyverse)

# create a truncated normal distribution to work on the solution set

z <- rnorm(n=3000,mean=0.2)
z <- data.frame(1:3000,z)
names(z) <- list("ID","myVar")
z <- z[z$myVar>0,] # change z to only include the rows where myVar>0 and grab all the columns
str(z)

## 'data.frame':    1746 obs. of  2 variables:
##  $ ID   : int  1 2 4 5 7 9 11 12 13 15 ...
##  $ myVar: num  2.581 1.237 0.517 0.718 0.556 ...

summary(z$myVar)

##     Min.  1st Qu.   Median     Mean  3rd Qu.     Max. 
## 0.000427 0.380944 0.771902 0.882828 1.255593 3.553240

Plot the random data

We are rescaling the y axis of the histogram from counts to density, so that the area under the histogram equals 1.0.

p1 <- ggplot(data=z, aes(x=myVar, y=after_stat(density))) + geom_histogram(color="grey60",fill="cornsilk",linewidth=0.2)
print(p1)

Add empirical density curve

Now modify the code to add in a kernel density plot of the data. This is an empirical curve that is fitted to the data. It does not assume any particular probability distribution, but it smooths out the shape of the histogram:

p1 <-  p1 +  geom_density(linetype="dotted",linewidth=0.75)
print(p1)

Get maximum likelihood parameters for normal

Next, fit a normal distribution to your data and grab the maximum likelihood estimators of the two parameters of the normal, the mean and the variance:

normPars <- fitdistr(z$myVar,"normal")
print(normPars)

##       mean          sd    
##   0.88282838   0.63473877 
##  (0.01519053) (0.01074133)

str(normPars)

## List of 5
##  $ estimate: Named num [1:2] 0.883 0.635
##   ..- attr(*, "names")= chr [1:2] "mean" "sd"
##  $ sd      : Named num [1:2] 0.0152 0.0107
##   ..- attr(*, "names")= chr [1:2] "mean" "sd"
##  $ vcov    : num [1:2, 1:2] 0.000231 0 0 0.000115
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "mean" "sd"
##   .. ..$ : chr [1:2] "mean" "sd"
##  $ n       : int 1746
##  $ loglik  : num -1684
##  - attr(*, "class")= chr "fitdistr"

normPars$estimate["mean"] # note structure of getting a named attribute

##      mean 
## 0.8828284

Plot normal probability density

Now let’s call the dnorm function inside ggplot’s stat_function to generate the probability density for the normal distribution. Read about stat_function in the help system to see how you can use this to add a smooth function to any ggplot. Note that we first get the maximum likelihood parameters for a normal distribution fitted to thse data by calling fitdistr. Then we pass those parameters (meanML and sdML to stat_function):

meanML <- normPars$estimate["mean"]
sdML <- normPars$estimate["sd"]

xval <- seq(0,max(z$myVar),len=length(z$myVar))

stat <- stat_function(aes(x = xval, y = after_stat(y)), fun = dnorm, colour="red", n = length(z$myVar), args = list(mean = meanML, sd = sdML))
p1 + stat

Notice that the best-fitting normal distribution (red curve) for these data actually has a biased mean. That is because the data set has no negative values, so the normal distribution (which is symmetric) is not working well.

Plot exponential probability density

Now let’s use the same template and add in the curve for the exponential:

expoPars <- fitdistr(z$myVar,"exponential")
rateML <- expoPars$estimate["rate"]
 
stat2 <- stat_function(aes(x = xval, y = after_stat(y)), fun = dexp, colour="blue", n = length(z$myVar), args = list(rate=rateML))
p1 + stat + stat2

Plot uniform probability density

For the uniform, we don’t need to use fitdistr because the maximum likelihood estimators of the two parameters are just the minimum and the maximum of the data:

stat3 <- stat_function(aes(x = xval, y = after_stat(y)), fun = dunif, colour="darkgreen", n = length(z$myVar), args = list(min=min(z$myVar), max=max(z$myVar)))
p1 + stat + stat2 + stat3

Plot gamma probability density

gammaPars <- fitdistr(z$myVar,"gamma")
shapeML <- gammaPars$estimate["shape"]
rateML <- gammaPars$estimate["rate"]
 
stat4 <- stat_function(aes(x = xval, y = after_stat(y)), fun = dgamma, colour="brown", n = length(z$myVar), args = list(shape=shapeML, rate=rateML))
p1 + stat + stat2 + stat3 + stat4

Plot beta probability density

This one has to be shown in its own plot because the raw data must be rescaled so they are between 0 and 1, and then they can be compared to the beta.

pSpecial <- ggplot(data=z, aes(x=myVar/(max(myVar + 0.1)), y=after_stat(density))) +
  geom_histogram(color="grey60",fill="cornsilk",size=0.2) + 
  xlim(c(0,1)) +
  geom_density(size=0.75,linetype="dotted")

betaPars <- fitdistr(x=z$myVar/max(z$myVar + 0.1),start=list(shape1=1,shape2=2),"beta")
shape1ML <- betaPars$estimate["shape1"]
shape2ML <- betaPars$estimate["shape2"]

statSpecial <- stat_function(aes(x = xval, y = after_stat(y)), fun = dbeta, colour="orchid", n = length(z$myVar), args = list(shape1=shape1ML,shape2=shape2ML))
pSpecial + statSpecial

Now redo this process using real data

I used the Bombus Bumblebee Seasonal Variation data where samples were collected to assay viruses (DWV and BQCV) and Nosema from around 300 bumblebees of various species from 5 sites at 4 different time points

bee <- read.csv("BeeData.csv",sep=",",comment.char='#')
View(bee) # view in new tab - make sure to use capital V
bee <- na.omit(bee) # omit the rows that contain an NA
str(bee)

## 'data.frame':    865 obs. of  11 variables:
##  $ id             : int  1 3 4 5 6 8 10 11 12 13 ...
##  $ target_name    : chr  "BQCV" "BQCV" "BQCV" "BQCV" ...
##  $ pathogen_binary: int  1 1 0 0 1 1 1 1 0 1 ...
##  $ sampling_date  : chr  "9/9/2020" "9/10/2020" "9/10/2020" "9/9/2020" ...
##  $ site_code      : chr  "BOST" "MUDGE" "MUDGE" "BOST" ...
##  $ field_id       : int  6 18 11 11 14 4 13 38 5 12 ...
##  $ bombus_spp     : chr  "impatiens" "impatiens" "impatiens" "impatiens" ...
##  $ host_plant     : chr  "solidago" "crown vetch" "crown vetch" "solidago" ...
##  $ bee_caste      : chr  "worker" "worker" "worker" "male" ...
##  $ sampling_event : int  4 4 4 4 4 4 4 2 4 4 ...
##  $ pathogen_load  : num  2414169 760793 0 0 124237 ...
##  - attr(*, "na.action")= 'omit' Named int [1:41] 22 43 57 77 83 172 207 233 256 289 ...
##   ..- attr(*, "names")= chr [1:41] "22" "43" "57" "77" ...

# make a column called myVar and copy over the pathogen_load column
bee$myVar <- bee$pathogen_load

# filter out rows where 'pathogen_load' is 0, and log() the myVar column
bee <- bee%>%filter(pathogen_load>0)%>%
  mutate(myVar=log(pathogen_load))

# plot this data
p1 <- ggplot(data=bee, aes(x=myVar, y=after_stat(density))) + geom_histogram(color="grey60",fill="cornsilk",linewidth=0.2)
print(p1)

# add empirical density curve
p1 <-  p1 +  geom_density(linetype="dotted",linewidth=0.75)
print(p1)

# get maximum likelihood parameters for normal
normPars <- fitdistr(bee$myVar,"normal")
print(normPars)

##       mean          sd    
##   13.3863402    2.4396499 
##  ( 0.1153915) ( 0.0815941)

str(normPars)

## List of 5
##  $ estimate: Named num [1:2] 13.39 2.44
##   ..- attr(*, "names")= chr [1:2] "mean" "sd"
##  $ sd      : Named num [1:2] 0.1154 0.0816
##   ..- attr(*, "names")= chr [1:2] "mean" "sd"
##  $ vcov    : num [1:2, 1:2] 0.01332 0 0 0.00666
##   ..- attr(*, "dimnames")=List of 2
##   .. ..$ : chr [1:2] "mean" "sd"
##   .. ..$ : chr [1:2] "mean" "sd"
##  $ n       : int 447
##  $ loglik  : num -1033
##  - attr(*, "class")= chr "fitdistr"

normPars$estimate["mean"] # note structure of getting a named attribute

##     mean 
## 13.38634

# plot normal probability density
meanML <- normPars$estimate["mean"]
sdML <- normPars$estimate["sd"]

xval <- seq(0,max(bee$myVar),len=length(bee$myVar))

stat <- stat_function(aes(x = xval, y = after_stat(y)), fun = dnorm, colour="red", n = length(bee$myVar), args = list(mean = meanML, sd = sdML))
p1 + stat

# plot exponential probability density
expoPars <- fitdistr(bee$myVar,"exponential")
rateML <- expoPars$estimate["rate"]
 
stat2 <- stat_function(aes(x = xval, y = after_stat(y)), fun = dexp, colour="blue", n = length(bee$myVar), args = list(rate=rateML))
p1 + stat + stat2

# plot uniform probability density
stat3 <- stat_function(aes(x = xval, y = after_stat(y)), fun = dunif, colour="darkgreen", n = length(bee$myVar), args = list(min=min(bee$myVar), max=max(bee$myVar)))
p1 + stat + stat2 + stat3

# plot gamma probability density
gammaPars <- fitdistr(bee$myVar,"gamma")
shapeML <- gammaPars$estimate["shape"]
rateML <- gammaPars$estimate["rate"]
 
stat4 <- stat_function(aes(x = xval, y = after_stat(y)), fun = dgamma, colour="brown", n = length(bee$myVar), args = list(shape=shapeML, rate=rateML))
p1 + stat + stat2 + stat3 + stat4

Both the normal and gamma distributions fit this dataset pretty well. The gamma looks like it is a slightly better fit.

# plot beta probability density
pSpecial <- ggplot(data=bee, aes(x=myVar/(max(myVar + 0.1)), y=after_stat(density))) +
  geom_histogram(color="grey60",fill="cornsilk",linewidth=0.2) + 
  xlim(c(0,1)) +
  geom_density(size=0.75,linetype="dotted")

betaPars <- fitdistr(x=bee$myVar/max(bee$myVar + 0.1),start=list(shape1=1,shape2=2),"beta")
shape1ML <- betaPars$estimate["shape1"]
shape2ML <- betaPars$estimate["shape2"]

statSpecial <- stat_function(aes(x = xval, y = after_stat(y)), fun = dbeta, colour="orchid", n = length(bee$myVar), args = list(shape1=shape1ML,shape2=shape2ML))
pSpecial + statSpecial

Simulating a new dataset

# simulate a new dataset using the gamma distribution and parameters from fitdist
gammaPars <- fitdistr(bee$myVar,"gamma")
shapeML <- gammaPars$estimate["shape"]
rateML <- gammaPars$estimate["rate"]

mysimdata <- rgamma(n=length(bee$myVar),shape=shapeML,rate=rateML)
mysimdataframe <- data.frame(mysimdata)

# plot this new simulated data along with the gamma probability density curve from the original data
p2 <- ggplot(data=mysimdataframe, aes(x=mysimdata, y=after_stat(density)))  + geom_histogram(color="grey60",fill="cornsilk",linewidth=0.2) 
p2+stat4

# plot the original data and the original gamma probability density curve
p1 <- ggplot(data=bee, aes(x=myVar, y=after_stat(density))) + geom_histogram(color="grey60",fill="cornsilk",linewidth=0.2)
p1+stat4

The model does a pretty good job at simulating realistic data that match my original measurements because the gamma distribution is likely a reasonable fit for my original data and we also used parameters that were estimated from the original dataset. The histogram for my simulated data looks more normally distributed and is not as right skewed as my original dataset. But it still does a decent job at simulating the data.

Homework_6

Kyle Grasso

2024-02-21

Create and use random data

Plot the random data

We are rescaling the y axis of the histogram from counts to density, so that the area under the histogram equals 1.0.

Add empirical density curve

Now modify the code to add in a kernel density plot of the data. This is an empirical curve that is fitted to the data. It does not assume any particular probability distribution, but it smooths out the shape of the histogram:

Get maximum likelihood parameters for normal

Next, fit a normal distribution to your data and grab the maximum likelihood estimators of the two parameters of the normal, the mean and the variance:

Plot normal probability density

Notice that the best-fitting normal distribution (red curve) for these data actually has a biased mean. That is because the data set has no negative values, so the normal distribution (which is symmetric) is not working well.

Plot exponential probability density

Now let’s use the same template and add in the curve for the exponential:

Plot uniform probability density

For the uniform, we don’t need to use fitdistr because the maximum likelihood estimators of the two parameters are just the minimum and the maximum of the data:

Plot gamma probability density

Plot beta probability density

This one has to be shown in its own plot because the raw data must be rescaled so they are between 0 and 1, and then they can be compared to the beta.

Now redo this process using real data

I used the Bombus Bumblebee Seasonal Variation data where samples were collected to assay viruses (DWV and BQCV) and Nosema from around 300 bumblebees of various species from 5 sites at 4 different time points

Both the normal and gamma distributions fit this dataset pretty well. The gamma looks like it is a slightly better fit.

Simulating a new dataset