1 Overview

This page provides a demonstration on using R and shuffling our data to analyze your soil science hypotheses to generate figures and results for your group lab report. You are welcome to either follow along with the text and copy-paste the code into the console or read through and then copy the whole script into a new R script.

Please start by logging into sso.rstudio.cloud/pomona. Once you’ve logged in, you should see the option to open the EA 30.1 - Fall 2022 workspace. You can also access the EA 30.1 - Fall 2022 workspace at this link.

At the workspace, you should then see the “Assignment” SoilScienceLab. Please click on that Assignment. Here is a direct link to the SoilScienceLab project, but it may not work if this is the first time in a while that you’re logging into RStudio.Cloud. If it doesn’t work, no worries - just log in through the sso.rstudio.cloud/pomona link.

This table below shows the hypotheses for each group:

Group Hypotheses Variables Statistic
1 Plants that have higher C storage will also have higher N storage. Cperc or soilC, Nperc or soilN Regression model (lm) or correlation coefficients
2 Levels of biomass concentration are greater in California sage scrub soil samples compared to chaparral plant soil samples. PlantComm or CSS, microbes Difference in means
3 The sites affected by the 2017 wildfire will have higher N values. Burn2017, Nperc or soilN Difference in means
4 Chaparral plants are woody, and wood presence in the soil depletes nitrogen levels. Hence, we hypothesize that chaparal plants have lower nitrogen levels than California sage scrub plant soil PlantComm or CSS, Nperc or soilN Difference in means
5 Soil clods collected near chaparral species have higher organic matter concentrations as compared to sagescrub species PlantComm or CSS, SOM Difference in means
6 Soil samples from California sage scrub species (artemisia californica, salvia apiana) contain higher levels of carbon than chapparal species (malosma laurina, sambucus nigra) PlantComm or CSS, Cperc or soilC Difference in means

2 Preparing data

2.1 1: Load packages

Below, we will start by loading packages that we’ll need.

### Load packages
library("ggplot2") # plotting functions
library("dplyr") # data wrangling functions
library("readr") # reading in tables, including ones online
library("mosaic") # shuffle (permute) our data

2.2 2: Read data

Next, we will pull in our data and start creating some useful variables, such as SoilDensity, the density of the soil (units: g/mL), or soilC, the quantity of carbon in the soil (units: kg C for some volume in units of \(m^3\)) given the soil’s density.

### Load in dataset
soilDF <- readr::read_tsv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRkRXSzJeNWsU4OJ55XfVQtZH3nWkuUjAmINt8go1UyaEgf_I4524bBV-Os1tw59A2nfN6_iPSzV323/pub?output=tsv") # read in spreadsheet from its URL and store in soilDF

### Preparing data
soilDF <- soilDF %>% # take soilDF object and feed it into -->
  mutate(CSS = as.factor(CSS)) # convert 0 and 1 for CSS (coastal sage scrub) from numeric to categorical ("factor")

soilDF <- soilDF %>% # take soilDF object and feed it into -->
  mutate(PlantComm = ifelse(CSS=="1","CSS","CHAP")) # use mutate to create a new column that has two types of plant communities: CSS and CHAP(arral)

### Calculate useful variables
soilDF <- soilDF %>% # take soilDF object and feed it into --> 
  mutate(SoilDensity = ( ( (ClodWeight1dip - (ClodWeight2dip-ClodWeight1dip)) * (1 - DryFrac)) - RockWeightG)/ClodVolumeInmL ) # create SoilDensity variable from columns in the data following Caspi et al. 2019

Note that our units for SoilDensity are g/mL. But we want soilC in units of kilograms of carbon per 0.1\(m^3\) of soil. To calculate soilC in units of kg C per \(m^2\) at a depth of 10cm (or 0.1m), we need to do the following calculations:

  1. Note that 10cm = 0.1m
  2. The total soil volume is therefore 1m * 1m * 0.1m = 0.1 \(m^3\)
  3. Note that 1 \(m^3\) equals \(10^6\) mL
  • Pro tip! Are you ever uncertain about unit conversions? You can use Wolfram Alpha to double check for you.
  1. So 0.1 \(m^3\) of soil equals \(10^6 * 0.1 = 10^5\).
  2. Keep in mind that 1000 (or \(10^3\)) grams = 1 kg

We’ll put all of these pieces together below.

soilDF <- soilDF %>%
  mutate(soilC = SoilDensity * 10^5 * Cperc/100 * 1/10^3) # create SoilDensity variable, which has these units: kg C/m2 at a depth of 10cm

### Display data
# View(soilDF) # open a spreadsheet viewer
  # delete the leading comment/hashtag sign to run the code above

2.3 3: Exploratory data analysis

Below, I provide fully-worked examples of different ways of inspecting your data and performing analyses assuming that SoilDensity is the variable of interest. In applying this code to your own data, you will want to think about what variable name should replace SoilDensity in the commands below. For example, you would change tapply(soilDF$SoilDensity, soilDF$PlantComm, summary) to tapply(soilDF$ResponseVariable,soilDF$PlantComm, summary) where ResponseVariable is the variable you’re analyzing.

Let’s start with exploratory data analysis where you calculate summary statistics and visualize your variable.

### Calculate summary statistics for SoilDensity
### for each PlantComm(unity)
tapply(soilDF$SoilDensity, soilDF$PlantComm, summary) 
## $CHAP
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.5424  0.9218  1.1074  1.0483  1.1491  1.4578 
## 
## $CSS
##    Min. 1st Qu.  Median    Mean 3rd Qu.    Max. 
##  0.5146  1.0101  1.1549  1.0749  1.2424  1.4039
  # use PlantComm as the grouping variable
  # for SoilDensity and then summarize SoilDensity
  # for each PlantComm(unity)

How would I visualize the distribution of values of SoilDensity? I could use a histogram and color the values differently for Coastal Sage Scrub vs. chaparral.

### Visualize your variable of interest
  # Histogram
gf_histogram(~SoilDensity,              # plot distribution of SoilDensity
             fill=~CSS,                 # change colors based on CSS (or not)
             data=soilDF,               # the data come from the soilDF object
             xlab="Soil density (g/mL)",# change the x-axis label
             ylab="Count")              # change the y-axis label

Alternatively, I could create a boxplot to visualize the distribution of values in SoilDensity with two boxplots for each plant community.

### Visualize your variable of interest
  # Boxplot
gf_boxplot(SoilDensity~CSS,       # plot distribution of SoilDensity for CSS
       fill= ~CSS,                # change colors based on CSS (or not)
       data=soilDF,               # the data come from the soilDF object
       xlab="Woody vegetation",   # change x-axis label
       ylab="Soil density (g/mL)")# change y-axis label

Finally, if I am interested in the relationship between two or more variables, I can use a scatterplot to visualize each pair of data.

### Visualization
  # Scatterplot
gf_point(SoilDensity~Species,        # plot SoilDensity against plant Species
         color= ~CSS,                # change colors based on CSS (or not)
         data= soilDF,               # the data come from the soilDF object
         xlab="Species",             # change x-axis label
         ylab="Soil density (g/mL)") # change y-axis label

3 Analysis

The code below shows you how to do different types of analyses.

3.1 1: Calculating differences in means

Let’s say I am interested in determining if Coastal Sage Scrub (CSS) plants tend to have higher soil density than Chaparral (CHAP) plants. This sounds like I want to see if there is a clear difference in the mean values of SoilDensity for the CSS plants versus the CHAP plants. We can start by calculating the mean difference we observe.

mean( SoilDensity ~ PlantComm, data = soilDF , na.rm=T) # show mean SoilDensity values for the plant communities and remove missing data
##     CHAP      CSS 
## 1.048335 1.074866
obs_diff <- diff( mean( SoilDensity ~ PlantComm, data = soilDF , na.rm=T)) # calculate the difference between the means and store in a variable called "obs_diff"
obs_diff # display the value of obs_diff
##        CSS 
## 0.02653111

OK, so the mean difference in mean soil density between CSS and CHAP is 0.03. Are the plant communities meaningfully different in terms of soil density though? Our null hypothesis would be that there is no difference in soil density between the plant communities.

Logically, if there is a meaningful difference, then if we shuffle our data around, that should lead to different results than what we see. That is one way to simulate statistics to test the null hypothesis. And specifically, in this case, we would expect to see our observed difference is much larger than most of the simulated values.

What does simulating our data look like?

Well, here is what the data look like initially:

soilDF[,c(1:4,18:20)]
## # A tibble: 24 x 7
##    SampleID Species Group CSS   PlantComm SoilDensity soilC
##    <chr>    <chr>   <dbl> <fct> <chr>           <dbl> <dbl>
##  1 Artcal1  Artcal      1 1     CSS             1.40  2.32 
##  2 Salapi1  Salapi      1 1     CSS             1.34  2.28 
##  3 Mallau1  Mallau      1 0     CHAP            1.36  3.34 
##  4 Samnig1  Samnig      1 0     CHAP            1.14  1.07 
##  5 Artcal2  Artcal      2 1     CSS             1.13  8.68 
##  6 Salapi2  Salapi      2 1     CSS             1.06  1.87 
##  7 Mallau2  Mallau      2 0     CHAP            0.899 3.90 
##  8 Samnig2  Samnig      2 0     CHAP            1.09  2.07 
##  9 Artcal3  Artcal      3 1     CSS             1.22  2.76 
## 10 Salapi3  Salapi      3 1     CSS             0.515 0.463
## # … with 14 more rows

Here is what it looks like after we’ve “shuffled” it once by randomizing the PlantComm assignments. This is like taking a deck of cards with different SoilDensity or SoilC(arbon) values, and being able to change their color and suite (in this case, plant community associated with that card).

resample(soilDF[,c(1:4,18:20)],groups=PlantComm,shuffled=c("SoilDensity","soilC"))
## # A tibble: 24 x 8
##    SampleID Species Group CSS   PlantComm SoilDensity  soilC orig.id 
##    <chr>    <chr>   <dbl> <fct> <chr>           <dbl>  <dbl> <chr>   
##  1 Mallau2  Mallau      2 0     CHAP            0.899  3.90  7.7.7   
##  2 Mallau4  Mallau      4 0     CHAP            1.09   3.34  15.8.3  
##  3 Mallau6  Mallau      6 0     CHAP            0.542 10.4   23.23.12
##  4 Samnig1  Samnig      1 0     CHAP            1.46   1.07  4.19.4  
##  5 Mallau5  Mallau      5 0     CHAP            1.36   3.90  19.3.7  
##  6 Mallau1  Mallau      1 0     CHAP            1.12   2.07  3.12.8  
##  7 Mallau6  Mallau      6 0     CHAP            1.46   0.315 23.19.23
##  8 Mallau2  Mallau      2 0     CHAP            0.899  2.07  7.7.8   
##  9 Samnig3  Samnig      3 0     CHAP            1.13   0.315 12.15.23
## 10 Mallau5  Mallau      5 0     CHAP            1.14   3.09  19.4.19 
## # … with 14 more rows

We can repeat that procedure again and see how the data shifts when we do another shuffle.

resample(soilDF[,c(1:4,18:20)],groups=PlantComm,shuffled=c("SoilDensity","soilC"))
## # A tibble: 24 x 8
##    SampleID Species Group CSS   PlantComm SoilDensity  soilC orig.id 
##    <chr>    <chr>   <dbl> <fct> <chr>           <dbl>  <dbl> <chr>   
##  1 Samnig4  Samnig      4 0     CHAP            1.07   2.07  16.16.8 
##  2 Samnig1  Samnig      1 0     CHAP            0.929  1.07  4.20.4  
##  3 Mallau1  Mallau      1 0     CHAP            1.12   2.07  3.12.8  
##  4 Samnig3  Samnig      3 0     CHAP            1.36   2.57  12.3.16 
##  5 Mallau2  Mallau      2 0     CHAP            1.36   3.90  7.3.7   
##  6 Samnig2  Samnig      2 0     CHAP            0.899  1.62  8.7.20  
##  7 Samnig2  Samnig      2 0     CHAP            0.542  3.34  8.23.3  
##  8 Samnig5  Samnig      5 0     CHAP            1.07   0.315 20.16.23
##  9 Mallau1  Mallau      1 0     CHAP            1.09   3.90  3.8.7   
## 10 Mallau2  Mallau      2 0     CHAP            1.14  10.4   7.4.12  
## # … with 14 more rows

Let’s shuffle the data and see what that means for the distribution of mean SoilDensity differences between CSS and CHAP.

### Create random differences by shuffling the data
randomizing_diffs <- do(1000) * diff( mean( SoilDensity ~ shuffle(PlantComm),na.rm=T, data = soilDF) ) # calculate the mean in SoilDensity when we're shuffling the plant community around 1000 times
  # Clean up our shuffled data
names(randomizing_diffs)[1] <- "DiffMeans" # change the name of the variable

# View first few rows of data
head(randomizing_diffs)
##     DiffMeans
## 1 -0.01179862
## 2 -0.02245818
## 3 -0.04216567
## 4  0.03161914
## 5 -0.02217865
## 6  0.06866587

Now we can visualize the distribution of simulated differences in mean soil density and our observed difference between CSS and CHAP.

gf_histogram(~ DiffMeans, fill = ~ (DiffMeans >= obs_diff),
             data = randomizing_diffs,
             xlab="Difference in mean soil density under null",
             ylab="Count")

In the end, how many of the simulated mean differences were more extreme than the value we observed? Based on this value, if we were using the conventional \(p\)-value (probability value) of 0.05, we would conclude that because this simulated \(p\)-value > 0.05, that we cannot reject the null hypothesis.

# What proportion of simulated values were larger than our observed difference
prop( ~ DiffMeans >= obs_diff, data = randomizing_diffs) # ~0.03 was the observed difference value - see obs_diff
## prop_TRUE 
##     0.399

3.2 2: Calculating correlation coefficients

How would I determine if there is a non-zero correlation between two variables or that two variables are positively correlated? I can again start by calculating the observed correlation coefficient for the data.

### Calculate observed correlation
obs_cor <- cor(BacterialPerc ~ SoilDensity, data=soilDF, use="complete.obs") # store observed correlation in obs_cor of BacterialPerc vs. SoilDensity
obs_cor # print out value
## [1] -0.1320645

Let’s say that I’m interested in determining if the percentage of bacteria is negatively correlated with soil density. Our null hypothesis could be that the correlation coefficient is actually 0. As before though, how do I know that my correlation coefficient of 0 is significantly different from 0? We can tackle this question by simulating a ton of correlation coefficient values from our data by shuffling it!

In this case, the shuffling here lets us estimate the variation in the correlation coefficient given our data. So we are curious now if the distribution of simulated values does or does not include 0 (that is, is it clearly \(< 0\) or \(> 0\)?).

### Calculate correlation coefs for shuffled data
randomizing_cor <- do(1000) * cor(BacterialPerc ~ SoilDensity, 
                                 data = resample(soilDF), 
                                 use="complete.obs") 
# Shuffle the data 1000 times
# Calculate the correlation coefficient each time
# By correlating BacterialPerc to SoilDensity from the
# data object soilDF

What are the distribution of correlation coefficients that we see when we shuffle our data?

quantiles_cor <- qdata( ~ cor, p = c(0.025, 0.975), data=randomizing_cor) # calculate the 2.5% and 97.5% quantiles in our simulated correlation coefficient data (note that 97.5-2.5 = 95!)
quantiles_cor # display the quantiles
##       2.5%      97.5% 
## -0.5288802  0.2217759

The values above give us a 95% confidence interval estimate for our correlation coefficient!

Do we clearly see that our correlation coefficient distribution does or doesn’t include 0?

gf_histogram(~ cor,
             data = randomizing_cor,
             xlab="Simulated correlation coefficients",
             ylab="Count")

In this case, our simulated correlation coefficient includes 0 in its 95% simulated confidence interval. We can also see this in the plot. So given these data, we cannot reject the null hypothesis. There is not sufficiently strong data that percent bacteria associates with soil density in any clear way.

3.3 3: Performing a regression model

The last type of analysis we’ll look at is a regression model. Specifically, we’ll do an analysis with two predictor variables. We are evaluating how soil density covaries with percent bacteria and soil organic matter (SOM).

We can see how our response variable co-varies with each predictor variable by using scatterplots. Here we can also add a line for each pair of variables.

gf_point(SoilDensity ~ BacterialPerc,    # plot soil density against bacteria
         data=soilDF,                    # from the soilDF data
         xlab="Bacteria %",              # change x-axis label
         ylab="Soil density (g/mL)") %>% # change y-axis label
  gf_lm()                                # add a regression line for these two variables only

Now let’s run a linear regression for multiple predictor variables (multiple regression). Here, we’ll use BacterialPerc and SOM as independent variables that predict SoilDensity, the response (or dependent) variable.

lm_mod <- lm(SoilDensity ~ BacterialPerc + SOM, data=soilDF) # perform linear regression of soil density against bacterial percentage and soil organic matter and store in an object called lm_mod
lm_mod # print object
## 
## Call:
## lm(formula = SoilDensity ~ BacterialPerc + SOM, data = soilDF)
## 
## Coefficients:
##   (Intercept)  BacterialPerc            SOM  
##      1.557138      -0.004718      -4.067414

We can also shuffle our data to estimate the variation in the coefficient values. Our null is that the coefficient estimates are equal to 0. That is, there is no relationship between soil density and bacterial percentage or soil organic matter. If the simulated coefficient estimates have an interval that doesn’t cross 0 (go from negative to positive), then that is evidence against the null hypothesis.

In the table below, we’d focus on rows 2-3 in particular.

lm_boot <- do(1000) * lm(SoilDensity ~ BacterialPerc + SOM, data=resample(soilDF)) # shuffle our data 1000 times and re-run the linear model each time and store the outputs in lm_boot
confint(lm_boot) # generate a 95% confidence interval from the simulated values for the linear regression coefficients
##            name        lower        upper level     method     estimate
## 1     Intercept  0.636625147  2.335351358  0.95 percentile  1.557138470
## 2 BacterialPerc -0.015790722  0.005553617  0.95 percentile -0.004718324
## 3           SOM -7.160797035  3.037437189  0.95 percentile -4.067413529
## 4         sigma  0.128758625  0.315910727  0.95 percentile  0.257846423
## 5     r.squared  0.006230317  0.651408223  0.95 percentile  0.159460192
## 6             F  0.065828566 19.621230945  0.95 percentile  1.991972295

Based on the values above, as soil organic matter (SOM) increases by one unit, soil density decreases by as much as 7.1 g/mL or increases as much as 5.3 g/mL. Because the confidence interval for this coefficient crosses 0, we would conclude that there is not a clear directional association between SOM and soil density. Similarly, it looks like bacterial percentage does not have a clear directional effect on soil density.

4 Code in entirety

The segment below can be directly copied and pasted into the code editor in RStudio Cloud.

###=========================================
### Load packages
###=========================================

library("ggplot2") # plotting functions
library("dplyr") # data wrangling functions
library("readr") # reading in tables, including ones online
library("mosaic") # shuffle (permute) our data

###=========================================
### Load and process data
###=========================================

### Load in dataset
soilDF <- readr::read_tsv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRkRXSzJeNWsU4OJ55XfVQtZH3nWkuUjAmINt8go1UyaEgf_I4524bBV-Os1tw59A2nfN6_iPSzV323/pub?output=tsv") # read in spreadsheet from its URL and store in soilDF

### Preparing data
soilDF$CSS <- as.factor(soilDF$CSS) # convert 0 and 1 for CSS (coastal sage scrub) from numeric to categorical ("factor")

soilDF <- soilDF %>% # take soilDF object and feed it into -->
  mutate(PlantComm = ifelse(CSS=="1","CSS","CHAP")) # use mutate to create a new column that has two types of plant communities: CSS and CHAP(arral)

### Calculate useful variables: SoilDensity
soilDF <- soilDF %>% # take soilDF object and feed it into --> 
  mutate(SoilDensity = ( ( (ClodWeight1dip - (ClodWeight2dip-ClodWeight1dip)) * (1 - DryFrac)) - RockWeightG)/ClodVolumeInmL )# create SoilDensity variable from columns in the data following Caspi et al. 2019

### Calculate soilC for soil carbon
soilDF <- soilDF %>%
  mutate(soilC = SoilDensity * 10^5 * Cperc/100 * 1/10^3) # create SoilDensity variable, which has these units: kg C/m2 at a depth of 10cm

### Display data
View(soilDF) # open a spreadsheet viewer

###=========================================
### Exploratory data analysis
###=========================================

### Calculate summary statistics for SoilDensity
### for each PlantComm(unity)
tapply(soilDF$SoilDensity, soilDF$PlantComm, summary) 
  # use PlantComm as the grouping variable
  # for SoilDensity and then summarize SoilDensity
  # for each PlantComm(unity)

### Visualize your variable of interest
  # Histogram
gf_histogram(~SoilDensity,              # plot distribution of SoilDensity
             fill=~CSS,                 # change colors based on CSS (or not)
             data=soilDF,               # the data come from the soilDF object
             xlab="Soil density (g/mL)",# change the x-axis label
             ylab="Count")              # change the y-axis label

### Visualize your variable of interest
  # Boxplot
gf_boxplot(SoilDensity~CSS,       # plot distribution of SoilDensity for CSS
       fill= ~CSS,                # change colors based on CSS (or not)
       data=soilDF,               # the data come from the soilDF object
       xlab="Woody vegetation",   # change x-axis label
       ylab="Soil density (g/mL)")# change y-axis label

### Visualization
  # Scatterplot
gf_point(SoilDensity~Species,        # plot SoilDensity against plant Species
         color= ~CSS,                # change colors based on CSS (or not)
         data= soilDF,               # the data come from the soilDF object
         xlab="Species",             # change x-axis label
         ylab="Soil density (g/mL)") # change y-axis label

###=========================================
### Analysis - Differences in means
###=========================================

mean( SoilDensity ~ PlantComm, data = soilDF , na.rm=T) # show mean SoilDensity values for the plant communities and remove missing data
obs_diff <- diff( mean( SoilDensity ~ PlantComm, data = soilDF , na.rm=T)) # calculate the difference between the means and store in a variable called "obs_diff"
obs_diff # display the value of obs_diff

### Create random differences by shuffling the data
randomizing_diffs <- do(1000) * diff( mean( SoilDensity ~ shuffle(PlantComm),na.rm=T, data = soilDF) ) # calculate the mean in SoilDensity when we're shuffling the plant community around 1000 times
  # Clean up our shuffled data
names(randomizing_diffs)[1] <- "DiffMeans" # change the name of the variable

# View first few rows of data
head(randomizing_diffs)

gf_histogram(~ DiffMeans, fill = ~ (DiffMeans >= obs_diff),
             data = randomizing_diffs,
             xlab="Difference in mean soil density under null",
             ylab="Count")

###=========================================
### Analysis - Correlation coefficient
###=========================================

### Calculate observed correlation
obs_cor <- cor(BacterialPerc ~ SoilDensity, data=soilDF, use="complete.obs") # store observed correlation in obs_cor of BacterialPerc vs. SoilDensity
obs_cor # print out value

### Calculate correlation coefs for shuffled data
randomizing_cor <- do(1000) * cor(BacterialPerc ~ SoilDensity, 
                                 data = resample(soilDF), 
                                 use="complete.obs") 
# Shuffle the data 1000 times
# Calculate the correlation coefficient each time
# By correlating BacterialPerc to SoilDensity from the
# data object soilDF

quantiles_cor <- qdata( ~ cor, p = c(0.025, 0.975), data=randomizing_cor) # calculate the 2.5% and 97.5% quantiles in our simulated correlation coefficient data (note that 97.5-2.5 = 95!)
quantiles_cor # display the quantiles

gf_histogram(~ cor, fill = ~ (cor > 0),
             data = randomizing_cor,
             xlab="Simulated correlation coefficients",
             ylab="Count")

###=========================================
### Analysis - Linear regression
###=========================================
### Initial visualization
gf_point(SoilDensity ~ BacterialPerc,    # plot soil density against bacteria
         data=soilDF,                    # from the soilDF data
         xlab="Bacteria %",              # change x-axis label
         ylab="Soil density (g/mL)") %>% # change y-axis label
  gf_lm()                                # add a regression line for these two variables only

lm_mod <- lm(SoilDensity ~ BacterialPerc + SOM, data=soilDF) # perform linear regression of soil density against bacterial percentage and resin weight and store in an object called lm_mod
lm_mod # print object

lm_boot <- do(1000) * lm(SoilDensity ~ BacterialPerc + SOM, data=resample(soilDF)) # shuffle our data 1000 times and re-run the linear model each time and store the outputs in lm_boot
confint(lm_boot) # generate a 95% confidence interval from the simulated values for the linear regression coefficients