This page provides a demonstration on using `R`

and shuffling our data to analyze your soil science hypotheses to generate figures and results for your group lab report. You are welcome to either follow along with the text and copy-paste the code into the console **or** read through and then copy the whole script into a new `R`

script.

Please start by logging into sso.rstudio.cloud/pomona. Once you’ve logged in, you should see the option to open the `EA 30.1 - Fall 2022`

workspace. You can also access the `EA 30.1 - Fall 2022`

workspace at this link.

At the workspace, you should then see the “Assignment” `SoilScienceLab`

. Please click on that Assignment. Here is a direct link to the `SoilScienceLab`

project, but it may not work if this is the first time in a while that you’re logging into RStudio.Cloud. If it doesn’t work, no worries - just log in through the sso.rstudio.cloud/pomona link.

This table below shows the hypotheses for each group:

Group | Hypotheses | Variables | Statistic |
---|---|---|---|

1 | Plants that have higher C storage will also have higher N storage. | Cperc or soilC, Nperc or soilN | Regression model (lm) or correlation coefficients |

2 | Levels of biomass concentration are greater in California sage scrub soil samples compared to chaparral plant soil samples. | PlantComm or CSS, microbes | Difference in means |

3 | The sites affected by the 2017 wildfire will have higher N values. | Burn2017, Nperc or soilN | Difference in means |

4 | Chaparral plants are woody, and wood presence in the soil depletes nitrogen levels. Hence, we hypothesize that chaparal plants have lower nitrogen levels than California sage scrub plant soil | PlantComm or CSS, Nperc or soilN | Difference in means |

5 | Soil clods collected near chaparral species have higher organic matter concentrations as compared to sagescrub species | PlantComm or CSS, SOM | Difference in means |

6 | Soil samples from California sage scrub species (artemisia californica, salvia apiana) contain higher levels of carbon than chapparal species (malosma laurina, sambucus nigra) | PlantComm or CSS, Cperc or soilC | Difference in means |

Below, we will start by loading packages that we’ll need.

```
### Load packages
library("ggplot2") # plotting functions
library("dplyr") # data wrangling functions
library("readr") # reading in tables, including ones online
library("mosaic") # shuffle (permute) our data
```

Next, we will pull in our data and start creating some useful variables, such as `SoilDensity`

, the density of the soil (units: g/mL), or `soilC`

, the quantity of carbon in the soil (units: kg C for some volume in units of \(m^3\)) given the soil’s density.

```
### Load in dataset
soilDF <- readr::read_tsv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRkRXSzJeNWsU4OJ55XfVQtZH3nWkuUjAmINt8go1UyaEgf_I4524bBV-Os1tw59A2nfN6_iPSzV323/pub?output=tsv") # read in spreadsheet from its URL and store in soilDF
### Preparing data
soilDF <- soilDF %>% # take soilDF object and feed it into -->
mutate(CSS = as.factor(CSS)) # convert 0 and 1 for CSS (coastal sage scrub) from numeric to categorical ("factor")
soilDF <- soilDF %>% # take soilDF object and feed it into -->
mutate(PlantComm = ifelse(CSS=="1","CSS","CHAP")) # use mutate to create a new column that has two types of plant communities: CSS and CHAP(arral)
### Calculate useful variables
soilDF <- soilDF %>% # take soilDF object and feed it into -->
mutate(SoilDensity = ( ( (ClodWeight1dip - (ClodWeight2dip-ClodWeight1dip)) * (1 - DryFrac)) - RockWeightG)/ClodVolumeInmL ) # create SoilDensity variable from columns in the data following Caspi et al. 2019
```

Note that our units for `SoilDensity`

are g/mL. But we want `soilC`

in units of kilograms of carbon per 0.1\(m^3\) of soil. To calculate `soilC`

in units of kg C per \(m^2\) at a depth of 10cm (or 0.1m), we need to do the following calculations:

- Note that 10cm = 0.1m
- The total soil volume is therefore 1m * 1m * 0.1m = 0.1 \(m^3\)
- Note that 1 \(m^3\) equals \(10^6\) mL

- Pro tip! Are you ever uncertain about unit conversions? You can use Wolfram Alpha to double check for you.

- So 0.1 \(m^3\) of soil equals \(10^6 * 0.1 = 10^5\).
- Keep in mind that 1000 (or \(10^3\)) grams = 1 kg

We’ll put all of these pieces together below.

```
soilDF <- soilDF %>%
mutate(soilC = SoilDensity * 10^5 * Cperc/100 * 1/10^3) # create SoilDensity variable, which has these units: kg C/m2 at a depth of 10cm
### Display data
# View(soilDF) # open a spreadsheet viewer
# delete the leading comment/hashtag sign to run the code above
```

Below, I provide fully-worked examples of different ways of inspecting your data and performing analyses **assuming that SoilDensity is the variable of interest**. In applying this code to your own data, you will want to think about what variable name should

`SoilDensity`

in the commands below. For example, you would change `tapply(soilDF$SoilDensity, soilDF$PlantComm, summary)`

to `tapply(soilDF$ResponseVariable,soilDF$PlantComm, summary)`

where `ResponseVariable`

is the variable you’re analyzing.Let’s start with exploratory data analysis where you calculate summary statistics and visualize your variable.

```
### Calculate summary statistics for SoilDensity
### for each PlantComm(unity)
tapply(soilDF$SoilDensity, soilDF$PlantComm, summary)
```

```
## $CHAP
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5424 0.9218 1.1074 1.0483 1.1491 1.4578
##
## $CSS
## Min. 1st Qu. Median Mean 3rd Qu. Max.
## 0.5146 1.0101 1.1549 1.0749 1.2424 1.4039
```

```
# use PlantComm as the grouping variable
# for SoilDensity and then summarize SoilDensity
# for each PlantComm(unity)
```

How would I visualize the distribution of values of `SoilDensity`

? I could use a histogram and color the values differently for Coastal Sage Scrub vs. chaparral.

```
### Visualize your variable of interest
# Histogram
gf_histogram(~SoilDensity, # plot distribution of SoilDensity
fill=~CSS, # change colors based on CSS (or not)
data=soilDF, # the data come from the soilDF object
xlab="Soil density (g/mL)",# change the x-axis label
ylab="Count") # change the y-axis label
```

Alternatively, I could create a boxplot to visualize the distribution of values in `SoilDensity`

with two boxplots for each plant community.

```
### Visualize your variable of interest
# Boxplot
gf_boxplot(SoilDensity~CSS, # plot distribution of SoilDensity for CSS
fill= ~CSS, # change colors based on CSS (or not)
data=soilDF, # the data come from the soilDF object
xlab="Woody vegetation", # change x-axis label
ylab="Soil density (g/mL)")# change y-axis label
```

Finally, if I am interested in the relationship between two or more variables, I can use a scatterplot to visualize each pair of data.

```
### Visualization
# Scatterplot
gf_point(SoilDensity~Species, # plot SoilDensity against plant Species
color= ~CSS, # change colors based on CSS (or not)
data= soilDF, # the data come from the soilDF object
xlab="Species", # change x-axis label
ylab="Soil density (g/mL)") # change y-axis label
```

The code below shows you how to do different types of analyses.

Let’s say I am interested in determining `if Coastal Sage Scrub (CSS) plants tend to have higher soil density than Chaparral (CHAP) plants`

. This sounds like I want to see if there is a clear difference in the mean values of `SoilDensity`

for the CSS plants versus the CHAP plants. We can start by calculating the mean difference we observe.

`mean( SoilDensity ~ PlantComm, data = soilDF , na.rm=T) # show mean SoilDensity values for the plant communities and remove missing data`

```
## CHAP CSS
## 1.048335 1.074866
```

```
obs_diff <- diff( mean( SoilDensity ~ PlantComm, data = soilDF , na.rm=T)) # calculate the difference between the means and store in a variable called "obs_diff"
obs_diff # display the value of obs_diff
```

```
## CSS
## 0.02653111
```

OK, so the mean difference in mean soil density between CSS and CHAP is 0.03. Are the plant communities meaningfully different in terms of soil density though? Our null hypothesis would be that there is no difference in soil density between the plant communities.

Logically, if there is a meaningful difference, then if we shuffle our data around, that should lead to different results than what we see. That is one way to *simulate* statistics to test the null hypothesis. And specifically, in this case, we would expect to see our observed difference is much larger than most of the simulated values.

What does simulating our data look like?

Well, here is what the data look like initially:

`soilDF[,c(1:4,18:20)]`

```
## # A tibble: 24 x 7
## SampleID Species Group CSS PlantComm SoilDensity soilC
## <chr> <chr> <dbl> <fct> <chr> <dbl> <dbl>
## 1 Artcal1 Artcal 1 1 CSS 1.40 2.32
## 2 Salapi1 Salapi 1 1 CSS 1.34 2.28
## 3 Mallau1 Mallau 1 0 CHAP 1.36 3.34
## 4 Samnig1 Samnig 1 0 CHAP 1.14 1.07
## 5 Artcal2 Artcal 2 1 CSS 1.13 8.68
## 6 Salapi2 Salapi 2 1 CSS 1.06 1.87
## 7 Mallau2 Mallau 2 0 CHAP 0.899 3.90
## 8 Samnig2 Samnig 2 0 CHAP 1.09 2.07
## 9 Artcal3 Artcal 3 1 CSS 1.22 2.76
## 10 Salapi3 Salapi 3 1 CSS 0.515 0.463
## # … with 14 more rows
```

Here is what it looks like after we’ve “shuffled” it once by randomizing the `PlantComm`

assignments. This is like taking a deck of cards with different `SoilDensity`

or `SoilC`

(arbon) values, and being able to change their color and suite (in this case, plant community associated with that card).

`resample(soilDF[,c(1:4,18:20)],groups=PlantComm,shuffled=c("SoilDensity","soilC"))`

```
## # A tibble: 24 x 8
## SampleID Species Group CSS PlantComm SoilDensity soilC orig.id
## <chr> <chr> <dbl> <fct> <chr> <dbl> <dbl> <chr>
## 1 Mallau2 Mallau 2 0 CHAP 0.899 3.90 7.7.7
## 2 Mallau4 Mallau 4 0 CHAP 1.09 3.34 15.8.3
## 3 Mallau6 Mallau 6 0 CHAP 0.542 10.4 23.23.12
## 4 Samnig1 Samnig 1 0 CHAP 1.46 1.07 4.19.4
## 5 Mallau5 Mallau 5 0 CHAP 1.36 3.90 19.3.7
## 6 Mallau1 Mallau 1 0 CHAP 1.12 2.07 3.12.8
## 7 Mallau6 Mallau 6 0 CHAP 1.46 0.315 23.19.23
## 8 Mallau2 Mallau 2 0 CHAP 0.899 2.07 7.7.8
## 9 Samnig3 Samnig 3 0 CHAP 1.13 0.315 12.15.23
## 10 Mallau5 Mallau 5 0 CHAP 1.14 3.09 19.4.19
## # … with 14 more rows
```

We can repeat that procedure again and see how the data shifts when we do another shuffle.

`resample(soilDF[,c(1:4,18:20)],groups=PlantComm,shuffled=c("SoilDensity","soilC"))`

```
## # A tibble: 24 x 8
## SampleID Species Group CSS PlantComm SoilDensity soilC orig.id
## <chr> <chr> <dbl> <fct> <chr> <dbl> <dbl> <chr>
## 1 Samnig4 Samnig 4 0 CHAP 1.07 2.07 16.16.8
## 2 Samnig1 Samnig 1 0 CHAP 0.929 1.07 4.20.4
## 3 Mallau1 Mallau 1 0 CHAP 1.12 2.07 3.12.8
## 4 Samnig3 Samnig 3 0 CHAP 1.36 2.57 12.3.16
## 5 Mallau2 Mallau 2 0 CHAP 1.36 3.90 7.3.7
## 6 Samnig2 Samnig 2 0 CHAP 0.899 1.62 8.7.20
## 7 Samnig2 Samnig 2 0 CHAP 0.542 3.34 8.23.3
## 8 Samnig5 Samnig 5 0 CHAP 1.07 0.315 20.16.23
## 9 Mallau1 Mallau 1 0 CHAP 1.09 3.90 3.8.7
## 10 Mallau2 Mallau 2 0 CHAP 1.14 10.4 7.4.12
## # … with 14 more rows
```

Let’s shuffle the data and see what that means for the distribution of mean `SoilDensity`

differences between `CSS`

and `CHAP`

.

```
### Create random differences by shuffling the data
randomizing_diffs <- do(1000) * diff( mean( SoilDensity ~ shuffle(PlantComm),na.rm=T, data = soilDF) ) # calculate the mean in SoilDensity when we're shuffling the plant community around 1000 times
# Clean up our shuffled data
names(randomizing_diffs)[1] <- "DiffMeans" # change the name of the variable
# View first few rows of data
head(randomizing_diffs)
```

```
## DiffMeans
## 1 -0.01179862
## 2 -0.02245818
## 3 -0.04216567
## 4 0.03161914
## 5 -0.02217865
## 6 0.06866587
```

Now we can visualize the distribution of simulated differences in mean soil density and our observed difference between CSS and CHAP.

```
gf_histogram(~ DiffMeans, fill = ~ (DiffMeans >= obs_diff),
data = randomizing_diffs,
xlab="Difference in mean soil density under null",
ylab="Count")
```

In the end, how many of the simulated mean differences were more extreme than the value we observed? Based on this value, if we were using the conventional \(p\)-value (probability value) of 0.05, we would conclude that because this simulated \(p\)-value > 0.05, that we cannot reject the null hypothesis.

```
# What proportion of simulated values were larger than our observed difference
prop( ~ DiffMeans >= obs_diff, data = randomizing_diffs) # ~0.03 was the observed difference value - see obs_diff
```

```
## prop_TRUE
## 0.399
```

How would I determine if there is a non-zero correlation between two variables or that two variables are positively correlated? I can again start by calculating the observed correlation coefficient for the data.

```
### Calculate observed correlation
obs_cor <- cor(BacterialPerc ~ SoilDensity, data=soilDF, use="complete.obs") # store observed correlation in obs_cor of BacterialPerc vs. SoilDensity
obs_cor # print out value
```

`## [1] -0.1320645`

Let’s say that I’m interested in determining if the percentage of bacteria is negatively correlated with soil density. Our null hypothesis could be that the correlation coefficient is actually 0. As before though, how do I know that my correlation coefficient of 0 is significantly different from 0? We can tackle this question by simulating a ton of correlation coefficient values from our data by shuffling it!

In this case, the shuffling here lets us estimate the variation in the correlation coefficient given our data. So we are curious now if the distribution of simulated values does or does not include 0 (that is, is it clearly \(< 0\) or \(> 0\)?).

```
### Calculate correlation coefs for shuffled data
randomizing_cor <- do(1000) * cor(BacterialPerc ~ SoilDensity,
data = resample(soilDF),
use="complete.obs")
# Shuffle the data 1000 times
# Calculate the correlation coefficient each time
# By correlating BacterialPerc to SoilDensity from the
# data object soilDF
```

What are the distribution of correlation coefficients that we see when we shuffle our data?

```
quantiles_cor <- qdata( ~ cor, p = c(0.025, 0.975), data=randomizing_cor) # calculate the 2.5% and 97.5% quantiles in our simulated correlation coefficient data (note that 97.5-2.5 = 95!)
quantiles_cor # display the quantiles
```

```
## 2.5% 97.5%
## -0.5288802 0.2217759
```

The values above give us a 95% confidence interval estimate for our correlation coefficient!

Do we clearly see that our correlation coefficient distribution does or doesn’t include 0?

```
gf_histogram(~ cor,
data = randomizing_cor,
xlab="Simulated correlation coefficients",
ylab="Count")
```

In this case, our simulated correlation coefficient includes 0 in its 95% simulated confidence interval. We can also see this in the plot. So given these data, we cannot reject the null hypothesis. There is not sufficiently strong data that percent bacteria associates with soil density in any clear way.

The last type of analysis we’ll look at is a regression model. Specifically, we’ll do an analysis with two predictor variables. We are evaluating how soil density covaries with percent bacteria and soil organic matter (`SOM`

).

We can see how our response variable co-varies with each predictor variable by using scatterplots. Here we can also add a line for each pair of variables.

```
gf_point(SoilDensity ~ BacterialPerc, # plot soil density against bacteria
data=soilDF, # from the soilDF data
xlab="Bacteria %", # change x-axis label
ylab="Soil density (g/mL)") %>% # change y-axis label
gf_lm() # add a regression line for these two variables only
```

Now let’s run a linear regression for multiple predictor variables (multiple regression). Here, we’ll use `BacterialPerc`

and `SOM`

as independent variables that predict `SoilDensity`

, the response (or dependent) variable.

```
lm_mod <- lm(SoilDensity ~ BacterialPerc + SOM, data=soilDF) # perform linear regression of soil density against bacterial percentage and soil organic matter and store in an object called lm_mod
lm_mod # print object
```

```
##
## Call:
## lm(formula = SoilDensity ~ BacterialPerc + SOM, data = soilDF)
##
## Coefficients:
## (Intercept) BacterialPerc SOM
## 1.557138 -0.004718 -4.067414
```

We can also shuffle our data to estimate the variation in the coefficient values. Our null is that the coefficient estimates are equal to 0. That is, there is no relationship between soil density and bacterial percentage or soil organic matter. If the simulated coefficient estimates have an interval that doesn’t cross 0 (go from negative to positive), then that is evidence against the null hypothesis.

In the table below, we’d focus on rows 2-3 in particular.

```
lm_boot <- do(1000) * lm(SoilDensity ~ BacterialPerc + SOM, data=resample(soilDF)) # shuffle our data 1000 times and re-run the linear model each time and store the outputs in lm_boot
confint(lm_boot) # generate a 95% confidence interval from the simulated values for the linear regression coefficients
```

```
## name lower upper level method estimate
## 1 Intercept 0.636625147 2.335351358 0.95 percentile 1.557138470
## 2 BacterialPerc -0.015790722 0.005553617 0.95 percentile -0.004718324
## 3 SOM -7.160797035 3.037437189 0.95 percentile -4.067413529
## 4 sigma 0.128758625 0.315910727 0.95 percentile 0.257846423
## 5 r.squared 0.006230317 0.651408223 0.95 percentile 0.159460192
## 6 F 0.065828566 19.621230945 0.95 percentile 1.991972295
```

Based on the values above, as soil organic matter (SOM) increases by one unit, soil density decreases by as much as 7.1 g/mL or increases as much as 5.3 g/mL. Because the confidence interval for this coefficient crosses 0, we would conclude that there is not a clear directional association between SOM and soil density. Similarly, it looks like bacterial percentage does not have a clear directional effect on soil density.

The segment below can be directly copied and pasted into the code editor in RStudio Cloud.

```
###=========================================
### Load packages
###=========================================
library("ggplot2") # plotting functions
library("dplyr") # data wrangling functions
library("readr") # reading in tables, including ones online
library("mosaic") # shuffle (permute) our data
###=========================================
### Load and process data
###=========================================
### Load in dataset
soilDF <- readr::read_tsv("https://docs.google.com/spreadsheets/d/e/2PACX-1vRkRXSzJeNWsU4OJ55XfVQtZH3nWkuUjAmINt8go1UyaEgf_I4524bBV-Os1tw59A2nfN6_iPSzV323/pub?output=tsv") # read in spreadsheet from its URL and store in soilDF
### Preparing data
soilDF$CSS <- as.factor(soilDF$CSS) # convert 0 and 1 for CSS (coastal sage scrub) from numeric to categorical ("factor")
soilDF <- soilDF %>% # take soilDF object and feed it into -->
mutate(PlantComm = ifelse(CSS=="1","CSS","CHAP")) # use mutate to create a new column that has two types of plant communities: CSS and CHAP(arral)
### Calculate useful variables: SoilDensity
soilDF <- soilDF %>% # take soilDF object and feed it into -->
mutate(SoilDensity = ( ( (ClodWeight1dip - (ClodWeight2dip-ClodWeight1dip)) * (1 - DryFrac)) - RockWeightG)/ClodVolumeInmL )# create SoilDensity variable from columns in the data following Caspi et al. 2019
### Calculate soilC for soil carbon
soilDF <- soilDF %>%
mutate(soilC = SoilDensity * 10^5 * Cperc/100 * 1/10^3) # create SoilDensity variable, which has these units: kg C/m2 at a depth of 10cm
### Display data
View(soilDF) # open a spreadsheet viewer
###=========================================
### Exploratory data analysis
###=========================================
### Calculate summary statistics for SoilDensity
### for each PlantComm(unity)
tapply(soilDF$SoilDensity, soilDF$PlantComm, summary)
# use PlantComm as the grouping variable
# for SoilDensity and then summarize SoilDensity
# for each PlantComm(unity)
### Visualize your variable of interest
# Histogram
gf_histogram(~SoilDensity, # plot distribution of SoilDensity
fill=~CSS, # change colors based on CSS (or not)
data=soilDF, # the data come from the soilDF object
xlab="Soil density (g/mL)",# change the x-axis label
ylab="Count") # change the y-axis label
### Visualize your variable of interest
# Boxplot
gf_boxplot(SoilDensity~CSS, # plot distribution of SoilDensity for CSS
fill= ~CSS, # change colors based on CSS (or not)
data=soilDF, # the data come from the soilDF object
xlab="Woody vegetation", # change x-axis label
ylab="Soil density (g/mL)")# change y-axis label
### Visualization
# Scatterplot
gf_point(SoilDensity~Species, # plot SoilDensity against plant Species
color= ~CSS, # change colors based on CSS (or not)
data= soilDF, # the data come from the soilDF object
xlab="Species", # change x-axis label
ylab="Soil density (g/mL)") # change y-axis label
###=========================================
### Analysis - Differences in means
###=========================================
mean( SoilDensity ~ PlantComm, data = soilDF , na.rm=T) # show mean SoilDensity values for the plant communities and remove missing data
obs_diff <- diff( mean( SoilDensity ~ PlantComm, data = soilDF , na.rm=T)) # calculate the difference between the means and store in a variable called "obs_diff"
obs_diff # display the value of obs_diff
### Create random differences by shuffling the data
randomizing_diffs <- do(1000) * diff( mean( SoilDensity ~ shuffle(PlantComm),na.rm=T, data = soilDF) ) # calculate the mean in SoilDensity when we're shuffling the plant community around 1000 times
# Clean up our shuffled data
names(randomizing_diffs)[1] <- "DiffMeans" # change the name of the variable
# View first few rows of data
head(randomizing_diffs)
gf_histogram(~ DiffMeans, fill = ~ (DiffMeans >= obs_diff),
data = randomizing_diffs,
xlab="Difference in mean soil density under null",
ylab="Count")
###=========================================
### Analysis - Correlation coefficient
###=========================================
### Calculate observed correlation
obs_cor <- cor(BacterialPerc ~ SoilDensity, data=soilDF, use="complete.obs") # store observed correlation in obs_cor of BacterialPerc vs. SoilDensity
obs_cor # print out value
### Calculate correlation coefs for shuffled data
randomizing_cor <- do(1000) * cor(BacterialPerc ~ SoilDensity,
data = resample(soilDF),
use="complete.obs")
# Shuffle the data 1000 times
# Calculate the correlation coefficient each time
# By correlating BacterialPerc to SoilDensity from the
# data object soilDF
quantiles_cor <- qdata( ~ cor, p = c(0.025, 0.975), data=randomizing_cor) # calculate the 2.5% and 97.5% quantiles in our simulated correlation coefficient data (note that 97.5-2.5 = 95!)
quantiles_cor # display the quantiles
gf_histogram(~ cor, fill = ~ (cor > 0),
data = randomizing_cor,
xlab="Simulated correlation coefficients",
ylab="Count")
###=========================================
### Analysis - Linear regression
###=========================================
### Initial visualization
gf_point(SoilDensity ~ BacterialPerc, # plot soil density against bacteria
data=soilDF, # from the soilDF data
xlab="Bacteria %", # change x-axis label
ylab="Soil density (g/mL)") %>% # change y-axis label
gf_lm() # add a regression line for these two variables only
lm_mod <- lm(SoilDensity ~ BacterialPerc + SOM, data=soilDF) # perform linear regression of soil density against bacterial percentage and resin weight and store in an object called lm_mod
lm_mod # print object
lm_boot <- do(1000) * lm(SoilDensity ~ BacterialPerc + SOM, data=resample(soilDF)) # shuffle our data 1000 times and re-run the linear model each time and store the outputs in lm_boot
confint(lm_boot) # generate a 95% confidence interval from the simulated values for the linear regression coefficients
```