Please start by logging into sso.rstudio.cloud/pomona. Once you’ve logged in, you should see the option to open the `EA 30.1 - Fall 2022`

workspace. You can also access the `EA 30.1 - Fall 2022`

workspace at this link.

At the workspace, you should then see the “Assignment” `AirQualityLab`

. Please click on that Assignment. Here is a direct link to the `AirQualityLab`

project, but it may not work if this is the first time in a while that you’re logging into RStudio.Cloud. If it doesn’t work, no worries - just log in through the sso.rstudio.cloud/pomona link.

Below, we will start by loading packages that we’ll need.

```
### Load packages
library("ggplot2") # plotting functions
library("dplyr") # data wrangling functions
library("readr") # reading in tables, including ones online
library("lubridate") # a package to specify date-time stamps
library("mosaic") # shuffle (permute) our data
```

Next, we will pull in our data.

```
### Load in dataset
airDF <- readr::read_tsv("https://raw.githubusercontent.com/EA30POM/site/main/data/PLairData.tsv")
### How many observations and columns do we have?
dim(airDF)
```

`## [1] 28531 23`

`View`

commandOftentimes, we may want a more intuitive way to see our data tables. It can be really annoying when `R`

is too clever and only displays a subset of the columns of your spreadsheet. The `View()`

function pulls up an Excel-style data viewer. Let’s try it below:

`View( head( airDF ) )`

I’m going to use the pipe operator `%>%`

to daisy chain commands together into a sequence. It is a handy way to make a complex operation with intermediate steps more transparent. The visualization below describes what pipes do in `R`

in linking up functions:

`airDF %>% head() %>% View()`

Now we’ll use that handy `%>%`

operator to clean our data. We’ll also use a very helpful function called `mutate`

. `mutate`

is a command to either alter an existing column or create a new column in the data.

```
### Pre-processing steps
# We want the Date and Time columns to be datetime objects
airDF <- airDF %>%
mutate(Date = lubridate::ymd(Date), Time=lubridate::hms(Time))
# We want to code which observations are made in the day vs. night
# using 6pm as our cut-off
pm_limit <- lubridate::hms("18:00:00")
### Add on a new column to note whether or not the data is day or night
airDF <- airDF %>%
mutate(DayNight = case_when( Time >= pm_limit ~ "Night",
TRUE ~ "Day"))
```

Below, I provide fully-worked examples of different ways of inspecting your data and performing analyses **assuming that Lux is the variable of interest**. In applying this code to your own data, you will want to think about what variable name should

`Lux`

in the commands below.Let’s start with exploratory data analysis where you calculate summary statistics and visualize your variable. Let’s say that I’m interested in understanding how the values of light (`Lux`

) vary across day and night (a silly example, I know). We’ll use another function: `group_by`

splits a data table into groups based on distinct values a variable that has **categories**. In this case, we tell `group_by`

to divide up the `airDF`

data table into groups based on the values of `DayNight`

. We then will use a function `summarize`

to get summary statistics of `Lux`

in the day and night.

```
### Calculate summary statistics for Lux
### for each DayNight condition
airDF %>%
group_by(DayNight) %>%
summarize(min=min(Lux), mean=mean(Lux), max=max(Lux))
# use DayNight as the grouping variable
# for Lux and then summarize Lux
# for each value of DayNight
```

How do light levels vary across day and night? We can visualize that using a boxplot. Boxplots are useful for depicting how different discrete categories within a variable exhibit variation. Below, we will specify the variable that we use to group our data (Day/Night) as the x variable, and the values of interest as the y variable (Lux, or light).

```
p <- ggplot(airDF, aes(x=DayNight, y=Lux))
p <- p + geom_boxplot()
p <- p + labs(x="",y="Light (lux)")
p <- p + theme_minimal()
p
```

Let’s calculate the difference between Day and Night in terms of their mean `Lux`

values.

```
### Calculating differences in mean Lux values
obs_diff <- diff( mean( Lux ~ DayNight, data=airDF, na.rm=T ) )
obs_diff # print out the value
```

```
## Night
## -0.5770775
```

OK, so we see that there is a difference of -0.58 in mean light levels across day and night. Is this a meaningful difference though? Our null hypothesis is that there is **no difference** in mean light levels between day and night (again, a silly example, I know).

Logically, if there is a meaningful difference, then if we shuffle our data around, that should lead to different results than what we see. That is one way to *simulate* statistics to test the null hypothesis. And specifically, in this case, we would expect to see our observed difference is much larger than most of the simulated values.

What does simulating our data look like?

Well, here is what the data look like initially. We’re going to use another function, `select`

, to pull out particular columns of interest from the data.

```
airDF %>%
dplyr::select(Date,Time,Lux,DayNight)
```

```
## # A tibble: 28,531 x 4
## Date Time Lux DayNight
## <date> <Period> <dbl> <chr>
## 1 2022-07-09 22H 0M 3S 11.1 Night
## 2 2022-07-09 22H 0M 4S 9.1 Night
## 3 2022-07-09 22H 0M 5S 11.1 Night
## 4 2022-07-09 22H 0M 6S 11.1 Night
## 5 2022-07-09 22H 0M 7S 11.1 Night
## 6 2022-07-09 22H 0M 8S 9.1 Night
## 7 2022-07-09 22H 0M 9S 9.1 Night
## 8 2022-07-09 22H 0M 10S 9.1 Night
## 9 2022-07-09 22H 0M 11S 11.1 Night
## 10 2022-07-09 22H 0M 12S 11.1 Night
## # … with 28,521 more rows
```

Here’s what the data look like if we shuffle it by randomizing the assignment of observations as day or night. This is like a chaos agent coming around and slapping a new label for day or night on at random for every row in the data table. (But if we think `mean(Lux)`

is the same for `Day`

and `Night`

, then that should be a perfectly OK action by the chaos agent.)

`print(resample(airDF[,c("Date","Time","Lux","DayNight")],groups=DayNight,shuffled=c("Lux")))`

```
## # A tibble: 28,531 x 5
## Date Time DayNight Lux orig.id
## <date> <Period> <chr> <dbl> <chr>
## 1 2022-07-09 13H 1M 16S Day 1.8 17495.18817
## 2 2022-09-09 13H 12M 30S Day 5.2 11632.16676
## 3 2022-09-09 13H 8M 49S Day 11.1 11411.25590
## 4 2022-07-09 12H 43M 24S Day 1.8 7603.11444
## 5 2022-09-09 12H 18M 1S Day 1.8 2632.20111
## 6 2022-12-09 13H 19M 2S Day 7.2 27111.26684
## 7 2022-07-09 12H 31M 55S Day 9.1 22927.28210
## 8 2022-12-09 2H 16M 49S Day 11.1 26000.6096
## 9 2022-07-09 12H 43M 52S Day 1.8 5784.7647
## 10 2022-09-09 12H 8M 15S Day 1.8 2046.18125
## # … with 28,521 more rows
```

We can repeat that procedure again and see how the data shifts if we do another shuffle.

`print(resample(airDF[,c("Date","Time","Lux","DayNight")],groups=DayNight,shuffled=c("Lux")))`

```
## # A tibble: 28,531 x 5
## Date Time DayNight Lux orig.id
## <date> <Period> <chr> <dbl> <chr>
## 1 2022-08-09 1H 33M 37S Day 5.2 18402.16359
## 2 2022-09-09 13H 11M 43S Day 11.1 11585.25557
## 3 2022-12-09 2H 10M 6S Day 1.8 25597.18103
## 4 2022-12-09 13H 19M 16S Day 1.8 27125.17130
## 5 2022-09-09 12H 1M 10S Day 5.5 19972.23567
## 6 2022-08-09 12H 11M 45S Day 5.5 24448.24233
## 7 2022-08-09 12H 32M 28S Day 1.8 14945.20533
## 8 2022-09-09 12H 10M 38S Day 13 20540.2373
## 9 2022-07-09 12H 41M 29S Day 1.8 5641.17320
## 10 2022-08-09 1H 35M 30S Day 9.1 18515.28531
## # … with 28,521 more rows
```

Let’s shuffle the data and see what that means for the distribution of mean `Lux`

differences between day and night.

```
### Create random differences by shuffling the data
randomizing_diffs <- do(1000) * diff( mean( Lux ~ shuffle(DayNight),na.rm=T, data = airDF) ) # calculate the mean in Lux when we're shuffling the day/night values around 1000 times
# Clean up our shuffled data
names(randomizing_diffs)[1] <- "DiffMeans" # change the name of the variable
# View first few rows of data
head(randomizing_diffs)
```

```
## DiffMeans
## 1 -0.084016586
## 2 -0.001129461
## 3 0.077960019
## 4 -0.025729763
## 5 0.024778919
## 6 0.054203638
```

Now we can visualize the distribution of simulated differences in mean light levels. Where would our observed difference in mean light levels fall?

```
gf_histogram(~ DiffMeans, fill = ~ (DiffMeans <= obs_diff),
data = randomizing_diffs,
xlab="Difference in mean light across day/night under null",
ylab="Count")
```

In the end, how many of the simulated mean differences were larger than the value we observed? Based on this value, if we were using the conventional \(p\)-value (probability value) of 0.05, we would conclude that because this simulated \(p\)-value < 0.05, that we reject the null hypothesis. There is a difference in mean light level between day and night.

```
# What proportion of simulated values were larger than our observed difference?
prop( ~ DiffMeans <= obs_diff, data = randomizing_diffs)
```

```
## prop_TRUE
## 0
```

Let’s say that instead of comparing light across day and night, I want to compare light against temperature. Light and temperature are both numeric variables. There aren’t obvious categories for either variable.

What do these two variables look like? What does the potential relationship between them look like? We can use a scatterplot to find out.

```
p <- ggplot(airDF, aes(x=TempC, y=Lux))
p <- p + geom_point()
p <- p + labs(x="Temperature (*C)",y="Light (lux)")
p <- p + theme_bw()
p
```

If I’m interested in seeing if `lower temperature is associated with lower (or higher) light levels`

. I can measure that by calculating a correlation coefficient.

```
### Calculate observed correlation
obs_cor <- cor(Lux ~ TempC, data=airDF, use="complete.obs") # store observed correlation in obs_cor of Lux and Temperature
obs_cor # print out value
```

`## [1] 0.1308332`

Is this value of a correlation coefficient of 0.13 meaningful? Let’s test that against a null hypothesis. Our null is that there’s no meaningful association. That is, the correlation coefficient is actually 0.

How then do I know that my correlation coefficient of 0 is significantly different from 0? We can tackle this question by simulating a ton of correlation coefficient values from our data by shuffling it!

In this case, the shuffling here lets us estimate the variation in the correlation coefficient given our data. So we are curious now if the distribution of simulated values does or does not include 0 (that is, is it clearly \(< 0\) or \(> 0\)?).

```
### Calculate correlation coefs for shuffled data
randomizing_cor <- do(1000) * cor(Lux ~ TempC,
data = resample(airDF),
use="complete.obs")
# Shuffle the data 1000 times
# Calculate the correlation coefficient each time
# By correlating Lux to TempC from the
# data table airDF
```

What are the distribution of correlation coefficients that we see when we shuffle our data?

```
quantiles_cor <- qdata( ~ cor, p = c(0.025, 0.975), data=randomizing_cor) # calculate the 2.5% and 97.5% quantiles in our simulated correlation coefficient data (note that 97.5-2.5 = 95!)
quantiles_cor # display the quantiles
```

```
## 2.5% 97.5%
## 0.1168233 0.1453339
```

The values above give us a 95% confidence interval estimate for our correlation coefficient!

Do we clearly see that our correlation coefficient distribution does or doesn’t include 0?

```
gf_histogram(~ cor,
data = randomizing_cor,
xlab="Simulated correlation coefficients",
ylab="Count")
```

In this case, our simulated correlation coefficient does not include in its 95% simulated confidence interval (we actually never see any values close to 0 here!). We can also see this in the plot. So given these data, we would also reject the null hypothesis that `there is no association between light and temperature`

.

On the other hand, if we had seen 0 in the interval, then we would conclude that given the data, we cannot reject the null hypothesis. In that case, there is not sufficiently strong data that light correlates with temperature in any clear way.