1 Overview

The objective this week is to analyze the class dataset, CSPdf, to address your group’s questions. The table below describes each group’s questions and my recommendations for a visualization.

Group Questions Visualization Test
1 What are the patterns for the topic of Human Wildlife Interaction globally? Map X2
1 Are there differences in the distribution of humans or animals being portrayed as victims in the topic of HWI? Waffle X2
2 What is the relationship between the topic of climate change and portrayals of endangered species? Balloon/Waffle X2
2 How are tigers portrayed in Asia and outside of Asia? Waffle X2
3 What are the difference in sentiments towards elephants in countries where they are native vs. non-native? Boxplot Difference in means
3 If we subset by HWI (Human Wildlife Interaction), what is the distribution of positive vs negative sentiments towards elephants? Map/Boxplot Difference in means
4 Is there a difference between shark attacks and portrayals of sharks as threats globally? Scatterplot X2
4 What is the relationship between portrayals and topics for sharks? Balloon X2
5 How does sentiment vary between sharks and pangolins? Boxplot Difference in means
5 How do sentiments towards sharks vary between countries? Boxplot ANOVA
6 Are carnivorous animals less likely to be portrayed as vulnerable or victimized than herbivorous animals? Waffle X2
6 Is ecotourism a more common portrayal in developing countries than in developed ones? Waffle X2

It is not necessary to perform a statistical analysis of the data to answer your question. However, for each question, I have included my own recommendation about an appropriate type of analysis. I include an example of each type of analysis below.

By clicking through the menu for Groups, you will be able to see scaffolded code for your group’s question.

2 Example code for analyses

In each group’s site, I provide code to subset the class data and perform a visualization to help you answer your question. However, if you would like to be able to make a claim around the statistical significance of your observations, you would then need some way of testing the data. That is, you need some way to quantify the probability that your data could have arisen by random chance.

First, let’s get started by pulling in the class dataset.

## Libraries for interacting with data
library(readr) # reads in spreadsheet data
library(dplyr) # lets you wrangle data more effectively
library(tidyr) # lets us interact with data with additional functions
  # Additional package to pull in
library(mosaic)# package for permutation based tests

CSPdf <- read_tsv("https://raw.githubusercontent.com/EA30POM/Fall2021/main/data/OTEC_CSP.tsv") # use the function read_tsv from readr to pull in a spreadsheet from this URL.
    # Then store the output in an object called "CSPdf"
    # CSPdf stands for Conservation Science Partners data frame

2.1 \(\chi^2\) test

The first type of test that I’ll cover is a \(\chi^2\) test (a.k.a. a Chi-squared test). This resource provides a great explanation of \(\chi^2\) tests. This resource also provides another explanation.

A \(\chi^2\) test is most appropriate when you have counts across two categorical variables. For instance, let’s say that I’m interested in seeing if there’s any relationship between Country and Portrayal of species in the class dataset, CSPdf. I would then test the null hypothesis (\(H_0\)) that Country and Portrayal are unrelated. I’d expect to see that the proportional counts of portrayals across countries should be quite consistent under the null hypothesis!

As we saw in the soil science lab, one way that we can deal with wonky data is to shuffle (or re-sample) our data assuming that the null hypothesis is true. In this case, that would mean that we should be able to shuffle around Country with no regard for Portrayal.

Then we can compare our observed \(\chi^2\) value against simulated values from shuffling our data, and see if our value is really large. If it is much larger than most of the simulated values, then that’s an indication that the data are inconsistent with the null hypothesis. Thus, we would reject the null.

First, let’s tally up the counts of Portrayal and Country

  ### Use tally to count up every combo of Portrayal and Country
  ### in the object CSPdf
Results_table <- tally(~ Portrayal + Country, data= CSPdf )
Results_table # See the observed counts
##                Country
## Portrayal       Armenia Australia Bangladesh Bulgaria Cambodia Cameroon Canada
##   Aesthetic           0         0          0        0        0        0      3
##   Controversial       0         0          0        0        0        0      1
##   Dangerous           0         8          0        0        0        1      4
##   Exciting            0         0          0        0        0        0      3
##   Important           0         1          1        0        1        0      2
##   Interesting         1         2          1        1        1        0      6
##   Neutral             0         1          0        0        0        0      5
##   Nuisance            0         1          0        0        0        0      1
##   Resilient           0         2          0        1        0        0      3
##   Valuable            0         2          1        0        0        0      2
##   Victim              0         3          0        0        0        0      9
##   Vulnerable          1         3          2        1        2        1      9
##                Country
## Portrayal       China Germany India Indonesia Ireland Israel Japan Kenya
##   Aesthetic         0       0     1         0       0      0     0     0
##   Controversial     1       0     4         0       1      0     0     0
##   Dangerous         0       0     4         0       1      2     0     0
##   Exciting          0       0     7         0       0      0     3     0
##   Important         1       0     6         0       0      0     0     1
##   Interesting       0       0     8         1       1      1     0     1
##   Neutral           1       0     9         1       0      1     0     0
##   Nuisance          0       0     5         0       0      1     0     0
##   Resilient         1       0     5         0       0      0     0     0
##   Valuable          4       0     6         0       0      0     0     0
##   Victim            3       2    23         1       0      2     1     1
##   Vulnerable       12       0    29        13       0      1     0     1
##                Country
## Portrayal       Malaysia Mexico New Zealand Nigeria Philippines Qatar Russia
##   Aesthetic            0      0           0       0           0     0      0
##   Controversial        0      0           0       0           0     0      0
##   Dangerous            0      0           2       0           0     0      0
##   Exciting             0      0           1       0           0     0      0
##   Important            0      1           1       2           0     0      0
##   Interesting          0      0           1       0           5     1      2
##   Neutral              0      0           1       0           0     0      0
##   Nuisance             0      0           0       0           0     0      0
##   Resilient            0      0           0       1           0     0      0
##   Valuable             1      0           1       1           2     0      0
##   Victim               2      0           0       3           4     0      0
##   Vulnerable           3      0           0       6           3     0      2
##                Country
## Portrayal       Saudi Arabia Singapore South Africa Switzerland Tanzania
##   Aesthetic                0         0            0           0        0
##   Controversial            0         0            3           0        0
##   Dangerous                0         0            3           1        0
##   Exciting                 1         1            2           0        0
##   Important                0         1            2           0        0
##   Interesting              1         1            2           0        1
##   Neutral                  2         0            1           0        0
##   Nuisance                 0         0            0           0        0
##   Resilient                0         0            0           0        0
##   Valuable                 0         1            3           0        0
##   Victim                   0         0            9           0        0
##   Vulnerable               0         0           11           0        0
##                Country
## Portrayal       Thailand Turkey United Arab Emirates United Kingdom
##   Aesthetic            0      0                    0              2
##   Controversial        0      0                    0              1
##   Dangerous            0      0                    0             13
##   Exciting             2      0                    1              9
##   Important            0      0                    0              4
##   Interesting          0      0                    0             12
##   Neutral              0      0                    0              9
##   Nuisance             0      0                    0              1
##   Resilient            0      0                    0              3
##   Valuable             1      1                    0              1
##   Victim               0      1                    0             14
##   Vulnerable           0      0                    0             52
##                Country
## Portrayal       United States Vietnam Zambia Zimbabwe
##   Aesthetic                 3       0      0        0
##   Controversial            26       0      0        0
##   Dangerous                65       0      2        0
##   Exciting                 36       0      0        0
##   Important                29       1      0        0
##   Interesting              72       2      0        0
##   Neutral                  52       1      0        0
##   Nuisance                 30       0      0        0
##   Resilient                18       0      0        0
##   Valuable                 17       0      0        0
##   Victim                   45       4      1        1
##   Vulnerable               84       7      0        2

Second, let’s calculate \(\chi^2\) for our actual data.

  ### Calculate the Chi-squared value for our data
obs_chi <- chisq(Results_table)
obs_chi # see the observed Chi-squared value
## X.squared 
##  477.5231

Third, let’s shuffle the data 1000 times and calculate a \(\chi^2\) value for every time the data the shuffled.

  ### Shuffle/sample the data 1000 times
  ### And calculate the Chi-squared value each time
SimulateChiSq <- do(1000) * chisq( tally( ~ Portrayal + shuffle(Country), data= CSPdf))

Finally, let’s see what proportion (\(p\)-value) of the simulated \(\chi^2\) values are bigger than our observed value.

  ### Calculate the proportion of values larger than our Chi-squared value
sim_p_val <- length(which(SimulateChiSq$X.squared > obs_chi))/nrow(SimulateChiSq)
sim_p_val # Is this < 0.05?
## [1] 0.025

2.2 Difference in means - two samples

We’ve seen this type of analysis before (see the soil science lab, section 3.1) so I won’t go into as much detail.

Let’s take an example from the class dataset. Let’s say that I want to know if Sentiment varies across wolves vs. animals that aren’t wolves. That means that we’re comparing Sentiment across just two types of animals (Taxon=="wolf" or not).

We can test the null hypothesis (\(H_0\)) that there is no difference in the mean sentiment for wolves vs. other animals. But as is the case for some of the groups, that’s going to mean that we’ll have to re-code the data. We need to say, based on Taxon if the article is about wolf or not wolf.

recode_CSP <- CSPdf %>%
  mutate(WolfOrNot = case_when(Taxon=="wolf"~"Wolf",
                               TRUE ~ "Not Wolf"))
  # create a new column called WolfOrNot
  # That codes the data based on the value in Taxon

# View(recode_CSP)

Second, we’ll calculate the observed difference in mean sentiment between Wolf and Not Wolf articles.

obs_diff <- diff(mean( Sentiment ~ WolfOrNot, data=recode_CSP)) # calculate the difference between the means and store in a variable called "obs_diff"
obs_diff # display the value of obs_diff
##        Wolf 
## -0.01338776

Third, we’ll now shuffle the data 1000 times and calculate the difference in mean sentiment across Wolf or Not Wolf each time.

# Shuffle the data and calculate the simulated differences in mean Sentiment
randomizing_diffs <- do(1000) * diff( mean( Sentiment ~ shuffle(WolfOrNot), data = recode_CSP) ) # calculate the mean in SoilDensity when we're shuffling the plant community around 1000 times
# Clean up our shuffled data
names(randomizing_diffs)[1] <- "DiffMeans" # change the name of the variable
# View first few rows of data
head(randomizing_diffs)
##      DiffMeans
## 1 -0.067686811
## 2 -0.043996547
## 3  0.084564769
## 4  0.033222414
## 5 -0.005977672
## 6 -0.047298070
# View the mean of the simulated differences
mean(randomizing_diffs$DiffMeans) # is this larger or smaller than our observed difference
## [1] -0.0006127999

Based on what we saw above, it looks like our obs_diff may be smaller than most of the simulated differences in means. That means that more extreme data would be smaller than our observed difference.

Fourth, we’ll calculate the probability that the simulated differences in mean sentiment were more extreme than our observed value.

# What proportion of simulated values were "more extreme" than our value?
prop( ~ DiffMeans < obs_diff, data = randomizing_diffs)
## prop_TRUE 
##     0.399

2.3 Difference in means - more than two samples

When we want to test whether or not three or more groups have the same mean value for some variable, we tend to use a one-way ANOVA test. While I briefly described ANOVA tests back in Week 3, we didn’t get to cover that type of analysis in any depth. If you’d like some explanation about this type of test, please refer to this resource by JMP or the Handbook of Biological Statistics.

Let’s say I’m interested in comparing mean Sentiment across all of the countries in our sample. I can perform an ANOVA test on the data. Great. But how do I know if 1) there are differences in mean sentiment across countries, and 2) these differences are meaningful?

I can shuffle the data! If I assume the null hypothesis (\(H_0\)) that mean sentiment is the same across all of the countries, then that means that I can shuffle Country independent of Sentiment.

Let’s start by calculating a test statistic using an ANOVA test on our dataset.

obs_aov <- anova( lm( Sentiment ~ Country, data= CSPdf))$`F value`[1] # calculate a test statistic from our data
obs_aov # display the value of the test statistic
## [1] 3.061811

Second, we’ll shuffle the data 1000 times, assuming that Country doesn’t matter for Sentiment.

sim_aov <- do(1000) * anova(lm( Sentiment ~ shuffle(Country), data=CSPdf))$`F value`[1]
head(sim_aov) # display the first few rows of simulated test statistics
##      result
## 1 0.5930035
## 2 0.8801567
## 3 1.1233541
## 4 1.1706447
## 5 0.8775084
## 6 0.6880138

Third, we’ll calculate a \(p\)-value based on the proportion of simulated test statistics that are larger than our observed test statistic.

prop( ~ result > obs_aov, data=sim_aov)
## prop_TRUE 
##         0