The objective this week is to analyze the class dataset, CSPdf
, to address your group’s questions. The table below describes each group’s questions and my recommendations for a visualization.
Group | Questions | Visualization | Test |
---|---|---|---|
1 | What are the patterns for the topic of Human Wildlife Interaction globally? | Map | X2 |
1 | Are there differences in the distribution of humans or animals being portrayed as victims in the topic of HWI? | Waffle | X2 |
2 | What is the relationship between the topic of climate change and portrayals of endangered species? | Balloon/Waffle | X2 |
2 | How are tigers portrayed in Asia and outside of Asia? | Waffle | X2 |
3 | What are the difference in sentiments towards elephants in countries where they are native vs. non-native? | Boxplot | Difference in means |
3 | If we subset by HWI (Human Wildlife Interaction), what is the distribution of positive vs negative sentiments towards elephants? | Map/Boxplot | Difference in means |
4 | Is there a difference between shark attacks and portrayals of sharks as threats globally? | Scatterplot | X2 |
4 | What is the relationship between portrayals and topics for sharks? | Balloon | X2 |
5 | How does sentiment vary between sharks and pangolins? | Boxplot | Difference in means |
5 | How do sentiments towards sharks vary between countries? | Boxplot | ANOVA |
6 | Are carnivorous animals less likely to be portrayed as vulnerable or victimized than herbivorous animals? | Waffle | X2 |
6 | Is ecotourism a more common portrayal in developing countries than in developed ones? | Waffle | X2 |
It is not necessary to perform a statistical analysis of the data to answer your question. However, for each question, I have included my own recommendation about an appropriate type of analysis. I include an example of each type of analysis below.
By clicking through the menu for Groups
, you will be able to see scaffolded code for your group’s question.
In each group’s site, I provide code to subset the class data and perform a visualization to help you answer your question. However, if you would like to be able to make a claim around the statistical significance of your observations, you would then need some way of testing the data. That is, you need some way to quantify the probability that your data could have arisen by random chance.
First, let’s get started by pulling in the class dataset.
## Libraries for interacting with data
library(readr) # reads in spreadsheet data
library(dplyr) # lets you wrangle data more effectively
library(tidyr) # lets us interact with data with additional functions
# Additional package to pull in
library(mosaic)# package for permutation based tests
CSPdf <- read_tsv("https://raw.githubusercontent.com/EA30POM/Fall2021/main/data/OTEC_CSP.tsv") # use the function read_tsv from readr to pull in a spreadsheet from this URL.
# Then store the output in an object called "CSPdf"
# CSPdf stands for Conservation Science Partners data frame
The first type of test that I’ll cover is a \(\chi^2\) test (a.k.a. a Chi-squared test). This resource provides a great explanation of \(\chi^2\) tests. This resource also provides another explanation.
A \(\chi^2\) test is most appropriate when you have counts across two categorical variables. For instance, let’s say that I’m interested in seeing if there’s any relationship between Country
and Portrayal
of species in the class dataset, CSPdf
. I would then test the null hypothesis (\(H_0\)) that Country
and Portrayal
are unrelated. I’d expect to see that the proportional counts of portrayals across countries should be quite consistent under the null hypothesis!
As we saw in the soil science lab, one way that we can deal with wonky data is to shuffle (or re-sample) our data assuming that the null hypothesis is true. In this case, that would mean that we should be able to shuffle around Country
with no regard for Portrayal
.
Then we can compare our observed \(\chi^2\) value against simulated values from shuffling our data, and see if our value is really large. If it is much larger than most of the simulated values, then that’s an indication that the data are inconsistent with the null hypothesis. Thus, we would reject the null.
First, let’s tally up the counts of Portrayal
and Country
### Use tally to count up every combo of Portrayal and Country
### in the object CSPdf
Results_table <- tally(~ Portrayal + Country, data= CSPdf )
Results_table # See the observed counts
## Country
## Portrayal Armenia Australia Bangladesh Bulgaria Cambodia Cameroon Canada
## Aesthetic 0 0 0 0 0 0 3
## Controversial 0 0 0 0 0 0 1
## Dangerous 0 8 0 0 0 1 4
## Exciting 0 0 0 0 0 0 3
## Important 0 1 1 0 1 0 2
## Interesting 1 2 1 1 1 0 6
## Neutral 0 1 0 0 0 0 5
## Nuisance 0 1 0 0 0 0 1
## Resilient 0 2 0 1 0 0 3
## Valuable 0 2 1 0 0 0 2
## Victim 0 3 0 0 0 0 9
## Vulnerable 1 3 2 1 2 1 9
## Country
## Portrayal China Germany India Indonesia Ireland Israel Japan Kenya
## Aesthetic 0 0 1 0 0 0 0 0
## Controversial 1 0 4 0 1 0 0 0
## Dangerous 0 0 4 0 1 2 0 0
## Exciting 0 0 7 0 0 0 3 0
## Important 1 0 6 0 0 0 0 1
## Interesting 0 0 8 1 1 1 0 1
## Neutral 1 0 9 1 0 1 0 0
## Nuisance 0 0 5 0 0 1 0 0
## Resilient 1 0 5 0 0 0 0 0
## Valuable 4 0 6 0 0 0 0 0
## Victim 3 2 23 1 0 2 1 1
## Vulnerable 12 0 29 13 0 1 0 1
## Country
## Portrayal Malaysia Mexico New Zealand Nigeria Philippines Qatar Russia
## Aesthetic 0 0 0 0 0 0 0
## Controversial 0 0 0 0 0 0 0
## Dangerous 0 0 2 0 0 0 0
## Exciting 0 0 1 0 0 0 0
## Important 0 1 1 2 0 0 0
## Interesting 0 0 1 0 5 1 2
## Neutral 0 0 1 0 0 0 0
## Nuisance 0 0 0 0 0 0 0
## Resilient 0 0 0 1 0 0 0
## Valuable 1 0 1 1 2 0 0
## Victim 2 0 0 3 4 0 0
## Vulnerable 3 0 0 6 3 0 2
## Country
## Portrayal Saudi Arabia Singapore South Africa Switzerland Tanzania
## Aesthetic 0 0 0 0 0
## Controversial 0 0 3 0 0
## Dangerous 0 0 3 1 0
## Exciting 1 1 2 0 0
## Important 0 1 2 0 0
## Interesting 1 1 2 0 1
## Neutral 2 0 1 0 0
## Nuisance 0 0 0 0 0
## Resilient 0 0 0 0 0
## Valuable 0 1 3 0 0
## Victim 0 0 9 0 0
## Vulnerable 0 0 11 0 0
## Country
## Portrayal Thailand Turkey United Arab Emirates United Kingdom
## Aesthetic 0 0 0 2
## Controversial 0 0 0 1
## Dangerous 0 0 0 13
## Exciting 2 0 1 9
## Important 0 0 0 4
## Interesting 0 0 0 12
## Neutral 0 0 0 9
## Nuisance 0 0 0 1
## Resilient 0 0 0 3
## Valuable 1 1 0 1
## Victim 0 1 0 14
## Vulnerable 0 0 0 52
## Country
## Portrayal United States Vietnam Zambia Zimbabwe
## Aesthetic 3 0 0 0
## Controversial 26 0 0 0
## Dangerous 65 0 2 0
## Exciting 36 0 0 0
## Important 29 1 0 0
## Interesting 72 2 0 0
## Neutral 52 1 0 0
## Nuisance 30 0 0 0
## Resilient 18 0 0 0
## Valuable 17 0 0 0
## Victim 45 4 1 1
## Vulnerable 84 7 0 2
Second, let’s calculate \(\chi^2\) for our actual data.
### Calculate the Chi-squared value for our data
obs_chi <- chisq(Results_table)
obs_chi # see the observed Chi-squared value
## X.squared
## 477.5231
Third, let’s shuffle the data 1000 times and calculate a \(\chi^2\) value for every time the data the shuffled.
### Shuffle/sample the data 1000 times
### And calculate the Chi-squared value each time
SimulateChiSq <- do(1000) * chisq( tally( ~ Portrayal + shuffle(Country), data= CSPdf))
Finally, let’s see what proportion (\(p\)-value) of the simulated \(\chi^2\) values are bigger than our observed value.
### Calculate the proportion of values larger than our Chi-squared value
sim_p_val <- length(which(SimulateChiSq$X.squared > obs_chi))/nrow(SimulateChiSq)
sim_p_val # Is this < 0.05?
## [1] 0.025
We’ve seen this type of analysis before (see the soil science lab, section 3.1) so I won’t go into as much detail.
Let’s take an example from the class dataset. Let’s say that I want to know if Sentiment
varies across wolves vs. animals that aren’t wolves. That means that we’re comparing Sentiment
across just two types of animals (Taxon=="wolf"
or not).
We can test the null hypothesis (\(H_0\)) that there is no difference in the mean sentiment for wolves vs. other animals. But as is the case for some of the groups, that’s going to mean that we’ll have to re-code the data. We need to say, based on Taxon
if the article is about wolf
or not wolf
.
recode_CSP <- CSPdf %>%
mutate(WolfOrNot = case_when(Taxon=="wolf"~"Wolf",
TRUE ~ "Not Wolf"))
# create a new column called WolfOrNot
# That codes the data based on the value in Taxon
# View(recode_CSP)
Second, we’ll calculate the observed difference in mean sentiment between Wolf
and Not Wolf
articles.
obs_diff <- diff(mean( Sentiment ~ WolfOrNot, data=recode_CSP)) # calculate the difference between the means and store in a variable called "obs_diff"
obs_diff # display the value of obs_diff
## Wolf
## -0.01338776
Third, we’ll now shuffle the data 1000 times and calculate the difference in mean sentiment across Wolf or Not Wolf each time.
# Shuffle the data and calculate the simulated differences in mean Sentiment
randomizing_diffs <- do(1000) * diff( mean( Sentiment ~ shuffle(WolfOrNot), data = recode_CSP) ) # calculate the mean in SoilDensity when we're shuffling the plant community around 1000 times
# Clean up our shuffled data
names(randomizing_diffs)[1] <- "DiffMeans" # change the name of the variable
# View first few rows of data
head(randomizing_diffs)
## DiffMeans
## 1 -0.067686811
## 2 -0.043996547
## 3 0.084564769
## 4 0.033222414
## 5 -0.005977672
## 6 -0.047298070
# View the mean of the simulated differences
mean(randomizing_diffs$DiffMeans) # is this larger or smaller than our observed difference
## [1] -0.0006127999
Based on what we saw above, it looks like our obs_diff
may be smaller than most of the simulated differences in means. That means that more extreme data would be smaller than our observed difference.
Fourth, we’ll calculate the probability that the simulated differences in mean sentiment were more extreme than our observed value.
# What proportion of simulated values were "more extreme" than our value?
prop( ~ DiffMeans < obs_diff, data = randomizing_diffs)
## prop_TRUE
## 0.399
When we want to test whether or not three or more groups have the same mean value for some variable, we tend to use a one-way ANOVA test. While I briefly described ANOVA tests back in Week 3, we didn’t get to cover that type of analysis in any depth. If you’d like some explanation about this type of test, please refer to this resource by JMP or the Handbook of Biological Statistics.
Let’s say I’m interested in comparing mean Sentiment
across all of the countries in our sample. I can perform an ANOVA test on the data. Great. But how do I know if 1) there are differences in mean sentiment across countries, and 2) these differences are meaningful?
I can shuffle the data! If I assume the null hypothesis (\(H_0\)) that mean sentiment is the same across all of the countries, then that means that I can shuffle Country
independent of Sentiment
.
Let’s start by calculating a test statistic using an ANOVA test on our dataset.
obs_aov <- anova( lm( Sentiment ~ Country, data= CSPdf))$`F value`[1] # calculate a test statistic from our data
obs_aov # display the value of the test statistic
## [1] 3.061811
Second, we’ll shuffle the data 1000 times, assuming that Country
doesn’t matter for Sentiment
.
sim_aov <- do(1000) * anova(lm( Sentiment ~ shuffle(Country), data=CSPdf))$`F value`[1]
head(sim_aov) # display the first few rows of simulated test statistics
## result
## 1 0.5930035
## 2 0.8801567
## 3 1.1233541
## 4 1.1706447
## 5 0.8775084
## 6 0.6880138
Third, we’ll calculate a \(p\)-value based on the proportion of simulated test statistics that are larger than our observed test statistic.
prop( ~ result > obs_aov, data=sim_aov)
## prop_TRUE
## 0