We will be:
You are welcome to either follow along with the text and copy-paste the code into the console or read through and then copy the whole script into a new R
script in Rstudio.Cloud Enviro Comms Lab. Please make sure that you start a new project using that link. Do not continue using the projects from the Air Quality Lab or the Soil Science Lab.
First, let’s make sure we have useful functions to help us in this project.
## Libraries for interacting with data
library(readr) # reads in spreadsheet data
library(dplyr) # lets you wrangle data more effectively
## Libraries for plotting
library(ggplot2) # plotting library
library(waffle) # square pie chart (waffle chart) library
## Libraries for spatial data
library(sf) # a nicer way to deal with spatial data
library(spData) # spatial datasets
## Libraries for text data
library(vader) # package that permits for sentiment analysis
library(tidytext) # package for wrangling text data
library(stringr) # package with helpful functions dealing with string, or character/text data.
library(tokenizers) # package to split strings into smaller entities (e.g. messages into words, ngrams, or individual letters)
Next, we’re going to import the class dataset.
CSPdf <- read_tsv("https://raw.githubusercontent.com/EA30POM/Fall2021/main/data/OTEC_CSP.tsv") # use the function read_tsv from readr to pull in a spreadsheet from this URL.
# Then store the output in an object called "CSPdf"
# CSPdf stands for Conservation Science Partners data frame
We can always pull up a viewer of our data by using these commands:
## Version 1:
CSPdf %>% # take CSPdf and feed it into the next command
head( ) %>% # look at the first 6 rows
View( ) # open the spreadsheet viewer
# akin to x --> f(x) --> g(x)
## Alternatively, version 2:
View( head( CSPdf ) ) # akin to g( f( x ) )
What topics are in the data?
CSPdf %>% # feed CSPdf into the next command
group_by(Topic) %>% # bucket the data based on the Topics of each article
summarise(Count = n()) # get the counts across all of the articles for every topic
## # A tibble: 12 x 2
## Topic Count
## <chr> <int>
## 1 Climate Change 8
## 2 Conservation 171
## 3 Ecotourism 39
## 4 Exploitation 200
## 5 Habitat Loss 12
## 6 Health 57
## 7 HWI 185
## 8 PAs 40
## 9 Pollution 10
## 10 Threats 65
## 11 Wildlife info 158
## 12 Zoo 26
## How much data do we have?
dim(CSPdf) # displays number of rows and columns
## [1] 971 9
nrow(CSPdf) # displays number of row
## [1] 971
Questions:
Portrayals
?Country
?Taxon
?Let’s say that my group decides to focus on wolves. How do I filter the data to just look at wolves?
Easy! I can use the filter
function from the dplyr
package.
subDF <- CSPdf %>% # take CSPdf and feed it forward into
filter(Taxon=="wolf") # the filter function, selecting only for the rows where Taxon is equal to wolf
# and assign the output to subDF, short for subset data frame
Questions:
deer
?Exploitation
but include all of the species?Vulnerable
?Below, we’ll walk through some example visualizations.
Now that we’ve subset our data, how can we get helpful information about different variables?
summaryDF <- subDF %>% # take the subset data, subDF, and feed it into
group_by(Topic) %>% # a grouping operation to bucket the data based on topic
summarise(Count=n()) # then get the count of articles by topic
Questions:
Country
?min
imum or mean
value for Sentiment
for articles about wolves?We can use a square pie chart, aka a waffle chart, to display this information about the topic by topic breakdown for wolf articles.
names(summaryDF) <- c("name","vals") # changing column names for the summaryDF object
col_ramp <- RColorBrewer::brewer.pal(nrow(summaryDF), "Set3") # selecting a nicer set of colors
waffle(data.frame(summaryDF),colors=col_ramp) # create a waffle plot for the summaryDF object, with the color ramp we have
Questions:
Country
counts for wolves instead?Sentiment
?This is a type of analysis that is called a “cross-tabulation”, or crosstab for short. Every entry in a crosstab table tells you the number of observations that have some specific pairing of variables. For example, how many articles about wolves have the portrayal Vulnerable
with the topic Exploitation
?
crossDF <- subDF %>% # start from our subset data, subDF
group_by(Topic,Portrayal) %>% # bucket the data based on Topic and Portrayal
summarise(Count=n()) # get the counts for each combination of Topic and Portrayal
# save the output in the object crossDF (short for crosstab data frame)
Questions:
Country
and Topic
?Portrayal
and Country
?Sentiment
and Country
?Now we can visualize our crosstab.
p <- ggplot(crossDF, aes(Topic, Portrayal, Count)) # initiate a plot using crossDF
p <- p + geom_point(aes(size = Count), colour = "skyblue") # plot points for each combination of topic and portrayal - change the size of the point based on the counts
p <- p + theme_bw() + labs(x="",y="",size="Count:") # modify the plot appearnace
p <- p + scale_size_continuous(range=c(1,10)) + geom_text(aes(label = Count)) # change the range of the point sizes and add text information about the crosstabs
p <- p + theme(axis.text.x = element_text(angle = 45, hjust=1),
legend.position = "top") # change the plot visualization
p # display the plot
Questions:
What is the distribution of sentiment toward wolves?
p <- ggplot(data=subDF, aes(x=Sentiment)) # create a plot from the subDF object using the column Sentiment
p <- p + geom_histogram() # add a histogram to the plot
p <- p + theme_classic() # change the display of the plot
p <- p + labs(y="Count") # change the y-axis label
p # display the plot
How does sentiment vary across topics for wolves?
p <- ggplot(data=subDF, aes(x=Topic,y=Sentiment)) # create a plot using the columns Topic and Sentiment from the subDF data
p <- p + geom_boxplot() # add a boxplot
p <- p + geom_hline(yintercept=0, linetype="dashed", col="red") # add a red dahsed line to demarcate where the positive vs. negative articles are located
p <- p + coord_flip() # switch the x- and y-axis
p # display plot
Below, we are going to do several things:
worldDF <- world # store the dataset for the world in worldDF, world data frame
speciesCountryCounts <- subDF %>% # feed the subsetDF forward
group_by(Country) %>% # bucket based on country
summarise(Count=n()) # get counts by country
speciesMap <- worldDF %>% # take the worldDF object and feed forward
left_join(speciesCountryCounts, by=c("name_long"="Country")) # combine with our count data by country,
# getting the data to match on the columns "name_long" (countries) and "Country"
# store outputs in speciesMap
p <- ggplot(data=speciesMap, aes(fill=Count)) # use the speciesMap object to create a map
p <- p + geom_sf(aes(fill=Count),color="gray47",size=0.05) # add on the boundaries of countries
p <- p + labs(x="",y="",fill="Count of wolf articles: ") # change the names of the variables as needed
p <- p + theme_minimal() # change the display style
p <- p + theme(legend.position = "top") # move the legend bar to the top
p # display the plot
How would we visualize the distribution of sentiment for wolf media coverage across countries?
First, we probably want to make our lives easier but just focusing on the countries that have more wolf coverage.
subDF %>% # take subDF and feed forward into
group_by(Country) %>% # a grouping step to bucket by Country
summarise(Count=n()) # then get the count of coverage by country
## # A tibble: 9 x 2
## Country Count
## <chr> <int>
## 1 Canada 7
## 2 India 2
## 3 Ireland 2
## 4 New Zealand 2
## 5 Russia 2
## 6 South Africa 1
## 7 Switzerland 1
## 8 United Kingdom 14
## 9 United States 133
Next, we’re going to further subset the data to just the few countries that have more articles on wolves.
subCountryDF <- subDF %>% # take subDF
filter(Country=="United States" |
Country=="United Kingdom" |
Country=="Canada") # only keep the articles that are coming from the US, Canada, or the UK
# This | vertical bar is the same as saying OR
# Save the output in subCountryDF
Finally, we can make a plot of sentiment across countries for wolves.
p <- ggplot(data=subCountryDF, aes(x=Country,y=Sentiment)) # use the subCountryDF object to initiate a plot of country and sentiment
p <- p + geom_boxplot() # add on a boxplot
p <- p + geom_hline(yintercept=0, linetype="dashed", col="red") # add a red dahsed line to demarcate where the positive vs. negative articles are located
p <- p + coord_flip() # switch the x and y-axis
p # display plot
The code below can be copied and pasted into RStudio Cloud for running directly.
###==============================
### Loading packages
###==============================
## Libraries for interacting with data
library(readr) # reads in spreadsheet data
library(dplyr) # lets you wrangle data more effectively
## Libraries for plotting
library(ggplot2) # plotting library
library(waffle) # square pie chart (waffle chart) library
## Libraries for spatial data
library(sf) # a nicer way to deal with spatial data
library(spData) # spatial datasets
## Libraries for text data
library(vader) # package that permits for sentiment analysis
library(tidytext) # package for wrangling text data
library(stringr) # package with helpful functions dealing with string, or character/text data.
library(tokenizers) # package to split strings into smaller entities (e.g. messages into words, ngrams, or individual letters)
###==============================
### Reading in the data
###==============================
CSPdf <- read_tsv("https://raw.githubusercontent.com/EA30POM/Fall2021/main/data/OTEC_CSP.tsv") # use the function read_tsv from readr to pull in a spreadsheet from this URL.
# Then store the output in an object called "CSPdf"
# CSPdf stands for Conservation Science Partners data frame
## Inspecting the data, version 1:
CSPdf %>% # take CSPdf and feed it into the next command
head( ) %>% # look at the first 6 rows
View( ) # open the spreadsheet viewer
# akin to x --> f(x) --> g(x)
## Alternatively, version 2:
View( head( CSPdf ) ) # akin to g( f( x ) )
###==============================
### Data summaries
###==============================
CSPdf %>% # feed CSPdf into the next command
group_by(Topic) %>% # bucket the data based on the Topics of each article
summarise(Count = n()) # get the counts across all of the articles for every topic
###==============================
### Subsetting the data
###==============================
subDF <- CSPdf %>% # take CSPdf and feed it forward into
filter(Taxon=="wolf") # the filter function, selecting only for the rows where Taxon is equal to wolf
# and assign the output to subDF, short for subset data frame
###==============================
### Summarizing the subset
###==============================
summaryDF <- subDF %>% # take the subset data, subDF, and feed it into
group_by(Topic) %>% # a grouping operation to bucket the data based on topic
summarise(Count=n()) # then get the count of articles by topic
###==============================
### Visualizing the distro
### of categories in the data
###==============================
names(summaryDF) <- c("name","vals") # changing column names for the summaryDF object
col_ramp <- RColorBrewer::brewer.pal(nrow(summaryDF), "Set3") # selecting a nicer set of colors
waffle(data.frame(summaryDF),colors=col_ramp) # create a waffle plot for the summaryDF object, with the color ramp we have
###==============================
### Crosstabs
###==============================
crossDF <- subDF %>% # start from our subset data, subDF
group_by(Topic,Portrayal) %>% # bucket the data based on Topic and Portrayal
summarise(Count=n()) # get the counts for each combination of Topic and Portrayal
# save the output in the object crossDF (short for crosstab data frame)
## Visualization of crosstabs
p <- ggplot(crossDF, aes(Topic, Portrayal, Count)) # initiate a plot using crossDF
p <- p + geom_point(aes(size = Count), colour = "skyblue") # plot points for each combination of topic and portrayal - change the size of the point based on the counts
p <- p + theme_bw() + labs(x="",y="",size="Count:") # modify the plot appearnace
p <- p + scale_size_continuous(range=c(1,10)) + geom_text(aes(label = Count)) # change the range of the point sizes and add text information about the crosstabs
p <- p + theme(axis.text.x = element_text(angle = 45, hjust=1),
legend.position = "top") # change the plot visualization
p # display the plot
###==============================
### Visualizing a numeric
### variable like sentiment
###==============================
p <- ggplot(data=subDF, aes(x=Sentiment)) # create a plot from the subDF object using the column Sentiment
p <- p + geom_histogram() # add a histogram to the plot
p <- p + theme_classic() # change the display of the plot
p <- p + labs(y="Count") # change the y-axis label
p # display the plot
p <- ggplot(data=subDF, aes(x=Topic,y=Sentiment)) # create a plot using the columns Topic and Sentiment from the subDF data
p <- p + geom_boxplot() # add a boxplot
p <- p + geom_hline(yintercept=0, linetype="dashed", col="red") # add a red dahsed line to demarcate where the positive vs. negative articles are located
p <- p + coord_flip() # switch the x- and y-axis
p # display plot
###==============================
### Visualizing the distro
### of articles across countries
###==============================
worldDF <- world # store the dataset for the world in worldDF, world data frame
speciesCountryCounts <- subDF %>% # feed the subsetDF forward
group_by(Country) %>% # bucket based on country
summarise(Count=n()) # get counts by country
speciesMap <- worldDF %>% # take the worldDF object and feed forward
left_join(speciesCountryCounts, by=c("name_long"="Country")) # combine with our count data by country,
# getting the data to match on the columns "name_long" (countries) and "Country"
# store outputs in speciesMap
## Create plot
p <- ggplot(data=speciesMap, aes(fill=Count)) # use the speciesMap object to create a map
p <- p + geom_sf(aes(fill=Count),color="gray47",size=0.05) # add on the boundaries of countries
p <- p + labs(x="",y="",fill="Count of wolf articles: ") # change the names of the variables as needed
p <- p + theme_minimal() # change the display style
p <- p + theme(legend.position = "top") # move the legend bar to the top
p # display the plot
###==============================
### Visualizing the distro
### of sentiment across countries
###==============================
subDF %>% # take subDF and feed forward into
group_by(Country) %>% # a grouping step to bucket by Country
summarise(Count=n()) # then get the count of coverage by country
## subset to the countries with the most articles
subCountryDF <- subDF %>% # take subDF
filter(Country=="United States" |
Country=="United Kingdom" |
Country=="Canada") # only keep the articles that are coming from the US, Canada, or the UK
# This | vertical bar is the same as saying OR
p <- ggplot(data=subCountryDF, aes(x=Country,y=Sentiment)) # use the subCountryDF object to initiate a plot of country and sentiment
p <- p + geom_boxplot() # add on a boxplot
p <- p + geom_hline(yintercept=0, linetype="dashed", col="red") # add a red dahsed line to demarcate where the positive vs. negative articles are located
p <- p + coord_flip() # switch the x and y-axis
p # display plot
# Save the output in subCountryDF