Darron Birgenheier 1955 Carpenter / Steelcraft school bus on 1954 GMC chassis HDR with ominous clouds’ by Darron Birgenheier licensed under CC BY-SA 2.0 / modified from original

USA School Shootings

22…

… this is the count of school shootings which took place in USA within the first 21 weeks in 2018. Statistically this means that there has been more than 1 shooting per week. The Washington Post wrote that this led to more deaths at schools than members of the US military have been killed while being deployed this year.

We hear about those shootings quite often in the news, and after almost every of theses a new discussion about the USA gun laws arises. Nevertheless, I do not want to get into theses discussions here but rather share some of my research with you. I was wondering if I could visualize USA school shootings on a map and decided to create a interactive dashboard to explore the data a little bit further. I found the School Shootings in the United States Wikipedia article which I used as a datasource for my project.

Dashboard Screenshot

We need to keep in mind that, hidden behind all these data and statistics, there are people. People who died, people who got injuried, people who have been crying for their friends and family members and even more. Feel free to explore the data by your own to get a feeling for all those incidents and victims:

Link to R-Shiny USA School Shootings Dashboard

One word of caution at this point: We cannot be sure that the article has documented every single shooting, but the Wikipedia community is working hard to keep the records up-to-date and adds new entries really quickly. Nevertheless, the present dataset does not raise the claim to be complete and to include every incidents ever happened.

The code for the dashboard can be found on my GitHub account:

R-Shiny USA School Shootings Dashboard - Code

For those who are interested in the data cleaning and preparation, I will go trough this process step by step in the rest of the blog post.

Environment Setup

In order to read the html tables from the Wikipedia article we are going to use the rvest package. To clean the data we will use the tidyverse package, and to built a interactive map later we are going to use the leaflet package. Further we will need ggmap for geocoding.

if (!require(rvest)) {
  install.packages("rvest")
}
if (!require(ggmap)) {
  devtools::install_github("dkahle/ggmap")
}

if (!require(leaflet)) {
  install.packages("leaflet")
}
library(rvest)
library(tidyverse)
library(ggmap)
library(leaflet)

Load Data from Wikipedia

To extract the data directly from the Wikipedia article we use the rvest package from Hadley Wickham. This package makes it easy to scrape data from html web pages. You can find some further information on its Github page or on this small blogpost from Hadley Wickham.

url <- "https://en.wikipedia.org/wiki/List_of_school_shootings_in_the_United_States"

# Read html and save it to the dashboard data folder for download
articleHTML <- url %>% 
  read_html()

write_html(articleHTML,
           "./usa_school_shootings_shiny/data/List_of_school_shootings_in_the_United_States.html")

# Extract tables and save them to a list
table_list <- articleHTML %>% 
  html_nodes("table") %>% 
  html_table()

# Convert all columns to character to avoid errors because of unclean data when binding rows
dt <- map(table_list, function(x) map(x,as.character)) %>% 
  bind_rows()

# Save raw data to dashbaord data folder
saveRDS(dt, "./usa_school_shootings_shiny/data/raw.RDS")

head(dt)

Data Preparation

Data Cleaning

We need to perform some data processing and cleaning steps so that we can use it for our dashboard. This steps include:

  • Remove duplicated part of Date (i.e. 000000001764-07-26-0000July 26, 1764)
  • Remove duplicated Location after “!”-character (i.e. Greencastle, Pennsylvania !Greencastle, Pennsylvania)
  • Convert Injuries and Deaths to integer (characters like “?”, “1+”, will be converted to NA)
dt <- dt %>% 
  mutate(
    # If Date contains "-0000" then remove the first part from it (first 24 characters)
    Date = ifelse(str_detect(Date, "-0000"), 
                  str_sub(Date, 24),
                  Date),
    # Convert Date to Date type
    Date = parse_date(Date, format = "%B %d, %Y", locale = locale("en")),
    year = as.integer(format(Date, "%Y")),
    century = as.integer(format(Date, "%C")),
    decade = floor(year/10)*10,
    # If Location contains "!", then remove part after that character
    Location = ifelse(str_detect(Location, "!"),
                      str_sub(Location, 1, str_locate(Location, " !")[,1] - 1),
                      Location),
    # Count words in Location for correct State extraction
    words_in_location = str_count(Location, '\\w+'),
    ### Extract State from Location variable ###
    # If City provided (words_in_location > 1), split City and State to only get State
    State = ifelse(words_in_location > 1,
                   str_split_fixed(Location, ",", n = 2)[,2],
                   Location),
    # Trim whitespace and remove "." from abbreviations
    State = gsub("\\.", "", trimws(State)),
    # Correct state abbreviations using the R state.abb and state.name dataset
    State = ifelse(State %in% state.abb,
                   state.name[match(State, state.abb)],
                   State),
    # Convert Deaths and Injuries to integer
    ### End: Extract State from Location variable ###
    Deaths = as.integer(Deaths),
    Injuries = as.integer(Injuries),
    # Create html popup message for later plot
    popup = paste0("<b>Date: ", Date, "</b><br/>",
                   "<b>Deaths: ", Deaths, "</b><br/>",
                   "<b>Injuries: ", Injuries,"</b><br/>",
                   "<br/>",
                   "<b>Description: </b><br/>",
                   Description)
  ) %>% 
  select(-words_in_location)

dt

Geocoding

Now that we have cleaned the data, we can convert the Location column to latitude and longitude data for our plot by using the geocode() function from the ggmap package. You’ll find a small introduction in my previous blogpost, where I used the package to geocode the addresses from San Francisco Police Departments.

api_key <- read_lines("./additional_data/api-key.txt") #you would need to insert your api key here
register_google(api_key, account_type = "standard")
# Get location from Google Maps Geocoding API
locations <- geocode(dt$Location, messaging = FALSE)

# add latitude and longitude data to our data frame
dt <- bind_cols(dt, locations)

saveRDS(dt, "./usa_school_shootings_shiny/data/cleaned.RDS")

Some plots

Leaflet Map

Having finished the data preprocessing we are going to use the leaflet package to create a interactive map. I also used this package within the Shiny dashboard. Here we will show all incidents in year 2018

leafletColors <- colorFactor(palette = c(Deaths = "#e34a33", Injuries = "#fdbb84"), 
                          domain = c("Incidents with deaths", "No deaths"))


leaflet(data = subset(dt, year == "2018")) %>%
  addTiles() %>% 
  addCircleMarkers(lng = ~lon,
                         lat = ~lat,
                         popup = ~popup,
                         label = ~Location,
                         color = ifelse(dt$Deaths > 0, "#e34a33",
                                        "#fdbb84"),
                         opacity = 0.3,
                         fillOpacity = 0.3,
                         radius = sqrt(dt$Deaths + dt$Injuries) + 6
        ) %>% 
  addLegend(position = "topright", 
            pal = leafletColors, 
            values = c("Incidents with deaths", "No deaths"))

State Statistics

As in the dashboard we are going to plot to summary plots for a absolute count and share of death and injured people per state (here only year 2018 again).

 myFillColors <- c(Deaths = "#e34a33", Injuries = "#fdbb84")

dt %>%
  filter(year == "2018") %>% 
  group_by(State) %>% 
  summarise(Deaths = sum(Deaths, na.rm = T),
            Injuries = sum(Injuries, na.rm = T),
            Total = sum(Deaths, na.rm = T) + sum(Injuries, na.rm = T)) %>% 
  gather(key = category, value = count, Deaths, Injuries) %>% 
  ggplot() +
  geom_col(aes(x = reorder(State, Total), y = count, fill = category),
           alpha = 0.7, width = 0.8) +
  scale_fill_manual(values = myFillColors,
                    guide = guide_legend(title = NULL, keywidth = 1, keyheight = 1)) +
  xlab("State") +
  ylab("Injured and Death People") +
  coord_flip() +
  theme_minimal() +
  theme(legend.position = "bottom")

dt %>%
  group_by(State) %>% 
  filter(year == "2018") %>% 
  summarise(Deaths = sum(Deaths, na.rm = T),
            Injuries = sum(Injuries, na.rm = T),
            Total = sum(Deaths, na.rm = T) + sum(Injuries, na.rm = T)) %>% 
  gather(key = category, value = count, Deaths, Injuries) %>% 
  ggplot() +
  geom_col(aes(x = reorder(State, Total), y = count, fill = category),
           alpha = 0.7, width = 0.8, position = "fill") +
  scale_fill_manual(values = myFillColors,
                    guide = guide_legend(title = NULL, keywidth = 1, keyheight = 1)) +
  scale_y_continuous(labels = scales::percent) +
  xlab("State") +
  ylab("Share") +
  coord_flip() +
  theme_minimal() +
  theme(legend.position = "bottom")
## Warning: Removed 2 rows containing missing values (geom_col).

Related

Next
Previous