Chapter 3

Transforming and analyzing data

Consider this Marine Mammal Inventory Report data set from the National Oceanic and Atmospheric Administration.

It has the location of every Beluga and Orca whale in the United States since the ’70s.

Currently, the data frame is in the environment and has 494 rows.

This view shows the first six rows.

Each animal is represented by a NOAA.ID (because some do not have names) and might show up multiple times in the data frame.

For example, the Beluga Whale Blanchon is NOA0000719 and has been at three facilities but his latest Status shows him as “DEAD”.

Filter

Please create a data frame of Orca whales only from the bo data frame.

We’ll name the new data frame orcas.

# Add to the line below 

orcas <- bo %>%
  
unique(orcas$Species)

orcas <- bo %>% filter(Species=="Killer Whale; Orca")
  
unique(orcas$Species)

Hint: Delete the commented line out before submitting.

Select

Drop the column/variable Species since there’s only one variable in it now.

# Add to the line below

orcas2 <- orcas
  
colnames(orcas2)

orcas2 <- select(orcas, -Species)

colnames(orcas2)

Arrange

Sort the orcas data frame based on the Facility column in alphabetical order.

# Add to the line below

orcas3 <- orcas2
  

head(orcas3)

orcas3 <- orcas2 %>% arrange(Facility)
  

head(orcas3)

Mutate

Create a new column called status_num to fill with a 1 if Status is “DEAD” or 0 if alive. Save the transformed data frame back into orcas

# Transform the code below

orcas4 <- orcas3
  

head(orcas4)

orcas4 <- mutate(orcas3, status_num=case_when(Status=="DEAD" ~ 1, TRUE ~ 0))
  

head(orcas4)

Hint: You may want to use the dplyr case_when() function nested in another function.

Summarize

Count up the number of total Orcas by counting the unique NOAA.IDs and how many have died by adding up the status_num. You want to summarize with two columns: total (by counting distinct NOAA.IDS and adding up number of those that have died)

# Add to the line below

alive_dead <- orcas4
  
alive_dead

alive_dead <- orcas4 %>%
  summarize(total=n_distinct(NOAA.ID), died=sum(status_num))

Hint: You may want to use the dplyr n_distinct() function nested in another function.

Percent

What’s the percent of dead Orcas?

Create a new column/variable called percent_dead.

You do not need to round the results.

# Add to the line below

perc_dead <- alive_dead

perc_dead

perc_dead <- alive_dead %>%
  mutate(percent_dead=died/total*100)

perc_dead

Hint: But don’t forget to multiply the results by 100 to get the percent.

Now figure how the percent of Orcas still alive.

Create a new column/variable called percent_alive.

You do not need to round the results.

# Add to the line below

percent_dead_alive <- perc_dead

percent_dead_alive

percent_dead_alive <- perc_dead %>%
  mutate(percent_alive=100-percent_dead)

percent_dead_alive

Group By

Consider the bo data frame again.

Which Facility had the most Orcas and Belugas combined?

Go ahead and call the new column mammal_count.

List them from highest count to lowest.

# Add to the line below

facilities_count <- bo

facilities_count

facilities_count <- bo %>%
  group_by(Facility) %>%
  summarize(mammal_count=n()) %>%
  arrange(-mammal_count)

facilities_count

Hint: n() is used to count. But use it correctly.

Tidying and joining data

Consider the bo data set again.

Turn a tall data frame wide

Summarize the data and count the total number of Orcas and Belugas (separately) per facility from the Species variable. Use the mammal_count as a column name again.

Then turn the tall data into a wide data frame that has these columns: Facility, Belugas, Orcas.

# Add to the line below

facilities_count <- bo %>%
   group_by(Facility, ________) %>%
   summarize(_______=___) %>%
   ______(_______, _____)

facilities_count

facilities_count <- bo %>%
  group_by(Facility, Species) %>%
  summarize(mammal_count=n()) %>%
  spread(Species, mammal_count)

facilities_count

Hint: You may want to use the bo data frame. And eventually the spread() function from the tidyr package.

Turn a wide data frame into a tall tidy data frame

Here’s a wide data frame with Owners and percent of Belugas compared to Orca whales owned.

Make this dataframe called owner_percent tidy for me (So the mammal types are in one column and the total is in another).

Call the new columns Species and Total.

# Add to the line below

owner_percent %>%

owner_percent %>%
  gather("Species", "Total", 2:3)

Hint: You may want to use the gather() function.

Join Data

Consider these two data sets:

owner_percent

owner_total

Join these two data sets and call it owner_total_percent.

Start with owner_percent and have it joined with owner_total in a left direction.

# Add to the line below

owner_percent_total <- _____join(________, ________)
  
owner_percent_total

owner_percent_total <- left_join(owner_percent, owner_total)
  
owner_percent_total

Handling strings

Let’s load the stringr package.

Consider the bo data frame again.

Lower case

Let’s use the functions in it to manipulate the strings in this data frame.

Turn the names in Name to lower case.

bo$Name <- ________(bo$Name)

bo

bo$Name <- str_to_lower(bo$Name)

bo

Title case

Convert the variables in Sex to title case.

bo$Name <- ________(bo$Name)

bo

bo$Sex <- str_to_title(bo$Sex)

bo2 <- bo
bo2

Combine string

Please create a new variable called sex_species that combines Sex and Species columns with a space in between.

bo3 <- mutate(bo2, 
             _________=_____(_______________________))
bo3

bo3 <- mutate(bo2,
             sex_species=str_c(Sex, " ", Species))

Substitute string

Now, replace all mentions of " Killer Whale;" and " Whale" from the sex_species column with a blank.

So “Male Killer Whale; Orca” turns in to “Male Orca” and “Male Beluga Whale” turns into “Male Beluga”

bo4 <- mutate(bo3, 
             sex_species=________(_______________________)) %>%
      mutate(sex_species=________(_______________________))

bo4$sex_species

bo4 <- mutate(bo3,
             sex_species=str_replace(sex_species, " Killer Whale;", "")) %>%
      mutate(sex_species=str_replace(sex_species, " Whale", ""))

bo4$sex_species

Hint: You may need to use the str_replace function more than once. Also, pay attention to the order. That will be important.

Detect strings

Let’s look at the first six rows of the bo data frame again.

Create a new data frame called seaworlds and use filter to keep only the rows that mention “Sea World” in the Facility column.

seaworlds <- filter(bo,
                    ______(_____________))

seaworlds

seaworlds <- filter(bo,
             str_detect(Facility, "Sea World"))

seaworlds

Dealing with dates

Consider the original bo data frame.

Convert this date

Turn Captivity.Date into a date data R can recognize.

We’ll use the lubridate package.

bo_dates <- bo %>%
  mutate(Captivity.Date=____(_______))

bo_dates

bo_dates <- bo %>%
  mutate(Captivity.Date=dmy(Captivity.Date))

bo_dates

Extract the month

Create a new column called month_number and extract the month (as a number) from the Captivity.Date.

bo_dates2 <- bo_dates %>%
  mutate(month_number=____(_______))

bo_dates2

bo_dates2 <- bo_dates %>%
  mutate(month_number=month(Captivity.Date))

bo_dates2

Create a new column called month_number and extract the month (as a label) from the Captivity.Date.

bo_dates3 <- bo_dates2 %>%
  mutate(month_label=____(_______, ______))

bo_dates3

bo_dates3 <- bo_dates2  %>%
  mutate(month_label=month(Captivity.Date, label=T))

bo_dates3

Time difference

Consider the modified bo_dates3 data frame we’ve created.

Based on available data, can you figure out the average age (in years) of Beluga Whales and Orca Whales (separately) when they join a facility (aka Facility.Date)?

Summarize your dataframe (group by Facility and Species) and create a new column called average_years.

bo_dates3 %>%

bo_dates3 %>%
  mutate(Birth.Date=dmy(Birth.Date), 
         Facility.Date=dmy(Facility.Date),
         age=difftime(Facility.Date, Birth.Date),
         age_years=as.numeric(age)/60/60/24/365.25) %>% 
  group_by(Facility, Species) %>% 
  summarize(average_years=mean(age_years, na.rm=T))

Hint: This one’s tough. You’ll need to convert Birth.Date and Facility.Date into dates. Then you’ll need to figure out the difference in time between them. And then you’ll need to convert that into a number and if it’s in seconds, you need to convert that into years. The math for that is seconds/60/60/24/365.25. And then group by how I described. And then when you summarize, be sure when you’re finding the average to remove NAs from the calculations. Good luck.

Alright, we can’t really tell the ages because of negative years? It looks like this data is pretty messy. Probably human error.

But great job! You’ve made it all the way through this chapter!