Consider this Marine Mammal Inventory Report data set from the National Oceanic and Atmospheric Administration.
It has the location of every Beluga and Orca whale in the United States since the ’70s.
Currently, the data frame is in the environment and has 494 rows.
This view shows the first six rows.
Each animal is represented by a NOAA.ID (because some do not have names) and might show up multiple times in the data frame.
For example, the Beluga Whale Blanchon is NOA0000719 and has been at three facilities but his latest Status shows him as “DEAD”.
Please create a data frame of Orca whales only from the bo data frame.
We’ll name the new data frame orcas.
# Add to the line below
orcas <- bo %>%
unique(orcas$Species)
orcas <- bo %>% filter(Species=="Killer Whale; Orca")
unique(orcas$Species)
Hint: Delete the commented line out before submitting.
Drop the column/variable Species since there’s only one variable in it now.
# Add to the line below
orcas2 <- orcas
colnames(orcas2)
orcas2 <- select(orcas, -Species)
colnames(orcas2)
Sort the orcas data frame based on the Facility column in alphabetical order.
# Add to the line below
orcas3 <- orcas2
head(orcas3)
orcas3 <- orcas2 %>% arrange(Facility)
head(orcas3)
Create a new column called status_num to fill with a 1 if Status is “DEAD” or 0 if alive. Save the transformed data frame back into orcas
# Transform the code below
orcas4 <- orcas3
head(orcas4)
orcas4 <- mutate(orcas3, status_num=case_when(Status=="DEAD" ~ 1, TRUE ~ 0))
head(orcas4)
Hint: You may want to use the dplyr case_when()
function nested in another function.
Count up the number of total Orcas by counting the unique NOAA.IDs and how many have died by adding up the status_num. You want to summarize with two columns: total (by counting distinct NOAA.IDS and adding up number of those that have died)
# Add to the line below
alive_dead <- orcas4
alive_dead
alive_dead <- orcas4 %>%
summarize(total=n_distinct(NOAA.ID), died=sum(status_num))
Hint: You may want to use the dplyr n_distinct()
function nested in another function.
What’s the percent of dead Orcas?
Create a new column/variable called percent_dead.
You do not need to round the results.
# Add to the line below
perc_dead <- alive_dead
perc_dead
perc_dead <- alive_dead %>%
mutate(percent_dead=died/total*100)
perc_dead
Hint: But don’t forget to multiply the results by 100 to get the percent.
Now figure how the percent of Orcas still alive.
Create a new column/variable called percent_alive.
You do not need to round the results.
# Add to the line below
percent_dead_alive <- perc_dead
percent_dead_alive
percent_dead_alive <- perc_dead %>%
mutate(percent_alive=100-percent_dead)
percent_dead_alive
Consider the bo data frame again.
Which Facility had the most Orcas and Belugas combined?
Go ahead and call the new column mammal_count.
List them from highest count to lowest.
# Add to the line below
facilities_count <- bo
facilities_count
facilities_count <- bo %>%
group_by(Facility) %>%
summarize(mammal_count=n()) %>%
arrange(-mammal_count)
facilities_count
Hint: n()
is used to count. But use it correctly.
Consider the bo data set again.
Summarize the data and count the total number of Orcas and Belugas (separately) per facility from the Species variable. Use the mammal_count as a column name again.
Then turn the tall data into a wide data frame that has these columns: Facility, Belugas, Orcas.
# Add to the line below
facilities_count <- bo %>%
group_by(Facility, ________) %>%
summarize(_______=___) %>%
______(_______, _____)
facilities_count
facilities_count <- bo %>%
group_by(Facility, Species) %>%
summarize(mammal_count=n()) %>%
spread(Species, mammal_count)
facilities_count
Hint: You may want to use the bo data frame. And eventually the spread()
function from the tidyr package.
Here’s a wide data frame with Owners and percent of Belugas compared to Orca whales owned.
Make this dataframe called owner_percent tidy for me (So the mammal types are in one column and the total is in another).
Call the new columns Species and Total.
# Add to the line below
owner_percent %>%
owner_percent %>%
gather("Species", "Total", 2:3)
Hint: You may want to use the gather()
function.
Consider these two data sets:
Join these two data sets and call it owner_total_percent.
Start with owner_percent and have it joined with owner_total in a left direction.
# Add to the line below
owner_percent_total <- _____join(________, ________)
owner_percent_total
owner_percent_total <- left_join(owner_percent, owner_total)
owner_percent_total
Let’s load the stringr package.
Consider the bo data frame again.
Let’s use the functions in it to manipulate the strings in this data frame.
Turn the names in Name to lower case.
bo$Name <- ________(bo$Name)
bo
bo$Name <- str_to_lower(bo$Name)
bo
Convert the variables in Sex to title case.
bo$Name <- ________(bo$Name)
bo
bo$Sex <- str_to_title(bo$Sex)
bo2 <- bo
bo2
Please create a new variable called sex_species that combines Sex and Species columns with a space in between.
bo3 <- mutate(bo2,
_________=_____(_______________________))
bo3
bo3 <- mutate(bo2,
sex_species=str_c(Sex, " ", Species))
Now, replace all mentions of " Killer Whale;" and " Whale" from the sex_species column with a blank.
So “Male Killer Whale; Orca” turns in to “Male Orca” and “Male Beluga Whale” turns into “Male Beluga”
bo4 <- mutate(bo3,
sex_species=________(_______________________)) %>%
mutate(sex_species=________(_______________________))
bo4$sex_species
bo4 <- mutate(bo3,
sex_species=str_replace(sex_species, " Killer Whale;", "")) %>%
mutate(sex_species=str_replace(sex_species, " Whale", ""))
bo4$sex_species
Hint: You may need to use the str_replace function more than once. Also, pay attention to the order. That will be important.
Let’s look at the first six rows of the bo data frame again.
Create a new data frame called seaworlds and use filter to keep only the rows that mention “Sea World” in the Facility column.
seaworlds <- filter(bo,
______(_____________))
seaworlds
seaworlds <- filter(bo,
str_detect(Facility, "Sea World"))
seaworlds
Consider the original bo data frame.
Turn Captivity.Date into a date data R can recognize.
We’ll use the lubridate package.
bo_dates <- bo %>%
mutate(Captivity.Date=____(_______))
bo_dates
bo_dates <- bo %>%
mutate(Captivity.Date=dmy(Captivity.Date))
bo_dates
Create a new column called month_number and extract the month (as a number) from the Captivity.Date.
bo_dates2 <- bo_dates %>%
mutate(month_number=____(_______))
bo_dates2
bo_dates2 <- bo_dates %>%
mutate(month_number=month(Captivity.Date))
bo_dates2
Create a new column called month_number and extract the month (as a label) from the Captivity.Date.
bo_dates3 <- bo_dates2 %>%
mutate(month_label=____(_______, ______))
bo_dates3
bo_dates3 <- bo_dates2 %>%
mutate(month_label=month(Captivity.Date, label=T))
bo_dates3
Consider the modified bo_dates3 data frame we’ve created.
Based on available data, can you figure out the average age (in years) of Beluga Whales and Orca Whales (separately) when they join a facility (aka Facility.Date)?
Summarize your dataframe (group by Facility and Species) and create a new column called average_years.
bo_dates3 %>%
bo_dates3 %>%
mutate(Birth.Date=dmy(Birth.Date),
Facility.Date=dmy(Facility.Date),
age=difftime(Facility.Date, Birth.Date),
age_years=as.numeric(age)/60/60/24/365.25) %>%
group_by(Facility, Species) %>%
summarize(average_years=mean(age_years, na.rm=T))
Hint: This one’s tough. You’ll need to convert Birth.Date and Facility.Date into dates. Then you’ll need to figure out the difference in time between them. And then you’ll need to convert that into a number and if it’s in seconds, you need to convert that into years. The math for that is seconds/60/60/24/365.25. And then group by how I described. And then when you summarize, be sure when you’re finding the average to remove NAs from the calculations. Good luck.
Alright, we can’t really tell the ages because of negative years? It looks like this data is pretty messy. Probably human error.
But great job! You’ve made it all the way through this chapter!