Chapter 4

ggplot2

Consider this data set from IMDB on 5,000 movies we have in the environment called movies.

This is what glimpse(movies) outputs.

## Observations: 3,258
## Variables: 29
## $ color                     <chr> "Color", "Color", "Color", "Color", ...
## $ director_name             <chr> "James Cameron", "Gore Verbinski", "...
## $ num_critic_for_reviews    <int> 723, 302, 813, 462, 392, 324, 635, 6...
## $ duration                  <int> 178, 169, 164, 132, 156, 100, 141, 1...
## $ director_facebook_likes   <int> 0, 563, 22000, 475, 0, 15, 0, 0, 0, ...
## $ actor_3_facebook_likes    <int> 855, 1000, 23000, 530, 4000, 284, 19...
## $ actor_2_name              <chr> "Joel David Moore", "Orlando Bloom",...
## $ actor_1_facebook_likes    <int> 1000, 40000, 27000, 640, 24000, 799,...
## $ gross                     <int> 760505847, 309404152, 448130642, 730...
## $ genres                    <chr> "Action|Adventure|Fantasy|Sci-Fi", "...
## $ actor_1_name              <chr> "CCH Pounder", "Johnny Depp", "Tom H...
## $ movie_title               <chr> "Avatar ", "Pirates of the Caribbean...
## $ num_voted_users           <int> 886204, 471220, 1144337, 212204, 383...
## $ cast_total_facebook_likes <int> 4834, 48350, 106759, 1873, 46055, 20...
## $ actor_3_name              <chr> "Wes Studi", "Jack Davenport", "Jose...
## $ facenumber_in_poster      <int> 0, 0, 0, 1, 0, 1, 4, 0, 0, 2, 1, 0, ...
## $ plot_keywords             <chr> "avatar|future|marine|native|paraple...
## $ movie_imdb_link           <chr> "http://www.imdb.com/title/tt0499549...
## $ num_user_for_reviews      <int> 3054, 1238, 2701, 738, 1902, 387, 11...
## $ language                  <chr> "English", "English", "English", "En...
## $ country                   <chr> "USA", "USA", "USA", "USA", "USA", "...
## $ content_rating            <chr> "PG-13", "PG-13", "PG-13", "PG-13", ...
## $ budget                    <int> 237000000, 300000000, 250000000, 263...
## $ title_year                <int> 2009, 2007, 2012, 2012, 2007, 2010, ...
## $ actor_2_facebook_likes    <int> 936, 5000, 23000, 632, 11000, 553, 2...
## $ imdb_score                <dbl> 7.9, 7.1, 8.5, 6.6, 6.2, 7.8, 7.5, 6...
## $ aspect_ratio              <dbl> 1.78, 2.35, 2.35, 2.35, 2.35, 1.85, ...
## $ movie_facebook_likes      <int> 33000, 0, 164000, 24000, 0, 29000, 1...
## $ genre1                    <chr> "Action", "Action", "Action", "Actio...

Make a scatterplot

Let’s start easy with a simple scatter plot comparing box office gross to the budget.

ggplot(______) + geom_________(aes(x=____,y=_____))
ggplot(movies) + geom_point(aes(x=gross,y=budget))

Change the color

Let’s change the color of the circles to blue.

ggplot(______) + geom_________(aes(x=_____,y=_____), ________)
ggplot(movies) + geom_point(aes(x=gross,y=budget), color="blue")

Make a scatterplot with color groups

Add a color factor based on content_rating.

ggplot(______) + geom_________(aes(x=_____, y=______, _______=_____))
ggplot(movies) + geom_point(aes(x=gross,y=budget, color=content_rating))

Did you notice the placement of the second-to-last parenthesis for color this time?

Bar plot

Make a bar plot chart that counts up the number of titles per year (title_year).

ggplot(______,
             aes(x=_________)) +
  geom__________()
ggplot(movies,
             aes(x=title_year)) +
  geom_bar()

Stacked bar plot ver. 1

Add content_rating as a grouping per year counting up the movies to create a stacked bar chart.

ggplot(______,
             aes(x=_________,_________)) +
  geom__________()
ggplot(movies,
             aes(x=title_year, fill=content_rating)) +
  geom_bar()

Hint: You may want to use the fill argument in the aes().

Stacked bar plot ver. 2

Great, now split up the bars so they’re not stacked but next to each other.

And we’ll focus on movies created after 2001 (title_year is the variable).

movies %>% 
  filter(___________) %>%
  ggplot(aes(x=_________,fill=________)) +
  geom__________(________________)
movies %>% 
  filter(title_year>2001) %>%
  ggplot(aes(x=title_year,fill=content_rating)) +
  geom_bar(position="dodge")

Hint: You may want to use the position argument in the geom_bar() function.

Stacked bar plot ver. 3

Alright, let’s make a percent stacked chart this time:

movies %>% 
  filter(___________) %>%
  ggplot(aes(x=_________,fill=_________)) +
  geom__________(position=________)
movies %>% 
  filter(title_year>2001) %>%
  ggplot(aes(x=title_year,fill=content_rating)) +
  geom_bar(position="fill")

Customizing charts

Another bar chart

Consider this data set from IMDB on 5,000 movies we have in the environment called movies.

This is what glimpse(movies) outputs.

## Observations: 3,258
## Variables: 29
## $ color                     <chr> "Color", "Color", "Color", "Color", ...
## $ director_name             <chr> "James Cameron", "Gore Verbinski", "...
## $ num_critic_for_reviews    <int> 723, 302, 813, 462, 392, 324, 635, 6...
## $ duration                  <int> 178, 169, 164, 132, 156, 100, 141, 1...
## $ director_facebook_likes   <int> 0, 563, 22000, 475, 0, 15, 0, 0, 0, ...
## $ actor_3_facebook_likes    <int> 855, 1000, 23000, 530, 4000, 284, 19...
## $ actor_2_name              <chr> "Joel David Moore", "Orlando Bloom",...
## $ actor_1_facebook_likes    <int> 1000, 40000, 27000, 640, 24000, 799,...
## $ gross                     <int> 760505847, 309404152, 448130642, 730...
## $ genres                    <chr> "Action|Adventure|Fantasy|Sci-Fi", "...
## $ actor_1_name              <chr> "CCH Pounder", "Johnny Depp", "Tom H...
## $ movie_title               <chr> "Avatar ", "Pirates of the Caribbean...
## $ num_voted_users           <int> 886204, 471220, 1144337, 212204, 383...
## $ cast_total_facebook_likes <int> 4834, 48350, 106759, 1873, 46055, 20...
## $ actor_3_name              <chr> "Wes Studi", "Jack Davenport", "Jose...
## $ facenumber_in_poster      <int> 0, 0, 0, 1, 0, 1, 4, 0, 0, 2, 1, 0, ...
## $ plot_keywords             <chr> "avatar|future|marine|native|paraple...
## $ movie_imdb_link           <chr> "http://www.imdb.com/title/tt0499549...
## $ num_user_for_reviews      <int> 3054, 1238, 2701, 738, 1902, 387, 11...
## $ language                  <chr> "English", "English", "English", "En...
## $ country                   <chr> "USA", "USA", "USA", "USA", "USA", "...
## $ content_rating            <chr> "PG-13", "PG-13", "PG-13", "PG-13", ...
## $ budget                    <int> 237000000, 300000000, 250000000, 263...
## $ title_year                <int> 2009, 2007, 2012, 2012, 2007, 2010, ...
## $ actor_2_facebook_likes    <int> 936, 5000, 23000, 632, 11000, 553, 2...
## $ imdb_score                <dbl> 7.9, 7.1, 8.5, 6.6, 6.2, 7.8, 7.5, 6...
## $ aspect_ratio              <dbl> 1.78, 2.35, 2.35, 2.35, 2.35, 1.85, ...
## $ movie_facebook_likes      <int> 33000, 0, 164000, 24000, 0, 29000, 1...
## $ genre1                    <chr> "Action", "Action", "Action", "Actio...

Let’s look at box office results for all the movies that James Cameron has created (variable is director_name).

movies %>% 
  filter(__________) %>% 
  ggplot(aes(x=_________,y=___________)) +
  geom_bar(___________)
movies %>% 
  filter(director_name=="James Cameron") %>%
  ggplot(aes(x=movie_title,y=gross)) +
  geom_bar(stat="identity")

Hint: You may want to pass the argument stat= to the geom_bar() function. What do you fill with stat? You’ll need to check your notes.

Flip that chart

Transpose that chart so that the movies are on the y axis instead of the x axis (without swapping the coords from the code above).

movies %>% 
  filter(__________) %>% 
  ggplot(aes(x=movie_title,y=gross)) +
  geom_bar(___________) +
  __________()
movies %>% 
  filter(director_name=="James Cameron") %>%
  ggplot(aes(x=movie_title,y=gross)) +
  geom_bar(stat="identity") +
  coord_flip()

Hint: What’s that line you need to add? You’re flipping the coordinates.

Reorder the labels in the chart

Let’s recreate the chart so that the movies listed are in order of release date (variable is title_year).

Remember, you’ll need to use the library forcats and the function fct_reorder().

movies %>% 
  filter(__________) %>%
  ggplot(aes(x=fct_reorder(_________,____(_______)), y=___________)) +
  geom_bar(___________) +
  __________()
movies %>% 
  filter(director_name=="James Cameron") %>%
  ggplot(aes(x=fct_reorder(movie_title,desc(title_year)), y=gross)) +
  geom_bar(stat="identity") +
  coord_flip()

Hint: I’ve already given you the fct_reorder() function! Just look up how it works so you use it right.

Style it

Let’s fix the x-axis and y-axis labels (“Movie” and “Box Office Gross”), as well as add a title to the chart (“How James Cameron movies performed at the box office”) and caption for where the data came from (“Source: IMDB.com”)

movies %>% 
  filter(____________) %>%
  ggplot(aes(x=fct_reorder(__________,desc(_______)), y=_____)) +
  geom_bar(stat="identity") +
  coord_flip() +
  ____(x=___________,
       y=___________,
       title=___________,
       sub____=___________) +
  theme_minimal()
movies %>% 
  filter(director_name=="James Cameron") %>%
  ggplot(aes(x=fct_reorder(movie_title,desc(title_year)), y=gross)) +
  geom_bar(stat="identity") +
  coord_flip() +
  labs(x="Movie",
       y="Box Office Gross",
       title="How James Cameron movies performed at the box office",
       subtitle="Source: IMDB.com") +
  theme_minimal()

Hint: You might want to use the labs() function. But you’ll have to look up the rest.

Color palettes

Let’s look at Wes Anderson’s box office performance using the code as above, but with Wes Anderson substituted out for James Cameron.

And for fun, we’ll use the wesanderson color palette from his first movie, Bottle Rocket (BottleRocket1).

Because you’re changing the color of the bars based on the movie_title variable, you need to add that to the aes().

Keep the movie title similar to the previous chart but with Wes Anderson instead of James Cameron.

library(wesanderson)
movies %>% 
  filter(director_name=="Wes Anderson") %>%
  ggplot(aes(x=fct_reorder(_________,desc(_______)), y=___________, ____=_____________)) +
  geom_bar(___________) +
  __________() +
  ____(____=___________,
           ____=___________,
           ______=__________________,
           ______=___________________) +
  theme_minimal() +
  ______________(values=wes_palette("_________"), guide=F)
movies %>% 
  filter(director_name=="Wes Anderson") %>%
  ggplot(aes(x=fct_reorder(movie_title,desc(title_year)), y=gross, fill=movie_title)) +
  geom_bar(stat="identity") +
  coord_flip() +
  labs(x="Movie",
       y="Box Office Gross",
       title="How Wes Anderson movies performed at the box office",
       subtitle="Source: IMDB.com") +
  theme_minimal() +
  scale_fill_manual(values=wes_palette("BottleRocket1"), guide=F)

Hint: In the last line, you’re using a scale function that lets you change the fill colors manually. Here’s the documentation on the Wes Anderson palette

Labeling

Let’s filter these movies to those directed by “Kathryn Bigelow”, “Martin Scorsese”, and “Steven Spielberg” and then make a scatter plot comparing box office to budget.

The color should represent the different director.

Then add a label for the movie using the ggrepel package.

library(ggrepel)
movies %>% 
  filter(director_name ____ c("Kathryn Bigelow", "Martin Scorsese", "Steven Spielberg")) %>% 
  ggplot(aes(x=gross, y=budget, color=______________, ______=___________)) +
  geom_point() +
  geom_____________()
movies %>% 
  filter(director_name %in% c("Kathryn Bigelow", "Martin Scorsese", "Steven Spielberg")) %>% 
  ggplot(aes(x=gross, y=budget, color=director_name, label=movie_title)) +
  geom_point() +
  geom_text_repel()

Hint: The filter code that lets you filter by group is %in%. The color is based on the director and the label is based on the movie name. Can you find those variable names in the movies data frame? And what’s the function for the last line? If you forgot, refer to the lesson.

Small multiples

Nice!

But crowded.

Let’s break them out individually by director so it’s easier to see the patterns.

movies %>% 
  filter(director_name ____ c("Kathryn Bigelow", "Martin Scorsese", "Steven Spielberg")) %>% 
  ggplot(aes(x=gross, y=budget, color=______________, ______=___________)) +
  geom_point() +
  geom_____________(size=2)
_____________(~____________, nrow=2) +
  theme(legend.position="none")
movies %>% 
  filter(director_name %in% c("Kathryn Bigelow", "Martin Scorsese", "Steven Spielberg")) %>% 
  ggplot(aes(x=gross, y=budget, color=director_name, label=movie_title)) +
  geom_point() +
  geom_text_repel(size=2) +
  facet_wrap(~director_name, nrow=2) +
  theme(legend.position="none")

Hint: What’s the term for creating facets of the chart? It’s facet_something. There’s more than one, but we want the one that we can specify the number of rows in the output (with nrow).