Chapter 4

ggplot2

Consider this data set from IMDB on 5,000 movies we have in the environment called movies.

This is what glimpse(movies) outputs.

## Observations: 3,258
## Variables: 29
## $ color                     <chr> "Color", "Color", "Color", "Color", ...
## $ director_name             <chr> "James Cameron", "Gore Verbinski", "...
## $ num_critic_for_reviews    <int> 723, 302, 813, 462, 392, 324, 635, 6...
## $ duration                  <int> 178, 169, 164, 132, 156, 100, 141, 1...
## $ director_facebook_likes   <int> 0, 563, 22000, 475, 0, 15, 0, 0, 0, ...
## $ actor_3_facebook_likes    <int> 855, 1000, 23000, 530, 4000, 284, 19...
## $ actor_2_name              <chr> "Joel David Moore", "Orlando Bloom",...
## $ actor_1_facebook_likes    <int> 1000, 40000, 27000, 640, 24000, 799,...
## $ gross                     <int> 760505847, 309404152, 448130642, 730...
## $ genres                    <chr> "Action|Adventure|Fantasy|Sci-Fi", "...
## $ actor_1_name              <chr> "CCH Pounder", "Johnny Depp", "Tom H...
## $ movie_title               <chr> "Avatar ", "Pirates of the Caribbean...
## $ num_voted_users           <int> 886204, 471220, 1144337, 212204, 383...
## $ cast_total_facebook_likes <int> 4834, 48350, 106759, 1873, 46055, 20...
## $ actor_3_name              <chr> "Wes Studi", "Jack Davenport", "Jose...
## $ facenumber_in_poster      <int> 0, 0, 0, 1, 0, 1, 4, 0, 0, 2, 1, 0, ...
## $ plot_keywords             <chr> "avatar|future|marine|native|paraple...
## $ movie_imdb_link           <chr> "http://www.imdb.com/title/tt0499549...
## $ num_user_for_reviews      <int> 3054, 1238, 2701, 738, 1902, 387, 11...
## $ language                  <chr> "English", "English", "English", "En...
## $ country                   <chr> "USA", "USA", "USA", "USA", "USA", "...
## $ content_rating            <chr> "PG-13", "PG-13", "PG-13", "PG-13", ...
## $ budget                    <int> 237000000, 300000000, 250000000, 263...
## $ title_year                <int> 2009, 2007, 2012, 2012, 2007, 2010, ...
## $ actor_2_facebook_likes    <int> 936, 5000, 23000, 632, 11000, 553, 2...
## $ imdb_score                <dbl> 7.9, 7.1, 8.5, 6.6, 6.2, 7.8, 7.5, 6...
## $ aspect_ratio              <dbl> 1.78, 2.35, 2.35, 2.35, 2.35, 1.85, ...
## $ movie_facebook_likes      <int> 33000, 0, 164000, 24000, 0, 29000, 1...
## $ genre1                    <chr> "Action", "Action", "Action", "Actio...

Make a scatterplot

Let’s start easy with a simple scatter plot comparing box office gross to the budget.

ggplot(______) + geom_________(aes(x=____,y=_____))
ggplot(movies) + geom_point(aes(x=gross,y=budget))

Change the color

Let’s change the color of the circles to blue.

ggplot(______) + geom_________(aes(x=_____,y=_____), ________)
ggplot(movies) + geom_point(aes(x=gross,y=budget), color="blue")

Make a scatterplot with color groups

Add a color factor based on content_rating.

ggplot(______) + geom_________(aes(x=_____, y=______, _______=_____))
ggplot(movies) + geom_point(aes(x=gross,y=budget, color=content_rating))

Did you notice the placement of the second-to-last parenthesis for color this time?

Bar plot

Make a bar plot chart that counts up the number of titles per year (title_year).

ggplot(______,
             aes(x=_________)) +
  geom__________()
ggplot(movies,
             aes(x=title_year)) +
  geom_bar()

Stacked bar plot ver. 1

Add content_rating as a grouping per year counting up the movies to create a stacked bar chart.

ggplot(______,
             aes(x=_________,_________)) +
  geom__________()
ggplot(movies,
             aes(x=title_year, fill=content_rating)) +
  geom_bar()

Hint: You may want to use the fill argument in the aes().

Stacked bar plot ver. 2

Great, now split up the bars so they’re not stacked but next to each other.

And we’ll focus on movies created after 2001 (title_year is the variable).

movies %>% 
  filter(___________) %>%
  ggplot(aes(x=_________,fill=________)) +
  geom__________(________________)
movies %>% 
  filter(title_year>2001) %>%
  ggplot(aes(x=title_year,fill=content_rating)) +
  geom_bar(position="dodge")

Hint: You may want to use the position argument in the geom_bar() function.

Stacked bar plot ver. 3

Alright, let’s make a percent stacked chart this time:

movies %>% 
  filter(___________) %>%
  ggplot(aes(x=_________,fill=_________)) +
  geom__________(position=________)
movies %>% 
  filter(title_year>2001) %>%
  ggplot(aes(x=title_year,fill=content_rating)) +
  geom_bar(position="fill")

Customizing charts

Another bar chart

Consider this data set from IMDB on 5,000 movies we have in the environment called movies.

This is what glimpse(movies) outputs.

## Observations: 3,258
## Variables: 29
## $ color                     <chr> "Color", "Color", "Color", "Color", ...
## $ director_name             <chr> "James Cameron", "Gore Verbinski", "...
## $ num_critic_for_reviews    <int> 723, 302, 813, 462, 392, 324, 635, 6...
## $ duration                  <int> 178, 169, 164, 132, 156, 100, 141, 1...
## $ director_facebook_likes   <int> 0, 563, 22000, 475, 0, 15, 0, 0, 0, ...
## $ actor_3_facebook_likes    <int> 855, 1000, 23000, 530, 4000, 284, 19...
## $ actor_2_name              <chr> "Joel David Moore", "Orlando Bloom",...
## $ actor_1_facebook_likes    <int> 1000, 40000, 27000, 640, 24000, 799,...
## $ gross                     <int> 760505847, 309404152, 448130642, 730...
## $ genres                    <chr> "Action|Adventure|Fantasy|Sci-Fi", "...
## $ actor_1_name              <chr> "CCH Pounder", "Johnny Depp", "Tom H...
## $ movie_title               <chr> "Avatar ", "Pirates of the Caribbean...
## $ num_voted_users           <int> 886204, 471220, 1144337, 212204, 383...
## $ cast_total_facebook_likes <int> 4834, 48350, 106759, 1873, 46055, 20...
## $ actor_3_name              <chr> "Wes Studi", "Jack Davenport", "Jose...
## $ facenumber_in_poster      <int> 0, 0, 0, 1, 0, 1, 4, 0, 0, 2, 1, 0, ...
## $ plot_keywords             <chr> "avatar|future|marine|native|paraple...
## $ movie_imdb_link           <chr> "http://www.imdb.com/title/tt0499549...
## $ num_user_for_reviews      <int> 3054, 1238, 2701, 738, 1902, 387, 11...
## $ language                  <chr> "English", "English", "English", "En...
## $ country                   <chr> "USA", "USA", "USA", "USA", "USA", "...
## $ content_rating            <chr> "PG-13", "PG-13", "PG-13", "PG-13", ...
## $ budget                    <int> 237000000, 300000000, 250000000, 263...
## $ title_year                <int> 2009, 2007, 2012, 2012, 2007, 2010, ...
## $ actor_2_facebook_likes    <int> 936, 5000, 23000, 632, 11000, 553, 2...
## $ imdb_score                <dbl> 7.9, 7.1, 8.5, 6.6, 6.2, 7.8, 7.5, 6...
## $ aspect_ratio              <dbl> 1.78, 2.35, 2.35, 2.35, 2.35, 1.85, ...
## $ movie_facebook_likes      <int> 33000, 0, 164000, 24000, 0, 29000, 1...
## $ genre1                    <chr> "Action", "Action", "Action", "Actio...

Let’s look at box office results for all the movies that James Cameron has created (variable is director_name).

movies %>% 
  filter(__________) %>% 
  ggplot(aes(x=_________,y=___________)) +
  geom_bar(___________)
movies %>% 
  filter(director_name=="James Cameron") %>%
  ggplot(aes(x=movie_title,y=gross)) +
  geom_bar(stat="identity")

Hint: You may want to pass the argument stat= to the geom_bar() function. What do you fill with stat? You’ll need to check your notes.

Flip that chart

Transpose that chart so that the movies are on the y axis instead of the x axis (without swapping the coords from the code above).

movies %>% 
  filter(__________) %>% 
  ggplot(aes(x=movie_title,y=gross)) +
  geom_bar(___________) +
  __________()