精华内容
下载资源
问答
  • 9. Data Manipulation with dplyr in R

    万次阅读 2019-12-27 04:49:56
    1. Transforming Data with dplyr The countries dataset Understanding you data Selecting columns The filter and arrange verbs Arranging observations Filtering for conditions Filtering and arranging Muta...

    文章目录

    1. Transforming Data with dplyr

    1.1 The countries dataset (video)

    1.2 Understanding you data

    1.3 Selecting columns

    Select the following four columns from the counties variable:

    • state
    • county
    • population
    • poverty

    You don’t need to save the result to a variable.

    Instruction:

    • Select the columns listed from the counties variable.
    # Select the columns 
    counties %>%
    select(state, county, population, poverty)
    

    1.4 The filter and arrange verbs (video)

    1.5 Arranging observations

    Here you see the counties_selected dataset with a few interesting variables selected. These variables: private_work, public_work, self_employed describe whether people work for the government, for private companies, or for themselves.

    In these exercises, you’ll sort these observations to find the most interesting cases.

    Instruction:

    • Add a verb to sort the observations of the public_work variable in descending order.
    counties_selected <- counties %>%
    select(state, county, population, private_work, public_work, self_employed)
    
    # Add a verb to sort in descending order of public_work
    counties_selected %>%
    arrange(desc(public_work))
    

    1.6 Filtering for conditions

    You use the filter() verb to get only observations that match a particular condition, or match multiple conditions.

    Instruction 1:

    • Find only the counties that have a population above one million (1000000).
    counties_selected <- counties %>%
    select(state, county, population)
    
    # Filter for counties with a population above 1000000
    counties_selected %>%
    filter(population > 1000000)
    

    Instruction 2:

    • Find only the counties in the state of California that also have a population above one million (1000000).
    counties_selected <- counties %>%
    select(state, county, population)
    
    # Filter for counties in the state of California that have a population above 1000000
    counties_selected %>%
    filter(state == "California" & population > 1000000)
    

    1.7 Filtering and arranging

    We’re often interested in both filtering and sorting a dataset, to focus on observations of particular interest to you. Here, you’ll find counties that are extreme examples of what fraction of the population works in the private sector.

    Instruction:

    • Filter for counties in the state of Texas that have more than ten thousand people (10000), and sort them in descending order of the percentage of people employed in private work.
    counties_selected <- counties %>%
    select(state, county, population, private_work, public_work, self_employed)
    
    # Filter for Texas and more than 10000 people; sort in descending order of private_work
    counties_selected %>%
    filter(state == 'Texas', population > 10000)%>%
    arrange(desc(private_work))
    

    1.8 Mutate (video)

    1.9 Calculating the number of government employees

    In the video, you used the unemployment variable, which is a percentage, to calculate the number of unemployed people in each county. In this exercise, you’ll do the same with another percentage variable: public_work.

    The code provided already selects the state, county, population, and public_work columns.

    Instruction 1:

    • Use mutate() to add a column called public_workers to the dataset, with the number of people employed in public (government) work.
    counties_selected <- counties %>%
    select(state, county, population, public_work)
    
    # Add a new column public_workers with the number of people employed in public work
    counties_selected %>%
    mutate(public_workers = population * public_work / 100)
    

    Instruction 2:

    • Sort the new column in descending order.
    counties_selected <- counties %>%
    select(state, county, population, public_work)
    
    # Sort in descending order of the public_workers column
    counties_selected %>%
    mutate(public_workers = public_work * population / 100) %>%
    arrange(desc(public_workers))
    

    1.10 Calculating the percentage of women in a country

    The dataset includes columns for the total number (not percentage) of men and women in each county. You could use this, along with the population variable, to compute the fraction of men (or women) within each county.

    In this exercise, you’ll select the relevant columns yourself.

    Instruction:

    • Select the columns state, county, population, men, and women.
    • Add a new variable called proportion_women with the fraction of the county’s population made up of women.
    # Select the columns state, county, population, men, and women
    counties_selected <- counties %>%
    select(state, county, population, men,women)
      
    # Calculate proportion_women as the fraction of the population made up of women
    counties_selected %>%
    mutate(proportion_women = women / population)
    

    1.11 Select, mutate, filter, and arrange

    In this exercise, you’ll put together everything you’ve learned in this chapter (select(), mutate(), filter() and arrange()), to find the counties with the highest proportion of men.

    Instruction:

    • Select only the columns state, county, population, men, and women.
    • Add a variable proportion_men with the fraction of the county’s population made up of men.
    • Filter for counties with a population of at least ten thousand (10000).
    • Arrange counties in descending order of their proportion of men.
    counties %>%
    # Select the five columns 
    select(state, county, population, men, women)%>%
    # Add the proportion_men variable
    mutate(proportion_men = men / population)%>%
    # Filter for population of at least 10,000
    filter(population >= 10000)%>%
    # Arrange proportion of men in descending order 
    arrange(desc(proportion_men))
    

    2. Aggregating Data

    2.1 The count verb (video)

    2.2 Counting by region

    The counties dataset contains columns for region, state, population, and the number of citizens, which we selected and saved as the counties_selected table. In this exercise, you’ll focus on the region column.

    counties_selected <- counties %>%
    select(region, state, population, citizens)
    

    Instruction:

    • Use count() to find the number of counties in each region, using a second argument to sort in descending order.
    # Use count to find the number of counties in each region
    counties_selected %>%
    count(region, sort = TRUE)
    

    2.3 Counting citizens by state

    You can weigh your count by particular variables rather than finding the number of counties. In this case, you’ll find the number of citizens in each state.

    counties_selected <- counties %>%
    select(region, state, population, citizens)
    

    Instruction:

    • Count the number of counties in each state, weighted based on the citizens column, and sorted in descending order.
    # Find number of counties per state, weighted by citizens
    counties_selected %>%
    count(state, wt = citizens, sort = TRUE) 
    

    2.4 Mutating and counting

    You can combine multiple verbs together to answer increasingly complicated questions of your data. For example: “What are the US states where the most people walk to work?”

    You’ll use the walk column, which offers a percentage of people in each county that walk to work, to add a new column and count based on it.

    counties_selected <- counties %>%
    select(region, state, population, walk)
    

    Instruction:

    • Use mutate() to calculate and add a column called population_walk, containing the total number of people who walk to work in a county.
    • Use a (weighted and sorted) count() to find the total number of people who walk to work in each state.
    counties_selected %>%
    # Add population_walk containing the total number of people who walk to work 
    mutate(population_walk = walk * population / 100) %>% 
    # Count weighted by the new column
    count(state, wt = population_walk, sort = TRUE)
    

    2.5 The group by, summarize and ungroup verbs (video)

    2.6 Summarizing

    The summarize() verb is very useful for collapsing a large dataset into a single observation.

    counties_selected <- counties %>%
    select(county, population, income, unemployment)
    

    Instruction:

    • Summarize the counties dataset to find the following columns: min_population (with the smallest population), max_unemployment (with the maximum unemployment), and average_income (with the mean of the income variable).
    # Summarize to find minimum population, maximum unemployment, and average income
    counties_selected %>%
    summarise(min_population = min(population), 
              max_unemployment = max(unemployment),
              average_income = mean(income))
    

    2.7 Summarizing by state

    Another interesting column is land_area, which shows the land area in square miles. Here, you’ll summarize both population and land area by state, with the purpose of finding the density (in people per square miles).

    counties_selected <- counties %>%
    select(state, county, population, land_area)
    

    Instruction 1:

    • Group the data by state, and summarize to create the columns total_area (with total area in square miles) and total_population (with total population).
    # Group by state and find the total area and population
    counties_selected %>%
    group_by(state) %>%
    summarise(total_area = sum(land_area), total_population = sum(population))
    

    Instruction 2:

    • Add a density column with the people per square mile, then arrange in descending order.
    # Add a density column, then sort in descending order
    counties_selected %>%
    group_by(state) %>%
    summarize(total_area = sum(land_area),
              total_population = sum(population)) %>%
    mutate(density = total_population / total_area) %>%
    arrange(desc(density))
    

    2.8 Summarizing by state and region

    You can group by multiple columns instead of grouping by one. Here, you’ll practice aggregating by state and region, and notice how useful it is for performing multiple aggregations in a row.

    counties_selected <- counties %>%
    select(region, state, county, population)
    

    Instruction 1:

    • Summarize to find the total population, as a column called total_pop, in each combination of region and state.
    # Summarize to find the total population
    counties_selected %>%
    group_by(region, state) %>%
    summarize(total_pop = sum(population))
    

    Instruction 2:

    • Notice the tibble is still grouped by region; use another summarize step to calculate two new columns: the average state population in each region (average_pop) and the median state population in each region (median_pop).
    # Calculate the average_pop and median_pop columns 
    counties_selected %>%
    group_by(region, state) %>%
    summarize(total_pop = sum(population)) %>%
    summarize(average_pop = mean(total_pop),
              median_pop = median(total_pop))
    

    2.9 The top_n verb (video)

    2.10 Selecting a country from each region

    Previously, you used the walk column, which offers a percentage of people in each county that walk to work, to add a new column and count to find the total number of people who walk to work in each county.

    Now, you’re interested in finding the county within each region with the highest percentage of citizens who walk to work.

    counties_selected <- counties %>%
    select(region, state, county, metro, population, walk)
    

    Instruction:

    • Find the county in each region with the highest percentage of citizens who walk to work.
    # Group by region and find the greatest number of citizens who walk to work
    counties_selected %>%
    group_by(region)%>%
    top_n(1,walk)
    

    2.11 Finding the highest-income state in each region

    You’ve been learning to combine multiple dplyr verbs together. Here, you’ll combine group_by(), summarize(), and top_n() to find the state in each region with the highest income.

    When you group by multiple columns and then summarize, it’s important to remember that the summarize “peels off” one of the groups, but leaves the rest on. For example, if you group_by(X, Y) then summarize, the result will still be grouped by X.

    counties_selected <- counties %>%
    select(region, state, county, population, income)
    

    Instruction:

    • Calculate the average income (as average_income) of counties within each region and state (notice the group_by() has already been done for you).
    • Find the highest income state in each region.
    counties_selected %>%
    group_by(region, state) %>%
    # Calculate average income
    summarize(average_income = mean(income))%>%
    # Find the highest income state in each region
    top_n(1,average_income)
    

    2.12 Using summarize, top_n, and count together

    In this chapter, you’ve learned to use five dplyr verbs related to aggregation: count(), group_by(), summarize(), ungroup(), and top_n(). In this exercise, you’ll use all of them to answer a question: In how many states do more people live in metro areas than non-metro areas?

    Recall that the metro column has one of the two values “Metro” (for high-density city areas) or “Nonmetro” (for suburban and country areas).

    counties_selected <- counties %>%
    select(state, metro, population)
    

    Instruction 1:

    • For each combination of state and metro, find the total population as total_pop.
    # Find the total population for each combination of state and metro
    counties_selected %>%
    group_by(state, metro) %>%
    summarize(total_pop = sum(population))
    

    Instruction 2:

    • Extract the most populated row from each state, which will be either Metro or Nonmetro.
    # Extract the most populated row for each state
    counties_selected %>%
    group_by(state, metro) %>%
    summarize(total_pop = sum(population)) %>%
    top_n(1, total_pop)
    

    Instruction 3:

    • Ungroup, then count how often Metro or Nonmetro appears to see how many states have more people living in those areas.
    # Count the states with more people in Metro or Nonmetro areas
    counties_selected %>%
    group_by(state, metro) %>%
    summarize(total_pop = sum(population)) %>%
    top_n(1, total_pop) %>%
    ungroup() %>%
    count(metro)
    

    3. Selecting and Transforming Data

    3.1 Selecting (video)

    3.2 Selecting columns

    Using the select verb, we can answer interesting questions about our dataset by focusing in on related groups of verbs. The colon (:) is useful for getting many columns at a time.

    Instruction:

    • Use glimpse() to examine all the variables in the counties table.
    • Select the columns for state, county, population, and (using a colon) all five of those industry-related variables; there are five consecutive variables in the table related to the industry of people’s work: professional, service, office, construction, and production.
    • Arrange the table in descending order of service to find which counties have the highest rates of working in the service industry.
    # Glimpse the counties table
    glimpse(counties)
    
    counties %>%
    # Select state, county, population, and industry-related columns
    select(state, county, population,professional, service, office, construction, production)%>%
    # Arrange service in descending order 
    arrange(desc(service))
    

    3.3 Select helpers

    In the video you learned about the select helper starts_with(). Another select helper is ends_with(), which finds the columns that end with a particular string.

    Instruction:

    • Select the columns state, county, population, and all those that end with work.
    • Filter just for the counties where at least 50% of the population is engaged in public work.
    counties %>%
    # Select the state, county, population, and those ending with "work"
    select(state, county, population, ends_with('work'))%>%
    # Filter for counties that have at least 50% of people engaged in public work
    filter(public_work >= 50)
    

    3.4 The renames verb (video)

    3.5 Renaming a column after count

    The rename() verb is often useful for changing the name of a column that comes out of another verb, such as count(). In this exercise, you’ll rename the n column from count() (which you learned about in Chapter 2) to something more descriptive.

    Instruction 1:

    • Use count() to determine how many counties are in each state.
    # Count the number of counties in each state
    counties %>%
    count(state)
    

    Instruction 2:

    • Notice the n column in the output; use rename() to rename that to num_counties.
    # Rename the n column to num_counties
    counties %>%
    count(state)  %>%
    rename(num_counties = n)
    

    3.6 Rename a column as part of a select

    rename() isn’t the only way you can choose a new name for a column: you can also choose a name as part of a select().

    Instruction:

    • Select the columns state, county, and poverty from the counties dataset; in the same step, rename the poverty column to poverty_rate.
    # Select state, county, and poverty as poverty_rate
    counties %>%
    select(state, county, poverty_rate = poverty)
    

    3.7 The transmute verb (video)

    3.8 Choosing among verbs

    3.9 Using transmute

    As you learned in the video, the transmute verb allows you to control which variables you keep, which variables you calculate, and which variables you drop.

    Instruction:

    • Keep only the state, county, and population columns, and add a new column, density, that contains the population per land_area.
    • Filter for only counties with a population greater than one million.
    • Sort the table in ascending order of density.
    counties %>%
    # Keep the state, county, and populations columns, and add a density column
    transmute(state, county, population, density = population / land_area)%>%
    # Filter for counties with a population greater than one million 
    filter(population > 1000000)%>%
    # Sort density in ascending order 
    arrange(density)
    

    3.10 Matching verbs to their definitions

    3.11 Choosing among the four verbs

    In this chapter you’ve learned about the four verbs: select, mutate, transmute, and rename. Here, you’ll choose the appropriate verb for each situation. You won’t need to change anything inside the parentheses.

    Instruction:

    • Choose the right verb for changing the name of the unemployment column to unemployment_rate.
    • Choose the right verb for keeping only the columns state, county, and the ones containing poverty.
    • Calculate a new column called fraction_women with the fraction of the population made up of women, without dropping any columns.
    • Keep only three columns: the state, county, and employed / population, which you’ll call employment_rate.
    # Change the name of the unemployment column
    counties %>%
    rename(unemployment_rate = unemployment)
    
    # Keep the state and county columns, and the columns containing poverty
    counties %>%
    select(state, county, contains("poverty"))
    
    # Calculate the fraction_women column without dropping the other columns
    counties %>%
    mutate(fraction_women = women / population)
    
    # Keep only the state, county, and employment_rate columns
    counties %>%
    transmute(state, county, employment_rate = employed / population)
    

    4. Case Study: The babynames Dataset

    4.1 The babynames data (video)

    4.2 Filtering and arranging for one year

    The dplyr verbs you’ve learned are useful for exploring data. For instance, you could find out the most common names in a particular year.

    Instruction:

    • Filter for only the year 1990.
    • Sort the table in descending order of the number of babies born.
    babynames %>%
    # Filter for the year 1990
    filter(year == 1990)%>%
    # Sort the number column in descending order 
    arrange(desc(number))
    

    4.3 Using top_n with babynames

    You saw that you could use filter() and arrange() to find the most common names in one year. However, you could also use group_by and top_n to find the most common name in every year.

    Instruction:

    • Use group_by and top_n to find the most common name for US babies in each year.
    # Find the most common name in each year
    babynames %>%
    group_by(year)%>%
    top_n(1, number)
    

    4.4 Visualizing names with ggplots

    The dplyr package is very useful for exploring data, but it’s especially useful when combined with other tidyverse packages like ggplot2.

    Instruction 1:

    • Filter for only the names Steven, Thomas, and Matthew, and assign it to an object called selected_names.
    # Filter for the names Steven, Thomas, and Matthew 
    selected_names <- babynames %>%
    filter(name %in% c("Steven","Thomas","Matthew"))
    

    Instruction 2:

    • Visualize those three names as a line plot over time, with each name represented by a different color.
    # Plot the names using a different color for each name
    ggplot(selected_names, aes(x = year, y = number, color = name)) +
    geom_line()
    

    4.5 Grouped mutates (video)

    4.6 Finding the year each name is most common

    In an earlier video, you learned how to filter for a particular name to determine the frequency of that name over time. Now, you’re going to explore which year each name was the most common.

    To do this, you’ll be combining the grouped mutate approach with a top_n.

    Instruction:

    • Complete the code so that it finds the year each name is most common.
    # Find the year each name is most common 
    babynames %>%
    group_by(year) %>%
    mutate(year_total = sum(number)) %>%
    ungroup() %>%
    mutate(fraction = number / year_total) %>%
    group_by(name) %>%
    top_n(1, fraction)
    

    4.7 Adding the total and maximum for each name

    In the video, you learned how you could group by the year and use mutate() to add a total for that year.

    In these exercises, you’ll learn to normalize by a different, but also interesting metric: you’ll divide each name by the maximum for that name. This means that every name will peak at 1.

    Once you add new columns, the result will still be grouped by name. This splits it into 48,000 groups, which actually makes later steps like mutates slower.

    Instruction 1:
    Use a grouped mutate to add two columns:

    • name_total, with the total number of babies born with that name in the entire dataset.
    • name_max, with the highest number of babies born in any year.
    # Add columns name_total and name_max for each name
    babynames %>%
    group_by(name) %>%
    mutate(name_total = sum(number),
           name_max = max(number))
    

    Instruction 2:

    • Add another step to ungroup the table.
    • Add a column called fraction_max, with the number in the year divided by the maximum for that name.
    babynames %>%
    group_by(name) %>%
    mutate(name_total = sum(number),
           name_max = max(number)) %>%
    # Ungroup the table 
    ungroup() %>%
    # Add the fraction_max column containing the number by the name maximum 
    mutate(fraction_max = number / name_max)
    

    4.8 Visualizing the normalized change in popularity

    You picked a few names and calculated each of them as a fraction of their peak. This is a type of “normalizing” a name, where you’re focused on the relative change within each name rather than the overall popularity of the name.

    In this exercise, you’ll visualize the normalized popularity of each name. Your work from the previous exercise, names_normalized, has been provided for you.

    names_normalized <- babynames %>%
                         group_by(name) %>%
                         mutate(name_total = sum(number),
                                name_max = max(number)) %>%
                         ungroup() %>%
                         mutate(fraction_max = number / name_max)
    

    Instruction:

    • Filter the names_normalized table to limit it to the three names Steven, Thomas, and Matthew.
    • Visualize fraction_max for those names over time.
    # Filter for the names Steven, Thomas, and Matthew
    names_filtered <- names_normalized %>%
    filter(name %in% c('Steven', 'Thomas', 'Matthew'))
    
    # Visualize these names over time
    ggplot(names_filtered, aes(x = year, y = fraction_max, color = name)) + 
    geom_line()
    

    4.9 Window function (video)

    4.10 Using ratios to describe the frequency of a name

    In the video, you learned how to find the difference in the frequency of a baby name between consecutive years. What if instead of finding the difference, you wanted to find the ratio?

    You’ll start with the babynames_fraction data already, so that you can consider the popularity of each name within each year.

    Instruction:

    • Arrange the data in ascending order of name and then year.
    • Group by name so that your mutate works within each name.
    • Add a column ratio containing the ratio between each year.
    babynames_fraction %>%
    # Arrange the data in order of name, then year 
    arrange(name, year) %>%
    # Group the data by name
    group_by(name) %>%
    # Add a ratio column that contains the ratio between each year 
    mutate(ratio = fraction / lag(fraction))
    

    4.11 Biggest jumps in a name

    Previously, you added a ratio column to describe the ratio of the frequency of a baby name between consecutive years to describe the changes in the popularity of a name. Now, you’ll look at a subset of that data, called babynames_ratios_filtered, to look further into the names that experienced the biggest jumps in popularity in consecutive years.

    babynames_ratios_filtered <- babynames_fraction %>%
                         arrange(name, year) %>%
                         group_by(name) %>%
                         mutate(ratio = fraction / lag(fraction)) %>%
                         filter(fraction >= 0.00001)
    

    Instruction:

    • From each name in the data, keep the observation (the year) with the largest ratio; note the data is already grouped by name.
    • Sort the ratio column in descending order.
    • Filter the babynames_ratios_filtered data further by filtering the fraction column to only display results greater than or equal to 0.001.
    babynames_ratios_filtered %>%
    # Extract the largest ratio from each name 
    top_n(1,ratio) %>%
    # Sort the ratio column in descending order 
    arrange(desc(ratio)) %>%
    # Filter for fractions greater than or equal to 0.001
    filter(fraction >= 0.001)
    

    4.12 Congratulations!

    展开全文
  • 10. Joining Data with dplyr in R

    千次阅读 2019-12-27 05:29:54
    1. Joining Tables 1.1 The inner_join verb (video) 1.2 What columns would you join on? 1.3 Joining parts and part ...1.4 Joining with a one-to-many relationship 1.5 Joining parts and inventorie...

    文章目录

    1. Joining Tables

    1.1 The inner_join verb (video)

    1.2 What columns would you join on?

    1.3 Joining parts and part categories

    The inner_join is the key to bring tables together. To use it, you need to provide the two tables that must be joined and the columns on which they should be joined.

    In this exercise, you’ll join a list of LEGO parts, available as parts, with these parts’ corresponding categories, available as part_categories. For example, the part Sticker Sheet 1 for Set 1650-1 is from the Stickers part category. You can join these tables to see all parts’ categories!

    Instruction 1:

    • Add the correct joining verb, the name of the second table, and the joining column for the second table.
    # Add the correct verb, table, and joining column
    parts %>% 
    inner_join(part_categories, by = c("part_cat_id" = "id"))
    

    Instruction 2:

    • Now, use the suffix argument to add "_part" and "_category" suffixes to replace the name.x and name.y fields.
    # Use the suffix argument to replace .x and .y suffixes
    parts %>% 
    inner_join(part_categories, by = c("part_cat_id" = "id"), suffix = c("_part", "_category"))
    

    1.4 Joining with a one-to-many relationship (video)

    1.5 Joining parts and inventories

    The LEGO data has many tables that can be joined together. Often times, some of the things you care about may be a few tables away (we’ll get to that later in the course). For now, we know that parts is a list of all LEGO parts, and a new table, inventory_parts, has some additional information about those parts, such as the color_id of each part you would find in a specific LEGO kit.

    Let’s join these two tables together to observe how joining parts with inventory_parts increases the size of your table because of the one-to-many relationship that exists between these two tables.

    Instruction:

    • Connect the parts and inventory_parts tables by their part numbers using an inner join.
    # Combine the parts and inventory_parts tables
    parts %>%
    inner_join(inventory_parts, by = "part_num")
    

    1.6 Joining in either direction

    An inner_join works the same way with either table in either position. The table that is specified first is arbitrary, since you will end up with the same information in the resulting table either way.

    Let’s prove this by joining the same two tables from the last exercise in the opposite order!

    Instruction:

    • Connect the inventory_parts table with the parts tables.
    # Combine the parts and inventory_parts tables
    inventory_parts %>%
    inner_join(parts, by = "part_num" )
    

    1.7 Joining three or more tables (video)

    1.8 Joining three tables

    You can string together multiple joins with inner_join and the pipe (%>%), both with which you are already very familiar!

    We’ll now connect sets, a table that tells us about each LEGO kit, with inventories, a table that tells us the specific version of a given set, and finally to inventory_parts, a table which tells us how many of each part is available in each LEGO kit.

    So if you were building a Batman LEGO set, sets would tell you the name of the set, inventories would give you IDs for each of the versions of the set, and inventory_parts would tell you how many of each part would be in each version.

    Instruction:

    • Combine the inventories table with the sets table.
    • Next, join the inventory_parts table to the table you created in the previous join by the inventory IDs.
    sets %>%
    # Add inventories using an inner join 
    inner_join(inventories,by = "set_num") %>%
    # Add inventory_parts using an inner join 
    inner_join(inventory_parts, by = c( "id" = "inventory_id"  ))
    

    1.9 What’s the most common color?

    Now let’s join an additional table, colors, which will tell us the color of each part in each set, so that we can answer the question, “what is the most common color of a LEGO piece?”

    Instruction 1:

    • Inner join the colors table using the color_id column from the previous join and the id column from colors; use the suffixes "_set" and "_color".
    # Add an inner join for the colors table
    sets %>%
    inner_join(inventories, by = "set_num") %>%
    inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
    inner_join(colors, by = c("color_id" = "id"), suffix = c("_set", "_color"))
    

    Instruction 2:

    • Count the name_color column and sort the results so the most prominent colors appear first.
    # Count the number of colors and sort
    sets %>%
    inner_join(inventories, by = "set_num") %>%
    inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
    inner_join(colors, by = c("color_id" = "id"), suffix = c("_set", "_color")) %>%
    count(name_color)%>%
    arrange(desc(n))
    

    2. Left and Right Joins

    2.1 The left_join verb (video)

    2.2 Left joining two sets by part and color

    In the video, you learned how to left join two LEGO sets. Now you’ll practice your ability to do this looking at two new sets: the Millennium Falcon and Star Destroyer sets. We’ve created these for you and they have been preloaded for you:

    millennium_falcon <- inventory_parts_joined %>%
      filter(set_num == "7965-1")
    
    star_destroyer <- inventory_parts_joined %>%
      filter(set_num == "75190-1")
    

    Instruction:

    • Combine the star_destroyer and millennium_falcon tables with the suffixes _falcon and _star_destroyer.
    # Combine the star_destroyer and millennium_falcon tables
    millennium_falcon %>%
    left_join(star_destroyer, by = c("part_num", "color_id"), suffix = c("_falcon", "_star_destroyer"))
    

    2.3 Left joining two sets by color

    In the videos and the last exercise, you joined two sets based on their part and color. What if you joined the datasets by color alone? As with the last exercise, the Millennium Falcon and Star Destroyer sets have been created and preloaded for you:

    millennium_falcon <- inventory_parts_joined %>%
    filter(set_num == "7965-1")
    
    star_destroyer <- inventory_parts_joined %>%
    filter(set_num == "75190-1")
    

    Instruction 1:

    • Sum the quantity column by color_id in the Millennium Falcon dataset.
    # Aggregate Millennium Falcon for the total quantity in each part
    millennium_falcon_colors <- millennium_falcon %>%
    group_by(color_id) %>%
    summarize(total_quantity = sum(quantity))
    

    Instruction 2:
    Now, sum the quantity column by color_id in the Star Destroyer dataset.

    # Aggregate Star Destroyer for the total quantity in each part
    star_destroyer_colors <- star_destroyer %>%
    group_by(color_id) %>%
    summarize(total_quantity = sum(quantity))
    

    Instruction 3:

    • Left join the two datasets, millennium_falcon_colors and star_destroyer_colors, using the color_id column and the _falcon and _star_destroyer suffixes.
    # Left join the Millennium Falcon colors to the Star Destroyer colors
    millennium_falcon_colors %>%
    left_join(star_destroyer_colors, by = c("color_id"), suffix = c("_falcon", "_star_destroyer"))
    

    2.4 Finding an observation that doesn’t have a match

    Left joins are really great for testing your assumptions about a data set and ensuring your data has integrity.

    For example, the inventories table has a version column, for when a LEGO kit gets some kind of change or upgrade. It would be fair to assume that all sets (which joins well with inventories) would have at least a version 1. But let’s test this assumption out in the following exercise.

    Instruction:

    • Use a left_join to join together sets and inventory_version_1 using their common column.
    • filter for where the version column is NA using is.na.
    inventory_version_1 <- inventories %>%
    filter(version == 1)
    
    # Join versions to sets
    sets %>%
    left_join(inventory_version_1, by = "set_num" ) %>%
    # Filter for where version is na
    filter(is.na(version))
    

    2.5 The right-join verb (video)

    2.6 Which joins is best?

    2.7 Counting part colors

    Sometimes you’ll want to do some processing before you do a join, and prioritize keeping the the second (right) table’s rows instead. In this case, a right join is for you.

    In the example below, we’ll count the part_cat_id from parts, before using a right_join to join with part_categories. The reason we do this is because we don’t only want to know the count of part_cat_id in parts, but we also want to know if there are any part_cat_ids not present in part_categories.

    Instruction 1:

    • Use the count verb to count each part_cat_id in the parts table.
    • Use a right_join to join part_categories. You’ll need to use the part_cat_id from the count and the id column from part_categories.
    parts %>%
    # Count the part_cat_id
    count(part_cat_id)%>%	
    # Right join part_categories
    right_join(part_categories, by = c("part_cat_id" = "id"))
    

    Instruction 2:

    • filter for where the column n is NA.
    parts %>%
    count(part_cat_id) %>%
    right_join(part_categories, by = c("part_cat_id" = "id")) %>%
    # Filter for NA
    filter(is.na(n))
    

    2.8 Cleaning up your count

    In both left and right joins, there is the opportunity for there to be NA values in the resulting table. Fortunately, the replace_na function can turn those NAs into meaningful values.

    In the last exercise, we saw that the n column had NAs after the right_join. Let’s use the replace_na column, which takes a list of column names and the values with which NAs should be replaced, to clean up our table.

    Instruction:

    • Use replace_na to replace NAs in the n column with the value 0.
    parts %>%
    count(part_cat_id) %>%
    right_join(part_categories, by = c("part_cat_id" = "id")) %>%
    # Use replace_na to replace missing values in the n column
    replace_na(list(n = 0))
    

    2.9 Joining tables to themselves (video)

    2.10 Joining themes to their children

    Tables can be joined to themselves!

    In the themes table, which is available for you to inspect in the console, you’ll notice there is both an id column and a parent_id column. Keeping that in mind, you can join the themes table to itself to determine the parent-child relationships that exist for different themes.

    In the videos, you saw themes joined to their own parents. In this exercise, you’ll try a similar approach of joining themes to their own children, which is similar but reversed. Let’s try this out to discover what children the theme "Harry Potter" has.

    Instruction:

    • Inner join themes to their own children, resulting in the suffixes "_parent" and "_child", respectively.
    • Filter this table to find the children of the “Harry Potter” theme.
    themes %>% 
    # Inner join the themes table
    inner_join(themes, by = c("id" = "parent_id"), suffix = c("_parent","_child"))%>%
    # Filter for the "Harry Potter" parent name 
    filter(name_parent == "Harry Potter")
    

    2.11 Joining themes to their grandchildren

    We can go a step further than looking at themes and their children. Some themes actually have grandchildren: their children’s children.

    Here, we can inner join themes to a filtered version of itself again to establish a between our last join’s children and their children.

    Instruction:
    Use another inner join to combine themes again with itself.

    • Be sure to use the suffixes "_parent" and "_grandchild" so the columns in the resulting table are clear.
    • Update the by argument to specify the correct columns to join on.
    # Join themes to itself again to find the grandchild relationships
    themes %>% 
    inner_join(themes, by = c("id" = "parent_id"), suffix = c("_parent", "_child")) %>% 
    inner_join(themes, by = c("id_child" = "parent_id"), suffix = c("_parent", "_grandchild"))
    

    2.12 Left-joining a table to itself

    So far, you’ve been inner joining a table to itself in order to find the children of themes like "Harry Potter" or "The Lord of the Rings".

    But some themes might not have any children at all, which means they won’t be included in the inner join. As you’ve learned in this chapter, you can identify those with a left_join and a filter().

    Instruction:

    • Left join the themes table to its own children, with the suffixes _parent and _child respectively.
    • Filter the result of the join to find themes that have no children.
    themes %>% 
    # Left join the themes table to its own children
    left_join(themes, by = c("id" = "parent_id"), suffix = c("_parent", "_child")) %>%
    # Filter for themes that have no child themes
    filter(is.na(id_child))
    

    3. Full, Semi, and Anti Joins

    3.1 The full_join verb (video)

    3.2 Differences between batman and star wars

    In the video, you compared two sets. Now, you’ll compare two themes, each of which is made up of many sets.

    First, you’ll need to join in the themes. Recall that doing so requires going through the sets first. You’ll use the inventory_parts_joined table from the video, which is already available to you in the console.

    inventory_parts_joined <- inventories %>%
    inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
    arrange(desc(quantity)) %>%
    select(-id, -version)
    

    Instruction:

    • In order to join in the themes, you’ll first need to combine the sets and inventory_parts_joined tables.
    • Then, combine the themes table with your first join, using the suffix argument to clarify which table each name came from ("_set" or "_theme").
    inventory_parts_joined %>%
    # Combine the sets table with inventory_parts_joined 
    inner_join(sets, by = "set_num")%>%
    # Combine the themes table with your first join 
    inner_join(themes, by = c("theme_id" = "id"), suffix = c("_set", "_theme"))
    

    3.3 Aggregating each theme

    Previously, you combined tables to compare themes. Before doing this comparison, you’ll want to aggregate the data to learn more about the pieces that are a part of each theme, as well as the colors of those pieces.

    The table you created previously has been preloaded for you as inventory_sets_themes. It was filtered for each theme, and the objects have been saved as batman and star_wars.

    inventory_sets_themes <- inventory_parts_joined %>%
    inner_join(sets, by = "set_num") %>%
    inner_join(themes, by = c("theme_id" = "id"), suffix = c("_set", "_theme"))
    
    batman <- inventory_sets_themes %>%
    filter(name_theme == "Batman")
    
    star_wars <- inventory_sets_themes %>%
    filter(name_theme == "Star Wars")
    

    Instruction:

    • Count the part number and color id for the parts in Batman and Star Wars, weighted by quantity.
    # Count the part number and color id, weight by quantity
    batman %>%
    count(part_num, color_id, wt = quantity)
    star_wars %>%
    count(part_num, color_id, wt = quantity)
    

    3.4 Full-joining batman and star wars LEGO parts

    Now that you’ve got separate tables for the pieces in the batman and star_wars themes, you’ll want to be able to combine them to see any similarities or differences between the two themes. The aggregating from the last exercise has been saved as batman_parts and star_wars_parts, and is preloaded for you.

    batman_parts <- batman %>%
    count(part_num, color_id, wt = quantity)
    
    star_wars_parts <- star_wars %>%
    count(part_num, color_id, wt = quantity)
    

    Instruction:

    • Combine the star_wars_parts table with the batman_parts table; use the suffix argument to include the "_batman" and "_star_wars" suffixes.
    • Replace all the NA values in the n_batman and n_star_wars columns with 0s.
    batman_parts %>%
    # Combine the star_wars_parts table 
    full_join(star_wars_parts, by = c("part_num", "color_id"), suffix = c("_batman", "_star_wars"))%>%
    # Replace NAs with 0s in the n_batman and n_star_wars columns 
    replace_na(list(n_batman = 0, n_star_wars = 0))
    

    3.5 Comparing batman and star wars LEGO parts

    The table you created in the last exercise includes the part number of each piece, the color id, and the number of each piece in the Star Wars and Batman themes. However, we have more information about each of these parts that we can gain by combining this table with some of the information we have in other tables. Before we compare the themes, let’s ensure that we have enough information to make our findings more interpretable. The table from the last exercise has been saved as parts_joined and is preloaded for you.

    parts_joined <- batman_parts %>%
    full_join(star_wars_parts, by = c("part_num", "color_id"), suffix = c("_batman", "_star_wars")) %>%
    replace_na(list(n_batman = 0, n_star_wars = 0))
    

    Instruction:

    • Sort the number of star wars pieces in the parts_joined table in descending order.
    • Join the colors table to the parts_joined table.
    • Combine the parts table to the previous join; add "_color" and "_part" suffixes to specify whether or not the information came from the colors table or the parts table.
    parts_joined %>%
    # Sort the number of star wars pieces in descending order 
    arrange(desc(n_star_wars))%>%
    # Join the colors table to the parts_joined table
    inner_join(colors, by = c("color_id" = "id"))%>%
    # Join the parts table to the previous join 
    inner_join(parts, by = "part_num", suffix = c("_color", "_part"))
    

    3.6 The semi- and anti-join verbs (video)

    3.7 Select the join

    3.8 Something within one set but not another

    In the videos, you learned how to filter using the semi- and anti-join verbs to answer questions you have about your data. Let’s focus on the batwing dataset, and use our skills to determine which parts are in both the batwing and batmobile sets, and which sets are in one, but not the other. While answering these questions, we’ll also be determining whether or not the parts we’re looking at in both sets also have the same color in common.

    The batmobile and batwing datasets have been preloaded for you.

    batmobile <- inventory_parts_joined %>%
    filter(set_num == "7784-1") %>%
    select(-set_num)
    
    batwing <- inventory_parts_joined %>%
    filter(set_num == "70916-1") %>%
    select(-set_num)
    

    Instruction:

    • Filter the batwing set for parts that are also in the batmobile, whether or not they have the same color.
    • Filter the batwing set for parts that aren’t also in the batmobile, whether or not they have the same color.
    # Filter the batwing set for parts that are also in the batmobile set
    batwing %>%
    semi_join(batmobile, by = "part_num")
    
    # Filter the batwing set for parts that aren't in the batmobile set
    batwing %>%
    anti_join(batmobile, by = "part_num")
    

    3.9 What colors are included in at least one set?

    Besides comparing two sets directly, you could also use a filtering join like semi_join to find out which colors ever appear in any inventory part. Some of the colors could be optional, meaning they aren’t included in any sets.

    The inventory_parts and colors tables have been preloaded for you.

    Instruction:
    Use the inventory_parts table to find the colors that are included in at least one set.

    # Use inventory_parts to find colors included in at least one set
    colors %>%
    semi_join(inventory_parts, by = c("id" = "color_id"))
    

    3.10 Which sets is missing version 1?

    Each set included in the LEGO data has an associated version number. We want to understand the version we are looking at to learn more about the parts that are included. Before doing that, we should confirm that there aren’t any sets that are missing a particular version.

    Let’s start by looking at the first version of each set to see if there are any sets that don’t include a first version.

    Instruction:

    • Use filter() to extract version 1 from the inventories table; save the filter to version_1_inventories.
    • Use anti_join to combine version_1_inventories with sets to determine which set is missing a version 1.
    # Use filter() to extract version 1 
    version_1_inventories <- inventories %>%
    filter(version == 1)
    
    # Use anti_join() to find which set is missing a version 1
    sets %>%
    anti_join(version_1_inventories, by = "set_num")
    

    3.11 Visualizing set differences (video)

    3.12 Aggregating sets to look at their differences

    To compare two individual sets, and the kinds of LEGO pieces that comprise them, we’ll need to aggregate the data into separate themes. Additionally, as we saw in the video, we’ll want to add a column so that we can understand the percentages of specific pieces that are part of each set, rather than looking at the numbers of pieces alone.

    The inventory_parts_themes table has been preloaded for you.

    inventory_parts_themes <- inventories %>%
    inner_join(inventory_parts, by = c("id" = "inventory_id")) %>%
    arrange(desc(quantity)) %>%
    select(-id, -version) %>%
    inner_join(sets, by = "set_num") %>%
    inner_join(themes, by = c("theme_id" = "id"), suffix = c("_set", "_theme"))
    

    Instruction:

    • Add a filter for the Batman set to create the batman_colors object.
    • Add a percent column to batman_colors that displays the total divided by the sum of the total.
    • Filter and aggregate the Star Wars set data to create the star_wars_colors object; add a percent column to the object to display the percent of the total.
    batman_colors <- inventory_parts_themes %>%
    # Filter the inventory_parts_themes table for the Batman theme
    filter(name_theme == "Batman") %>%
    group_by(color_id) %>%
    summarize(total = sum(quantity)) %>%
    # Add a percent column of the total divided by the sum of the total 
    mutate(percent = total / sum(total))
    
    # Filter and aggregate the Star Wars set data; add a percent column
    star_wars_colors <- inventory_parts_themes %>%
    filter(name_theme == "Star Wars") %>%
    group_by(color_id) %>%
    summarize(total = sum(quantity)) %>%
    mutate(percent = total / sum(total))
    

    3.13 Combining sets

    The data you aggregated in the last exercise has been preloaded for you as batman_colors and star_wars_colors. Prior to visualizing the data, you’ll want to combine these tables to be able to directly compare the themes’ colors.

    batman_colors <- inventory_parts_themes %>%
    filter(name_theme == "Batman") %>%
    group_by(color_id) %>%
    summarize(total = sum(quantity)) %>%
    mutate(percent = total / sum(total))
    
    star_wars_colors <- inventory_parts_themes %>%
    filter(name_theme == "Star Wars") %>%
    group_by(color_id) %>%
    summarize(total = sum(quantity)) %>%
    mutate(percent = total / sum(total))
    

    Instruction 1:

    • Join the batman_colors and star_wars_colors tables; be sure to include all observations from both tables.
    • Replace the NAs in the total_batman and total_star_wars columns.
    batman_colors %>%
    # Join the Batman and Star Wars colors
    full_join(star_wars_colors, by = "color_id", suffix = c("_batman", "_star_wars")) %>%
    # Replace NAs in the total_batman and total_star_wars columns
    replace_na(list(total_batman = 0, total_star_wars = 0)) %>%
    inner_join(colors, by = c("color_id" = "id")) 
    

    Instruction 2:

    • Add a difference column that calculates the difference between percent_batman and percent_star_wars, and a total column, which is the sum of total_batman and total_star_wars.
    • Add a filter to select observations where total is at least 200.
    batman_colors %>%
    full_join(star_wars_colors, by = "color_id", suffix = c("_batman", "_star_wars")) %>%
    replace_na(list(total_batman = 0, total_star_wars = 0)) %>%
    inner_join(colors, by = c("color_id" = "id")) %>%
    # Create the difference and total columns
    mutate(difference = percent_batman - percent_star_wars,
           total = total_batman + total_star_wars) %>%
    # Filter for totals greater than 200
    filter(total > 200)
    

    3.14 Visualizing the difference: batman and star wars

    In the last exercise, you created colors_joined. Now you’ll create a bar plot with one bar for each color (name), showing the difference in percentages.

    Because factors and visualization are beyond the scope of this course, we’ve done some processing for you: here is the code that created the colors_joined table that will be used in the video.

    colors_joined <- batman_colors %>%
    full_join(star_wars_colors, by = "color_id", suffix = c("_batman", "_star_wars")) %>%
    replace_na(list(total_batman = 0, total_star_wars = 0)) %>%
    inner_join(colors, by = c("color_id" = "id")) %>%
    mutate(difference = percent_batman - percent_star_wars,
                total = total_batman + total_star_wars) %>%
    filter(total >= 200) %>%
    mutate(name = fct_reorder(name, difference)) 
    

    Instruction:
    Create a bar plot using the colors_joined table to display the most prominent colors in the Batman and Star Wars themes, with the bars colored by their name.

    # Create a bar plot using colors_joined and the name and difference columns
    ggplot(colors_joined, aes(name, difference, fill = name)) +
    geom_col() +
    coord_flip() +
    scale_fill_manual(values = color_palette, guide = FALSE) +
    labs(y = "Difference: Batman - Star Wars")
    

    4. Case Study: Joins and Stack Overflow Data

    4.1 Stack overflow questions (video)

    4.2 Left-joining questions and tags

    Three of the Stack Overflow survey datasets are questions, question_tags, and tags:

    • questions: an ID and the score, or how many times the question has been upvoted; the data only includes R-based questions
    • question_tags: a tag ID for each question and the question’s id
    • tags: a tag id and the tag’s name, which can be used to identify the subject of each question, such as ggplot2 or dplyr

    In this exercise, we’ll be stitching together these datasets and replacing NAs in important fields.

    Note that we’ll be using left_joins in this exercise to ensure we keep all questions, even those without a corresponding tag. However, since we know the questions data is all R data, we’ll want to manually tag these as R questions with replace_na.

    Instruction 1:

    • Join together questions and question_tags using the id and question_id columns, respectively.
    # Join the questions and question_tags tables
    questions %>%
    left_join(question_tags, by = c("id" = "question_id"))
    

    Instruction 2:

    • Use another join to add in the tags table.
    # Join in the tags table
    questions %>%
    left_join(question_tags, by = c("id" = "question_id")) %>%
    left_join(tags, by = c("tag_id" = "id"))
    

    Instruction 3:

    • Use replace_na to change the NAs in the tag_name column to "only-r".
    # Replace the NAs in the tag_name column
    questions %>%
    left_join(question_tags, by = c("id" = "question_id")) %>%
    left_join(tags, by = c("tag_id" = "id")) %>%
    replace_na(list(tag_name = "only-r"))
    

    4.3 Comparing scores across tags

    The complete dataset you created in the last exercise is available to you as questions_with_tags. Let’s do a quick bit of analysis on it! You’ll use familiar dplyr verbs like group_by, summarize, arrange, and n to find out the average score of the most asked questions.

    Instruction:

    • Aggregate by the tag_name.
    • Summarize to get the total number of questions, num_questions, as well as the mean score for each question, score.
    • Arrange num_questions in descending order to sort the answers by the most asked questions.
    questions_with_tags %>%
    # Group by tag_name
    group_by(tag_name) %>%
    # Get mean score and num_questions
    summarize(score = mean(score),
              	      num_questions = n()) %>%
    # Sort num_questions in descending order
    arrange(desc(num_questions))
    

    4.4 What tags never appear on R questions?

    The tags table includes all Stack Overflow tags, but some have nothing to do with R. How could you filter for just the tags that never appear on an R question? The tags and question_tags tables have been preloaded for you.

    Instruction:

    • Use a join to determine which tags never appear on an R question.
    # Using a join, filter for tags that are never on an R question
    tags %>%
    anti_join(question_tags, by = c("id" = "tag_id"))
    

    4.5 Joining questions and answers (video)

    4.6 Finding gaps between questions and answers

    Now we’ll join together questions with answers so we can measure the time between questions and answers.

    Instruction:

    • Use an inner join to combine the questions and answers tables using the suffixes "_question" and "_answer", respectively.
    • Subtract creation_date_question from creation_date_answer within the as.integer() function to create the gap column.
    questions %>%
    # Inner join questions and answers with proper suffixes
    inner_join(answers, by = c("id" = "question_id"), suffix = c("_question", "_answer")) %>%
    # Subtract creation_date_question from creation_date_answer to create gap
    mutate(gap = as.integer(creation_date_answer - creation_date_question))
    

    4.7 Joining question and answer counts

    We can also determine how many questions actually yield answers. If we count the number of answers for each question, we can then join the answers counts with the questions table.

    Instruction:

    • Count and sort the question_id column in the answers table to create the answer_counts table.
    • Join the questions table with the answer_counts table.
    • Replace the NA values in the n column with 0s.
    # Count and sort the question id column in the answers table
    answer_counts <- answers %>%
    group_by(question_id) %>%
    count(question_id)
    
    # Combine the answer_counts and questions tables
    questions %>%
    left_join(answer_counts, by = c("id" = "question_id")) %>%
    # Replace the NAs in the n column
    replace_na(list(n = 0))
    

    4.8 Joining questions, answers and tags

    Let’s build on the last exercise by adding the tags table to our previous joins. This will allow us to do a better job of identifying which R topics get the most traction on Stack Overflow. The tables you created in the last exercise have been preloaded for you as answer_counts and question_answer_counts.

    answer_counts <- answers %>%
    count(question_id, sort = TRUE)
    
    question_answer_counts <- questions %>%
    left_join(answer_counts, by = c("id" = "question_id")) %>%
    replace_na(list(n = 0))
    

    Instruction:

    • Combine the question_tags table with question_answer_counts using an inner_join.
    • Now, use another inner_join to add the tags table.
    question_answer_counts %>%
    # Join the question_tags tables
    inner_join(question_tags, by = c("id" = "question_id")) %>%
    # Join the tags table
    inner_join(tags, by = c("tag_id" = "id"))
    

    4.9 Average answers by question

    The table you created in the last exercise has been preloaded for you as tagged_answers. You can use this table to determine, on average, how many answers each questions gets.

    tagged_answers <- question_answer_counts %>%
    inner_join(question_tags, by = c("id" = "question_id")) %>%
    inner_join(tags, by = c("tag_id" = "id"))
    

    Some of the important variables from this table include: n, the number of answers for each question, and tag_name, the name of each tag associated with each question.

    Let’s use some of our favorite dplyr verbs to find out how many answers each question gets on average.

    Instruction:

    • Aggregate the tagged_answers table by tag_name.
    • Summarize tagged_answers to get the count of questions and the average_answers.
    • Sort the resulting questions column in descending order.
    tagged_answers %>%
    # Aggregate by tag_name
    group_by(tag_name) %>%
    # Summarize questions and average_answers
    summarize(questions = n(),
                       average_answers = mean(n)) %>%
    # Sort the questions in descending order
    arrange(desc(questions))
    

    4.10 The bind_rows verb (video)

    4.11 Joining questions and answers with tags

    To learn more about the questions and answers table, you’ll want to use the question_tags table to understand the tags associated with each question that was asked, and each answer that was provided. You’ll be able to combine these tables using two inner joins on both the questions table and the answers table.

    Instruction:

    • Use two inner joins to combine the question_tags and tags tables with the questions table.
    • Now, use two inner joins to combine the question_tags and tags tables with the answers table.
     # Inner join the question_tags and tags tables with the questions table
    questions %>%
    inner_join(question_tags, by = c("id" = "question_id")) %>%
    inner_join(tags, by = c("tag_id" = "id"))
    
    # Inner join the question_tags and tags tables with the answers table
    answers %>%
    inner_join(question_tags, by = "question_id") %>%
    inner_join(tags, by = c("tag_id" = "id"))
    

    4.12 Binding and counting posts with tags

    The tables you created in the previous exercise have been preloaded as questions_with_tags and answers_with_tags. First, you’ll want to combine these tables into a single table called posts_with_tags. Once the information is consolidated into a single table, you can add more information by creating a date variable using the lubridate package, which has been preloaded for you.

    questions_with_tags <- questions %>%
    inner_join(question_tags, by = c("id" = "question_id")) %>%
    inner_join(tags, by = c("tag_id" = "id"))
    
    answers_with_tags <- answers %>%
    inner_join(question_tags, by = "question_id") %>% 
    inner_join(tags, by = c("tag_id" = "id"))
    

    Instruction:

    • Combine the questions_with_tags and answers_with_tags tables into posts_with_tags.
    • Add a year column to the posts_with_tags table, then aggregate by type, year, and tag_name.
    # Combine the two tables into posts_with_tags
    posts_with_tags <- bind_rows(questions_with_tags %>% mutate(type = "question"),
                                 answers_with_tags %>% mutate(type = "answer"))
    
    # Add a year column, then aggregate by type, year, and tag_name
    posts_with_tags %>%
    mutate(year = year(creation_date)) %>%
    count(type, year, tag_name)
    

    4.13 Visualizing questions and answers in tags

    In the last exercise, you modified the posts_with_tags table to add a year column, and aggregated by type, year, and tag_name. The modified table has been preloaded for you as by_type_year_tag, and has one observation for each type (question/answer), year, and tag. Let’s create a plot to examine the information that the table contains about questions and answers for the dplyr and ggplot2 tags. The ggplot2 package has been preloaded for you.

    by_type_year_tag <- posts_with_tags %>%
    mutate(year = year(creation_date)) %>%
    count(type, year, tag_name)
    

    Instruction:

    • Filter the by_type_year_tag table for the dplyr and ggplot2 tags.
    • Create a line plot with that filtered table that plots the frequency (n) over time, colored by question/answer and faceted by tag.
    # Filter for the dplyr and ggplot2 tag names 
    by_type_year_tag_filtered <- by_type_year_tag %>%
    filter(tag_name == "dplyr" | tag_name == "ggplot2")
    
    # Create a line plot faceted by the tag name 
    ggplot(by_type_year_tag_filtered, aes(year, n, color = type)) +
    geom_line() +
    facet_wrap(~ tag_name)
    

    4.14 Congratulations!

    展开全文
  • R语言-attach、detach、with

    万次阅读 2015-08-29 23:22:55
    在R语言中,对于串列,数据框中的数据的进行操作时,为了避免重复地键入对象名称,可使用attach或with。 1、attach() 假设data.frame包含列name,age attach(onedata.frame)后就可以引用直接引用onedata....

    在R语言中,对于串列,数据框中的数据的进行操作时,为了避免重复地键入对象名称,可使用attach或with。

    1、attach()

    假设data.frame包含列name,age

    attach(onedata.frame)后就可以引用直接引用onedata.frame中的元素了,例如:

    (1)创建测试数据框

    > name<-c("Zhangshan","Lisi","Wangwu","Zhaoliu")
    > age<-c(20,30,40,50)
    > onedata.frame<-data.frame(name,age)
    > onedata.frame
           name age
    1 Zhangshan  20
    2      Lisi  30
    3    Wangwu  40
    4   Zhaoliu  50

    (2)attach测试

    > attach(onedata.frame)
    The following objects are masked _by_ .GlobalEnv:

        age, name

    > age
    [1] 20 30 40 50
    > name
    [1] "Zhangshan" "Lisi"      "Wangwu"    "Zhaoliu"  
    > detach(onedata.frame)

    > name
    错误: 找不到对象'name'
    可见,访问数据框中的元素只在命令attach()和detach()之间可以搜索到。

    2、with()

    使用with,类似Javascript中的with,在括号中有效。例如:

    >with(onedata.frame,{
        name
    })

    [1] Zhangshan Lisi      Wangwu    Zhaoliu  
    Levels: Lisi Wangwu Zhangshan Zhaoliu

    用with有个问题就是里面设置的变量在外部无法访问:

    > with(onedata.frame,{name1<-name})
    > name1
    错误: 找不到对象'name1'


    解决办法就是使用<<-赋值符号,例如:

    > with(onedata.frame,{name1<<-name})
    > name1
    [1] Zhangshan Lisi      Wangwu    Zhaoliu  
    Levels: Lisi Wangwu Zhangshan Zhaoliu
    >




    展开全文
  • A recent trend in Deep Learning are Attention Mechanisms. In an interview, Ilya Sutskever, now the research director of OpenAI, mentioned that Attention Mechanisms are one of the most exciting ...

    A recent trend in Deep Learning are Attention Mechanisms. In an interview, Ilya Sutskever, now the research director of OpenAI, mentioned that Attention Mechanisms are one of the most exciting advancements, and that they are here to stay. That sounds exciting. But what are Attention Mechanisms?

    Attention Mechanisms in Neural Networks are (very) loosely based on the visual attention mechanism found in humans. Human visual attention is well-studied and while there exist different models, all of them essentially come down to being able to focus on a certain region of an image with “high resolution” while perceiving the surrounding image in “low resolution”, and then adjusting the focal point over time.

    Attention in Neural Networks has a long history, particularly in image recognition. Examples include Learning to combine foveal glimpses with a third-order Boltzmann machine orLearning where to Attend with Deep Architectures for Image Tracking. But only recently have attention mechanisms made their way into recurrent neural networks architectures that are typically used in NLP (and increasingly also in vision). That’s what we’ll focus on in this post.

    What problem does Attention solve?

    To understand what attention can do for us, let’s use Neural Machine Translation (NMT) as an example. Traditional Machine Translation systems typically rely on sophisticated feature engineering based on the statistical properties of text. In short, these systems are complex, and a lot of engineering effort goes into building them. Neural Machine Translation systems work a bit differently. In NMT, we map the meaning of a sentence into a fixed-length vector representation and then generate a translation based on that vector. By not relying on things like n-gram counts and instead trying to capture the higher-level meaning of a text, NMT systems generalize to new sentences better than many other approaches. Perhaps more importantly, NTM systems are much easier to build and train, and they don’t require any manual feature engineering. In fact, a simple implementation in Tensorflow is no more than a few hundred lines of code.

    Most NMT systems work by encoding the source sentence (e.g. a German sentence) into a vector using a Recurrent Neural Network, and then decoding an English sentence based on that vector, also using a RNN.

    RNN for Machine Translation

    In the picture above, “Echt”, “Dicke” and “Kiste” words are fed into an encoder, and after a special signal (not shown) the decoder starts producing a translated sentence. The decoder keeps generating words until a special end of sentence token is produced. Here, the h vectors represent the internal state of the encoder.

    If you look closely, you can see that the decoder is supposed to generate a translation solely based on the last hidden state (h_3 above) from the encoder. This h3 vector must encode everything we need to know about the source sentence. It must fully capture its meaning. In more technical terms, that vector is a sentence embedding. In fact, if you plot the embeddings of different sentences in a low dimensional space using PCA or t-SNE for dimensionality reduction,you can see that semantically similar phrases end up close to each other. That’s pretty amazing.

    Still, it seems somewhat unreasonable to assume that we can encode all information about a potentially very long sentence into a single vector and then have the decoder produce a good translation based on only that. Let’s say your source sentence is 50 words long. The first word of the English translation is probably highly correlated with the first word of the source sentence. But that means decoder has to consider information from 50 steps ago, and that information needs to be somehow encoded in the vector. Recurrent Neural Networks are known to have problems dealing with such long-range dependencies. In theory, architectures like LSTMsshould be able to deal with this, but in practice long-range dependencies are still problematic. For example, researchers have found that reversing the source sequence (feeding it backwards into the encoder) produces significantly better results because it shortens the path from the decoder to the relevant parts of the encoder. Similarly, feeding an input sequence twice also seems to help a network to better memorize things.

    I consider the approach of reversing a sentence a “hack”. It makes things work better in practice, but it’s not a principled solution. Most translation benchmarks are done on languages like French and German, which are quite similar to English (even Chinese word order is quite similar to English). But there are languages (like Japanese) where the last word of a sentence could be highly predictive of the first word in an English translation. In that case, reversing the input would make things worse. So, what’s an alternative? Attention Mechanisms.

    With an attention mechanism we no longer try encode the full source sentence into a fixed-length vector. Rather, we allow the decoder to “attend” to different parts of the source sentence at each step of the output generation. Importantly, we let the model learn what to attend to based on the input sentence and what it has produced so far. So, in languages that are pretty well aligned (like English and German) the decoder would probably choose to attend to things sequentially. Attending to the first word when producing the first English word, and so on. That’s what was done in Neural Machine Translation by Jointly Learning to Align and Translate and look as follows:

    NMT Attention

    Here, The y‘s are our translated words produced by the decoder, and the x‘s are our source sentence words. The above illustration uses a bidirectional recurrent network, but that’s not important and you can just ignore the inverse direction. The important part is that each decoder output word y_t now depends on a weighted combination of all the input states, not just the last state. The a‘s are weights that define in how much of each input state should be considered for each output. So, if a_{3,2} is a large number, this would mean that the decoder pays a lot of attention to the second state in the source sentence while producing the third word of the target sentence. The a's are typically normalized to sum to 1 (so they are a distribution over the input states).

    A big advantage of attention is that it gives us the ability to interpret and visualize what the model is doing. For example, by visualizing the attention weight matrix a when a sentence is translated, we can understand how the model is translating:

    NMT Attention Matrix

    Here we see that while translating from French to English, the network attends sequentially to each input state, but sometimes it attends to two words at time while producing an output, as in translation “la Syrie” to “Syria” for example.

    The Cost of Attention

    If we a bit more look closely at the equation for attention we can see that attention comes at a cost. We need to calculate an attention value for each combination of input and output word. If you have a 50-word input sequence and generate a 50-word output sequence that would be 2500 attention values. That’s not too bad, but if you do character-level computations and deal with sequences consisting of hundreds of tokens the above attention mechanisms can become prohibitively expensive.

    Actually, that’s quite counterintuitive. Human attention is something that’s supposed to savecomputational resources. By focusing on one thing, we can neglect many other things. But that’s not really what we’re doing in the above model. We’re essentially looking at everything in detail before deciding what to focus on. Intuitively that’s equivalent outputting a translated word, and then going back through all of your internal memory of the text in order to decide what which word to produce next. That seems like a waste, and not at all what humans are doing. In fact, it’s more akin to memory access, not attention, which in my opinion is somewhat of a misnomer (more on that below). Still, that hasn’t stopped attention mechanisms from becoming quite popular and performing well on many tasks.

    An alternative approach to attention is to use Reinforcement Learning to predict an approximate location to focus to. That sounds a lot more like human attention, and that’s what’s done inRecurrent Models of Visual Attention.

    Attention beyond Machine Translation

    So far we’ve looked at attention applied to Machine Translation. But the same attention mechanism from above can be applied to any recurrent model. So let’s look at a few more examples.

    In Show, Attend and Tell the authors apply attention mechanisms to the problem of generating image descriptions. They use a Convolutional Neural Network to “encode” the image, and a Recurrent Neural Network with attention mechanisms to generate a description. By visualizing the attention weights (just like in the translation example), we interpret what the model is looking at while generating a word:

    Show, Attend and Tell Attention Visualization

    In Grammar as a Foreign Language, the authors use a Recurrent Neural Network with attention mechanisk to generate sentence parse trees. The visualized attention matrix gives insight into how the network generates those trees:

    Screen Shot 2015-12-30 at 1.49.19 PM

    In Teaching Machines to Read and Comprehend, the authors use a RNN to read a text, read a (synthetically generated) question, and then produce an answer. By visualizing the attention matrix we can see where the networks “looks” while it tries to find the answer to the question:

    Teaching Machines to Read And Comprehend Attention

    Attention = (Fuzzy) Memory?

    The basic problem that the attention mechanism solves is that it allows the network to refer back to the input sequence, instead of forcing it to encode all information into one fixed-length vector. As I mentioned above, I think that attention is somewhat of a misnomer. Interpreted another way, the attention mechanism is simply giving the network access to its internal memory, which is the hidden state of the encoder. In this interpretation, instead of choosing what to “attend” to, the network chooses what to retrieve from memory. Unlike typical memory, the memory access mechanism here is soft, which means that the network retrieves a weighted combination of all memory locations, not a value from a single discrete location. Making the memory access soft has the benefit that we can easily train the network end-to-end using backpropagation (though there have been non-fuzzy approaches where the gradients are calculated using sampling methods instead of backpropagation).

    Memory Mechanisms themselves have a much longer history. The hidden state of a standard Recurrent Neural Network is itself a type of internal memory. RNNs suffer from the vanishing gradient problem that prevents them from learning long-range dependencies. LSTMs improved upon this by using a gating mechanism that allows for explicit memory deletes and updates.

    The trend towards more complex memory structures is now continuing. End-to-End Memory Networks allow the network to read same input sequence multiple times before making an output, updating the memory contents at each step. For example, answering a question by making multiple reasoning steps over an input story. However, when the networks parameter weights are tied in a certain way, the memory mechanism inEnd-to-End Memory Networks identical to the attention mechanism presented here, only that it makes multiple hops over the memory (because it tries to integrate information from multiple sentences).

    Neural Turing Machines use a similar form of memory mechanism, but with a more sophisticated type of addressing that using both content-based (like here) and location-based addressing, allowing the network to learn addressing pattern to execute simple computer programs, like sorting algorithms.

    It’s likely that in the future we will see a clearer distinction between memory and attention mechanisms, perhaps along the lines of Reinforcement Learning Neural Turing Machines, which try to learn access patterns to deal with external interfaces.

    from: http://www.wildml.com/2016/01/attention-and-memory-in-deep-learning-and-nlp/

    展开全文
  • 我为什么放弃Go语言

    万次阅读 多人点赞 2014-04-14 19:24:19
    是明智理性的吗?其实我一直在认真思考这个问题。开门见山地说,我当初放弃Go语言,就是因为两个“不爽”:第一,对Go语言本身不爽;第二,对Go语言社区里的某些人不爽。毫无疑问,这是非常主观的结论,但是我有...
  • Deep Learning in NLP (一)词向量和语言模型

    万次阅读 多人点赞 2014-04-09 15:09:38
    这篇博客是我看了半年的论文后,自己对 Deep Learning 在 NLP 领域中应用的理解总结,在此分享。其中必然有局限性,欢迎各种交流,随便拍。  Deep Learning 算法已经在图像音频领域取得了惊人的成果,但是在...
  • 本文为In-memory Computing with SAP HANA on Lenovo X6 Systems第三章Software components and data replication methods的读书笔记。 SAP HANA software componentsSAP HANA databaseSAP HANA数据库是HANA的核心,...
  • 问题描述 好久之前的一个Android项目,最近需要重构一下 因为Android Studio的开发环境以及...No toolchains found in the NDK toolchains folder for ABI with prefix: mips64el-linux-android 如图所示: ...
  • joinWith和with的区别

    千次阅读 2019-01-22 10:20:16
    $query = User::find()-&...with([ 'message' =&gt; function($query){ $query-&gt;where(['!=','title','']); } ]); $res = $query-&gt;asArray()-&gt;all(); //SQL 'SELE...
  • # Visualize variable with cos2 >= 0.6 # 可视化cos2>0.6 fviz_pca_var(res.pca, select.var = list(cos2 = 0.6))
  • 在用R语言写循环语句或者数组截断时,会出现这种报错only 0's may be mixed with negative subscripts 这种错误一般都是写了类似x[i-10:i+j-10]的数组引起的,改正的方法是将i-10i+j-10用括号括起来,如x[(i-...
  • 《Recursive Recurrent Nets with Attention Modeling for OCR in the Wild》已经被CVPR 2016(CV领域三大顶会之一)正式接收了,主要是介绍了在lexicon-free的情况下,使用带Attention Model的recurcive RNN识别...
  • NLP:《NLP_2019_Highlights》2019年自然语言处理领域重要进展及其解读 NLP_2019_Highlights 2019 was an impressive year for the field of natural language processing (NLP). In this report, I want to ...
  • 在运行ts()函数时提示Error in ts(temp):argument “s” is missing,with no default 怎么办 遇见这种问题,可以尝试点击右上角小扫把,把之前系统保存的变量预设全部删除,接着再运行一次就可以了(记得从library...
  • R语言:SMOTE - Supersampling Rare Events in R:用R对稀有事件进行超级采样
  • netflix设置语言Netflix isn’t just an English language streaming service with content from around the globe. You can easily watch films and shows in other languages. There are also options for ...
  • 这个就是在IN的基础上,了一个状态的说法。然后构造出在不同的状态下的instance normalization Adaptive Instance Normalization 如果说IN,就是把输入正则化到一个风格(用仿射的参数来区分),这有可能去...
  • with(data, expr, ...) #data可以是an environment, a list, a data frame, or an integer as in #sys.call within(data, expr, ...) #data为列表或数据框 within(data, expr, keepAttrs = TRUE, ...)  within...
  • 在对时间序列进行拟合操作时,发生:Error in ts(x):对象不是矩阵的错误,而直接在arima()函数中使用时没有问题的。 > sample > sample  [1] 0.00 0.00 0.00 0.00 0.00 0.00 0.06 0.09 0.20 0.09 0.08 0.14 0.14 ...
  • 一文教会你使用R语言和基本统计分析

    万次阅读 多人点赞 2018-04-21 20:17:18
    一文教会你使用R语言和基本统计分析 目录 1.R语言介绍 2.R语言的安装 3.R语言的基本函数使用 4.R语言的基本绘图 4.1 直方图 Histograms 4.2 核密度图 Kernel Density Plots 4.3 点图 Dot ...
  • 各位朋友,我已开通微信公共号:小程在线 我会把文章及时的更新到公共号上,欢迎... execute JDBC update query failed in dbSendUpdate (鍐呴儴閿欒: Overflow Exception trying to bind NaN); 发生这个错误的...
  • No toolchains found in the NDK toolchains folder for ABI with prefix: mips64el-linux-android 如图所示:   显然这个错误是由于Android3.0(当然也可能是更高的版本)开发环境的版本更新所导致的 ...
  • 关于lucene发展语言实现的方向

    千次阅读 2005-04-21 09:04:00
    关于lucene发展语言实现的方向 多语言lucene的发展无疑是基于java lucene的。一切的功能特性兼容性的问题都要以java lucene为主。java lucene是其他语言lucene发展的鼻祖。那么多语言lucene的发展应该怎么办...
  • r语言中注释快捷键 R语言中的注释 (Comments in R language) We all are acquainted with the principles or rules that are used while writing the program more clearly in the C language. In the basic C ...
  • 在用vscdoe编译go语言时,出现以下问题: # odbc exec: "gcc": executable file not found in %PATH% exit status 2 2、解决方案 2.1 mingw 64  MinGW分为较早开发的MinGW32
  • 语言包含: Javascript ,PHP,Java,Groovy,C#,Objective C,Go,Ruby,Python,Perl,Dart,Swift,Rust,Powershell。
  • buildcompile的区别[编程语言]

    万次阅读 2012-08-27 19:21:05
    Build是从新编译所有生成exe有关的文件,无论.pas文件是否修改过,它都会重新生成新的.dcu,并从新链接这些.dcu等等文件。  Compile是编译修改过的文件,它只生成新修改过的.pas的相应的.dcu,并从新链接这些...
  • 为什么Go语言在中国格外的"火

    万次阅读 多人点赞 2019-07-19 08:47:22
    go语言推出有几年了,似乎不温不火。但是在中国范围内,确实被关注的一塌糊涂。这是2017年2月份TIOBE出的编程语言排名:在拉勾网上搜索go的职位,结果有119个(2017年2月14日搜索结果),似乎还没有那么火爆:但是在...
  • 一起来看看她都遇到了什么问题问题是怎么被解决的吧。 这一天,小师妹一脸郁闷的问我:F师兄,我学Java IO也有好多天了,最近写了一个例子,读取一个文件没有问题,但是读取很多个文件就会告诉我:”Can't open so...
  • Go语言的前世今生

    万次阅读 2017-02-15 14:44:15
    今天跟大家分享一下Go语言的发展历程。谷歌工程师的20%时间谷歌的“20%时间”工作方式,允许工程师拿出20%的时间来研究自己喜欢的项目。语音服务Google Now、谷歌新闻Google News、谷歌地图Google Map上的交通信息等...

空空如也

空空如也

1 2 3 4 5 ... 20
收藏数 373,717
精华内容 149,486
关键字:

with和in加语言