tidyr is a key component of the tidyverse, designed to tidy and reshape data into a consistent, “tidy” format where each variable is a column, each observation is a row, and each type of observational unit forms a table. It excels at handling missing values, reshaping data (e.g., pivoting), and separating or uniting columns, making it a go-to tool for cleaning messy datasets.
Key Concepts:
- Works with data frames or tibbles.
- Focuses on reshaping data and managing missingness explicitly or implicitly.
- Integrates seamlessly with the pipe operator (%>%) for chaining operations.
- Ideal for preparing data for analysis by ensuring it adheres to tidy principles.
Learning Path:
- Start with pivot_longer() and pivot_wider() to reshape data.
- Use drop_na() and replace_na() to handle missing values.
- Explore separate() and unite() for splitting or combining columns.
- Master fill() and complete() for dealing with implicit missingness.
Inducing Missing Values into mtcars
Since mtcars has no missing values by default, let’s introduce some programmatically to demonstrate tidyr’s capabilities. Here’s how we’ll modify mtcars to include NA values:
library(dplyr) # For manipulation
library(tidyr) # For tidying
# Set a seed for reproducibility
set.seed(123)
# Create a copy of mtcars with missing values
mtcars_missing <- mtcars
# Randomly replace 5 values in 'mpg' with NA
mtcars_missing$mpg[sample(1:nrow(mtcars_missing), 5)] <- NA
# Randomly replace 4 values in 'hp' with NA
mtcars_missing$hp[sample(1:nrow(mtcars_missing), 4)] <- NA
# Randomly replace 3 values in 'wt' with NA
mtcars_missing$wt[sample(1:nrow(mtcars_missing), 3)] <- NA
# View the first few rows to confirm missing values
head(mtcars_missing)
Now, mtcars_missing has NA values in mpg, hp, and wt, which we’ll use to showcase tidyr functions.
Comprehensive List of Key tidyr Functions with mtcars_missing
Here’s a list of key tidyr functions demonstrated with mtcars_missing. Since your focus is on missing values, the emphasis is on how these functions handle or address NAs.
1. drop_na() – Remove rows with any missing values. Drops all rows with NA in any column. Only complete cases remain (fewer than 32 rows).
mtcars_missing %>% drop_na()
2. replace_na() – Replace missing values with a specified value. Replaces NA in specified columns with a value (here, 0). NAs in mpg, hp, and wt become 0.
mtcars_missing %>% replace_na(list(mpg = 0, hp = 0, wt = 0))
3. fill() – Fill missing values with the previous non-missing value. Fills NA in mpg with the last non-missing value in sorted order. NAs in mpg are replaced by the previous mpg value within the sorted cylinder groups.
mtcars_missing %>% arrange(cyl) %>% fill(mpg, .direction = "down")
4. pivot_longer() – Reshape wide data to long format, preserving NAs. Converts mpg, hp, and wt into a long format; NAs remain as NA.
mtcars_missing %>%
pivot_longer(cols = c(mpg, hp, wt), names_to = "variable", values_to = "value")
5. pivot_wider() – Reshape long data to wide format, introducing NAs if needed. Reverses pivot_longer(), restoring the original structure with NAs intact. This is similar to ‘mtcars_missing’.
# First, create a long version, then pivot back
long <- mtcars_missing %>%
pivot_longer(cols = c(mpg, hp, wt), names_to = "variable", values_to = "value") long %>% pivot_wider(names_from = "variable", values_from = "value")
6. separate() – Split a column into multiple columns (example with a fabricated column). Splits a combined column; NAs in other columns are unaffected. Note that ‘mtcars’ doesn’t have a natural column to split, so we create one.
mtcars_missing %>%
mutate(combined = paste(cyl, gear, sep = "-")) %>%
separate(combined, into = c("cyl_new", "gear_new"), sep = "-")
7. unite() – Combine columns into one. Combines mpg and hp into one column; NAs appear as NA unless na.rm = TRUE.
mtcars_missing %>%
unite(“perf Combo”, mpg, hp, sep = “_”, na.rm = FALSE)
8. complete() – Make implicit missing combinations explicit. Ensures all combinations of cyl and gear exist, filling missing ones with NA. Expands the dataset if any cyl-gear pairs are missing (in mtcars, most are present).
mtcars_missing %>% complete(cyl, gear)
9. nest() – Create nested data frames. Nests data into a list-column by cyl; NAs are preserved in nested tibbles. A tibble with one row per cyl and a nested data column.
mtcars_missing %>%
group_by(cyl) %>%
nest()
10. unnest() – Expand nested data frames. Reverses nest(), restoring the original structure with NAs intact.
# Follow-up to nest() mtcars_missing %>%
group_by(cyl) %>%
nest() %>%
unnest(cols = data)
Putting It Together!
Here are three combined examples using multiple tidyr functions with mtcars_missing, focusing on missing value handling:
Example 1: Cleaning and Reshaping
library(tidyr)
mtcars_missing %>%
drop_na(mpg) %>% # Remove rows where mpg is NA
pivot_longer(cols = c(mpg, hp, wt), names_to = "metric", values_to = "value") %>%
replace_na(list(value = 0)) %>% # Replace remaining NAs with 0
pivot_wider(names_from = "metric", values_from = "value") # Back to wide format
Drops rows with missing mpg, reshapes to long format, replaces remaining NAs with 0, and reshapes back to wide.
Example 2: Filling Missing Values
library(tidyr)
mtcars_missing %>%
arrange(cyl) %>% # Sort by cylinders
fill(mpg, hp, .direction = "down") %>% # Fill NA with previous values
select(mpg, hp, cyl) %>% # Select key columns
complete(cyl) # Ensure all cylinder levels are present
Fills NAs in mpg and hp with the previous non-missing value within sorted cyl, selects columns, and ensures all cyl values are represented.
Example 3: Combining and Separating with Missingness
library(tidyr)
mtcars_missing %>%
unite("power_weight", hp, wt, sep = "_", na.rm = FALSE) %>% # Combine hp and wt separate(power_weight, into = c("horsepower", "weight"), sep = "_") %>% # Split back replace_na(list(horsepower = "missing", weight = "missing")) # Replace NAs
Combines hp and wt into one column, splits them back, and replaces NAs with a string “missing” for clarity.