Factors in R are a data type used to represent categorical variables—data that can take on a limited, fixed set of values (e.g., “low,” “medium,” “high” or “yes,” “no”). They are essential for statistical modeling, data visualization, and analysis, as they explicitly define categories and their order, unlike plain character vectors.
Factors are built on top of integers, with each level (category) assigned a numeric code, but they display as labels, making them both efficient and human-readable.
What Are Factors?
A factor is an object that stores categorical data with predefined levels. Internally, factors are integers mapped to character labels (levels). Created with factor() or derived from character vectors in data frames.
# Create a factor
x <- factor(c(“low”, “high”, “medium”, “low”))
x
# Output: [1] low high medium low
# Levels: high low medium
Key Functions for Working with Factors
R provides several functions to create, manipulate, and inspect factors. Here’s a rundown:
1. factor() – Create a factor; Converts a vector into a factor, optionally specifying levels and order.
sizes <- factor(c("small", "large", "medium", "small"), levels = c("small", "medium", "large"))
sizes
# Output: [1] small large medium small
# Levels: small medium large
2. levels() – Get or set factor levels; Retrieves or modifies the categories of a factor.
levels(sizes)
# Output: [1] "small" "medium" "large"
levels(sizes) <- c("S", "M", "L")
sizes
# Output: [1] S L M S
# Levels: S M L
3. nlevels() – Count the number of levels; Returns the number of unique categories.
nlevels(sizes) # Output: 3
4. as.factor() – Convert to factor; Coerces an object (e.g., character or numeric) to a factor.
as.factor(c(1, 2, 1, 3))
# Output: [1] 1 2 1 3
# Levels: 1 2 3
5. is.factor() – Check if an object is a factor; Tests whether a variable is a factor.
is.factor(sizes) # TRUE
is.factor(c("a", "b")) # FALSE
Factors in Data Frames (Using mtcars)
In datasets like mtcars, variables can be factors if they represent categories. Let’s create and manipulate a factor from mtcars.
Example: Converting cyl to a Factor
# cyl (cylinders) is numeric by default
str(mtcars$cyl) # num [1:32] 6 6 4 6 8 ...
# Convert to factor
mtcars$cyl_factor <- factor(mtcars$cyl, levels = c(4, 6, 8))
str(mtcars$cyl_factor)
# Output: Factor w/ 3 levels "4","6","8": 2 2 1 2 3 ...
Why Factorize?: cyl as a factor ensures R treats it as categorical (e.g., 4, 6, 8 cylinders) rather than a continuous numeric variable, which matters for modeling or plotting.
Ordered Factors
Factors can be unordered (default) or ordered (for ranked categories like “low” < “medium” < “high”). The syntax is ordered = TRUE.
ratings <- factor(c("low", "high", "medium", "low"), levels = c("low", "medium", "high"), ordered = TRUE)
ratings
# Output: [1] low high medium low
# Levels: low < medium < high ratings[1] < ratings[2]
# TRUE (low < high)
Ordered factors enable comparisons (<, >), useful in ordinal data analysis.
Practical Examples with mtcars
Let’s apply factors to mtcars for real-world context.
1. Categorizing MPG
# Create a categorical mpg variable
mtcars$mpg_cat <- cut(mtcars$mpg, breaks = c(10, 20, 30, Inf), labels = c("low", "medium", "high")) mtcars$mpg_cat
2. Reordering Factor Levels
library(forcats) # Tidyverse package for factors
# Reorder cyl_factor by mean mpg
mtcars$cyl_factor <- fct_reorder(mtcars$cyl_factor, mtcars$mpg, mean) levels(mtcars$cyl_factor)
3. Collapsing Levels
# Collapse mpg_cat into fewer categories
mtcars$mpg_simple <- fct_collapse(mtcars$mpg_cat, "low_medium" = c("low", "medium"), "high" = "high")
mtcars$mpg_simple
Handling Factors in Analysis
Statistical Models: Functions like lm() or glm() treat factors as categorical predictors, creating dummy variables automatically.
lm(mpg ~ cyl_factor, data = mtcars)
Visualization: ggplot2 uses factor levels for grouping or ordering.
library(ggplot2)
ggplot(mtcars, aes(x = cyl_factor, y = mpg)) + geom_boxplot()
Factors ensure R recognizes variables as categories, not numbers or strings, critical for statistical accuracy. Internally stored as integers, factors save memory compared to character vectors. Levels and order provide explicit control over how categories are treated in analysis or plots.