Data frames are one of the most widely used data structures in R. They allow you to store tabular data in rows and columns, making them ideal for working with datasets. A data frame is a two-dimensional, tabular data structure in R where each column can have a different type (e.g., numeric, character, factor). It is similar to a table in SQL or a dataframe in Python.
Creating data frame: Use data.frame()
to create a data frame. You can also convert other objects (like lists or matrices) into data frames using as.data.frame()
.
Please note: Use Control + Shift + A by clicking the line number. This allows your code to get formatted and make it easier working on lengthier codes. You can alternatively go to ‘Code‘ menu in R Studio & click on ‘Reformat code‘.
Creating a data frame:
> friend_data <- data.frame(
friend_id = 1:5,
friend_name = c(“Adam”, “John”, “Mary”, “Rama”, “Jessy”),
friend_age = c(25, 30, 22, 35, 29),
stringsAsFactors = FALSE # Prevent automatic factor conversion
)
Displaying data frame:
> print(friend_data)
Convert a matrix to a data frame:
> matrix_data <- matrix(1:9, nrow = 3)
> df_from_matrix <- as.data.frame(matrix_data)
Functions to explore the structure and metadata of a data frame:
> str(friend_data) # Displays the structure of the data frame
> class(friend_data) # “data.frame” – Class of the object
> dim(friend_data) # Dimensions – number of rows and columns
> nrow(friend_data) # Number of rows
> ncol(friend_data) # Number of columns
> colnames(friend_data) # Column names
> rownames(friend_data) # Row names
> attributes(friend_data) # Attributes of the data
Accessing Data in a Data Frame: By columns
> friend_data$friend_name # Access the “friend_name” column
> friend_data[, 2] # Using column indexing – access the second column
> friend_data[[“friend_name”]] # Using column name indexing
> friend_data[, c(“friend_name”, “friend_age”)] # Selecting multiple columns
Accessing Data in a Data Frame: By rows
> friend_data[1, ] # First row
> friend_data[c(1, 3), ] # First and third rows
> friend_data[-1, ] # All rows except the first
Subsetting Rows and Columns
> friend_data[friend_data$friend_age > 25, ] # Rows based on condition – Friends older than 25
> friend_data[1:2, c(“friend_name”, “friend_age”)] # Specific rows and columns
Modifying Data Frames – Adding columns
> friend_data$friend_city <- c(“NY”, “LA”, “SF”, “Boston”, “Chicago”) # Adding new column
> new_col <- c(“Single”, “Married”, “Single”, “Divorced”, “Single”) # Bind a new column
> friend_data <- cbind(friend_data, friend_status = new_col)
Modifying Data Frames – Removing columns
> friend_data$friend_city <- NULL
> friend_data <- friend_data[, -2] # Exclude columns using negative indexing – Remove second column
Visualizing first and last few rows
> head(friend_data) # default six rows
> head(friend_data, 2) # First two rows
> tail(friend_data) # default six rows
> tail(friend_data, 2) # Last two rows
Summary of the data – basic descriptive statistics
> summary(friend_data)
Unique Values:
> unique(friend_data$friend_name)
Sort:
> friend_data <- friend_data[order(friend_data$friend_age), ] # Ascending by age
Verfiying missing values:
> any(is.na(friend_data)) # TRUE/FALSE for any missing values
> which(is.na(friend_data)) # Positions of missing values
Handling missing data: Deleting & replacing
> friend_data <- na.omit(friend_data)
> friend_data$friend_age[is.na(friend_data$friend_age)] <- mean(friend_data$friend_age, na.rm = TRUE)
Functions to aggregate:
> colMeans(friend_data[, -1]) # Exclude non-numeric columns
> colSums(friend_data[, -1])
> rowMeans(friend_data[, -1])
> rowSums(friend_data[, -1])
> table(friend_data$friend_status)
Merging and Binding:
> new_row <- data.frame(friend_id = 6, friend_name = “Kate”, friend_age = 28, friend_status = “Married”) > friend_data <- rbind(friend_data, new_row) # Row binding
> new_col <- c(1, 0, 1, 1, 0, 1)
> friend_data <- cbind(friend_data, friend_employed = new_col) # Column binding
Transformation of data:
> t(friend_data)
Adding Rank:
> friend_data$rank <- rank(friend_data$friend_age)
Key Functions for Aggregation
tapply() applies a function to subsets of a vector, split by a factor or a list of factors.
> data(mtcars)
> avg_mpg <- tapply(mtcars$mpg, mtcars$cyl, mean)
> print(avg_mpg)
by() applies a function to subsets of a data frame or matrix, split by factors.
> by(data = mtcars$mpg, INDICES = mtcars$cyl, FUN = summary)
aggregate() is a versatile function to compute summary statistics for a data frame, grouped by one or more factors.
> agg_data <- aggregate(cbind(mpg, hp) ~ cyl, data = mtcars, FUN = mean)
> print(agg_data)
Aggregation with Multiple Factors
> agg_multi <- aggregate(mpg ~ cyl + gear, data = mtcars, FUN = mean) # Avg mpg grouped by both cylinder & gear
> print(agg_multi)
apply() operates on rows or columns of a matrix or data frame.
> col_sums <- apply(mtcars[, c(“mpg”, “hp”, “wt”)], 2, sum) # Columnwise sum
> print(col_sums)
> row_sums <- apply(mtcars[, c(“mpg”, “hp”, “wt”)], 1, sum) # Rowwise sum
> print(row_sums)
lapply() applies a function to each element of a list or data frame and returns a list.
> mean_mtcars <- lapply(mtcars, FUN = mean)
sapply() is similar to lapply()
but simplifies the output to a vector or matrix when possible.
> col_means <- sapply(mtcars[, c(“mpg”, “hp”, “wt”)], mean) print(col_means) # Calculate mean for each column.
> print(col_means)
mapply() is a multivariate version of sapply()
that applies a function to multiple arguments simultaneously.
> prod_mpg_hp <- mapply(`*`, mtcars$mpg, mtcars$hp)
> print(prod_mpg_hp)
Grouped Aggregation with split()
– splits a vector or data frame into groups based on a factor and applies a function to each subset.
> split_data <- split(mtcars, mtcars$cyl) # Split mtcars data by cylinder type
> avg_mpg <- sapply(split_data, function(group) mean(group$mpg)) print(avg_mpg) # Calculate avg mpg for each group
Table-Based Aggregation – table() & prop.table() creates contingency tables for counts, proportions, and cross-tabulations.
> cyl_count <- table(mtcars$cyl) print(cyl_count) # Count of cars by cylinder type
> cyl_prop <- prop.table(cyl_count) # proportion of cars by cylinder type
> print(cyl_prop)
Aggregating missing data
> data <- data.frame( group = c(“A”, “A”, “B”, “B”, “C”, “C”), value = c(10, NA, 20, 30, NA, 40) ) # Create a sample data frame with missing values
> agg_na <- aggregate(value ~ group, data = data, FUN = function(x) sum(x, na.rm = TRUE)) # Aggregating sum ignoring na.
> print(agg_na)Comparison of aggregation functions