R Data Reshaping
Merging Data Frames
In R, merging data frames is done using the merge() function.
The syntax for the merge() function is as follows:
# S3 method
merge(x, y, …)
# S3 method for data.frame
merge(x, y, by = intersect(names(x), names(y)),
by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,
sort = TRUE, suffixes = c(".x",".y"), no.dups = TRUE,
incomparables = NULL, …)
Common parameters explained:
x, y: data frames
by, by.x, by.y: specify the matching column names in the two data frames; by default, it uses the common column names.
all: logical value; all = L is shorthand for all.x = L and all.y = L, where L can be TRUE or FALSE.
all.x: logical value, default is FALSE. If TRUE, it shows rows in x even if there are no corresponding matches in y, with unmatched rows in y represented by NA.
all.y: logical value, default is FALSE. If TRUE, it shows rows in y even if there are no corresponding matches in x, with unmatched rows in x represented by NA.
sort: logical value, whether to sort the columns.
The merge() function is similar to SQL JOIN operations:
Natural join or INNER JOIN: returns rows if there is at least one match in the tables
Left outer join or LEFT JOIN: returns all rows from the left table, even if there are no matches in the right table
Right outer join or RIGHT JOIN: returns all rows from the right table, even if there are no matches in the left table
Full outer join or FULL JOIN: returns rows if there is a match in one of the tables
Example
# data frame 1
df1 = data.frame(SiteId = c(1:6), Site = c("Google","tutorialpro","Taobao","Facebook","Zhihu","Weibo"))
# data frame 2
df2 = data.frame(SiteId = c(2, 4, 6, 7, 8), Country = c("CN","USA","CN","USA","IN"))
# INNER JOIN
df1 = merge(x=df1,y=df2,by="SiteId")
print("----- INNER JOIN -----")
print(df1)
# FULL JOIN
df2 = merge(x=df1,y=df2,by="SiteId",all=TRUE)
print("----- FULL JOIN -----")
print(df2)
# LEFT JOIN
df3 = merge(x=df1,y=df2,by="SiteId",all.x=TRUE)
print("----- LEFT JOIN -----")
print(df3)
# RIGHT JOIN
df4 = merge(x=df1,y=df2,by="SiteId",all.y=TRUE)
print("----- RIGHT JOIN -----")
print(df4)
Executing the above code produces the following output:
[1] "----- INNER JOIN -----"
SiteId Site Country
1 2 tutorialpro CN
2 4 Facebook USA
3 6 Weibo CN
[1] "----- FULL JOIN -----"
SiteId Site Country.x Country.y
1 2 tutorialpro CN CN
2 4 Facebook USA USA
3 6 Weibo CN CN
4 7 <NA> <NA> USA
5 8 <NA> <NA> IN
[1] "----- LEFT JOIN -----"
SiteId Site.x Country Site.y Country.x Country.y
1 2 tutorialpro CN tutorialpro CN CN 2 4 Facebook USA Facebook USA USA 3 6 Weibo CN Weibo CN CN [1] "----- RIGHT JOIN -----" SiteId Site.x Country Site.y Country.x Country.y 1 2 tutorialpro CN tutorialpro CN CN 2 4 Facebook USA Facebook USA USA 3 6 Weibo CN Weibo CN CN 4 7 <NA> <NA> <NA> <NA> USA 5 8 <NA> <NA> <NA> <NA> IN
### Data Integration and Splitting
R uses the **melt()** and **cast()** functions to integrate and split data.
- melt(): Converts wide-format data into long-format.
- cast(): Converts long-format data into wide-format.
The following diagram effectively demonstrates the functions of melt() and cast() (detailed examples will be provided later):
cast( data, formula, fun.aggregate = NULL, ..., margins = NULL, subset = NULL, fill = NULL, drop = TRUE, value.var = guess_value(data) )
Parameter Description:
- data: The combined data frame.
- formula: The format for reshaping the data, similar to x ~ y, where x is the row label and y is the column label.
- fun.aggregate: The aggregation function used to process the value.
- margins: A vector of variable names (can include "grand_col" and "grand_row") for calculating margins, setting TRUE calculates all margins.
- subset: Conditional filtering of the results, in the format **subset = .(variable=="length")**.
- drop: Whether to keep the default values.
- value.var: The field to be processed.
## Example
Load libraries
library(MASS) library(reshape2) library(reshape)
Create data frame
id <- c(1, 1, 2, 2) time <- c(1, 2, 1, 2) x1 <- c(5, 3, 6, 2) x2 <- c(6, 5, 1, 4) mydata <- data.frame(id, time, x1, x2)
Melt
md <- melt(mydata, id = c("id","time"))
Print recasted dataset using cast() function
cast.data <- cast(md, id~variable, mean)
print(cast.data)
cat("\n") time.cast <- cast(md, time~variable, mean) print(time.cast)
cat("\n") id.time <- cast(md, id~time, mean) print(id.time)
cat("\n") id.time.cast <- cast(md, id+time~variable) print(id.time.cast)
cat("\n") id.variable.time <- cast(md, id+variable~time) print(id.variable.time)
cat("\n") id.variable.time2 <- cast(md, id~variable+time) print(id.variable.time2)
Executing the above code outputs:
id x1 x2 1 1 4 5.5 2 2 4 2.5
time x1 x2 1 1 5.5 3.5 2 2 2.5 4.5
id 1 2 1 1 5.5 4 2 2 3.5 3
id time x1 x2 1 1 1 5 6 2 1 2 3 5 3 2 1 6 1 4 2 2 2 4
id variable 1 2 1 1 x1 5 3 2 1 x2 6 5 3 2 x1 6 2 4 2 x2 1 4
id x1_1 x1_2 x2_1 x2_2 1 1 5 3 6 5 2 2 6 2 1 4 ```