R Data Reshaping

Merging Data Frames

In R, merging data frames is done using the merge() function.

The syntax for the merge() function is as follows:

# S3 method
merge(x, y, …)

# S3 method for data.frame
merge(x, y, by = intersect(names(x), names(y)),
      by.x = by, by.y = by, all = FALSE, all.x = all, all.y = all,
      sort = TRUE, suffixes = c(".x",".y"), no.dups = TRUE,
      incomparables = NULL, …)

Common parameters explained:

x, y: data frames
by, by.x, by.y: specify the matching column names in the two data frames; by default, it uses the common column names.
all: logical value; all = L is shorthand for all.x = L and all.y = L, where L can be TRUE or FALSE.
all.x: logical value, default is FALSE. If TRUE, it shows rows in x even if there are no corresponding matches in y, with unmatched rows in y represented by NA.
all.y: logical value, default is FALSE. If TRUE, it shows rows in y even if there are no corresponding matches in x, with unmatched rows in x represented by NA.
sort: logical value, whether to sort the columns.

The merge() function is similar to SQL JOIN operations:

Natural join or INNER JOIN: returns rows if there is at least one match in the tables
Left outer join or LEFT JOIN: returns all rows from the left table, even if there are no matches in the right table
Right outer join or RIGHT JOIN: returns all rows from the right table, even if there are no matches in the left table
Full outer join or FULL JOIN: returns rows if there is a match in one of the tables

Example

# data frame 1
df1 = data.frame(SiteId = c(1:6), Site = c("Google","tutorialpro","Taobao","Facebook","Zhihu","Weibo"))

# data frame 2
df2 = data.frame(SiteId = c(2, 4, 6, 7, 8), Country = c("CN","USA","CN","USA","IN"))

# INNER JOIN
df1 = merge(x=df1,y=df2,by="SiteId")
print("----- INNER JOIN -----")
print(df1)

# FULL JOIN
df2 = merge(x=df1,y=df2,by="SiteId",all=TRUE)
print("----- FULL JOIN -----")
print(df2)

# LEFT JOIN
df3 = merge(x=df1,y=df2,by="SiteId",all.x=TRUE)
print("----- LEFT JOIN -----")
print(df3)

# RIGHT JOIN
df4 = merge(x=df1,y=df2,by="SiteId",all.y=TRUE)
print("----- RIGHT JOIN -----")
print(df4)

Executing the above code produces the following output:

[1] "----- INNER JOIN -----"
  SiteId     Site Country
1      2   tutorialpro      CN
2      4 Facebook     USA
3      6    Weibo      CN
[1] "----- FULL JOIN -----"
  SiteId     Site Country.x Country.y
1      2   tutorialpro        CN        CN
2      4 Facebook       USA       USA
3      6    Weibo        CN        CN
4      7     <NA>      <NA>       USA
5      8     <NA>      <NA>        IN
[1] "----- LEFT JOIN -----"
  SiteId   Site.x Country   Site.y Country.x Country.y

1 2 tutorialpro CN tutorialpro CN CN 2 4 Facebook USA Facebook USA USA 3 6 Weibo CN Weibo CN CN [1] "----- RIGHT JOIN -----" SiteId Site.x Country Site.y Country.x Country.y 1 2 tutorialpro CN tutorialpro CN CN 2 4 Facebook USA Facebook USA USA 3 6 Weibo CN Weibo CN CN 4 7 <NA> <NA> <NA> <NA> USA 5 8 <NA> <NA> <NA> <NA> IN


### Data Integration and Splitting

R uses the **melt()** and **cast()** functions to integrate and split data.

- melt(): Converts wide-format data into long-format.

- cast(): Converts long-format data into wide-format.

The following diagram effectively demonstrates the functions of melt() and cast() (detailed examples will be provided later):

cast( data, formula, fun.aggregate = NULL, ..., margins = NULL, subset = NULL, fill = NULL, drop = TRUE, value.var = guess_value(data) )


Parameter Description:

- data: The combined data frame.

- formula: The format for reshaping the data, similar to x ~ y, where x is the row label and y is the column label.

- fun.aggregate: The aggregation function used to process the value.

- margins: A vector of variable names (can include "grand_col" and "grand_row") for calculating margins, setting TRUE calculates all margins.

- subset: Conditional filtering of the results, in the format **subset = .(variable=="length")**.

- drop: Whether to keep the default values.

- value.var: The field to be processed.

## Example

Load libraries

library(MASS) library(reshape2) library(reshape)

Create data frame

id <- c(1, 1, 2, 2) time <- c(1, 2, 1, 2) x1 <- c(5, 3, 6, 2) x2 <- c(6, 5, 1, 4) mydata <- data.frame(id, time, x1, x2)

Melt

md <- melt(mydata, id = c("id","time"))

Print recasted dataset using cast() function

cast.data <- cast(md, id~variable, mean)

print(cast.data)

cat("\n") time.cast <- cast(md, time~variable, mean) print(time.cast)

cat("\n") id.time <- cast(md, id~time, mean) print(id.time)

cat("\n") id.time.cast <- cast(md, id+time~variable) print(id.time.cast)

cat("\n") id.variable.time <- cast(md, id+variable~time) print(id.variable.time)

cat("\n") id.variable.time2 <- cast(md, id~variable+time) print(id.variable.time2)


Executing the above code outputs:

id x1 x2 1 1 4 5.5 2 2 4 2.5

time x1 x2 1 1 5.5 3.5 2 2 2.5 4.5

id 1 2 1 1 5.5 4 2 2 3.5 3

id time x1 x2 1 1 1 5 6 2 1 2 3 5 3 2 1 6 1 4 2 2 2 4

id variable 1 2 1 1 x1 5 3 2 1 x2 6 5 3 2 x1 6 2 4 2 x2 1 4

id x1_1 x1_2 x2_1 x2_2 1 1 5 3 6 5 2 2 6 2 1 4 ```

❮ R Examples R Java ❯