R Data Types
Data types refer to a broad system used to declare variables or functions of different types.
The type of a variable determines the space it occupies in storage and how the bit pattern stored is interpreted.
The most basic data types in R language are primarily three:
- Numeric
- Logical
- Character
Numeric constants mainly come in two forms:
General | 123 -0.125 |
---|---|
Scientific Notation | 1.23e2 -1.25E-1 |
Logical types are often referred to as Boolean (Boolean) in many other programming languages, and the constant values are only TRUE and FALSE.
Note: R language is case-sensitive, true or True cannot represent TRUE.
The most intuitive data type is the character type. Text is what is commonly referred to as a string (String) in other languages, and constants are enclosed in double quotes. In R language, text constants can be enclosed in either single or double quotes, for example:
Example
> 'tutorialpro' == "tutorialpro"
[1] TRUE
Regarding variable definitions in R language, unlike some strongly typed languages that require setting names and data types for variables, a new variable is actually defined whenever the assignment operator is used in R:
Example
a = 1
b <- TRUE
b = "abc"
By object type, there are the following 6 types (these types will be detailed later):
- Vector
- List
- Matrix
- Array
- Factor
- Data Frame
Vector
Vectors are often provided in the standard libraries of programming languages like Java, Rust, C#, as vectors are indispensable tools in mathematical operations—the most common vector is a two-dimensional vector, which is necessarily used in the plane coordinate system.
From a data structure perspective, a vector is a linear table, which can be seen as an array.
In R language, vectors exist as a type, making vector operations easier:
Example
> a = c(3, 4)
> b = c(5, 0)
> a + b
[1] 8 4
>
c()
is a function that creates a vector.
Here, two two-dimensional vectors are added to get a new two-dimensional vector (8, 4). If a two-dimensional vector and a three-dimensional vector are operated on, it will lose mathematical meaning, although it will not stop running, it will be warned.
It is recommended to avoid this situation by habit.
Each element in a vector can be individually retrieved by its index:
Example
> a = c(10, 20, 30, 40, 50)
> a[2]
[1] 20
Note: In R language, "index" does not represent an offset, but represents which one it is, which means it starts from 1!
R can also easily retrieve part of a vector:
Example
> a[1:4] # Retrieve items 1 to 4, including items 1 and 4
[1] 10 20 30 40
> a[c(1, 3, 5)] # Retrieve items 1, 3, and 5
[1] 10 30 50
> a[c(-1, -5)] # Remove items 1 and 5
[1] 20 30 40
These three methods of partial retrieval are the most commonly used.
Vectors support scalar calculations:
Example
> c(1.1, 1.2, 1.3) - 0.5
[1] 0.6 0.7 0.8
> a = c(1,2)
> a ^ 2
[1] 1 4
The commonly used mathematical operation functions, such as sqrt, exp, etc., can also be used for scalar operations on vectors.
As a linear table structure, "vector" should have some commonly used linear table processing functions, and R indeed has these functions:
Vector sorting:
Example
> a = c(1, 3, 5, 2, 4, 6)
> sort(a)
[1] 1 2 3 4 5 6
> rev(a)
[1] 6 4 2 5 3 1
> order(a)
[1] 1 4 2 5 3 6
> a[order(a)]
[1] 1 2 3 4 5 6
The order() function returns a vector of indices after the vector is sorted.
Vector Statistics
R has a very complete set of statistical functions:
Function Name | Meaning |
---|---|
sum | Sum |
mean | Average |
var | Variance |
sd | Standard Deviation |
min | Minimum |
max | Maximum |
range | Range (a two-dimensional vector, maximum and minimum) |
Vector statistics example:
Example
> sum(1:5)
[1] 15
> sd(1:5)
[1] 1.581139
> range(1:5)
[1] 1 5
Vector Generation
Vectors can be generated using the c() function or the min:max operator to generate consecutive sequences.
To generate a sequence with gaps, the seq function can be used:
> seq(1, 9, 2)
[1] 1 3 5 7 9
The seq
function can also generate an arithmetic sequence from m
to n
by specifying m
, n
, and the length of the sequence:
> seq(0, 1, length.out=3)
[1] 0.0 0.5 1.0
rep
stands for "repeat" and is used to generate repeated sequences of numbers:
> rep(0, 5)
[1] 0 0 0 0 0
Vectors often use NA
and NULL
, here's an introduction to these terms and their differences:
NA
represents "missing", whileNULL
represents "non-existence".NA
is like a placeholder, indicating the absence of a value but the position exists.NULL
signifies the absence of data.
Example:
> length(c(NA, NA, NULL))
[1] 2
> c(NA, NA, NULL, NA)
[1] NA NA NA
Clearly, NULL
has no meaning in a vector.
Logical
Logical vectors are primarily used for logical operations on vectors, for example:
> c(11, 12, 13) > 12
[1] FALSE FALSE TRUE
The which
function is commonly used for processing logical vectors and can be used to filter the indices of data we need:
> a = c(11, 12, 13)
> b = a > 12
> print(b)
[1] FALSE FALSE TRUE
> which(b)
[1] 3
For example, we need to filter data from a list that is greater than or equal to 60 and less than 70:
> vector = c(10, 40, 78, 64, 53, 62, 69, 70)
> print(vector[which(vector >= 60 & vector < 70)])
[1] 64 62 69
Similar functions include all
and any
:
> all(c(TRUE, TRUE, TRUE))
[1] TRUE
> all(c(TRUE, TRUE, FALSE))
[1] FALSE
> any(c(TRUE, FALSE, FALSE))
[1] TRUE
> any(c(FALSE, FALSE, FALSE))
[1] FALSE
all()
checks if all elements in a logical vector are TRUE
, and any()
checks if any element is TRUE
.
Strings
The string data type itself is not complex, here we focus on string manipulation functions:
> toupper("tutorialpro") # Convert to uppercase
[1] "TUTORIALPRO"
> tolower("TUTORIALPRO") # Convert to lowercase
[1] "tutorialpro"
> nchar("中文", type="bytes") # Count byte length
[1] 4
> nchar("中文", type="char") # Count character length
[1] 2
> substr("123456789", 1, 5) # Substring from 1 to 5
[1] "12345"
> substring("1234567890", 5) # Substring from 5 to end
[1] "567890"
> as.numeric("12") # Convert string to number
[1] 12
> as.character(12.34) # Convert number to string
[1] "12.34"
> strsplit("2019;10;1", ";") # Split string by delimiter
[[1]]
[1] "2019" "10" "1"
> gsub("/", "-", "2019/10/1") # Replace string
[1] "2019-10-1"
On Windows computers, using the GBK encoding standard, a Chinese character is two bytes. If running on a UTF-8 encoded computer, the byte length of a single Chinese character should be 3.
R supports regular expressions in the format of Perl:
> gsub("[[:alpha:]]+", "$", "Two words")
[1] "$ $"
For more string content, refer to: R Language Strings Introduction.
Matrices
R provides a matrix type for linear algebra research, which is similar to a two-dimensional array in other languages, but R offers language-level matrix operations.
First, let's look at matrix generation:
> vector=c(1, 2, 3, 4, 5, 6)
> matrix(vector, 2, 3)
[,1] [,2] [,3]
[1,] 1 3 5
[2,] 2 4 6
The matrix is initialized with a vector, and you need to specify the number of rows and columns.
The values from the vector are filled column by column into the matrix. If you want to fill by row, you need to specify the byrow
attribute:
> matrix(vector, 2, 3, byrow=TRUE)
[,1] [,2] [,3]
[1,] 1 2 3
[2,] 4 5 6
Each value in the matrix can be accessed directly:
Example
> m1 = matrix(vector, 2, 3, byrow=TRUE)
> m1[1,1] # 1st row, 1st column
[1] 1
> m1[1,3] # 1st row, 3rd column
[1] 3
Each column and row in an R matrix can be named, which is done in bulk via a character vector:
Example
> colnames(m1) = c("x", "y", "z")
> rownames(m1) = c("a", "b")
> m1
x y z
a 1 2 3
b 4 5 6
> m1["a", ]
x y z
1 2 3
Matrix arithmetic operations are similar to vector operations, which can be performed with scalars or with matrices of the same size for corresponding positions.
Matrix multiplication operation:
Example
> m1 = matrix(c(1, 2), 1, 2)
> m2 = matrix(c(3, 4), 2, 1)
> m1 %*% m2
[,1]
[1,] 11
Inverse matrix:
Example
> A = matrix(c(1, 3, 2, 4), 2, 2)
> solve(A)
[,1] [,2]
[1,] -2.0 1.0
[2,] 1.5 -0.5
The solve() function is used to solve linear algebra equations, with the basic usage being solve(A,b)
, where A is the coefficient matrix of the system of equations, and b is the vector or matrix of the equations.
The apply() function can operate on each row or column of the matrix as a vector:
Example
> (A = matrix(c(1, 3, 2, 4), 2, 2))
[,1] [,2]
[1,] 1 2
[2,] 3 4
> apply(A, 1, sum) # The second parameter is 1 for row-wise operation with the sum() function
[1] 3 7
> apply(A, 2, sum) # The second parameter is 2 for column-wise operation
[1] 4 6
For more matrix content, refer to: R Matrix.