目錄
原始文章
data.table是R語言的熱門套件,能夠快速處理大量資料,效率高於dplyr,且得利於語法結構的精巧設計,data.table更容易寫出版排整齊的程式。這篇文章會討論data.table的進階技巧,在Cheat Sheat上也不一定找得到。
開始之前,記得先安裝dplyr和data.table套件並載入。
install.packages('dplyr')
install.packages('data.table')
library('dplyr')
library('data.table')
.GRP – 為每個Group加上Index
dt <- data.table(C1 = c("A", "B", "C", "B", "A"),
C2 = c(1, 2, 3, 2, 1))
dt2 <- dt[, INDEX := .GRP, by = .(C1)]
C1 C2 INDEX
1: A 1 1
2: B 2 2
3: C 3 3
4: B 2 2
5: A 1 1
以row index及column index來subset
dt <- data.table(C1 = 1:3,
C2 = 101:103,
C3 = 901:903)
dt_row <- dt[1:2, ]
dt_col <- dt[, 2:3]
dt_row_col <- dt[1:2, 2:3]
> dt_row
C1 C2 C3
1: 1 101 901
2: 2 102 902
> dt_col
C2 C3
1: 101 901
2: 102 902
3: 103 903
> dt_row_col
C2 C3
1: 101 901
2: 102 902
將data.table物件轉換成Vector
必須先轉成matrix,再轉成vector。
dt <- data.table(C1 = 1:3,
C2 = 101:103,
C3 = 901:903)
d2 <- dt %>%
as.matrix() %>%
as.vector()
> d2
[1] 1 2 3 101 102 103 901 902 903
用dcast轉置時,修改預設的欄位名稱
dt <- data.table(C1 = c(1, 2, 3, 4),
C2 = c(2, 3, 4, 5),
C3 = c(1, 3, 5, 6),
C4 = c(1, 2, 4, 7))
dt_dcast <- dcast(dt, C1 + C2 ~ paste0("NEW_", C3), value.var = c("C4"))
> dt_dcast
C1 C2 NEW_1 NEW_3 NEW_5 NEW_6
1: 1 2 1 NA NA NA
2: 2 3 NA 2 NA NA
3: 3 4 NA NA 4 NA
4: 4 5 NA NA NA 7