rm(list=ls())
library(data.table)
原文来自data.table自带的一个Vignette: Using .SD for Data Analysis。
1 什么是.SD
简单理解,.SD是在数据分析中,用于捕获变量的缩写。.SD可以理解为一个子集、自身copy或者引用。除了不能用:=对它赋值外,.SD可以理解为一个data.table。
.SD主要有2个用途:
通过.SDCols提取列
针对分组数据进行清洗
2 准备使用的数据
使用来自Lahman数据库的棒球数据。需要提前安装Lahman R包。
然后加载Lahman包,获得球队数据,将其转为data.table类型。
library(Lahman)
Teams <- copy(Teams)
setDT(Teams)
Teams
## yearID lgID teamID franchID divID Rank G Ghome W L DivWin WCWin
## 1: 1871 NA BS1 BNA <NA> 3 31 NA 20 10 <NA> <NA>
## 2: 1871 NA CH1 CNA <NA> 2 28 NA 19 9 <NA> <NA>
## 3: 1871 NA CL1 CFC <NA> 8 29 NA 10 19 <NA> <NA>
## 4: 1871 NA FW1 KEK <NA> 7 19 NA 7 12 <NA> <NA>
## 5: 1871 NA NY2 NNA <NA> 5 33 NA 16 17 <NA> <NA>
## ---
## 2981: 2021 NL SLN STL C 2 162 81 90 72 N Y
## 2982: 2021 AL TBA TBD E 1 162 81 100 62 Y N
## 2983: 2021 AL TEX TEX W 5 162 81 60 102 N N
## 2984: 2021 AL TOR TOR E 4 162 80 91 71 N N
## 2985: 2021 NL WAS WSN E 5 162 81 65 97 N N
## LgWin WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA
## 1: N <NA> 401 1372 426 70 37 3 60 19 73 16 NA NA 303 109 3.55
## 2: N <NA> 302 1196 323 52 21 10 60 22 69 21 NA NA 241 77 2.76
## 3: N <NA> 249 1186 328 35 40 7 26 25 18 8 NA NA 341 116 4.11
## 4: N <NA> 137 746 178 19 8 2 33 9 16 4 NA NA 243 97 5.17
## 5: N <NA> 302 1404 403 43 21 1 33 15 46 15 NA NA 313 121 3.72
## ---
## 2981: N N 706 5351 1303 261 22 198 478 1341 89 22 86 44 672 626 3.98
## 2982: N N 857 5507 1336 288 36 222 585 1542 88 42 72 41 651 593 3.67
## 2983: N N 625 5405 1254 225 24 167 433 1381 106 29 58 31 815 758 4.79
## 2984: N N 846 5476 1455 285 13 262 496 1218 81 20 51 35 663 610 3.91
## 2985: N N 724 5385 1388 272 20 182 573 1303 56 26 84 31 820 743 4.80
## CG SHO SV IPouts HA HRA BBA SOA E DP FP name
## 1: 22 1 3 828 367 2 42 23 243 24 0.834 Boston Red Stockings
## 2: 25 0 1 753 308 6 28 22 229 16 0.829 Chicago White Stockings
## 3: 23 0 0 762 346 13 53 34 234 15 0.818 Cleveland Forest Citys
## 4: 19 1 0 507 261 5 21 17 163 8 0.803 Fort Wayne Kekiongas
## 5: 32 1 0 879 373 7 42 22 235 14 0.840 New York Mutuals
## ---
## 2981: 3 15 50 4251 1234 152 608 1225 84 137 0.986 St. Louis Cardinals
## 2982: 1 13 42 4367 1264 184 436 1478 80 130 0.986 Tampa Bay Rays
## 2983: 0 3 31 4273 1402 232 513 1239 83 146 0.986 Texas Rangers
## 2984: 1 14 34 4216 1257 209 473 1468 90 122 0.984 Toronto Blue Jays
## 2985: 1 8 36 4183 1364 247 548 1346 96 116 0.983 Washington Nationals
## park attendance BPF PPF teamIDBR teamIDlahman45
## 1: South End Grounds I NA 103 98 BOS BS1
## 2: Union Base-Ball Grounds NA 104 102 CHI CH1
## 3: National Association Grounds NA 96 100 CLE CL1
## 4: Hamilton Field NA 101 107 KEK FW1
## 5: Union Grounds (Brooklyn) NA 90 88 NYU NY2
## ---
## 2981: Busch Stadium III 2102530 92 92 STL SLN
## 2982: Tropicana Field 761072 92 91 TBR TBA
## 2983: Globe Life Field 2110258 99 101 TEX TEX
## 2984: Sahlen Field 805901 102 101 TOR TOR
## 2985: Nationals Park 1465543 95 96 WSN MON
## teamIDretro
## 1: BS1
## 2: CH1
## 3: CL1
## 4: FW1
## 5: NY2
## ---
## 2981: SLN
## 2982: TBA
## 2983: TEX
## 2984: TOR
## 2985: WAS
获得投手Pitcher的数据集
Pitching <- copy(Pitching)
setDT(Pitching)
Pitching
## playerID yearID stint teamID lgID W L G GS CG SHO SV IPouts H ER
## 1: bechtge01 1871 1 PH1 NA 1 2 3 3 2 0 0 78 43 23
## 2: brainas01 1871 1 WS3 NA 12 15 30 30 30 0 0 792 361 132
## 3: fergubo01 1871 1 NY2 NA 0 0 1 0 0 0 0 3 8 3
## 4: fishech01 1871 1 RC1 NA 4 16 24 24 22 1 0 639 295 103
## 5: fleetfr01 1871 1 NY2 NA 0 1 1 1 1 0 0 27 20 10
## ---
## 49426: zeuchtj01 2021 1 TOR AL 0 2 5 3 0 0 0 45 21 11
## 49427: zimmebr02 2021 1 BAL AL 4 5 14 13 0 0 0 193 75 36
## 49428: zimmejo02 2021 1 MIL NL 0 0 2 0 0 0 0 17 8 5
## 49429: zimmeky01 2021 1 KCA AL 4 1 52 2 0 0 2 162 46 29
## 49430: zuberty01 2021 1 KCA AL 0 3 31 0 0 0 0 82 26 19
## HR BB SO BAOpp ERA IBB WP HBP BK BFP GF R SH SF GIDP
## 1: 0 11 1 NA 7.96 NA 7 NA 0 146 0 42 NA NA NA
## 2: 4 37 13 NA 4.50 NA 7 NA 0 1291 0 292 NA NA NA
## 3: 0 0 0 NA 27.00 NA 2 NA 0 14 0 9 NA NA NA
## 4: 3 31 15 NA 4.35 NA 20 NA 0 1080 1 257 NA NA NA
## 5: 0 3 0 NA 10.00 NA 0 NA 0 57 0 21 NA NA NA
## ---
## 49426: 6 9 8 0.323 6.60 0 0 0 0 74 0 16 0 0 2
## 49427: 14 22 56 0.291 5.04 0 0 2 1 285 0 37 1 2 3
## 49428: 1 2 0 0.348 7.94 0 0 1 0 26 1 5 0 0 1
## 49429: 7 30 46 0.243 4.83 1 9 0 0 223 9 32 0 4 13
## 49430: 6 17 25 0.250 6.26 1 2 1 0 123 6 20 0 1 2
球队记录了给定年份的一些统计数据,而投球记录了给定年份的给定投手的统计数据。数据的定义,参加这个文档。
3 针对未分组数据的.SD操作
查看投手的成绩,两种方式:
第一种:
Pitching
## playerID yearID stint teamID lgID W L G GS CG SHO SV IPouts H ER
## 1: bechtge01 1871 1 PH1 NA 1 2 3 3 2 0 0 78 43 23
## 2: brainas01 1871 1 WS3 NA 12 15 30 30 30 0 0 792 361 132
## 3: fergubo01 1871 1 NY2 NA 0 0 1 0 0 0 0 3 8 3
## 4: fishech01 1871 1 RC1 NA 4 16 24 24 22 1 0 639 295 103
## 5: fleetfr01 1871 1 NY2 NA 0 1 1 1 1 0 0 27 20 10
## ---
## 49426: zeuchtj01 2021 1 TOR AL 0 2 5 3 0 0 0 45 21 11
## 49427: zimmebr02 2021 1 BAL AL 4 5 14 13 0 0 0 193 75 36
## 49428: zimmejo02 2021 1 MIL NL 0 0 2 0 0 0 0 17 8 5
## 49429: zimmeky01 2021 1 KCA AL 4 1 52 2 0 0 2 162 46 29
## 49430: zuberty01 2021 1 KCA AL 0 3 31 0 0 0 0 82 26 19
## HR BB SO BAOpp ERA IBB WP HBP BK BFP GF R SH SF GIDP
## 1: 0 11 1 NA 7.96 NA 7 NA 0 146 0 42 NA NA NA
## 2: 4 37 13 NA 4.50 NA 7 NA 0 1291 0 292 NA NA NA
## 3: 0 0 0 NA 27.00 NA 2 NA 0 14 0 9 NA NA NA
## 4: 3 31 15 NA 4.35 NA 20 NA 0 1080 1 257 NA NA NA
## 5: 0 3 0 NA 10.00 NA 0 NA 0 57 0 21 NA NA NA
## ---
## 49426: 6 9 8 0.323 6.60 0 0 0 0 74 0 16 0 0 2
## 49427: 14 22 56 0.291 5.04 0 0 2 1 285 0 37 1 2 3
## 49428: 1 2 0 0.348 7.94 0 0 1 0 26 1 5 0 0 1
## 49429: 7 30 46 0.243 4.83 1 9 0 0 223 9 32 0 4 13
## 49430: 6 17 25 0.250 6.26 1 2 1 0 123 6 20 0 1 2
第二种:
Pitching[, .SD] #返回整个data.table
## playerID yearID stint teamID lgID W L G GS CG SHO SV IPouts H ER
## 1: bechtge01 1871 1 PH1 NA 1 2 3 3 2 0 0 78 43 23
## 2: brainas01 1871 1 WS3 NA 12 15 30 30 30 0 0 792 361 132
## 3: fergubo01 1871 1 NY2 NA 0 0 1 0 0 0 0 3 8 3
## 4: fishech01 1871 1 RC1 NA 4 16 24 24 22 1 0 639 295 103
## 5: fleetfr01 1871 1 NY2 NA 0 1 1 1 1 0 0 27 20 10
## ---
## 49426: zeuchtj01 2021 1 TOR AL 0 2 5 3 0 0 0 45 21 11
## 49427: zimmebr02 2021 1 BAL AL 4 5 14 13 0 0 0 193 75 36
## 49428: zimmejo02 2021 1 MIL NL 0 0 2 0 0 0 0 17 8 5
## 49429: zimmeky01 2021 1 KCA AL 4 1 52 2 0 0 2 162 46 29
## 49430: zuberty01 2021 1 KCA AL 0 3 31 0 0 0 0 82 26 19
## HR BB SO BAOpp ERA IBB WP HBP BK BFP GF R SH SF GIDP
## 1: 0 11 1 NA 7.96 NA 7 NA 0 146 0 42 NA NA NA
## 2: 4 37 13 NA 4.50 NA 7 NA 0 1291 0 292 NA NA NA
## 3: 0 0 0 NA 27.00 NA 2 NA 0 14 0 9 NA NA NA
## 4: 3 31 15 NA 4.35 NA 20 NA 0 1080 1 257 NA NA NA
## 5: 0 3 0 NA 10.00 NA 0 NA 0 57 0 21 NA NA NA
## ---
## 49426: 6 9 8 0.323 6.60 0 0 0 0 74 0 16 0 0 2
## 49427: 14 22 56 0.291 5.04 0 0 2 1 285 0 37 1 2 3
## 49428: 1 2 0 0.348 7.94 0 0 1 0 26 1 5 0 0 1
## 49429: 7 30 46 0.243 4.83 1 9 0 0 223 9 32 0 4 13
## 49430: 6 17 25 0.250 6.26 1 2 1 0 123 6 20 0 1 2
两种方式是等价的:
identical(Pitching, Pitching[, .SD])
## [1] TRUE
如果从子集的角度来理解,. sd仍然是数据的子集,它只是一个不令人注意的子集,即集合本身。
3.1 配合.SDcols提取
如果我们只想从data.table中提取特定的多列,可以使用.SDcols配合操作。譬如,提取投手的接球成绩信息。
# W: Wins; L: Losses; G: Games
Pitching[, .SD, .SDcols=c("W","L","G")]
## W L G
## 1: 1 2 3
## 2: 12 15 30
## 3: 0 0 1
## 4: 4 16 24
## 5: 0 1 1
## ---
## 49426: 0 2 5
## 49427: 4 5 14
## 49428: 0 0 2
## 49429: 4 1 52
## 49430: 0 3 31
上边这样做,其实没有多大意义,完全可以用Pitching[,c("W","L","G")]类似语句
代替。我们关注的重点还是对.SD进一步处理。
3.2 列类型转换
譬如,后续分析需要把多列从字符类型转为因子类型。首先查看列类型
# teamIDBR: Team ID used by Baseball Reference website
# teamIDlahman45: Team ID used in Lahman database version 4.5
# teamIDretro: Team ID used by Retrosheet
fkt = c('teamIDBR', 'teamIDlahman45', 'teamIDretro')
# confirm that they're stored as `character`
Teams[ , sapply(.SD, is.character), .SDcols = fkt]
## teamIDBR teamIDlahman45 teamIDretro
## TRUE TRUE TRUE
用基础的R语法也可以实现:
setDF(Teams)
sapply(Teams[ , fkt], is.character) #
## teamIDBR teamIDlahman45 teamIDretro
## TRUE TRUE TRUE
setDT(Teams)
sapply返回的是向量,矩阵,或者是数组(当simplify=TRUE时)。
转换的思路是这样的:理解这种语法的关键是要记住,可以将data.table(以及data.Frame)视为一个列表,列表中每个元素对应data.table中的一列,因此,sapply/lapply将FUN参数(在本例中为is.character)应用到data.table中的每一列,并像sapply/lapply通常所做的那样返回结果。
Teams[ , (fkt):=lapply(.SD,factor), .SDcols=fkt]
head(unique(Teams[[fkt[1L]]]))
## [1] BOS CHI CLE KEK NYU ATH
## 101 Levels: ALT ANA ARI ATH ATL BAL BLA BLN BLU BOS BRA BRG BRO BSN BTT ... WSN
需要特别注意:
lapply返回的是列表,每一列的结果,作为一个列表元素存储。在上边代码中,将返回的列表中的每一个元素,赋值给fkt变量中命名的一列。
请注意,我们必须将fkt放在圆括号()中,以强制data.table将其解释为列名,而不是尝试分配名为’fkt’的列。
实际上,.SDcols参数可以非常灵活的指定待提取的列:
可以通过一个包括列名的字符变量指定待提取的列(如上所示)
也可以通过列的位置(整数)指定待提取的列
也可以通过一个逻辑变量TRUE/FALSE来定义是否包括某一列
也可基于正则表达式的模式匹配
下边例子,获得因子类型列的位置,然后将其转为字符类型。
# while .SDcols accepts a logical vector,
# := does not, so we need to convert to column
# positions with which()
fkt_idx = which(sapply(Teams, is.factor))
Teams[ , (fkt_idx) := lapply(.SD, as.character), .SDcols = fkt_idx]
head(unique(Teams[[fkt_idx[1L]]]))
## [1] "NA" "NL" "AA" "UA" "PL" "AL"
.SDcols也可以被赋值正则表达式,譬如提取包括team单词的列,共计有4列。
# while .SDcols accepts a logical vector,
# := does not, so we need to convert to column
# positions with which()
Teams[ , .SD, .SDcols = patterns("team")]
## teamID teamIDBR teamIDlahman45 teamIDretro
## 1: BS1 BOS BS1 BS1
## 2: CH1 CHI CH1 CH1
## 3: CL1 CLE CL1 CL1
## 4: FW1 KEK FW1 FW1
## 5: NY2 NYU NY2 NY2
## ---
## 2981: SLN STL SLN SLN
## 2982: TBA TBR TBA TBA
## 2983: TEX TEX TEX TEX
## 2984: TOR TOR TOR TOR
## 2985: WAS WSN MON WAS
貌似存在问题,在列计算时,:=是不接受右手边的逻辑类型的,因此当要对列进行计算时,必须将其转为整数位置。
team_idx <- grep("team", names(Teams), value = TRUE)
Teams[ , (team_idx):=lapply(.SD, factor), .SDcols=team_idx]
如果设置grep中的参数value=FALSE,将返回每个列的整数位置,但是作者不建议这样做,通过整数位置来提取列,一旦列位置变化,会带来潜在的危险。
3.3 控制模型的右手边
构建可变的模型是稳健统计分析的核心特征。让我们试着使用投球表中的一小部分协变量来预测投手的ERA(自责失分,自责分,一种衡量表现的指标)。W ( wins )与ERA之间的(线性)关系如何,取决于设定中包含哪些其他的协变量?
下面是一个利用.SD的强大功能的简短脚本,探讨解决这个问题:
首先生成要分析的协变量的所有组合,是利用combn函数实现的,0L:length(extra_var)实际上对应了combn中的m参数。
extra_var = c('yearID', 'teamID', 'G', 'L')
models = unlist(
lapply(0L:length(extra_var), combn, x = extra_var, simplify = FALSE),
recursive = FALSE
) #消除一层列表
定义画图需要的常用颜色,这个不错,作图可以参考。
# here are 16 visually distinct colors, taken from the list of 20 here:
# https://sashat.me/2017/01/11/list-of-20-simple-distinct-colors/
col16 = c('#e6194b', '#3cb44b', '#ffe119', '#0082c8',
'#f58231', '#911eb4', '#46f0f0', '#f032e6',
'#d2f53c', '#fabebe', '#008080', '#e6beff',
'#aa6e28', '#fffac8', '#800000', '#aaffc3')
准备执行回归分析,提取W的回归系数。这里巧妙的利用.SDcols来生成不同的数据,提供给.SD,供lm调用。非常巧妙的构思,利用了data.table强大的列计算能力。代码虽然短,但是分析了16个lm模型,值的学习。
par(oma = c(2, 0, 0, 0))
lm_coef = sapply(models, function(rhs) {
# using ERA ~ . and data = .SD, then varying which
# columns are included in .SD allows us to perform this
# iteration over 16 models succinctly.
# coef(.)['W'] extracts the W coefficient from each model fit
Pitching[ , coef(lm(ERA ~ ., data = .SD))['W'], .SDcols = c('W', rhs)]
})
barplot(lm_coef, names.arg = sapply(models, paste, collapse = '/'),
main = 'Wins Coefficient\nWith Various Covariates',
col = col16, las = 2L, cex.names = .8)
4 .SD 分组操作
这应该是.SD最常用的场景。我们经常需要在组水平上执行操作。在data.table中,通过by =或者keyby = 切出多个数据组,每一个数据组,可以认为是一个data.table子集。这样,实际上.SD包括了多个切出的data.table子集。每一次针对其中一个data.table子集操作。
4.2 提取分组子集
提取每个棒球队的数据集,并且返回每个棒球队的最后一条记录。.N表示每个棒球队的记录总数。
Teams[ , .SD[.N], by=teamID]
## teamID yearID lgID franchID divID Rank G Ghome W L DivWin WCWin
## 1: BS1 1875 NA BNA <NA> 1 82 NA 71 8 <NA> <NA>
## 2: CH1 1871 NA CNA <NA> 2 28 NA 19 9 <NA> <NA>
## 3: CL1 1872 NA CFC <NA> 7 22 NA 6 16 <NA> <NA>
## 4: FW1 1871 NA KEK <NA> 7 19 NA 7 12 <NA> <NA>
## 5: NY2 1875 NA NNA <NA> 6 71 NA 30 38 <NA> <NA>
## ---
## 145: ANA 2004 AL ANA W 1 162 81 92 70 Y N
## 146: ARI 2021 NL ARI W 5 162 81 52 110 N N
## 147: MIL 2021 NL MIL C 1 162 81 95 67 Y N
## 148: TBA 2021 AL TBD E 1 162 81 100 62 Y N
## 149: MIA 2021 NL FLA E 4 162 81 67 95 N N
## LgWin WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA
## 1: Y <NA> 831 3515 1128 167 51 15 33 52 93 37 NA NA 343 152 1.87
## 2: N <NA> 302 1196 323 52 21 10 60 22 69 21 NA NA 241 77 2.76
## 3: N <NA> 174 943 272 28 5 0 17 13 12 3 NA NA 254 126 5.70
## 4: N <NA> 137 746 178 19 8 2 33 9 16 4 NA NA 243 97 5.17
## 5: N <NA> 328 2685 633 82 21 7 19 47 20 24 NA NA 425 174 2.46
## ---
## 145: N N 836 5675 1603 272 37 162 450 942 143 46 73 41 734 692 4.28
## 146: N N 679 5489 1297 308 31 144 537 1465 43 16 54 28 893 804 5.11
## 147: N N 738 5362 1251 255 18 194 586 1465 82 21 88 35 623 558 3.50
## 148: N N 857 5507 1336 288 36 222 585 1542 88 42 72 41 651 593 3.67
## 149: N N 623 5348 1244 226 23 158 450 1553 106 29 65 30 701 622 3.96
## CG SHO SV IPouts HA HRA BBA SOA E DP FP name
## 1: 60 10 17 2196 751 2 33 110 483 56 0.870 Boston Red Stockings
## 2: 25 0 1 753 308 6 28 22 229 16 0.829 Chicago White Stockings
## 3: 15 0 0 597 285 6 24 11 184 17 0.816 Cleveland Forest Citys
## 4: 19 1 0 507 261 5 21 17 163 8 0.803 Fort Wayne Kekiongas
## 5: 70 3 0 1910 718 4 21 77 526 30 0.838 New York Mutuals
## ---
## 145: 2 11 50 4363 1476 170 502 1164 90 126 0.985 Anaheim Angels
## 146: 3 4 22 4252 1480 232 555 1238 100 113 0.983 Arizona Diamondbacks
## 147: 2 19 44 4308 1156 168 537 1618 94 102 0.984 Milwaukee Brewers
## 148: 1 13 42 4367 1264 184 436 1478 80 130 0.986 Tampa Bay Rays
## 149: 1 8 33 4245 1282 162 529 1381 122 146 0.979 Miami Marlins
## park attendance BPF PPF teamIDBR teamIDlahman45
## 1: South End Grounds I NA 103 96 BOS BS1
## 2: Union Base-Ball Grounds NA 104 102 CHI CH1
## 3: National Association Grounds NA 96 100 CLE CL1
## 4: Hamilton Field NA 101 107 KEK FW1
## 5: Union Grounds (Brooklyn) NA 99 100 NYU NY2
## ---
## 145: Angels Stadium of Anaheim 3375677 97 97 ANA ANA
## 146: Chase Field 1043010 99 101 ARI ARI
## 147: Miller Park 1824282 101 101 MIL ML4
## 148: Tropicana Field 761072 92 91 TBR TBA
## 149: Marlins Park 642617 98 99 MIA FLO
## teamIDretro
## 1: BS1
## 2: CH1
## 3: CL1
## 4: FW1
## 5: NY2
## ---
## 145: ANA
## 146: ARI
## 147: MIL
## 148: TBA
## 149: MIA
如果想显示每个棒球队的第一条记录,可以将.SD[.N]替换为.SD[1L]。
4.2 组优化
如果我们想知道每个球队total number of runs(数据集中是R字段)最大的年份,可以用以下代码实现:
Teams[ , .SD[which.max(R)], by=teamID]
## teamID yearID lgID franchID divID Rank G Ghome W L DivWin WCWin LgWin
## 1: BS1 1875 NA BNA <NA> 1 82 NA 71 8 <NA> <NA> Y
## 2: CH1 1871 NA CNA <NA> 2 28 NA 19 9 <NA> <NA> N
## 3: CL1 1871 NA CFC <NA> 8 29 NA 10 19 <NA> <NA> N
## 4: FW1 1871 NA KEK <NA> 7 19 NA 7 12 <NA> <NA> N
## 5: NY2 1872 NA NNA <NA> 3 56 NA 34 20 <NA> <NA> N
## ---
## 145: ANA 2000 AL ANA W 3 162 81 82 80 N N N
## 146: ARI 1999 NL ARI W 1 162 81 100 62 Y N N
## 147: MIL 1999 NL MIL C 5 161 80 74 87 N N N
## 148: TBA 2021 AL TBD E 1 162 81 100 62 Y N N
## 149: MIA 2017 NL FLA E 2 162 78 77 85 N N N
## WSWin R AB H X2B X3B HR BB SO SB CS HBP SF RA ER ERA CG SHO
## 1: <NA> 831 3515 1128 167 51 15 33 52 93 37 NA NA 343 152 1.87 60 10
## 2: <NA> 302 1196 323 52 21 10 60 22 69 21 NA NA 241 77 2.76 25 0
## 3: <NA> 249 1186 328 35 40 7 26 25 18 8 NA NA 341 116 4.11 23 0
## 4: <NA> 137 746 178 19 8 2 33 9 16 4 NA NA 243 97 5.17 19 1
## 5: <NA> 523 2426 670 87 14 4 58 52 59 22 NA NA 362 172 3.02 54 3
## ---
## 145: N 864 5628 1574 309 34 236 608 1024 93 52 47 43 869 805 5.00 5 3
## 146: N 908 5658 1566 289 46 216 588 1045 137 39 48 60 676 615 3.77 16 9
## 147: N 815 5582 1524 299 30 165 658 1065 81 33 55 51 886 813 5.07 2 5
## 148: N 857 5507 1336 288 36 222 585 1542 88 42 72 41 651 593 3.67 1 13
## 149: N 778 5602 1497 271 31 194 486 1282 91 30 67 41 822 772 4.82 1 7
## SV IPouts HA HRA BBA SOA E DP FP name
## 1: 17 2196 751 2 33 110 483 56 0.870 Boston Red Stockings
## 2: 1 753 308 6 28 22 229 16 0.829 Chicago White Stockings
## 3: 0 762 346 13 53 34 234 15 0.818 Cleveland Forest Citys
## 4: 0 507 261 5 21 17 163 8 0.803 Fort Wayne Kekiongas
## 5: 1 1536 622 2 33 46 323 33 0.868 New York Mutuals
## ---
## 145: 46 4344 1534 228 662 846 134 182 0.978 Anaheim Angels
## 146: 42 4402 1387 176 543 1198 104 132 0.983 Arizona Diamondbacks
## 147: 40 4328 1618 213 616 987 127 146 0.979 Milwaukee Brewers
## 148: 42 4367 1264 184 436 1478 80 130 0.986 Tampa Bay Rays
## 149: 34 4328 1450 193 627 1202 73 156 0.988 Miami Marlins
## park attendance BPF PPF teamIDBR teamIDlahman45
## 1: South End Grounds I NA 103 96 BOS BS1
## 2: Union Base-Ball Grounds NA 104 102 CHI CH1
## 3: National Association Grounds NA 96 100 CLE CL1
## 4: Hamilton Field NA 101 107 KEK FW1
## 5: Union Grounds (Brooklyn) NA 93 92 NYU NY2
## ---
## 145: Edison International Field 2066982 102 103 ANA ANA
## 146: Bank One Ballpark 3019654 101 101 ARI ARI
## 147: County Stadium 1701796 99 99 MIL ML4
## 148: Tropicana Field 761072 92 91 TBR TBA
## 149: Marlins Park 1583014 93 93 MIA FLO
## teamIDretro
## 1: BS1
## 2: CH1
## 3: CL1
## 4: FW1
## 5: NY2
## ---
## 145: ANA
## 146: ARI
## 147: MIL
## 148: TBA
## 149: MIA
如果只想返回某些字段信息,可以通过.SDcols定制展现的字段,譬如只想显示年份和R:
Teams[ , .SD[which.max(R)], by=teamID, .SDcols=c("R","yearID")]
## teamID R yearID
## 1: BS1 831 1875
## 2: CH1 302 1871
## 3: CL1 249 1871
## 4: FW1 137 1871
## 5: NY2 523 1872
## ---
## 145: ANA 864 2000
## 146: ARI 908 1999
## 147: MIL 815 1999
## 148: TBA 857 2021
## 149: MIA 778 2017
4.3 组水平上的回归分析
统计每个球队ERA回归到W上的系数是多少。
#统计所有球队的整体水平
overall_coef = Pitching[ , coef(lm(ERA ~ W))['W']]
#筛选记录数上大于20的球队,进行回归分析
Pitching[ , if (.N > 20L) .(w_coef = coef(lm(ERA ~ W, data=.SD))['W']), by = teamID
][ , hist(w_coef, 20L, las = 1L,
xlab = 'Fitted Coefficient on W',
ylab = 'Number of Teams', col = 'darkgreen',
main = 'Team-Level Distribution\nWin Coefficients on ERA')]
## $breaks
## [1] -0.46 -0.44 -0.42 -0.40 -0.38 -0.36 -0.34 -0.32 -0.30 -0.28 -0.26 -0.24
## [13] -0.22 -0.20 -0.18 -0.16 -0.14 -0.12 -0.10 -0.08 -0.06 -0.04 -0.02 0.00
##
## $counts
## [1] 1 1 1 2 0 3 2 1 3 3 7 10 8 6 4 5 4 2 3 1 1 1 1
##
## $density
## [1] 0.7142857 0.7142857 0.7142857 1.4285714 0.0000000 2.1428571 1.4285714
## [8] 0.7142857 2.1428571 2.1428571 5.0000000 7.1428571 5.7142857 4.2857143
## [15] 2.8571429 3.5714286 2.8571429 1.4285714 2.1428571 0.7142857 0.7142857
## [22] 0.7142857 0.7142857
##
## $mids
## [1] -0.45 -0.43 -0.41 -0.39 -0.37 -0.35 -0.33 -0.31 -0.29 -0.27 -0.25 -0.23
## [13] -0.21 -0.19 -0.17 -0.15 -0.13 -0.11 -0.09 -0.07 -0.05 -0.03 -0.01
##
## $xname
## [1] "w_coef"
##
## $equidist
## [1] TRUE
##
## attr(,"class")
## [1] "histogram"
abline(v = overall_coef, lty = 2L, col = 'red')