R语言25-R语法: tapply

2019-05-07

R语言

tapply

tapply 对向量的子集执行批处理操作。

> str(tapply)
function (X, INDEX, FUN = NULL, ..., simplify = TRUE)

X 是一个向量
INDEX 因子或因子的列表（或与因子相关）
FUN 批处理的函数
… 其他传递给 FUN 函数
simplify 是否简化结果

分组取平均值。

> x <- c(rnorm(10), runif(10), rnorm(10, 1))
> f <- gl(3, 10)
> f
 [1] 1 1 1 1 1 1 1 1 1 1 2 2 2 2 2 2 2 2 2 2 3 3 3 3 3 3 3 3 3 3
Levels: 1 2 3
> tapply(x, f, mean)
        1         2         3 
0.1045255 0.4867243 0.9131191 

不简化分组取平均值的结果

> tapply(x, f, mean, simplify = FALSE)
$`1`
[1] 0.1045255

$`2`
[1] 0.4867243

$`3`
[1] 0.9131191

找到每组的数值范围。

> tapply(x, f, range)
$`1`
[1] -0.8040998  1.0022698

$`2`
[1] 0.04577595 0.95238798

$`3`
[1] -0.4422177  2.3863979

split

split 根据因子向量或因子列表分组。

> str(split)
function (x, f, drop = FALSE, ...)  

x 可以是向量、列表、数据框
f 因子或因子列表
drop 是否去除因子水平为空的结果

> split(x, f)
$`1`
 [1]  0.06417511  0.77601085  1.66855356  1.38744423
 [5] -0.90908770  0.39727163 -2.13528805  0.29087121
 [9]  0.82936584  0.53773723

$`2`
 [1] 0.6646064 0.4408925 0.3199122 0.2156969 0.8358507
 [6] 0.1408568 0.4088236 0.2258691 0.9606134 0.7945027

$`3`
 [1]  0.65276220  2.46645556  2.72756544  1.77246304
 [5]  2.94941952  0.11977102 -0.04283368  2.36610370
 [9]  0.44573942  2.31295594

lapply 和 split 配合使用的例子。

> lapply(split(x, f), mean)
$`1`
[1] 0.2907054

$`2`
[1] 0.5007624

$`3`
[1] 1.57704

拆分数据框

> library(datasets)
> head(airquality)
  Ozone Solar.R Wind Temp Month Day
1    41     190  7.4   67     5   1
2    36     118  8.0   72     5   2
3    12     149 12.6   74     5   3
4    18     313 11.5   62     5   4
5    NA      NA 14.3   56     5   5
6    28      NA 14.9   66     5   6

> s <- split(airquality, airquality$Month)
> lapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind")]))
$`5`
   Ozone  Solar.R     Wind 
      NA       NA 11.62258 

$`6`
    Ozone   Solar.R      Wind 
       NA 190.16667  10.26667 

$`7`
     Ozone    Solar.R       Wind 
        NA 216.483871   8.941935 

$`8`
   Ozone  Solar.R     Wind 
      NA       NA 8.793548 

$`9`
   Ozone  Solar.R     Wind 
      NA 167.4333  10.1800 

> sapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind")])) 
               5         6          7        8        9
Ozone         NA        NA         NA       NA       NA
Solar.R       NA 190.16667 216.483871       NA 167.4333
Wind    11.62258  10.26667   8.941935 8.793548  10.1800
> sapply(s, function(x) colMeans(x[, c("Ozone", "Solar.R", "Wind")], na.rm = TRUE))
                5         6          7          8         9
Ozone    23.61538  29.44444  59.115385  59.961538  31.44828
Solar.R 181.29630 190.16667 216.483871 171.857143 167.43333
Wind     11.62258  10.26667   8.941935   8.793548  10.18000

将数据分割到多个水平

> x <- rnorm(10)
> f1 <- gl(2, 5)
> f2 <- gl(5, 2)
> f1
 [1] 1 1 1 1 1 2 2 2 2 2
Levels: 1 2
> f2
 [1] 1 1 2 2 3 3 4 4 5 5
Levels: 1 2 3 4 5
> interaction(f1, f2)
 [1] 1.1 1.1 1.2 1.2 1.3 2.3 2.4 2.4 2.5 2.5
Levels: 1.1 2.1 1.2 2.2 1.3 2.3 1.4 2.4 1.5 2.5

interaction 函数可以创建没有值的水平。

> str(split(x, list(f1, f2)))
List of 10
 $ 1.1: num [1:2] -1.073 -0.572
 $ 2.1: num(0) 
 $ 1.2: num [1:2] -0.435 -0.976
 $ 2.2: num(0) 
 $ 1.3: num 0.201
 $ 2.3: num 0.691
 $ 1.4: num(0) 
 $ 2.4: num [1:2] 0.498 0.333
 $ 1.5: num(0) 
 $ 2.5: num [1:2] -0.00797 -0.1332

去除无值的分组。

> str(split(x, list(f1, f2), drop = TRUE))
List of 6
 $ 1.1: num [1:2] -1.073 -0.572
 $ 1.2: num [1:2] -0.435 -0.976
 $ 1.3: num 0.201
 $ 2.3: num 0.691
 $ 2.4: num [1:2] 0.498 0.333
 $ 2.5: num [1:2] -0.00797 -0.1332