《R数据科学》课后练习题：第三章（1）

3.2 练习题

（1）找出满足下列条件的所有航班。

a.到达时间延误2小时或更多的航班。

head(filter(flights, arr_delay >= 120))

## # A tibble: 6 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##
## 1 2013     1     1      811            630       101     1047            830
## 2 2013     1     1      848           1835       853     1001           1950
## 3 2013     1     1      957            733       144     1056            853
## 4 2013     1     1     1114            900       134     1447           1222
## 5 2013     1     1     1505           1310       115     1638           1431
## 6 2013     1     1     1525           1340       105     1831           1626
## # ... with 11 more variables: arr_delay, carrier, flight,
## #   tailnum, origin, dest, air_time, distance,
## #   hour, minute, time_hour

b.飞往休斯敦（IAH机场或HOU机场）的航班。

#method1
head(filter(flights, dest == "IAH"|dest == "HOU"))

## # A tibble: 6 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##
## 1 2013     1     1      517            515         2      830            819
## 2 2013     1     1      533            529         4      850            830
## 3 2013     1     1      623            627        -4      933            932
## 4 2013     1     1      728            732        -4     1041           1038
## 5 2013     1     1      739            739         0     1104           1038
## 6 2013     1     1      908            908         0     1228           1219
## # ... with 11 more variables: arr_delay, carrier, flight,
## #   tailnum, origin, dest, air_time, distance,
## #   hour, minute, time_hour

#method2
head(filter(flights, dest %in% c( "IAH", "HOU")))

c.由联合航空（United）、美利坚航空（American）或三角洲航空（Delta）运营的航班。

#method1
head( filter(flights, carrier == "UA"|carrier == "AA"|carrier == "DL"))

## # A tibble: 6 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##
## 1 2013     1     1      517            515         2      830            819
## 2 2013     1     1      533            529         4      850            830
## 3 2013     1     1      542            540         2      923            850
## 4 2013     1     1      554            600        -6      812            837
## 5 2013     1     1      554            558        -4      740            728
## 6 2013     1     1      558            600        -2      753            745
## # ... with 11 more variables: arr_delay, carrier, flight,
## #   tailnum, origin, dest, air_time, distance,
## #   hour, minute, time_hour

#method2
head(filter(flights, carrier %in% c( "UA", "AA", "DL")))

d.夏季（7月、8月、和9月）出发的航班。

#method1
head(filter(flights, month == 7|month == 8|month == 9))

## # A tibble: 6 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##
## 1 2013     7     1        1           2029       212      236           2359
## 2 2013     7     1        2           2359         3      344            344
## 3 2013     7     1       29           2245       104      151              1
## 4 2013     7     1       43           2130       193      322             14
## 5 2013     7     1       44           2150       174      300            100
## 6 2013     7     1       46           2051       235      304           2358
## # ... with 11 more variables: arr_delay, carrier, flight,
## #   tailnum, origin, dest, air_time, distance,
## #   hour, minute, time_hour

#method2
head(filter(flights, month %in% 7:9))

e.到达时间延误超过2小时，但出发时间没有延误的航班。

head(filter(flights, arr_delay >= 120 & dep_delay <=< span=""> 0))

## # A tibble: 6 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##
## 1 2013     1    27     1419           1420        -1     1754           1550
## 2 2013    10     7     1350           1350         0     1736           1526
## 3 2013    10     7     1357           1359        -2     1858           1654
## 4 2013    10    16      657            700        -3     1258           1056
## 5 2013    11     1      658            700        -2     1329           1015
## 6 2013     3    18     1844           1847        -3       39           2219
## # ... with 11 more variables: arr_delay, carrier, flight,
## #   tailnum, origin, dest, air_time, distance,
## #   hour, minute, time_hour

f.延误至少1小时，但飞行过程弥补回30分钟的航班。

#这里的延误指的是起飞延误1h
head(filter(flights, dep_delay >= 60 & dep_delay - arr_delay > 30))

## # A tibble: 6 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##
## 1 2013     1     1     2205           1720       285       46           2040
## 2 2013     1     1     2326           2130       116      131             18
## 3 2013     1     3     1503           1221       162     1803           1555
## 4 2013     1     3     1839           1700        99     2056           1950
## 5 2013     1     3     1850           1745        65     2148           2120
## 6 2013     1     3     1941           1759       102     2246           2139
## # ... with 11 more variables: arr_delay, carrier, flight,
## #   tailnum, origin, dest, air_time, distance,
## #   hour, minute, time_hour

g.出发时间在午夜和早上6点之间（包括0点和6点）的航班。

#查看起飞时间
summary(flights$dep_time)

## Min. 1st Qu. Median Mean 3rd Qu. Max. NA's
## 1 907 1401 1349 1744 2400 8255

#午夜是2400
#method1
head(filter(flights, dep_time <=< span=""> 600 | dep_time ==2400))

## # A tibble: 6 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##
## 1 2013     1     1      517            515         2      830            819
## 2 2013     1     1      533            529         4      850            830
## 3 2013     1     1      542            540         2      923            850
## 4 2013     1     1      544            545        -1     1004           1022
## 5 2013     1     1      554            600        -6      812            837
## 6 2013     1     1      554            558        -4      740            728
## # ... with 11 more variables: arr_delay, carrier, flight,
## #   tailnum, origin, dest, air_time, distance,
## #   hour, minute, time_hour

#method2
head(filter(flights, dep_time %% 2400 <=< span=""> 600))

## # A tibble: 6 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##
## 1 2013     1     1      517            515         2      830            819
## 2 2013     1     1      533            529         4      850            830
## 3 2013     1     1      542            540         2      923            850
## 4 2013     1     1      544            545        -1     1004           1022
## 5 2013     1     1      554            600        -6      812            837
## 6 2013     1     1      554            558        -4      740            728
## # ... with 11 more variables: arr_delay, carrier, flight,
## #   tailnum, origin, dest, air_time, distance,
## #   hour, minute, time_hour

（2）dplyr中对筛选有帮助的另一个函数时between()。它的作用是什么？你能使用这个函数来简化解决前面问题的代码吗？

# between(x, left, right) 等价于 x >= left & x <= right
head(filter(flights, between(month, 7, 9)))

（3）dep_time有缺失值的航班有多少？其他变量的缺失值情况如何？这样的行表示什么情况？

result <-< span=""> filter(flights, is.na(dep_time))
nrow(result)

## [1] 8255

head(result)

## # A tibble: 6 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##
## 1 2013     1     1       NA           1630        NA       NA           1815
## 2 2013     1     1       NA           1935        NA       NA           2240
## 3 2013     1     1       NA           1500        NA       NA           1825
## 4 2013     1     1       NA            600        NA       NA            901
## 5 2013     1     2       NA           1540        NA       NA           1747
## 6 2013     1     2       NA           1620        NA       NA           1746
## # ... with 11 more variables: arr_delay, carrier, flight,
## #   tailnum, origin, dest, air_time, distance,
## #   hour, minute, time_hour

#到达时间也是缺失值，所以有可能是这些航班被取消了

（4）为什么NA ^ 0的值不是NA？为什么NA|TRUE的值不是NA？为什么FALSE & NA的值不是NA？你能找出一般规律吗？（NA * 0则是精妙的反例!）

#所有数字的0次方都为1
NA ^ 0

## [1] 1

#所有的or TURE都是真
NA | TRUE

## [1] TRUE

#所有的and FALSE都是假
FALSE & NA

## [1] FALSE

#因为NA是未知的，所以无法判断真假，故而NA
NA | FALSE

## [1] NA

NA * 0

## [1] NA

#因为数学上没有定义：正无穷或者负无穷 * 0 = 0，所以NA * 0 != 0
Inf * 0

## [1] NaN

-Inf * 0

## [1] NaN

3.3练习题

（1）如何使用arrange()将缺失值排在最前面？（提示：使用is.na()。）

head(arrange(flights, desc(is.na(dep_delay)), dep_delay))

（2）对flights排序以找出延误时间最长的航班。找出出发时间最早的航班。

head(arrange(flights, desc(dep_delay)))

## # A tibble: 6 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##
## 1 2013     1     9      641            900      1301     1242           1530
## 2 2013     6    15     1432           1935      1137     1607           2120
## 3 2013     1    10     1121           1635      1126     1239           1810
## 4 2013     9    20     1139           1845      1014     1457           2210
## 5 2013     7    22      845           1600      1005     1044           1815
## 6 2013     4    10     1100           1900       960     1342           2211
## # ... with 11 more variables: arr_delay, carrier, flight,
## #   tailnum, origin, dest, air_time, distance,
## #   hour, minute, time_hour

#这里的出发时间最早，指的是提前时间最长
head(arrange(flights, dep_delay))

## # A tibble: 6 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##
## 1 2013    12     7     2040           2123       -43       40           2352
## 2 2013     2     3     2022           2055       -33     2240           2338
## 3 2013    11    10     1408           1440       -32     1549           1559
## 4 2013     1    11     1900           1930       -30     2233           2243
## 5 2013     1    29     1703           1730       -27     1947           1957
## 6 2013     8     9      729            755       -26     1002            955
## # ... with 11 more variables: arr_delay, carrier, flight,
## #   tailnum, origin, dest, air_time, distance,
## #   hour, minute, time_hour

（3）对flights排序以找出速度最快的航班。

head(arrange(flights, desc(distance / air_time)))

## # A tibble: 6 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##
## 1 2013     5    25     1709           1700         9     1923           1937
## 2 2013     7     2     1558           1513        45     1745           1719
## 3 2013     5    13     2040           2025        15     2225           2226
## 4 2013     3    23     1914           1910         4     2045           2043
## 5 2013     1    12     1559           1600        -1     1849           1917
## 6 2013    11    17      650            655        -5     1059           1150
## # ... with 11 more variables: arr_delay, carrier, flight,
## #   tailnum, origin, dest, air_time, distance,
## #   hour, minute, time_hour

（4）哪个航班的飞行时间最长？哪个最短？

head(arrange(flights, desc(air_time)))

## # A tibble: 6 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##
## 1 2013     3    17     1337           1335         2     1937           1836
## 2 2013     2     6      853            900        -7     1542           1540
## 3 2013     3    15     1001           1000         1     1551           1530
## 4 2013     3    17     1006           1000         6     1607           1530
## 5 2013     3    16     1001           1000         1     1544           1530
## 6 2013     2     5      900            900         0     1555           1540
## # ... with 11 more variables: arr_delay, carrier, flight,
## #   tailnum, origin, dest, air_time, distance,
## #   hour, minute, time_hour

head(arrange(flights, air_time))

## # A tibble: 6 x 19
##    year month   day dep_time sched_dep_time dep_delay arr_time sched_arr_time
##
## 1 2013     1    16     1355           1315        40     1442           1411
## 2 2013     4    13      537            527        10      622            628
## 3 2013    12     6      922            851        31     1021            954
## 4 2013     2     3     2153           2129        24     2247           2224
## 5 2013     2     5     1303           1315       -12     1342           1411
## 6 2013     2    12     2123           2130        -7     2211           2225
## # ... with 11 more variables: arr_delay, carrier, flight,
## #   tailnum, origin, dest, air_time, distance,
## #   hour, minute, time_hour

3.4 练习

（1）从flights数据集中选择dep_time，dep_delay， arr_time和arr_delay，通过头脑风暴找出尽可能多的方法。

head(select(flights, dep_time, dep_delay, arr_time, arr_delay))

## # A tibble: 6 x 4
##   dep_time dep_delay arr_time arr_delay
##
## 1      517         2      830        11
## 2      533         4      850        20
## 3      542         2      923        33
## 4      544        -1     1004       -18
## 5      554        -6      812       -25
## 6      554        -4      740        12

head(flights[c("dep_time", "dep_delay", "arr_time", "arr_delay")])

head(select(flights, 4, 6, 7, 9))

head(flights[c(4, 6, 7, 9)])

head(select(flights, all_of(c("dep_time", "dep_delay", "arr_time", "arr_delay"))))

head(select(flights, any_of(c("dep_time", "dep_delay", "arr_time", "arr_delay"))))

（2）如果在select()函数中多次计入一个变量名，那会发生什么情况？

select（）调用会忽略重复项。如果存在重复的变量，select（）函数不会引发错误或警告，也不会显示任何消息。

可以将select（）与everything（）结合使用，以便轻松更改列的顺序，而不必指定所有列的名称。

head(select(flights, carrier, everything()))

## # A tibble: 6 x 19
##   carrier year month   day dep_time sched_dep_time dep_delay arr_time
##
## 1 UA       2013     1     1      517            515         2      830
## 2 UA       2013     1     1      533            529         4      850
## 3 AA       2013     1     1      542            540         2      923
## 4 B6       2013     1     1      544            545        -1     1004
## 5 DL       2013     1     1      554            600        -6      812
## 6 UA       2013     1     1      554            558        -4      740
## # ... with 11 more variables: sched_arr_time, arr_delay,
## #   flight, tailnum, origin, dest, air_time,
## #   distance, hour, minute, time_hour

（3）one_of()函数的作用是什么？为什么它结合以下向量使用时非常有用？

vars <-< span=""> c(
"years", "month", "day", "dep_delay", "arr_delay"
)

head(select(flights, one_of(vars)))

## Warning: Unknown columns: `years`

## # A tibble: 6 x 4
##   month   day dep_delay arr_delay
##
## 1     1     1         2        11
## 2     1     1         4        20
## 3     1     1         2        33
## 4     1     1        -1       -18
## 5     1     1        -6       -25
## 6     1     1        -4        12

在dplyr的最新版本中，不赞成使用one_of，而推荐使用：all_of（）和any_of（）。如果所有变量都存在于数据框中，这些函数的行为类似。

（4）以下代码的运行结果是否出乎意料？选择辅助函数处理大小写的默认方式是什么？如何改变默认方式？

select(flights, contains("TIME"))

## # A tibble: 336,776 x 6
##    dep_time sched_dep_time arr_time sched_arr_time air_time time_hour
##
## 1      517            515      830            819      227 2013-01-01 05:00:00
## 2      533            529      850            830      227 2013-01-01 05:00:00
## 3      542            540      923            850      160 2013-01-01 05:00:00
## 4      544            545     1004           1022      183 2013-01-01 05:00:00
## 5      554            600      812            837      116 2013-01-01 06:00:00
## 6      554            558      740            728      150 2013-01-01 05:00:00
## 7      555            600      913            854      158 2013-01-01 06:00:00
## 8      557            600      709            723       53 2013-01-01 06:00:00
## 9      557            600      838            846      140 2013-01-01 06:00:00
## 10      558            600      753            745      138 2013-01-01 06:00:00
## # ... with 336,766 more rows

contains()的默认行为是忽略大小写.

#改变这种行为可以使用以下方式：
select(flights, contains("TIME", ignore.case = FALSE))

## # A tibble: 336,776 x 0

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。