1、数据转换
目前为止介绍的都是数据的重排。另一类重要操作则是过滤、清理以及其他的转换工作。
2、移除重复数据
DataFrame中常常会出现重复行。下面就是一个例子:
01.
In [
4
]: data = pd.DataFrame({
'k1'
:[
'one'
] *
3
+ [
'two'
] *
4
,
02.
'k2'
:[
1
,
1
,
2
,
3
,
3
,
4
,
4
]})
03.
04.
In [
5
]: data
05.
Out[
5
]:
06.
k1 k2
07.
0
one
1
08.
1
one
1
09.
2
one
2
10.
3
two
3
11.
4
two
3
12.
5
two
4
13.
6
two
4
14.
15.
[
7
rows x
2
columns]
01.
In [
6
]: data.duplicated()
02.
Out[
6
]:
03.
0
False
04.
1
True
05.
2
False
06.
3
False
07.
4
True
08.
5
False
09.
6
True
10.
dtype: bool
01.
In [
7
]: data.drop_duplicates()
02.
Out[
7
]:
03.
k1 k2
04.
0
one
1
05.
2
one
2
06.
3
two
3
07.
5
two
4
08.
09.
[
4
rows x
2
columns]
01.
In [
8
]: data[
'v1'
] = range(
7
)
02.
03.
In [
9
]: data
04.
Out[
9
]:
05.
k1 k2 v1
06.
0
one
1
0
07.
1
one
1
1
08.
2
one
2
2
09.
3
two
3
3
10.
4
two
3
4
11.
5
two
4
5
12.
6
two
4
6
13.
14.
[
7
rows x
3
columns]
15.
16.
In [
10
]: data.drop_duplicates([
'k1'
])
17.
Out[
10
]:
18.
k1 k2 v1
19.
0
one
1
0
20.
3
two
3
3
21.
22.
[
2
rows x
3
columns]
01.
In [
11
]: data.drop_duplicates([
'k1'
,
'k2'
], take_last=True)
02.
Out[
11
]:
03.
k1 k2 v1
04.
1
one
1
1
05.
2
one
2
2
06.
4
two
3
4
07.
6
two
4
6
08.
09.
[
4
rows x
3
columns]
3、利用函数或映射进行数据转换
在对数据集进行转换时,你可能希望根据数组、Series或DataFrame列中的值来实现该转换工作。我们来看看下面这组有关肉类的数据:
01.
In [
12
]: data = pd.DataFrame({
'food'
:[
'bacon'
,
'pulled pork'
,
'bacon'
,
'Pastrami'
,
'corned beef'
,
'Bacon'
,
02.
'pastrami'
,
'honey ham'
,
'nova lox'
],
03.
....:
'ounces'
:[
4
,
3
,
12
,
6
,
7.5
,
8
,
3
,
5
,
6
]})
04.
05.
In [
13
]: data
06.
Out[
13
]:
07.
food ounces
08.
0
bacon
4.0
09.
1
pulled pork
3.0
10.
2
bacon
12.0
11.
3
Pastrami
6.0
12.
4
corned beef
7.5
13.
5
Bacon
8.0
14.
6
pastrami
3.0
15.
7
honey ham
5.0
16.
8
nova lox
6.0
17.
18.
[
9
rows x
2
columns]
1.
In [
14
]: meat_to_animal = {
2.
....:
'bacon'
:
'pig'
,
3.
....:
'pulled pork'
:
'pig'
,
4.
....:
'pastrami'
:
'cow'
,
5.
....:
'corned beef'
:
'cow'
,
6.
....:
'honey ham'
:
'pig'
,
7.
....:
'nova lox'
:
'salmon'
8.
....: }
01.
In [
15
]: data[
'animal'
] = data[
'food'
].map(str.lower).map(meat_to_animal)
02.
03.
In [
16
]: data
04.
Out[
16
]:
05.
food ounces animal
06.
0
bacon
4.0
pig
07.
1
pulled pork
3.0
pig
08.
2
bacon
12.0
pig
09.
3
Pastrami
6.0
cow
10.
4
corned beef
7.5
cow
11.
5
Bacon
8.0
pig
12.
6
pastrami
3.0
cow
13.
7
honey ham
5.0
pig
14.
8
nova lox
6.0
salmon
15.
16.
[
9
rows x
3
columns]
01.
In [
17
]: data[
'food'
].map(lambda x: meat_to_animal[x.lower()])
02.
Out[
17
]:
03.
0
pig
04.
1
pig
05.
2
pig
06.
3
cow
07.
4
cow
08.
5
pig
09.
6
cow
10.
7
pig
11.
8
salmon
12.
Name: food, dtype: object
说明:
使用map是一种实现元素级转换以及其他数据清理工作的便捷方式。
4、替换值
利用fillna方法填充缺失数据可以看做值替换的一种特殊情况。虽然前面提到的mao可用于修改对象的数据子集,而replace则提供了一种实现该功能的更简单、更灵活的方式。我们来看看下面这个Series:
01.
In [
18
]: data = pd.Series([
1
., -
999
,
2
., -
999
, -
1000
.,
3
.])
02.
03.
In [
19
]: data
04.
Out[
19
]:
05.
0
1
06.
1
-
999
07.
2
2
08.
3
-
999
09.
4
-
1000
10.
5
3
11.
dtype: float64
01.
In [
20
]: data.replace(-
999
, np.nan)
02.
Out[
20
]:
03.
0
1
04.
1
NaN
05.
2
2
06.
3
NaN
07.
4
-
1000
08.
5
3
09.
dtype: float64
01.
In [
21
]: data.replace([-
999
, -
1000
], np.nan)
02.
Out[
21
]:
03.
0
1
04.
1
NaN
05.
2
2
06.
3
NaN
07.
4
NaN
08.
5
3
09.
dtype: float64
01.
In [
22
]: data.replace([-
999
, -
1000
], [np.nan,
0
])
02.
Out[
22
]:
03.
0
1
04.
1
NaN
05.
2
2
06.
3
NaN
07.
4
0
08.
5
3
09.
dtype: float64
01.
In [
23
]: data.replace({-
999
: np.nan, -
1000
:
0
})
02.
Out[
23
]:
03.
0
1
04.
1
NaN
05.
2
2
06.
3
NaN
07.
4
0
08.
5
3
09.
dtype: float64
5、重命名轴索引
跟Series中的值一样,轴标签也可以通过函数或映射进行转换,从而得到一个新对象。轴还可以被就地修改,而无需新建一个数据结构。接下来看看下面这个简单的例子:
1.
In [
24
]: data = pd.DataFrame(np.arange(
12
).reshape((
3
,
4
)),
2.
....: index=[
'Ohio'
,
'Colorado'
,
'New York'
],
3.
....: columns=[
'one'
,
'two'
,
'three'
,
'four'
])
1.
In [
25
]: data.index.map(str.upper)
2.
Out[
25
]: array([
'OHIO'
,
'COLORADO'
,
'NEW YORK'
], dtype=object)
01.
In [
26
]: data.index = data.index.map(str.upper)
02.
03.
In [
27
]: data
04.
Out[
27
]:
05.
one two three four
06.
OHIO
0
1
2
3
07.
COLORADO
4
5
6
7
08.
NEW YORK
8
9
10
11
09.
10.
[
3
rows x
4
columns]
1.
In [
28
]: data.rename(index=str.title, columns=str.upper)
2.
Out[
28
]:
3.
ONE TWO THREE FOUR
4.
Ohio
0
1
2
3
5.
Colorado
4
5
6
7
6.
New York
8
9
10
11
7.
8.
[
3
rows x
4
columns]
01.
In [
31
]: data.rename(index={
'OHIO'
:
'INDIANA'
},
02.
columns={
'three'
:
'peekaboo'
})
03.
Out[
31
]:
04.
one two peekaboo four
05.
INDIANA
0
1
2
3
06.
COLORADO
4
5
6
7
07.
NEW YORK
8
9
10
11
08.
09.
[
3
rows x
4
columns]
01.
In [
32
]: # 总是返回DataFrame的引用
02.
03.
In [
33
]: _ = data.rename(index={
'OHIO'
:
'INDIANA'
}, inplace=True)
04.
05.
In [
34
]: data
06.
Out[
34
]:
07.
one two three four
08.
INDIANA
0
1
2
3
09.
COLORADO
4
5
6
7
10.
NEW YORK
8
9
10
11
11.
12.
[
3
rows x
4
columns]
6、离散化和面元划分
为了便于分析,连续数据常常被离散化或拆分为“面元”(bin)。假设有一组人员数据,而你希望将它们划分为不同的年龄组:
1.
In [
35
]: ages = [
20
,
22
,
25
,
27
,
21
,
23
,
37
,
31
,
61
,
45
,
41
,
32
]
01.
In [
36
]: bins = [
18
,
25
,
35
,
60
,
100
]
02.
03.
In [
37
]: cats = pd.cut(ages, bins)
04.
05.
In [
38
]: cats
06.
Out[
38
]:
07.
(
18
,
25
]
08.
(
18
,
25
]
09.
(
18
,
25
]
10.
(
25
,
35
]
11.
(
18
,
25
]
12.
(
18
,
25
]
13.
(
35
,
60
]
14.
(
25
,
35
]
15.
(
60
,
100
]
16.
(
35
,
60
]
17.
(
35
,
60
]
18.
(
25
,
35
]
19.
Levels (
4
): Index([
'(18, 25]'
,
'(25, 35]'
,
'(35, 60]'
,
'(60, 100]'
], dtype=object)
01.
In [
39
]: cats.labels
02.
Out[
39
]: array([
0
,
0
,
0
,
1
,
0
,
0
,
2
,
1
,
3
,
2
,
2
,
1
])
03.
04.
In [
40
]: cats.levels
05.
Out[
40
]: Index([u
'(18, 25]'
, u
'(25, 35]'
, u
'(35, 60]'
, u
'(60, 100]'
], dtype=
'object'
)
06.
07.
In [
41
]: pd.value_counts(cats)
08.
Out[
41
]:
09.
(
18
,
25
]
5
10.
(
35
,
60
]
3
11.
(
25
,
35
]
3
12.
(
60
,
100
]
1
13.
dtype: int64
01.
In [
42
]: pd.cut(ages, [
18
,
26
,
36
,
61
,
100
], right=False)
02.
Out[
42
]:
03.
[
18
,
26
)
04.
[
18
,
26
)
05.
[
18
,
26
)
06.
[
26
,
36
)
07.
[
18
,
26
)
08.
[
18
,
26
)
09.
[
36
,
61
)
10.
[
26
,
36
)
11.
[
61
,
100
)
12.
[
36
,
61
)
13.
[
36
,
61
)
14.
[
26
,
36
)
15.
Levels (
4
): Index([
'[18, 26)'
,
'[26, 36)'
,
'[36, 61)'
,
'[61, 100)'
], dtype=object)
01.
In [
43
]: group_names = [
'Youth'
,
'YoungAdult'
,
'MiddleAged'
,
'Senior'
]
02.
03.
In [
44
]: pd.cut(ages, bins, labels=group_names)
04.
Out[
44
]:
05.
Youth
06.
Youth
07.
Youth
08.
YoungAdult
09.
Youth
10.
Youth
11.
MiddleAged
12.
YoungAdult
13.
Senior
14.
MiddleAged
15.
MiddleAged
16.
YoungAdult
17.
Levels (
4
): Index([
'Youth'
,
'YoungAdult'
,
'MiddleAged'
,
'Senior'
], dtype=object)
01.
In [
45
]: data = np.random.rand(
20
)
02.
03.
In [
46
]: pd.cut(data,
4
, precision=
2
)
04.
Out[
46
]:
05.
(
0.037
,
0.26
]
06.
(
0.037
,
0.26
]
07.
(
0.48
,
0.7
]
08.
(
0.7
,
0.92
]
09.
(
0.037
,
0.26
]
10.
(
0.037
,
0.26
]
11.
(
0.7
,
0.92
]
12.
(
0.7
,
0.92
]
13.
(
0.037
,
0.26
]
14.
(
0.26
,
0.48
]
15.
(
0.26
,
0.48
]
16.
(
0.26
,
0.48
]
17.
(
0.037
,
0.26
]
18.
(
0.26
,
0.48
]
19.
(
0.48
,
0.7
]
20.
(
0.7
,
0.92
]
21.
(
0.037
,
0.26
]
22.
(
0.7
,
0.92
]
23.
(
0.037
,
0.26
]
24.
(
0.037
,
0.26
]
25.
Levels (
4
): Index([
'(0.037, 0.26]'
,
'(0.26, 0.48]'
,
'(0.48, 0.7]'
,
26.
'(0.7, 0.92]'
], dtype=object)
01.
In [
48
]: data = np.random.randn(
1000
) # 正态分布
02.
03.
In [
49
]: cats = pd.qcut(data,
4
) # 按四分位数进行分隔
04.
05.
In [
50
]: cats
06.
Out[
50
]:
07.
[-
3.636
, -
0.717
]
08.
(
0.647
,
3.531
]
09.
[-
3.636
, -
0.717
]
10.
[-
3.636
, -
0.717
]
11.
[-
3.636
, -
0.717
]
12.
(
0.647
,
3.531
]
13.
[-
3.636
, -
0.717
]
14.
(-
0.717
, -
0.0323
]
15.
(-
0.717
, -
0.0323
]
16.
(
0.647
,
3.531
]
17.
[-
3.636
, -
0.717
]
18.
(-
0.717
, -
0.0323
]
19.
(
0.647
,
3.531
]
20.
...
21.
[-
3.636
, -
0.717
]
22.
[-
3.636
, -
0.717
]
23.
(
0.647
,
3.531
]
24.
(-
0.717
, -
0.0323
]
25.
(
0.647
,
3.531
]
26.
[-
3.636
, -
0.717
]
27.
[-
3.636
, -
0.717
]
28.
(-
0.0323
,
0.647
]
29.
[-
3.636
, -
0.717
]
30.
(-
0.717
, -
0.0323
]
31.
(-
0.717
, -
0.0323
]
32.
(-
0.0323
,
0.647
]
33.
(
0.647
,
3.531
]
34.
Levels (
4
): Index([
'[-3.636, -0.717]'
,
'(-0.717, -0.0323]'
,
35.
'(-0.0323, 0.647]'
,
'(0.647, 3.531]'
], dtype=object)
36.
Length:
1000
37.
38.
In [
51
]: pd.value_counts(cats)
39.
Out[
51
]:
40.
(-
0.717
, -
0.0323
]
250
41.
(-
0.0323
,
0.647
]
250
42.
(
0.647
,
3.531
]
250
43.
[-
3.636
, -
0.717
]
250
44.
dtype: int64
01.
In [
52
]: pd.qcut(data, [
0
,
0.1
,
0.5
,
0.9
,
1
.])
02.
Out[
52
]:
03.
(-
1.323
, -
0.0323
]
04.
(-
0.0323
,
1.234
]
05.
(-
1.323
, -
0.0323
]
06.
[-
3.636
, -
1.323
]
07.
[-
3.636
, -
1.323
]
08.
(-
0.0323
,
1.234
]
09.
(-
1.323
, -
0.0323
]
10.
(-
1.323
, -
0.0323
]
11.
(-
1.323
, -
0.0323
]
12.
(
1.234
,
3.531
]
13.
(-
1.323
, -
0.0323
]
14.
(-
1.323
, -
0.0323
]
15.
(-
0.0323
,
1.234
]
16.
...
17.
[-
3.636
, -
1.323
]
18.
(-
1.323
, -
0.0323
]
19.
(-
0.0323
,
1.234
]
20.
(-
1.323
, -
0.0323
]
21.
(-
0.0323
,
1.234
]
22.
[-
3.636
, -
1.323
]
23.
(-
1.323
, -
0.0323
]
24.
(-
0.0323
,
1.234
]
25.
(-
1.323
, -
0.0323
]
26.
(-
1.323
, -
0.0323
]
27.
(-
1.323
, -
0.0323
]
28.
(-
0.0323
,
1.234
]
29.
(-
0.0323
,
1.234
]
30.
Levels (
4
): Index([
'[-3.636, -1.323]'
,
'(-1.323, -0.0323]'
,
31.
'(-0.0323, 1.234]'
,
'(1.234, 3.531]'
], dtype=object)
32.
Length:
1000
说明:
稍后在讲解聚合和分组运算时会再次用到cut和qcut,因为这两个离散化函数对分量和分组分析非常重要。
7、检测和过滤异常值
异常值(outlier)的过滤或变换运算在很大程度上其实就是数组运算。来看一个含有正态分布数据的DataFrame:
01.
In [
53
]: np.random.seed(
12345
)
02.
03.
In [
54
]: data = pd.DataFrame(np.random.randn(
1000
,
4
))
04.
05.
In [
55
]: data.describe()
06.
Out[
55
]:
07.
0
1
2
3
08.
count
1000.000000
1000.000000
1000.000000
1000.000000
09.
mean -
0.067684
0.067924
0.025598
-
0.002298
10.
std
0.998035
0.992106
1.006835
0.996794
11.
min -
3.428254
-
3.548824
-
3.184377
-
3.745356
12.
25
% -
0.774890
-
0.591841
-
0.641675
-
0.644144
13.
50
% -
0.116401
0.101143
0.002073
-
0.013611
14.
75
%
0.616366
0.780282
0.680391
0.654328
15.
max
3.366626
2.653656
3.260383
3.927528
16.
17.
[
8
rows x
4
columns]
1.
In [
56
]: col = data[
3
]
2.
3.
In [
57
]: col[np.abs(col) >
3
]
4.
Out[
57
]:
5.
97
3.927528
6.
305
-
3.399312
7.
400
-
3.745356
8.
Name:
3
, dtype: float64
01.
In [
58
]: data[(np.abs(data) >
3
).any(
1
)]
02.
Out[
58
]:
03.
0
1
2
3
04.
5
-
0.539741
0.476985
3.248944
-
1.021228
05.
97
-
0.774363
0.552936
0.106061
3.927528
06.
102
-
0.655054
-
0.565230
3.176873
0.959533
07.
305
-
2.315555
0.457246
-
0.025907
-
3.399312
08.
324
0.050188
1.951312
3.260383
0.963301
09.
400
0.146326
0.508391
-
0.196713
-
3.745356
10.
499
-
0.293333
-
0.242459
-
3.056990
1.918403
11.
523
-
3.428254
-
0.296336
-
0.439938
-
0.867165
12.
586
0.275144
1.179227
-
3.184377
1.369891
13.
808
-
0.362528
-
3.548824
1.553205
-
2.186301
14.
900
3.366626
-
2.372214
0.851010
1.332846
15.
16.
[
11
rows x
4
columns]
01.
In [
59
]: data[np.abs(data) >
3
] = np.sign(data) *
3
02.
03.
In [
60
]: data.describe()
04.
Out[
60
]:
05.
0
1
2
3
06.
count
1000.000000
1000.000000
1000.000000
1000.000000
07.
mean -
0.067623
0.068473
0.025153
-
0.002081
08.
std
0.995485
0.990253
1.003977
0.989736
09.
min -
3.000000
-
3.000000
-
3.000000
-
3.000000
10.
25
% -
0.774890
-
0.591841
-
0.641675
-
0.644144
11.
50
% -
0.116401
0.101143
0.002073
-
0.013611
12.
75
%
0.616366
0.780282
0.680391
0.654328
13.
max
3.000000
2.653656
3.000000
3.000000
14.
15.
[
8
rows x
4
columns]
说明:
np.sign这个ufunc返回的是一个由1和-1组成的数组,表示原始值的符号。
8、排列和随机采样
利用numpy.random.permutation函数可以轻松实现对Series或DataFrame的列的排列工作(permuting,随机重排序)。通过需要排列的轴的长度调用permutation,可产生一个表示新顺序的整数数组:
1.
In [
61
]: df = pd.DataFrame(np.arange(
5
*
4
).reshape(
5
,
4
))
2.
3.
In [
62
]: sampler = np.random.permutation(
5
)
4.
5.
In [
63
]: sampler
6.
Out[
63
]: array([
1
,
0
,
2
,
3
,
4
])
01.
In [
64
]: df
02.
Out[
64
]:
03.
0
1
2
3
04.
0
0
1
2
3
05.
1
4
5
6
7
06.
2
8
9
10
11
07.
3
12
13
14
15
08.
4
16
17
18
19
09.
10.
[
5
rows x
4
columns]
11.
12.
In [
65
]: df.take(sampler)
13.
Out[
65
]:
14.
0
1
2
3
15.
1
4
5
6
7
16.
0
0
1
2
3
17.
2
8
9
10
11
18.
3
12
13
14
15
19.
4
16
17
18
19
20.
21.
[
5
rows x
4
columns]
1.
In [
66
]: df.take(np.random.permutation(len(df))[:
3
])
2.
Out[
66
]:
3.
0
1
2
3
4.
1
4
5
6
7
5.
3
12
13
14
15
6.
4
16
17
18
19
7.
8.
[
3
rows x
4
columns]
01.
In [
67
]: bag = np.array([
5
,
7
, -
1
,
6
,
4
])
02.
03.
In [
68
]: sampler = np.random.randint(
0
, len(bag), size=
10
)
04.
05.
In [
69
]: sampler
06.
Out[
69
]: array([
4
,
4
,
2
,
2
,
2
,
0
,
3
,
0
,
4
,
1
])
07.
08.
In [
70
]: draws = bag.take(sampler)
09.
10.
In [
71
]: draws
11.
Out[
71
]: array([
4
,
4
, -
1
, -
1
, -
1
,
5
,
6
,
5
,
4
,
7
])
9、计算指标/哑变量
另一种常用于统计建模或机器学习的转换方式是:将分类变量(categorical variable)转换为“哑变量矩阵”(dummy matrix)或“指标矩阵”(indicator matrix)。如果DataFrame的某一列中含有k个不同的值,则可以派生出一个k列矩阵或DataFrame(其值全为1和0)。pandas有一个get_dummies函数可以实现该功能(其实自己动手做一个也不难)。拿之前的一个例子来说:
01.
In [
72
]: df = pd.DataFrame({
'key'
: [
'b'
,
'b'
,
'a'
,
'c'
,
'a'
,
'b'
],
02.
....:
'data1'
: range(
6
)})
03.
04.
In [
73
]: pd.get_dummies(df[
'key'
])
05.
Out[
73
]:
06.
a b c
07.
0
0
1
0
08.
1
0
1
0
09.
2
1
0
0
10.
3
0
0
1
11.
4
1
0
0
12.
5
0
1
0
13.
14.
[
6
rows x
3
columns]
01.
In [
74
]: dummies = pd.get_dummies(df[
'key'
], prefix=
'key'
)
02.
03.
In [
75
]: df_with_dummy = df[[
'data1'
]].join(dummies)
04.
05.
In [
76
]: df_with_dummy
06.
Out[
76
]:
07.
data1 key_a key_b key_c
08.
0
0
0
1
0
09.
1
1
0
1
0
10.
2
2
1
0
0
11.
3
3
0
0
1
12.
4
4
1
0
0
13.
5
5
0
1
0
14.
15.
[
6
rows x
4
columns]
01.
In [
77
]: mnames = [
'movie_id'
,
'title'
,
'genres'
]
02.
In [
78
]: movies = pd.read_table(
'movies.dat'
, sep=
'::'
, header=None,
03.
.....: names=mnames)
04.
05.
In [
79
]: movies[:
10
]
06.
Out[
79
]:
07.
movie_id title genres
08.
0
1
Toy Story (
1995
) Animation|Children's|Comedy
09.
1
2
Jumanji (
1995
) Adventure|Children's|Fantasy
10.
2
3
Grumpier Old Men (
1995
) Comedy|Romance
11.
3
4
Waiting to Exhale (
1995
) Comedy|Drama
12.
4
5
Father of the Bride Part II (
1995
) Comedy
13.
5
6
Heat (
1995
) Action|Crime|Thriller
14.
6
7
Sabrina (
1995
) Comedy|Romance
15.
7
8
Tom and Huck (
1995
) Adventure|Children's
16.
8
9
Sudden Death (
1995
) Action
17.
9
10
GoldenEye (
1995
) Action|Adventure|Thriller
1.
In [
80
]: genre_iter = (set(x.split(
'|'
))
for
x in movies.genres)
2.
3.
In [
81
]: genres = sorted(set.union(*genre_iter))
1.
In [
82
]: dummies = DataFrame(np.zeros((len(movies), len(genres))), columns=genres)
1.
In [
83
]:
for
i, gen in enumerate(movies.genres):
2.
.....: dummies.ix[i, gen.split(
'|'
)] =
1
01.
In [
84
]: movies_windic = movies.join(dummies.add_prefix(
'Genre_'
))
02.
03.
In [
85
]: movies_windic.ix[
0
]
04.
Out[
85
]:
05.
movie_id
1
06.
title Toy Story (
1995
)
07.
genres Animation|Children's|Comedy
08.
Genre_Action
0
09.
Genre_Adventure
0
10.
Genre_Animation
1
11.
Genre_Children's
1
12.
Genre_Comedy
1
13.
Genre_Crime
0
14.
Genre_Documentary
0
15.
Genre_Drama
0
16.
Genre_Fantasy
0
17.
Genre_Film-Noir
0
18.
Genre_Horror
0
19.
Genre_Musical
0
20.
Genre_Mystery
0
21.
Genre_Romance
0
22.
Genre_Sci-Fi
0
23.
Genre_Thriller
0
24.
Genre_War
0
25.
Genre_Western
0
26.
Name:
0
注意:
对于很大的数据,用这种方式构建多成员指标变量就会变得非常慢。肯定需要编写一个能够利用DataFrame内部机制的更低级的函数才行。
一个对统计应用有用的秘诀是:结合get_dummies和诸如cut之类的离散化函数。
01.
In [
86
]: values = np.random.rand(
10
)
02.
03.
In [
87
]: values
04.
Out[
87
]:
05.
array([
0.75603383
,
0.90830844
,
0.96588737
,
0.17373658
,
0.87592824
,
06.
0.75415641
,
0.163486
,
0.23784062
,
0.85564381
,
0.58743194
])
07.
08.
In [
88
]: bins = [
0
,
0.2
,
0.4
,
0.6
,
0.8
,
1
]
09.
10.
In [
89
]: pd.get_dummies(pd.cut(values, bins))
11.
Out[
89
]:
12.
(
0
,
0.2
] (
0.2
,
0.4
] (
0.4
,
0.6
] (
0.6
,
0.8
] (
0.8
,
1
]
13.
0
0
0
0
1
0
14.
1
0
0
0
0
1
15.
2
0
0
0
0
1
16.
3
1
0
0
0
0
17.
4
0
0
0
0
1
18.
5
0
0
0
1
0
19.
6
1
0
0
0
0
20.
7
0
1
0
0
0
21.
8
0
0
0
0
1
22.
9
0
0
1
0
0
23.
24.
[
10
联系客服