金融数据特征提取与自动标注

传统机器学习，其中有一项重要的工作就是”特征工程“，所谓”特征工程“就是提取基础数据的”特征“。比如图像识别，需要对图片进行灰度、二值化处理等。

金融时间序列也可以做相应的特征提取，比如收益率，波动率，各种技术指标：均线，动量等，当然也可以是财务特征比如PE,PB，ROA等。这里的特征和传统金融分析里的“因子”可以对应上。我们可以基于传统的alpha因子，用机器学习模型去寻找其中的数据关联。

所以，我们在基础数据的基础上，实现自动化的特征提取与数据标准。

金融数据特征自动提取

导入所需的包以前模块

from engine.common.mongo_utils import mongo
import pandas as pd
from datetime import datetime

从mongo里查询数据，并进行预处理

def feature_extractor(instrument,features,start_date='',end_date='',benchmark='000300_index'):
    items = mongo.query_docs('astock_daily_quotes',{'code':instrument,
                                            'date':{'$gt':start_date,'$lt':end_date}},
                             )

    df = pd.DataFrame(list(items))
    df = df[['open','high','low','close','date','code']]
    df.index = df['date']
    df.sort_index(inplace=True)

    for feature in features:
        df = parse_feature(df,feature)
    return df

解析需要的特征

def parse_feature(df,feature):
    features_support = ['return']

    if '_' in feature:
        feature_name =  feature[:feature.index('_')]
        param = int(feature[feature.index('_')+1:])

    else:
        feature_name = feature
        param = 0

    print(feature_name, param)

    if feature_name not in features_support:
        return df

    if feature_name == 'return':
        df[feature] = df['close'] /df['close'].shift(param+1) -1
    return df

我们尝试读取贵州茅台（600519）于2017-01-01至2017-01-31之间的数据，并提取当天的收益率，与5天的收益率特征。

features = ['return_0','return_4']
start = datetime(2017,1,1)
end = datetime(2017,1,31)
print(feature_extractor('600519',features,start_date=start,end_date=end))

如下结果可以看到，我们不仅读取了基本数据OHLC的日K线数据，还自动计算了return_0以及return_4

open    high     low   close    code  return_0  return_4
date                                                                  
2017-01-03  334.28  337.00  332.81  334.56  600519       NaN       NaN
2017-01-04  334.62  352.17  334.60  351.91  600519  0.051859       NaN
2017-01-05  350.00  351.45  345.44  346.74  600519 -0.014691       NaN
2017-01-06  346.64  359.78  346.10  350.76  600519  0.011594       NaN
2017-01-09  347.80  352.88  346.54  348.51  600519 -0.006415       NaN
2017-01-10  348.45  352.00  346.60  349.00  600519  0.001406  0.043161
2017-01-11  348.00  348.00  343.50  345.45  600519 -0.010172 -0.018357
2017-01-12  346.55  347.40  344.51  347.05  600519  0.004632  0.000894
2017-01-13  346.98  347.39  343.88  344.87  600519 -0.006282 -0.016792
2017-01-16  344.13  344.80  338.80  341.47  600519 -0.009859 -0.020200
2017-01-17  342.60  351.50  342.00  349.13  600519  0.022432  0.000372
2017-01-18  348.88  356.77  347.21  355.08  600519  0.017042  0.027877

金融数据自动标注

数据标注是机器学习里“监督学习”在数据准备阶段最重要的工作。监督学习本质上就是”统计“学习样本特征与标注之间的相关性。统计学上说的“garbage in,garbage out”就是在强调数据标注质量的重要性。

现代深度学习，基于大数据样本以及GPU的强大算力。其中这里的数据标注成本是非常高的，很多公司，比如做无人驾驶，需要跨国雇人做数据标注等。

金融时间序列在数据标注上相对容易，一定程序上我们是可以实现标注自动化的。因为从回测的角度，所以的交易都是发生过且记录在案。站在过去的时点上，我们是知道“未来”几天或几个月的走势，相关的收益率，波动率等。可以把这些特征做过样本的标注。

前文描述的，在做特征提取的时候，我们是“回顾历史”，比如看近5天的收益率，做为当天的一个数据特征。而标注，则是看未来，即当前这些数据特征，在未来，比如未来5天的收益率是多少。

目标是未来5天的收益率。
对未来5天的收益率顺序，使用Series的0.2，0.4，0.6，0.8四个分位点，把整个序列分成5份，分别标注为0-4五类。

#自动标注数据
def auto_labeler(df,label,hold_days):
    label_name = ''
    if label == 'return':
        label_name = 'label_return_'+str(hold_days)
        df[label_name] = df['close'].shift(-hold_days)/df['close']  - 1

    rank20 = df[label_name].quantile(0.2)
    rank40 = df[label_name].quantile(0.4)
    rank60 = df[label_name].quantile(0.6)
    rank80 = df[label_name].quantile(0.8)
    df['label'] = np.where(df[label_name]<rank20,0,None)
    df['label'] = np.where(df[label_name] > rank20, 1, df['label'])
    df['label'] = np.where(df[label_name] > rank40, 2, df['label'])
    df['label'] = np.where(df[label_name] > rank60, 3, df['label'])
    df['label'] = np.where(df[label_name] > rank80, 4, df['label'])
    return df

调用，先提取基础数据特征，在这个基础上进行数据标注。

start = datetime(2017,1,1)
end = datetime(2017,1,31)
df = feature_extractor('600519',features,start_date=start,end_date=end)
df = auto_labeler(df,'return',5)
print(df.head(10))

得到结果如下：

open    high     low   close    code  return_0  return_4  date                                                                     
2017-01-03  334.28  337.00  332.81  334.56  600519       NaN       NaN   
2017-01-04  334.62  352.17  334.60  351.91  600519  0.051859       NaN   
2017-01-05  350.00  351.45  345.44  346.74  600519 -0.014691       NaN   
2017-01-06  346.64  359.78  346.10  350.76  600519  0.011594       NaN   
2017-01-09  347.80  352.88  346.54  348.51  600519 -0.006415       NaN   
2017-01-10  348.45  352.00  346.60  349.00  600519  0.001406  0.043161   
2017-01-11  348.00  348.00  343.50  345.45  600519 -0.010172 -0.018357   
2017-01-12  346.55  347.40  344.51  347.05  600519  0.004632  0.000894   
2017-01-13  346.98  347.39  343.88  344.87  600519 -0.006282 -0.016792   
2017-01-16  344.13  344.80  338.80  341.47  600519 -0.009859 -0.020200   

            label_return_5 label  
date                              
2017-01-03        0.043161     4  
2017-01-04       -0.018357     1  
2017-01-05        0.000894     2  
2017-01-06       -0.016792     1  
2017-01-09       -0.020200     0  
2017-01-10        0.000372     2  
2017-01-11        0.027877     3  
2017-01-12        0.022101     3  
2017-01-13        0.029344     4  
2017-01-16        0.028553     4

关于作者：魏佳斌，互联网产品/技术总监，北京大学光华管理学院（MBA）,特许金融分析师（CFA），资深产品经理/码农。偏爱python，深度关注互联网趋势，人工智能，AI金融量化。致力于使用最前沿的认知技术去理解这个复杂的世界。AI量化开源项目：

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。