pig避免产生大量小文件的方法

参考链接：How do I force PigStorage to output a few large files instead of thousands of tiny files?

pig中会为每个输入文件构建一个mapper, 如果中间没有reduce过程，则输入有多少个文件，输出就有多少个文件（ If you have thousands of input files, you have thousands of output files.）。

可以（官网参考链接：http://pig.apache.org/docs/latest/perf.html#combine-files）设置如下参数合并输入的小文件，达到合并输出文件的效果：

pig.maxCombinedSplitSize – Specifies the size, in bytes, of data to be processed by a single map. Smaller files are combined untill this size is reached.
pig.splitCombination – Turns combine split files on or off (set to “true” by default).

但是这种方法只对利用PigStorage load数据的过程有效。

另外，可以利用reduce的操作来合并小文件。

pig能够触发reduce的操作 COGROUP, CROSS, DISTINCT, GROUP, JOIN (inner), JOIN (outer), 和 ORDER BY。
在含有reducer操作的语句中添加parallel num, num即是最终的文件数。（如果数据量太大，数目太小，有可能运行的时候会失败。。。在我司的集群上跑的时候）。

在group all中，reducer默认为1，parallel不生效，这时候数据量太大有可能会失败，解决方案可参考：Pig 处理大量的小文件

或文章开头给出的链接中的答案：

ps:

1. pig默认reducer的设置：

2. parallel只影响reducer的数量，maper的数量由输入文件的个数确定

The parallel features only affect the number of reduce tasks. Map parallelism is determined by the input file, one map for each HDFS block.

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。