reshape命令一文读懂！

1. sreshape命令：快速reshape
2. fastreshape命令：针对大数据集的高效 reshape
3. xpose命令：官方提供的数据转置命令
4. 参考资料

作者： 胡世亮 (河海大学)；刘欣妍 (香港中文大学)
E-mail: hushiliang2018@hhu.edu.cn；liuxinyan@link.cuhk.edu.hk

此文是承接由何庆红同学完成的前半部分，尚需完善。数据处理-reshape命令一文读懂。

接下来，由我和胡世亮同学共同完成一些其他命令的介绍：

1. sreshape命令：快速reshape

首先笔者先来介绍一下 sreshape 命令的安装，该外部命令不能通过一般的ssc install安装，我们应该首先在stata命令窗口输入findit reshape，然后找到如下图所示的 sreshape 的安装命令，然后点击安装即可。

sreshape和 reshape 有什么区别呢？官方给出的答案是，两个命令区别几乎不大， sreshape 的优势在于"speedier"和"sparser"。这也正是该命令前加"s"的原因。对于第一点优势，官方给的建议是当你在处理较大的数据，而 reshape 命令又需要等待很久时，你可以考虑采用 sreshape 命令，速度大概能快4-5倍左右，这个可以留给大家自己去试一下（语法命令与reshape完全一致)。

这里我们介绍一下 sreshape 的第二个优势就是在于它可以处理"更为稀疏"的数据，也即当你的数据缺失值较多时，你可以考虑使用 sreshape 。这里我们举一个例子，我们先用一般的 reshape 来看一下结果：

clear
input id y1990  y1991  y1992
      1  55  .   33
      2   .  26   .
      3  78  52   .
      4  23  .   25
end

reshape long y, i(id) j(year)

list , clean noobs \\得到的结果如下
id   year    y  
 1   1990   55  
 1   1991    .  
 1   1992   33  
 2   1990    .  
 2   1991   26  
 2   1992    .  
 3   1990   78  
 3   1991   52  
 3   1992    .  
 4   1990   23  
 4   1991    .  
 4   1992   25

这里我们可以看出，转换成长型数据后，缺失值较多，那么我们来看下如果用 sreshape 是怎么样的：

sreshape long y, i(id) j(year) missing(drop)

list , clean noobs \\得到的结果如下
id	year	y
1	1990	55
1	1992	33
2	1991	26
3	1990	78
3	1991	52
4	1990	23
4	1992	25

从中我们可以看出 sreshape 提供了一个删掉缺失值的方法，同时 sreshape 还提供了missing(drop all) 和widevars(nonmissing)两条命令，都是可以快速删除缺失值提高我们reshape效率的，因此建议大家在处理大数据并且缺失值较多时，可以考虑使用 sreshape 命令。

2. fastreshape命令：针对大数据集的高效 reshape

首先， fastreshape 是外部命令，可以直接在命令窗口键入ssc install fastreshape, replace进行安装。

相比于 reshape 命令， fastrehape 命令具有怎样的优势呢？顾名思义， fastrehape 命令处理数据的速度更快。官方的解释是，针对大数据集， fastrehape 命令提供了一种更加快速的 reshape 方法。

fastrehape命令与 reshape 命令的语法基本一致，但还提供了一个 fast 的高级选项，通过不对 reshape 后的数据集进行排序，达到更加快速地 reshape 数据的目的。默认情况下， fast 选项不启用。事实上，对 reshape 后的数据集进行排序是非常耗时和不必要的。

这里，我们通过官方提供的范例数据集 reshape.dta进行演示：


cls
clear all

webuse "reshape1.dta", clear

list, clean noobs // 得到如下的结果

/*
id   sex   inc80   inc81   inc82   ue80   ue81   ue82  
 1     0    5000    5500    6000      0      1      0  
 2     1    2000    2200    3300      1      0      0  
 3     0    3000    2000    1000      0      0      1  
*/ 


* (1) 采用 reshape 命令

preserve

timer clear 1
timer on 1

reshape long inc ue, i(id) j(year)

list, clean noobs   // 得到如下的结果

/*
id   year   sex    inc   ue  
 1     80     0   5000    0  
 1     81     0   5500    1  
 1     82     0   6000    0  
 2     80     1   2000    1  
 2     81     1   2200    0  
 2     82     1   3300    0  
 3     80     0   3000    0  
 3     81     0   2000    0  
 3     82     0   1000    1  
*/

timer off 1

timer list 1

global t1 = r(t1)

disp in y `"reshape命令用时 $t1 秒"'

restore


* (2) 采用sreshape命令

preserve

timer clear 2
timer on 2

sreshape long inc ue, i(id) j(year)

list, clean noobs   // 得到如下的结果

/*
id   year   sex    inc   ue  
 1     80     0   5000    0  
 1     81     0   5500    1  
 1     82     0   6000    0  
 2     80     1   2000    1  
 2     81     1   2200    0  
 2     82     1   3300    0  
 3     80     0   3000    0  
 3     81     0   2000    0  
 3     82     0   1000    1  
*/

timer off 2

timer list 2

global t2 = r(t2)

disp in y `"sreshape命令用时 $t2 秒"'

restore


* (3) 采用fastreshape命令，不附加fast选项

preserve

timer clear 3
timer on 3

fastreshape long inc ue, i(id) j(year)

list, clean noobs   // 得到如下的结果

/*
id   year   sex    inc   ue  
 1     80     0   5000    0  
 1     81     0   5500    1  
 1     82     0   6000    0  
 2     80     1   2000    1  
 2     81     1   2200    0  
 2     82     1   3300    0  
 3     80     0   3000    0  
 3     81     0   2000    0  
 3     82     0   1000    1  
*/

timer off 3

timer list 3

global t3 = r(t3)

disp in y `"fastreshape命令，不附加fast选项，用时 $t3 秒"'

restore


* (4) 采用fastreshape命令，附加fast选项

preserve

timer clear 4
timer on 4

fastreshape long inc ue, i(id) j(year) fast

list, clean noobs   // 得到如下的结果

/*
id   year   sex    inc   ue  
 1     80     0   5000    0  
 2     80     1   2000    1  
 3     80     0   3000    0  
 1     81     0   5500    1  
 2     81     1   2200    0  
 3     81     0   2000    0  
 1     82     0   6000    0  
 2     82     1   3300    0  
 3     82     0   1000    1  

*/

timer off 4

timer list 4

global t4 = r(t4)

disp in y `"fastreshape命令，附加fast选项，用时 $t4 秒"'

restore


* 汇总各命令处理数据的用时

disp in y `"reshape命令用时 $t1 秒"'
disp in y `"sreshape命令用时 $t2 秒"'
disp in y `"fastreshape命令，不附加fast选项，用时 $t3 秒"'
disp in y `"fastreshape命令，附加fast选项，用时 $t4 秒"'

/*
reshape命令用时 .232 秒
sreshape命令用时 .134 秒
fastreshape命令，不附加fast选项，用时 .139 秒
fastreshape命令，附加fast选项，用时 .138 秒
*/

从处理的速度看，几个命令的快慢排序如下：sreshape命令最快，其次是附加 fast 选项的 fastreshape 命令，接着是不附加 fast 选项的 fastreshape 命令，最后我们传统的 reshape 命令最慢。

由于数据集 reshape.dta非常小，相比于不附加 fast 选项，附加 fast 选项的 fastreshape 命令的优势不明显。但是，当我们处理非常大的数据集时，适合采用附加 fast 选项的 fastreshape 命令。

3. xpose命令：官方提供的数据转置命令

在这里，我们将官方提供的数据转置命令 xpose 也纳入本篇推文，并做简要介绍。

数据转置的含义是数据的行列对调，如下所示：


/*     -- d_long.txt --
year	investment	income	consumption
2000	317	      766	      679
2001	314	      779	      686
2002	306	      808	      697
2003	304	      785	      688
2004	292	      794	      704
2005	275	      799	      699
                   
       -- d_wide.txt -- 
2000	2001	2002	2003	2004	2005
 317	 314	 306	 304	 292	 275
 766	 779	 808	 785	 794	 799
 679	 686	 697	 688	 704	 699
*/

help xpose，可知 xpose 的语法如下：

xpose, clear [options]

clear选项是必选项，用于提醒内存中未转换的数据集如果没有提前保存的话，将会被清除

其他可选项的含义：

选项	含义
format	使用未转换数据集中最大数值的显示格式
format(%fmt)	将指定格式应用于转置数据中的所有变量
varname	添加包含原始变量名的变量 _varname
promote	保留最紧凑的数据类型

接下来，我们采用 Stata 官方提供的网络数据集 xposexmpl.dta进行实操演示。


clear all

webuse "xposexmpl.dta" , clear // 调用数据

list, clean noobs   // 转置前的数据如下

/*
    county   year1   year2   year3  
         1    57.2    11.3    19.5  
         2    12.5     8.2    28.9  
         3      18    14.2    33.2 
*/

xpose, clear varname // 进行转置，并添加包含原始变量名的变量

list, clean noobs   // 转置后的数据如下

/*
           v1          v2          v3   _varname  
            1           2           3     county  
    57.200001        12.5          18      year1  
         11.3   8.1999998        14.2      year2  
         19.5        28.9   33.200001      year3  
*/

xpose, clear // 恢复原始数据

list, clean noobs   // 恢复后的原始数据如下，数据的显示格式发生变化

/*
    county       year1       year2       year3  
         1   57.200001        11.3        19.5  
         2        12.5   8.1999998        28.9  
         3          18        14.2   33.200001 
*/

xpose, clear varname format(%6.2f) // 再次转置，并添加包含原始变量名的变量，设定转置后数据显示格式

list, clean noobs   // 再次转置后的结果

/*
       v1      v2      v3   _varname  
     1.00    2.00    3.00     county  
    57.20   12.50   18.00      year1  
    11.30    8.20   14.20      year2  
    19.50   28.90   33.20      year3  
*/

值得一提的是，xpose 命令只能用于数值型变量的转置，如果是字符型变量的转置，需要采用外部命令 sxpose，基本用法与xpose 命令类似，请ssc install sxpose, replace进行安装后，键入 help sxpose 探索一下吧！

4. 参考资料

Mitchell M N. Data management using Stata: A practical handbook[M]. College Station, TX: Stata press, 2010. -PDF-
Baum, Christopher F., and Nicholas J. Cox. "Stata tip 45: Getting those data into shape." The Stata Journal 7.2 (2007): 268-271. -PDF-
Simons, K. L. (2016). A sparser, speedier reshape. The Stata Journal, 16(3), 632-649. -PDF-
Newson, R. (2012). XREWIDE: Stata module to extend reshape wide command. -PDF-

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。