打开APP
userphoto
未登录

开通VIP,畅享免费电子书等14项超值服

开通VIP
当爬虫遇到JavaScript

1 前言

这次我的目标是写一个爬虫程序,获取网站 GNMA官网 每个月发行的Remic Prospectuses文件,具体到某年某月的URL是:

http://www.ginniemae.gov/doing_business_with_ginniemae/investor_resources/prospectuses/Pages/remic_prospectuses.aspx?YearDropDown=2013&MonthDropDown=March

如果有多页的内容,则需要点击操作,然后JavaScript生成页面,不然看不到其它页面内容,如:

javascript: __doPostBack('ctl00$m$g_01259610_86e5_4c21_9a20_5a2b7d94350f','dvt_firstrow={11};dvt_startposition={}');

这个程序应该接受年、月参数,自动获取Remic清单,遇到需要点击操作才出来的页面也要能够处理。

2 分析目标网页

目标页面URL是有一定规律的,年份是数字,模式是 \d+ ,而月份是英文各个月份的全拼,首字母大写。

对于具体页面的Remic文件源码格式类似:

<a href="/doing_business_with_ginniemae/investor_resources/Prospectuses/ProspectusesLib/2013Mar21-037.pdf" target="_blank">2013-037 - Dated March 21, 2013</a>

而对应的JavaScript操作源码,整理的类似有:

<a href="javascript: __doPostBack('ctl00$m$g_01259610_86e5_4c21_9a20_5a2b7d94350f','dvt_firstrow={11};dvt_startposition={}');"><a href="javascript: __doPostBack('ctl00$m$g_01259610_86e5_4c21_9a20_5a2b7d94350f','dvt_firstrow={1};dvt_startposition={}');"><a href="javascript: __doPostBack('ctl00$m$g_01259610_86e5_4c21_9a20_5a2b7d94350f','dvt_firstrow={21};dvt_startposition={}');"><a href="javascript: __doPostBack('ctl00$m$g_01259610_86e5_4c21_9a20_5a2b7d94350f','dvt_firstrow={31};dvt_startposition={}');">

所幸在源码中找到定义:

<script type="text/javascript">//<![CDATA[var theForm = document.forms['aspnetForm'];if (!theForm) {    theForm = document.aspnetForm;}function __doPostBack(eventTarget, eventArgument) {    if (!theForm.onsubmit || (theForm.onsubmit() != false)) {        theForm.__EVENTTARGET.value = eventTarget;        theForm.__EVENTARGUMENT.value = eventArgument;        theForm.submit();    }}//]]></script>

所以,必须通过表单的形式进行POST方法提交,获取表单信息。

3 grspider.py

通过Firebug可以获取到cURL命令行下载相应页面的命令,但具体做法会比较乱,通过模拟POST提交表单:

运行:

$ python grspider.py -y 2014http://www.ginniemae.gov/doing_business_with_ginniemae/investor_resources/prospectuses/Pages/remic_prospectuses.aspx?YearDropDown=2014&MonthDropDown=Month$ cat gnma_remic.json{    "2014-001": "http://www.ginniemae.gov/doing_business_with_ginniemae/investor_resources/Prospectuses/ProspectusesLib/2014Jan23-001.pdf",    "2014-001O": "http://www.ginniemae.gov/doing_business_with_ginniemae/investor_resources/Prospectuses/ProspectusesLib/2014Jan23-001O.pdf",    "2014-002": "http://www.ginniemae.gov/doing_business_with_ginniemae/investor_resources/Prospectuses/ProspectusesLib/2014Jan23-002.pdf",    "2014-002O": "http://www.ginniemae.gov/doing_business_with_ginniemae/investor_resources/Prospectuses/ProspectusesLib/2014Jan23-002O.pdf",    "2014-003": "http://www.ginniemae.gov/doing_business_with_ginniemae/investor_resources/Prospectuses/ProspectusesLib/2014Jan23-003.pdf",    "2014-003O": "http://www.ginniemae.gov/doing_business_with_ginniemae/investor_resources/Prospectuses/ProspectusesLib/2014Feb24-003O.pdf",    "2014-004": "http://www.ginniemae.gov/doing_business_with_ginniemae/investor_resources/Prospectuses/ProspectusesLib/2014Jan23-004.pdf",    "2014-004O": "http://www.ginniemae.gov/doing_business_with_ginniemae/investor_resources/Prospectuses/ProspectusesLib/2014Feb24-004O.pdf",     ....}

4 后记

遇到JavaScript的页面,最好还是通过Firebug等方法查看规律,然后模拟表单操作。

本站仅提供存储服务,所有内容均由用户发布,如发现有害或侵权内容,请点击举报
打开APP,阅读全文并永久保存 查看更多类似文章
猜你喜欢
类似文章
在ASP.NET中,使用javascript实现button点击
asp.net中服务端控件事件是如何触发的?
深入了解__doPostBack
服务端控件通过js调用服务端事件
javascript 防止特殊字符
转:教师继续教育学习时间加速(快速挂学时)
更多类似文章 >>
生活服务
热点新闻
分享 收藏 导长图 关注 下载文章
绑定账号成功
后续可登录账号畅享VIP特权!
如果VIP功能使用有故障,
可点击这里联系客服!

联系客服