关键时刻，第一时间送达！

每天我都要坐地铁上班，而地铁里完全没有手机信号。但我希望在坐地铁的时候读些新闻，于是就写了下面这个新闻爬虫。

我并没有打算做很漂亮的应用，所以只完成了原型，它可以满足我最基本的需求。其思路很简单：

找到新闻源；
用Python抓取新闻；
利用BeautifulSoup分析HTML并提取出内容；
转换成容易阅读的格式并通过邮件发送。

下面详细介绍每个部分的实现。

▌新闻源：Reddit

我们可以通过Reddit提交新闻链接并为之投票，因此Reddit是个很好的新闻来源。但接下来的问题是：怎样才能获取每天最流行的新闻？

在考虑抓取之前，我们应该先考虑目标网站有没有提供API。因为使用API完全合法，更重要的是它能提供机器可读的数据，这样就无需再分析HTML了。

幸运的是Reddit提供了API。我们可以从API列表（https://www.reddit.com/dev/api/）中找到所需的功能：/top。该功能可以返回Reddit或指定subreddit上最流行的新闻。

接下来的问题是：这个API怎么用？

仔细阅读了Reddit的文档（https://github.com/reddit-archive/reddit/wiki/OAuth2）之后，我找到了最有效的用法。

第一步：在Reddit上创建一个应用。登录之后前往“preferences → apps”页面，底部有个名为“create another app...”的按钮。点击后创建一个“script”类型的应用。我们不需要提供“about url”或“redirect url”，因为这个应用不对公众开放，也不会被别人使用。

应用创建之后，可以在应用信息里找到App ID和Secret。

下个问题是如何使用App ID和Secret。由于我们只需获取指定SubReddit上最流行的新闻，而无需访问任何与用户相关的信息，所以理论上来说我们无需提供用户名或密码之类的个人信息。Reddit提供了“Application Only OAuth”（https://github.com/reddit-archive/reddit/wiki/OAuth2#application-only-oauth）的形式，通过这种方式，应用可以匿名访问公开的信息。运行下面这条命令：

$ curl -X POST -H 'User-Agent: myawesomeapp/1.0' -d grant_type=client_credentials --user 'OUR_CLIENT_ID:OUR_CLIENT_SECRET' https://www.reddit.com/api/v1/access_token

该命令会返回access token：

{'access_token': 'ABCDEFabcdef0123456789', 'token_type': 'bearer', 'expires_in': 3600, 'scope': '*'}

太好了！有了access token之后就可以大展拳脚了。

最后，如果不想自己写API的访问代码的话，可以使用Python客户端：https://github.com/praw-dev/praw

先做一下测试，从/r/Python获取最流行的5条新闻：

>>> import praw
>>> import pprint
>>> reddit = praw.Reddit(client_id='OUR_CLIENT_ID',
... client_secret='OUR_SECRET',
... grant_type='client_credentials',
... user_agent='mytestscript/1.0')
>>> subs = reddit.subreddit('Python').top(limit=5)
>>> pprint.pprint([(s.score, s.title) for s in subs])
[(6555, 'Automate the boring stuff with python - tinder'),
(4548,
'MS is considering official Python integration with Excel, and is asking for '
'input'),
(4102, 'Python Cheet Sheet for begineers'),
(3285,
'We started late, but we managed to leave Python footprint on r/place!'),
(2899, 'Python Section at Foyle's, London')]

成功了！

▌抓取新闻页面

下一步的任务是抓取新闻页面，这其实很简单。通过上一步我们可以得到Submission对象，其URL属性就是新闻的地址。我们还可以通过domain属性过滤掉那些属于Reddit自己的URL：

subs = [sub for sub in subs if not sub.domain.startswith('self.')]

我们只需要抓取该URL即可，用Requests很容易就可以做到：

for sub in subs:
res = requests.get(sub.url)
if (res.status_code == 200 and 'content-type' in res.headers and
res.headers.get('content-type').startswith('text/html')):
html = res.text

这里我们略过了content type不是text/html的新闻地址，因为Reddit的用户有可能会提交直接指向图片的链接，我们不需要这种。

▌提取新闻内容

下一步是从HTML中提取内容。我们的目标是提取新闻的标题和正文，而且可以忽略其他不需要阅读的内容，如页首、页脚、侧边栏等。

这项工作很难，其实并没有通用的完美解决办法。虽然BeautifulSoup可以帮我们提取文本内容，但它会连页首页脚一起提取出来。

不过幸运的是，我发现目前网站的结构比以前好很多。没有表格布局，也没有和
，整个文章页面清晰地用

和
标出了标题和每个段落。而且绝大部分网站会把标题和正文放在同一个容器元素中，比如像这样：

header>Site Navigationheader>
div id='#main'>
section>
h1 class='title'>Page Titleh1>
section>
section>
p>Paragraph 1p>
p>Paragraph 2p>
section>
div>
aside>Sidebaraside>
footer>Copyright...footer>

这个例子中顶层的

就是用于标题和正文的容器。所以可以利用如下算法找到正文：

找到
作为标题。出于SEO的目的，通常页面上只会有一个
；
找到
的父元素，检查该父元素是否包含足够多的
；
重复第2步，直到找到一个包含足够多
的父元素，或到达元素。如果找到了包含足够
的父元素，则该父元素就是正文的容器。如果在找到足够的
之前遇到了，说明页面不包含任何可供阅读的内容。

这个算法虽然非常简陋，并没有考虑任何语义信息，但完全行得通。毕竟，算法运行失败时只需要忽略掉那篇文章就行了，少读一篇文章没什么大不了的……当然你可以通过解析

、

用这个算法可以很容易地写出解析代码：

soup = BeautifulSoup(text, 'html.parser')
# find the article title
h1 = soup.body.find('h1')
# find the common parent for

and all
s.
root = h1
while root.name != 'body' and len(root.find_all('p')) <>5:
root = root.parent
if len(root.find_all('p')) <>5:
return None
# find all the content elements.
ps = root.find_all(['h2', 'h3', 'h4', 'h5', 'h6', 'p', 'pre'])

这里我利用len(root.find_all('p')) <>

▌转换成易于阅读的格式

最后一步是将提取出的内容转换为易于阅读的格式。我选择了Markdown，不过你可以写出更好的转换器。

本例中我只提取了和

、

，所以简单的函数就能满足要求：

ps = root.find_all(['h2', 'h3', 'h4', 'h5', 'h6', 'p', 'pre'])
ps.insert(0, h1) # add the title
content = [tag2md(p) for p in ps]
def tag2md(tag):
if tag.name == 'p':
return tag.text
elif tag.name == 'h1':
return f'{tag.text}\n{'=' * len(tag.text)}'
elif tag.name == 'h2':
return f'{tag.text}\n{'-' * len(tag.text)}'
elif tag.name in ['h3', 'h4', 'h5', 'h6']:
return f'{'#' * int(tag.name[1:])} {tag.text}'
elif tag.name == 'pre':
return f'```\n{tag.text}\n```'

▌完整的代码

我在Github上分享了完整的代码，链接如下：

https://gist.github.com/charlee/bc865ba8aac295dd997691310514e515

正好100行，跑一下试试：

Scraping /r/Python...
- Retrieving https://imgs.xkcd.com/comics/python_environment.png
x fail or not html
- Retrieving https://thenextweb.com/dd/2017/04/24/universities-finally-realize-java-bad-introductory-programming-language/#.tnw_PLAz3rbJ
=> done, title = 'Universities finally realize that Java is a bad introductory programming language'
- Retrieving https://github.com/numpy/numpy/blob/master/doc/neps/dropping-python2.7-proposal.rst
x fail or not html
- Retrieving http://www.thedurkweb.com/sms-spoofing-with-python-for-good-and-evil/
=> done, title = 'SMS Spoofing with Python for Good and Evil'
...

抓取的新闻文件：

最后需要做的是将这个脚本放在服务器上，设置好cronjob每天跑一次，然后将生成的文件发到我的信箱。

我没有花太多时间关注细节，所以其实这个脚本还有很多值得改进的地方。有兴趣的话你可以继续添加更多的功能，如提取图像等。

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。

和标出了标题和每个段落。而且绝大部分网站会把标题和正文放在同一个容器元素中，比如像这样：

作为标题。出于SEO的目的，通常页面上只会有一个

；

的父元素，检查该父元素是否包含足够多的；

and all s.root = h1while root.name != 'body' and len(root.find_all('p')) <>5: root = root.parentif len(root.find_all('p')) <>5: return None# find all the content elements.ps = root.find_all(['h2', 'h3', 'h4', 'h5', 'h6', 'p', 'pre'])

和
标出了标题和每个段落。而且绝大部分网站会把标题和正文放在同一个容器元素中，比如像这样：

的父元素，检查该父元素是否包含足够多的
；

and all
s.
root = h1
while root.name != 'body' and len(root.find_all('p')) <>5:
root = root.parent
if len(root.find_all('p')) <>5:
return None
# find all the content elements.
ps = root.find_all(['h2', 'h3', 'h4', 'h5', 'h6', 'p', 'pre'])