打开APP
userphoto
未登录

开通VIP,畅享免费电子书等14项超值服

开通VIP
爬虫框架简单学习

一、简单配置,获取单个网页上的内容。

(1)创建scrapy项目

```c

scrapy startproject getblog

```

(2)编辑 items.py

```c

# -*- coding: utf-8 -*-

# Define here the models for your scraped items

#

# See documentation in:

# http://doc.scrapy.org/en/latest/topics/items.html

from scrapy.item import Item, Field

class BlogItem(Item):

  title = Field()

  desc = Field()

```

(3)在 spiders 文件夹下,创建 blog_spider.py

需要熟悉下xpath选择,感觉跟JQuery选择器差不多,但是不如JQuery选择器用着舒服( w3school教程:http://www.jshk.com.cn/mb/reg.asp?kefu=xiaoding )。

```c

# coding=utf-8

from scrapy.spider import Spider

from getblog.items import BlogItem

from scrapy.selector import Selector

class BlogSpider(Spider):

  # 标识名称

  name = 'blog'

  # 起始地址

  start_urls = ['http://www.cnblogs.com/']

  def parse(self, response):

    sel = Selector(response) # Xptah 选择器

    # 选择所有含有class属性,值为'post_item'的div 标签内容

    # 下面的 第2个div 的 所有内容

    sites = sel.xpath('//div[@class="post_item"]/div[2]')

    items = []

    for site in sites:

      item = BlogItem()

      # 选取h3标签下,a标签下,的文字内容 'text()'

      item['title'] = site.xpath('h3/a/text()').extract()

      # 同上,p标签下的 文字内容 'text()'

      item['desc'] = site.xpath('p[@class="post_item_summary"]/text()').extract()

      items.append(item)

    return items

```

(4)运行,

```c

scrapy crawl blog # 即可

```

(5)输出文件。

 在 settings.py 中进行输出配置。

```c

# 输出文件位置

FEED_URI = 'blog.xml'

# 输出文件格式 可以为 json,xml,csv

FEED_FORMAT = 'xml'

```

输出位置为项目根文件夹下。

二、基本的 -- scrapy.spider.Spider

    (1)使用交互shell

```c

dizzy@dizzy-pc:~$ scrapy shell "http://www.baidu.com/"

2014-08-21 04:09:11+0800 [scrapy] INFO: Scrapy 0.24.4 started (bot: scrapybot)

2014-08-21 04:09:11+0800 [scrapy] INFO: Optional features available: ssl, http11, django

2014-08-21 04:09:11+0800 [scrapy] INFO: Overridden settings: {'LOGSTATS_INTERVAL': 0}

2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled extensions: TelnetConsole, CloseSpider, WebService, CoreStats, SpiderState

2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled downloader middlewares: HttpAuthMiddleware, DownloadTimeoutMiddleware, UserAgentMiddleware, RetryMiddleware, DefaultHeadersMiddleware, MetaRefreshMiddleware, HttpCompressionMiddleware, RedirectMiddleware, CookiesMiddleware, ChunkedTransferMiddleware, DownloaderStats

2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled spider middlewares: HttpErrorMiddleware, OffsiteMiddleware, RefererMiddleware, UrlLengthMiddleware, DepthMiddleware

2014-08-21 04:09:11+0800 [scrapy] INFO: Enabled item pipelines: 

2014-08-21 04:09:11+0800 [scrapy] DEBUG: Telnet console listening on 127.0.0.1:6024

2014-08-21 04:09:11+0800 [scrapy] DEBUG: Web service listening on 127.0.0.1:6081

2014-08-21 04:09:11+0800 [default] INFO: Spider opened

2014-08-21 04:09:12+0800 [default] DEBUG: Crawled (200) <GET http://www.baidu.com/> (referer: None)

[s] Available Scrapy objects:

[s]  crawler  <scrapy.crawler.Crawler object at 0xa483cec>

[s]  item    {}

[s]  request  <GET http://www.baidu.com/>

[s]  response  <200 http://www.baidu.com/>

[s]  settings  <scrapy.settings.Settings object at 0xa0de78c>

[s]  spider   <Spider 'default' at 0xa78086c>

[s] Useful shortcuts:

[s]  shelp()      Shell help (print this help)

[s]  fetch(req_or_url) Fetch request (or URL) and update local objects

[s]  view(response)  View response in a browser

>>> 

  # response.body 返回的所有内容

  # response.xpath('//ul/li') 可以测试所有的xpath内容

    More important, if you type response.selector you will access a selector object you can use to

query the response, and convenient shortcuts like response.xpath() and response.css() mapping to

response.selector.xpath() and response.selector.css()

```

也就是可以很方便的,以交互的形式来查看xpath选择是否正确。之前是用FireFox的F12来选择的,但是并不能保证每次都能正确的选择出内容。

        也可使用:

```c

scrapy shell 'http://scrapy.org' --nolog

# 参数 --nolog 没有日志

```

(2)示例

```c

from scrapy import Spider

from scrapy_test.items import DmozItem

class DmozSpider(Spider):

  name = 'dmoz'

  allowed_domains = ['dmoz.org']

  start_urls = ['http://www.dmoz.org/Computers/Programming/Languages/Python/Books/',

         'http://www.dmoz.org/Computers/Programming/Languages/Python/Resources/,'

         '']

  def parse(self, response):

    for sel in response.xpath('//ul/li'):

      item = DmozItem()

      item['title'] = sel.xpath('a/text()').extract()

      item['link'] = sel.xpath('a/@href').extract()

      item['desc'] = sel.xpath('text()').extract()

      yield item

```

(3)保存文件

        可以使用,保存文件。格式可以 json,xml,csv

```c

scrapy crawl -o 'a.json' -t 'json'

```

(4)使用模板创建spider

```c

scrapy genspider baidu baidu.com

# -*- coding: utf-8 -*-

import scrapy

class BaiduSpider(scrapy.Spider):

  name = "baidu"

  allowed_domains = ["baidu.com"]

  start_urls = (

    'http://www.baidu.com/',

  )

  def parse(self, response):

    pass

```

本站仅提供存储服务,所有内容均由用户发布,如发现有害或侵权内容,请点击举报
打开APP,阅读全文并永久保存 查看更多类似文章
猜你喜欢
类似文章
python模块之Scrapy爬虫框架
Python大佬批量爬取中国院士信息,告诉你哪个地方人杰地灵
从原理到实战,一份详实的 Scrapy 爬虫教程,值得收藏
Scrapy爬取校花网图片
万字长文带你入门Scrapy - Scrapy简明教程
Python爬虫框架Scrapy的安装与正确使用方法
更多类似文章 >>
生活服务
热点新闻
分享 收藏 导长图 关注 下载文章
绑定账号成功
后续可登录账号畅享VIP特权!
如果VIP功能使用有故障,
可点击这里联系客服!

联系客服