打开APP
userphoto
未登录

开通VIP,畅享免费电子书等14项超值服

开通VIP
Scrapy: run using TOR and multiple agents

Scrapy is a brilliant and well documented crawler written in python. Though it is not as scalable as Apache Nutch but it can easily handle thousands of sites easily. You can get up and running very quickly using the official documentation.Tor gives you power to keep your privacy and security.Tor can hide you so that website can not track your identity. You may read more about TOR in official site. However Tor only works for TCP streams and can be used by any application with SOCKS support.

When we combine Scrapy with Tor, we can have more control over our crawler privacy. We already know that Scrapy can work with proxy server however since Scrapy doesn’t work directly with SOCKS proxy, things can work out if we can introduce a http proxy server as an intermediate between Scrapy and Tor which can also speak to Tor using SOCKS. SOCKS protocol is a lower level protocol than http and it is more transparent in a sense that it doesn’t add extra info like http-header etc. We are going to use a tiny and fast proxy server polipo. Polipo can talk to Tor using SOCKS protocol therefore all three together can work to create anonymous crawler.Alright let’s get started.

  • I am going to assume that you have already installed scrapy on your system.
  • Install Tor as per the instruction from official documentation. On my mac I used macport to install Tor.
  • Start tor
  • Install polipo using macport.
  • uncomment following lines in /etc/polipo/config or /opt/local/etc/polipo/config file.
build.sbt
123
socksParentProxy = localhost:9050diskCacheRoot=""disableLocalInterface=""
  • start polipo.By default polipo listens on 8123 port and Tor on 9050 port. If you want you may change this port and accordingly adjust settings in config files.

Now to verify if everything is working fine+ change your broser proxy setting to point to localhost and port 8123.+ Visit this tor check page. This page should give your message that you are using tor correctly depending upon if everything is configured correctly.

Now after basic setup is complete let’s add middleware code in Scrapy to make use of this proxy.+ Add a new file called middlewares.py in your project and add following code.See this Gist

middlewares.py
123456789101112
import osimport randomfrom scrapy.conf import settingsclass RandomUserAgentMiddleware(object):    def process_request(self, request, spider):        ua  = random.choice(settings.get('USER_AGENT_LIST'))        if ua:            request.headers.setdefault('User-Agent', ua)class ProxyMiddleware(object):    def process_request(self, request, spider):        request.meta['proxy'] = settings.get('HTTP_PROXY')
settings.py
1234567891011121314151617
### More comprehensive list can be found at ### http://techpatterns.com/forums/about304.htmlUSER_AGENT_LIST = [    'Mozilla/5.0 (Windows NT 6.1; WOW64) AppleWebKit/535.7          (KHTML, like Gecko) Chrome/16.0.912.36 Safari/535.7',    'Mozilla/5.0 (Windows NT 6.2; Win64; x64; rv:16.0)        Gecko/16.0 Firefox/16.0',    'Mozilla/5.0 (Macintosh; Intel Mac OS X 10_7_3) AppleWebKit/534.55.3        (KHTML, like Gecko) Version/5.1.3 Safari/534.53.10']HTTP_PROXY = 'http://127.0.0.1:8123'DOWNLOADER_MIDDLEWARES = {     'myproject.middlewares.RandomUserAgentMiddleware': 400,     'myproject.middlewares.ProxyMiddleware': 410,     'scrapy.contrib.downloadermiddleware.useragent.UserAgentMiddleware': None    # Disable compression middleware, so the actual HTML pages are cached}

You are all set to crawl now.Make sure you use the crawler responsibly with sufficient delay and follow the website terms and conditions along with robots.txt rules.

Next time I am going to setup things on ubuntu and accordingly I will update this article.Hope this helps!!!

本站仅提供存储服务,所有内容均由用户发布,如发现有害或侵权内容,请点击举报
打开APP,阅读全文并永久保存 查看更多类似文章
猜你喜欢
类似文章
【热】打开小程序,算一算2024你的财运
Python爬虫:Scrapy中间件Middleware和Pipeline
Python爬虫从入门到放弃(二十三)之 Scrapy的中间件Downloader Middleware实现User
Python项目:公众号文章爬虫
收藏| Scrapy框架各组件详细设置
浅谈Scrapy框架普通反爬虫机制的应对策略
scrapy实践之中间件的使用
更多类似文章 >>
生活服务
热点新闻
分享 收藏 导长图 关注 下载文章
绑定账号成功
后续可登录账号畅享VIP特权!
如果VIP功能使用有故障,
可点击这里联系客服!

联系客服