An introduction to scrapy

Scrapy框架介绍

基本流程

scrapy框架

Engine打开一个域名，定位处理这个域名的Spider，从Spider中请求最开始要抓取的URLs
Engine得到URLs后，将它们封装为Requests并将它们安排到Scheduler
Engine向Scheduler请求要抓取的URLs，Scheduler向Engine返回接下来要抓取的URLs
Engine将它们传给Downloader，中间经过了DownloaderMiddleware处理（Requests方向）
一旦页面下载完成，Downloader根据下面的页面生成Response对象，将它传回给Engine，中间再次经过DownloaderMiddleware（Response方向）
Engine从Downloader得到Response后，将它再传给Spider处理，经过SpiderMiddleware处理（input方向）
Spider处理Response，给Engine返回分析到的items和新的Requests
Engine将items发送给ItemPipelines，将Requests返回给Scheduler
这个过程从第3步开始不断循环，直到Scheduler没有Requests，Engine关闭

Spider

Spider自定义的类，继承Spider或CrawlSpider：确定一个或一组站点是如何被抓取的；包括得到持续抓取的链接和抽取出结构化的数据两个主要功能。

CrawlSpider 更常用，可以自定义 Rule (TODO: 介绍使用方法)

Rule 可以自定义要提取的链接
Link Extractors

start_urls

最开始的 Requests 通过调用 start_requests 方法，将 start_urls list 中的 URLs 封装成 Requests 对象。可以重写 start_requests: 比如 Requests 对象需要设置 Cookies 属性。

parse

parse 作为 Requests 的 callback 用于解析 Response 的内容。Response 对象封装了 Selector 可以使用 xpath/css 来抽取内容。

主要的 xpath 使用:

response.xpath('//a[contains(@href, "image")]/img/@src').extract()
response.xpath('//a[@href="image")]/h1/text()').extract()
response.xpath(u'//a[re:test(text(), "下一页"))]/text()').extract()

extract() 返回一个 list 对象，有可能为None，extract_first() 返回一个 str 对象，有可能为 None。

Note: 可使用 scrapy shell html_path 来分析 xpath 的正确性。

Refer: Scrapy Selector Tutorial Xpath

Scheduler

Scheduler 作用是保存待处理的 Requests (Queue) 以及已经处理过的请求(Dupefilter)，Scrapy 中的实现有基于 Memory 和 Disk。

基于 Memory 的方案，不能可持久化，任务终断记录消失
基于 Disk 方案，不断地读写文件，效率不高

为了高效、可持久化和可分布式地运行，使用 Redis 实现一个 Scheduler

__init__(): 初始化 Redis 参数，queue_key dupefilter_key
open(): 变量初始化
enqueue_request(): if not dupe_filer.request_seen(request): queue.push(request)
close(): 根据配置条件是否可持久化，决定 Redis 是否清空 key

其中 queue 可以是简单的队列，也可以是带优先级的队列，Requests 有 priority 属性。

Download Middleware

Download Middleware 是处理 Request 和 Response 的钩子 (hooks)

在设置中可以启动或禁用以及设置执行顺序

    DOWNLOADER_MIDDLEWARES = {
    'myproject.middlewares.CustomDownloaderMiddleware': 543,
    }

一个 Middleware 可包含两个方向的处理函数: process_request and process_reponse 下面介绍一些常用的 Middlewares，更详细的参考 Download Middleware

rotate_proxy

Scrapy 实现了一个简单的 HttpProxyMiddleware，它根据 *unix 设置的环境变量来得到 http 代理，然后设置 request.meta 属性。鉴于通常爬虫使用多个代理 IP 来进行抓取工作，我们设计一个自定义的 RotateProxyMiddleWare 即每个Request 请求从备选的代理 IP 库中得到一个，然后设置 request.meta 中的 proxy 属性。

IP 库使用 Redis 保存: 现在的选取规则使用轮询，即从队列取出使用后再将其放入队尾。可以设计其它策略：比如如果一个 Requests 得到了 302 跳转问题，基本可以断定这个 IP 被封，则降低它的权重。

具体如何得到代理 IP 库见下文。

rotate_useragent

为了防止网站将程序识别为爬虫，请求设置 UserAgent 是必要的。Scrapy 实现了一个简单的 Middleware，即根据 settings 设置 requests 的相关属性。但只使用一个 User Agent 不能满足多样性，从 user-agent 可以获得大量的 UA，使用 rotate_proxy 相同的方式实现 UA 的轮转。

httpcache

此 middleware 提供了 Requests 和 Response 的缓存。当前只使用了基于文件系统的缓存，后续可以实现基于 DBM 的缓存。

主要缓存内容：

request_body - the plain request body
request_headers - the request headers (in raw HTTP format)
response_body - the plain response body
response_headers - the request headers (in raw HTTP format)
meta - some metadata of this cache resource in Python repr() format (grep-friendly format)
pickled_meta - the same metadata in meta but pickled for more efficient deserialization

可用内容:

meta 一些基本信息包括http code, refer base
request_headers 分析 requests 的相关属性，包括 headers, cookies.
response_body 查看下载得到网页具体内容，保存的时候一般使用 gzip 压缩，读取时也就使用 gzip

对应的一些设置选项:

HTTPCACHE_ENABLED 是否使用
HTTPCACHE_EXPIRATION_SECS Cache 可以设置过期时间，默认0不过期
HTTPCACHE_IGNORE_HTTP_CODES 对于出错的网页可以不保存 [404 520...]

cookies

在一些需要登录抓取的网站，需要使用 CookiesMiddleware 和 Multiple cookie sessions

XXX: COOKIES_ENABLED vs dont_merge_cookie

为了防止过多的 cookie 暴露出爬虫，设置 dont_merge_cookie 属性，使得 response set-cookie 不起作用。

XXX: merged cookies 在网站识别爬虫时起到什么作用 ???

实际上 Requests 的 Cookie 也没有起作用，暂时没有加这个属性，所以会有 cookies 合并。确定是不是 Scrapy 的 bug？

download timeout middleware

设置一个 page download timeout，可以不用设置太长，否则占用并发数但依然得不到结果，影响 autothrotte 算法。

redirect

在正常的网页访问中，如果出现了网址的迁移，设置 response Location 来跳转是正常的。但在爬虫任务中，大多数出现的跳转是要登录验证，此时不应该跳转，而是将该 Request 请求重新加入到 Scheduler 中用其他 IP 或者 Sesstion 重新爬取。

retry

在抓取网页的过程中，有多种因素可能导致下载错误，应该对该请求重试，Scrapy 实现了 RetryMiddleware。一些基本设置：

RETRY_ENABLED
RETRY_TIMES
RETRY_HTTP_CODES

Spider Middleware

用于处理发送给 Spider 的 Response 以及 Spider 发送给 Engine 的 Requests 和 Items。

Scrapy 实现的 Middleware 有:

httperror 对于 response http code 非200-300怎么处理可以在 Spider 实现，并加入到 handle_httpstatus_list
refer 设置 Request header refer 防止盗链不能访问题
depth 设置爬虫抓取的深度
urllength 设置 Requests url 长度的过滤

Item Pipeline

Item 用于保存结构化的数据，之后通过 pipeline 进行可持久化

items loaders

当前对 response 提取 item 的方式不具有稳定性。

XXX: 应该是泛化为一个 loader 由于一个属性可能对应多个 xpath

mysql

使用 mysql 将 item 持久化到数据库，当前实现使用了异步的 twisted adbapi。 JSON 形式存储到 Mongo 中更合理。 Note: adbapi 执行数据库插入或更新操作时，加入 addErrback() 十分必要，用于检测错误。

XXX: 暂时还没有对比使用异步与不使用有多大的差异

Downloader

At last but not least.

DOWNLOAD_DELAY 每添加一个 download 需要等待的时间
CONCURRENT_REQUESTS Scrapy Downloader 发起的最大并发 Requests 数目
CONCURRENT_REQUESTS_PER_DOMAIN 限制每个域名下的并发数
CONCURRENT_REQUESTS_PER_IP 当域名对应不同的IP，可以限定每个目标IP的并发数

XXX: 需要更详细的分析

如果我们只使用一个IP请求，要达到 N 个并发应设置 delay ＝ latency / N。 downloader 每增加一个 download 都要延时 2s !!! 2s 延时时，基本没有什么并发，可以保证每个IP是安全的。

Autothrotte extension

根据 downloader latency 来调整 delay 的大小。

如何防止被ban

Here are some tips to keep in mind when dealing with these kind of sites:

rotate your user agent from a pool of well-known ones from browsers (google around to get a list of them)
disable cookies (see COOKIES_ENABLED) as some sites may use cookies to spot bot behaviour
use download delays (2 or higher). See DOWNLOAD_DELAY setting.
if possible, use Google cache to fetch pages, instead of hitting the sites directly
use a pool of rotating IPs. For example, the free Tor project or paid services like ProxyMesh
use a highly distributed downloader that circumvents bans internally, so you can just focus on parsing clean pages.

Refer: Avoiding getting banned

IP代理获取

来源

freeproxylists, xici 找到的比较可靠的免费代理IP，freeproxylists 大概500

XXX: 自动化抓取更新代理IP

代理测试

python 并发测试：使用多线程获取每个链接是否可用

curl 速度测试

curl -y 2 -Y 51200 -m 3 -x $line -o $prefix$index www.baidu.com

TODO: 使用 python 完成上面 curl 的速度测试

Refer

Scrapy Architecture