《幻域Online》最新活动中心

首页 >> 攻略指南

Scrapy 教程

跟踪链接 假设您不仅想抓取 https://quotes.toscrape.com 前两页的内容，还想抓取网站中所有页面的引言。现在您知道如何从页面中提取数据了，让...

跟踪链接

假设您不仅想抓取 https://quotes.toscrape.com 前两页的内容，还想抓取网站中所有页面的引言。

现在您知道如何从页面中提取数据了，让我们看看如何从页面中跟踪链接。

首先要做的是提取我们要跟踪的页面的链接。检查我们的页面，我们可以看到有一个指向下一页的链接，其标记如下

我们可以在 shell 中尝试提取它

>>> response.css('li.next a').get()

'Next '

这获取了 anchor 元素，但我们想要属性 href。为此，Scrapy 支持一个 CSS 扩展，允许您选择属性内容，如下所示

>>> response.css("li.next a::attr(href)").get()

'/page/2/'

还有一个 attrib 属性可用（更多信息请参阅选择元素属性）

>>> response.css("li.next a").attrib["href"]

'/page/2/'

现在让我们看看我们的 spider，修改后它可以递归跟踪指向下一页的链接，并从中提取数据

import scrapy

class QuotesSpider(scrapy.Spider):

name = "quotes"

start_urls = [

"https://quotes.toscrape.com/page/1/",

]

def parse(self, response):

for quote in response.css("div.quote"):

yield {

"text": quote.css("span.text::text").get(),

"author": quote.css("small.author::text").get(),

"tags": quote.css("div.tags a.tag::text").getall(),

}

next_page = response.css("li.next a::attr(href)").get()

if next_page is not None:

next_page = response.urljoin(next_page)

yield scrapy.Request(next_page, callback=self.parse)

现在，提取数据后，parse() 方法会查找指向下一页的链接，使用 urljoin() 方法构建完整的绝对 URL（因为链接可能是相对的），并生成一个指向下一页的新请求，将自身注册为回调以处理下一页的数据提取，并使抓取继续遍历所有页面。

您在这里看到的是 Scrapy 跟踪链接的机制：当您在回调方法中 yield 一个 Request 时，Scrapy 会安排发送该请求，并注册一个回调方法以便在请求完成后执行。

通过这种方式，您可以构建复杂的爬虫，根据您定义的规则跟踪链接，并根据正在访问的页面提取不同类型的数据。

在我们的示例中，它创建了一种循环，跟踪所有指向下一页的链接，直到找不到为止——这对于抓取博客、论坛和其他带分页的网站非常方便。

创建 Request 的快捷方式

作为创建 Request 对象的快捷方式，您可以使用 response.follow

import scrapy

class QuotesSpider(scrapy.Spider):

name = "quotes"

start_urls = [

"https://quotes.toscrape.com/page/1/",

]

def parse(self, response):

for quote in response.css("div.quote"):

yield {

"text": quote.css("span.text::text").get(),

"author": quote.css("span small::text").get(),

"tags": quote.css("div.tags a.tag::text").getall(),

}

next_page = response.css("li.next a::attr(href)").get()

if next_page is not None:

yield response.follow(next_page, callback=self.parse)

与 scrapy.Request 不同，response.follow 直接支持相对 URL - 无需调用 urljoin。请注意，response.follow 仅返回一个 Request 实例；您仍然需要 yield 这个 Request。

您还可以将选择器而不是字符串传递给 response.follow；此选择器应提取必要的属性

for href in response.css("ul.pager a::attr(href)"):

yield response.follow(href, callback=self.parse)

对于元素有一个快捷方式：response.follow 会自动使用它们的 href 属性。因此代码可以进一步缩短

for a in response.css("ul.pager a"):

yield response.follow(a, callback=self.parse)

要从一个可迭代对象创建多个请求，您可以改用 response.follow_all

anchors = response.css("ul.pager a")

yield from response.follow_all(anchors, callback=self.parse)

或者，进一步缩短

yield from response.follow_all(css="ul.pager a", callback=self.parse)

更多示例和模式

这里是另一个演示回调和跟踪链接的 spider，这次用于抓取作者信息

import scrapy

class AuthorSpider(scrapy.Spider):

name = "author"

start_urls = ["https://quotes.toscrape.com/"]

def parse(self, response):

author_page_links = response.css(".author + a")

yield from response.follow_all(author_page_links, self.parse_author)

pagination_links = response.css("li.next a")

yield from response.follow_all(pagination_links, self.parse)

def parse_author(self, response):

def extract_with_css(query):

return response.css(query).get(default="").strip()

yield {

"name": extract_with_css("h3.author-title::text"),

"birthdate": extract_with_css(".author-born-date::text"),

"bio": extract_with_css(".author-description::text"),

}

这个 spider 将从主页开始，它将跟踪所有指向作者页面的链接，并为每个链接调用 parse_author 回调，同时也会像我们之前看到的那样，使用 parse 回调跟踪分页链接。

这里我们将回调作为位置参数传递给 response.follow_all，以使代码更短；这也适用于 Request。

parse_author 回调定义了一个辅助函数，用于从 CSS 查询中提取和清理数据，并生成包含作者数据的 Python 字典。

这个 spider 演示的另一个有趣的事情是，即使同一作者有许多引言，我们也不必担心多次访问同一作者页面。默认情况下，Scrapy 会过滤掉已访问过的 URL 的重复请求，从而避免因编程错误而过度访问服务器的问题。这可以在 DUPEFILTER_CLASS 设置中配置。

希望到目前为止，您已经对如何使用 Scrapy 的跟踪链接和回调机制有了很好的理解。

作为另一个利用跟踪链接机制的 spider 示例，请查看 CrawlSpider 类，它是一个通用的 spider，实现了一个小型规则引擎，您可以在其之上编写爬虫。

此外，一种常见模式是使用传递额外数据给回调函数的方法，从多个页面构建一个 item。