FreshRSS 通过 Xpath 抓取订阅源

发表于 2024-05-27 分类于原创本文字数： 606

针对一些不提供 RSS 订阅服务的网站，FreshRSS 自带的抓取工具提供了基于 Xpath 的抓取方案，然而官方文档没有针对这方面的教程。通过摸索，基本解决了“红源”问题。下面简单总结了一些经验，以供参考。

基本法则

准备动手之前，强烈建议在一些 RSS app 里测试下主页链接，看能不能直接识别出订阅源，比如 inoreader。而今，RSS 格式千奇百怪，已经不是 Atom 一统天下的时代。

拿到准备抓取的链接（一般是网站首页），首先判断 RSS 源是 html 还是 xml 格式，对应的订阅源类型分别为HTML + XPath (Web 抓取)和XML + XPath。注意，看上去是 html 的网页可能是加载了 css 的 xml，比如这个，稳妥起见可以右键查看源代码确认一下。关于 Xpath 语法，可以参考这个。

然后找一个趁手的 Xpath 在线检查工具，比如这个（中文）和这个（英文，国内访问可能受限）。保存之前可以用它们测试一下是否有效。建议再准备一个 strtotime 测试工具，用来检查文章日期是否能够被正确解析，比如这个。

实战

接下来记录一些测试通过的：

源地址	文章定位	文章标题	文章内容	文章链接	文章日期
Airing	`//item`	`descendant::title`	`descendant::*[name()='content:encoded']`	`descendant::link`	`descendant::*[name()='pubdate']`
Lv. MAX	`//div[@class="post-list"]`	`descendant::h3`	`descendant::p[@class="description"]`	`descendant::a/@href`	`descendant::p[@class="date"]`
foundryvtt	`//nav[@id="news"]/div/figure`	`//nav[@id="news"]/div/figure/descendant::a/@title`	`//nav[@id="news"]/div/figure/descendant::a/@title`	`//nav[@id="news"]/div/figure/descendant::a/img/@src`
葉子	`//article[@class="post-block"]`	`descendant::header/h2`	`descendant::div[@class="post-body"]`	`descendant::header/h2/a/@href`	`descendant::header/div/span/time/@datetime`
皮益侠	`//article`	`descendant::h2`	`descendant::h3`	`descendant::h2/a/@href`	`descendant::time`
清北	`//main/div[@class="card"]`	`descendant::h1`	`descendant::div[@class="content"]`	`descendant::h1/a/@href`	`descendant::time/@datetime`
armsword	`//div[@class="article-inner"]`	`descendant::h1`	`descendant::div[@class="article-entry"]`	`descendant::h1/a/@href`	`descendant::time/@datetime`
lyz	`//div[@class="post-block"]`	`descendant::h2[@class="post-title"]`	`descendant::div[@class="post-body"]`	`descendant::h2[@class="post-title"]/a/@href`	`descendant::time/@datetime`
K.I.S.S	`//item`	`descendant::title`	`descendant::description`	`descendant::link`	`descendant::pubDate`
小Lee说	`//section[@class="post-item"]`	`descendant::h2`	`descendant::div[@class="post-abstract"]`	`descendant::a/@href`	`descendant::div[@class="post-info"]/span[position()=1]`
小骨GT	`//div[contains(@class, 'c-post-card')]`	`descendant::h3`		`descendant::h3/a/@href`	`descendant::time/@datetime`
Summer	`//section[contains(@class, 'post-item')]`	`descendant::h2`	`descendant::div[contains(@class, 'post-abstract')]`	`descendant::div[@class="content"]/a/@href`	`descendant::div[contains(@class, 'text-xs')]/@title`