屏蔽恶意蜘蛛
主要:
if ($http_user_agent ~ “hubspot|CCBot|VelenPublicWebCrawler|Konturbot|my-tiny-bot|eiki|webmeup|ExtLinksBot|Go-http-client|Python|ZoominfoBot|MegaIndex.ru|GPTBot|MauiBot|Amazonbot|ds-robot|intelx.io|coccocbot|FeedDemon|Indy Library|Alexa Toolbar|AskTbFXTV|AhrefsBot|CrawlDaddy|CoolpadWebkit|Applebot|Java|Barkrowler|Feedly|UniversalFeedParser|ApacheBench|Microsoft URL Control|Swiftbot|DuckDuckGo|ClaudeBot|coccocbot|ZmEu|oBot|GPTBot|jaunty|Python-urllib|lightDeckReports Bot|YYSpider|DigExt|MJ12bot|DotBot|heritrix|Html5plus|Bytespider|BLEXBot|serpstatbot|Ezooms|JikeSpider|Barkrowler|InfoTigerBot|SemrushBot|DuckDuckGo-Favicons-Bot|ImagesiftBot|^$” ) {
return 403;
}
#临时禁止,以后可以删除
if ($http_user_agent ~ “hubspot|rwth-aachen.de|^$” ) {
return 403;
}
HttpClient 有时候是恶意的,但有时候会影响
小蜘蛛:
if ($http_user_agent ~ “Phpzhanqun|HostHarvest|python-requests|^$” ) {
return 403;
}
Amazonbot:Amazonbot is Amazon’s web crawler used to improve our services, such as enabling Alexa to answer even more questions for customers. Amazonbot respects standard robots.txt rules. 可以屏蔽
Go-http-client:这个是 是阿里云(或腾讯云 )的全站加速 为了确定最优线路用的蜘蛛,也可能是go语言制作的http客户端,可能其它程序抓取的(https://www.cnblogs.com/rxbook/p/15167301.html);不是正常浏览器,暂作屏蔽。
Bytespider: 字节跳动的蜘蛛,可能为了迅速建立数据库,抓取频率过高。海外市占率低,暂时屏蔽,以后要放出来。
Pro Sitemaps Generator: pro-sitemaps.com 一个生成站点地图的工具,会给网站增加负担,不需要都加,碰到了加就可以。
2024.5.25 增加
ImagesiftBot,这个是抓取图片,给AI用的蜘蛛
researchscan.comsys.rwth-aachen.de: 德国大学研究网站安全的扫描 (临时禁止,以后可以删除)
GPTBot: ChatGPT的蜘蛛,禁用!
ClaudeBot: 大数据AI抓取蜘蛛,非常没有道德暴力抓取,直击禁止。最好还能禁止他们服务器的IP。