Esses dias eu vi um post no Mastodon sobre bloquear robôs que acessam o servidor web.

Perguntei pro autor do post como foi que ele conseguiu isso, mas fui solenemente ignorado. Coisas da Internet.
Então resolvi dar uma olhada nos logs do servidor, esse que hospeda esse mesmo site. E fiz um programa em perl pra isso. Pra matar as saudades. E tirar a ferrugem.
E esse foi o programa:
#! /usr/bin/env perl
use IO::Zlib;
my $LOGDIR = "/var/log/apache2";
opendir(DIR, $LOGDIR) or die "Impossible to read from directory: $!\n";
%ip_addrs;
%bot_agent;
@gzip_files;
foreach my $filename (readdir DIR) {
next if $filename !~ /access/;
# skip gz right now
if ($filename =~ /\.gz/) {
push(@gzip_files, ($LOGDIR."/".$filename));
next;
}
print($LOGDIR."/".$filename."\n");
open(FD, $LOGDIR."/".$filename) or die "Impossible to read file: $!\n";
foreach my $line () {
next if $line !~ /bot/;
parse_log_line($line);
}
}
print("result:\n");
for my $filename (@gzip_files) {
print("$filename\n");
my $fh = new IO::Zlib;
$fh->open($filename, "rb") or die "impossible to read gzip file: $!\n";
while ( my $line = <$fh>) {
next if $line !~ /bot/;
parse_log_line($line);
}
}
foreach $bot (sort {$bot_agent{$b}<=>$bot_agent{$a} } keys %bot_agent) {
print("$bot => $bot_agent{$bot}\n");
}
sub parse_log_line() {
my $line = $_[0];
our %ip_addrs;
our %bot_agent;
@params = split(/ /, $line);
my $ip = $params[0];
$ip_addrs{$ip}++;
$line =~ s/.*]//;
$line =~ s/\"$//;
$line =~s/.*\"//;
chomp($line);
if ($line =~ m/bot/) {
$bot_agent{$line}++;
}
}
E o resultado em formato de tabela:
User-Agent do robô | Quantidade de acessos |
---|---|
Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/) | 30130 |
Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html) | 22203 |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; | 16981 |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36 | 15979 |
Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/) | 12112 |
Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot) | 10744 |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36 | 9704 |
Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; | 7434 |
Mozilla/5.0 (compatible; AwarioBot/1.0; +https://awario.com/bots.html) | 5042 |
Mozilla/5.0 (compatible; DataForSeoBot/1.0; +https://dataforseo.com/dataforseo-bot) | 3104 |
Linguee Bot (http://www.linguee.com/bot; | 1957 |
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot) | 1450 |
Mozilla/5.0 (compatible; archive.org_bot +http://archive.org/details/archive.org_bot) Zeno/5741de8 warc/v0.8.85 | 751 |
Mozilla/5.0 (compatible; SemrushBot-BA; +http://www.semrush.com/bot.html) | 651 |
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) | 567 |
Mozilla/5.0 (compatible; MJ12bot/v2.0.2; http://mj12bot.com/) | 538 |
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.7204.183 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) | 531 |
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) | 411 |
Blogtrottr/2.1 (+https://blogtrottr.com/robot) | 340 |
Googlebot-Image/1.0 | 315 |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/100.0.4896.127 Safari/537.36 | 300 |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) | 261 |
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot | 240 |
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/137.0.7151.119 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) | 207 |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/136.0.0.0 Safari/537.36 | 191 |
ZoominfoBot (zoominfobot at zoominfo dot com) | 164 |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot | 150 |
Mozilla/5.0 (compatible; SeznamBot/4.0; +https://o-seznam.cz/napoveda/vyhledavani/en/seznambot-crawler/) | 124 |
Mozilla/5.0 (compatible; MojeekBot/0.11; +https://www.mojeek.com/bot.html) | 107 |
Mozilla/5.0 (compatible; archive.org_bot +http://archive.org/details/archive.org_bot) Zeno/a7797cb warc/v0.8.78 | 89 |
DuckDuckBot/1.1; (+http://duckduckgo.com/duckduckbot.html) | 82 |
Mozilla/5.0 (compatible;PetalBot;+https://webmaster.petalsearch.com/site/petalbot) | 80 |
Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots) | 70 |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot) | 62 |
Mozilla/5.0 (compatible; YaK/1.0; http://linkfluence.com/; | 61 |
Mozilla/5.0 (compatible; coccocbot-image/1.0; +http://help.coccoc.com/searchengine) | 56 |
Mozilla/5.0 (compatible; wpbot/1.3; +https://forms.gle/ajBaxygz9jSR8p8G9) | 41 |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/137.0.7151.119 Safari/537.36 | 37 |
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.7204.168 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) | 29 |
Twitterbot/1.0 | 29 |
AdsBot-Google (+http://www.google.com/adsbot.html) | 22 |
Mozilla/5.0 (compatible; intelx.io_bot +https://intelx.io) | 21 |
BufferLinkPreviewBot/1.0 (+https://scraper.buffer.com/about/bots/link-preview-bot) | 19 |
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) | 14 |
Mozilla/5.0 (compatible; Thinkbot/0.5.8; +In_the_test_phase,_if_the_Thinkbot_brings_you_trouble,_please_block_its_IP_address._Thank_you.) | 13 |
Mozilla/5.0 (compatible; MJ12bot/v2.0.4; http://mj12bot.com/) | 13 |
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.7204.183 Mobile Safari/537.36 (compatible; AdsBot-Google-Mobile; +http://www.google.com/mobile/adsbot.html) | 9 |
Mozilla/4.0 (compatible; fluid/0.0; +http://www.leak.info/bot.html) | 8 |
Mozilla/5.0 (Windows NT 10.0; Win64; x64; trendictionbot0.5.0; trendiction search; http://www.trendiction.de/bot; please let us know of any problems; web at trendiction.com) Gecko/20100101 Firefox/125.0 | 7 |
Googlebot/2.1 (+http://www.google.com/bot.html) | 5 |
Mozilla/5.0 (compatible; SurdotlyBot/1.0; +http://sur.ly/bot.html) | 5 |
Mozilla/5.0 (compatible; Website-info.net-Robot; https://website-info.net/robot) | 4 |
Mozilla/5.0 (compatible; coccocbot-web/1.0; +http://help.coccoc.com/searchengine) | 4 |
Pandalytics/2.0 (https://domainsbot.com/pandalytics/) | 4 |
Mozilla/5.0 (compatible; IbouBot/1.0; | 4 |
DomainStatsBot/1.0 (https://domainstats.com/pages/our-bot) | 4 |
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) | 3 |
Mozilla/5.0 (compatible; SeekportBot; +https://bot.seekport.com) | 3 |
serpstatbot/2.1 (advanced backlink tracking bot; https://serpstatbot.com/; | 3 |
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/601.2.4 (KHTML, like Gecko) Version/9.0.1 Safari/601.2.4 facebookexternalhit/1.1 Facebot Twitterbot/1.0 | 3 |
yacybot (-global; amd64 Linux 5.15.161; java 11.0.26-internal; America/en) http://yacy.net/bot.html | 2 |
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/120.0.6099.199 Safari/537.36 | 2 |
Mozilla/5.0 (compatible; Qwantbot/1.0_4396629; +https://help.qwant.com/bot/) | 2 |
DomCopBot (https://www.domcop.com/bot) | 2 |
Mozilla/5.0 (compatible; YandexFavicons/1.0; +http://yandex.com/bots) | 2 |
yacybot (/global; amd64 Linux 5.15.0-140-generic; java 11.0.27; America/en) http://yacy.net/bot.html | 2 |
Synapse (bot; +https://github.com/matrix-org/synapse) | 2 |
Mozilla/5.0 (compatible; archive.org_bot; Wayback Machine Live Record; +http://archive.org/details/archive.org_bot) | 1 |
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) | 1 |
Mozilla/5.0 (compatible; Qwantbot/1.0; +https://help.qwant.com/bot/) | 1 |
Facebot | 1 |
Mozilla/5.0 (compatible; GetHPinfo.com-Bot/0.1; +http://www.gethpinfo.com/bot/ | 1 |
Slack-ImgProxy (+https://api.slack.com/robots) | 1 |
Googlebot-Video/1.0 | 1 |
yacybot (/global; amd64 Linux 6.12.38+deb13-amd64; java 21.0.8; Europe/fr) http://yacy.net/bot.html | 1 |
Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; SMTBot/1.0; +http://www.similartech.com/smtbot) | 1 |
Mastodon/4.5.0-alpha.1+chuckya (http.rb/5.3.1; +https://lab.wheelsbot.dev/) | 1 |
O resultado é texto. Eu só formatei pra ficar mais fácil de visualizar aqui (usando "sd" pra isso). E está hardcoded pra buscar os logs do apache2 em sistemas debian alike.
Dos resultados, confesso que fiquei surpreso. Realmente bastante tráfego vindo de robôs. E vários que eu nunca ouvi falar.
Sobre bloquear ou não, eu por enquanto não mexi em nada e os robôs continuam acessando tudo. Mesmo porque eu não fiz nada qualitativo, pra saber se estão recebendo 200 (ok) ou alguma outra coisa como 404 (not found - não encontrad).
Mas caso eu decida pelo bloqueio, achei um projeto bem interessante no GitHub que já faz a curadoria de robôs "bons" e "ruins".