Esses dias eu vi um post no Mastodon sobre bloquear robôs que acessam o servidor web.

Perguntei pro autor do post como foi que ele conseguiu isso, mas fui solenemente ignorado. Coisas da Internet.

Então resolvi dar uma olhada nos logs do servidor, esse que hospeda esse mesmo site. E fiz um programa em perl pra isso. Pra matar as saudades. E tirar a ferrugem.

E esse foi o programa:

  
#! /usr/bin/env perl
use IO::Zlib;

my $LOGDIR = "/var/log/apache2";

opendir(DIR, $LOGDIR) or die "Impossible to read from directory: $!\n";

%ip_addrs;
%bot_agent;

@gzip_files;

foreach my $filename (readdir DIR) {
        next if $filename !~ /access/;
        # skip gz right now
        if ($filename =~ /\.gz/) {
                push(@gzip_files, ($LOGDIR."/".$filename));
                next;
        }

        print($LOGDIR."/".$filename."\n");
        open(FD, $LOGDIR."/".$filename) or die "Impossible to read file: $!\n";
        foreach my $line () {
                next if $line !~ /bot/;
                parse_log_line($line);
        }
}


print("result:\n");


for my $filename (@gzip_files) {
        print("$filename\n");
        my $fh = new IO::Zlib;
        $fh->open($filename, "rb") or die "impossible to read gzip file: $!\n";
        while ( my $line = <$fh>) {
                next if $line !~ /bot/;
                parse_log_line($line);
        }
}

foreach $bot (sort {$bot_agent{$b}<=>$bot_agent{$a} } keys %bot_agent) {
    print("$bot => $bot_agent{$bot}\n");
}

sub parse_log_line() {
        my $line = $_[0];
        our %ip_addrs;
        our %bot_agent;

        @params = split(/ /, $line);
        my $ip = $params[0];
        $ip_addrs{$ip}++;
        $line =~ s/.*]//;
        $line =~ s/\"$//;
        $line =~s/.*\"//;
        chomp($line);
        if ($line =~ m/bot/) {
                $bot_agent{$line}++;
        }
}    
  

E o resultado em formato de tabela:

User-Agent do robô Quantidade de acessos
Mozilla/5.0 (compatible; MJ12bot/v1.4.8; http://mj12bot.com/) 30130
Mozilla/5.0 (compatible; SemrushBot/7~bl; +http://www.semrush.com/bot.html) 22203
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; ClaudeBot/1.0; This email address is being protected from spambots. You need JavaScript enabled to view it.) 16981
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/116.0.1938.76 Safari/537.36 15979
Mozilla/5.0 (compatible; AhrefsBot/7.0; +http://ahrefs.com/robot/) 12112
Mozilla/5.0 (Linux; Android 7.0;) AppleWebKit/537.36 (KHTML, like Gecko) Mobile Safari/537.36 (compatible; PetalBot;+https://webmaster.petalsearch.com/site/petalbot) 10744
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Amazonbot/0.1; +https://developer.amazon.com/support/amazonbot) Chrome/119.0.6045.214 Safari/537.36 9704
Mozilla/5.0 (compatible; DotBot/1.2; +https://opensiteexplorer.org/dotbot; This email address is being protected from spambots. You need JavaScript enabled to view it.) 7434
Mozilla/5.0 (compatible; AwarioBot/1.0; +https://awario.com/bots.html) 5042
Mozilla/5.0 (compatible; DataForSeoBot/1.0; +https://dataforseo.com/dataforseo-bot) 3104
Linguee Bot (http://www.linguee.com/bot; This email address is being protected from spambots. You need JavaScript enabled to view it.) 1957
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/605.1.15 (KHTML, like Gecko) Version/17.4 Safari/605.1.15 (Applebot/0.1; +http://www.apple.com/go/applebot) 1450
Mozilla/5.0 (compatible; archive.org_bot +http://archive.org/details/archive.org_bot) Zeno/5741de8 warc/v0.8.85 751
Mozilla/5.0 (compatible; SemrushBot-BA; +http://www.semrush.com/bot.html) 651
Mozilla/5.0 (compatible; YandexBot/3.0; +http://yandex.com/bots) 567
Mozilla/5.0 (compatible; MJ12bot/v2.0.2; http://mj12bot.com/) 538
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.7204.183 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 531
Mozilla/5.0 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 411
Blogtrottr/2.1 (+https://blogtrottr.com/robot) 340
Googlebot-Image/1.0 315
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/100.0.4896.127 Safari/537.36 300
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; GPTBot/1.2; +https://openai.com/gptbot) 261
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_15_7) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/131.0.0.0 Safari/537.36; compatible; OAI-SearchBot/1.0; +https://openai.com/searchbot 240
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/137.0.7151.119 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 207
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; bingbot/2.0; +http://www.bing.com/bingbot.htm) Chrome/136.0.0.0 Safari/537.36 191
ZoominfoBot (zoominfobot at zoominfo dot com) 164
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko); compatible; ChatGPT-User/1.0; +https://openai.com/bot 150
Mozilla/5.0 (compatible; SeznamBot/4.0; +https://o-seznam.cz/napoveda/vyhledavani/en/seznambot-crawler/) 124
Mozilla/5.0 (compatible; MojeekBot/0.11; +https://www.mojeek.com/bot.html) 107
Mozilla/5.0 (compatible; archive.org_bot +http://archive.org/details/archive.org_bot) Zeno/a7797cb warc/v0.8.78 89
DuckDuckBot/1.1; (+http://duckduckgo.com/duckduckbot.html) 82
Mozilla/5.0 (compatible;PetalBot;+https://webmaster.petalsearch.com/site/petalbot) 80
Mozilla/5.0 (compatible; YandexImages/3.0; +http://yandex.com/bots) 70
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; PerplexityBot/1.0; +https://perplexity.ai/perplexitybot) 62
Mozilla/5.0 (compatible; YaK/1.0; http://linkfluence.com/; This email address is being protected from spambots. You need JavaScript enabled to view it.) 61
Mozilla/5.0 (compatible; coccocbot-image/1.0; +http://help.coccoc.com/searchengine) 56
Mozilla/5.0 (compatible; wpbot/1.3; +https://forms.gle/ajBaxygz9jSR8p8G9) 41
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/137.0.7151.119 Safari/537.36 37
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.7204.168 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 29
Twitterbot/1.0 29
AdsBot-Google (+http://www.google.com/adsbot.html) 22
Mozilla/5.0 (compatible; intelx.io_bot +https://intelx.io) 21
BufferLinkPreviewBot/1.0 (+https://scraper.buffer.com/about/bots/link-preview-bot) 19
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/99.0.4844.84 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 14
Mozilla/5.0 (compatible; Thinkbot/0.5.8; +In_the_test_phase,_if_the_Thinkbot_brings_you_trouble,_please_block_its_IP_address._Thank_you.) 13
Mozilla/5.0 (compatible; MJ12bot/v2.0.4; http://mj12bot.com/) 13
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/138.0.7204.183 Mobile Safari/537.36 (compatible; AdsBot-Google-Mobile; +http://www.google.com/mobile/adsbot.html) 9
Mozilla/4.0 (compatible; fluid/0.0; +http://www.leak.info/bot.html) 8
Mozilla/5.0 (Windows NT 10.0; Win64; x64; trendictionbot0.5.0; trendiction search; http://www.trendiction.de/bot; please let us know of any problems; web at trendiction.com) Gecko/20100101 Firefox/125.0 7
Googlebot/2.1 (+http://www.google.com/bot.html) 5
Mozilla/5.0 (compatible; SurdotlyBot/1.0; +http://sur.ly/bot.html) 5
Mozilla/5.0 (compatible; Website-info.net-Robot; https://website-info.net/robot) 4
Mozilla/5.0 (compatible; coccocbot-web/1.0; +http://help.coccoc.com/searchengine) 4
Pandalytics/2.0 (https://domainsbot.com/pandalytics/) 4
Mozilla/5.0 (compatible; IbouBot/1.0; This email address is being protected from spambots. You need JavaScript enabled to view it.; +https://ibou.io/iboubot.html) 4
DomainStatsBot/1.0 (https://domainstats.com/pages/our-bot) 4
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/W.X.Y.Z Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 3
Mozilla/5.0 (compatible; SeekportBot; +https://bot.seekport.com) 3
serpstatbot/2.1 (advanced backlink tracking bot; https://serpstatbot.com/; This email address is being protected from spambots. You need JavaScript enabled to view it.) 3
Mozilla/5.0 (Macintosh; Intel Mac OS X 10_11_1) AppleWebKit/601.2.4 (KHTML, like Gecko) Version/9.0.1 Safari/601.2.4 facebookexternalhit/1.1 Facebot Twitterbot/1.0 3
yacybot (-global; amd64 Linux 5.15.161; java 11.0.26-internal; America/en) http://yacy.net/bot.html 2
Mozilla/5.0 AppleWebKit/537.36 (KHTML, like Gecko; compatible; Googlebot/2.1; +http://www.google.com/bot.html) Chrome/120.0.6099.199 Safari/537.36 2
Mozilla/5.0 (compatible; Qwantbot/1.0_4396629; +https://help.qwant.com/bot/) 2
DomCopBot (https://www.domcop.com/bot) 2
Mozilla/5.0 (compatible; YandexFavicons/1.0; +http://yandex.com/bots) 2
yacybot (/global; amd64 Linux 5.15.0-140-generic; java 11.0.27; America/en) http://yacy.net/bot.html 2
Synapse (bot; +https://github.com/matrix-org/synapse) 2
Mozilla/5.0 (compatible; archive.org_bot; Wayback Machine Live Record; +http://archive.org/details/archive.org_bot) 1
Mozilla/5.0 (Linux; Android 6.0.1; Nexus 5X Build/MMB29P) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/117.0.0.0 Mobile Safari/537.36 (compatible; Googlebot/2.1; +http://www.google.com/bot.html) 1
Mozilla/5.0 (compatible; Qwantbot/1.0; +https://help.qwant.com/bot/) 1
Facebot 1
Mozilla/5.0 (compatible; GetHPinfo.com-Bot/0.1; +http://www.gethpinfo.com/bot/ 1
Slack-ImgProxy (+https://api.slack.com/robots) 1
Googlebot-Video/1.0 1
yacybot (/global; amd64 Linux 6.12.38+deb13-amd64; java 21.0.8; Europe/fr) http://yacy.net/bot.html 1
Mozilla/5.0 (iPhone; CPU iPhone OS 6_0 like Mac OS X) AppleWebKit/536.26 (KHTML, like Gecko) Version/6.0 Mobile/10A5376e Safari/8536.25 (compatible; SMTBot/1.0; +http://www.similartech.com/smtbot) 1
Mastodon/4.5.0-alpha.1+chuckya (http.rb/5.3.1; +https://lab.wheelsbot.dev/) 1

O resultado é texto. Eu só formatei pra ficar mais fácil de visualizar aqui (usando "sd" pra isso). E está hardcoded pra buscar os logs do apache2 em sistemas debian alike.

Dos resultados, confesso que fiquei surpreso. Realmente bastante tráfego vindo de robôs. E vários que eu nunca ouvi falar.

Sobre bloquear ou não, eu por enquanto não mexi em nada e os robôs continuam acessando tudo. Mesmo porque eu não fiz nada qualitativo, pra saber se estão recebendo 200 (ok) ou alguma outra coisa como 404 (not found - não encontrad).

Mas caso eu decida pelo bloqueio, achei um projeto bem interessante no GitHub que já faz a curadoria de robôs "bons" e "ruins".

We use cookies

We use cookies on our website. Some of them are essential for the operation of the site, while others help us to improve this site and the user experience (tracking cookies). You can decide for yourself whether you want to allow cookies or not. Please note that if you reject them, you may not be able to use all the functionalities of the site.