this post was submitted on 07 Jul 2025

554 points (98.1% liked)

Open Source

38894 readers

36 users here now

All about open source! Feel free to ask questions, and share news, and interesting stuff!

Useful Links

Rules

Posts must be relevant to the open source ideology
No NSFW content
No hate speech, bigotry, etc

Related Communities

Community icon from opensource.org, but we are not affiliated with them.

founded 5 years ago

MODERATORS

Cloak@lemmy.ml

kevincox@lemmy.ml

CrypticCoffee@lemmy.ml

Lettuceeatlettuce@lemmy.ml

554

The Open-Source Software Saving the Internet From AI Bot Scrapers (www.404media.co)

submitted 1 week ago by fattyfoods@feddit.nl to c/opensource@lemmy.ml

108 comments fedilink hide all child comments

(page 2) 50 comments

sorted by: hot top controversial new old

[–] bdonvr@thelemmy.club 29 points 1 week ago (7 children)

Ooh can this work with Lemmy without affecting federation?

[–] beyond@linkage.ds8.zone 30 points 1 week ago (1 children)

Yes.

Source: I use it on my instance and federation works fine

[–] bdonvr@thelemmy.club 16 points 1 week ago (1 children)

Thanks. Anything special configuring it?

[–] beyond@linkage.ds8.zone 20 points 1 week ago* (last edited 1 week ago)

I keep my server config in a public git repo, but I don't think you have to do anything really special to make it work with lemmy. Since I use Traefik I followed the guide for setting up Anubis with Traefik.

I don't expect to run into issues as Anubis specifically looks for user-agent strings that appear like human users (i.e. they contain the word "Mozilla" as most graphical web browsers do) any request clearly coming from a bot that identifies itself is left alone, and lemmy identifies itself as "Lemmy/{version} +{hostname}" in requests.

[–] deadcade@lemmy.deadca.de 11 points 1 week ago (1 children)

"Yes", for any bits the user sees. The frontend UI can be behind Anubis without issues. The API, including both user and federation, cannot. We expect "bots" to use an API, so you can't put human verification in front of it. These "bots* also include applications that aren't aware of Anubis, or unable to pass it, like all third party Lemmy apps.

That does stop almost all generic AI scraping, though it does not prevent targeted abuse.

load more comments (1 replies)

[–] interdimensionalmeme@lemmy.ml 8 points 1 week ago (2 children)

Yes, it would make lemmy as unsearchable as discord. Instead of unsearchable as pinterest.

load more comments (2 replies)

[–] infinitesunrise@slrpnk.net 5 points 1 week ago

Yeah, it's already deployed on slrpnk.net. I see it momentarily every time I load the site.

[–] seang96@spgrn.com 4 points 1 week ago

As long as its not configured improperly. When forgejo devs added it it broke downloading images with Kubernetes for a moment. Basically would need to make sure user agent header for federation is allowed.

load more comments (2 replies)

[–] medem@lemmy.wtf 24 points 1 week ago (7 children)

What advantage does this software provide over simply banning bots via robots.txt?

[–] irotsoma@lemmy.blahaj.zone 27 points 1 week ago

TL;DR: You should have both due to the explicit breaking of the robots.txt contract by AI companies.

AI generally doesn't obey robots.txt. That file is just notifying scrapers what they shouldn't scrape, but relies on good faith of the scrapers. Many AI companies have explicitly chosen not no to comply with robots.txt, thus breaking the contract, so this is a system that causes those scrapers that are not willing to comply to get stuck in a black hole of junk and waste their time. This is a countermeasure, but not a solution. It's just way less complex than other options that just block these connections, but then make you get pounded with retries. This way the scraper bot gets stuck for a while and doesn't waste as many of your resources blocking them over and over again.

[–] thingsiplay@beehaw.org 13 points 1 week ago

The difference is:

robots.txt is a promise without a door
Anubis is a physical closed door, that opens up after some time

[–] Mwa@thelemmy.club 8 points 1 week ago

The problem is Ai doesn't follow robots.txt,so Cloudflare are Anubis developed a solution.

load more comments (4 replies)

[–] refalo@programming.dev 21 points 1 week ago* (last edited 1 week ago) (4 children)

I don't understand how/why this got so popular out of nowhere... the same solution has already existed for years in the form of haproxy-protection and a couple others... but nobody seems to care about those.

load more comments (4 replies)

[–] Kazumara@discuss.tchncs.de 12 points 1 week ago (1 children)

Just recently there was a guy on the NANOG List ranting about Anubis being the wrong approach and people should just cache properly then their servers would handle thousands of users and the bots wouldn't matter. Anyone who puts git online has no-one to blame but themselves, e-commerce should just be made cacheable etc. Seemed a bit idealistic, a bit detached from the current reality.

Ah found it, here

[–] deadcade@lemmy.deadca.de 14 points 1 week ago (1 children)

Someone making an argument like that clearly does not understand the situation. Just 4 years ago, a robots.txt was enough to keep most bots away, and hosting personal git on the web required very little resources. With AI companies actively profiting off stealing everything, a robots.txt doesn't mean anything. Now, even a relatively small git web host takes an insane amount of resources. I'd know - I host a Forgejo instance. Caching doesn't matter, because diffs berween two random commits are likely unique. Ratelimiting doesn't matter, they will use different IP (ranges) and user agents. It would also heavily impact actual users "because the site is busy".

A proof-of-work solution like Anubis is the best we have currently. The least possible impact to end users, while keeping most (if not all) AI scrapers off the site.

[–] interdimensionalmeme@lemmy.ml 2 points 1 week ago (1 children)

This would not be a problem if one bot scraped once, and the result was then mirrored to all on Big Tech's dime (cloudflare, tailscale) but since they are all competing now, they think their edge is going to be their own more better scraper setup and they won't share.

Maybe there should just be a web to torrent bridge sovtge data is pushed out once by the server and tge swarm does the heavy lifting as a cache.

[–] deadcade@lemmy.deadca.de 2 points 1 week ago (1 children)

No, it'd still be a problem; every diff between commits is expensive to render to web, even if "only one company" is scraping it, "only one time". Many of these applications are designed for humans, not scrapers.

load more comments (1 replies)

[–] RedSnt@feddit.dk 10 points 1 week ago

Brodie interviewed the creator of Anubis a little while back, it's pretty good.

[–] interdimensionalmeme@lemmy.ml 8 points 1 week ago

Open source is also the AI scraper bots AND the internet itself, it is every character in the story.

[–] not_amm@lemmy.ml 7 points 1 week ago

I had seen that prompt, but never searched about it. I found it a little annoying, mostly because I didn't know what it was for, but now I won't mind. I hope more solutions are developed :D

[–] DrunkAnRoot@sh.itjust.works 2 points 1 week ago

it wont protect more then one subdomain i think

load more comments