this post was submitted on 29 May 2024
39 points (100.0% liked)

technology

23308 readers
371 users here now

On the road to fully automated luxury gay space communism.

Spreading Linux propaganda since 2020

Rules:

founded 4 years ago
MODERATORS
 

Consider https://arstechnica.com/robots.txt or https://www.nytimes.com/robots.txt and how they block all the stupid AI models from being able to scrape for free.

you are viewing a single comment's thread
view the rest of the comments
[–] farting_weedman@hexbear.net 1 points 5 months ago

No, robots.txt doesn’t solve this problem. Scrapers just ignore it. The idea behind robots.txt was to be nice to the poor google web crawlers and direct them away from useless stuff that it was a waste to index.

They could still be fastidious and follow every link, they’d just be ignoring the “nothing to see here” signs.

You beat scrapers with recursive loops of links that start from 4pt black on black divs whose page content isn’t easily told apart from useful human created content.

Traps and poison, not asking nicely.