Author Topic: multiple AI companies ignore robots.txt (Read 1730 times)

madires · « **on:** June 23, 2024, 01:38:55 pm »

If you're involved in running webservers or websites you know that you can use robots.txt to control what webcrawlers will scan or not. Some AI companies ignore this well established standard:
- Exclusive-Multiple AI companies bypassing web standard to scrape publisher sites, licensing firm says (https://finance.yahoo.com/news/exclusive-multiple-ai-companies-bypassing-143742513.html)
- Perplexity Is a Bullshit Machine (https://www.wired.com/story/perplexity-is-a-bullshit-machine/)

janoc · « **Reply #1 on:** June 23, 2024, 02:09:26 pm »

This standard has been rather pointless and ignored by malicious bots of all kinds ever since it has existed because it fully relies on the bot behaving nicely. Which malicious and various spam and unscrupulous AI firm bots don't do - by design. It is only marginally more useful than the Do Not Track flag in browsers - everyone is free to completely disregard it and you can't do anything about it.

Why would anyone expect that various content scrapers (including the well known search engines) would always respect it is rather beyond me. I think the real point of this "article" is rather this:

Quote

Multiple artificial intelligence companies are circumventing a common web standard used by publishers to block the scraping of their content for use in generative AI systems, content licensing startup TollBit has told publishers.

(emphasis mine)

The article is rather more about a company rep trying to scaremonger and sell their "anti-AI scraping bot" snake oil analytics/detection scheme to publishers worried about their precious content being summarized than anything newsworthy, IMO.

SeanB · « **Reply #2 on:** June 23, 2024, 07:03:18 pm »

Simplest cure is probably to add a directory to robots.txt that you ask to ignore, and fill the directory with a few GB of random text in a few thousand files, along with some blobs of random data, either set as filetype of being an excel file, a pdf or something, all small random files that will break the input parser of the AI. Enough companies do that and AI scrapers will start to respect the limiters. Yes a bit of heavy traffic, and you probably will want to do a little bit of web server work that gives each file a current date and time, and also serves them randomly to filename requests, something like a canary does.

janoc · « **Reply #3 on:** June 23, 2024, 09:09:36 pm »

Quote from: SeanB on June 23, 2024, 07:03:18 pm

Simplest cure is probably to add a directory to robots.txt that you ask to ignore, and fill the directory with a few GB of random text in a few thousand files, along with some blobs of random data, either set as filetype of being an excel file, a pdf or something, all small random files that will break the input parser of the AI. Enough companies do that and AI scrapers will start to respect the limiters. Yes a bit of heavy traffic, and you probably will want to do a little bit of web server work that gives each file a current date and time, and also serves them randomly to filename requests, something like a canary does.

That's a good way to either DDOS yourself or blow your egress traffic charges out of the water once the bots start to massively download this garbage. Also, most bots will abort the connection after they receive more than a certain amount of data, so you will maybe slow them down, at best - while paying the bills for the traffic that your real customers can't use.

SiliconWizard · « **Reply #4 on:** June 23, 2024, 10:58:56 pm »

Quote from: janoc on June 23, 2024, 09:09:36 pm

Quote from: SeanB on June 23, 2024, 07:03:18 pm
Simplest cure is probably to add a directory to robots.txt that you ask to ignore, and fill the directory with a few GB of random text in a few thousand files, along with some blobs of random data, either set as filetype of being an excel file, a pdf or something, all small random files that will break the input parser of the AI. Enough companies do that and AI scrapers will start to respect the limiters. Yes a bit of heavy traffic, and you probably will want to do a little bit of web server work that gives each file a current date and time, and also serves them randomly to filename requests, something like a canary does.

That's a good way to either DDOS yourself or blow your egress traffic charges out of the water once the bots start to massively download this garbage. Also, most bots will abort the connection after they receive more than a certain amount of data, so you will maybe slow them down, at best - while paying the bills for the traffic that your real customers can't use.

Yes, fighting bots by saturating them with bogus data would work if storing, and streaming said data was "free". It's not.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: multiple AI companies ignore robots.txt (Read 1730 times)

madires

multiple AI companies ignore robots.txt

janoc

Re: multiple AI companies ignore robots.txt

SeanB

Re: multiple AI companies ignore robots.txt

janoc

Re: multiple AI companies ignore robots.txt

SiliconWizard

Re: multiple AI companies ignore robots.txt

Share me