Products > ChatGPT/AI

multiple AI companies ignore robots.txt

(1/1)

madires:
If you're involved in running webservers or websites you know that you can use robots.txt to control what webcrawlers will scan or not. Some AI companies ignore this well established standard:
- Exclusive-Multiple AI companies bypassing web standard to scrape publisher sites, licensing firm says (https://finance.yahoo.com/news/exclusive-multiple-ai-companies-bypassing-143742513.html)
- Perplexity Is a Bullshit Machine (https://www.wired.com/story/perplexity-is-a-bullshit-machine/)

janoc:
This standard has been rather pointless and ignored by malicious bots of all kinds ever since it has existed because it fully relies on the bot behaving nicely. Which malicious and various spam and unscrupulous AI firm bots don't do - by design. It is only marginally more useful than the Do Not Track flag in browsers - everyone is free to completely disregard it and you can't do anything about it.

Why would anyone expect that various content scrapers (including the well known search engines) would always respect it is rather beyond me. I think the real point of this "article" is rather this:


--- Quote ---Multiple artificial intelligence companies are circumventing a common web standard used by publishers to block the scraping of their content for use in generative AI systems, content licensing startup TollBit has told publishers.
--- End quote ---
(emphasis mine)

The article is rather more about a company rep trying to scaremonger and sell their "anti-AI scraping bot" snake oil analytics/detection scheme to publishers worried about their precious content being summarized than anything newsworthy, IMO.

SeanB:
Simplest cure is probably to add a directory to robots.txt that you ask to ignore, and fill the directory with a few GB of random text in a few thousand files, along with some blobs of random data, either set as filetype of being an excel file, a pdf or something, all small random files that will break the input parser of the AI. Enough companies do that and AI scrapers will start to respect the limiters. Yes a bit of heavy traffic, and you probably will want to do a little bit of web server work that gives each file a current date and time, and also serves them randomly to filename requests, something like a canary does.

janoc:

--- Quote from: SeanB on June 23, 2024, 07:03:18 pm ---Simplest cure is probably to add a directory to robots.txt that you ask to ignore, and fill the directory with a few GB of random text in a few thousand files, along with some blobs of random data, either set as filetype of being an excel file, a pdf or something, all small random files that will break the input parser of the AI. Enough companies do that and AI scrapers will start to respect the limiters. Yes a bit of heavy traffic, and you probably will want to do a little bit of web server work that gives each file a current date and time, and also serves them randomly to filename requests, something like a canary does.

--- End quote ---

That's a good way to either DDOS yourself or blow your egress traffic charges out of the water once the bots start to massively download this garbage. Also, most bots will abort the connection after they receive more than a certain amount of data, so you will maybe slow them down, at best - while paying the bills for the traffic that your real customers can't use.


 :-//

SiliconWizard:

--- Quote from: janoc on June 23, 2024, 09:09:36 pm ---
--- Quote from: SeanB on June 23, 2024, 07:03:18 pm ---Simplest cure is probably to add a directory to robots.txt that you ask to ignore, and fill the directory with a few GB of random text in a few thousand files, along with some blobs of random data, either set as filetype of being an excel file, a pdf or something, all small random files that will break the input parser of the AI. Enough companies do that and AI scrapers will start to respect the limiters. Yes a bit of heavy traffic, and you probably will want to do a little bit of web server work that gives each file a current date and time, and also serves them randomly to filename requests, something like a canary does.

--- End quote ---

That's a good way to either DDOS yourself or blow your egress traffic charges out of the water once the bots start to massively download this garbage. Also, most bots will abort the connection after they receive more than a certain amount of data, so you will maybe slow them down, at best - while paying the bills for the traffic that your real customers can't use.


 :-//

--- End quote ---

Yes, fighting bots by saturating them with bogus data would work if storing, and streaming said data was "free". It's not.

Navigation

[0] Message Index

There was an error while thanking
Thanking...
Go to full version
Powered by SMFPacks Advanced Attachments Uploader Mod