Author Topic: multiple AI companies ignore robots.txt  (Read 1730 times)

0 Members and 1 Guest are viewing this topic.

Offline madiresTopic starter

  • Super Contributor
  • ***
  • Posts: 8829
  • Country: de
  • A qualified hobbyist ;)
multiple AI companies ignore robots.txt
« on: June 23, 2024, 01:38:55 pm »
If you're involved in running webservers or websites you know that you can use robots.txt to control what webcrawlers will scan or not. Some AI companies ignore this well established standard:
- Exclusive-Multiple AI companies bypassing web standard to scrape publisher sites, licensing firm says (https://finance.yahoo.com/news/exclusive-multiple-ai-companies-bypassing-143742513.html)
- Perplexity Is a Bullshit Machine (https://www.wired.com/story/perplexity-is-a-bullshit-machine/)
 
The following users thanked this post: SiliconWizard

Offline janoc

  • Super Contributor
  • ***
  • Posts: 3958
  • Country: de
Re: multiple AI companies ignore robots.txt
« Reply #1 on: June 23, 2024, 02:09:26 pm »
This standard has been rather pointless and ignored by malicious bots of all kinds ever since it has existed because it fully relies on the bot behaving nicely. Which malicious and various spam and unscrupulous AI firm bots don't do - by design. It is only marginally more useful than the Do Not Track flag in browsers - everyone is free to completely disregard it and you can't do anything about it.

Why would anyone expect that various content scrapers (including the well known search engines) would always respect it is rather beyond me. I think the real point of this "article" is rather this:

Quote
Multiple artificial intelligence companies are circumventing a common web standard used by publishers to block the scraping of their content for use in generative AI systems, content licensing startup TollBit has told publishers.
(emphasis mine)

The article is rather more about a company rep trying to scaremonger and sell their "anti-AI scraping bot" snake oil analytics/detection scheme to publishers worried about their precious content being summarized than anything newsworthy, IMO.
« Last Edit: June 23, 2024, 02:13:12 pm by janoc »
 
The following users thanked this post: SeanB

Offline SeanB

  • Super Contributor
  • ***
  • Posts: 16391
  • Country: za
Re: multiple AI companies ignore robots.txt
« Reply #2 on: June 23, 2024, 07:03:18 pm »
Simplest cure is probably to add a directory to robots.txt that you ask to ignore, and fill the directory with a few GB of random text in a few thousand files, along with some blobs of random data, either set as filetype of being an excel file, a pdf or something, all small random files that will break the input parser of the AI. Enough companies do that and AI scrapers will start to respect the limiters. Yes a bit of heavy traffic, and you probably will want to do a little bit of web server work that gives each file a current date and time, and also serves them randomly to filename requests, something like a canary does.
 

Offline janoc

  • Super Contributor
  • ***
  • Posts: 3958
  • Country: de
Re: multiple AI companies ignore robots.txt
« Reply #3 on: June 23, 2024, 09:09:36 pm »
Simplest cure is probably to add a directory to robots.txt that you ask to ignore, and fill the directory with a few GB of random text in a few thousand files, along with some blobs of random data, either set as filetype of being an excel file, a pdf or something, all small random files that will break the input parser of the AI. Enough companies do that and AI scrapers will start to respect the limiters. Yes a bit of heavy traffic, and you probably will want to do a little bit of web server work that gives each file a current date and time, and also serves them randomly to filename requests, something like a canary does.

That's a good way to either DDOS yourself or blow your egress traffic charges out of the water once the bots start to massively download this garbage. Also, most bots will abort the connection after they receive more than a certain amount of data, so you will maybe slow them down, at best - while paying the bills for the traffic that your real customers can't use.


 :-//
 

Online SiliconWizard

  • Super Contributor
  • ***
  • Posts: 17059
  • Country: fr
Re: multiple AI companies ignore robots.txt
« Reply #4 on: June 23, 2024, 10:58:56 pm »
Simplest cure is probably to add a directory to robots.txt that you ask to ignore, and fill the directory with a few GB of random text in a few thousand files, along with some blobs of random data, either set as filetype of being an excel file, a pdf or something, all small random files that will break the input parser of the AI. Enough companies do that and AI scrapers will start to respect the limiters. Yes a bit of heavy traffic, and you probably will want to do a little bit of web server work that gives each file a current date and time, and also serves them randomly to filename requests, something like a canary does.

That's a good way to either DDOS yourself or blow your egress traffic charges out of the water once the bots start to massively download this garbage. Also, most bots will abort the connection after they receive more than a certain amount of data, so you will maybe slow them down, at best - while paying the bills for the traffic that your real customers can't use.


 :-//

Yes, fighting bots by saturating them with bogus data would work if storing, and streaming said data was "free". It's not.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf