Author Topic: Common Crawl  (Read 198 times)

0 Members and 1 Guest are viewing this topic.

Online SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 16370
  • Country: fr
Common Crawl
« on: February 21, 2025, 11:25:21 pm »
If you're wondering where many AI companies and research labs get their data from:

https://commoncrawl.org/

Of course, the bigger ones like Open AI have their own web crawl tools and don't just use the web, but torrents, as a recent post has talked about.

But if you were curious, and if you want to experiment with large data sets yourself, that's a starting point.

Common Crawl is a non-profit organization. Even as non-profit, the ethical aspects of web crawling are still debatable.
 
The following users thanked this post: MK14


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf