If you're wondering where many AI companies and research labs get their data from:
https://commoncrawl.org/Of course, the bigger ones like Open AI have their own web crawl tools and don't just use the web, but torrents, as a recent post has talked about.
But if you were curious, and if you want to experiment with large data sets yourself, that's a starting point.
Common Crawl is a non-profit organization. Even as non-profit, the ethical aspects of web crawling are still debatable.