Author Topic: How does the Wayback Machine access all those pages?  (Read 5606 times)

0 Members and 1 Guest are viewing this topic.

Offline Analog KidTopic starter

  • Super Contributor
  • ***
  • Posts: 4381
  • Country: us
  • DANDY fan (Discretes Are Not Dead Yet)
How does the Wayback Machine access all those pages?
« on: February 16, 2025, 02:47:39 am »
I just used the Wayback Machine (archive.org) to access an article that was behind a paywall.
Original article: www.wired.com/story/andrew-degraff-mapping-literary-journeys
Archived article: https://web.archive.org/web/20250216023626/https://www.wired.com/story/andrew-degraff-mapping-literary-journeys/

(I was actually the first one to ask for this page at Wayback, so they archived it while I waited)

My question is, how do they manage to access content that's behind someone's paywall, as here?
Do they have legit subscriptions to every website that has a paywall?
Is there some sneaky backdoor way they can extract this content?

Any more, when I run into an article that's paywalled (an increasingly likely situation), I just hop on over to archive.org and invite myself in.
All those publishers must know about this.

What's the deal here?
 

Online ledtester

  • Super Contributor
  • ***
  • Posts: 4005
  • Country: us
Re: How does the Wayback Machine access all those pages?
« Reply #1 on: February 16, 2025, 04:16:57 am »
WIRED allows a certain number of complimentary articles per month. I was able to use the archive.org link a few times and then got this message:

Quote
You’ve read your last complimentary article this month. Subscribe Now. If you're already a subscriber sign in.

but the full article text was loaded in the page and then some javascript put up the above message in its place.

You can see what happens when you disable javascript by using 12ft.io:

https://12ft.io/http://www.wired.com/story/andrew-degraff-mapping-literary-journeys

and you can see that you get all of the text.

So WIRED is giving you the entire article in this case and then using Javascript to enforce the paywall.
 

Offline Analog KidTopic starter

  • Super Contributor
  • ***
  • Posts: 4381
  • Country: us
  • DANDY fan (Discretes Are Not Dead Yet)
Re: How does the Wayback Machine access all those pages?
« Reply #2 on: February 16, 2025, 04:59:24 am »
WIRED allows a certain number of complimentary articles per month. I was able to use the archive.org link a few times and then got this message:

Quote
You’ve read your last complimentary article this month. Subscribe Now. If you're already a subscriber sign in.

Hmmm; that's exactly what I got the very first time I tried to access the Wired article (and I hadn't accessed Wired at all before this), whereas I was able to access the Wayback version right away. So do you mean that if I try to access the Wayback version a certain number of times that the paywall will come up?

And does that mean that I can avoid paywalls by simply disabling Javascript?

This stuff makes my head hurt.
« Last Edit: February 16, 2025, 05:01:13 am by Analog Kid »
 

Offline Whales

  • Super Contributor
  • ***
  • Posts: 2627
  • Country: au
    • Halestrom
Re: How does the Wayback Machine access all those pages?
« Reply #3 on: February 16, 2025, 06:04:29 am »
Some paywalls are javascript + CSS only, so they still send you a copy of the article but then try and hide it.  Others are implemented server-side, so they don't send you the full article unless you're logged in.  Whatever they felt like doing.

Ublock origin's advanced features can be used to disable javascript on a website, I use it all the time: https://addons.mozilla.org/en-US/firefox/addon/ublock-origin/

This addon lets you disable all page styles, which works sometimes:  https://addons.mozilla.org/en-US/firefox/addon/margin_annihilator/
 

Offline golden_labels

  • Super Contributor
  • ***
  • Posts: 2305
  • Country: pl
Re: How does the Wayback Machine access all those pages?
« Reply #4 on: February 16, 2025, 08:14:28 am »
My question is, how do they manage to access content that's behind someone's paywall, as here?
Paywalls are complicated and there is no single answer. In particular I can’t tell anything about that specific pair (Wayback Machine + Wired).

For start, not all clients are served the same. It’s often dynamically determined. Legal limitations, providing samples to lure the target, geodiscrimination, whatever information they already have about you, blocking the unwelcome, political affiliations, agreements with 3rd parties, pretty much anything can affect if you see a paywall or not. For example I never seen a paywall on Wired. I can read the linked article with no issues either. So it’s possible that solely determined a crawler was able to access content.

Many paywalled sites intentionally provide full content to search engine crawlers to lure audience or limit visibility of the competition. While Wayback Machine isn’t a search engine, it ends up in many classifiers marked as such. Most companies slurp those indiscriminately or delegate the task to other parties. It’s not unexpected for a crawler to be served as a search engine even while not being one.

All those publishers must know about this. What's the deal here?
Must they? Don’t overestimate general population’s knowledge and abilities. :)

Ignoring that and assuming they know: the knowledge does nothing by itself. The action is needed. But what would be the action? Archive doesn’t own assets big enough to make profit from suying them. Not big enough for a business with a $2bn/yr revenue, and even smaller when expenses and risks are taken into account. DMCA-ing entries? That requires hiring people Which would only be viable, if losses are considerable. And what losses they may have from the 0.01% outliers going to Wayback Machine, outliers who are unlikely to ever subscribe anyway? Blocking Wayback Machine? The same problem. And imagine all the meetings needed, all the decisions on each management level, consultations regarding brand image impact, determining what level of financial harm will that bring regarding lower exposure on Wikipedia, hiring experts to determine security and reliability impact of changing access policies, having somebody to pay to maintain that change? C’mon, be reasonable.

Finally: companies are not conscious beings. They have legal personhood, but they aren’t literal persons. They don’t have personality, they don’t have thoughts, beliefs, fears or desires. It may be a convenient metal shortcut to say “they know” or “they may do this or that.” But in the end, who are the they? The they, who have actual thoughts, beliefs, fears and desires? It’s not the organisation. It’s the people hired. Do you think some Conde Nast manager, facing a paywall, is subscribing? Or, if only they can use computers well enough, goes just like you to Wayback Machine? Unless forced to by the work obligations, it’s not in their interest to block anything.

« Last Edit: February 16, 2025, 08:29:11 am by golden_labels »
Why 📎 | We live in times when half of people have IQ below 100.
 

Offline Halcyon

  • Global Moderator
  • *****
  • Posts: 6747
  • Country: au
Re: How does the Wayback Machine access all those pages?
« Reply #5 on: February 16, 2025, 09:21:01 am »
Some paywalls are easily bypassed by not executing scripts (for example), leaving the underlying content perfectly accessible.

There are also other mechanisms to read paywalled sites, such as how https://12ft.io works.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf