Author Topic: Wikipedia download  (Read 1490 times)

0 Members and 1 Guest are viewing this topic.

Online RoGeorgeTopic starter

  • Super Contributor
  • ***
  • Posts: 6634
  • Country: ro
Wikipedia download
« on: March 19, 2022, 11:07:02 am »
I want to download a copy of Wikipedia, primarily for EN, FR and RO languages, and depending of the size, maybe for GE and HU languages, too.  I've searched how to download Wikipedia, but to me it's not clear how, and what to save, and in what format:  https://dumps.wikimedia.org/

These are the must have for a copy:
  • Working 100% offline, with no Internet connection available
  • All formulas, diagrams, pics and multimedia included
  • Searchable

Other nice to have would be:
  • OS independent
  • Incremental updates on request only, after Diff-ing between the offline vs. the online versions
  • automated data access (counting words, statistics, ML, etc.)



So far my understanding is the available formats are:
- HTML version (but very old, latest Wikipedia available in HTML is from 2008)
- MySQL version
- XML version
- Wikipedia backup??
- Wikipedia for mirroring??

Then there is:
- with talks and personal pages - no idea what's this
- text only
- current
- current with versioning (all edits history)
- only removed articles

Since I've read in the recent years many stubs were created by bots, in an automated way, I thought I might want to get 2-3 versions at, say 5 years apart, just in case some damage slept unnoticed in the past years.  I'm interested mostly about science and engineering pages.

Any guidance would be appreciated.

What format to download?
What total size to expect?

Online RoGeorgeTopic starter

  • Super Contributor
  • ***
  • Posts: 6634
  • Country: ro
Re: Wikipedia download
« Reply #1 on: March 19, 2022, 11:51:46 am »
I've downloaded the static HTML version (2008) for Romanian language only, because it was smaller (`300MB), expands to about 4.5GB, but it is text only, includes no images.    :-//
https://dumps.wikimedia.org/other/static_html_dumps/

Offline DiTBho

  • Super Contributor
  • ***
  • Posts: 4217
  • Country: gb
Re: Wikipedia download
« Reply #2 on: March 19, 2022, 12:41:11 pm »
Talking about HTML version

For something like this 'wget' and 'curl' aren't enough, so I've developed a little tool that can parse what comes from a URL on the web, see links and images, and download them recursively with a little name manipulation (rename the file).

The problem is that everyday HTML-v5 is around and websites (including some wiki-like) are full of crap, JavaScript and related stuff, and basic HTML (what we used back to 2000s) isn't enough.

Code: [Select]
# myNET-get-files-from-url-v9 url
list_preparing, getting /index.html ... success
preparing /list ... done
list_preparing, getting /index.html ... success
preparing /list ... done
/auth/wiki_course_new/ is ignored
downloading [/2021_Summary_MIT.pdf] ... success
preparing index /wiki_course_unit_1 ... success
downloading [/wiki_course_unit_1.htm] ... success
preparing folder /wiki_course_unit_1_files ... success
downloading [/wiki_course_unit_1_files/filelist.xml] ... success
downloading [/wiki_course_unit_1_annotated.pdf] ... success
list_preparing, getting /wiki_course_unit_1_files/index.html ... success
preparing /wiki_course_unit_1_files/list ... done
/auth/wiki_course_new/lectures/ is ignored
downloading [/wiki_course_unit.html] ... success
downloading [/wiki_course_unit_1_files/colorschememapping.xml] ... success
downloading [/wiki_course_unit_1_files/image001.png] ... success
downloading [/wiki_course_unit_1_files/image002.jpg] ... success
downloading [/wiki_course_unit_1_files/image002.png] ... success
downloading [/wiki_course_unit_1_files/image003.png] ... success
downloading [/wiki_course_unit_1_files/image004.jpg] ... success
downloading [/wiki_course_unit_1_files/image004.png] ... success
downloading [/wiki_course_unit_1_files/image005.png] ... success
downloading [/wiki_course_unit_1_files/image006.jpg] ... success
downloading [/wiki_course_unit_1_files/image006.png] ... success
downloading [/wiki_course_unit_1_files/image007.png] ... success
downloading [/wiki_course_unit_1_files/image008.jpg] ... success
downloading [/wiki_course_unit_1_files/image008.png] ...

To download images and pdf, I had to add a small JavaScript interpreter that can figure out how to pass the image URL to 'wget', because the site handles images with thumbnails and - drag the mouse over the image to enlarge it - or script that checks where you are located and react accordingly - which, seriously, this is handled by your browser, and it's a piece of JavaScript code, usually difficult to be correctly parsed unless you have a full JS interpreter.

But there is other bullshit around, like banners, pop-menus that just ask you to click "Ok, I accept cookies and everything", and again you have to add code to handle this.

-

This tool does this:
Code: [Select]
         modern HTML + JS -----> converted into pure HTML(2000s)
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Online RoGeorgeTopic starter

  • Super Contributor
  • ***
  • Posts: 6634
  • Country: ro
Re: Wikipedia download
« Reply #3 on: March 19, 2022, 01:43:12 pm »
My understanding is that there are ways to download everything, legally and without web-crawling, but I don't get it yet where from, or even what exactly to download.

Offline DiTBho

  • Super Contributor
  • ***
  • Posts: 4217
  • Country: gb
Re: Wikipedia download
« Reply #4 on: March 19, 2022, 03:16:03 pm »
what exactly to download.

plus, Copyright should tell you what you can do, if it's for your private need and info will stay offline, or if you can somewhat mirror it to your website.

I don't know  :-//

My tool is a toy for my personal use and needs, and things will perpetually stay offline.
The opposite of courage is not cowardice, it is conformity. Even a dead fish can go with the flow
 

Offline TERRA Operative

  • Super Contributor
  • ***
  • Posts: 3050
  • Country: jp
  • Voider of warranties
    • Near Far Media Youtube
Re: Wikipedia download
« Reply #5 on: March 19, 2022, 03:27:28 pm »
I think if you want all the images etc, you need to download wikimedia, or at least a good portion of it? I think you might need to invest in some hard drives for all that..
Where does all this test equipment keep coming from?!?

https://www.youtube.com/NearFarMedia/
 

Offline NottheDan

  • Frequent Contributor
  • **
  • Posts: 281
  • Country: gb
Re: Wikipedia download
« Reply #6 on: March 19, 2022, 03:44:07 pm »
My understanding is that there are ways to download everything, legally and without web-crawling, but I don't get it yet where from, or even what exactly to download.
Wikipedia itself tells you: https://en.wikipedia.org/wiki/Wikipedia:Database_download
And Kiwix makes it even easier: https://library.kiwix.org/?lang=&category=
 
The following users thanked this post: edavid

Online RoGeorgeTopic starter

  • Super Contributor
  • ***
  • Posts: 6634
  • Country: ro
Re: Wikipedia download
« Reply #7 on: March 20, 2022, 10:58:53 am »
Wikipedia itself tells you: https://en.wikipedia.org/wiki/Wikipedia:Database_download

That's the page where from all the questions in the OP arisen.  :)
Will try KIWIX, too.

So far I've tried XOWA (on Kubuntu 20.04 LTS):  https://github.com/gnosygnu/xowa/releases
It needs Java to run, and it throws all kind of errors.  I could only import some March 2017 Wikipedia version uploaded by the XOWA team.  Import from other places didn't work, it gives a page long error message starting with
Code: [Select]
[err 0] send_json was not acknowledged ...
Browsing the offline wikipedia with XOWA seems to be working, though has some minor rendering issues.



The XOWA dump of 2017 En Wikipedia with pictures and everything else included is about 13 GB to download, and about 90 GB on disk after unpacking.  After that, it all works offline in its own XOWA browser.

Don't know yet if there is a way for XOWA to act as a web server for the stored Wikipedia pages, so to view them with some other browser of choice.
« Last Edit: March 20, 2022, 11:47:43 am by RoGeorge »
 

Offline 50ShadesOfDirt

  • Regular Contributor
  • *
  • Posts: 111
  • Country: us
Re: Wikipedia download
« Reply #8 on: March 20, 2022, 05:22:44 pm »
I've used the Kiwix solution, and it's pretty good ... uses a (browser) client front-end to read the offline files, from Wikipedia and othr sites. Takes a decent pipe to d/l the files themselves ...

https: //www.kiwix.org/en/

Many portions of Wikipedia have already been packaged for kiwix ... you can grab specific areas.
 

Online RoGeorgeTopic starter

  • Super Contributor
  • ***
  • Posts: 6634
  • Country: ro
Re: Wikipedia download
« Reply #9 on: March 20, 2022, 06:07:14 pm »
I've downloaded with XOWA the 2017 Wikipedia xowa dump, then wanted to test for images, so I've searched for "transmission line".  To my surprise, one of the offline search results was "transmission of COVID-19"  :-//

I've looked in the online Wikipedia, and that page was created in 2020, so I don't know what happened.  Apparently, the date for the offline dump was either wrong and not 2017, or maybe it automatically/silently updated to the latest/current database.

Either way, that's not what I wanted, nor expected.

Offline SiliconWizard

  • Super Contributor
  • ***
  • Posts: 15168
  • Country: fr
Re: Wikipedia download
« Reply #10 on: March 20, 2022, 06:37:45 pm »
To my surprise, one of the offline search results was "transmission of COVID-19"  :-//

Anything links to Covid-19 these days. :-DD
 

Online RoGeorgeTopic starter

  • Super Contributor
  • ***
  • Posts: 6634
  • Country: ro
Re: Wikipedia download
« Reply #11 on: March 20, 2022, 07:53:47 pm »
Kiwix is trivial to set and use.  :-+

- has many contributors, very active project
- for Linux, there is a single flatpack file that just runs, so no dependencies
     - can work as a local webserver
     - has NO settings, which is annoying, not even a dark theme
- for browser, there is an add-on that can be used from inside preferred browser (Firefox, Chrome, Edge)
     - has dark mode and decent ammount of settings
     - apparently, the web addon searches in the titles only

Kiwix has all Wikimedia project already available for download from their repos (Wikipedia, Wikibooks, Wikihow, Wikisource, Wiktionary, Wikiquote, Wikinews, Wikiversity, Wikivoyage and a few other websites from other projects), with local and very fast mirrors.  From here, it downloaded 87GiB of current Wikipedia (with pics) at a speeds of 50 MiB/s average!  :o

Not sure yet if much older versions of Wikipedia are still available from their zim repos.


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf