EEVblog Electronics Community Forum

Products => Computers => Programming => Topic started by: Jackster on December 15, 2022, 11:27:39 pm

Title: How to clone an Apache file directory from a website?
Post by: Jackster on December 15, 2022, 11:27:39 pm
Doing a bit of file archival work and we have a few public-facing Apache file directories like in this screenshot.

(https://tecadmin.net/wp-content/uploads/2019/05/apache-directory-listing.png)

The directory structure is pretty simple but there are many of them.
The files are pretty standard but some are text files and some are binary.

What I would like to do is basically clone those directories onto my local storage.


I have tried WGET so far with some success.
Code: [Select]
 wget -m -p -E -k -K -np "http://domain.ltd/dir/"
As well as an assortment of flags.
But I am finding that it does not download the txt files correctly and sometimes the binary files are downloaded incorrectly also.

Does anyone have a better solution?
I don't need to generate the index pages either.

Title: Re: How to clone an Apache file directory from a website?
Post by: Jackster on December 15, 2022, 11:41:36 pm
A few solutions have poped up such as https://github.com/ArchiveBox/ArchiveBox and Archive Team's self hosted Warrior.
Title: Re: How to clone an Apache file directory from a website?
Post by: retiredfeline on December 15, 2022, 11:48:40 pm
It's been years since I had to do anything like this but I think I used httrack.
Title: Re: How to clone an Apache file directory from a website?
Post by: Jackster on December 16, 2022, 12:07:06 am
It's been years since I had to do anything like this but I think I used httrack.

That is what I used first. Tried and tested bit of software. But it also had issues with the binary files and text files.
Title: Re: How to clone an Apache file directory from a website?
Post by: Whales on December 16, 2022, 12:43:06 am
What exactly are the issues with the files you have downloaded?  Please give as much detail as you can.

Are you trying to clone that site's code assets including .php files so you can rehost it yourself?  You probably can't do that from a web browser, the Apache instance will likely "run" the php pages when you try to access them (rather than serve you the raw file contents).  You will need a different form of access to the server (eg a shared hosting control panel or ssh/scp/sftp).

Title: Re: How to clone an Apache file directory from a website?
Post by: retiredfeline on December 16, 2022, 12:45:03 am
What sorts of issues? Linux doesn't have attributes for binary or text files but you can configure the Apache server your end to associate MIME types with particular file extensions. This is a configuration setup on the origin server that isn't exported to the world.

I noticed now that it looks like a Wordpress site. Dynamic sites like that where content is programmatically generated cannot be cloned from the website. You need access to the underlying Wordpress installation.
Title: Re: How to clone an Apache file directory from a website?
Post by: AndyBeez on December 16, 2022, 12:59:30 am
Gain root access and FTP / SSH all of the files into a zip or tar archive.
Title: Re: How to clone an Apache file directory from a website?
Post by: Jackster on December 16, 2022, 02:31:36 am
Should have added more info..

It is a publicly accessible directory for a game's downloadable mods and maps.
The files are compiled data files and txt configs.

I don't have access to the actual server's backend.
Title: Re: How to clone an Apache file directory from a website?
Post by: golden_labels on December 16, 2022, 04:09:21 am
It is a publicly accessible directory for a game's downloadable mods and maps.
The files are compiled data files and txt configs.

I don't have access to the actual server's backend.
You still did not answer Whale’s question (https://www.eevblog.com/forum/programming/how-to-clone-an-apache-file-directory-from-a-website/msg4583953/#msg4583953) about the problems you are experiencing. We do not know, what you did download. Neither we do know what you consider “incorrect” and — most importantly — what is your reasoning behind considering something “incorrect”.

On a separate note: while dumping websites please consider using wget’s --limit-rate, --wait and --random-wait. That will decrease load variance, making it affect normal users less.



Title: Re: How to clone an Apache file directory from a website?
Post by: Nominal Animal on December 16, 2022, 12:43:12 pm
The only problem I can think of wrt. plain text files is that the server supplies a Content-Type: text/plain; charset=encoding with different encoding than you use on your local machine.  (Typically, the server provides UTF-8, and you use e.g. Windows Western European aka win-1252 aka Codepage 1252.)
The effect is that all ASCII characters are fine, but non-ascii characters are mangled.
One other quirk is that you might use a text editor that requires Windows CR-LF newline convention, whereas Linux, BSDs, Macs use LF-only.

There is no "fix" nor "error".

What you need to do, is view the file in your browser, and change the encoding to what the server used.
(Browser developers are stupid, and don't have any way of setting local text files' encoding to UTF-8, "because it is not a legacy encoding and is therefore unsuitable as a default".  Yeah, it is one of my pet peeves.)
Because you intend to do an archival, I recommend against recoding the files with e.g. iconv.
Title: Re: How to clone an Apache file directory from a website?
Post by: Jackster on December 16, 2022, 01:34:29 pm
Thanks to a suggestion by someone on ArchiveTeam's IRC, OpenDirectoryDownloader to index the URLs and then use WGET to download all the files.
It is working well and not messing up the config files.

https://github.com/KoalaBear84/OpenDirectoryDownloader