Author Topic: Scanning Tech Manuals - discussion of methods (Read 6423 times)

TerraHertz · « **on:** December 10, 2012, 08:44:37 am »

Over the years I've acquired a bit of a collection of old service manuals, and a few other rare works. A lot of these are already available from various sources in electronic form, but some of them are not. I've always thought it would be helpful to others if I scanned some of my collection, and put it online. Whether through existing archive sites, or my own.

So far, the scanning equipment and tools I have are crude, slow, and not really feasible for larger manuals. But I've been making a few experimental attempts. Some examples are here:
http://everist.org/archives/scans/

Any comments appreciated, especially suggestions for improvement.

Has anyone else here tried this, and what setup did you use?

The method I'm currently using to present the scans I've done is a bit unusual. I don't much like PDFs (more on that later) and the best alternative I've found is a scheme known as RAR-books. It's a bit clumsy and requires a non-freeware utility that probably most people don't have - WinRAR, which is an alternative to WinZip and other file compression tools.

The key feature of WinRAR is that it can open an archive in which the actual archive does not begin at the start of the file. It just skips over whatever it finds until it gets to the archive header. One side effect of this is that you can concatenate two files - the first a plain JPG image, and the second the RAR archive. The result will open and display as a JPG image (giving you a 'front cover' for the file) and WinRAR can extract the remaining contents.
You can tell if you're looking at a RAR-book, as the JPG image will appear complete very quickly (it will only be a few K), but the browser will still be downloading the rest - possibly many megabytes. If the file name is something.jpg, the image is smallish, but the file is unexpectedly big, it's probably a RAR-book.

This technique is very popular among people free-sharing scanned/OCR'd novels and other literature. It was common for a while on 4chan's 'Lit-thursday', before 4chan decided that even for that den of villainy, the amount of copyright infringement going on was getting excessive. You did realize that copyright infringement is far, far worse than rape, murder, sedition, every type of porn, and so on, didn't you?

For the moment I'm using the RAR-book form of wrapper, with page images strung together with simple html. It allows me complete freedom to optimize image compression for each page. I'm very much concerned with trying to cleanly retain the original appearance of historic documents. (Which these technical manuals are.) So that degree of image quality control is essential, I believe.

However I'm not totally happy with the RAR-book form since apart from the non-freeware utility, it also doesn't allow inclusion of an accessible plain text version of the contents, that can be searched, selected and transfered to an editor. This is a crucial feature. I'm thinking maybe I can incorporate that via the html, but not yet sure how to do it so the image and plain text are associated, word for word.

For scanning, I generally find that 400 dpi grayscale is adequate for text, but must be increased to 600 dpi if there are any screened images on the page. And colour if the page has colours, obviously. These both result in fairly huge scan files, typically 20 to 80 MB per page. Then the choice of final resolution and pixel-coding scheme depends on the content. B&W text works OK with 16-level grayscale, giving 4 bits/pixel. It's really necessary to retain some gray levels, or the horrible ragged 'FAX effect' occurs making character outlines look nasty. It can be quite tricky to choose the grayscale transfer curve to get a flat white background (if that's wanted) while retaining shading on character edges.

The first book-scan I did was Vinge's 'True Names'. It's a SF classic - the first VR novel. Then it was out of print for over a decade. So... That one I OCR'd, but the software was less than ideal. I don't currently have a contemporary OCR tool.

The GR 900-LB Slotted Line manual I'm trying now is the most ambitious one yet. It's about 80 pages, with many diagrams.

If that turns out acceptably, I think I'll try something with foldout schematics next.
And if I ever get some scanning equipment that can cycle pages a lot faster than my present setup, there's a very large, rare, historic book of about 400 pages with many engravings, that I'd love to dump free on the net. Just to piss off a certain group of arseholes.

Why not PDF? Sigh, where to begin? I'll leave that for another comment.

amyk · « **Reply #1 on:** December 10, 2012, 09:10:53 am »

You can use 7-zip if you want to use free software; it works well with concatenated archives too.

Digital cameras are faster than traditional scanners; that's what Google uses, and several commercial bookscanner products too. They apply image postprocessing to remove distortion, then OCR.

kripton2035 · « **Reply #2 on:** December 10, 2012, 09:16:38 am »

for me 150dpi for texts and 300dpi for pictures are enough.
that's why may be you think your system is slow.
also I'm waiting for the "no pdf" comments as I dont understand why...

I tried your link, I see a jpg picture, and the rar book is not accessible you only see the first page
your system is perhaps good for you, but some people (like me) won't be able to see your work !
regards,

jancumps · « **Reply #3 on:** December 10, 2012, 09:24:11 am »

Only see the image on my ipad. Haven't checked yet if there is an unrarrer available in apps store.

TerraHertz · « **Reply #4 on:** December 10, 2012, 09:53:00 am »

The image you see when you fetch the file is NOT the first page. It's just a 'cover' image, a kind of thumbnail, and at much lower resolution than the contents. In fact the book cover is typically present again within, at full resolution.

Unlike pdf, no computers will be able to automatically open the document. People who've used RAR-books before (and those who've had to deal with stripping DRM off ebooks) will be familiar with the 'download and save it somewhere, use archive tool to extract files a folder, *then* you can read the document' cycle.
Once you have the de-archiver, and understand it, it's just a couple of extra clicks. No big deal.

I'll look into 7-zip.

Digital cameras - hmm... that's an interesting idea. I suppose the commercial systems include a glass face on top of the document, and tangent lighting. Not impossible to experiment with at home. The distortion processing would be a problem though, if I can't find freeware tools.

"for me 150dpi for texts and 300dpi for pictures are enough."
They're 'readable', but that's not my objective. I'm treating this more as a historical conservation exercise. Also, as a way to get fairly hi-res images of the docs online, so others can convert to pdf or whatever they like.
So many instances I've come across where people go to the trouble of scanning things, but only to 'just good enough' quality, so you end up struggling to make out fine details on schematics etc. Another instance was where a friend of mine (now deceased) scanned his archive of early issues of a zine produced by a group we were associated with. He 'saved space' by scanning at the minimum readable resolution. Maybe in those days (quite a while ago) it made sense. Now it doesn't, and someone will have to redo all that work.

jucole · « **Reply #5 on:** December 10, 2012, 01:08:39 pm »

As a historical conservation of printed documents I think the process is good. I come from a graphic designer background and when I've had to create multi-page documents I've scanned the pages, cleaned and segmented all the graphical elements using Photoshop, OCR'd the text, then faithfully reconstructed the pages in Adobe Indesign / Illustrator at print resolution. Then you can simply create PDFs at screen / print resolution from the original source files.

One thing i've noticed with scanning is that if you scan at the scanner hardware preset resolutions you can squeeze at little more sharpness to the images than have the software interpolate to a non-common resolution.

To me PDF is only as good as the person who created it; you see a range of varying quality efforts around on the web but I've never come across a true Historical Document Conservation file format; maybe there is something out there.

TerraHertz · « **Reply #6 on:** December 10, 2012, 02:56:57 pm »

Quote

I've never come across a true Historical Document Conservation file format; maybe there is something out there.

It's strange that, isn't it? So many different document capture formats, and not one even attempts to convey an accurate representation of the original work, in every detail. Some years ago I was studying for a degree in Networking, and planned to take it through to a Masters. Had to quit due to personal life dramas, but had begun work on a thesis. Was going to be a study of information coding schemes, their historical development, and where things went wrong. I'm pretty sure there were some very serious fundamental mistakes made in the early days. For instance the absence of any concept of command/data channel separation in ASCII. (Among many other things it's missing, in hindsight.) With the unfortunate result that ever since, every page layout system has to overload some of the elements of the character coding scheme to get that ability. html's <> for example. Or go completely binary, as with all the Word-style representations. Postscript was a small step in the right direction, but then it got Corporatised, and perverted into pdf.

It's really sad, that you see massive amounts of effort being put into things like Project Gutenberg, and all the libraries being converted to pdfs and ebooks, yet all of them are inadequate in a historical archiving sense. An electronic copy that looks as identical to the original as possible, including blemishes etc, even a physical model of the original, yet which still has the added features of an electronic form - search and extraction of the information content. There's simply nothing, that I've been able to find.
Meanwhile, the majority of books printed in the last century are all falling apart due to acidic paper.

This dichotomy - the so called highly advanced computing science we have, and yet still the total absence of any standard for preserving the cultural works of humankind, really makes me shake my head. We're such stupid creatures, with crazy short-sighted priorities.

What OCR tool were you using? Comments?
Scanning sharpness - yes. I always use the native best format on the scanner. Another thing is to never reduce screened images by anything other than half/quarter scales, to not wreck the screen pattern with moires.

I've no intention of doing large numbers of documents with my current methods. I'm just doing some as an exploration, to learn the pitfalls, while thinking about about the results I'd like to achieve.

kripton2035 · « **Reply #7 on:** December 10, 2012, 03:16:41 pm »

Quote from: jancumps on December 10, 2012, 09:24:11 am

Only see the image on my ipad. Haven't checked yet if there is an unrarrer available in apps store.

also I can only see the image on my mac - no unrar mac app seems able to open it
so I doubt the ios can open it !

AlfBaz · « **Reply #8 on:** December 10, 2012, 03:48:42 pm »

Ah, ok... So you save the image to your PC, and it acts just like a jpg (properties and all) then change the extension to rar, extract all and the book is in html format, pictures and all. Very, very good!

I can see the dilema, it must have taken a lot of work to get to that stage

jucole · « **Reply #9 on:** December 10, 2012, 04:02:36 pm »

Perhaps you could come up with your own format that better suited your requirements. There are so many free tools and code libraries out there nowadays it just makes things so much easier for anybody to write their own software. I'd start with a document folder structure to hold the resources and create a reader with a text search and mapping from the text to the actual scan page that highlighted the text of the searched paragraph. All the resources would be added to a zip file, the zip file would be renamed to something like .hdf "historical document format"; you could even add technical information such as the physical properties and pictures of the book etc. Other people could add PDF translations of the original source scans if they so wished.

Gall · « **Reply #10 on:** December 10, 2012, 04:08:56 pm »

Why not use DJVU?
They're compact, well-suited for scanned texts, may combine OCR and scanned image so that you can select scanned text with mouse and the OCRed text goes to the clipboard. The format is specially designed for book scanning.

Ant it's free and open-source.

SeanB · « **Reply #11 on:** December 10, 2012, 04:14:56 pm »

Just renamed to rar then it will open. This is probably an issue with the GUI wrapper not knowing that it must use unrar as an extract method for the file.

kripton2035 · « **Reply #12 on:** December 10, 2012, 05:53:41 pm »

Quote from: SeanB on December 10, 2012, 04:14:56 pm

Just renamed to rar then it will open. This is probably an issue with the GUI wrapper not knowing that it must use unrar as an extract method for the file.

already done it (I know the trick..) but it still does not open on a mac.

SeanB · « **Reply #13 on:** December 10, 2012, 06:57:55 pm »

Can it handle a standard RAR? I had to add all the uncompress methods to the Ubuntu archive viewer to get some to work well. Have you tried 7zip, it often works where others give up on corrupt files.

As an aside when scanning use the descreen and set it to 133 or 72, depending on whether the original is halftoned for newsprint or litho. It removes all the moire from it even at 1200dpi scanning. You can then reduce the resolution later on for the final product.

kripton2035 · « **Reply #14 on:** December 10, 2012, 07:38:31 pm »

yes I opened rar, zip, 7zip, gzip, and lot of other formats without any problem.
but this rar behind jpg cannot be opened

mariush · « **Reply #15 on:** December 10, 2012, 08:22:43 pm »

Appending a rar file to the end of a jpg picture sounds really ugly.

Abby Finereader and Xerox Textbridge (or whatever it's now renamed to nowadays) are quite capable of recognizing pictures, tables etc and separating them from text. They can export to HTML, PDF, DOC and other formats so I don't know ... maybe

See if you can export to HTML and get the layout preserved... otherwise Adobe Acrobat Pro has an "export to html" feature which works reasonably well. But Adobe Acrobat Pro is kind of expensive.

I did scan some magazines in my past and back then Textbridge was better at recognizing tables, charts, patterns and had a neat trick of placing a lower res picture of the page as background of the pdf page, then put over the text and pictures it segmented at higher resolution.

But in the end I worked with Abby Finereader because it did the character recognition better from the start so I wasted less time correcting and training it.

I found 300 dpi to be quite enough to preserve documents.

bitwelder · « **Reply #16 on:** December 10, 2012, 10:29:36 pm »

Quote from: mariush on December 10, 2012, 08:22:43 pm

I found 300 dpi to be quite enough to preserve documents.

I was having a look at the code used by the Google Books project (they recently open-sourced the machinery to scan an entire book automatically, a clever contraption BTW), and it seems for them 300 dpi is enough:

Code: [Select]

scan() {
cat $SERIAL_DEVICE | scanimage --batch=%06d.pnm --batch-prompt \
--page-height 355 -y 355 \
--source "ADF Duplex" --mode Color --resolution 300 $@
}

TerraHertz · « **Reply #17 on:** December 10, 2012, 11:56:05 pm »

Actually you should not have to bother changing the filename extension from jpg to rar.
If you start WinRAR yourself, then tell it to open the JPG file, it just works. That's a nice thing about WinRAR, it doesn't care about the file extension. It will search for an archive embedded in any file you tell it to.
So, you leave the file saved on disk as a JPG. If you just do a 'view file' (double click on Windows, or whatever your OS requires), you see the small 'cover' image. Only when you want to actually read the thing, you unpack it.

Ha ha... some people getting so dependent on the OS knowing 'how to do things', they're forgetting they are in charge.

I guess that's actually one of the advantages to RAR-books, for some people. That OS's don't know how to open them, so to naive users the file appears just like a simple JPG. Useful for hiding stuff you shouldn't have. It's like a very simple & weak version of this: http://www.truecrypt.org

Quote from: jucole on December 10, 2012, 04:02:36 pm

Perhaps you could come up with your own format that better suited your requirements. There are so many free tools and code libraries out there nowadays it just makes things so much easier for anybody to write their own software. I'd start with a document folder structure to hold the resources and create a reader with a text search and mapping from the text to the actual scan page that highlighted the text of the searched paragraph. All the resources would be added to a zip file, the zip file would be renamed to something like .hdf "historical document format"; you could even add technical information such as the physical properties and pictures of the book etc. Other people could add PDF translations of the original source scans if they so wished.

This is the ultimate intention. But the whole exercise is more than 'just another document format'. It's an examination of how our current information representation methods, OS and GUI practices would be different (hopefully better) if some very fundamental legacy concepts/practices had been done differently. I just use the case of document capture as (one way) of forcing myself to deal with practical issues as well as the abstract.

The way I'm encapsulating documents now isn't even an attempt to do it the way I'd like to. It's just a kind of warmup exercise.

Getting overwhelmed with information is one problem, but unavoidable. It's very useful to me to be told about things like DJVU, Finereader, Textbridge etc (thanks for those!) since just finding these things exist takes time, and it's easy to miss things.

Incidentally, filename extensions, external file attributes, the file vs folder dichotomy, and even the existence of any kind of separate filesystem allocation/indexing data tables, are all legacy concepts that we use just because we do, and could be eliminated in a future OS.

mariush · « **Reply #18 on:** December 11, 2012, 02:50:17 am »

That's because the rar file format has a signature... it starts with "Rar!"

Winrar simply searches for these 4 characters and if they're found, it does further checks on the next few bytes and then if everything's OK it lists the contents of the archive.

Just the same, Zip files start with "PK", 7zip pictures start with "7z", PNG pictures have the signature "‰PNG" and a new line character right after that ....

It's highly unlikely that the sequence "Rar!" would apprear in the picture binary stream, so that's why you could just append a rar archive to the end of the picture file and it will work. In contrast, "PK" or "7z" would be quite possible to appear, especially with a larger picture.

If you want, I guess you could rename the rar files to .cbr and then you could say it's a comic book :

http://en.wikipedia.org/wiki/Comic_book_archive

but it's a silly format, it doesn't give you the ability to search the text inside the book, it's just plain pictures.

I suppose the most safe format and future proof would be to just dump the book as a html file, with the images embedded inside the html and a simple javascript code to do pagination (add previous, next with suitable scrolling), but the size of the html files would be huge, 4-6 times the size of an archive.

jucole · « **Reply #19 on:** December 11, 2012, 11:27:32 am »

Quote from: TerraHertz on December 10, 2012, 11:56:05 pm

This is the ultimate intention. But the whole exercise is more than 'just another document format'. It's an examination of how our current information representation methods, OS and GUI practices would be different (hopefully better) if some very fundamental legacy concepts/practices had been done differently. I just use the case of document capture as (one way) of forcing myself to deal with practical issues as well as the abstract.

The way I'm encapsulating documents now isn't even an attempt to do it the way I'd like to. It's just a kind of warmup exercise.

Incidentally, filename extensions, external file attributes, the file vs folder dichotomy, and even the existence of any kind of separate filesystem allocation/indexing data tables, are all legacy concepts that we use just because we do, and could be eliminated in a future OS.

I thought you were asking for comments on your documentation process; not writing a essay on how you should represent files on a computer system ? ;-)

The Apple iPads have a nice clean way of representing files on it's OS, but not sure it's scalable for lots of files.

TerraHertz · « **Reply #20 on:** December 11, 2012, 01:25:21 pm »

It's pretty, isn't it? I don't have an iPad, but the net says the iPad supports the following eBook formats:
* ePub which is open ebook format
* Amazon's Kindle
* Barnes & Noble
* iBooks from the iBooks Store
* PDF from third-party apps

What *does* it do, when you have a few hundred titles?

Sadly, formats designed for pretty presentation of works produced for (or cleaned up for) eBook publication, don't seem to have features needed for historical doc capture. Stains, hand written notes, typographic flaws, etc, nope.
Most of their development effort seems to be going into DRM-enforcement schemes. May they all die in fire.

Quote

I thought you were asking for comments on your documentation process; not writing a essay on how you should represent files on a computer system ? ;-)

Yeah. I thought so too. Hard to not get off topic on this, by mentioning the underlying purpose. It's not really possible to have a discussion here on the deeper aspects anyway - too complex. Even if it was of interest to anyone.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Scanning Tech Manuals - discussion of methods (Read 6423 times)

jucole

jucole

jucole

Share me