Author Topic: Whats the best "free" program to OCR a scanned doc into pdf ?  (Read 3844 times)

0 Members and 1 Guest are viewing this topic.

Offline BravoVTopic starter

  • Super Contributor
  • ***
  • Posts: 7549
  • Country: 00
  • +++ ATH1
As title says, I have few old T&M original paper manuals that are not available online, and convert them into PDF through my flatbed scanner.

Its just I prefer the pages to be converted or at least getting the best OCR results, cause saving them as like pages of "picture" in PDF is kinda lame.

No online OCR service please, prefer a stand alone PC program, and free of course.  ::)

Offline G7PSK

  • Super Contributor
  • ***
  • Posts: 3878
  • Country: gb
  • It is hot until proved not.
Re: Whats the best "free" program to OCR a scanned doc into pdf ?
« Reply #1 on: July 22, 2015, 09:58:45 am »
Libre office can do it via a roundabot route, someone wrote a script to automate it herew.
http://askubuntu.com/questions/240011/how-to-convert-pdf-file-to-an-odt-file
 

Offline Halcyon

  • Global Moderator
  • *****
  • Posts: 6066
  • Country: au
Re: Whats the best "free" program to OCR a scanned doc into pdf ?
« Reply #2 on: July 22, 2015, 10:04:21 am »
I'm happy to create the PDF's for you if you want to ZIP up the images and throw them in Dropbox somewhere. I have Acrobat Professional on this machine. That's about the only free thing I can suggest.
« Last Edit: July 22, 2015, 10:07:42 am by Halcyon »
 

Offline firewalker

  • Super Contributor
  • ***
  • Posts: 2452
  • Country: gr
Become a realist, stay a dreamer.

 

Offline XOIIO

  • Super Contributor
  • ***
  • Posts: 1625
  • Country: ca
Re: Whats the best "free" program to OCR a scanned doc into pdf ?
« Reply #4 on: July 22, 2015, 10:34:11 am »
I haven't tried much but ABBY's ocr software is fantastic, I found it "Free" through a certain website to test it out, if I had needed it for much I would definitely have bought it. There's a super handy screenshot tool that lets you capture a specific area, great for translating stuff like a datasheet perhaps, and if I recall it wasn't too bad at handwriting either.

I had finereader as well and that's $230, but the screenshot reader is only $40, pretty good if you ask me. (and if you don't like using "free" software)

Offline Whales

  • Super Contributor
  • ***
  • Posts: 2096
  • Country: au
    • Halestrom
Re: Whats the best "free" program to OCR a scanned doc into pdf ?
« Reply #5 on: July 22, 2015, 11:54:04 am »
Before you delve into OCR, have you done a triage of how easy the text is for OCR algorithms to read?  Any of these things immediately make 1:1 copies nigh impossible without lots of human effort making corrections:
  • Age markings.  Dark lines, spots, fading
  • Use of different fonts styles and types.  A single consistent typeface is best, but even then errors occur.
  • Line graphics.  Underlines of titles, borders.
  • Columns, text boxes or other simple formatting complexities where the OCR software has to work out (if it can) where a line of text properly wraps to (only an issue for searchability and pure text output)

Whatever you do I recommend you keep all of the scanned originals.  Don't scan, OCR and delete in batches.  Only OCR once you have everything, and then still don't delete everything unless what you get is what you want.

The DjVu file format is useful for scanned text.
« Last Edit: July 22, 2015, 11:59:21 am by Whales »
 

Offline amyk

  • Super Contributor
  • ***
  • Posts: 8488
Re: Whats the best "free" program to OCR a scanned doc into pdf ?
« Reply #6 on: July 22, 2015, 11:57:42 am »
Don't use lossy compression either, see near the bottom of this page with some sample manual scans that degraded horribly: http://everist.org/NobLog/20131122_an_actual_knob.htm
 

Offline Whales

  • Super Contributor
  • ***
  • Posts: 2096
  • Country: au
    • Halestrom
Re: Whats the best "free" program to OCR a scanned doc into pdf ?
« Reply #7 on: July 22, 2015, 12:00:35 pm »
Don't use lossy compression either, see near the bottom of this page with some sample manual scans that degraded horribly: http://everist.org/NobLog/20131122_an_actual_knob.htm
Link to exact section: http://everist.org/NobLog/20131122_an_actual_knob.htm#jbig2

Offline Halcyon

  • Global Moderator
  • *****
  • Posts: 6066
  • Country: au
Re: Whats the best "free" program to OCR a scanned doc into pdf ?
« Reply #8 on: July 22, 2015, 07:19:45 pm »
If you can create a PDF using the Adobe 'ClearScan' method, it combines the best of OCR with the original image. In that the original scanned is preserved but still searchable via a text overlay. Works quite well.
 

Offline codeboy2k

  • Super Contributor
  • ***
  • Posts: 1836
  • Country: ca
Re: Whats the best "free" program to OCR a scanned doc into pdf ?
« Reply #9 on: July 22, 2015, 11:58:38 pm »
Tesseract.

https://code.google.com/p/tesseract-ocr/

Alexander.

The last time I tried to use tesseract, I didn't find it that easy to set up the language learning part.  (One had to "train it" -- maybe that's changed in recent years)

However, once I got past that it was easy and worked well.  For a front-end, I now use gscan2pdf and scan at 600dpi. gscan2pdf will call out to tesseract to OCR the scanned image, and then it will make a PDF with the graphics and text embedded.
« Last Edit: July 23, 2015, 12:03:05 am by codeboy2k »
 

Offline BravoVTopic starter

  • Super Contributor
  • ***
  • Posts: 7549
  • Country: 00
  • +++ ATH1
Re: Whats the best "free" program to OCR a scanned doc into pdf ?
« Reply #10 on: July 23, 2015, 10:50:56 am »
Thanks for inputs and comments, and Halcyon, thanks for the offer, I will consider it seriously.  :-+

Btw, I have not scan anything yet, and also lots of mine has the long folded pages like schematic that need to be stitched digitally like those old Tek's manual which is quite time consuming to do.

Regarding suggestions on keeping the original scan in loss less format, yes, definitely that is the 1st thing to do, and then do the OCR, never have intention to discard the original scan. Yes, I'm aware of bad scanned documents, especially on the blurry circuit parts or illustrations, really hate that.  :palm:

As I don't use OCR thingy often, at least for many-many years back in Win 3.1  ::), actually I was kind of expecting someone had actually experienced on scanning T&M old manuals and just simply point out the perfect one to do this, I guess the OCR technology isn't advancing too much compared to win 3.1 era.
« Last Edit: July 23, 2015, 10:52:42 am by BravoV »
 

Offline mariush

  • Super Contributor
  • ***
  • Posts: 5162
  • Country: ro
  • .
Re: Whats the best "free" program to OCR a scanned doc into pdf ?
« Reply #11 on: July 23, 2015, 11:38:45 am »
When I used to scan (about 1y+ ago), I used  Irfanview's Batch Scanning feature, basically after previewing the first page (and testing various filters to see what works best with the content, to remove see through text, auto adjust contrast and colors, descreen etc) you just press Alt+S or whatever shortcut is for the Scan button and Irfanview scans the page and automatically saves it into a folder you want with the compression quality you want and auto-increments a number at the end of the filename for you.

I used 300 dpi and saving to JPG with quality 92-95, resulting in 4-10 MB files for each A4 page. Above 300 dpi, any decent OCR software won't achieve better performance so it made no sense for me to waste 3-4x the time needed when scanning at 300dpi. 
Also, I don't know how it is now, but about a year ago when I still did this, Abby Fine Reader basically converted anything you shoved into it in a sort of grayscale then performed OCR, so it's pointless to scan full color a BW manual, unless you really want to.

Back then, Abby Fine Reader was the best... it could do that mixed pdf thing Halcyon mentioned, basically the background of each PDF page was the actual scanned page (picture at 100-150dpi ) and the OCR'ed text was overlayed over the picture so you could copy and paste text and it would look very nice (but resulted in larger pdf files of course)

If I remember correctly, Xerox Textbridge did a great job at actual character recognition but didn't do the layout and text formatting to my liking (i was scanning old IT magazines with columns and small pictures and the software screwed the formatting)... I also tested OmniPage Professional but for some reason didn't like it.

If you want a copy of FineReader or OmniPage for testing, PM me and I'll upload it for you somewhere.
 

Offline eas

  • Frequent Contributor
  • **
  • Posts: 601
  • Country: us
    • Tech Obsessed
Re: Whats the best "free" program to OCR a scanned doc into pdf ?
« Reply #12 on: July 30, 2015, 10:51:23 pm »
OCR for search on old manuals/documents = awesome. OCR for reading, not so much.

Very interested in the answers here. I've been using the FineReader that came with my scanner to make searchable PDFs of old papers, but it doesn't really want to work on stuff from another source and I'd rather not try and hack around the limitations.

I found a nice cache of old electronics-related docs on Archive.org. Many of them have OCRed versions, but about half of those are afflicted with over-compressed scans and I'd like to start over with the higher-quality versions.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf