General > General Technical Chat
Google world brainL *trying* to digitize every paper book = FAIL
thm_w:
--- Quote from: BrianHG on August 27, 2021, 01:14:06 am --- :palm: This is google. They shouldn't be the ones using garbage run of the mil OCR software. They should be using one of their neuronet based deep learning systems powered by a few hundred GPU or dedicated processors with full language and context and layout aware interpreters.
Like, Beamin said, FAIL on a pitiful level for a company of this tech caliber.
--- End quote ---
We don't know if the problem is the source or the conversion to ebook reader unless OP provides an actual link...
CatalinaWOW:
I have never encountered a problem on this scale, but only use a desktop or laptop under Windows. And have downloaded only a couple of dozen or so books and portions thereof. I would suspect either a translation problem to Nook or possibly a poor communication protocol that doesn't recognize/correct errors.
My guess would be that Google sticks with, or at least emphasizes the OCR versions just to minimize bandwidth on the transmission end. They surely have the resources to retain the original scanned images, though perhaps with methods that don't provide high speed and random access.
My personal experience with OCR is that the original needs to be pretty bad before you get the level of errors described. And that is with the fairly garden variety OCR provided retail by Abbyy and running on a not very high end desk top computer. It can bog down on technical texts which have complex math formula with glyphs from several languages, but that is usually handled well by just making an image out of the equations. In all but the very densest of textbooks that still dramatically reduces image storage required.
Beamin:
--- Quote from: ejeffrey on August 26, 2021, 11:26:25 pm ---Example? Link?
--- End quote ---
The only way I know to see it is through the nook which is missing right now sorry.
I think the fonts are built into the nook so maybe on googles end they see the old books fine but once its in the nook its garbage. I dont know what the current B&W eink reader is but go into free books and try downloading stuff from around 1900, you will see A LOT of books that are not readable.
TimFox:
When Oxford Univ Press wanted to digitize the Oxford English Dictionary, they used 128 typists for 18 months.
CatalinaWOW:
I did a bit of browsing, and on my system all of the books turned up as images and quite readable. An English translation of Fourier's Theory of Heat was very legible, but dense with equations, not just as separate lines but embedded in the text. It is one that I think OCR would have a lot of trouble with.
As others have suggested a specific link would be useful.
Navigation
[0] Message Index
[*] Previous page
Go to full version