Book archives
Here in the early years of the digital age, all kinds of archive material is being digitized for broad distribution, consumption, and storage. Because we are in the early stages of this conversion, sometimes it's not quite in reality what it's billed to be in theory. Take e-books for example. There are thousands of titles available in the public domain as a text-based, digitized e-books. For example, lots of sources will provide you with the great (and not so great) 19th Century literature. Unfortunately, lots of times the transcoding from printed page to digital content is a little rough. Mistakes abound. So-called "optical character recognition" is sometime closer to "optical character guessing." The scanned page is optically correct, but the software that converts the pixels in the image to characters of digital text is not flawless. The scanned page may clearly read "Party of the first part" but might emerge from character recognition as "|°artg of tne fust bart" until a human reader fixes the mistakes manually — usually in a volunteer effort. Humans can and often do correct the mistakes, but not perfectly and often not even competently.
Another problem is simple punctuation conventions. For example, here in America, we punctuate some speaking with double quotes, hence "I am speaking and you know it because of the punctuation." It is common in British English to indicate someone speaking with a single quote, hence 'I am speaking and you know it because of the punctuation.' It's a small thing, but one that can be mildly jarring when one is trained differently.
I've recently found an interesting work-around for some of this that you may not be aware of. Check out www.archive.org. There you will find gazillions of books (and other materials) that have been scanned and that are available for free download. In particular, you'll find that lots of material is available as image-based PDFs from the original scans. It's like having a picture of a page rather than a digitally transcoded version of the text. The file sizes are much larger, but not too bad — unless you need hundreds of book at a time on your portable device. I'm currently reading Dombey and Son by Charles Dickens from a volume originally published (and printed) in 1896 by Chapman & Hall of London. All the transcoded versions I tried to read were simply horrible with lots of errors. Because the optically-based PDF is simply scans of the pages, it is error free — at least, that is, as error free as the original paper publication is. The optical PDF includes all the original illustrations, too. Obviously the scans of the book pages also preserve all the typography of the original as well as the design elements and flourish marks. I even enjoy the look of the aged paper.
It's not like reading a "normal" e-book in that I can't change the font size or typeface, but I can't do that when reading a paper book either. At least with the optical PDF I can zoom in if I need to with a pinch gesture. To make the file a little more "reader friendly," I've added bookmarks to the PDF so I could jump to chapters as I need to. Using the Adobe Reader app on my Android tablet, I can still make marginal notes, highlight passages, etc.
Why would I choose to read a book this way? Well, here's a real-world example. When I was recently down on the Oregon coast slumming in a used bookstore, I found an old and broken copy of a book by John Galsworthy that I'd never heard of. Because I'd just watched The Forsyte Saga on video, I became interested in learning more about this author. The book is title Caravan and consists of a number of Galsworthy short stories. It looked interesting, but because it was so broken and in disrepair, there were pages missing. When I next had a chance, I jumped onto www.archive.org and found the book, originally published in 1925, not scanned but photographed (interestingly enough with a Canon 5D Mk II) into the digital archive by the National Federation of the Blind in April of 2012. In a couple of minutes, I had the 48 mB PDF downloaded to my computer, transferred to my Dropbox, uploaded and accessible on my Nexus 7 tablet, and was reading away in amazement of the technology. So far, I haven't found this obscure book in any of the standard e-book locations like Gutenberg.org or Manybooks.net. Nonetheless, I'm enjoying it in this photographed PDF archive format — completely error free and a joy to read.
Just for fun, try a search for photography in www.archive.org. Very interesting.

Just to note: you won't find many post-1922 books at gutenberg.org because they're not likely to be in the Public Domain in the US. Project Gutenberg is very careful to vet works that are in the Public Domain so as to avoid nasty lawsuits. Post-1922 works are either very carefully researched (e.g. the Science Fiction short stories) or are made available by the copyright holder.
Also, as a volunteer proofreader (through Distributed Proofreaders at pgdp.net), I recommend that you look for Project Gutenberg books later than about 10,000 if you're looking for well-converted text versions. The early ones are quite... idiosyncratic.
Posted by: logista | 10/31/2012 at 07:06 PM
As someone who has used optical character "guessing", I completely relate to the often abysmal transcription that takes place. It's often more efficient for me to type-transcribe the document myself! Thanks so much for sharing these new sites where I can truly find a "copy" (closer to a photocopy) of some of these original books. And BTW, the Forsyte Saga is even better in the printed version than the BBC video series.
Posted by: Kathy Eyster | 11/01/2012 at 08:21 AM