Nick Hodson

"Review of the Year"


A great deal has happened in the field of book transcription over the past five years, and indeed there is scarcely a single part of the work that has not been affected.

Traditionally one obtained a copy of the book to be transcribed, scanned it, OCRed the scans, edited the resultant text, and then disposed of it to some corner of the computer whence it could be retrieved when desired.

While working from a copy of the book is still number one method, during the last few years some rare books that exist only in the great university copyright libraries have been scanned by the libraries and/or by Google, and the collection of scans made available to the world on the Internet Archive. This has meant that I personally have been able to transcribe and post to the Gutenberg Project, for anyone in the world to read freely, a great many books that either were never available from booksellers via Abebooks, or that never came up on eBay.

The quality of these scans varies greatly, from a very poor standard of scans done at a certain University Library, to a generally high standard of scans coming from the University of Illinois at Urbana-Champaign. A scan can be poorly done with one or more of the following faults: pages missing; pages out of focus; pictures of hands or fingers instead of part of a page; camera misaligned, so that only the top or the bottom half of a page is seen; some pages scanned at the wrong dpi, typically four times smaller; illustrations missing.

When working with scans or other images the best program to use is the free and quite excellent IrfanView 4.23. For those working in this field this program is the industry standard. Do not be put off by salesmen trying to persuade you that some disgustingly expensive program is the industry standard, because it isn’t.

But there is a conflict in the minds of people of the managerial class. They are often paid on the basis of the expenditure they are “responsible” for. Therefore a perfect and free program carries for them less merit than an imperfect and expensive program. I am sorry to say that salesmen are trained to recognise this, and take advantage of it.

Ideally several different scans of a book turn up on the Internet Archive, and at the time of writing, the end of 2010, this is just beginning to happen. If you can find two good sets of scans of the same edition of a book the odds are very high that you will be able to make up a complete set of good scans for that book.

I have several scanners, of which the Plustek OpticBook 3600 scanner is the most useful. You can either scan a page at a time, for books that don’t or shouldn’t be opened out fully, or you can use it as a flat-bed scanner for books that can be opened out fully without damaging them.

For books that are in very poor condition I have a small scanner that can scan single sheets of paper, reading both sides at once. This is the Visioneer Strobe XP 300. It runs quite quickly, and is very portable. It can be powered from the usb, but I prefer to power it from the mains, via its own converter.

After some years of experimentation I have found that, for text pages and grey-scale pages such as illustrations, the optimum setting is 300 dpi, scanning into a jpg. The actual brightness and contrast settings must be found for any scanner that you use.

No matter how I obtain the scans of a book I pass them through a miraculously wonderful program, written by someone in Russia and called Scan Tailor, that takes one-page or two-page scans of a book, and goes by itself through the various processes necessary to generate a set of perfect scans of each page.

I used to generate from the scans a set of small-scale images that could be used to check what was actually on each page, against whatever my transcription was showing, and that was being questioned by my editing program. Later I moved on to having a good quality pdf of the book. Nowadays I generate a DjVu of the book, and use that. This has three great advantages: one, that the DjVu is easy to navigate; two that it is typically seven times smaller than the same set of images made into a pdf; three that there are some eReaders that can view a DjVu, thus enabling those with that type of eReader to read the book exactly as it was printed, of which more anon.

The free DjVu Solo 3.1 is used to generate the DjVu file of a book from the cleaned-up scans of each page, and the free WinDjView is used to look at the book.

For moving from scans to text we use an OCR program, of which currently many people use ABBYY FineReader 10. This program has a number of advantages over some of its predecessors in the FineReader series. One, it can be set to identify characters more than just the 26 letters a to z. By telling it that some of the text will be in French it correctly identifies even common words like “café” and “tête-à-tête”, and by adding German and Greek to this list it will identify most of the characters found in English texts. Two, it can be set to remove headers and footers from the page, which it does with near perfection (it just occasionally makes a mistake here, but these are easy to detect and remove). Three, it is very good at producing the actual text of the book. Four, where it does make a mistake, these are such characteristic ones that they are easy to deal with. Mostly, any way.

In addition I have made many improvements to my own editing program, not so much in its basic module “et_prevu.com” as in its peripheral programs which provide tests of the data for various classes of error. Starting with the newly OCRed text of a book, you can run through these pre-tests in a couple of hours. Having done these you can use et_prevu.com to rattle through the book chapter by chapter.

On completing the chapters, there are still a few tests that really only work on a book that is almost ready to publish. One of these is the Rare Word test, which I have used for about ten years now. In many cases a wrongly transcribed word that has escaped the spelling test, and the tests for frequently misread words, is a rarely used word. This test is quite productive, and enables nearly all the misreads still in the text, to be detected and dealt with.

Finally there are two processes that will help with the final cleaning up of the text. The first is to have the book read aloud by a text-to-speech program. It is surprising how good the human brain is at detecting misreads, when it hears them read out aloud, even if your mind is supposed to be doing something else at the same time.

The second is of course to actually read the book. But there is an intermediate process which enables you to make sure that the paragraphing of the book is in good order. You set up the computer screen into two windows. The DjVu of the book is in the window on the right, and the text of the book is in Notepad (yes, Notepad) in a window on the left. You check the beginning and end of each of the paragraphs, and you read the text between them if you want to. When this process is complete you have what I call a Level Three text.

If you want to read the book before publishing it there is an excellent program called YBook, written by the Australian computer wizard Simon Haynes and published with his Spacejock software. This makes the book you have prepared appear on the screen just like a book, with a left-hand and a right-hand page.

Publishing the book takes three stages. In order to publish to Project Gutenberg you must first obtain a copyright clearance, which is provided by a hard-working lady called Juliet Sutherland, who not only deals with requests for copyright clearance but who also prepares books for editing by the project’s unique system of collective editing. Juliet obtains old books that may be ready for scrapping, scans them and sends them on to the people who deal with editing them. This is not the place to deal with the pros and cons of this methodology, but just to say that participants are allocated a scanned page from the actual book, plus the OCRed text from that page. What they have to do is to check that these two agree. What happens after that is out of their hands, and they can just opt to do another page, or to leave off for the day.

To obtain a copyright clearance you have to submit certain information about the book, together with scans of the title page, and of the page following the title page. If I think more than that is necessary to prove the date I am claiming for the book I scan the relevant page from the Copac database of books that are in the British copyright libraries, and some British University libraries, and correlate the book with that Copac scan by sending Juliet a scan of the last page of the book. I nearly always use Copac, but it is also possible to find books in the Library of Congress catalogue.

Once you have got your book ready for publication, in order to get it onto Project Gutenberg, you need to produce it in two forms, plain text and html, along with certain metadata about the book and about the person who has done the preparation of the book. You need to check the xhtml using three tests. One, to check that the book is in good correct xhtml. Two, to check that all its links are correct, ie those between chapters, and those relating to the images contained within the book. Three, to check that its css attributes are all allowable. Once that is done you can combine all these four files, and the images if any, into a single zip file, which is to be made available for an assessor to check that all is well. I use David Widger for this task, a retired paediatrician, and he has served me well for over 560 books. He quite often finds mistakes that I have made, but more often than not he finds that what I have prepared is ready to go straight onto Project Gutenberg.

But for my own use I like to prepare the book for reading on my Bebook device. I have the earliest of the Bebook eReaders, but the operating system that I use is one that has been made available by the Ukrainian Tigran (living in London, I believe), and by Tirwal, who have made enormous improvements in the display of DjVu files. You remember that these are made up directly from the scans of the pages of the book.

One of the advantages of the DjVu method of displaying a book is that it does not matter what language the book is in, nor what script is being used. A few years ago I did a display page on my website illustrating how DjVu can be used for producing books in scripts other than European ones, using several books written in Urdu as display material. In actual fact Urdu can be represented as a machine-generated script, but it was the idea that I was trying to get across.

Books displayed as DjVu files can easily be embellished with contents pages. This means that you can go straight to a given chapter, and even to a given page within that chapter. For instance my Classical Greek New Testament DjVu is easily navigable because the Testament is divided into Books, and the Books into Chapters. It could have a contents file that could take you to a given verse, but I do not think this is worth it. You might have it take you to the nearest verse in multiples of ten or twenty. The same applies to my Latin New Testament, and then also to my Gaelic one, and to the ones in several other languages.

The other method I like to use on the Bebook is to display the book as an fb2 file. This is a Russian format, that is really quite easy to generate. It also holds the images, which have first to be coded up using a method called B64. Here 64 different alphabetic and numeric characters are used (a-z, A-Z, 0-9, and two others, + and /). This can be used to encode 6 bits, so that three bytes, which is 24 bits, can be represented by four of these characters. This is easy to encode and decode. When the book has been prepared as an fb2 file, it can be zipped, which reduces its size considerably, and the Bebook reads it very efficiently from that zipped file.

The Bebook can display books in quite a number of different formats, and theoretically it can display 7000 pages on a single charge. However there are some formats that are rather more expensive to display. pdf is one of these, and you often see people writing into the Bebook user group website complaining that they are not getting anywhere near 7000 pages on a charge. Obviously they are not using one of the economic methods of storing their books. I once tested the assertion that you can get 7000 pages on a charge, displaying a sequence of several books stored in rtf format, and found that you could get rather more than 7000 pages, over 8000 in fact.

There are a number of different formats in which books can be displayed. Gutenberg makes books available in several of these. Some years ago the Microsoft “lit” format was very popular. Then we had the Mobipocket “prc” or “mobi” format, which was useful for reading books with an Ipaq device. The three best formats for the Bebook are: plain text; rtf (rich text format); and fb2, which is my favourite.

Nowadays many of the eReader devices take books in epub format, and many of the companies selling eBooks do so in that format. One advantage to the publisher of it is that it is easy to tie the book up with DRM, meaning that the book can be sold for display on a limited number of devices. DRM stands for “Digital Rights Management.”

I was preparing some books for a friend, and found that he wanted them in pdf format. This again is quite easy to make, and there are many very good programs for making these pdf books. You need to understand that there are two quite different formats called pdf. One is made directly from the scans of the pages of the book. The other is made using a text-editing program that prints using a filter that converts the text into a form that makes it look like the pages of a book. One clue to the difference between these two kinds of pdf is that the latter is searchable.

There are very strict rules that must be obeyed by a book in epub format. Some of the programs that purport to generate an epub version of a book, produce epubs that do not obey these rules. There is a website that can validate an epub. You just tell it whereabouts on your computer it can find the epub you wish to check, and it checks it. I did at first find the diagnostics it sent out rather difficult to understand, but once I got the hang of them and took notice of them, I found I could generate valid epubs every time. I generate an html file of the book, just as I would if I were making a prc file for the Mobipocket bookshelf. This file then goes through the excellent free program “Calibre” and emerges as a nearly correct epub. There is another excellent and free program called “Sigil” that can in little more than a minute convert that epub to one that will pass cleanly through the epub validation website.

So far I have said nothing about producing the eBook as an audiobook. The general principle is simple. You can use NextUp TextAloud version 2 or 3 to produce the mp3 files of the book. To do this it needs a “voice”, which must be suited to the language the book is written in, and to the audience that will be listening to it. Obviously I am only interested in voices that read in a variety of English that is similar to the one I speak myself. I tend to use these two: Acapela Peter and Cerevoice William. I tend to record at 24 kbps, and with a sampling rate of 16 kHz. I think this is the lowest useful setting, and it would be easy to select a higher setting if you prefer,

TextAloud recently issued their Version 3, but I cannot see that it is more useful than Version 2, so I have re-established Version 2 on my computer.

This leaves you with a series of mp3 files. On some mp3-player devices you need to set up a directory structure, by author, and then by book. You put the various chapters and other components of the book, already so-named that they display in the correct order, into this latter sub-folder. An example of an mp3 player of this type is the TrekStor L. This device has a slot for a micro-SD card, so that you could travel around with quite a large library if you so desired, just by having a collection of micro-SD cards, all with different books on it. But micro-SD cards are quite expensive, though the TrekStor is actually quite inexpensive.

In order to put your home-made audiobooks onto your Ipod, you need to convert them to the m4b format. Here all the mp3 files are put into a single m4b file, using the free “MP3 to Ipod Audio Book Converter”, and then you add bookmarks so that you can jump to the start of any desired chapter, using the inexpensive “Chapter Master”. An audiobook made in this way works perfectly on an Ipod. It can also be played on the PC using the QuickTime player, or of course using ITunes.

While I was writing this article I was shown the Sonos device for maintaining and playing a collection of music. I should think it would also work well for a collection of audiobooks. My friend had a hand-held device with which you could choose which Band and which of their songs you wanted to hear. All the music was held on a terabyte disk somewhere else in the house. The music you wanted to hear was played on hi-fi devices throughout the house. All these devices were communicating with one another wirelessly. On looking up the Wikipedia article about Sonos I saw that it can handle aac files, which is one of the audiobook formats, but I did not see m4b, which is the format used for home-made audiobooks.

In all the above I have been writing about advances over the past five years in scanning, editing and otherwise handling eBooks. But I feel I must mention Audible, which offers for sale audiobooks that are very well read and presented. I have a contract with them whereby I pay them a small sum of money each month, in exchange for which I can download one of their audiobooks, no matter what its nominal price may be. I find this to be a very agreeable arrangement, because there is usually only one audiobook I wish to download from them in a given month, and that will be the book we are to read soon with the book-reading group I belong to.

My own software, the Athelstane System, has been available for free for some years now, but the latest update is slow to appear, because so many small changes have been made over the past year, that every time I think I ought to re-publish it, I instantly think of yet another way in which it could be improved.

Another area of the work in which many changes have recently taken place is in making eBooks easily accessible to the blind or partially sighted. I wrote an article about this in February 2010, and I do not have any further information to add to that article.

Some eReader devices have a feature whereby they hold the books in a text format, typically a third to a half a megabyte, and have software that can read the book aloud, thus avoiding the need to store the very much larger set of mp3 files or the m4b file for the book. These will very often approach 200 megabytes in size. I have not checked out all of these devices, but those that I have seen had a major fault: they took no notice of the blank line between paragraphs, thus instantly reducing their reading into an incomprehensible mess. Some years ago a version of Fonix ISpeak was available for the Ipaq. With a very small amount of extra programming it could have been almost perfect, for it could have been made not only to observe the space between paragraphs, but it could have used the standard ISpeak markup, that enables the TTS program (text-to-speech) to pronounce words correctly. Unfortunately that program is no longer on the market.