Full process for scanning books at 600 dpi
Nick Hodson

Athelstane eBooks method of scanning public domain books, in preparation for posting on the website.

Herewith a brief description of the way I do the scanning, or at least as brief as I can make it while leaving nothing out. I also describe how to create a nice pdf of the book, and an archive set of tiffs and jpegs of the book, and, furthermore, how to create a small size set of images that will be useful for editing, or comfortable reading of the book.

External programs used (Windows 98 or xp):

  • Software associated with Plustek OpticBook 3600.
  • ABBYY FineReader version 7 or 8.
  • ImageToPDF version 2.4.0.
  • Advanced Batch Converter version 3.9.
  • IrfanView.
  • Paint Shop Pro Version 3.11. Later version not required.

Dos (settings vary between Windows 95, 98 and xp).

For each author there is a folder.

For each book there is a sub-folder of its author’s folder.

Since I try to work as much as possible so that my programs will run under early versions of dos, the author’s folder, and the book’s folder have 6 to 8 characters.

Each author has a letter assigned, for instance M is Marryat, B is Ballantyne. If necessary two letters are assigned, for instance TW is Reverend T P Wilson. This assignment is controlled through a file called UP.TXT and an executable called MAKE_UP.COM.

UP.TXT is edited to include a "new" author, and is then processed thus:
MAKE_UP UP.TXT
This creates a batch utility called UP.BAT which is on the path. ("PATH" is a reserved word in dos.)
Thereafter typing UP M (upper or lower case) will jump to the Marryat folder.

As regards the book, when its folder is created, I also think of a 5-character name which is a corruption of the book’s name. I avoid the temptation of using a meaningful-looking corruption, such as "peter". The aim is that for all the books I process this 5-character name will be unique. The reason for this is clear when you consider bulk updating of all my books on the web, which sometimes has to take place. This is because the chapter files are named with this 5-character version of the book name. So also were the individual pages. We do not want to exceed the old-dos limit of 8 characters. For instance "MSHEL01.HTM" will be chapter 1 of "The Master of the Shell".

Now I link the 8-character author name, the 8-character book name and the 5-character book name using a batch utility called SET_BOOK. This makes use of debug and creates a new version of a batch file called JUMP. Until it is reset, for instance by starting work on another book, executing JUMP causes a jump to the folder of the current book. It can be reset either by going to another book’s folder and executing a local batch file called BOOK.BAT or in any folder by executing a global batch file called JMP.BAT. It can also be reset by executing NAV which can be used to go to any author and book without knowing the UP-codes.

The next step is to take note of the important values of the book, for instance its actual name, the author’s name, the publisher, the illustrator, the edition used, and the date of the first edition. The program is called STARTNEW. It takes note of where the page numbers are, whether dog-ears are used for primary speech, whether a new page is started for each chapter or not. The number of chapters. The number of the first and last bodytext page of the book. The page number at which each chapter begins. How many pages before the bodytext, and how many follow it.

From this a list is created of the first and last page numbers for each chapter (CHAPTERS.INI) which has to be checked carefully. Since the information for it probably came from the Contents pages of the book, there is no guarantee that they are right.

Another screen then comes up into which a few more details need to be entered, much to do with the number of images, and also estimating the length of the book. This is done by entering the number of lines on a full page, and also five consecutive lines of the book, from which the average length of a line is deduced. A constant figure of 85% is applied to get the approximate length of the book in bytes. Applying some arbitrary but roughly accurate costing factors an idea of the effort required to produce the eBook is made, and expressed as a value in UK pounds. The file thus created is SCANINFO.INI.

Now we can get down to the business of scanning.

We need to scan a few pages of the book at various brightness settings and with the name of the scans set to "test". The other settings will be 600 dpi, and "save as tiff" into the book’s folder. From these few scans we run FineReader, and see for what brightness the scans are best. It normally comes out at 15, but it has been known to be 20, 30, 40 and even 50 in exceptional cases. Sometimes even it comes out as the default setting of 0.

Next I scan the bodytext at the chosen brightness setting. We start at the first bodytext page, probably numbered 7, or some other odd number near that. I do not use the setting that is available of inverting all even-numbered scans. This is because there an obvious and elementary mistake in the OpticBook’s software, in connection with this setting. The scan names are to have the value "ppp".

After the first page I scan each opening of the book as follows.

  • Left hand page on the scanner, upside down.
  • Double tap on the grey scan-text button.
  • When left hand page is done, quickly rotate the book
    and place it on the scanner with the right hand page right way up.
    Since a double tap had been given there will be no delay in the scan beginning,
    and there is only just enough time to rotate the book and position it.
  • If necessary the page currently being scanned can be aborted,
    and also the page last completed can be deleted.

Using this technique scanning proceeds comfortably at 150 pages an hour. Could be faster if I used a faster computer.

The scans of course come into hand numbered 0001, 0002, 0003, ... up to the last.

They all have a space in their name, a ridiculous mistake in the OpticBook’s software. This is at once remedied by running a program called "SPACE". At this point we also run a little program that analyses how long we have taken to scan each page. Since the times stored under Windows and Dos are rounded to the nearest two seconds, we can’t get an accurate figure for any one scan, but by grouping them into those done fairly quickly, those that took a little longer, and those where I obviously left the job for a meal or such-like, and then averaging, we can get a very fair estimate.

The next problem to remedy is that the tiffs have been saved uncompressed, and also that every even-numbered one is upside down. All this is dealt with by the use of TIF_PACK.COM for the odd-numbered scans and TIFINVRT.COM for the even-numbered scans.

Of course all this takes place quite automatically. All I do is to type STAGE_1C, and all this happens automatically. I retire to a sofa for a rest. The progress is indicated by an enormous-sized display of the program number and the scan number. When done I rise back to my chair and carry on as follows.

First I check that I haven’t missed one out, by looking at every tenth. I remedy any mistakes (see next paragraph). Then I have a really nifty little program (PAGE_NUM.COM) that assigns to each scan its correct name. Some books start at page 1, some at page 7 (the most usual), some at page 9 and I once even came across one that started at page 13.

When you were looking at the tenth files you used a program called TIF_MONO.COM. Because of the settings you loaded with STARTNEW, it knows where on the page to look for the page-number, and this should be what you see. I note all these down as they appear. Sometimes the first page of a chapter does not have a page number or has it in a different place. You have to be aware of this. The Escape key causes an exit from TIF_MONO, and you might care to remember that control-[ is also Escape.

This program creates a batch file called TEMP.BAT and you will need to execute this to effect the name-changes.

For the actual scanning I create four series of scans. The bodytext are originally named as "ppp". Everything before the bodytext are named "aaa", everything after are named "xxx", all images are in the "ccc" series. Then corrections are in the "qqq" series, but immediately renamed manually to their correct names, of which the alphabetic part is the 5-character corruption of the book name. This enables "PAGE_NUM.COM" to work correctly.

One of the subfolders created by STARTNEW is called "straight". I fire up FineReader, and load into it the "aaa" series. This of course does not take long, as there are usually only about 6 or seven pages in that series, unless there is a lengthy Preface or some such. Highlight all these files by highlighting one of them and then hitting control-A. I then save these now straightened files into "straight". Take care that you are saving them as separate files rather than the default multi-tiff file. Use the root name "aa" not "aaa".

Now start a new FineReader batch, discarding the earlier one, and load up the "xxx" files, if any. These were the publisher’s booklist at the back of the book. Save these into "straight" as separate files with root name "xx" not "xxx".

Now start another new FineReader batch, discarding the earlier one, and load up the bodytext pages. Again, click on one to highlight that one, then hit control-A to highlight all of them. Then save them all into "straight" just as you did the "aaa" and the "xxx" sequences, but this time use the root-name "ppp" once again.

Now you can tell FineReader to start reading all the highlighted page-scans, and while it is doing this go to your dos window and type STAGE_2C. This will rename the aa series, the xx series and the ppp series to their correct names. It will also clean up nearly all the scuff on the edges of the scans, and it will widen the images, which is a help if one of the scans was very close to the edge (that is, text almost off the scan). It then goes on after various evolutions that use some of the 300-dpi software, to create a set of cropped images into a subfolder of the book’s main folder, called "cropped". Retire to the sofa for another rest, a long one, while all this is happening.

Rise once again from your sofa when you see on the screen a static display of a reduced-size page image, right-rotated. There will be some fine dotted lines. The page should be numbered 17, as this is normally one of the pages that carry printer’s marks. Using the up, down, left, right keys, you can manoeuvre the box enclosed by the dotted lines, and using control up, control down, control left and control right you can affect the size of the box. The object is to nicely contain the text from the top of the page to the printer’s mark. Once set you should not need to set the box size again. Hitting enter twice you will be presented with the next page, normally 33, and so on. When the printer’s marks sequence is done it will start on the first and last pages of each chapter. When that is done it will look through all the page images located in "cropped", and will decide which pages need a manual treatment as well as those already done. It will tell you to enter the word "CHECK" if it finds any. It is this last part that does not work all that well with an xp. It just runs slower, that’s all.

When you first do the above setting, take note of the width and depth figures appearing on the top of the page. These apply actually to the half-sized images at 300-dpi that were created for you while you were on the sofa, so the actual figures it is using on the full-sized images are twice this.

When it is satisfied with the bodytext you should enter "MC_ALL aaa" and then when that has completed "MC_ALL xxx". MC stands for "manual crop". Use lower-case if you like.

Now there are two ways of checking that the pages have all been nicely cropped. One is to use the Windows program IrfanView to look through all the pages, having set the screen to display the images so as to fill the screen. If you see anything you don’t like you can reprocess it with MC nn1 nn2 nn3..., where nn1,... are the page-numbers needing work. The other is to create a pdf of the book, to do which you need to move the jpg scan(s) of the books images into "cropped". Fire up ImageToPDF and load the four sequences. Click on the bar at the top right of the screen to make sure that the scans are in the correct order. Click on one of the files, and then highlight all by hitting control-A. Then set the output pdf by clicking on the middle one of the five icons appearing on the right of the screen. Then enter the book details by clicking on the second from right of the icons appearing on the bottom of the screen. Finally click on the rightmost icon, which will cause the creation of the pdf to commence. You can then inspect the images, and remedy anything that needs doing. There is no need to exit from ImageToPDF, because it may not be long before you need it again.

If necessary use an old version of Paint Shop Pro to clean the cropped image.

You may like to use my function DESKEW nn1 nn2 nn3, which again controls fine dotted lines with the arrow keys and control arrow keys to work out exactly what angle to rotate through. This program uses sines and cosines to rotate the image about its centre point, just as it should do. When you have got the dotted lines just right hit Escape to initiate the action. It is rare for more than three or four images to benefit from this treatment.

When you have got all the images in "cropped" to appear correctly, cleaned up of any nasty blobs, and so forth, you can make all the page-scans the same length. This is not strictly necessary but it improves the reading of the book with the pdf. To do this enter the following:

cropped (this will move you into the "cropped" folder.) extend abook depth (where abook is the 5-character code for the book, and depth is the reported modal depth of the pages, a number like 3900.)

Remember that the aaa sequence and the xxx sequence are already so extended.

Now you can create the pdf again.

When that is done, you can fire up Advanced Batch Converter, which you will need for two purposes.

The first one is to create a set of scaled down images that will be used for reference during the editing. This can be done quickly but it is far better to run with the following settings which will slow it down somewhat, but will produce a really useful set of images, that reflect very well what is to be seen on the actual pages.

1. Select only the bodytext pages.

2. With "Use Advanced Option" set pages to be converted to a max of 640 wide and 940 deep. Scaling to be done with Lanczos-3 method. Anti-aliasing to be ON.

3. Output folder to be the subfolder of the books main folder, called "tof". Name as source images.

4. Output format to be PackBits. This gets converted by the next function onto Group-3. See below.

5. Set dpi to be 100 or 150, whichever you like.

After this has processed go back into the dos window and run "INTO_TOF". When this has completed you will see the book appear in a normally very legible form. You see the top or the bottom halves of the pages. Space-key goes forward, backspace-key goes backward, and there is a good set of other keys that have various effects.

For some reason it is now necessary to exit from Advanced Batch Converter, and then go back into it again. This is because of some obscure bug in the program which stops it from running properly if you don’t do that. When it does fire up, unclick "Use Advanced Options", and set the output to be into the book folder’s subfolder called "big". Set the output files to be Group-4 tiffs. Set dpi to be 600.

When that has run you have a nice set of Group-4 tiffs of the whole book in the "big" folder. Move the original jpg files of the pictures into this folder, and create a zip file ABOOKBIG.ZIP (where ABOOK represents the 5-character name of the book), containing the individual tiffs and the jpegs of the book. You can then delete tiffs and jpegs from the following files: book’s home folder, and subfolders "backups", "straight", "little", "cropped" and "big".

The first four of those can be done by executing

  • DONEWITH BACKUPS
  • DONEWITH STRAIGHT
  • DONEWITH LITTLE
  • DONEWITH CROPPED

I would strongly advise archiving the pdf file and the above zip file before you start to delete anything.

I must mention a bug in the ImageToPDF program. This may cause pages in the main bodytext sequence that also contain images to display as white-on-black instead of black-on-white. It does this if there would be more black than white on that page, and it is pretty mindless. You can promote that page to greyscale which usually gets rid of the problem, or just use Paint Shop Pro to remove some or all of the image. There is always a way round this if you look hard enough.

There is another bug in ImageToPDF that causes AnyToDjVu to hesitate a bit, but does not cause it to abort. Our last task is to obtain smaller size versions of the colour and/or grey images, that will be suitable for displaying from within the html files. There are various ways of doing this, but the first thing is to know on what pages of the book you want them to appear. What I do currently is to have the pdf display the image you are to work on. Then I hit the "Print Screen" key, which copies what is on the screen into the clipboard. Then I have IrfanView with a clear screen, and hit control-V, which causes the screen image to appear. Next I crop the picture I am interested in, by drawing a box round the image with the cursor, and then hitting control-Y. Straighten if necessary with control-U. Save using S into the book’s subfolder "q" with the name abooknnn.jpg where abook is the book’s five-character name. and nnn is the required page number. As I wrote elsewhere:

The reduced size "ccc" images are then placed into a sub-folder called "q" and given names in a similar fashion to the tiffs, ie the 5-character corruption of the book name, and the page number to which the image refers. I don’t need to do anything more than this, as right at the last knockings, just as the output xhtml file is being created, the images are posted to their correct places in the chapter file.

I sometimes produce two-up versions of the pdf file of the book, but this is not within the scope of this article, and they are not always very useful. It just depends on the book.

N.