Detailed Description of how the Athelstane System works

by NH

The Athelstane system of creating e-Books has been built over over the years since 1997. By sticking to a defined file structure and to several other conventions, we can be sure that the e-Books will come out at the end of the process with a minimum number of errors. We can’t claim zero, especially as what one person will call an error another will call correct. Our object is to produce books that can be read or listened-to with a minimum of annoyance arising from perceived anomalies in the text.

We strive to detect and correct typos that were in the original book. By “book” here of course we mean the edition of the book that was used. Perhaps another publisher did not have those typos in his edition. The sort of typos that we mean are missing commas or stops, often at the bottom of the page, where the movable type was hammered flat too roughly. Sometime we find quotation marks within a text at inappropriate places. The system will find these, but it takes a literate editor to correct them.

From the time editing commences, you will always have access to a display of the page you are editing, so that any possible deviations from the original text in the book can be checked.

This process mostly takes place in a dos environment. This is available within W98 and W95, and also in XP, but the latter has some eccentricities that might be construed as bugs, and we have had to devise workarounds for them. All the dos part of the system will run even on an old Windows 3.1 machine, so it is quite universal.

There are a number of stages in the creation of an e-book. The initial stage is to define the book itself, and establish whereabouts in the computer’s hard disk you will be working. You do this by creating the folder which will be the book’s home folder. The folder above that must be the author’s home folder. The folder above that must be the website’s home folder, or analogous to it, and the folder above that must be called “C:\wavefile”. Several of our important files that we call upon during the editing process will reside in C:\wavefile. In actual fact this hierarchy of files will probably not be the hierarchy that is used to maintain your website. You can either use a different part of this computer, or an analogous part on the computer you use for working on the internet.

To declare a book’s home page we use a batch file called SET_BOOK. But first you will need to have decided what to call it. The author must have a 6-8 byte abbreviated name, and so must the book. Mostly we use 8-byte names for authors and 8-byte names for books, though we can get away with six or seven bytes. But we MUST have a second name for the book, of exactly five bytes. All of the chapter files, be they text, be they html or be they asc or audio files will use this 5-byte string as part of their name. Whole book files will mostly use the eight-byte version, though there are some exceptions. So SET_BOOK is followed by three strings: 8-byte author, 5-byte book, and 8-byte book. What it then does is to set up a way of entering these strings as environment variables, and also of always returning to this working folder for the book by typing JUMP. In fact if you look at JUMP.BAT you can see this is exactly what it does.

When you load the system into your computer you must put all the dictionary files and other lists onto C:\WAVEFILE, along with all the screen files. Then we must put all the program files (that is, bat, com and exe files) into a folder that either is already on the path environment variable, or we must add something in autoexec to put it there. To save trouble we just use the same directory as all the computer’s own operating system files are on, but you may not like doing this.

The home page of PCN - you can select different kinds of activity from here

We are going to run a program called PCN.BAT, which will guide us through the different stages of creating and editing our e-book. It will stay loaded all the time.

The PCN screen for starting a new e-Book

The last of the PCN screens is called “S”, which stands for startup. The first of its activities is called “Load Starting Data.” We shall call this stage “Pre_Scan.” In this stage you will load the page numbers of the start and end of each chapter, as well as the names of the author, the publisher and the illustrator, the dates of this edition and of the first edition, and you will answer “yes” or “no” to three important questions about the layout of the book: does each chapter start on a new page? does the book obey the usual conventions about quotation marks? and does it obey the conventions about what appears on the top and bottom of each page?

We now come to the real business. Stage One is to use the Plustek scanner to produce the tiff images of the page. At the end of this stage you will have one packed tiff image at 300 dpi for each page. You will also have the tiff images for such text pages as the title page, the list of chapters, the list of illustrations, and for any other prefaces, introductions etc the author may have included. You will also have created a pdf of the book, from title page to the end of the text. You will have looked through this pdf to see if any pages need to be re-scanned because as they are they are unusable. Do not worry about pages that need deskewing, despeckling or cropping. Do something about it only if the scan is too faint, or if part of the text is missing, or if there is anything else that could be removed by re-scanning. We give full instructions later on.

In Stage Two we strive to clean up the tiffs. This is very largely an automatic process, whereby straightening (deskewing) and despeckling are followed by cropping, which is automatic on this pass. If there is any garbage in the margins of the text the cropping will take it out if it is minor, but will crop outside it if it is major. This enables the next stage to detect the pages that still have garbage. Deskewing, despeckling and cropping are an important part of getting tiff images that will OCR correctly. Any garbage on the page may appear as garbage in your texts, and be a nuisance.

In Stage Three the computer looks at all the cropped files, and does a statistical analysis of their widths and depths. If all the pages were devoid of any garbage in the margins they would all come out with almost the same widths and heights, except of course for the first and last pages of each chapter, which would be shorter. In the analysis the program looks at all the depths and decides what is likely to be the depth for a normal clean page, and then it decides what is likely to be the width. You will see the information this is based on, as there will be two displays of the list of files with their depths and widths, one sorted by descending depth, and the other sorted by descending width. The program selects the pages that come outside these two limits, and gets you to do a manual crop for each of them. The manual cropping program is very intuitive and easy to use. In all cases you can see exactly what garbage has appeared in the margins and needs to be cropped.

You won’t have to do anything more than just obey the computer’s instructions to do this. It actually creates a batch file called CHECK and by running that you will be cropping the right pages.

After you’ve finished Stage Three, you can run it again, but this time there should not be any deviant pages although usually there are just a few with nothing much wrong with them except maybe a fraction of a degree skewed. So this time when you see the sorted list you can see how deep the deepest page is. To make the displays in the pdf and also in what you see when you are editing we need to make all the pages the same depth.

Stage Four. Once you’ve told the program how deep you want all the pages to be (minimally) it will go ahead and add a few blank rasters at the bottom of each page, to make them come to that measurement. At this stage we create the displays for each page, that we will be using in editing. When we did the scanning we put each page image in the book’s home page. When we did the straightening we put those page images in a sub-folder called “straight”. And when we did the cropping we put the images in a further sub-folder of “straight”, called “cropped”. So now we move the contents of “cropped” back up two levels to the book’s home page, and delete the two folders “straight” and “cropped”. You should remake the pdf of the book, now.

We are now ready to begin OCR, because we have made sure our page scans are as good as they could possibly be.

We start running ABBYY FineReader, and we load the files for the book. We start the OCR process, which will take a little time.

When it ends we need to save each page’s html file separately. We explain how to do this in the detailed instructions for each section. The next stage also runs automatically, and ends up with good, though not yet perfect text versions for each chapter. There will also be an audio version that can be played using Fonix ISpeak, though we advise not starting this until we have done at least the first chapter’s editing. As we do the editing and as each chapter is finished, so the audio files are updated, so that your ISpeak program is always reading something that’s nearly right.

So here we are with our html pages, but no text pages or chapters ready. We are working from our PCN menu section “S”, and we select “G: Create Chapters from Pages.”

Here’s what now happens:

1. The program counts the pages, and confirms that they are correct in number.

2. It deals with certain hyphenation issues in the data.

3. It creates a text version of each html page.

4. It makes sure that there is one and only one line between paragraphs.

5. Knowing the start and end page numbers of each chapter it assembles the pages into chapters.

6. It inserts a header at the start of each chapter, consisting of the book name, the author name, and the chapter number (in Romans).

7. If the non-standard convention for double quotation marks has been used it corrects for this.

8. A number of small changes to the output are made. For example it would replace “space question mark” with question mark only. This program recycles for each chapter until it reports no changes, usually taking three cycles.

9 The program looks through each text to see if there is anything that can’t pass the spell checker, and if there is it looks in a list of known fixes, and selects one that is appropriate. We have sometimes found that some of the fixes are not appropriate, so we have deleted these from the improvements database. This bit of the process is not so necessary now we have FineReader, but it was vital with older OCR programs. We harvested many thousands of substitutions, but those days are gone now.

10. It then creates a first stab at the ISpeak audio files.

We have now completed the PCN Menu Startup section S.

The PCN screen for the first checks on a new e-Book

Proceed to PCN Page 2, entitled “Sundry Tests on the Whole Book.”

The first of these is labelled “A: Edit all chapters”, and that’s when the business begins. Because this is so important we will leave the detailed description till the end of this article.

The next one is labelled “B: Find missing stop after T.” Most of these missing stops will have been found when editing the chapters, but a few may have escaped notice. Basically what happens is that sometimes the OCR program reads “t.” as just “t”. Of course the give-away is that the next character will be a capital letter. This task lists all cases where a t is followed by a space and a capital letter, or of course a space quote capital-letter. The literate editor will easily be able to discern which of the cases on the screen need fixing, and which do not, and of course you have the image of that page to help decide.

Then we have “C: Find I anomalies.” This lists out cases where it thinks that the word I in the text might really be an exclamation mark. There are usually a few of these in a book.

Next “D: Find V anomalies.” Most of these will already have been found and dealt with, but we are looking for cases where there is a V that might be a Y or even a W.

Then “E: Find funnies.” These are not jokes, but stray characters that get into a text at OCR-time. Some of course are valid, and have been put by the editor, but they all bear looking at.

“F: Find places where Cap is needed.” In the nineteenth century exclamation marks and query signs were usually followed by a lower case letter, but these days we tend to use a capital after these signs. What are listed here are all the places that could perhaps do with changing a lower case letter to a big one. And the next option, “G” fixes them all.

“H: Overall spelling check” goes back over the spelling again throughout the book, because sometimes in editing, corrections are made that might be wrongly spelt, or might be introducing a word that is not in either of your word-lists.

“I: Check Page numbers” just checks that every chapter has all its pages, and no more, and that all the pages for each chapter are there. If you are nervous about this one it is a good plan to run it first, before the editor at Item “A.”

“J: Numerals report” lists all numerals that appear in the book, aside from page numbers. Most of these will be valid, like dates, or quantities, but it’s not unusual to find an I masquerading as a 1.

“K: test all punctuation tokens.” A punctuation token is what comes between words. Mostly the token consists of just a space. The program reads the whole book, looking at all the tokens, and comparing them to a valid list. It reports any that aren’t in the list, but sometimes these are just rare tokens, that it’s not worth putting into the list.

“L: Rare word test.” The program looks right through the whole book, indexing every occurrence of every word in it, with its location in the book by chapter number and absolute position in the chapter. It then merges the indexes for each chapter, which will have been sorted, and it now knows the frequency of every word that has been used. It then lists out exactly where every 2-letter word that has been used four times, has occurred with its context, every 3-letter word used 3 times, 4-letter word used twice, and every 5-letter word used once only. Again the literate human reading these reports can easily spot the ones where a typo has got this far, and it is easy to correct it.

The PCN screen for further tests on a new e-Book

Next section in your PCN screen, “Page Three, Further tests on whole book.”

Some of the items on this page just list out all occurrences of commas or other punctuation. We would advise just skimming through these reports in case anything stands out.

The PCN screen for initiating the Hyphens test on a new e-Book

“Page Seven: Words Index, Hyphens, Finder.” The Hyphens check is the most important here. It makes sure that there is a degree of consistency between the spelling of hyphenated words. For example “foot-stool” might appears as “foot stool” or “footstool”. The activities on this page report on all such variants, and allow you to select the ones you want to adjust. It uses the same indexing method that we described under the Rare Word test.

The PCN screen for final tests on a new e-Book

On “Page Eight: Watermarking Actions” we have some more tests. “A” and “D” are important and may well show up some errors. “G: Final check on names” is really looking for cases where the same person has been spelt differently in different parts of the book. Believe me, this does happen. Sometimes, though, it is two entities with similar names. But it is worth being sure.

“K: Check HTML codes balance.” It rarely throws anything up, but sometimes one will have made a mistake and lost part of an html markup.

The PCN screen for publishing a new e-Book

“Page Nine, Making the HTM and ASC Files.” Nearly the end, and if you don’t want to make any MP3 audio files, which to us is the object of the whole exercise, it is the end.


The Editor Screen

Now we come to a description of the actual editing. There are many keystrokes, mostly with the alt key held down, that enable you to make most kinds of correction with a single action. But the first concept to understand is that you can either be working within a line or you can be outside the text altogether. The tests form a sequence that are always initiated from outside the text, but of course the work they do will be within the text. For example you can’t start the spelling checker from inside the text, but the work you do with it is done within the text. When working in the text you will always be able to see two text lines above the line you are working on. Of course they are blank lines if you are at the top of the text. The way to know whether you are inside or outside the text is by the arrow that points from the left to the current line. If you are inside the text it is made up with equals signs. If you are outside the text it is made up with minus signs.

Another thing that you may find rather strange is that you always (or nearly always) initiate tests from the bottom of the text. In other words, a test of the chapter is done, and any corrections are made. So now we are at the bottom of the text again, and ready to do the next test. However all tests can be initiated from anywhere within the text, and go from there to the end. But tests initiated at the end of the text always go to the top before running. The title line is omitted for some tests.

The first test is a search for the page numbers. You do this after you have looked at the first few lines of the chapter, especially where the original text had drop caps. Once you’ve got that right you start to look at the page numbers, which are at this time on a line of their own between pages. There might be an unwanted blank line below the page number. Mostly the work done automatically when the pages were set up will have taken this out, but if it begins with a capital letter, the blank line before it might need to be removed. Alt-4 is a macro, and it does this. Similarly the macro alt-3 will remove an unwanted blank line before the page number. There may be a line with printer’s marks before the page number, and this can be removed with alt-1. Even more wonderful is that if the page ended with a hyphenated word, alt-5 will pick it up, and stick it below the page number line. To check on what you are doing alt-u will always bring up the screen image of that actual original page, which is quite wonderful, too. In all cases except when you have used alt-5, when you are happy with that page number line just hitting carriage return (aka enter) will take you the next page number line. If you had used alt-5, you still hit carriage return but follow it with alt-n. And so on till you reach the end of the chapter. If the book was one of those where chapters do not always start on a page of their own, you may have to remove the beginning of the next chapter. Bring the first line of that chapter to that editing line, the third one down, and hit alt-o, alt-e, alt-i, and the offending text disappears, and you are “At End” ready to do the next check. If you had a similar task to do at the top of the chapter, editing out a few lines from the previous chapter, position the first line of that block as the current line, and hit alt-o. Then make the last line of the block be the current line, and hit alt-p, then alt-e. Again the work is done.

Navigating around the page. In the left hand column of the page there is a series of letters of the alphabet, a to v. If you hit one of these letters it will make the line adjacent to that letter be the current line. Alt-b goes down to the next blank line, and alt-v goes up to the previous one. If you have now got the line you want to work on to be the current line, you can move the cursor at once to any point you like by hitting one of the numbers 0 to 9, as indicated on the second row of the red banner at the top of the page.

So now you’ve got your chapter in pretty good shape, and on the top right of screen is “At End”, so you are ready for the next text. Hit alt-s, which will find any unwanted spaces in the text. They disappear if you just hit enter, and then it goes and looks for the next one.

Now for the spelling check. This is control-up-arrow. You need to know that there are two dictionaries in use. One of them resides in c:\wavefile, and contains all the words that we have encountered since 1997, which is nowhere near the number other people have in their dictionaries, even though we’ve got some German, French, Latin, Italian Spanish and Portuguese words in it. The other dictionary is local, and resides in the book’s home folder. It contains proper names encountered in the book, and words that, though not proper names, we don’t want to get into the main dictionary. We’ll tell you later how to steer these words into the local dictionary as opposed to the main one.

The spelling checker will not get off a line until all the words in it have been passed as correct. The special case is when there is a very long word, which we don’t accept in the dictionary, so to bypass this word, hit control-z, then enter, then go to the next line, which is labelled ‘d’ and enter control-down. This very rarely happens but you need to know how to do it, because it does happen sometimes.

When the spelling check is complete you will be “At End” again and now you do the Special Word test, initiated by alt-t. This looks for words that have often been found before to be misread. It also looks for other things that may indicate a typo. With the older OCR programs we needed well over 600 words in the list of special words, but with FineReader it is scarcely 200. We are looking for such things as “hut”, “tile”, “out”, “1” that may be wrong or may indicate a typo.

You can run alt-t again a second time, which is a similar test but with fewer words to find, so that you might catch what you might have missed on the first pass.

The next two tests are alt-‘ and alt-˜, looking for ‘ and ˜ respectively. Then we come to alt-h. This works a little differently from the other tests, as you remain outside the text for the most part while using it. Hitting alt-h brings the next line with a hyphen in it to be the current line, and your literate editor, assisted by the alt-u key to bring up a display of the actual original page, has to decide whether it is a genuine hyphen, a genuine dash, or the result of a speck on the page or the scanner. Please note that while still at the editing stage, m-dashes appear as “space hyphen space”, a situation which is corrected when the html and the ascii files are eventually created. Remember, the working text is NOT the one that is published. There are various keystrokes associated with hyphens, but the help-files deal with them, and you just need to know that alt-8 removes a hyphen. FineReader nearly always gets the hyphens and m-dashes exactly right, but there was often a lot of trouble when dealing with them with the older OCR programs.

We now come to alt-r, which actually does find lots of things that otherwise might be missed. Basically it looks for unusual punctuation tokens between words. An example would be “space dot” between words. It also finds places where it thinks there ought to be a full stop, because the next word is capitalised but isn’t known to be a proper noun. We have included here zip file with the coding for alt-r and also that for a procedure that it uses, check_line. In very simple terms, alt-r causes the computer to look at each line in the data in turn, creating a two-line entity by adding a space to the end of the current line, and then adding the next line onto that. It then puts this long line into check_line, which looks for all sorts of things it might not like. If it finds one it returns out of check_line with the position in the line of the offending item, jumps into in-text mode, and waits for you to edit or accept what it has found. If it finds a new proper-name you can add this to the list by hitting alt-12, but that doesn’t take effect till after you have finished editing the chapter.

The next test is alt-’. This goes through the entire chapter looking for the apostrophe symbol, except where it is used as a singular possessive or otherwise, inside a word. Don’t forget that this character is rather over-used. We are going to mask out the plural possessives (though we don’t have to) and the use of it to indicate that the speaker has dropped some letters from the end of a word. We do this by hitting the hash key, which turns the apostrophe into something else. All we actually want at this stage is for this character to appear in the screen only where it is matched by a previous left single quote. In the rare cases where a speech enclosed by double quotes carries on to the next paragraph, carrying with it a subsidiary speech enclosed by single quotes, then we end the subsidiary speech with alt-9 and the main speech with alt-0. These actually print in our text as quite different characters, alt-numpad-253 and alt-numpad-252, that the next test, alt-q, recognises as dummy right single quote and dummy right double quote.

At the moment this program does not accept speech in any other way except double quotes for main speech and single quotes for subsidiary speech. For some people this is unacceptable and I will have to see what I can do about it.

The final test is alt-q, and I always think of this that it’s like running a compiler on a well-checked program. What it does is to run through the chapter paragraph by paragraph. Wherever you start out from it always backs up to the start of the current paragraph. It checks for entities that should come in pairs, double-quotes, single-quotes, left and right parentheses, left and right square brackets, left and right curly brackets, less than and greater than (if it doesn’t do that I meant it to!), and it also checks for acceptable beginnings and ends to the paragraphs. It also checks when there is a second level of subsidiary speech. There is one place where it falls down, and that is when a speaker breaks out into song or verse.

Of course there are other things to know about, like fast keys for setting up blockquotes or verse, but these are to be found in the help files that run with the programs.

When you have finished editing the chapter, which often takes only ten or fifteen minutes, you hit escape, and confirm that you want to end the editing and go on to the next stage.

Editing the special words list

This will be dealing with the new words you have found. The word-list editor will appear with all the new words (if any) in sorted order. You navigate by page-up, page-down, etc, or by hitting the letter of the alphabet found against the word you want to work on. The work could be deleting it (alt-d), treating it as a verb and getting its other parts of speech (alt-v), editing it (alt-e), inserting a line below it (alt-i), inserting a line above it (alt-b), converting it to lower case (control-l), to start with a capital (control-i or tab, same thing), to be all in upper-case (control-u). To finish, hit alt-x, and confirm that you want to exit. The program now posts all words that didn’t have a capital letter in them to the main dictionary, and all those that did have a capital letter in them to the local word-list. So if there is a word, apart from a proper noun, that you want for that book, but not for other books, just hit tab when that word is the current word, and the deed is done.

The help page for the word-list editor

You will get a second bite of the cherry as far as main dictionary words goes, because if there were any that you wanted to send there you will see the word editor again, and you can confirm that you want them in the permanent dictionary at that time.

Immediately you finish that chapter the next one comes up, and you must edit that. If you have set the switch for the ISpeak file for a chapter to be remade after editing the chapter, you can listen to a pretty good version of the book, even while you are editing it.

Download the coding for the two main tests in the editor. Alt-r and Alt-q


An Essay by the Webmaster of Athelstane E-Books