There are several useful programs for getting the computer to read text aloud. Latterly they have become very good indeed. I still have minidisks with books I recorded with the best available text-to-speech programs eight or nine years ago, and they are not nearly as good as you can get today, but at least the book could be listened to and enjoyed in the absence of anything better.
We will name here three voices (using two programs) that we find pleasing. The two programs are those produced by Fonix and by NextUp. Fonix ISpeak has several annoying faults but they can all be worked around. Left to itself it guesses wrongly at how a word should be pronounced in a surprisingly large number of cases. But painstakingly we have created, over perhaps four years, a database of these words, with over twenty thousand words in it, and used their excellent phonetic method of correcting the pronunciation. We also have a list of nearly seventy thousand words that it pronounces well, or well enough. The result is that we can be listening to a very acceptable version of the book almost as soon as it emerges from the OCR process. We just need to run the first chapter through our editor, and start the playing as soon as that chapter has been finished. This may be as soon as ten minutes after the whole book has emerged from the OCR sequence of programs.
The Fonix ISpeak version can be checked for any new words that are not in the list of well-pronounced words, nor in the list of words that have been phonetically fixed. These words are spoken by ISpeak while we listen attentively, and well-pronounced words merged into the appropriate database, while ill-pronounced words are provided with the appropriate phonetic correction. This sounds a bore but typically a book will yield fifty words that have not been used in any other of the books on the Athelstane website, and a third of these will need phonetic fixes, the rest being OK. So that doesn’t amount to more than a few minutes with any new book.
However, although we are so used to the ISpeak voice now that it sounds almost natural to us, none of our friends agree. They think it sounds too robotic.
We must mention that there is a version of Fonix ISpeak that runs on a Pocket PC such as the Ipaq. The disadvantage of this version is that it will not accept any markups whatever (if there is a version 2, that may have been fixed), so you have to put up with its mispronunciations, and its lack of pauses after sentences and at the end of paragraphs.
There is a program called “TextAloud MP3” propagated by an outfit called NextUp. This program can drive a number of different “voices” emanating from various providers, such as AT&T, Lernhaut and Hauspie, Cepstral, NeoSpeech, ScanSoft, and several others.
A year or so ago a series of voices were published by NeoSpeech. We like “Paul”, an American speaker, from this series, and we have created CDs of many books using this voice. However I am in the business of providing audiobooks for UK listeners, and that means we are really looking for a good UK English voice. The ones that have been offered in the past, naming no names, have been more annoying than pleasurable, and certainly have not been modelled using native speakers of the best UK English.
But in December 2005 there was published a set of voices called, generically, “RealSpeak” by ScanSoft. There are voices in this series in several different languages. The voice that is in UK English is really very acceptable indeed, and is not far off “The King’s English”, the dialect spoken by educated or aspiring people in Britain. This voice is therefore called “ScanSoft Daniel”, or just “Daniel” for short.
However, nothing is perfect, and the pronunciation of English is a bit quirky, or perhaps we should say that the spelling is quirky. We are collecting a list of words “Daniel” does not speak well, and providing a list of replacements for these words, which is a bit time-consuming at the current stage. We can easily check the pronunciation of two or three thousand words a day, taking up an hour or two, but this means that such studies need to be carried on relentlessly for a month, two hours a day, including re-compiling daily the databases of well-pronounced and of ill-pronounced words. However, books of which all the words have been checked are a real pleasure to listen to. Realistically we find that 2500 words generate from 50 to 70 words that need to be provided with pronunciation instructions. This is very good when you consider that ISpeak would have yielded 800 badly pronounced words.
Another feature that has to be seen to with each book is to mark up every instance of every word that has two or more different pronunciations with an indication of which one to use. Such words are “read”, “lead”, “bow” and over 120 others. There are normally 300 to 500 such words in any normal novel. In many cases it is the verb that has one pronunciation, and the noun another. Think of “object” and “subject”. In such cases the decision on whether we have the noun or the verb is easily done by software.
Books that have been marked up in this way, and all of whose words have been checked, make excellent audiobooks, with a commercial value. In the not too distant future we will have enough information on our website to enable anyone to make their own first-class audiobooks with little effort, though a small amount of effort will move the product from “very good” to “excellent”. The books produced will be nearly at Level 3, as defined in the next paragraph.
We make CDs of the MP3 speech files of the chapters of a book, marking them with a Level number, indicating the effort that has gone into making the audiobook as good as possible. Level 1 indicates that the book has been downloaded, split into chapters, and converted to MP3 with no further work. Using “Daniel” these results are quite good. I have listened to all six of Jane Austen’s major works generated at this level within a few days of “Daniel” being made available to the world. Level 2 indicates that the words are well-pronounced as far as we know, but that no intensive testing has been done to prove this. We can only set this level with Fonix ISpeak, because we know there will be only about 17 ill-pronounced words in most books, even if we do no more extra work. Level 3.1 is the highest we normally attain, where all the words with two or three possible pronunciations have been carefully marked up, and all the words have been checked for well or ill pronunciation, and suitable markups made.
There is actually a possible Level 3.2, which means that a Level 3.1 version was produced and listened to attentively, with all matters noted being attended to. The result is a very good and very saleable audiobook.
How we save and listen to the audiobooks, whether on CD, in an MP3 player, or on minidisk, is a subject for another article.
In practice how we generate speech files from the raw text files is to pass each chapter through two stages. In the first stage the markup not required for speech, such as page numbers, is removed. Various other small changes that have been observed to be useful, are made as well. For instance occurrences of “Mr.” are replaced by “Mr”. In the second stage markup required by the text-to-speech speaker is added. The program by Fonix ISpeak or by NextUp TextAloud MP3, whichever is appropriate, is then run to produce MP3 files of each chapter. Finally we run a little program which adds the various tags required at the end of the MP3 files, indicating the name of the book, the name of the author, and other information. At the same time an analysis of how long each chapter of the book will take to play, and of how many megabytes will be taken up by storing it, is created, and a playlist is generated so that the book is easily played on the computer by Windows Media Player.
NH, 8th January 2006.