So I've been playing around with speech synthesis software tonight. eSpeak looks a lot nicer than Festival, just in that it's much easier to adjust its speed, correct its pronunciation, and play with variations: whisper, different accents, pitch, word spacing, creaky voice. I got to thinking, what would a logical policy for updating its lexicon look like? I thought the results I came up with were interesting. Maybe some other people will be interested too.
eSpeak
gets “neuroscience” and “pseudoscience” wrong,
pronouncing them with a [[s,i@ns]]
rather than a [[s'aI@ns]]
.
It also gets “omniscience” and “prescience” wrong,
or at least pronounces them
rather differently than I would:
$ ~/pkgs/espeak-1.37-source/src/speak -v en/en-r+f2 -s 250 -x "The
science of neuroscience is not a scientific or quasiscientific
pseudoscience. Conscientiously pursue omniscience and prescience."
D@2 s'aI@ns Vv n'3:r-@s,i@ns I2z n,0t#@ saI@nt'IfIk _:_:O@ kw,eIzaIsi@nt'IfIk sj'u:d@s,i@ns
k,0nsI2;'EnS@sli p3sj'u: '0mnIs,i@ns _:_:and pr'i:si@ns
I would pronounce the “science” in “omniscience” and “prescience” as
[[S@ns]]
and put the accent on another syllable.
There’s a special rule for “scien” beginning a word, and for “conscience”:
en_list:conscience k0nS@ns
en_rules: _sc) ie (n aI@
en_rules:?8 _sc) ie (n aIa2
However, Jonathan Duddington has said
he wants to keep the eSpeak distribution small,
so he “wouldn’t want to include too many unusual or specialist words”.
(See http://sourceforge.net/forum/forum.php?thread_id=1700280&forum_id=538920
where he talks about why he doesn’t want to import the Festival lexicon.)
Already, espeak-data/en_dict
is 80KB,
which is half the size of the speak
binary.
There are several possible strategies that a maintainer could adopt in order to improve the coverage of their special-case word files without letting them get large. Suppose that there is a scalar metric of “goodness” that can be applied independently to each special case. Here are three plausible strategies, ordered from least to most stringent.
I am going to approximate “quality” with “frequency”, on the theory that mispronouncing a rare word is always better than mispronouncing a common one.
Note the analogy to Google’s famous hiring policy: only hiring candidates who raised their average ability.
Are these “science” words significant enough to include?
en_list
only contains 2869 lines, maybe 2400 of which are words.
So maybe only the top 2400 or so exceptions
to the normal rules of pronunciation
are currently considered for inclusion.
Some time ago, I tabulated the frequencies of words in the British National Corpus and put the results online at http://canonical.org/~kragen/sw/wordlist. It has 109557 lines, ordered from the most common words (“the”, “of”, and “and”, each occurring millions of times) to the least common (with a cutoff of 5 occurrences, because most of the words with fewer were actually misspellings).
I selected 20 lines at random from en_list
with the following results:
kragen@thrifty:~/pkgs/espeak-1.37-source/dictsource$ ~/bin/unsort < en_list | head -20
this %DIs $nounf $strend $verbsf
barbeque bA@b@kju:
con k0n
?5 thu TIR // Thursday
_: koUl@n
Ukraine ju:kr'eIn
peculiar pI2kju:lI3
unread Vnr'Ed $only
inference Inf@r@ns
José hoUs'eI
unsure VnS'U@
survey $verb
ë $accent
epistle I2pIs@L
Munich mju:nIk
scenic si:nIk
synthesise sInT@saIz
corps kO@ $only
rajah rA:dZA:
transports transpo@t|s $nounf
Where do these special cases appear in the British National Corpus tabulation? Here are some results, edited for readability:
kragen@thrifty:~/pkgs/espeak-1.37-source/dictsource$ grep -niE ' (this|barbeque
|con|thu|ukraine|peculiar|unread|inference|José|unsure|survey|epistle|munich
|scenic|synthesise|corps|rajah|transports)$' /home/kragen/devel/wordlist
22:463240 this
1178:7999 survey
5102:1441 peculiar
5831:1200 corps
7165:888 ukraine
8977:634 munich
9045:627 unsure
10552:494 inference
11134:455 con
15127:275 scenic
29899:82 epistle
31386:74 transports
34270:62 synthesise
37255:52 unread
73679:11 thu
74154:11 rajah
87737:8 barbeque
The 50th-percentile among the sample of 20
(of which two weren't words,
and a third wasn't found)
seems to be line 11 134
with the word “con”.
That is,
the exceptions in en_list
are mostly drawn
from the most frequently used
eleven thousand words in the language.
(Maybe words like “barbeque”, “rajah”, and “unread”
should be dropped.)
So under the policies “C+” and “C-”, any word that is more common than “barbeque”, at position 87737 in the British National Corpus tabulation, (or maybe some word even a bit rarer than that) should be added to the file. (Under policy “C+”, some word would be removed to compensate, raising the threshold.) Under the policy “A”, the threshold would be “con”, at position 11 134.
Unfortunately, José is missing. I think I excluded accented characters when I tabulated the frequencies initially.
Anyway, that gives us a way to compare the “science” words:
kragen@thrifty:~/pkgs/espeak-1.37-source/dictsource$ grep -n scien[tc]
/home/kragen/devel/wordlist
870:10597 science
1614:5922 scientific
2584:3547 scientists
3865:2088 sciences
3977:2005 scientist
5342:1355 conscience
13365:338 conscientious
16976:227 scientifically
25757:109 consciences
26015:107 conscientiously
27861:93 unscientific
37040:53 omniscient
44349:36 prescient
49031:29 neuroscience
49706:28 prescience
50457:27 scientificity
50587:27 omniscience
53155:24 scientism
62346:17 geoscience
66943:14 scientia
67285:14 neuroscientists
68176:14 conscientiousness
82060:9 geoscientists
84433:8 scientology
84434:8 scienter
86513:8 geosciences
90235:7 neurosciences
93073:7 biosciences
93074:7 bioscience
95039:6 scientifique
95591:6 pseudoscience
103190:5 presciently
103191:5 prescientific
Of these, only those more common than “conscience”
seem to deserve a place in en_list
.
How does eSpeak do now?
$ ~/pkgs/espeak-1.37-source/src/speak -v en/en-r+f2 -s 250 -x "Science is
scientific and done by scientists, who work in the sciences. A
scientist with a conscience may be conscientious. Those with
scientifically-minded consciences will conscientiously avoid
unscientific claims of omniscient beings or prescient prophets."
s'aI@ns I2z saI@nt'IfIk _:_:and d'Vn baI s'aI@nt#Ists
_:_:h,u: w'3:k I2nD@2 s'aI@nsI2z
a2 s'aI@nt#Ist wI2D a2 k'0nS@ns m'eI bi: k,0nsI2;'EnS@s
DoUz wI2D saI@nt'IfIkli m'aIndI2d k'0nS@nsI2z wIl k,0nsI2;'EnS@sli; a2v'OId
VnsaI@nt'IfIk kl'eImz Vv '0mnIs,i@nt b'i:;INz _:_:O@ pr'i:si@nt pr'0fIts
It pronounces everything correctly until it gets to "omniscient" and "prescient", and maybe its pronunciations for those are correct, but at least they’re not the pronunciations I would use.
Under policy “A”,
those words are not common enough to add to en_list
,
because they would lower
the average frequency of words in en_list
unless you removed a less common word to compensate.
Under policies “C+” and “C-”, not only “omniscient” and “prescient” qualify, but so do “neuroscience”, “geoscience”, “neuroscientists”, and “geoscience”, which eSpeak currently mispronounces.
(Including all the exceptions that as rare as “prescient”
might quadruple the size of en_list
,
and perhaps en_dict
as a result,
if arbitrary spellings were as common among rare words
as they are among common words.
Think of that as an upper bound.
Including all the exceptions as rare as “neuroscientists”
might multiply its size by seven.
This is the downside of policy “C-”,
but it does not happen with policy “C+”.
On the other hand,
under policy “C+”,
even “prescient” might not survive long after being added.)
There is a better solution
than adding a bunch of one-word special cases
to en_list
.
Probably
in this case
the solution is to change the special case
for "conscience"
to a special case
for "conscien..."
and change the "scien..." rule
to a "...scien..." rule;
that covers all the words
except for "omniscien..."
and "prescien...".
Covering those two
takes only two more rules
in en_rules
,
if it's considered worthwhile;
but "conscience" is ten times as common
as both of those together,
"con" three times as common,
but "barbeque" 18 times less common.
I think
there is a need for a larger en_list
and en_rules
to be available,
even if they aren't part of the standard distribution.
eSpeak’s current footprint
for a single language
is about 160KB for the executable
and 80KB for the dictionary.
But it would be useful in many cases
even if its dictionary were 800KB
(as perhaps it would be with the Festival lexicon)
or 8MB.
And for a better user interface
for making changes to the dictionary,
and especially en_rules
,
since currently it's hard to know
what words you're changing the pronunciation of
when you change en_rules
,
and you have to master a phonological orthography system
to make any contribution at all.
And then there's no git
-like infrastructure
for sharing your changes,
and even learning git
is a pretty big barrier to contributions.
If, instead, you could twist a knob to jog back to the last mispronounced word, then hold down a button and say its correct pronunciation, the barrier to contributions would be much lower. You would need a reasonable phonological analysis system (like in a speech-to-text system) to turn the spoken word into the string of phonemes. Then, if you could share your accumulated corrections with all other users of the software with the push of a button, the process of coming up with the tens of thousands of special cases would be a lot quicker.
The above is about eSpeak 1.37 from perhaps 2008. I currently have eSpeak 1.48.03 from 2014 installed, and en_dict is now 116K instead of 80K. The en/en-r voice used above doesn’t exist any more, but the en-us voice is a fairly close equivalent:
$ espeak -v en-us+f2 -s 250 -x "The science of neuroscience is not a scientific
or quasiscientific pseudoscience. Conscientiously pursue
omniscience and prescience."
D@2 s'aI@ns Vv n'U@r@s,aI@ns Iz n,0t#@ saI@nt'IfIk_:_: O@ kw,eIzaIsaI@nt'IfIk s'u:doUs,aI@ns
k,0nsI2;'EnS@sli p3s'u: 0mn'IsI;@ns_:_: and pr'i:si@ns
It now pronounces “neuroscience” and “pseudoscience” correctly. The
relevant part of en_rules
is as follows:
sc) ie (nc aI@
ie (ntiC aI@
_sc) ie (n aI@
?8 _sc) ie (n aIa#
I think that means that now the “ie” in any instance of “scienc” will be pronounced as “aI@”, regardless of whether it’s at the beginning of the word, which is what the “_” in the last two entries means, as explained in docs/dictionary.html in the eSpeak source code.
My other example now renders as follows:
$ espeak -v en-us+f2 -s 250 -x "Science is
scientific and done by scientists, who work in the sciences. A
scientist with a conscience may be conscientious. Those with
scientifically-minded consciences will conscientiously avoid
unscientific claims of omniscient beings or prescient prophets."
s'aI@ns Iz saI@nt'IfIk_:_: and d'Vn baI s'aI@ntIsts
h,u: w'3:k InD@2 s'aI@nsI#z
a# s'aI@ntIst wID a# k'0nS@ns m'eI bi: k,0nsI2;'EnS@s
DoUz wID saI@nt'IfIklim'aIndI#d k'0nS@nsI#z wI2l k,0nsI2;'EnS@sli; a#v'OId
(line break inserted)
VnsaI@nt'IfIk kl'eImz Vv 0mn'IS@nt b'i:;I2Nz_:_: O@ pr'i:si@nt pr'0fI2ts
This is different in several details from the above, but overall it
doesn’t seem to be worse in any way. Also, eSpeak now has an --ipa
option, which produces the following output instead:
sˈaɪəns ɪz saɪəntˈɪfɪk ænd dˈʌn baɪ sˈaɪəntɪsts
hˌuː wˈɜːk ɪnðə sˈaɪənsᵻz
ɐ sˈaɪəntɪst wɪð ɐ kˈɑːnʃəns mˈeɪ biː kˌɑːnsɪˈɛnʃəs
ðoʊz wɪð saɪəntˈɪfɪklimˈaɪndᵻd kˈɑːnʃənsᵻz wɪl kˌɑːnsɪˈɛnʃəsli ɐvˈɔɪd ʌnsaɪəntˈɪfɪk klˈeɪmz ʌv ɑːmnˈɪʃənt bˈiːɪŋz ɔːɹ pɹˈiːsiənt pɹˈɑːfɪts
To me, this is dramatically more readable, but it is omitting some
details that are important to at least eSpeak’s pronunciation; for
example, the _:_:
pause above doesn’t seem to appear, nor does the
distinction between I (stressed) and I2 (unstressed, but not reduced
like the undocumented I#). You can use it to translate from eSpeak’s
internal format to IPA by using [[]]
:
$ espeak -v en-us+f2 -s 250 --ipa "[[h,u: w'3:k InD@2 s'aInsI#z]]"
hˌuː wˈɜːk ɪnðə sˈaɪnsᵻz
This makes it easy to compare the old and new pronunciations simultaneously by ear and by reading the IPA transcription, which reveals a few different improvements:
$ espeak -v en-us+f2 -s 250 --ipa "[[D@2 s'aI@ns Vv n'3:r-@s,i@ns I2z n,0t#@ saI@nt'IfIk _:_:
O@ kw,eIzaIsi@nt'IfIk sj'u:d@s,i@ns]].
[[k,0nsI2;'EnS@sli p3sj'u: '0mnIs,i@ns _:_:and pr'i:si@ns]].
The science of neuroscience is not a scientific or quasiscientific pseudoscience.
Conscientiously pursue omniscience and prescience."
ðə sˈaɪəns ʌv nˈɜːɹəsˌiəns ɪz nˌɑːɾə saɪəntˈɪfɪk ɔːɹ kwˌeɪzaɪsiəntˈɪfɪk sjˈuːdəsˌiəns
kˌɑːnsɪˈɛnʃəsli pɚsjˈuː ˈɑːmnɪsˌiəns ænd pɹˈiːsiəns
ðə sˈaɪəns ʌv nˈʊɹɹəsˌaɪəns ɪz nˌɑːɾə saɪəntˈɪfɪk ɔːɹ kwˌeɪzaɪsaɪəntˈɪfɪk sˈuːdoʊsˌaɪəns
kˌɑːnsɪˈɛnʃəsli pɚsˈuː ɑːmnˈɪsɪəns ænd pɹˈiːsiəns
This also means you can use it with -q
as a fairly reliable
converter from standard English orthography to IPA:
ðɪs ˈɑːlsoʊ mˈiːnz juː kæn jˈuːz ɪt wɪðkjˈuː æz ɐ ɹˈæpɪd ænd fˈɛɹli ɹɪlˈaɪəbəl kənvˈɜːɾɚ fɹʌm stˈændɚd ˈɪŋɡlɪʃ ɔːɹθˈɑːɡɹəfi tʊ ˌaɪpˌiːˈeɪ
It’s a little slow for use in this mode; converting the first 83955 words of the KJV took 1'57" on my laptop, which is only 718 words per second, about three times faster than speech. But this speed is sufficient to solve many problems with. The particular problem that made me update this note tonight is that of finding sets of minimal pairs of English words for ESL learners to learn to distinguish the phonemes in, the hard part of which for a computer is finding out what the pronunciations of the English words are; the following command lines generated a decent pronouncing dictionary in just over 5 minutes:
$ espeak -v en-us+f2 --ipa -q < /usr/share/dict/words > words-ipa-2
$ paste /usr/share/dict/words words-ipa-2 > pronunciation-dictionary