Sunday, December 31, 2017

English pronunciation (1): Hou tu pranownse Inglish

Foreword: "English pronunciation" is a new series of blog entries which includes articles and other resources that I have collected from the Internet about the topic of English pronunciation for those who need to learn more about it. My target audience is Vietnamese teachers and learners of English, but the blog entries in this series will hopefully be useful for anyone learning English.

As any English language learner will agree, English pronunciation is so difficult for non-native speakers because English spelling is such a mess. And, for the Vietnamese learner, English pronunciation is even more difficult because the two languages are so different from each other. Therefore, most of the entries in this series will focus on those two areas: linking English pronunciation and its spelling rules; and comparing English vs Vietnamese pronunciation systems. 

Below is the first entry, written by a foreigner about the English language and why it is so difficult for English language learners compared to, for example, German. You may or may not agree with this claim (that English is more difficult than German), but you will find the article useful with many examples of the irregularities of English spellings.


Hou tu pranownse Inglish

© 2000 by Mark Rosenfelder

Everybody agrees that English spelling is horrible.
There have been almost as many proposals for spelling reform as there are rewrites of Esperanto. (Tellingly, there has been precisely one success in each category-- Noah Webster and Ido-- and neither caught on universally.) Most of these proposals spend their energy fixing what isn't broken. For instance, they search hard for clever new ways of spelling the ch sound-- even though ch does the job just fine in hundreds of languages. Or, they insist on 'correcting' the Great Vowel Shift, using Italian values for the vowels.
Whenever the subject comes up, someone is sure to bring up all the words in -ough, or George Bernard Shaw's ghoti-- a word which illustrates only Shaw's wiseacre ignorance. English spelling may be a nightmare, but it does have rules, and by those rules, ghoti can only be pronounced like goatee.
The purpose of this page is to describe those rules-- to explain the system behind English spelling, the rules that tell you how to pronounce a written word correctly over 85% of the time.
Many people expect the opposite as well-- to predict the spelling from the pronunciations-- not realizing that few orthographies meet this goal. It's far from true of Spanish, for instance, which is often held up as an example of a good orthography. I stopped fervently admiring Spanish orthography when I saw a sign in a Mexican bakery with about one spelling mistake every third word.
Several different types of people might be interested in this page:
  • foreign learners of English
  • native speakers who never quite mastered English spelling
  • spelling reformers who care to understand the system they want to replace
  • linguists interested in how an inadequate alphabet is manhandled to fit an unruly language.
I've also included a sample lexicon and a set of spelling rules which you can use with my Sound Change Applier to automatically derive the pronunciation.

Thanks to Éamonn McManus, Aaron J. Dinkin, Dennis Paul Himes, Geoff Eddy, Hirofumi Nagamura, and John Cowan for useful comments and ideas, which I've tried to incorporate here.

The sounds of General American

If we're discussing spelling, we have to discuss sounds as well; and this means choosing a reference dialect. I'll use my own, of course-- a version of General American that's unexcitingly close to the standard. I'll call it GA below.
Here's the vowels and consonants of my dialect. For each I give the IPA, the representation in the eccentric phonemic transcription I use in this document, and a couple of sample words.
The IPA is given in Unicode; if it doesn't look right you have a nasty old non-Unicode-compliant browser.


 e ä rate p p paper
 æ â rat b b book
 i ë meet, machine t t take
 ɛ ê met, dread d d dead
 aj ï bite, cycle g g get
 ɪ î bit, lick k k cape, talkquite
 o ö note, sow m m moon
 a ô not, clock n n new
 ju ü cute, you ŋ ñ sing, think
 ʌ û cut, come f f four, physics
 v v vine
 u u coot θ + thin
 ɔ ò caught, dog ð + this
 ʊ ù cook, put s s so
 ə @ above, cynic, until z z zoo
 ʃ $ shack
 aw ôw crowd, loud ʒ $ measure
 oj öy boy, droid  ç chew
  j judge
 j y you, million r r ran
 w w wait, cow l l late
 h h hang
 ɚ @r search, manor, bird
  @n button, happen
  @l battle, final

Who cares about dialects?

Ideally you shouldn't have to worry about my dialect at all: you could simply take (say) ê to represent whatever you pronounce as the vowel in met. Unfortunately, English dialects are not uniform enough to share a single phonology. There are many words that are not only pronounced differently in different dialects-- that is, they have a distinct phonetic realization-- but also have their own phonemicrepresentation.
Some examples:
  • GA is rhotic-- we pronounce the post-vocalic r's-- while other important dialects are not, notably the British standard, RP.
  • I distinguish cot and caughtDon and Dawn; these vowels (ô, ò) merge in the US West.
  • On the other hand, I merge the vowel sounds in Mary, merry, and marry, which are distinguished in Eastern US dialects and in RP.
  • I pronounce w and wh the same.

Notational conventions

Spellings are in teal italicspronunciations are in blue Courier. This convention avoids cluttering the text with brackets and quotation marks.
Thus g refers to the letter <g>, while g refers to the sound /g/, and I will write that laugh is pronounced lâf.
Linguists can take the 'pronunciations' as phonemic; e.g. I haven't attempted to indicate aspiration, the flapping of medial t and d, the appearance of clear and dark l, etc. I indicate some but not all vowel reductions (basically, those that are reduced in all forms of the morpheme).
# represents the beginning or end of a word. For instance, #rh represents an rh that begins a word; g# refers to a final g.
Capital letters represent variables; e.g. V represents any vowel.

The computer simulation

Along with this explanatory page, I've put up
The lexicon includes the target pronunciation in GA; I modified the program to compare the results of the rule application with the target. The results:
  • 3079 (or 59%) of the pronunciations are generated perfectly.
  • 4389 (or 85%) are generated perfectly or with only minor errors: vowel length errors, failure to reduce vowels to @, or failure to voice an s.
This is impressive; but it understates the systematicity of English spelling:
  • Many of the errors are off in only one segment. (E.g. the rules predict everything about bachelor except the loss of the middle vowel. Shouldn't they get some credit for getting six segments correct?)
  • Many of the pronunciations are really predictable using rules beyond the scope of the Sound Change Applier. I haven't by any means found every possible rule, or stated them in the best, most general form.
  • The worst offenders in the language are already included in the sample; a larger vocabulary would include a higher percentage of well-behaved spellings.
There is a fuller discussion of the mispredictions at the end of the document.
The odd phonetic transcription, by the way, derives from the dual need to easily represent sounds both in html and in the sound change file. I'm restricted to characters that html supports; and I can't use capital letters, because I need them for variable definitions in the rules. As a mnemonic, think of the umlauts as colons, so that ö is short for o:, 'long o'.
The wacky spellings I used for the vowels, however, are inherent in the logic of English spelling. It would only obscure how the system works if I represented the long and short vowels with IPA forms.

The rules

The bulk of this page is basically a human-readable restatement of the rules in the sound change file
The order of the rules is important. The rules can be thought of as a recipe: to pronounce a word, you go down the list of rules, seeing if each one in turn applies, and applying it if it does.
The result is sometimes a little backwards in terms of explaining the system, because exceptions come first, before the general rules. That's the best way to teach the computer; but humans tend to do best by learning the most general rule first.
I'll warn you: some of these rules are going to seem mondo obscure. That's because I've tried to find every regularity I could, even if it only explains half a dozen words. The yield of some rules may be small enough that some people would rather just learn the affected words as irregularities. But if anything I'm more interested in the minor regularities; they're puzzles, often unfamiliar ones, and many are the fossils of minor sound changes.
To head off another likely reaction: yes, you can find exceptions to the rules. I'm perfectly aware that ough is not always pronounced ö. The point is, what follows are the default rules that work 85% of the time. Think of ö as the default pronunciation of ough; any other pronunciation of ough is an irregularity.
And finally: I'm aware that some linguists (e.g. Edward Carney) have also worked on these problems; unfortunately, I've only seen their work in summaries. I've tried to be careful and linguistically informed, but I don't claim to have committed a work of scholarship.

Some rewrites

English has more phonemes than the alphabet has available symbols; the usual expedient of the orthography for solving this problem is to use digraphs. (Both the problem and the solution are inherited from Latin, which had hardly finished tossing out the Greek letters it didn't think it needed when it started to borrow Greek words that needed them.)
1. Make the following unconditional replacements:
 ch      ç    
 sh $
 ph f
 th +
 qu kw
 wr r
 wh w
 xh x
 rh r
Before an o, replace wh with h instead: who, whore, whole.
If you're one of those fossils who still use a voiceless w or another strange contortion to distinguish wh and w, you'd modify this rule.
We can do significantly better than the program if we don't do these substitutions when the digraph spans a morpheme boundary. In other words, we shouldn't do the replacement in compound words like bosshood, flathead, uphill, or perhaps.
We can also do better if we replace ch with k in words of Greek and Hebrew origin-- that is, in two-dollar words like archaism or trochaic or Malachi.
The program actually replaces only initial rh, since medial rh is so likely to be found in a compound (and it doesn't occur finally in the sample lexicon).
(xh isn't really a digraph; the rule just reflects the fact that an initial h isn't pronounced after a prefix ending in x, as in exhibit.)
2. Replace x with ks; but after e and before another vowel, use gz instead. (This is not an allophonic rule: compare the near-minimal pair exist and excite.)
3. Ignore apostrophes (can't, cop's, o'clock). Hyphens can however be treated as word separators (mother-in-law is pronounced like mother in law).

The notorious gh

4. Before a vowel, gh becomes gghost = göst.
5gh turns a preceding single vowel long: right = rït.
6aught and ought become òtdaughter = dòt@rsought = sòt.
7. Any other ough becomes ödough = .
8. Elsewhere, gh is simply dropped: freight = frät.
People usually trot out gh when they bitch about English spelling. The culprit is sound change: gh used to do nicely for the x sound (now usually represented kh when we transcribe foreign words), but the sound disappeared in everything but Scots. It usually went quietly, but sometimes, word-finally (laugh, cough, enough, rough, tough, and not much more) it was transformed to finstead.
ough is also notorious, but the usual sound (as seen in rule 7) is öThrough is a notable exception.
Initial gh is sometimes used to keep the from softening (ghetto); but generally it's a meaningless variant on g, said to be introduced by Dutch typesetters in the early days of printing. In any case it's no problem, since it's always g. This is one reason Shaw's ghoti is such a fraud: initial gh can never be pronounced f.

Unpronounceable initials

9. In initial gn, kn, mn, pt, ps, tm, pronounce the second letter only: gnostic = nôstîkpsycho sïköknight = nït.
Most of these are Greek borrowings-- Greek is much freer with initial clusters than English is-- but kn derives from Old English.

Replacing y

10. Replace y with ï if it ends a one-syllable word: ply = plï.
11ey is pronounced ëay is ä; and oy is öysay, monkey boy = sä mûnkë böy.
12. Replace y with if it's not adjacent to a vowel-- we'll worry later about how to pronounce the i.
Thus, system = sîst@m but you, where the y adjoins a vowel, is yu.

Simplification of stl

13. The in stl is lost before a final vowel: bustle = bûs@l"bristly = brîslë.
This could perhaps be generalized; but in slow speech I leave the t in (say) coastline or Christlike. I'm also tempted to generalize to all stops, but the only instance in the sample lexicon is muscle, and it's pretty silly to have a rule that applies to a single word.

(Af)frication before i

14ci or ti becomes $ before a vowel: gracious = grä$@snation = ä$@n.
15tu becomes çu before a vowel, or before a liquid (r, l) followed by a vowel: mutual = müçu@lmature = m@çur.
16s becomes $ (or $ if it's preceded by a vowel):
  • before o-- passion = pâ$@nvision = vî$@n". Note that the i is lost.
  • before ur-- assure @$urleisure = $@r.
  • after k and before a vowel: sexual = sêk$u@l.
At some point English affricated a number of consonants before a i or y that preceded another vowel, including the [y] sound that begins ü Sometimes the y has been lost since. This process seems to be no longer productive-- compare costume, Casio. (Or is it? In quick speech I do say kôsçùm.)
Rule 14 shows another reason ghoti is a fraud: ti only fricativizes when it's followed by a vowel.

Voicing of s

17s is voiced between two vowels (amuse, design, prison), except after a (base, parasite).
It's easy to find exception to this rule: disagree, opposite, analysis-- there's even words where the rule applies only for verbs (abuse, house). The rule as stated has more successes than failures, and I haven't been able to find merely lexical rules that do much better. A better rule might take the language of origin into account: the voicing tends to occur in French and Latin words (resent, please, reason, miserable), but not if they're from Greek (analysis, isoceles) or more exotic languages (papoose, Osaka).
The voicing of s is so almost predictable that there are orthographic conventions (borrowed from French) to indicate that we really do want an s: double the s (cf. Moses vs. mosses), or use c instead (race vs. rase). Annoyingly, there are a few cases of unexpectedly voiced ss (dessert, dissolve).
As a corollary of this rule, the American use of -ize for British -ise was unnecessary, although of course it is more foolproof.

You know me, al

18al is pronounced òl before r, s, m, a dental stop, or final llalso, already, wall, bald, although, almost.
19alk becomes òk, except initially: walk wòk.
I suspect this is a sound change, obscured by later borrowings like alcohol.

Softening of velars

20c becomes s before a front vowel, k elsewhere: cell = sêlacid = âsîd, but cow = kôwbacker = bâk@rclear = klër.
21. Similarly, g becomes j before a front vowel, g elsewhere: gel = jêlturgid t@rjîd, but got = gôttwig = twîggleam = glëm.
22. If the doesn't begin the word, and the triggering e precedes o or a, the e is lost: changeable = cänj@b@ldungeon dûnj@n (but geology jëôl@jë).
23. Initial gu or final gue is pronounced gguest = gêstplague = pläg. (Medially, it tends to be gw instead: language, anguish.)
Front vowels are and e; note that y was changed to i by rule 12. We owe these rules to a sound change, and not even our own-- it derives from the history of French.
The last two rules allow g to be used for two sounds:
  • ga ge gi go gu can be written ga gue gui go gu
  • ja je ji jo ju can be written gea ge gi geo geu.
The inserted e or u are orthographic only; they make sure rule 21 applies or doesn't apply, as desired.
In French, there's a parallel with c:
  • ka ke ki ko ku can be written ca que qui co cu
  • sa se si so su can be written cea ce ci ceo ceu (but it's more usual to write ça ce ci ço çu)
but it doesn't work so well in English, since our qu is still kw. The inserted e is found in just a few words (e.g. placeable), due to compounding.

Untangle reverse-written final liquids

24le and re (after a consonant, and ending the word) should be rewritten @l, @r.
To be precise, they become syllabic consonants: the final sound in bottle is a prolonged dark l. I think this is an allophonic detail, however: if you like, just add a rule at the end to turn all instances of @r into syllabic r.

Short and long vowels

OK, listen up, because these are the two most important rules of English spelling.
25. Vowels are pronounced long before an intervocalic consonant (rate, mete, fine, rote, cute = rät mët fïn röt küt).
26. They're short before two consonants (baffle, held, children, rotten, butler), or before a final consonant (pat, pet, pit, pot, but pât pêt pît pôt bût).
English has a dozen or so vowel phonemes, and this silly alphabet we inherited from the Romans has just five vowel symbols (y is sometimes used as a vowel, but as we've seen, it pointlessly duplicates i). The five symbols can represent ten sounds, thanks to these rules.
Each vowel letter has two basic interpretations, which by convention are called long and short. (Phonetically they're not distinguished by length; tense and lax would be more accurate. But I think the more familiar terms will be more readable, and remind readers that their old English teachers were onto something after all.)
In my transcription, long vowels are marked with a diaresis, since html doesn't supply a macron (äëïöü), and short vowels with a circumflex (âêîôû). Now you can see why I chose those odd representations-- they come from the basic logic of English spelling. (Think of the diaresis as the IPA : long mark.)
Note that the names of the letters A E I O U are simply the 'long' vowels.
And where did that come from?
  • The spelling of the long vowels is the fault of the Great Vowel Shift of early modern times. Middle English spoke the vowels with their 'proper' vowels, so that (say) mate would have been pronounced môt@.
  • The short vowels are simply laxed versions of the original sounds of the long vowels. ê, for instance, is a lazy version of ä (the original sound of long e)-- closer to the muddy center of the vowel space.
The above rules work in conjunction with rule 54, which means that doubling a consonant changes a medial vowel from long to short: later/latter, Peter/petter, biter/bitter, hoping/hopping, cuter/cutter.

Exceptions, but general ones

27. Final ind is ïnd, final oss is òs; final og is ògmind, boss, dog = mïnd bòs dòg.
28o also becomes ò before f and another consonant (offer = òf@rsoften = sòf@n).
29wa is pronounced  before a dental or alveolar consonant (t d n s +): want, wander, swan, Rwanda, swat, wad, wasp, and as  between w and (t)$wash, squash, watch = wò$ skwò$ wòç.
29au is pronounced u before l, or after a labial stop (pb) and before a sibilant (s$ç): adult, push, butch. (This doesn't apply if the u is long: mule.)
I don't think I ever noticed these generalizations till I started working out the rules for this page. At least some of these, such as 29a, are sound changes from Shakespeare's time.
Rules such as 6, 18, 19, 27, 28, and 51 introduce ò, a vowel which (as signalled by the odd diacritic in my transcription) doesn't fit well into English phonology. The fact that a velar occurs in many of the rule conditions suggests that it was originally an allophonic variant of /ô/ and /â/ in this environment-- compare dog, ought, long, walk with dot, out, lot, wad. But it's now phonemic in GA, as can be seen in the minimum triad caught, cot, cat. These rules would have to be modified (and some could be eliminated) in dialects that merge ò and ô.
For some speakers, rule 29a only applies after labials, so that pull and dull don't rhyme.

Softening of gn

30. Except before a vowel, the vowel in ign or igm lengthens, and the g is lost: alignment paradigm = @lïnm@nt, pär@dïm, but igneous = îgnë@s.
31. The g is simply lost in eignfeign = fän.

Handling of -ous

32. Except before a vowel, ous reduces to @sjealous = jêl@s.
I'm ambivalent about rules that relate to a particular suffix, since arguably the pronunciation is simply a fact about the suffix in the mental lexicon. But a suffix can apply to dozens of words, so there was a large gain from including some such rules in the file.
Note the importance of order: this rule has to be ordered before silent e deletion, or it will apply to words like arouse.

Removal of silent e

33. Remove final erate mike cute = rät mïk küt (unless it's the only vowel in the word, as in he).
This and rules 25 and 26 (on long and short vowels) are the guts of the English spelling system. They allow the five vowel symbols to represent ten vowel phonemes.
English orthography tends to preserve the spelling of morphemes in derived words, including their final e. The program is too stupid to handle this, since it has no way of recognizing compounds. But of course in words like safety, lovely, changeable, careful, warehouse, jukebox, placement, placeholder the e in the first morpheme should be deleted by this rule.
People pay tribute to these rules every time they make up words-- whether for marketing purposes (Nite-Lite, Cold-Eeze, Unix), slang (reefer, dweeb, doofus), a created world (hobbit, Leela, Oz, Alley Oop, Naboo, Mr. Magoo, Morlock), or for borrowings ( thuggee, kangaroo, tycoon, igloo, tepee). Words that don't fit the pattern, like Linux, can cause confusion.

Add shortening; stir

Some vowels that are orthographically long are pronounced short, and frankly I haven't put my finger on the pattern. In the file I did add this rule:
34. Shorten a vowel that precedes a simple, final CV syllable (and is not the first syllable in the word).
This handles words like anomaly, cinema, sanity, biology, century; but it fails on other words, like patina, tuxedo, agora. Obviously the shortened vowels are all unstressed; but the idea here is to predict pronunciations from the spelling, and the spelling doesn't indicate the stress.
(We've already removed silent e, so this rule isn't triggered by words like phoneme.)
Somewhere I read that long vowels can't occur earlier than the antepenult; but obvious counterexamples are isolating or unification. I'll see if I can improve the generalization, however.

Vowel digraphs

Besides the long/short trick, English expands its repertoire of vowel representations with digraphs. Quite a few of these are redundant, and there are lots of exceptions-- this, and not ch or ough, is the real weak point of English spelling.
35iV (that is, i plus another vowel) becomes ï@ in the initial syllable: bias, diagram = bï@s, dï@grâm.
36. Exceptions to the following rule:
  • Final ow is pronounced öslow, rainbow, overthrow.
  • oo is pronuonced ù before a kbook, crook, look.
  • ei is pronuonced ë after sperceive, ceiling, seize.
  • ie is pronounced ï finally: dye, necktie.
  • oul becomes ù before a final d.
37. Make the following substitutions:
 eau      ö     
 ai ä
 au, aw ò
 ee ë
 ea ë
 ei ä
 eo ë@
 eu, ew ü
 ie ë
 iV ë@
 oa ö
 oe ö
 oo u
 ou, ow ôw
 oi öy
 ua ü@
 ue u
 ui u
Again, the program is not smart enough to recognize when the digraph spans a morpheme boundary, and thus should be treated as two separate vowels: goer = gö@rcoaxial = köâksë@l.
Annoyingly, some of these digraphs have at least two values: cf. wool, fool; mead, dread; fief, friend; reign, seize; ground, group. The values in the table are those that occur most often. (The alternatives are generally just a step or two apart phonetically, e.g. u/ù, ë/ê, ä/ë.)
For ease of exposition I've put the final ie rule here, but it really goes before rule 14 (affrication); otherwise terrible things happen to words like untie.

Those pesky final syllabics

38. Any vowel reduces to @ before final lbattle, final, hovel, evil, symbol.
39. Any short vowel reducts to @ before a final nhuman, frighten, cabin, button.
These rules don't apply to monosyllables (pal, can), nor to vowels that have already been assigned a particular value by an earlier rule (e.g. meal to mël by rule 37).
These rules could probably be refined; they don't apply to stressed finals, but again, the orthography doesn't indicate stress.
You can take @l as a phonemic representation, or add a rule at the end to replace it with vocalic l. Ditto for @n.

Suffix simplifications

40. The following suffixes are reduced as follows:
 -able, -ible      @b@l     
 -lion ly@n
 -nion ny@n
Again, we really shouldn't have 'rules' for single lexical entries. But these suffixes are common, so the rule has a large yield.

Unpronounceable finals

41. A final b or n is not pronounced if preceded by an mdamn bomb = dâm bôm.

Final vowel coloration

42. Pronounce any remaining final vowel as follows:
 -a      @     
 -i ë
 -o ö
 -u u
A final vowel is usually the mark of a foreign word, which is why final vowels tend to have the 'continental' values: sushi, cello, haiku. Earlier borrowings were nativized, meaning that final vowels had to be written as diphthongs (e.g. MunseeHindoo).
Since final -e is already in use, we used to mark one that was supposed to be pronounced (Chloë klöë), or, if we were borrowing from French, we retained the accent (café = kâfä). But English seems to be so allergic to diacritics that these helpful conventions have largely been lost.

Vowels before r

r is hell on English vowels; it tends to color the vowels, and in many dialects, disappear. In GA there are 12 monophthongal vowels, but only 6 can appear before r-- ä ë ô ö ò u-- plus @r, which is really just a prolonged vocalic r.
43. An ôw, ô, or ò resulting from the previous rules changes to ö before an rcourse = körsfor = för.
44war is pronounced wör, except before a vowel: warlock, war, dwarf = wörlôk, wör, dwörf; and wor is pronounced w@rword, worst, worry.
45ê or â before a double r (and ê before ri) become äterror, marry, merit = tär@r, märë, märît.
46â before any other r becomes ômark, star môrk, stôr.
47ê, î, û before r are reduced to schwa: perk, fir, fur = p@rk, f@r, f@r.
Thanks to the infamous rule 45, I pronounce Mary, merry, marry the same. If you left this rule out, it would probably correctly predict the pronounciation of Easterners and Britons who distinguish them.

The velar nasal ng

The careful reader may wonder why ng was not handled earlier, with the other consonantal digraphs. The reason is that orthographically, it acts as a double consonant-- e.g. singer has a short not a long i. But now it's time to handle it.
For lack of an eng, I represent the velar nasal as ñ; don't confuse it with a palatalized ny.
48ng becomes ñg before a liquid (r, l) or semivowel (y, w): angry, England, singular, anguish = äñgrë, îñglând, sîñgül@r, äñgwî$.
49ng becomes ñ finally, or before another consonant: hung = hûnglength = läñ+.
50n becomes ñ before a velar stop (k, g): anger = äñg@rthink = +îñk.
51ô becomes ò, and â becomes ä before ñsong = sòñhang = häñ.
Note that rule 50 doesn't apply to words like hung, because rule 49 already removed the in those words.
50 is arguably merely allophonic, but since it's completely consistent I treated it as a spelling rule. You could certainly say that a word like ungrateful 'really' has an underlying /ng/, because it's composed ofun plus grateful; then this, as in most languages, will get pronounced ñg. But if you go that route, you can't actually show that English allows /ñg/ as well as /ng/-- how do we know that wrong isn't actually /ròng/, modified by the allophonic rule? The important thing is not to pretend that we have a contrast of /ng/ and /ñg/.

Voicing of s

52s is voiced finally, after a voiced oral stop: dogs = dògz.
53. It's also voiced before final mprism = prîzm.
The first of these rules is really morphophonemic: the plural, possessive, and 3p singular inflections of English are spelled s even when, by assimilation, they're pronounced z. This rule is not phonological, as can be seen by a word like chance = çâns; compare fans = fânz.

Double consonants

54. A double consonant is pronounced singly: dinner, buzzard, hassle = dîn@r, bûz@rd, hâs@l.
55. A t disappears before ç, and a d before jbatch = bâçjudge = jûj.
56. An s disappears before $pressure = prê$r.
Rule 54 works hand in hand with rule 25: a consonant is doubled to show that the preceding vowel is short: redder = rêd@r (compare red, where the d doesn't need to be doubled because a vowel preceding a final consonant is already short).
Rule 55 is something of a corollary: to 'double' ç, we write tch rather than chch; and to double a j, we write dg rather than jj or gg.
Rule 56 goes with rule 16, which changed s to $ before some instances of u.

Almost but not quite regular

In the rule list there's almost a rule that changes o to û before certain fricatives or nasals. Here's a list of affected words, as well as counterexamples:
 _v    above, cover, dove, glove, govern, hovel, hover, love, oven, shovel, of clover, prove, drover, jovial, move, novel, over, poverty, proverb, province, sovereign, stove, bovine
 _l color apology, polo
 _+ other, another, mother, brother, nothing both, bother, broth, brothl, cloth, clothes, moth
 _n onion, none, money, monk, monkey, month, wonder, front, son, sponge, honey, Monday, one alone, bone, honest, honor, tonight, pond, beyond, conk
 _m come, become, from, some, stomach bomb, comb, dome, home, gnome, Mom, whom, womb
Most of these turn out to be due to an orthographic or even a calligraphic rule: medieval English scribes wrote o instead of u before m, n, v, apparently because in the medieval hand, the verticals of the u ran confusingly together with those of the following consonant.

So what's irregular?

The biggest source of errors are those that I considered near-misses: instances where the rules get the length of a vowel wrong, or don't predict a reduction to schwa, or don't predict a voiced s.
The first two of these are a feature not a bug, since they make word roots recognizable, despite predictable differences in pronunciation. For instance, the root pedant is spelled identically in pedant(pêd@nt) and pedantic (p@dântîk)). This underlines the relationship between the two words, despite the fact that neither root vowel is pronounced the same. Similarly, sanity has a short a (sânîtë), although a vowel preceding a single consonant is normally long; this is an 'error', but it keeps the same spelling of the root as in sane.
Putting these near-misses aside, my program gets 791 words wrong in a 5180-word sample vocabulary.
Many of these are really stupidities of the program, not the language. There are:
  • 188 simple variations of other errors-- e.g. since busy is wrongly predicted to have a ü, so is business
  • 52 borrowings using foreign spelling conventions (e.g. aficionado, bourgeois, cello, stein). Borrowings are common enough in English that writers can learn the patterns for each source language.
  • 18 instances of final -ed taken as êd
  • 45 words (mostly Greek) where ch = k not ç
  • 45 silent e's not recognized as such due to compounding
  • 20 over-enthusiastic vowel reductions (usually due to stress falling where, statistically, it doesn't occur much: amen, violin; or to vowels that unexpectedly don't turn to schwa before rmirror, sergeant).
  • 6 instances of consonant combinations taken as single sounds despite crossing a morpheme boundary (e.g. dishonor, shepherd)
That leaves about 420 words wrong, less than 10%; the major categories are as follows:
  • 195 misinterpretations of diphthongs; some of these are genuine ambiguities in English spelling (cf. dead, mead, real; die, sieve, science, fief); others are due to insufficient analysis (e.g. poet is mispredicted simply because I didn't provide a rule for oe-- it wasn't worth it, it occurred too rarely in the lexicon).
  • 37 examples of the o to û change discussed above.
  • 26 indefensible vowel spellings (e.g. pretty, women, resin, English, lose, swamp, water, bury, lawyer).
  • 17 consonant clusters not simplified enough (e.g. half, folks, listen, mortgage, raspberry).
  • 17 instances of an unexpected (or mispredicted) ò; e.g. cloth, frost, chocolate.
  • 18 instances of final -y being ï rather than ë .
  • 13 annoying cases where before a front vowel is hard (e.g. get, give); there are also 4 cases where gg + front vowel was taken incorrectly as gj-- which it should be, dammit (suggest) but often isn't (stagger).
  • 8 instances of an unexpected ù; e.g. put, wolf, woman. (These all begin with labials-- these may be related to rule 29a.)
  • 10 unexpected (af)frications (e.g. educate, ocean, righteous, sure); there's also an instance of an unexpected lack of frication (absurd)
  • 8 more instances of er becoming är (besides those noted in the rules-- e.g. era, there, herald, very)
  • 6 instances of vowels unexpectedly dropping (e.g. bachelor, vegetable, Wednesday)

Generating spellings from pronunciation

Can you reverse these rules to get instructions on how to spell a word given its pronunciation? Not really, since there are too many alternative spellings. However, the following table can be taken as a first approximation. For each GA phoneme, I list the spellings referred to in the rules above. Caveats:
  • Remember the long/short vowel rules (25,26).
    • To ensure a short pronunciation, double the following consonant.
    • To ensure a long pronunciation:
      • at the end of a word, add a silent e
      • elsewhere in the word, use a diphthong instead.
  • Remember the softening of velars; see rules 20-23 for a discussion of how to spell s/k/g/j before various vowels.
  • Parenthesized characters represent the environment where you can use a spelling. Examples:
    • under s(V)ss(V) means that you can spell it ss between two vowels
    • under äa(ng) means that you can spell it a before ng.
  • represents the end or beginning of a word:
    • i# under ï means that this spelling occurs word-finally.
  • ks (or intervocalic gz) can be written x.
  • It's preferable to spell a word the same way across all morphological changes, even if it means slight violations of the rules (e.g. 'silent final e' in the middle of a word).
  • Likewise: write reduced vowels with the full vowel in a morphologically related word. E.g. the second vowel in parent is e because we have a full ê in parental.
 ä a, ay, ai, ei, e(r), a(ng) p p
 â a b b
 ë e, ee, ea, ey, (c)ei, e(V), i#, y# t t
 ê e, ea d d
 ï i, y ,ie, igh, ig(n), i(V) g g, gh(i/e/y)
 î i, y k k, c(a/o/u), q(u), ck#
 ö o, oa, oe, ough, o#, ow#, eau m m
 ô o, (w)a(n/s/t/d), a(r) n n
 ü u, eu, ew ñ ng, n(k,g)
 û u f f, ph
 v v
 u oo, ue, ui, u# + th
 ò au, aw, augh(t), a(l), (w)a(sh,ch), o(ss#, g#, fC, ng) + th
 ù oo, u s s, (V)ss(V), c(i/e/y), ce(a/o/u)
 @ V, a# z z, (V)s(V)
 $ sh, ci(V), ti(V); rule 16 situations: s, ss
 ôw ou, ow $ s, zh
 öy oy, oi ç ch, (doubled) tch, t(u)
 j j, (doubled) dg, g(i/e/y), ge(a/o/u
 y y; yu can be u r r, #wr, rh
 w w, #wh, u(V) ;l l
 h h
 @r Vr, re#
 @n Vn
 @l Vl, le#

Spelling reform by regularization

You could use the above table as the basis for a really useful and minimal spelling reform.
For instance, here's Percy Bysshe Shelley's Ozymandias in regularized spelling. To minimize the barbarity, I exempt one- and two-letter words from reform.
I met a traveller from an anteke land hu sedTue vast and trunkless legs of stone stand in the desert. Near them, on the sand, haff sunk, a shattered visage lies, huse frown, and wrinkled lip, and sneer of cold cummand tell that its sculptor well those passions read, which yet remain, stamped on these lifeless things-- the hand that mocked them, and the hart that fed. And on the peddestalthese words are carved: 'My name is Ozzymandias, king of kings! Look on my works, ye mighty, and despair!' Nuthing beside remains. Round the decay of that colossal wreck, boundless and bare, the lone and levvel sands stretch far away.
Or of course we could just hang it up and use Chinese-style syllabograms instead.

So how horrible is English spelling really?

I doubt that this page will convince anyone that English spelling is a good system. There's too many oddities.
  • Vowel combinations are a mess-- often the best you can do is give the two most likely sounds (realm, reap), and even those will be overruled in the fairly frequent cases where two vowels really adjoin (reality).
  • There's too many quirky rules that derive from odd sound changes. We may not be able to get away from the Romance c/g softening or the Great Vowel Shift, but does our spelling need to preserve old forms of feign or walk?
  • There was a period when busybodies did their best to make English look like Latin. This was bad enough when we distorted perfectly good French loans like dette into debt, but we're also stuck with false etymologies like island (in place of the older, and regular, iland).
  • And the modern custom of borrowing instead of adapting spellings, though nice for etymology, plays havoc with the orthography, especially as we start to borrow from more exotic languages and forget where they're from. I've heard well-meaning idiots pronouncing a Russian as ts, as if it were German; and people like to pronounce words like Sarajevo as if they were Spanish. And why spell gyrosas if it were classical instead of modern Greek (inviting the pronunciation jïröz in place of yërös)?
  • While we're at it, could we please fix the word ginkgo, which is not only difficult and irregular, but doesn't reflect any proper Japanese word? The Japanese characters (銀杏) can be read two ways: as icho:, they refer to the tree; as ginnan, to the fruit. The second character can be read kyo: in other words, so someone misread the combination as ginkyo:, and someone else mangled this into ginkgo.
What I hope to have shown, however, is that beneath all the pitfalls, there's a rather clever and fairly regular mechanism at work, and one which still gets the vast majority of words pretty much correct. It's not to modern tastes, but by no means as broken as people think.

No comments:

Post a Comment