Sunday, February 10, 2013

Identifying and Fixing Transcription Errors in Large Corpuses

"Underwood 11 Typewriter", by Alex Kerhead.
This is the third post in my series on the Old Bailey Online (OBO) corpus. In previous posts I looked at the impact of courtroom reporters and editors on the vocabulary used in the Old Bailey trial transcripts, and at ways of measuring the diversity of immigration in London between the 1680s and 1830s.

Since I'm dealing with a huge amount of text (51 million words, 100,000 trials), I thought I'd turn my attention to the accuracy of the transcription. For such a large corpus, the OBO is remarkably accurate. The 51 million words in the set of records between 1674 and 1834 were transcribed entirely manually by two independent typists. The transcriptions of each typist was then compared and any discrepancies were corrected by a third person. Since it is unlikely that two independent professional typists would make the same mistakes, this process known as “double rekeying” ensures the accuracy of the finished text.

But typists do make mistakes, as do we all. How often? By my best guess, about once every 4,000 words, or about 15,000-20,000 total transcription errors across 51 million words. How do I know that, and what can we do about it?

Well as you may have read in the previous posts, I ran each unique string of characters in the corpus through a series of four English language dictionaries containing roughly 80,000 words, as well as a list of 60,000 surnames known to be present in the London area by the mid-nineteenth century. Any word in neither of these lists has been put into a third list (which I've called the “unidentified list”). This unidentified list contains 43,000 unique “words” and I believe is the best place to look for transcription errors.

Not all of the words on the unidentified list are in fact errors. Many are archaic verb conjugations or spellings (catched – 1,657 uses or forraign – 1 use), compound words (shopman – 4,036 or watchhouse – 2,661), London place names (Houndsditch – 877), uncommon names that had not been marked up as such during the XML tagging process (Woolnock – 1), Latin words (paena – 1), or abbreviations (knt – 1,921) – short for “knight”, a title used by many gentlemen in the eighteenth century.

On the other hand, many of these words are clearly errors. We see mistyped letters as in “insluence” instead of “influence” or “doughter” instead of “daughter”. We also see transposed letters as in “sivler” instead of “silver”. And there are missing letters: “Wlliam” instead of “William”. Finding the difference between the real words such as “watchhouse” and the errors such as “Wlliam” amongst the 43,000 terms on the unidentified list is the real challenge.

Checking manually is impractical as these terms appear nearly 200,000 times in the corpus. Correcting every single error might not be worth the effort. However, to get an idea for the types of errors we see appearing and in what proportions, I checked every entry on the unidentified list against the image of the original scanned record during a single session of the court: January 1800. The unidentified words fell into the categories seen in Figure 1.

Figure 1: January 1800 Old Bailey Online transcription errors and the type of error.
The most surprising category here for me is the purple section, which showcases three instances that I would have categorized as typos by the transcribers, but which were actually typos in the original source. This compounds the problem because it means we must acknowledge that in some instances the error is not with the OBO team but is in fact reflecting the content of the contemporary document. From the perspective of a person searching for a particular keyword in the database they may be frustrated by the original error. On the other hand, from the perspective of those who want to be true to the original, that mistake should be preserved. I won't weigh in on that particular issue here, but it is something anyone working to correct transcription errors should consider.

With this in mind we can begin to look at the other categories, and by the looks of things approximately 40% of entries can in theory be corrected if we can figure out the intended word. Admittedly, I only looked at a single session of the trials, and this may not be representative - particularly if we consider Early Modern English, which might lead us to believe earlier trials are more likely to have archaic non-standardized spellings. If however the session from 1800 is roughly representative of a typical session then we should expect to find somewhere in the neighbourhood of 15,000-20,000 errors.

What can we do about it?

How can we automatically find and correct those errors? Given the fail-safes put in place by the double rekeying process, it's already incredibly unlikely that we will find typing errors by the transcribers. That means when we do encounter such errors it's likely only going to happen once or twice, meaning most errors are probably words that appear only once or twice in the corpus and that do not appear on either the dictionary list or the surname list.

That's not to say of course that just because a word appears in the dictionary that it is not transcribed incorrectly; however, at this stage it is much easier to identify those errors that are not recognized words. Unfortunately there are over 30,000 unique words on the unidentified list that appear only once, meaning this is still impractical to explore manually. Luckily the double rekeying means that any mistakes are more likely to be a matter of the transcriber interpreting the marks on the page differently than we might have liked them to than it is a case of fat fingers hitting the wrong key.

The early modern “long S” is the perfect such example. In the early modern era, up to about 1820, it was entirely common to find the letter S represented as what we might think looks like a lower-case “f”. This is the “suck” vs “fuck” problem that the Google N-Grams viewer runs into, as a slew of esses are interpreted as efs. When viewing the result one might be tempted to conclude people had quite a potty mouth on them in the early nineteenth century, as can be seen in Figure 2. Though not necessarily an incorrect assumption, it wouldn't be wise to make the assumption on this particular evidence.

Figure 2: Google N-Gram results for "suck" and "fuck" in the early nineteenth century

When we look through many of the words on the unidentified list it becomes clear that the Long S is a substantial problem. We find examples of the following:
  • abufes
  • afcertained
  • assaffin
  • affaulting
  • affize 
Or, the other way around:
  • assair
  • assixed
  • assluent
  • asorethought
  • artisice 
By writing a Python program that changed the letter F to an S and vise versa, I was able to check if making such a change created a word that was in fact an English word. When I did this I was pointed to several thousand possible typos. As I inspected the list further I noticed there were other common errors probably caused by the very high contrast scans of the original documents. These original documents often included missing parts of letters, difficult to read words, or little bits of dirt or smudges that made interpreting the marks more challenging.

Some of the most obvious switches were:
  • F / S 
  • I / L 
  • U / N
  • C / E
  • A / O
  • S / Z
  • V / U 
Why these particular switches appeared again and again I'm not entirely sure. Some of them are easy to understand: the lower-case C and lower-case E are easy to mix up. Especially when a fleck of dirt shows up in just the right spot on the scan. Others are a bit more difficult to explain, as with U and N, which we wouldn't expect an automated optical character recognition program to have trouble with, but which seems to have stumped the human transcribers repeatedly.

By running these seven sets of letters through the program and testing the results against the English dictionaries I was able to come up with 2,780 suggested corrections. If these are all correct, that simple switching would correct 9,503 typos in the OBO corpus. The results of these changes broken down by letter-pair can be seen in Figure 3.

Figure 3: The number of suggested corrections in the OBO corpus by switching letter pair combinations in misspelled words.
I say suggested corrections because in some cases the switch is actually wrong, or may be wrong. The English dictionaries missed "popery", a common term used to refer to Roman Catholics in the eighteenth century and has instead suggested the unlikely "papery" as an alternative. In 86 cases the switching has come up with two possible suggestions, both of which are English words, at least one of which is obviously incorrect. The unidentified word "faucy" could be "saucy" or "fancy". Turns out it's saucy, referring to the behaviour of a Peter Dayley - that naughty boy.

This switcheroo method will not solve all problems. It cannot fix transposed letters, as with sivler and silver; Levenstein distance is likely needed for that. It does nothing for missing letters as in Wlliam. But it does take us well along the path to making some rather dramatic improvements with a very reasonable amount of effort, and I would argue, could be an economical way to improve the accuracy of projects which have already been transcribed but which suffer from accuracy issues. As with all great things in life this algorithm still requires a human's careful eye, but at least it has pointed that eye in the right direction. And when you're looking at 51 million words of text, that's nine-tenths of the battle.

If you're working on a project that could use some accuracy improvements, or have explored other ways of achieving similar results, I'd be very happy to hear from you.