Introducing the Scots-o-meter utility

Jaccard similarity of high frequency words

Jun 29, 2022

In a previous SubStack article we compared the top 10 and top 200 most frequently used words in various Scots dialect with the top 10 and top 200 words in standard English using some kind of Jaccard Similarity percentage. On my corpus of 21st century Scots texts website there is a utility for doing these comparisons yourself.

There used to be a handful of pages that either compared single authors, or single dialects, but I thought I could combine them all into a single php script to make maintenance easier.

The Scots-o-meter

If you just go to dialcomp.php it opens as the Scots-o-meter where you can paste text, up to 10,000 characters and it compares it to all the dialects in the corpus, and gives you the overlap percentages for each dialect. It also keeps a log of all the submitted text.

10,000 characters is about 2,000 words which isn’t really enough to give a good reading. It will give a decent indication of which dialect the text is most similar to, but its not going to give very high percentages compared to the other comparison functions.

Here’s what it looks like about to consume a Colin Burnett short story

And here’s the results, identifying that the text was most similar to the Central dialect.

I’d always pegged Colin Burnett as writing with an Edinburgh urban dialect, but the Scots-o-meter reckons the writing is more like the West Central dialect with 41%, although the Glasgow urban dialect scores highly too with 40%

Author comparison

If you go to dialcomp.php?a=100007 with an author number as a GET parameter, it will compare that author’s top 200 words with all the large regional dialects, all the small regional dialects and English and display the percentages.

The author numbers can be obtained from the writers page on the corpus and looking at the URLs. I’ve got to figure a better way of doing this.

Top fifty words

As a default the utility looks at the top 200 most frequently used words or fewer if there isn’t enough text. With a top=50 GET parameter dialcomp.php?a=100007&top=50 it will instead look at the top 50 most frequently used words, you can set the top= value to anything up to 1,000.

I dunno if there’s any merit in looking at higher word frequencies, even above 50 words it becomes more biased towards specific writer’s subject matter, and less meaningful.

In general the most common words are pronouns, articles and prepositions, and then some of the more common verbs like to do, to be, to go and to have. We start to see more nouns, adjectives and adverbs after the top 50 words.

There are probably papers that examine the dispersion of parts of speech in frequency lists, that could be really interesting.

Comparing writers with specific dialects and words

If you click on any of the dialect codes in an author comparison dialcomp.php?a=100007&b=SHD it should open again comparing the same author to the selected dialect and displaying the actual top 200 words, indicating which words are common and overlap with the dialect and which aren’t.

Using the top= GET parameter here also works to override the top 200 default.

Looking at just one dialect

The dialcomp.php page will also accept the small dialect codes as a single GET parameter. In this case it will compare the top 200 words in the chosen dialect with the top 200 words in each of all the other dialects. In this way we can see which dialects are more similar or closest to others

For example, if we look at the Aberdeen dialect dialcomp.php?a=ABN we can see that it is closest to the Doric dialect group (71.5%), and most similar to the Mid Northern A (MNA) dialect (73.5%). Also, the Aberdeen dialect in the corpus is only 38.5% similar to English.

It is tempting to go through every dialect and compile the similarity percentages into a huge matrix, perhaps with colours to indicate similarity and black boxes to indicate the groupings.

Although this doesn’t work very well and doesn’t tell us much. Some of the dialects don’t have much text in the corpus and so have columns and rows that stand out un expectedly.

A slightly more fun way to display the information is with a hierarchical network diagram.

If ABN is 73.5% similar to MNA and MNA is 68% similar to DOR we could present it like this:-

And to cover all of the Doric related dialects we could do a network like this:-

A similar network cover the Central dialects would look like this

Or with the various percentages that represent similarity we could draw the network like this.

There’s probably some software out there that can do such a network diagram natively with the various lines behaving like elastic bands depending on the strength of the relationships. I just knocked this up in PowerPoint.

At this point I wonder if perhaps Fife should have its own dialect code, outwith the South East Central dialect. I don’t have enough field information to tell if there is a Fife dialect that’s significantly different, which writers would claim to write in the Fife dialect?

Comparing two dialects

If we want to compare two dialects and actually see the top 200 words used and which they have in common, we can do that.

Adding them to the a= and b= GET parameters should work dialcomp.php?a=SEC&b=MNA it ought to accept any combination of major and dialect codes. This is kind of neat for showing which words dialect have in common and which are more idiosyncratic.

Here we can clearly see differences between the East Central dialect and the Mid Northern A dialect. We can pick out corresponding alternate spelling preferences in the two dialects -

ti - te
whit - fit
wes - wiz
when - fin
wey - wye
guid - gweed
ane - een

Comparing with English

Finally if you want to compare anything with English, dialcomp.php?a=MNA&b=English the URL will accept ENG or English as a dialect code. Instead of compiling English word frequencies from the corpus, it uses a lookup table.

Again, we can clearly see corresponding pairs of spelling preferences:-

nae - not
wi - with
hae - have
wee - little
oan - one

The English wordlist

Originally I put together an English word frequency list from the SUBTLEX-UK corpus of British TV subtitles, but it included things like “n’t” and “‘ll” as discrete words and whereas my corpus counts them differently.

As an alternative I put together an English word frequency list using half a dozen English literature classics, using my corpus to natively do the word-counting.

The two different English word lists didn’t make much of a difference to the dialect comparison results, around 1% either way. As long as its consistent, it doesn’t really matter, there’s no ideal or optimum English word list.

Proper nouns

I’ve created a lookup file which lists proper nouns, the utility sometimes ignores proper nouns when doing comparisons. There’s about five translations of “Alice in Wonderland” in the corpus, so “Alice” is a really common word here when usually it isn’t out in the real world. The option to ignore proper nouns or not should probably be a GET variable, I just haven’t figured out how to do it justice.

Similarly there are degrees of proper nouns, do you want to ignore place names as well as “Alice”. If we ignore the word “Scotland” should we also ignore the words “Scots” and “Scottish”.

These words, “Scots” and “Scottish”, are used disproportionately more frequently in Scots writing compared to English writing. Is it fair to include them in a comparison?

Also many proper nouns are also common nouns, Ruby and Bob for example. The utility isn’t smart enough to consider parts of speech.

Do I need to create two or three different proper noun look up files, depending on how strict we want to be? (proper nouns, geographic nouns, ambiguous nouns) This isn’t important is it?

And I keep spotting obvious proper nouns that are missing from the list and it makes me feel like a chump for not spotting them sooner.

Perambulations

Discussion about this post