Comparisons of Englishishness between Scots language dialects and writers
To what extent do these lexica vary?
For the past two years or so, partly as a covid-lockdown hobby, I have been compiling a corpus of 21st century Scots texts. This corpus is up to around 2,500,000 words now.
This is larger than the Helsinki Corpus of Older Scots (830,000 words) but not quite as big as the Glasgow University’s Corpus of Modern Scottish Writing (5,500,000 words) or the SCOTS corpus of Corpus of Modern Scottish Writing (4,600,000 words).
The Helsinki corpus covers the years 1450 to 1700, the CMSW covers 1700 to 1945 and SCOTS covers 1945 to 2010. My corpus is the largest that exclusively covers the language as it is currently used in the 21st century.
In terms of size and up-to-dated-ness my corpus is pretty good. When folk on twitter ask “what is Scots?”, my corpus has the answer.
I’m not a linguist, I’m a manufacturing engineer and stock controller. I can use my computery skills to count words, to count word frequencies, to categorise writers and then see what insights I can squeeze out of the data.
Top tens
The top ten most frequently used words on the corpus for each of the main dialect groups are as follows:-
We can see that for any given dialect, about half the words aren’t standard English, and we can see that there are several non-English words that are consistent across all the Scots language dialects.
The top 200 words can be found on this page which is updated on the fly.
Do you sometimes feel that we need to be objective?
We can also compare these top ten words with the most frequent words in British English. The SUBLEX-UK corpus of UK TV broadcast subtitle text was found and used to generate word frequency list. This corpus contains 201.3 million words, including 160,022 unique words
Very roughly we can count how much the top ten most frequently used words in each dialect overlap with the top ten English words:-
I believe this is called the “Jaccard Similarity” between each the lexica of each dialect and British English.
As we look at those top ten words we might note that in Doric, “he” is the tenth most common word. In English it’s the 16th most common word, perhaps we should look a little further at the top 200 or top 1000 most common words and compare how the different dialects overlap with English.
Plotted on a graph with a logarithmic x-axis, we can see to what extent the overlap proportions wiggle or are consistent.
After the words that we focus on is more than the top 200, the lines seems to wiggle less.
The divergence and downward slope between the figures when we look at the top 100, top 200 and top 1000 is perhaps an artefact of written subject matter. Whilst the top ten words are usually prepositions, pronouns and determiners, which hardly vary depending on what is being written about, the top 1000 words are likely to be mostly nouns, verbs and adjective which vary greatly depending on the subject of the writing.
To ensure consistency going forward in this exercise, we will focus on the top 200 words.
We can display the way lexicons overlap as a Venn diagram considering how the central and Doric dialects overlap.
We can see that the Central Scots and Doric dialects overlap more with each other than they do with English, by way of 68.5%.
Individual writers
With this process described we can compare the lexica of many individual writers in the corpus with English, reducing their entire body of writing to a single number representing how similar it is with English.
From this small sample of writers in the corpus we see that the average overlap with English is 32.75% with a standard deviation of 10.5%.
Emma Grae and Iain WD Forde represent two outliers, respectively writing in more Englishish Scots and broader Scots than the other writers.
As a sanity check we can insert some English literature writers into the corpus and validate how similar they are to the English word frequency list.
The standardisation of the English language is apparent here with the very narrow standard deviation of just 1.5%.
From this we can state that any writer or text with an overlap with English of 65% is probably writing in English, and anyone with an overlap of around 33% is probably writing in Scots.
If we look at the top 200 most prolific writers in the Scots Corpus and find their individual Englishishness scores, we can plot them as a histogram.
With this larger sample size the average English overlap percentage moves to 29.3% with a standard deviation of 8.4%.
Scots language writers who match this average 29.3% Englishishness score and can be characterised as writing in typical Scots, not too broad, and not too Englishish include Billy Kay, Sandy Fleemin and Robert Fairnie.
The histogram is roughly bell-shaped as a normal distribution, that is to say 95% of the writers are within two standard deviations of the mean.
At this point I will refrain from naming the outlier Scots writers. There’s a whole ethical discussion to be had that I’m not equipped to convey.
Small regional dialects
All the texts in the corpus are tagged into smaller regional dialect groups.
Orkney
Shetland
North Northern B (Caithness)
North Northern A (Black Isle)
Mid Northern A
Mid Northern B
South Northern
Aberdeen
General Northern / Doric
North East Central
(South) East Central
West Central
Dundee
Edinburgh
Glasgow
Ayrshire
General Central
South East (Borders)
South West (Galloway)
Donegal (East Donegal)
West Ulster (Letterkenny / L'Derry)
Coleraine Ulster (North Antrim)
Ballymena Ulster (Mid Antrim)
South Antrim (Between Sixmilewater and Belfast)
Eastern Ulster (Belfast)
Peninsular Ulster (Ards)
East Antrim (Larne)
General Ulster
Synthetic Ulster (no region)
Although some of these groups don’t have enough writers in the corpus to be representative.
We can use the box and whisker plot to show how much the Englishishness of writers of regional dialects varies within each regional dialect.
First we can look at the Central Scots dialects.
Its difficult to say that the Glaswegian dialect is the most Englishish, whilst the Glasgow dialect on average is closest to English, there are a some Glasgow Scots writers who are more Englishish than others.
I think the personal spelling and vernacular tastes of writers is more significant than the regional dialect. Some Glaswegian writers write broad Glasgow Scots and some write more Englishish Glasgow Scots.
Likewise in Doric Scots.
There’s nothing inherent within the Aberdeen Doric dialect that makes it more Englishish than the more rural Mid Northern A dialect, its purely down to personal taste.
With Insular Scots, its a bit different. The Orkney Scots dialect is objectively more Englishish than the Shetland variety.
With the Ulster-Scots dialects, there is a far broader range of Englishishness between different writers. Although generally Ulster-Scots is the least Englishish of all the Scots dialects.
When a writer writes Scots…
I’m not one to judge whether something is written in Scots or not. For the most part the texts in the corpus are Scots because the writer has themselves declared that it is not English, it is Scots!
The idea of the Scots language is an individual thing that resides in the minds of each writer, and indeed in the minds of the 1.8 million people who reported they understood Scots in the 2011 Scottish census.
This idea of Scots can be shown to be distinct from the idea of English, with a clear gap in terms of lexicon. English uses one set of words, Scots uses a different set, that overlaps with English by a relatively uniform degree, of around 29.3%.