The other week I published an article here comparing the Englishishness of writing in Scots. We compile word frequency lists from a target Scots text perhaps by a single writer and compare it to a general British English word frequency list, looking at just the top 200 words and seeing which words are in both the author’s list and the English list gives a score of Englishishness.
A similar exercise can be carried out comparing an author’s word frequencies with a frequency list generated from the entire corpus of Scots texts.
Regionality
The Scots language is widely accepted as falling into about six dialect groups, namely Central, Northern (Doric), Shetland, Orkney, Southern and Ulster-Scots. In terms of lexis and syntax, these dialect groups have more in common with each other than they do with English, however there are a handful of significant spelling differences between the dialects.
Ulster-Scots has a preference for spelling the definite article THA, whilst the others generally use THE
Doric has a preference for spelling question words with F instead of WH, for example FIT - WHIT, FAN - WHEN, FOO - HOW,
Shetland has a preference for using a D sound instead of TH or Y for some words, for example DA - THE, DIS - THIS, DAT - THAT, DU - YOU
It would be unfair to “score” Doric or Shetlandic writers based on the Central Scots dialect, so instead we classify each writer by their closest dialect group.
Identifying writers by dialect
Doric writer Sheena Blackhall is one of the most prolific in the corpus and could be characterised as typical writer of Doric. If we compare her top ten most common words with the top tens from the Ulster Scots and Doric texts from the corpus.
From this top ten we can see that Sheena has more in common with other Doric writers than with Ulster-Scots writers.
If we look further, at the top 200 most common words using the dialect comparison utility we can see that this alignment to Doric over Ulster-Scots (and the other dialects) is more validated.
So, just from the top 200 words, we could say that Sheena Blackhall’s writing is 58.5% Doric Scotsish and 26.5% Englishish.
Other writers
If we carry out the same exercise on other prolific writers in the corpus, we get the following Scotsishness and Englishishness scores.
They average 35.7% Englishishness and 62.1% Scotsishness in their respective dialects.
It pains me to reduce a writer’s entire artistic output to just two numbers. Whereas they may have agonised over each word, each spelling, characterisation and nuance, it does allow us to plot pretty graph.
Pretty graphs
Its important to understand what this graph represents and where its limitations lie.
Because the English and Scots top 200 words overlap by about 38%, its impossible for a writer to score both 100% Englishish and more than 38% Scotsish, or 100% Scotsish and more than 38% Englishish. This limitation describes the areas of impossibility on the graph.
If we go through the top two hundred most prolific writers in the corpus, and a small handful of classic English writers, we can plot a huge cloud their Englishishness and Scotsishness scores on a single graph.
We can see how there is a clear “air gap” between standard English writers and 21st century Scots writers.
The Scots - English continuum
Writer and translator Ashley Douglas has done an excellent job explaining the Scots-English continuum, how because English and Scots are mutually intelligible, speakers and writers find themselves code-switching between the two languages to greater or lesser extents.
On our graph we might see this continuum as a diagonal line from top left to bottom right. With the more Scotsish writers to the left and more Englishish writers to the right.
From looking at the writers on this continuum we can kind of see that it holds out, with writers on the upper left edge being traditional, establishment Scots writers, like Stephen Pacitti, Thomas Clark and Sandy Flemming. And writers on the lower right edge being newer, younger, urban writers like Colin Burnett, Peter Bennett and Emma Grae.
If there was an easy way to translate x, y coordinates, rotating them 45 degrees anticlockwise, this might provide a single number representing where a piece of text lies on the Scot-English Continuum.
What stands out more than the Scots-English Continuum, the more significant eigenvector, going from bottom left to top right. This seems to represent a continuum from non-standard (or specialist spellings) on the left to standard (frequently used spellings) on the right.
The articles in the corpus representing these non-standard (or specialist spellings) seem to be government / political writing, science writing, and exceptionally broad dialect Scots writing. Where either the subject matter or the spellings aren’t likely to feature in most frequent word lists.
Here’s the graph rotated by 45 degrees, and scaled so that the Scots-English continuum spans the whole width of the chart. A handful of names have been labelled.
I’m not sure if this adds anything. At some point we’re just manipulating the data into meaninglessness. With the un-rotated graph we know where the information has come from, with the rotated graph I’m not sure if I’m trying to make a point or anything.
Regionality
In these first few graphs the different dialects are all mixed together in the swarm of dots representing Scots writers.
However, we can also display each dialect in a different colour:
This graph is a little difficult to make out, so instead I’ve drawn shapes round each dialect group.
From this we can surmise that Ulster-Scots is less standardised than the other dialects and the Shetland dialect is most cohesive.
The Central and Doric blobs are large, representing a wide variety of standard and non-standard writings and a broad range of more Englishish and less Englishish writers.
More interestingly, we can look at the finer dialect groups within the Central Scots dialect.
If we imagine the two continuums previously mentioned, each colour grouping displays the standard - non-standard continuum slope, but here we can see that the Glasgow dialect (in orange) tends towards the right, indicating its more Englishish than the other dialects. The East Central dialect (yellow) tends towards the left, indicating it is less Englishish than the other dialects.