Emergent dialects among Scots Writers
If we didn't know that Doric and Ulster-Scots existed, could we identify those dialects
In previous SubStack posts we established that Scots language writers write distinctly differently to Standard English writers. By comparing the top 200 words in a corpus of 21st century Scots texts with the top 200 words in Standard English we could see that on average Scots writers share around 30% to 45% of their vocabulary with English (compared English writers sharing more than 60% of vocab amongst themselves)
In a subsequent SubStack post we looked at how Scots writers adhere to predetermined Scots dialects, with each writer sharing between 40% to 70% of vocabulary with either Central, Doric, Shetland, Orkney or Ulster-Scots.
This all pre-supposes that Standard, Central, Doric etc already exist as a priori established languages and dialects. But what if we could determine what dialects exist, just by comparing writers’s similarities with each other?
Surely writers of the same dialect would have more lexical similarity with each other than other other dialect writers?
Similar writers
Matthew Fitt and James Robertson are mainstream Scots language writers, they share a publisher, both write Scots translations of established children’s books and I like to imagine they share an office. If we compare their top 200 most frequently used words, we can see that they have 75.5% in common.
Perhaps a regular occurrence is that James calls across the desks “Matthew, how are you spelling ‘stammygastert’?”
“Oh, I’m spelling it ‘stammygastert’!”
“Aye, me too”
On the other hand Shetlandic writer Christine De Luca is a long way away from their office and probably doesn’t compare spellings with them.
Their top 200 words only have 33% in common.
Similarly Matthew Fitt and Christine De Luca, have only 33.5% vocab in common.
We could plot this data as a Network Graph, showing that James Robertson and Matthew Fit are close together and Christine De Luca is far away.
How many Scots Writers
According to the British Labour Survey, about 1 in 1,000 people in the UK are professional writers. According to the 2011 Scottish census, there are about 1.5 million people who can speak Scots in Scotland, so there should be around 1,500 Scots language writers, give or take a few hundred.
In my corpus of 21 century Scots texts there are just under 500 writers listed, only a small fraction of these are professional writers. My point is that the corpus is pretty representative, in the best case sampling a third of the writer population.
I’ve taken pains to cast my net as wide as possible to find writers of all genres and ages in all geographic location, and I’m keen to fill as many gaps as I can, like some kind of linguistic comic book collector. Who have a I missed?
If we go through and compare each writer to every other writer, it will take me ages. There would be about 125,000 comparisons in total. Instead I will start with the most prolific writers in the corpus and work my way down.
It starts like this:-
To get the first 60 writers, means 1,770 datapoints, 1,770 webpages to check. I’ve honed my procedures, and can rattle through about 10 per minute, its about three hours work, more or less.
Network Graphs
You might have seen network graphs in popular infographics, they’re pretty cool. You might even have played around with them dragging points around in multidimensional space, I know I have. But I’ve never before had any data of my own to use.
Google Fusion Graphs used to have a good Network Graph creator, but Google shut it down a few years ago. The R statistical language also has a function for generating the graphs, I spent an afternoon playing around with it and found it to be too time consuming to get good-looking graph. So finally I found Flourish.Studio, which seems to be easy to use and looks great.
And with the data for the first 60 writers sampled we can plot a Network Graph.
At this point we realise that our network graph software doesn’t actually do anything with different strength links. Its just a big blob of writers who are all linked to each other.
We can soldier on.
If we go through the list and remove weak links, disconnecting any that are less than 30%, the tarball starts to change.
Whilst this doesn’t look much, those names mean something to me, the names on the left are Ulster-Scots writers and the names on the right are Shetlandic and Orcadian writers.
I know that we were looking for emergent dialects, but lets colour code the datapoints by the island that the writers are from, just to help out a little bit.
So there’s something going on, we’re on the right track
Progressively breaking weak links until only 35% and stronger lexical connections remain gives this network.
Here we can see that Ulster-Scots is breaking away at the left, Shetlandic breaks away at the east, and Orkney in the north east.
When we continue breaking weak links, so that only 40% and stronger remain, our graph continues to undulate and ruminate.
We have successfully identified Shetlandic as a distinct dialect of Scots, its writers are more similar to each other than they are to the rest of the Scots writing sphere.
We could argue that Orkney is also a distinct dialect at this point, except Hazel Parkins and Josie Giles who don’t seem to be as connected to the rest of Orkney, but in different ways. Hazel has lots of lexical similarity connections with loads of writers all across Scotland.
And conversely Josie Giles, only has lexical similarly connections to three other writers.
Is this because their writing is disproportionately science-fiction?
Mainland Scots dialects
Moving on, lets focus exclusively on mainland Scotland Scots writers. We can colour-code writers known to write in the Doric and Northern dialect.
The Doric writers do cluster together to some extent, but still seem close to the other mainland writers. At this point we might raise the water level of lexical similarity and remove any links less than 42%.
To be honest, it doesn’t make much of a difference. They are still closely linked to the other mainland writers.
We might note that Catherine Byrne, who’s sole work of literature in the corpus is a Caithness dialect translation of Alice in Wonderland, doesn’t have any lexical connections with other Northern/Doric writers at this point.
If we remove Northern writers from our analysis at this point, some of the names on the right are familiar, they represent youngish authors who published books or risen in prominence in the last two years. I’ll characterise them as “New wave” writers.
In the analysises so far this group has always stuck close together in terms of lexical similarity, although they represent both Glasgow and Edinburgh writers. Perhaps its the urban-ness that makes them a cohesive group.
Finer Regional Dialects
The other day I was showing off these network diagrams on twitter and Left Peggers asked about the Ayrshire dialect.
Now, I’m just a poor Englishman, living in England, its been twenty five years since I last went to Ayrshire and about ten years since I was in Dumfries or Galloway. I don’t know what the Ayrshire dialect sounds like. I can only buy books and look at graphs, and try to draw conclusions.
One of the first books in my collection was Chuckies Fir the Cairn, a selection of poetry by a handful of writers from the Dumfries and Galloway region. If we colour-code the writers on the graph, we might be able to see some kind of grouping.
As writers from the same region, they don’t naturally hang together with lexical similarity. Not in the same way that the Insular Scots writers do, or the Doric writers or even our New Wave writers.
If there is a distinct Dumfries and Galloway dialect it isn’t detectable by this method.
Do you want to play?
It occurs to me that whilst this has been a fascinating experiment and play around with data, you might want to have a go too.
Here’s a link to the visualisation
I think its possible to also faff about with the data behind it.
Let me know if you find anything interesting.