Introducing: a Parallel Corpus of Government Texts in Scots and English
In which I describe a cool new website, if you think Scots and corpus linguistics is cool.
Here is a new website I have put together.
It displays English and Scots translations of government documents that I have found, in a side-by-side manner with corresponding sentences aligned.
There is a very basic search function that lets it displays only lines containing the search term, with the search term in bold text.
Currently there are only around seven documents in the corpus, or 800 sentences / lines. More will be added in due course.
Copyright
The documents included so far seem to be generally covered by similar Government Licences - we’re free to copy, publish, distribute, transmit, adapt, exploit, combine with other, and include it in our own products, but with the proviso that links are provided to the licences.
I need to figure out how to provide such links. I guess each text will have a tag relating to which copyright regime / licence its held under, and then just a wee link on each and every line.
Do any government branches actually care? Are they not delighted that someone, anyone, is actively reading these documents that have been translated at great expense?
The Universal Declaration of Human Rights, is kind of in the public domain, outwith even the government licence. But some websites claim copyright over it.
This sort of thing is rife - where organisations, like museums and government agencies claim copyright of books, photos and documents that are hundreds of years old. Even Wikimedia occasionally slips Creative Commons licences on works they have no rights to.
When posting this image, according to Wikimedia, I have to give credit to a Greek potter and painter called Exekias who died about 2,500 years ago.
By introducing this sort of ambiguity into copyright, it undermines the entire system. Do I really need to go to Greece, study ancient Greek law, and try to find out who owns the copyright on an image from literally the start of recorded history? Or should I just use the image, keep it on this Substack article until Exekias himself rises from Halcyon swamps and quietly asks for me to take it down?
Parallel corpora
It needs to be stated very early on, that this corpus, and parallel corpora in general, are not Artificial Intelligence, they are not AI, its merely an effective way of displaying translated text.
Parallel corpora have existed for centuries, or millennia if we consider the Rosetta Stone. Having the texts side by side, aligned by sentence, with corresponding words marked up in a similar manner is a neat way of examining translations in context.
Rather than using a translation dictionary on a word-by-word basis, parallel corpora allows you to look at how the words are used, and compare meaning in context.
Government documents
Over the years I have seen a small variety of government texts translated into Scots. Mostly the translation exercise is carried out by nameless translators.
Within Scottish Scots some translators are really good, like Ashley Douglas and Dauvit Horsbroch, who are talented writers in their own right, and other translators are garbage who either spell words like they’ve never read written Scots before, or just rummage through the Dictionary of Older Scots for archaic words that no one has used for centuries.
We must recognise the fact that whilst these government documents exist in Scots, the politicians and legislators are most likely using the English version, and unlikely to even read the Scots versions.
I read them, its a secret world of esoteric nuance.
Like all government documents, the prose does not represent how people normally speak the language, or even how people typically write the language. Its a kind of formal register, with legal overtones.
Facebook
The other day over on that Facebook, Steve Byrne posted a link to a Westminster government document written in Scots - translated from an English version - it was the executive summary of a consultation about the 2003 UNESCO Convention for the Safeguarding of the Intangible Cultural Heritage.
As soon as I found the English version, I copied and pasted both documents into Excel, side by side. The two documents aligned perfectly. I thought “you know who I have to share this with - Scots corpus linguists!”
The website is really simple, its a php script that grabs a xml file that consists of the parallel texts marked up with useful metadata, and displays the texts in a table. (I prepare the xml file offline, labouriously line by line)
Then there’s a little function that lets you search for specific words, and it only displays lines containing those words.
This isn’t rocket science.
As soon as I broke the task down into small bitesize steps, I realised that it would only take an hour to two to set up. There’s no magic or AI going on, no analysis or word counting this time.
Style
About a year ago, there was a fascinating thread on Twitter / Academic paper, by Professor Jack Grieve et Al, about corpus design.
What I took from it was that instead of just chucking more and more random texts at a corpus of Scots writing, it is better to focus on a single specific type of text. The more specific, the better.
So this corpus covers specifically governmental translated documents written in the mainland Scots central dialect - Not Ulster-Scots or Doric, not the Orkney or Shetland varieties.
Analysis
We must keep in mind that translation is something of an art. There’s no objective innate, “one true translation”. Note that there are literally hundreds of English translations of the Bible, and deciding one is correct and all the others is wrong is how wars start.
But we can still recreationally note differences in translations, here are a few things I’ve found over the last day or so. Feel free to have a play around to find anything else of interest - let me know in the comments, as the kids say don’t forget to click like and subscribe.
MUST / MAUN / SHOULD
Here we see that in the 2003 UNESCO document, the word MAUN is used for MUST
Although the same document also uses MAUN for SHOULD
This ambiguity would be problematic if the document was legislation - MUST and SHOULD have semantic differences in law.
ABOUT / ABOOT / ANENT
I noticed that some translators were using ANENT as the Scots translation of ABOUT, whereas I would have used ABOOT.
A commentator on BlueSky pointed out that ANENT is used in a formal register. I wouldn’t have known this, not being a native speaker. But now I’m keeping a figurative post-it note on my desk to remind me.
AND / AN
In a lot of contemporary Scots writing AN is used instead of the English AND. However in the Scots government documents AND is used more often.
I guess this could be a register thing again, or personal choice by the translators. It doesn’t feel like there should be a marked difference in usage between natural and formal usage, but I’m not a translator (I’m not a writer, I’m not a reporter, I’m not an linguist, etc)
KNOWLEDGE / KENNIN / HERITAGE
The translation KENNIN for KNOWLEDGE appears quite often in the UNESCO document.
KENNIN is also used for HERITAGE and RECOGNITION. This triple usage makes sense, but seems like a narrow use of vocabulary.
KEN-HOO
Here we have an example of the translator using the singular lexical item KEN-HOO to translate KNOW-HOW.
Whilst it kind of makes sense, this is the first time I’ve come across this phrase in Scots. The Dictionary of the Scots language has many occurrences of KNOW HOW as part of a phrase to know how to do something, but not as a single lexical item.
I have no other way of searching if this usage has wider currency.
HAMELT / INDIGENOUS / DOMESTIC
Here we see HAMELT used as a synonym for DOMESTIC and INDIGENOUS.
Whilst the meaning is clear, I’ve always used HAMELT as an adverb like HOMELY - relating to an individual’s house, rather than than national or homeland meaning - maybe that’s just me.