Monday, September 3, 2018

A Country Without a Language: Constructing and using language tags

Tír gan teanga, tír gan anam (a country without a langauge is a country without a soul).
One of the tools everyone who develops web applications or writes web pages uses is Best Current Practice 47 (BCP-47) for what's commonly called a language tag (langtag) or locale code. The BCP 47 practice specification, defined by the Internet Engineering Task Force (IETF), sets out how to specify a language using internationally standard codes.

Since every engineer working with web technology should be using this specification, it would be helpful to have a solid understanding of this somewhat arcane and sometimes confusing specification. What follows is a simplified explanation that leads us to just that.

The Basics

The langtag is constructed by combining a language code from ISO639-1:2002, Codes for the representation of names of languages -- Part 1: Alpha-2 code, called a "language subtag" and optionally, more specific restrictions, also called subtags, for the script used, the region in which the localization is used, and the variant, each separated by a dash. Although there are additional subtags, these four are more than enough to denote a language used. In fact, the common practice within BCP 47 is to compose a langtag that is only verbose enough to uniquely identify the language used, and typically, langtags are composed of only the language and the region subtags. Let's see how this works using a few examples.

A Basic Langtag

Although the most basic langtag is one that consists solely of the language code, e.g., "en" for English, the most basic common langtag pattern is a language separated by regions. Since George Bernard Shaw famously said that "England and America are two countries divided by a common language", we'll use this as an example to construct basic langtags that identify that although the script used is the same, English written and spoken in the US is different than that used in England.

The "common language" is identified by its ISO language code, "en" and each of the regions, the US and England are identified by their respective ISO country codes, US and GB. Combining these two basic subtags results in two langtags, "en-US" and "en-GB", for the US and England, respectively.

Of course we could have added the script subtag to each of these langtags to identify the alphabet used, but since it's the same in GB English as it is in US English, that would not add meaning to the langtag, so it would not be included in this case. In fact, we find that the script subtag is relatively rare, however, to see how it works we'll look at a language that can use different scripts, or alphabet sets - a language such as Irish (or Gaeilge).

A Different Script

Advert in Gaeilge using the latin gaelic alphabet
There is a relatively famous advertisement for Guinness Stout that carries the slogan "Ní féidir an dubh cur ina bhán air", which is written in Irish, or Gaeilge. The Irish language is often written today using a Latin alphabet, much like English, with a liberal use of vowels that include accents like those in the word "Ní"; however, prior to the middle of the 20th century, there was no "h" in the Irish alphabet as the lenition was identified by a dot above a letter. This practice means the same phrase in the original Irish script, LATG, would be "Ní féidir an duḃ cur ina ḃán air", as shown in the advertisement pictured.

This difference would give us two langtags - "ga-latn-IE" and "ga-latg-IE". Since the formal language specification for Gaeilge now uses the LATN alphabet, and we're only verbose enough to identify the language, the "ga-latn-IE" langtag would commonly be shortened to "ga-IE".

A Variant or Two

Now let's turn our attention to the variant subtag.

Variants are seldom used in common practice - there are only 100 registered with the IANA and there are typically few who regularly use any single variant. Variants are often used to denote archaic uses and intermingled languages like the mix of English and Spanish commonly called "Spanglish".

If we turn our attention back to the langtag for the US, we might also want to include a regional variant for the Northeast or Southern US, especially given the differences in third-person word choice (where the common choice for third-person pronouns is "you" and "you" for singular and plural, respectively, in the southern vernacular the singular and plural are "y'all" and "all y'all"). Subregion variants such as this are quite common, even if they do not reach the status of a dialect. Although variants are common, they are not often registered with the IANA, which is a requirement for the variant to be used as a subtag.

One exception to the pattern of unregistered variants is Boontling, a variant of English that is tied to Boonville, California. Since a variant subtag for Boontling - BOONT - is listed in the IANA language subtag registry as a variant of (US) English, its langtag would be "en-boont" or "en-US-boont" or "en-latn-US-boont" if you wished to use the more verbose, which we don't.

It's also possible for a langtag to have multiple variants. The only example I know of for this would be a variant of English spoken in Scotland (typically referred to as Braid or Ullans) that uses the variant subtag SCOTLAND and a variant of this variant that is spoken in Ulster, Northern Ireland, which would make the langtag "en-scotland-ulster" or "en-GB-scotland-ulster".

BCP 47 Implementation

The way in which the specification has been written implementing langtags can be a little confusing. Sometimes variants are widely used enough that they become regional and sometimes variants even become recognized as their own language. One instance of this is the two primary variants of Norwegian, Norwegian Bokmål and ‎Norwegian Nynorsk. Although linguistically these are two variants or dialects, the ISO considers them languages in themselves, which means there are three valid language subtags that can be used to construct langtags for Norwegian in Norway: "no-NO", representing Norwegian in Norway; "nb-NO", representing Norwegian Bokmål in Norway; and "nn-NO", representing Norwegian Nynorsk in Norway.

Difficulties like this aside, however, one of the rules of accessibility (a11y) under the "robust" principle, requires us to include a langtag for documents using the lang attribute. The inclusion of the language allows assistive technology, like screen readers, to announce words and phrases properly and allows user agents to offer dynamic translation.

As anyone who has read authors that sprinkle phrases in multiple languages throughout their work knows, even though a root document has a language specified, there may be portions in other languages. Those portions also need to be spoken correctly and the user may benefit from dynamic translation of them as well. To help with this process, the folks writing the HTML spec made the lang attribute a global attribute, not just an attribute on the document, meaning it can be applied to any HTML element.

So, if you're concerned about the usability of your pages, include the langtag on the document (e.g., <html lang="en-US">) and anywhere else it's appropriate...and even if you're not concerned about general usability, adding the langtag to the document will help you meet the accessibility guidelines (WCAG 2.1, Guideline 3.1, Success Criteria 3.1.1 and 3.1.2) - and we all want that.

Happy coding.

No comments:

Post a Comment