by Adnane Guettaf - July 28, 2022
Eden AI aims to simplify the use and deployment of AI technologies by providing a unique API that connects to the best possible AI engines. These engines can be used for different purposes, e.g, face detection, OCR (receipt, invoice, table, etc.), keyword extraction, sentiment analysis, face detection, and much more. Eden AI's main goal is to provide its clients with the best AI engine suited to their projects, in order to keep AI lightweight and easy for any developer. To do so, a lot of background processing is made behind the scene, without interfering or forcing users to deal with constraints, including language constraints.
Even today, the way languages are encoded in some applications may be confusing. Different standardized tags or code representations of language names are published and available. However, the correspondence between different language names, especially when accompanied by different identification codes (e.g. English may be coded with "en" or "eng"), can be cumbersome for our customers to manage.
Therefore, in this article, we describe how language mapping is handled by our service. In other words, we explain how the matching between the input language provided by the user and the languages supported by the providers is performed to facilitate the user's access to one of our AI engines.
Language name standardization is the process of encoding, standardising and maintaining language names according to certain international standards. A well-known organisation that develops and publish standardization for all technical and nontechnical fields except electrical and electronic engineering fields, is the International Organization for Standardization, known also by the acronym ISO. In the reminder of this section, we will focus on the standards concerning language name codes.
Substituting the name of a language with an equivalent code (either 2 or 3 letters long) can be very useful for library catalogs or bibliographic purposes, for libraries of information management, for identifying languages in computer systems, and also for representing different language versions in web applications on the internet.
ISO 639 standard is currently composed of five parts, each of them maintained by an agency handling the revisions, especially when the addition of new codes is needed. Each code in one part means the same thing in another part, however, not all languages are available in all parts. Each of the five parts of ISO 639 is briefly described below:
Alpha-2 code. It provides a language code element of two letters to identify the names of languages. It covers the world’s major languages with 184 codes registered to this day and can be suitable for computerized systems applications. E.g, ”en” can be used to identify the language name English, or ”fr ” to identify the French language.
Alpha-3 code. Because ISO 639-1 was not accommodating to a sufficient number of languages, ISO 639-2 – which uses a three letters language identification for the representation of names of languages – provides more possible combinations, and more language covering.
Alpha-3 code. With the same three language identification letters, ISO 639-3 extends ISO 639-1 and ISO 639-2 in order to cover all known natural languages. It provides an almost complete enumeration of all complete languages with nearly 8000 entries and is designed to be used in a wide range of applications.
General principles of language coding and guidelines for use. Gives general principles for language coding using codes specified in other parts of ISO 639 and their combinations with other codes. It also provides guidelines for using any combination of ISO 639 parts.
Language families and groups. Enumerate approximately 115 collective codes that identify language families and groups. However, languages designed exclusively for machine use are not included in this part.
IETF BCP 47 language defines standardized codes or tags used to identify human language on the Internet. The tag structure is standardized by the Internet Engineering Task Force (IETF). BCP 47 standes for ’Best Current Practice’ and is a persistent name for a series of RFC that describe language tag syntax. The last one is RFC 5646, and allows for a number of additional subtags. The subtags used are managed by the IANA Language Subtag Registry (INA) and can be of various types, E,g. language–extlang–script–region–variant–extension–privateuse. The figure 2 here-below shows an example of tags.
Since Eden AI aims to make AI easy for developers, users relying on our API should not have to worry about the language constraints posed by each of our vendors. Users simply select the language they wish to use when using our services, unhindered by how they provide the language name, nor by the normalisation or tags to be used. The language interpolation between user input and provider constraints is done by our service.
To better illustrate the issue, we will present a simplified use case. A technology (we will consider invoice processing with OCR for the purpose of this example) is provided by three different vendors: A, B and C, each using a different set of language constraints. Vendor A handles processing for the following list of language tags: ["en-US", "fr-FR", "fr-CA"]. Vendor B handles the processing for ["it", "en-GB", "ceb-CB", "fr-BE"], and Vendor C handles the following list of language tags ["fr-FR", "it", "ar"].
All three providers are available through our service. A user called Mike wants to use invoice processing for an invoice written in English. However, Mike does not know whether the English used in the invoice is American English, British English or some other type of spoken English. For simplicity, Mike will try to process the document by simply specifying that it is written in English ("en").
Our service will match the language selected by Mike ("en" in this example) with the languages provided by the suppliers, choosing the best language tag that matches Mike's choice, even if there is no exact match. Supplier A should consider American English ("en-US"). Supplier B will consider British English ("en-GB") as the processing language, while Supplier C will not provide any results because the latter language does not match Mike's request.
In order to perform the appropriate language mapping between the user-selected language and the language tags supported by the provider, different approaches have been investigated and tested. We started by understanding how language names are standardized according to internationally approved standards. Then we explored current best practices for identifying human languages on the Internet and for web applications. We were able to find third-party libraries relevant to Python programming that provides information for handling ISO 639 language name codes. We list some of them:
- iso639-lang. 1 Lightweight library which handles the ISO 639 series of international standards for language codes, and permits easily to switch from one language code to another. For example, it allows switching easily from a two-letter language identification (ISO 639-1 ) to a three-letters language identification (ISO 639-2 or ISO 639-3 ).
- pycountry. 1 Library that provides ISO databases for each of the following standards: ISO 639-3 languages, ISO 3166 countries, ISO 3166-3 deleted countries, ISO 3166-2 subdivision of countries, ISO 4217 currencies, and ISO 15924 for scripts.
- langcodes. Provides very intuitive and lightweight tools for handling language codes. In addition to multiple features provided by the library E.g, standardizing language tags, checking the validity of language tags, or getting demographic language data. One feature that comes in handy for language matching is the search for the best matching language feature. This is a feature that allows you to choose the correct language, even if there is no exact match, and returns the language tag of the best supported language.
For the design and implementation of our solution, we chose to rely on the latter library, as it allows us to select the best match between the languages supported by the provider and the input language provided by the user.
It is important for us to manage the users' preference even in the smallest details while managing the great complexity that represents the languages (+200 languages supported for the moment). It is important for Eden AI to be able to make users benefit from all the possibilities offered by the various integrated providers.
We believe that in the future a real robust standard will emerge and will be followed by all service providers. We are working to contribute to this at our scale. Do not hesitate to contact us if you are interested in this topic.