Language identification

edit

This functionality is in technical preview and may be changed or removed in a future release. Elastic will work to fix any issues, but features in technical preview are not subject to the support SLA of official GA features.

Language identification is a trained model that you can use to determine the language of text. You can reference the language identification model in an inference processor of an ingest pipeline by using its model ID (lang_ident_model_1). The input field name is text. If you want to run language identification on a field with a different name, you must map your field name to text in the ingest processor settings.

The longer the text passed into the language identification model, the more accurately the model can identify the language. It is fairly accurate on short samples (for example, 50 character-long streams) in certain languages, but languages that are similar to each other are harder to identify based on a short character stream.

Language identification takes into account Unicode boundaries when the feature set is built. If the text has diacritical marks, then the model uses that information for identifying the language of the text. In certain cases, the model can detect the source language even if it is not written in the script that the language traditionally uses. These languages are marked in the supported languages table (see below) with the Latn subtag. Language identification supports Unicode input.

Supported languages

edit

The table below contains the ISO codes and the English names of the languages that language identification supports. If a language has a 2-letter ISO 639-1 code, the table contains that identifier. Otherwise, the 3-letter ISO 639-2 code is used. The ‘Latn’ subtag indicates that the language is transliterated into Latin script.

Code Language Code Language Code Language

af

Afrikaans

hr

Croatian

pa

Punjabi

am

Amharic

ht

Haitian

pl

Polish

ar

Arabic

hu

Hungarian

ps

Pashto

az

Azerbaijani

hy

Armenian

pt

Portuguese

be

Belarusian

id

Indonesian

ro

Romanian

bg

Bulgarian

ig

Igbo

ru

Russian

bg-Latn

Bulgarian

is

Icelandic

ru-Latn

Russian

bn

Bengali

it

Italian

sd

Sindhi

bs

Bosnian

iw

Hebrew

si

Sinhala

ca

Catalan

ja

Japanese

sk

Slovak

ceb

Cebuano

ja-Latn

Japanese

sl

Slovenian

co

Corsican

jv

Javanese

sm

Samoan

cs

Czech

ka

Georgian

sn

Shona

cy

Welsh

kk

Kazakh

so

Somali

da

Danish

km

Central Khmer

sq

Albanian

de

German

kn

Kannada

sr

Serbian

el

Greek, modern

ko

Korean

st

Southern Sotho

el-Latn

Greek, modern

ku

Kurdish

su

Sundanese

en

English

ky

Kirghiz

sv

Swedish

eo

Esperanto

la

Latin

sw

Swahili

es

Spanish, Castilian

lb

Luxembourgish

ta

Tamil

et

Estonian

lo

Lao

te

Telugu

eu

Basque

lt

Lithuanian

tg

Tajik

fa

Persian

lv

Latvian

th

Thai

fi

Finnish

mg

Malagasy

tr

Turkish

fil

Filipino

mi

Maori

uk

Ukrainian

fr

French

mk

Macedonian

ur

Urdu

fy

Western Frisian

ml

Malayalam

uz

Uzbek

ga

Irish

mn

Mongolian

vi

Vietnamese

gd

Gaelic

mr

Marathi

xh

Xhosa

gl

Galician

ms

Malay

yi

Yiddish

gu

Gujarati

mt

Maltese

yo

Yoruba

ha

Hausa

my

Burmese

zh

Chinese

haw

Hawaiian

ne

Nepali

zh-Latn

Chinese

hi

Hindi

nl

Dutch, Flemish

zu

Zulu

hi-Latn

Hindi

no

Norwegian

hmn

Hmong

ny

Chichewa

Example of language identification

edit

In the following example, we feed the language identification trained model a short Hungarian text that contains diacritics and a couple of English words. The model identifies the text correctly as Hungarian with high probability.

POST _ingest/pipeline/_simulate
{
   "pipeline":{
      "processors":[
         {
            "inference":{
               "model_id":"lang_ident_model_1", 
               "inference_config":{
                  "classification":{
                     "num_top_classes":5 
                  }
               },
               "field_mappings":{

               }
            }
         }
      ]
   },
   "docs":[
      {
         "_source":{ 
            "text":"Sziasztok! Ez egy rövid magyar szöveg. Nézzük, vajon sikerül-e azonosítania a language identification funkciónak? Annak ellenére is sikerülni fog, hogy a szöveg két angol szót is tartalmaz."
         }
      }
   ]
}

The ID of the language identification trained model.

Indicates that only the top five languages (that is to say, the ones with the highest probability) are reported. In this example, 5 classes (in this case, languages) with the highest probability will be reported.

The source object that contains the text to identify.

The request returns the following response:

{
  "docs" : [
    {
      "doc" : {
        "_index" : "_index",
        "_type" : "_doc",
        "_id" : "_id",
        "_source" : {
          "text" : "Sziasztok! Ez egy rövid magyar szöveg. Nézzük, vajon sikerül-e azonosítania a language identification funkciónak? Annak ellenére is sikerülni fog, hogy a szöveg két angol szót is tartalmaz.",
          "ml" : {
            "inference" : {
              "top_classes" : [ 
                {
                  "class_name" : "hu",
                  "class_probability" : 0.9999936063740517,
                  "class_score" : 0.9999936063740517
                },
                {
                  "class_name" : "lv",
                  "class_probability" : 2.5020248433413966E-6,
                  "class_score" : 2.5020248433413966E-6
                },
                {
                  "class_name" : "is",
                  "class_probability" : 1.0150420723037688E-6,
                  "class_score" : 1.0150420723037688E-6
                },
                {
                  "class_name" : "ga",
                  "class_probability" : 6.67935962773335E-7,
                  "class_score" : 6.67935962773335E-7
                },
                {
                  "class_name" : "tr",
                  "class_probability" : 5.591166324774555E-7,
                  "class_score" : 5.591166324774555E-7
                }
              ],
              "predicted_value" : "hu", 
              "model_id" : "lang_ident_model_1"
            }
          }
        },
        "_ingest" : {
          "timestamp" : "2020-01-22T14:25:14.644912Z"
        }
      }
    }
  ]
}

Contains scores for the most probable languages. The number of reported languages is defined by num_top_classes.

The predicted value is the ISO identifier of the language with the highest probability.