Identify the language of text with ML Kit on Android

You can use ML Kit to identify the language of a string of text. You can get the string's most likely language as well as confidence scores for all of the string's possible languages.

ML Kit recognizes text in more than 100 different languages in their native scripts. In addition, romanized text can be recognized for Arabic, Bulgarian, Chinese, Greek, Hindi, Japanese, and Russian. See the complete list of supported languages and scripts.

There are two ways to integrate language identification: by bundling the model as part of your app, or by using an unbundled model that depends on Google Play Services. If you select the unbundled model, your app will be smaller. See the table below for details.

BundledUnbundled
Library namecom.google.mlkit:language-idcom.google.android.gms:play-services-mlkit-language-id
ImplementationModel is statically linked to your app at build time.Model is dynamically downloaded via Google Play Services.
App size impactAbout 1.3 MB size increase.About 500 KB size increase.
Initialization timeModel is available immediately.Might have to wait for model to download before first use.
API lifecycle stageGeneral Availability (GA)Beta
  • Play around with the sample app to see an example usage of this API.

Before you begin

  1. In your project-level build.gradle file, make sure to include Google's Maven repository in both your buildscript and allprojects sections.
  2. Add the dependencies for the ML Kit Android libraries to your module's app-level gradle file, which is usually app/build.gradle. Choose one of the following dependencies based on your needs:

    For bundling the model with your app:

        dependencies {
          // ...
          // Use this dependency to bundle the model with your app
          implementation 'com.google.mlkit:language-id:17.0.1'
        }
        

    For using the model in Google Play Services:

        dependencies {
          // ...
          // Use this dependency to use the dynamically downloaded model in Google Play Services
          implementation 'com.google.android.gms:play-services-mlkit-language-id:16.0.0-beta2'
        }
        
  3. If you choose to use the model in Google Play Services, you can configure your app to automatically download the model to the device after your app is installed from the Play Store. To do so, add the following declaration to your app's AndroidManifest.xml file:

          <application ...>
              ...
              <meta-data
                  android:name="com.google.mlkit.vision.DEPENDENCIES"
                  android:value="langid" />
              <!-- To use multiple models: android:value="langid,model2,model3" -->
          </application>
        
    If you don't enable install-time model downloads, the model is downloaded the first time you run the identifier. Requests you make before the download has completed produce no results.

Identify the language of a string

To identify the language of a string, call LanguageIdentification.getClient() to get an instance of LanguageIdentifier, and then pass the string to the identifyLanguage() method of LanguageIdentifier.

For example:

Kotlin

val languageIdentifier = LanguageIdentification.getClient()
languageIdentifier.identifyLanguage(text)
        .addOnSuccessListener { languageCode ->
            if (languageCode == "und") {
                Log.i(TAG, "Can't identify language.")
            } else {
                Log.i(TAG, "Language: $languageCode")
            }
        }
        .addOnFailureListener {
            // Model couldn’t be loaded or other internal error.
            // ...
        }

Java

LanguageIdentifier languageIdentifier =
        LanguageIdentification.getClient();
languageIdentifier.identifyLanguage(text)
        .addOnSuccessListener(
                new OnSuccessListener<String>() {
                    @Override
                    public void onSuccess(@Nullable String languageCode) {
                        if (languageCode.equals("und")) {
                            Log.i(TAG, "Can't identify language.");
                        } else {
                            Log.i(TAG, "Language: " + languageCode);
                        }
                    }
                })
        .addOnFailureListener(
                new OnFailureListener() {
                    @Override
                    public void onFailure(@NonNull Exception e) {
                        // Model couldn’t be loaded or other internal error.
                        // ...
                    }
                });

If the call succeeds, a BCP-47 language code is passed to the success listener, indicating the language of the text. If no language is confidently detected, the code und (undetermined) is passed.

By default, ML Kit returns a value other than und only when it identifies the language with a confidence value of at least 0.5. You can change this threshold by passing a LanguageIdentificationOptions object to getClient():

Kotlin

val languageIdentifier = LanguageIdentification
        .getClient(LanguageIdentificationOptions.Builder()
                .setConfidenceThreshold(0.34f)
                .build())

Java

LanguageIdentifier languageIdentifier = LanguageIdentification.getClient(
        new LanguageIdentificationOptions.Builder()
                .setConfidenceThreshold(0.34f)
                .build());

Get the possible languages of a string

To get the confidence values of a string's most likely languages, get an instance of LanguageIdentifier and then pass the string to the identifyPossibleLanguages() method.

For example:

Kotlin

val languageIdentifier = LanguageIdentification.getClient()
languageIdentifier.identifyPossibleLanguages(text)
        .addOnSuccessListener { identifiedLanguages ->
            for (identifiedLanguage in identifiedLanguages) {
                val language = identifiedLanguage.languageTag
                val confidence = identifiedLanguage.confidence
                Log.i(TAG, "$language $confidence")
            }
        }
        .addOnFailureListener {
            // Model couldn’t be loaded or other internal error.
            // ...
        }

Java

LanguageIdentifier languageIdentifier =
        LanguageIdentification.getClient();
languageIdentifier.identifyPossibleLanguages(text)
        .addOnSuccessListener(new OnSuccessListener<List<IdentifiedLanguage>>() {
            @Override
            public void onSuccess(List<IdentifiedLanguage> identifiedLanguages) {
                for (IdentifiedLanguage identifiedLanguage : identifiedLanguages) {
                    String language = identifiedLanguage.getLanguageTag();
                    float confidence = identifiedLanguage.getConfidence();
                    Log.i(TAG, language + " (" + confidence + ")");
                }
            }
        })
        .addOnFailureListener(
                new OnFailureListener() {
                    @Override
                    public void onFailure(@NonNull Exception e) {
                        // Model couldn’t be loaded or other internal error.
                        // ...
                    }
                });

If the call succeeds, a list of IdentifiedLanguage objects is passed to the success listener. From each object, you can get the language's BCP-47 code and the confidence that the string is in that language. Note that these values indicate the confidence that the entire string is in the given language; ML Kit doesn't identify multiple languages in a single string.

By default, ML Kit returns only languages with confidence values of at least 0.01. You can change this threshold by passing a LanguageIdentificationOptions object to getClient():

Kotlin

val languageIdentifier = LanguageIdentification
      .getClient(LanguageIdentificationOptions.Builder()
              .setConfidenceThreshold(0.5f)
              .build())

Java

LanguageIdentifier languageIdentifier = LanguageIdentification.getClient(
      new LanguageIdentificationOptions.Builder()
              .setConfidenceThreshold(0.5f)
              .build());

If no language meets this threshold, the list has one item, with the value und.