You can use ML Kit to identify the language of a string of text. You can get the string's most likely language as well as confidence scores for all of the string's possible languages.
ML Kit recognizes text in more than 100 different languages in their native scripts. In addition, romanized text can be recognized for Arabic, Bulgarian, Chinese, Greek, Hindi, Japanese, and Russian. See the complete list of supported languages and scripts.
See the ML Kit quickstart sample on GitHub for an example of this API in use.
Before you begin
- In your project-level
build.gradle
file, make sure to include Google's Maven repository in both yourbuildscript
andallprojects
sections. - Add the dependencies for the ML Kit Android libraries to your module's
app-level gradle file, which is usually
app/build.gradle
:dependencies { // ... implementation 'com.google.mlkit:language-id:16.1.1' }
Identify the language of a string
To identify the language of a string, call LanguageIdentification.getClient()
to
get an instance of LanguageIdentifier
, and then pass the string to the
identifyLanguage()
method of LanguageIdentifier
.
For example:
Kotlin
val languageIdentifier = LanguageIdentification.getClient() languageIdentifier.identifyLanguage(text) .addOnSuccessListener { languageCode -> if (languageCode == "und") { Log.i(TAG, "Can't identify language.") } else { Log.i(TAG, "Language: $languageCode") } } .addOnFailureListener { // Model couldn’t be loaded or other internal error. // ... }
Java
LanguageIdentifier languageIdentifier = LanguageIdentification.getClient(); languageIdentifier.identifyLanguage(text) .addOnSuccessListener( new OnSuccessListener<String>() { @Override public void onSuccess(@Nullable String languageCode) { if (languageCode.equals("und")) { Log.i(TAG, "Can't identify language."); } else { Log.i(TAG, "Language: " + languageCode); } } }) .addOnFailureListener( new OnFailureListener() { @Override public void onFailure(@NonNull Exception e) { // Model couldn’t be loaded or other internal error. // ... } });
If the call succeeds, a
BCP-47 language code is
passed to the success listener, indicating the language of the text. If no
language is confidently detected, the code
und
(undetermined) is passed.
By default, ML Kit returns a value other than und
only when it identifies
the language with a confidence value of at least 0.5. You can change this
threshold by passing a LanguageIdentificationOptions
object to getClient()
:
Kotlin
val languageIdentifier = LanguageIdentification .getClient(LanguageIdentificationOptions.Builder() .setConfidenceThreshold(0.34f) .build())
Java
LanguageIdentifier languageIdentifier = LanguageIdentification.getClient( new LanguageIdentificationOptions.Builder() .setConfidenceThreshold(0.34f) .build());
Get the possible languages of a string
To get the confidence values of a string's most likely languages, get an
instance of LanguageIdentifier
and then pass the string to the
identifyPossibleLanguages()
method.
For example:
Kotlin
val languageIdentifier = LanguageIdentification.getClient() languageIdentifier.identifyPossibleLanguages(text) .addOnSuccessListener { identifiedLanguages -> for (identifiedLanguage in identifiedLanguages) { val language = identifiedLanguage.languageTag val confidence = identifiedLanguage.confidence Log.i(TAG, "$language $confidence") } } .addOnFailureListener { // Model couldn’t be loaded or other internal error. // ... }
Java
LanguageIdentifier languageIdentifier = LanguageIdentification.getClient(); languageIdentifier.identifyPossibleLanguages(text) .addOnSuccessListener(new OnSuccessListener<List<IdentifiedLanguage>>() { @Override public void onSuccess(List<IdentifiedLanguage> identifiedLanguages) { for (IdentifiedLanguage identifiedLanguage : identifiedLanguages) { String language = identifiedLanguage.getLanguageTag(); float confidence = identifiedLanguage.getConfidence(); Log.i(TAG, language + " (" + confidence + ")"); } } }) .addOnFailureListener( new OnFailureListener() { @Override public void onFailure(@NonNull Exception e) { // Model couldn’t be loaded or other internal error. // ... } });
If the call succeeds, a list of IdentifiedLanguage
objects is passed to the
success listener. From each object, you can get the language's BCP-47 code and
the confidence that the string is in that language. Note that
these values indicate the confidence that the entire string is in the given
language; ML Kit doesn't identify multiple languages in a single string.
By default, ML Kit returns only languages with confidence values of at least
0.01. You can change this threshold by passing a
LanguageIdentificationOptions
object to
getClient()
:
Kotlin
val languageIdentifier = LanguageIdentification .getClient(LanguageIdentificationOptions.Builder() .setConfidenceThreshold(0.5f) .build())
Java
LanguageIdentifier languageIdentifier = LanguageIdentification.getClient( new LanguageIdentificationOptions.Builder() .setConfidenceThreshold(0.5f) .build());
If no language meets this threshold, the list has one item, with the value
und
.
Next steps
See the ML Kit quickstart sample on GitHub for an example of this API in use.