Identify the language of text with ML Kit on iOS

You can use ML Kit to identify the language of a string of text. You can get the string's most likely language as well as confidence scores for all of the string's possible languages.

ML Kit recognizes text in more than 100 different languages in their native scripts. In addition, romanized text can be recognized for Arabic, Bulgarian, Chinese, Greek, Hindi, Japanese, and Russian. See the complete list of supported languages and scripts.

See the ML Kit quickstart sample on GitHub for an example of this API in use.

Before you begin

  1. Include the following ML Kit libraries in your Podfile:
    pod 'GoogleMLKit/LanguageID'
    
  2. After you install or update your project's Pods, open your Xcode project using its .xcworkspace. ML Kit is supported in Xcode version 11.3.1 or higher.

Identify the language of a string

To identify the language of a string, get an instance of LanguageIdentification, and then pass the string to the identifyLanguage(for:) method.

For example:

Swift

let languageId = NaturalLanguage.languageIdentification()

languageId.identifyLanguage(for: text) { (languageCode, error) in
  if let error = error {
    print("Failed with error: \(error)")
    return
  }
  if let languageCode = languageCode, languageCode != "und" {
    print("Identified Language: \(languageCode)")
  } else {
    print("No language was identified")
  }
}

Objective-C

MLKLanguageIdentification *languageId = [MLKLanguageIdentification languageIdentification];

[languageId identifyLanguageForText:text
                         completion:^(NSString * _Nullable languageCode,
                                      NSError * _Nullable error) {
                           if (error != nil) {
                             NSLog(@"Failed with error: %@", error.localizedDescription);
                             return;
                           }
                           if (![languageCode isEqualToString:@"und"] ) {
                             NSLog(@"Identified Language: %@", languageCode);
                           } else {
                             NSLog(@"No language was identified");
                           }
                         }];

If the call succeeds, a BCP-47 language code is passed to the completion handler, indicating the language of the text. If no language could be confidently detected, the code und (undetermined) is passed.

By default, ML Kit returns a non-und value only when it identifies the language with a confidence value of at least 0.5. You can change this threshold by passing a LanguageIdentificationOptions object to languageIdentification(options:):

Swift

let options = LanguageIdentificationOptions(confidenceThreshold: 0.4)
let languageId = NaturalLanguage.languageIdentification(options: options)

Objective-C

MLKLanguageIdentificationOptions *options =
    [[MLKLanguageIdentificationOptions alloc] initWithConfidenceThreshold:0.4];
MLKLanguageIdentification *languageId =
    [MLKLanguageIdentification languageIdentificationWithOptions:options];

Get the possible languages of a string

To get the confidence values of a string's most likely languages, get an instance of LanguageIdentification and then pass the string to the identifyPossibleLanguages(for:) method.

For example:

Swift

let languageId = NaturalLanguage.languageIdentification()

languageId.identifyPossibleLanguages(for: text) { (identifiedLanguages, error) in
  if let error = error {
    print("Failed with error: \(error)")
    return
  }
  guard let identifiedLanguages = identifiedLanguages,
    !identifiedLanguages.isEmpty,
    identifiedLanguages[0].languageCode != "und"
  else {
    print("No language was identified")
    return
  }

  print("Identified Languages:\n" +
    identifiedLanguages.map {
      String(format: "(%@, %.2f)", $0.languageCode, $0.confidence)
      }.joined(separator: "\n"))
}

Objective-C

MLKLanguageIdentification *languageId = [MLKLanguageIdentification languageIdentification];

[languageId identifyPossibleLanguagesForText:text
                                  completion:^(NSArray * _Nonnull identifiedLanguages,
                                               NSError * _Nullable error) {
  if (error != nil) {
    NSLog(@"Failed with error: %@", error.localizedDescription);
    return;
  }
  if (identifiedLanguages.count == 1
      && [identifiedLanguages[0].languageCode isEqualToString:@"und"] ) {
    NSLog(@"No language was identified");
    return;
  }
  NSMutableString *outputText = [NSMutableString stringWithFormat:@"Identified Languages:"];
  for (MLKIdentifiedLanguage *language in identifiedLanguages) {
    [outputText appendFormat:@"\n(%@, %.2f)", language.languageCode, language.confidence];
  }
  NSLog(outputText);
}];

If the call succeeds, a list of IdentifiedLanguage objects is passed to the continuation handler. From each object, you can get the language's BCP-47 code and the confidence that the string is in that language. Note that these values indicate the confidence that the entire string is in the given language; ML Kit doesn't identify multiple languages in a single string.

By default, ML Kit returns only languages with confidence values of at least 0.01. You can change this threshold by passing a LanguageIdentificationOptions object to languageIdentification(options:):

Swift

let options = LanguageIdentificationOptions(confidenceThreshold: 0.4)
let languageId = NaturalLanguage.languageIdentification(options: options)

Objective-C

MLKLanguageIdentificationOptions *options =
    [[MLKLanguageIdentificationOptions alloc] initWithConfidenceThreshold:0.4];
MLKLanguageIdentification *languageId =
    [MLKLanguageIdentification languageIdentificationWithOptions:options];

If no language meets this threshold, the list has one item, with the value und.

Next steps

See the ML Kit quickstart sample on GitHub for an example of this API in use.