Recognize text in images with ML Kit on iOS

You can use ML Kit to recognize text in images, such as the text of a street sign.

See the ML Kit quickstart sample on GitHub for an example of this API in use, or try the codelab.

Before you begin

  1. Include the following ML Kit libraries in your Podfile:
    pod 'GoogleMLKit/TextRecognition'
    
  2. After you install or update your project's Pods, open your Xcode project using its .xcworkspace. ML Kit is supported in Xcode version 11.3.1 or higher.

Now you are ready to start recognizing text in images.

Input image guidelines

  • For ML Kit to accurately recognize text, input images must contain text that is represented by sufficient pixel data. Ideally, each character should be at least 16x16 pixels. There is generally no accuracy benefit for characters to be larger than 24x24 pixels.

    So, for example, a 640x480 image might work well to scan a business card that occupies the full width of the image. To scan a document printed on letter-sized paper, a 720x1280 pixel image might be required.

  • Poor image focus can affect text recognition accuracy. If you aren't getting acceptable results, try asking the user to recapture the image.

  • If you are recognizing text in a real-time application, you should consider the overall dimensions of the input images. Smaller images can be processed faster. To reduce latency, ensure that the text occupies as much of the image as possible, and capture images at lower resolutions (keeping in mind the accuracy requirements mentioned above). For more information, see Tips to improve real-time performance.


Recognize text in images

To recognize text in an image, run the text recognizer as described below.

1. Prepare the input image

Pass the image as a UIImage or a CMSampleBufferRef to the TextRecognizer's process(_:completion:) method:

Create a VisionImage object using a UIImage or a CMSampleBufferRef.

If you use a UIImage, follow these steps:

  • Create a VisionImage object with the UIImage. Make sure to specify the correct .orientation.

    Swift

    let image = VisionImage(image: UIImage)
    visionImage.orientation = image.imageOrientation

    Objective-C

    MLKVisionImage *visionImage = [[MLKVisionImage alloc] initWithImage:image];
    visionImage.orientation = image.imageOrientation;

If you use a CMSampleBufferRef, follow these steps:

  • Specify the orientation of the image data contained in the CMSampleBufferRef buffer.

    To get the image orientation:

    Swift

    func imageOrientation(
      deviceOrientation: UIDeviceOrientation,
      cameraPosition: AVCaptureDevice.Position
    ) -> UIImage.Orientation {
      switch deviceOrientation {
      case .portrait:
        return cameraPosition == .front ? .leftMirrored : .right
      case .landscapeLeft:
        return cameraPosition == .front ? .downMirrored : .up
      case .portraitUpsideDown:
        return cameraPosition == .front ? .rightMirrored : .left
      case .landscapeRight:
        return cameraPosition == .front ? .upMirrored : .down
      case .faceDown, .faceUp, .unknown:
        return .up
      }
    }
          

    Objective-C

    - (UIImageOrientation)
      imageOrientationFromDeviceOrientation:(UIDeviceOrientation)deviceOrientation
                             cameraPosition:(AVCaptureDevicePosition)cameraPosition {
      switch (deviceOrientation) {
        case UIDeviceOrientationPortrait:
          return cameraPosition == AVCaptureDevicePositionFront ? UIImageOrientationLeftMirrored
                                                                : UIImageOrientationRight;
    
        case UIDeviceOrientationLandscapeLeft:
          return cameraPosition == AVCaptureDevicePositionFront ? UIImageOrientationDownMirrored
                                                                : UIImageOrientationUp;
        case UIDeviceOrientationPortraitUpsideDown:
          return cameraPosition == AVCaptureDevicePositionFront ? UIImageOrientationRightMirrored
                                                                : UIImageOrientationLeft;
        case UIDeviceOrientationLandscapeRight:
          return cameraPosition == AVCaptureDevicePositionFront ? UIImageOrientationUpMirrored
                                                                : UIImageOrientationDown;
        case UIDeviceOrientationUnknown:
        case UIDeviceOrientationFaceUp:
        case UIDeviceOrientationFaceDown:
          return UIImageOrientationUp;
      }
    }
          
  • Create a VisionImage object using the CMSampleBufferRef object and orientation:

    Swift

    let image = VisionImage(buffer: sampleBuffer)
    image.orientation = imageOrientation(
      deviceOrientation: UIDevice.current.orientation,
      cameraPosition: cameraPosition)

    Objective-C

     MLKVisionImage *image = [[MLKVisionImage alloc] initWithBuffer:sampleBuffer];
     image.orientation =
       [self imageOrientationFromDeviceOrientation:UIDevice.currentDevice.orientation
                                    cameraPosition:cameraPosition];

2. Get an instance of TextRecognizer

Get an instance of TextRecognizer by calling textRecognizer:

Swift

let textRecognizer = TextRecognizer.textRecognizer()
      

Objective-C

Objective-C

MLKTextRecognizer *textRecognizer = [MLKTextRecognizer textRecognizer];
      

3. Process the image

Then, pass the image to the process(_:completion:) method:

Swift

textRecognizer.process(visionImage) { result, error in
  guard error == nil, let result = result else {
    // Error handling
    return
  }
  // Recognized text
}

Objective-C

[textRecognizer processImage:image
                  completion:^(MLKText *_Nullable result,
                               NSError *_Nullable error) {
  if (error != nil || result == nil) {
    // Error handling
    return;
  }
  // Recognized text
}];

4. Extract text from blocks of recognized text

If the text recognition operation succeeds, it returns a Text object. A Text object contains the full text recognized in the image and zero or more TextBlock objects.

Each TextBlock represents a rectangular block of text, which contain zero or more TextLine objects. Each TextLine object contains zero or more TextElement objects, which represent words and word-like entities such as dates and numbers.

For each TextBlock, TextLine, and TextElement object, you can get the text recognized in the region and the bounding coordinates of the region.

For example:

Swift

let resultText = result.text
for block in result.blocks {
    let blockText = block.text
    let blockLanguages = block.recognizedLanguages
    let blockCornerPoints = block.cornerPoints
    let blockFrame = block.frame
    for line in block.lines {
        let lineText = line.text
        let lineLanguages = line.recognizedLanguages
        let lineCornerPoints = line.cornerPoints
        let lineFrame = line.frame
        for element in line.elements {
            let elementText = element.text
            let elementLanguages = element.recognizedLanguages
            let elementCornerPoints = element.cornerPoints
            let elementFrame = element.frame
        }
    }
}

Objective-C

NSString *resultText = result.text;
for (MLKTextBlock *block in result.blocks) {
  NSString *blockText = block.text;
  NSArray<MLKTextRecognizedLanguage *> *blockLanguages = block.recognizedLanguages;
  NSArray<NSValue *> *blockCornerPoints = block.cornerPoints;
  CGRect blockFrame = block.frame;
  for (MLKTextLine *line in block.lines) {
    NSString *lineText = line.text;
    NSArray<MLKTextRecognizedLanguage *> *lineLanguages = line.recognizedLanguages;
    NSArray<NSValue *> *lineCornerPoints = line.cornerPoints;
    CGRect lineFrame = line.frame;
    for (MLKTextElement *element in line.elements) {
      NSString *elementText = element.text;
      NSArray<MLKTextRecognizedLanguage *> *elementLanguages = element.recognizedLanguages;
      NSArray<NSValue *> *elementCornerPoints = element.cornerPoints;
      CGRect elementFrame = element.frame;
    }
  }
}

Tips to improve real-time performance

To recognize text in a real-time application, follow these guidelines to achieve the best framerates:

  • For processing video frames, use the results(in:) synchronous API of the text recognizer. Call this method from the AVCaptureVideoDataOutputSampleBufferDelegate's captureOutput(_, didOutput:from:) function to synchronously get results from the given video frame. Keep AVCaptureVideoDataOutput's alwaysDiscardsLateVideoFrames as true to throttle calls to the text recognizer. If a new video frame becomes available while the text recognizer is running, it will be dropped.
  • If you use the output of the text recognizer to overlay graphics on the input image, first get the result from ML Kit, then render the image and overlay in a single step. By doing so, you render to the display surface only once for each processed input frame. See the previewOverlayView and MLKDetectionOverlayView classes in the showcase sample app for an example.
  • Consider capturing images at a lower resolution. However, also keep in mind this API's image dimension requirements.

Next steps