Supported file types for text extraction

Cloud Search indexes all items that are sent, regardless of file type (MIME or content-type). Indexing is performed on a file's metadata data and, if supported, its content. Following is a list of file types for which content indexing is supported.

  • Microsoft Word (DOC)
  • Microsoft Word (DOCX)
  • Microsoft Excel (XLS)
  • Microsoft Excel (XLSX)
  • Microsoft Powerpoint (PPT)
  • Microsoft Powerpoint (PPTX)
  • Adobe’s Portable Document Format (PDF)
  • Rich Text Format (RTF)
  • Text Format (TXT)
  • Hypertext Markup Language (HTML)
  • Extensible Markup Language (XML)

Additionally, Google Cloud Search uses Optical Character Recognition (OCR) to extract text from the following file types:

  • Joint Photographic Experts Group (JPG)
  • Graphic Interchange Format (GIF)
  • Tagged Image File Format (TIFF)
  • Scalable Vector Graphics (SVG)
  • PostScript Image Format (PS)

Google Cloud Search only uses OCR on files that are 50 MB or less in size.

In addition to these file types, Cloud Search supports indexing of content within any plain text file.