Google App Engine

Standard Mapreduce Input Readers and Output Writers

Experimental!

Mapreduce is an experimental, innovative, and rapidly changing new feature for Google App Engine. Unfortunately, being on the bleeding edge means that we may make backwards-incompatible changes to Mapreduce. We will inform the community when this feature is no longer experimental.
 


The App Engine Mapreduce library provides these standard input readers and output writers:

  1. BlobstoreLineInputReader
  2. BlobstoreZipInputReader
  3. BlobstoreZipLineInputReader
  4. BlobstoreOutputWriter
  5. DatastoreInputReader
  6. DatastoreKeyInputReader
  7. FileOutputWriter
  8. NamespaceInputReader
  9. RecordsReader

About Readers and Writers

The standard data input readers are designed to read in data from specific storage, such as blobstore or datastore and then supply the data to the mapper function; the standard output writers write data from the reducer function to a specific storage, for example, datastore or blobstore. You don't instantiate, invoke, or write to the input readers or output writers; all of the interaction with the readers and writers is done for you by the MapreducePipeline object. You simply tell your MapreducePipeline object what reader to use and what output writer to use, and you also provide the MapreducePipeline with the reader and writer parameters.

The illustration below is a representation of a MapreducePipeline object with its constructor specifying a word count job, corresponding mapper and reducer functions, and the input reader and output writer to be used. Notice the "mapper_params" and "reducer_params". Those parameters are actually for the reader and writer, respectively. Notice also how the reader and writer are specified, using the Mapreduce library.

Standard Input Readers and Output Writers

The following table describes possible settings in the mapreduce.yaml file.

Reader or Writer Name Description Parameters
BlobstoreLineInputReader Reads a line (\n) delimited text file one line at a time from Blobstore. It calls the mapper function once with each line, passing to the mapper a tuple comprised of the byte offset in the file of the first character in the line and the line as a string, not including the trailing newline. For example: (byte_offset, line_value). blob_keys Either a string containing the blob key, or a list containing multiple blob keys, specifying the data to be read by the reader.
BlobstoreZipInputReader Iterates over all of the compressed files within the specified zipfile in Blobstore. It calls the mapper function once for each file, passing it the tuple comprised of the zipfile.ZipInfo entry for the file, and a parameterless function that your mapper calls to return the complete body of the file as a string. For example, (zipinfo, file_callable). The following snippet shows how your mapper might extract each file's data in each iteration:
def word_count_map(data):
  """Word count map function."""
  (entry, text_fn) = data
  text = text_fn()
blob_key A string containing the blob key specifying the zip file data to be read by the reader.
BlobstoreZipLineInputReader Iterates over all of the compressed files, each of which must contain line (\n) delimited data, within the specified zipfile in Blobstore. It calls the mapper function once for each line in each file, passing a tuple consisting of the byte offset in the file of the first character in the line and the line as a string, not including the trailing newline. For example, (byte_offset, line_value). blob_keys Either a string containing the blob key, or a list containing multiple blob keys, specifying the zip file data to be read by the reader.
BlobstoreOutputWriter Writes data from the reducer function to Blobstore, automatically assigning a filename. To retrieve the filename, you must use the completed mapreduce pipeline, as demonstrated by the StoreOutput function in the Mapreduce Made Easy demo. mime_type MIME content type of the output blob. For example, "text/plain".
DatastoreInputReader Iterates and returns all instances of the specified entity (entity_kind) from the datastore, automatically advancing to the next unread entities. Each iteration returns the number of entities specified by the batch_size parameter. This reader does no filtering: you would need to do any required filtering in your mapper. This reader has several parameters:
  • entity_kind The datastore kind to map over.
  • namespace The namespace that will be searched for entity_kinds.
  • batch_size The number of entities to read from the datastore with each batch get. Default is 50.
DatastoreKeyInputReader Iterates and returns all keys of the entities in the datastore of the specified entity_kind, automatically advancing to the next unread keys. Each iteration returns the number of keys specified by the batch_size parameter. This reader does no filtering: you would need to do any required filtering in your mapper. This reader has several parameters:
  • entity_kind The datastore kind whose keys are to be returned.
  • namespace The namespace that will be searched for entity_kinds.
  • batch_size The number of keys to read from the datastore with each batch get. Default is 50.
FileOutputWriter Writes output data to Blobstore or Google Cloud Storage, automatically assigning a filename. To retrieve the filename, you must use the completed MapreducePipeline, as demonstrated by the StoreOutput function, which can be found in the file main.py which is part of the Mapreduce Made Easy demo. This writer has several parameters:
  • filesystem The type of output storage: blobstore or gs.
  • mime_type The MIME content type of the written data. For example, text/plain.
  • gs_bucket_name For a gs filesystem, the bucket name and directory. For example, mybucket/dir1/dir2.
  • output_sharding Controls the number of output files. Only input is supported, which means the number of output files equals the number of input shards.
NamespaceInputReader Iterates over and returns the available namespaces. This reader has several parameters:
  • namespace_range The range of namespaces that will be iterated over.
  • batch_size The number of namespaces to return in each iteration. Default is 10.
RecordsReader Reads a list of files obtained via the Files API in records format, yielding each record as a string in each iteration. files Either a string containing the file to be read or a list containing multiple strings of files to be read.

About Customized Readers and Writers

The standard input readers and output writers should suffice for most use cases. If you need a reader that handles a different input source and format or a writer that writes to a different location and output format than the standard ones, contact Google to determine whether Google can add these to the standard readers and writers.

Alternatively, for those who want to write their own reader or writer, you can take a look at the open source code for readers and writers to see how to do this.

Authentication required

You need to be signed in with Google+ to do that.

Signing you in...

Google Developers needs your permission to do that.