Mapreduce is an experimental, innovative, and rapidly changing new feature for Google App Engine. Unfortunately, being on the bleeding edge means that we may make backwards-incompatible changes to Mapreduce. We will inform the community when this feature is no longer experimental.
The App Engine Mapreduce library provides these standard input readers and output writers:
About Readers and Writers
The standard data input readers are designed to read in data from specific storage, such as blobstore or datastore and then supply the data to the mapper function; the standard output writers write data from the reducer function to a specific storage, for example, datastore or blobstore. You don't instantiate, invoke, or write to the input readers or output writers; all of the interaction with the readers and writers is done for you by the MapreducePipeline object. You simply tell your MapreducePipeline object what reader to use and what output writer to use, and you also provide the MapreducePipeline with the reader and writer parameters.
The illustration below is a representation of a MapreducePipeline object with its constructor specifying a word count job, corresponding mapper and reducer functions, and the input reader and output writer to be used. Notice the "mapper_params" and "reducer_params". Those parameters are actually for the reader and writer, respectively. Notice also how the reader and writer are specified, using the Mapreduce library.
Standard Input Readers and Output Writers
The following table describes possible settings in the mapreduce.yaml file.
|Reader or Writer Name||Description||Parameters|
|BlobstoreLineInputReader||Reads a line (\n) delimited text file one line at a time from Blobstore. It calls the mapper function once with each line, passing to the mapper a tuple comprised of the byte offset in the file of the first character in the line and the line as a string, not including the trailing newline. For example: (byte_offset, line_value).||
|BlobstoreZipInputReader||Iterates over all of the compressed files within the specified zipfile in Blobstore. It calls the mapper function once for each file, passing it the tuple comprised of the zipfile.ZipInfo entry for the file, and a parameterless function that your mapper calls to return the complete body of the file as a string. For example, (zipinfo, file_callable). The following snippet shows how your mapper might extract each file's data in each iteration:
def word_count_map(data): """Word count map function.""" (entry, text_fn) = data text = text_fn()
|BlobstoreZipLineInputReader||Iterates over all of the compressed files, each of which must contain line (\n) delimited data, within the specified zipfile in Blobstore. It calls the mapper function once for each line in each file, passing a tuple consisting of the byte offset in the file of the first character in the line and the line as a string, not including the trailing newline. For example, (byte_offset, line_value).||
|BlobstoreOutputWriter||Writes data from the reducer function to Blobstore, automatically assigning a filename. To retrieve the filename, you must use the completed mapreduce pipeline, as demonstrated by the StoreOutput function in the Mapreduce Made Easy demo.||
|DatastoreInputReader||Iterates and returns all instances of the specified entity (entity_kind) from the datastore, automatically advancing to the next unread entities. Each iteration returns the number of entities specified by the batch_size parameter. This reader does no filtering: you would need to do any required filtering in your mapper.||This reader has several parameters:
|DatastoreKeyInputReader||Iterates and returns all keys of the entities in the datastore of the specified entity_kind, automatically advancing to the next unread keys. Each iteration returns the number of keys specified by the batch_size parameter. This reader does no filtering: you would need to do any required filtering in your mapper.||This reader has several parameters:
|FileOutputWriter||Writes output data to Blobstore or Google Cloud Storage, automatically assigning a filename. To retrieve the filename, you must use the completed MapreducePipeline, as demonstrated by the StoreOutput function, which can be found in the file main.py which is part of the Mapreduce Made Easy demo.||This writer has several parameters:
|NamespaceInputReader||Iterates over and returns the available namespaces.||This reader has several parameters:
|RecordsReader||Reads a list of files obtained via the Files API in records format, yielding each record as a string in each iteration.||
About Customized Readers and Writers
The standard input readers and output writers should suffice for most use cases. If you need a reader that handles a different input source and format or a writer that writes to a different location and output format than the standard ones, contact Google to determine whether Google can add these to the standard readers and writers.