Google App Engine

Uploading and Downloading Data in Python

Note: Bulk upload and download is not supported for apps that use Federated (OpenID) authentication.

The bulk loader tool can upload and download data to and from your application's datastore. With just a little bit of setup, you can upload new datastore entities from CSV and XML files, and download entity data into CSV, XML, and text files. Most spreadsheet applications can export CSV files, making it easy for non-developers and other applications to produce data that can be imported into your app. You can customize the upload and download logic to use different kinds of files, or do other data processing.

You can use the bulk loader tool to download and upload all datastore entities in a special format suitable for backup and restore, without any additional code or configuration. You configure the bulk loader with a configuration file that specifies the format of uploaded and downloaded data. You can use the bulk loader itself to automatically generate a configuration file based on your app's datastore, and you can then edit that configuration file to suit your needs exactly.

The bulk loader is available via the appcfg.py command.

  1. Setting up remote_api
  2. Downloading and uploading all data
  3. Configuring the bulk loader
  4. Creating loader classes
  5. Preparing your data
  6. Uploading the data to App Engine
  7. Loading data into the development server
  8. Creating exporter classes
  9. Downloading data from App Engine
  10. Command-line arguments

Setting up remote_api

The bulk loader tool communicates with your application running on App Engine using remote_api, a request handler included with the App Engine runtime environment that allows remote applications with the proper credentials to access the datastore remotely.

There are two ways to install remote_api: automatically using the builtins directive, or you manually using the url directive.

Installing remote_api using the builtins directive

If your application is using the High Replication Datastore, add an explicit s~ prefix (or e~ prefix if your application is located in the European Union) to the app id in your app.yaml:

application: s~your-app-id

Edit your app.yaml, and add the following to app.yaml:

builtins:
- remote_api: on

This directive finds the include.yaml file for the remote_api and maps the request handler to /_ah/remote_api. Only administrators of the application can access this URL.

After adding the builtins directive, you need to update your app with the revised app.yaml:

appcfg.py update <app-directory>

Installing remote_api using the url directive

If you want to install remote_api in a custom path, don't use builtins. Instead, specify that path with the url handler as follows:

- url: /remote_api
  script: $PYTHON_LIB/google/appengine/ext/remote_api/handler.py
  login: admin

This maps the remote_api request handler to the URL /remote_api for your application. You can use any URL you like. Access to this URL is restricted to administrators for the application.

After adding the url directive, you need to update your app to install the new app.yaml and the remote_api URL:

appcfg.py update <app-directory>

Downloading and uploading all data

If your app uses the master/slave datastore, you can download and upload every entity of a kind in a format suitable for backup and restore, all without writing any additional code or configuration. If your app uses the High Replication datastore, it is not so simple. If you attempt to download data, you'll see a high_replication_warning error in the Admin Console, and the downloaded data might not include recently saved entities.

To download all entities of all kinds from an app's master/slave datastore, run the following command:

appcfg.py download_data --url=http://your_app_id.appspot.com/_ah/remote_api --filename=<data-filename>

You can also use the --kind=... argument to download all entities of a specific kind:

appcfg.py download_data --kind=<kind> --url=http://your_app_id.appspot.com/_ah/remote_api --filename=<data-filename>

To upload data to the app's datastore from a file created by appcfg.py download_data, run the following command:

appcfg.py upload_data --url=http://your_app_id.appspot.com/_ah/remote_api --kind=<kind> --filename=<data-filename>

When data is downloaded, the entities are stored along with their original keys. When the data is uploaded, the original keys are used. If an entity exists in the datastore with the same key as an entity being uploaded, the entity in the datastore is replaced.

You can use upload_data to replace the data in the app from which it was dumped, or you can use it to upload the data to a different application. Entities with numeric system IDs will be uploaded with the same IDs, and reference properties will be preserved.

Configuring the bulk loader

The bulk loader uses configuration files to describe the data you're uploading or downloading. You can use the bulk loader itself to automatically generate these configuration files. To generate a configuration file for an existing app, you call the bulk loader with the create_bulkloader_config action. After the configuration file is generated, you'll then edit some details in the file before using it.

Using automatic configuration

The bulk loader uses a bulkloader.yaml file to describe how your data should be transformed when uploaded or downloaded. This file includes a header, followed by a list of transforms. Each transform describes two stages of transformation: between external data and an intermediate format, and between the intermediate format and a datastore entity.

When you import data, one transform reads data from an external source, such as a CSV or XML file, and converts it to an intermediate format (a Python dictionary) that represents the contents of the file. A second transform converts the data from the intermediate format to App Engine datastore entities. When you export data, the process is reversed. First, entities are transformed to an intermediate format, then from that format to the export format.

When you run the bulk loader to automatically generate the bulkloader.yaml file, the bulk loader examines your datastore statistics and creates transforms based on the kinds and properties of your app's data. Note that your datastore statistics can be up to 24 hours old, so if you change your schema, the generated file might not reflect the changes right away.

To automatically generate the bulkloader.yaml file based on your datastore statistics, run the bulk loader with the create_bulkloader_config action:

appcfg.py create_bulkloader_config --filename=bulkloader.yaml --url=http://your_app_id.appspot.com/_ah/remote_api

You'll use the generated file as input to the bulk loader tool when you run it again to perform an import or export. Below is an example of the output that appears when you run the bulk loader with the create_bulkloader_config action:

[INFO    ] Logging to bulkloader-log-20100516.144319
[INFO    ] Throttling transfers:
[INFO    ] Bandwidth: 250000 bytes/second
[INFO    ] HTTP connections: 8/second
[INFO    ] Entities inserted/fetched/modified: 20/second
[INFO    ] Batch Size: 100
[INFO    ] Opening database: bulkloader-progress-20100516.144319.sql3
[INFO    ] Opening database: bulkloader-results-20100516.144319.sql3
[INFO    ] Connecting to your_app_id.appspot.com/_ah/remote_api
No handlers could be found for logger "google.appengine.tools.appengine_rpc"
[INFO    ] Downloading kinds: ['__Stat_PropertyType_PropertyName_Kind__']
.
[INFO    ] Have 64 entities, 0 previously transferred
[INFO    ] 64 entities (23986 bytes) transferred in 1.9 seconds

Now let's look at the generated bulkloader.yaml file, along with descriptions of each section. The first section:

# Autogenerated bulkloader.yaml file.
# You must edit this file before using it. TODO: Remove this line when done.
# At a minimum address the items marked with TODO:
#  * Fill in connector and connector_options
#  * Review the property map
#    - Ensure the 'external_name' matches the name of your CSV column,
#      XML tag, etc.
#    - Check that __key__ property is what you want. Its value will become
#      the key name on import, and on export the value will be the Key
#      object. If you would like automatic key generation on import and
#      omitting the key on export, you can remove the entire __key__
#      property from the property map.

You'll need to edit the generated file before you can use it with your data. These instructions at the top of the file remind you of items in the file that you should address.

The next section lists Python modules to be imported:

# If you have module(s) with your model classes, add them here. Also
# change the kind properties to model_class.
python_preamble:
- import: google.appengine.ext.bulkload.transform
- import: google.appengine.ext.db
- import: re
- import: base64

You probably won't have to edit this section, unless you want to import any additional Python modules when doing the bulk loader import or export.

The next section of the bulkloader.yaml file provides details on how the data should be transformed upon input and output:

transformers:
- kind: Permission
  connector: # TODO: Choose a connector here: csv, simplexml, etc...
  connector_options:
    # TODO: Add connector options here--these are specific to each connector.
  property_map:
    - property: __key__
      external_name: key
      export_transform: transform.key_id_or_name_as_string

    - property: account
      external_name: account
      # Type: Key Stats: 119 properties of this type in this kind.
      import_transform: transform.create_foreign_key('TODO: fill in Kind name')
      export_transform: transform.key_id_or_name_as_string

    - property: invite_nonce
      external_name: invite_nonce
      # Type: String Stats: 19 properties of this type in this kind.

    - property: role
      external_name: role
      # Type: Integer Stats: 119 properties of this type in this kind.
      import_transform: transform.none_if_empty(int)

    - property: user
      external_name: user
      # Type: Key Stats: 119 properties of this type in this kind.
      import_transform: transform.create_foreign_key('TODO: fill in Kind name')
      export_transform: transform.key_id_or_name_as_string

The bulkloader.yaml file contains one set of transforms for each kind you want to process. The generated file contains a set for each kind in your datastore. This example includes the following: one kind, permission; a connector, which you'll fill in when the file is edited, specifying the external format of the data; optional connector_options, also to be added, that specify various settings and flags for the connector; and a property_map describing all the properties in the data.

Editing the configuration file

The first step in editing the bulkloader.yaml file is to specify the connector and connector options. The bulk loader supports CSV and XML (represented by csv and xml respectively) connectors for data import and export, and simple text (simpletext) for export only. In the sample, we'll set up the CSV connector to read the data on import and write it on export. We'll use the default options of the CSV connector, which are to read the column names from the first row of the CSV file, and to write them there on export.

- kind: Permission
  connector: csv

The next section describes the properties of the data. Each property entry specifies how to transform a particular property on import and export. The auto-generated file includes 4 properties identified by the bulk loader, plus the __key__ pseudo-property. Each property has an external name and optional transforms for input and output. These specify how to transform data between the datastore and an external representation. We also must add the kind names for the two reference properties, replacing TODO strings in the original file. Here's the edited properties section:

property_map:
    - property: __key__
      external_name: key
      export_transform: transform.key_id_or_name_as_string

    - property: account
      external_name: account
      import_transform: transform.create_foreign_key('Account')
      export_transform: transform.key_id_or_name_as_string

    - property: invite_nonce
      external_name: invite_nonce

    - property: role
      external_name: role
      import_transform: transform.none_if_empty(int)

    - property: user
      external_name: user
      import_transform: transform.create_foreign_key('User')
      export_transform: transform.key_id_or_name_as_string

With the bulkloader.yaml file complete, we can now import data from an external CSV file to the datastore:

appcfg.py upload_data --config_file=bulkloader.yaml --filename=users.csv --kind=Permission --url=http://your_app_id.appspot.com/_ah/remote_api

You can use the same bulkloader.yaml file to export data, as in the following example command line:

appcfg.py download_data --config_file=bulkloader.yaml --filename=users.csv --kind=Permission --url=http://your_app_id.appspot.com/_ah/remote_api

In this invocation of appcfg.py, data from your app's datastore is exported to a CSV file named users.csv.

Configuration file reference details

This section contains details of the format of bulkloader.yaml files and options for the appcfg.py tool.

The file begins with a header containing information that applies to the entire file. This section is used for specifying Python modules to be imported.

The Transformers section lists entity kinds and transform information. Each entity begins by specifying a kind. You can specify a model class instead of a kind. For each entity kind, the file specifies a connector (csv, xml, or simpletext [for export only]), optional connector options, and a property map. Within the property map, each property is specified, along with an external name, and transforms for import and export, if required.

The connector options are as follows:

csv connector
encoding
Any Python standard encoding format, such as utf-8 (the default) or windows-1252.
column_list
Use a sequence of names specified here for columns on import and export. If not specified, use first row of data to calculate external_name of each column, then read or write data starting with second row.
skip_import_header_row
If true, header line will be ignored on import.
print_export_header_row
If true, header line will be printed on export.
import_options
Additional keyword arguments for the Python CSV module on import. Use dialect: excel-tab for a TSV file.
export_options
Additional keyword arguments for the Python CSV module on export.
simplexml connector
xpath_to_nodes
An xpath that specifies the nodes to be read. Basic queries of the form /node1/node2 are supported, but some others may not be. If an alternate form is specified, export will be disabled. Namespaces are not well supported.
style
Possible values are element_centric and attribute_centric. The children of the nodes found with xpath_to_nodes will be converted into the intermediate format. The style argument determines whether the attributes of the found node (attribute_centric) or the child nodes (element_centric) are used. The entire node is also passed in as __node__.
simpletext connector
template
A Python dict interpolation string used for each exported record.
prolog (optional)
Written before the per-record output.
epilog (optional)
Written after the per-record output.
mode (optional)
text (default)
Text file mode. Newlines are written between records.
nonewline
Text file mode. No newlines are added.
binary
Binary file mode. No newlines are added.

The property map section defines the details of the transform between the entity and the intermediate format. The elements of the property map:

property
The name of the property, as defined in the entity or model.
external_name
Maps a single property, such as a single CSV column, to a single entry in the intermediate dictionary.
import_template
Specifies multiple dictionary items for a single property, using Python string interpolation.
import_transform
A single-argument function that returns the correct value and type data based on the external_name or import_template strings. Examples include the built-in Python conversion operators (such as float), any of several helper functions provided in transform, such as get_date_time or generate_foreign_key, a function provided in your own library, or an in-line lambda function. Or, a two-argument function with the keyword argument bulkload_state, which on return contains useful information about the entity: bulkload_state.current_entity, which is the current entity being processed; bulkload_state.current_dictionary, the current export dictionary, and bulkload_state.filename, the --filename argument that was passed to appcfg.py.
export_transform
Like import_transform, except performed on export.
export
Like import_template, except performed on export and specified as a sequence of external_name/export_transform values.

Each entity created on import has a key. If you don't specify a key, it will be generated automatically by the datastore. If you want to use or calculate a key from the import data, specify a key using the same syntax as the property map; that is, external_name, import_template, and so on.

If you want to do additional processing on data that can't be easily described in a property map, you can specify a function to modify the entity in arbitrary ways, or even return multiple entities on import. To use this feature, add one or both of the following to your transform entry:

post_import_function(input_dict, instance, bulkload_state_copy) functionName

Your function must return one of the following: None, which means to skip importing this record; a single entity (usually the instance argument that was passed in); or a list of multiple entities to be imported.

post_export_function(instance, export_dict, bulkload_state) functionName

Your function must return one of the following: None, which means this result should be skipped; or a dict (typically the export_dict argument that was passed in), containing the entities.

Creating loader classes

To upload data, you must tell appcfg.py how to translate each row in the data file to a datastore entity. You do this using a file of Python code. The file imports or defines the Model classes for the entities being created, defines a loader class for each kind you wish to import, and declares the available loader classes in a global variable.

For example, say you have a Model class named Album defined in a file named models.py (which is in your PYTHONPATH, such as the directory where you'll run the tool) that resembles the following:

from google.appengine.ext import db

class Album(db.Model):
    artist = db.StringProperty()
    title = db.StringProperty()
    publication_date = db.DateProperty()
    length_in_minutes = db.IntegerProperty()

You wish to import a CSV file that has the columns in the following order: title, artist, publication date, and length in minutes. The CSV file contains string representations for each of these values.

The following is a loader class file for this data:

import datetime
from google.appengine.ext import db
from google.appengine.tools import bulkloader
import models

class AlbumLoader(bulkloader.Loader):
    def __init__(self):
        bulkloader.Loader.__init__(self, 'Album',
                                   [('title', str),
                                    ('artist', str),
                                    ('publication_date',
                                     lambda x: datetime.datetime.strptime(x, '%m/%d/%Y').date()),
                                    ('length_in_minutes', int)
                                   ])

loaders = [AlbumLoader]

The bulk loader tool looks for a global variable in the loader class file named loaders, whose value is a list of loader classes to use. In this example, the tool loads the AlbumLoader class for loading entities of the kind Album.

The loader class defines an __init__() method that calls the __init__() on the Loader class. The first argument is self, the loader class instance. The second argument is the name of the datastore kind as a string, in this case 'Album'. The third argument is a sequence of tuples, where each tuple contains the name of the property (a string) and a conversion function. The conversion function must take a string and return one of the datastore value types.

In this example, the 'title' and 'artist' properties both take string values, so the conversion function is str, the string constructor. The 'length_in_minutes' property takes an integer, and the int constructor takes a string and converts it to an integer value.

For the 'publication_date' property, the model needs a datetime value. In this case, we know (from our data file) that publication dates are represented in the form mm/dd/yyyy. The conversion function is a Python lambda expression (a short function) that takes a string, then passes it to datetime.datetime.strptime() with a pattern to parse the value to a datetime.datetime, then calls its date() method to get the final datetime.date value.

If any conversion function raises an exception or fails to return a value that meets the requirements of the Model class, processing of the data file stops and reports an error. If you are using a progress file (with Sqlite, as described above), you can fix the CSV file (or the code, as needed), then re-run the upload to continue starting from the row where the error occurred.

If the name of the file containing the loader class definition is album_loader.py, you would give appcfg.py upload_data the following argument: --config_file=album_loader.py

Preparing your data

appcfg.py upload_data accepts data in the form of a file in the CSV (Comma Separated Values) format, a simple text file representing a table of values, with one row per line and columns delimited by commas (,). If a value contains one or more commas, the value is surrounded by double quotes ("). If a value contains a double quote, the double quote appears twice (""). The tool uses the csv module from the Python standard library to parse the data file.

The file must use the UTF-8 encoding for the text data. UTF-8 encoding is compatible with ASCII encoding, so any spreadsheet application that saves CSV as ASCII will also work. If your CSV file contains UTF-8 characters, you will need to decode the characters in your Loader class. Here is an example of the AlbumLoader that supports UTF-8 titles and artists:

import datetime
from google.appengine.ext import db
from google.appengine.tools import bulkloader
import models

class AlbumLoader(bulkloader.Loader):
    def __init__(self):
        bulkloader.Loader.__init__(self, 'Album',
                                   [('title', lambda x: x.decode('utf-8')),
                                    ('artist', lambda x: x.decode('utf-8')),
                                    ('publication_date',
                                     lambda x: datetime.datetime.strptime(x, '%m/%d/%Y').date()),
                                    ('length_in_minutes', int)
                                   ])

loaders = [AlbumLoader]

Most spreadsheet applications can export a sheet as a CSV file. To export a sheet as CSV from Google Spreadsheets, select the File menu > Export > .csv Sheet Only, then save the file from the browser window that opens.

If the name of the CSV file to load is album_data.csv, you would give appcfg.py upload_data the following argument: --filename=album_data.csv

If the CSV file's first row is not data (such as if it's a header row), use the following option to skip the first row: --has_header

Uploading the data to App Engine

To start the data upload, run appcfg.py upload_data with the appropriate arguments:

appcfg.py upload_data --config_file=album_loader.py --filename=album_data.csv --kind=Album http://your_app_id.appspot.com/_ah/remote_api

If you are using a Google Apps domain name and need appcfg.py to sign in using an account on that domain, you must specify the --auth_domain=... option, whose value is your domain name.

If the transfer is interrupted, you can resume the transfer from where it left off using the --db_filename=... argument. The value is the name of the progress file created by the tool, which is either a name you provided with the --db_filename argument when you started the transfer, or a default name that includes a timestamp. This assumes you have sqlite3 installed, and did not disable the progress file with --db_filename=skip.

Loading data into the development server

If you'd like to test how your data works with the app before uploading it, you can load it into the development server. Use the --url option to point the tool at the development server URL. For example:

appcfg.py upload_data --config_file=album_loader.py --filename=album_data.csv --kind=Album --url=http://localhost:8080/_ah/remote_api <app-directory>

Creating exporter classes

To download data, you must tell appcfg.py how to export datastore entities to a file. Similar to uploading and loader classes, you do this with a file of Python code. The file imports or defines the Model classes, defines an exporter class for each kind you wish to export, and declares the available exporter classes in a global variable.

Continuing the example with the Album Model class defined above, the following file defines an exporter that produces a CSV file in the same format used by the loader class:

from google.appengine.ext import db
from google.appengine.tools import bulkloader
import models

class AlbumExporter(bulkloader.Exporter):
    def __init__(self):
        bulkloader.Exporter.__init__(self, 'Album',
                                     [('title', str, None),
                                      ('artist', str, None),
                                      ('publication_date', str, None),
                                      ('length_in_minutes', str, None)
                                     ])

exporters = [AlbumExporter]

The exporter class is similar to the loader class. The exporter class defines an __init__() method that calls the __init__() on the Exporter class. The first argument is self, the exporter instance. The second argument is the name of the datastore kind being exported, a string. The third argument is a sequence of tuples, one for each entity property being exported.

Each tuple has 3 elements: the property name, a function that takes the property value and converts it to a str, and a default value if the entity does not have the property set. If the default value is None, the exporter raises an exception if it attempts to export an entity that does not have the property.

The CSV file created by the exporter includes one row for each entity, and one column for each property mentioned in the sequence. The columns are in the order they appear in the sequence.

Downloading data from App Engine

To start a data download, run appcfg.py download_data with the appropriate arguments:

appcfg.py download_data --config_file=album_loader.py --filename=album_data_archive.csv --kind=Album <app-directory>

If you are using a Google Apps domain name and need appcfg.py to sign in using an account on that domain, you must specify the --auth_domain=... option, whose value is your domain name.

If the transfer is interrupted, you can resume the transfer from where it left off using the --db_filename=... and --result_db_filename=... arguments. These arguments are the names of the progress file and the results file created by the tool, which are either names you provided with the arguments when you started the transfer, or default names that include a timestamp. This assumes you have sqlite3 installed, and did not disable progress files with --db_filename=skip.

Command-line arguments

The appcfg.py upload_data command accepts the following arguments. See also the other options appcfg.py accepts for all actions, listed in Uploading and Managing a Python App: Command-Line Arguments.

appcfg.py upload_data [options] <app-directory>
--filename=...
Required. The path to the CSV data file to load.
--kind=...
Required. The name of the datastore kind to use for creating new entities.
--config_file=...
Required. A Python source file that imports or defines Model classes for the kinds of entities that an upload might create, as well as Loader classes for each kind. appcfg.py upload_data provides the Loader base class in the local namespace when evaluating this file.
--loader_opts=...
An option to pass to the Loader class's initialize() method. You can implement this method to pass arguments to your Loader classes.
--log_file=...
The name of the file to write logging information about the upload. The default is to create a file named bulkloader-log-timestamp in the current working directory (where timestamp is the time the tool is run).
--auth_domain=...
The name of the authorization domain of the account to use to contact remote_api. If you're using a Google Apps domain and need appcfg.py to sign in using a Google Apps account, specify your domain name with this option.
--num_threads=#
The number of threads to spawn to upload new entities in parallel. The default is 10.
--batch_size=#
The number of entities to create with each remote_api call. For large entities, use a small batch size to limit the amount of data posted per batch. The default is 10.
--bandwidth_limit=#
The maximum total number of bytes per second that all threads should send. Bursts may exceed this maximum, but the overall bandwidth will be less than this amount. The default is 250,000 bytes per second.
--rps_limit=#
The maximum total number of records per second that all threads should send. The default is 20.
--http_limit=#
The maximum total number of HTTP requests per second that all threads should send. The default is 7.5 per second (15 per 2 seconds).
--db_filename=...
The filename to use for the progress file for this run. If not specified, this is named bulkloader-progress-timestamp, where timestamp represents the time the command is run. If this argument is specified with the value skip, the upload will not use a progress file.
--has_header
If present, skips the first row in the CSV file, assuming it's a header row.
--application=...
The application ID, if different from the application ID specified in the app's app.yaml file.
--url=...
The URL of the remote_api handler to use to connect to the datastore.

The bulk loader can determine the application ID from the URL.

--dry_run
Don't actually call the datastore, just verify that everything else would work.
app_directory
Read the yaml file from the app_directory to determine the app id and the url. Using --url is preferred.
appcfg.py download_data [options]

The arguments for download_data are the same as those for upload_data. The --filename argument specifies the file to write to, instead of the file to read from. This file is overwritten when starting a new download, and appended to when continuing an interrupted download.

download_data also supports the following argument:

--result_db_filename=...
The filename to use for the results file for this run. The results file stores the exported data until the export is finished, so the final output file can be sorted by key. If not specified, this is named bulkloader-results-timestamp, where timestamp represents the time the command is run. If this argument is specified with the value skip, the upload will not use a results file.

需要進行驗證

您必須登入 Google+ 才能執行這項操作。

正在登入...

Google 開發人員需要您的授權才能執行這項操作。