Batch ingestion

Your data feeds let you make your restaurant, services, and menu available in Ordering End-to-End.

This document covers how to host your sandbox and production inventories and use batch ingestion to update your inventory in Ordering End-to-End.

Data feed environments

There are three data feed environments available for your integration development:

Feed environment	Description	Batch ingestion
Sandbox	The test environment for your feed development.	Required
Production	The production environment for your inventory that you want to launch.	Required

Hosting data feeds

In order for Ordering End-to-End to process your Sandbox and Production data feeds by batch ingestion, you must host your data feed files in Google Cloud Storage, Amazon S3, or HTTPS with a sitemap.

We recommend that you host the data feeds for your sandbox and production environments separately. This approach lets you do development and testing in your sandbox feed environment before you deploy the changes to production.

For example, if you use Google Cloud Storage as a hosting option, you would have the following paths:

Sandbox Feed: gs://foorestaurant-google-feed-sandbox/
Production Feed: gs://foorestaurant-google-feed-prod/

To host your inventory, do the following:

Generate your data feed files.
Choose a hosting solution.
Host your data feeds.
Ensure that your data feed files are updated regularly. Production data feeds must be updated daily.

For details on how to create an inventory feed, see the documentation for the Restaurant, Service, and Menu entities, as well as the Create a data feed section.

Guidelines on data feed files

Each file, which can contain multiple entities, must not exceed 200 MB. The top-level entities Restaurant, Service, and Menu, along with their child entities, must not exceed 4 MB all together.

Choose a hosting solution

The following table lists the options for hosting your data feeds and how those hosts work with Ordering End-to-End:

	Amazon S3	Google Cloud Storage	HTTPS with a sitemap
Credentials and access	Provide Google with the following information: Access key ID Secret access key The paths to your production and sandbox S3 directories and `marker.txt` file. The path must begin with `s3://`. The S3 bucket needs to include the following information: Feed files for your inventory. `marker.txt`, which contains a timestamp used for fetching. Example `marker.txt` file: `2018-12-03T08:30:42.694Z`	Provide Google with the paths to your production and sandbox bucket directories and `marker.txt` file. The paths must begin with `gs://`. Add the service account provided by your Google consultant as a reader of your Google Cloud Storage bucket. For more information on how to control access for Google Cloud Storage (GCS), see Google Cloud Platform Console: Setting bucket permissions. The GCS bucket needs to include the following information: Feed files for your inventory. `marker.txt`, which contains a timestamp used for fetching. Example `marker.txt` file: `2018-12-03T08:30:42.694Z`	Provide Google with the following information: Credentials to your basic auth. The path to your production and sandbox sitemap paths. The path must begin with `https://`. Protocol: You must make your feed files available through HTTPS, not HTTP. Security: Google strongly recommends that you protect your hosted feed files with Basic Authentication.
How Google knows which files need to be fetched	Directory listing of all files in the bucket.	Directory listing of all files in the bucket.	Individual URLs of files listed in the sitemap.
How Google knows that files are ready to fetch	After you finish generating your data feeds, update the `marker.txt` file with the latest timestamp.	After you finish generating your data feeds, update the `marker.txt` file with the latest timestamp.	After you finish generating your data feeds, update the response header `last-modified` of your `sitemap.xml` with the latest timestamp.
File limits	Maximum number of files: 100,000. You must have less than 100,000 files total in your Amazon S3 bucket.	Maximum number of files: 100,000. You must have less than 100,000 files total in your Google Cloud Storage bucket.	Maximum number of files: 100,000. The number of file paths within your sitemap XML file must be less than 100,000.

Connect your data feeds for batch ingestion

After you host your feeds, you need to connect them to your project on Actions Center. The initial configuration of production feeds is done on the Onboarding Tasks page. Later on the production and sandbox feeds configuration can be updated from the Configuration > Feeds page at any time by any portal users with an administrative role. The sandbox environment is used for development and testing purposes, while the production feeds are displayed to users.

If you host your data feeds with Amazon S3

In the Actions Center, go to Configuration > Feeds.
Click Edit and fill out the Update Feed form:
- Feed delivery method: Set to Amazon S3.
- Marker File: Provide the URL of the marker.txt file.
- Data Files: Provide the URL to the S3 bucket that contains the data feeds.
- Access ID: Enter the IAM access key ID with permissions to read from S3 resources.
- Access Key: Enter the IAM secret access key with permissions to read from S3 resources.
Click Submit.
After one to two hours, check whether batch ingestion fetches your feed files.

If you host your data feeds with Google Cloud Storage

In the Actions Center, go to Configuration > Feeds.
Click Edit and fill out the Update Feed form:
- Feed delivery method: Set to Google Cloud Storage.
- Marker File: Provide the URL of the marker.txt file.
- Data Files: Provide the URL to the GCS bucket that contains the data feeds.
Click Submit.
A service account is created to access your GCS bucket. The account name can be found in Configuration > Feeds after the onboarding tasks are complete. This service account needs the “Storage Legacy Object Reader” role. This role can be granted to the service account in the IAM page of the Google Cloud console.
After one to two hours, check whether batch ingestion fetches your feed files.

If you host your data feeds with HTTPS

In the Actions Center, go to Configuration > Feeds.
Click Edit and fill out the Update Feed form:
- Feed delivery method: Set to HTTPS.
- Sitemap File: Provide the URL of the sitemap.xml file.
- Username: Enter the username credentials to access the HTTPS server.
- Password: Enter the password to access the HTTPS server.
Click Submit.
After one to two hours, check whether batch ingestion fetches your feed files.

Example paths

The following table contains example paths for each of the hosting options:

	Amazon S3	Google Cloud Storage	HTTPS with a sitemap
Path	`s3://foorestaurant-google-feed-sandbox/`	`gs://foorestaurant-google-feed-sandbox/`	`https://sandbox-foorestaurant.com/sitemap.xml`
Marker file	`s3://foorestaurant-google-feed-sandbox/marker.txt`	`gs://foorestaurant-google-feed-sandbox/marker.txt`	Not applicable

Sitemaps for HTTPS hosting

Use the following guidelines when you define sitemaps:

Links in your sitemap must point to the files themselves.
If your sitemap includes references to a cloud provider instead of your own domain name, ensure that the start of the URLs, like https://www.yourcloudprovider.com/your_id, are stable and unique to your batch job.
Be careful not to upload partial sitemaps (like in the event of a partial data upload). Doing so results in Google ingesting only the files in the sitemap, which will cause your inventory levels to drop and might result in your feed ingestion being blocked.
Ensure that the paths to the files referenced in the sitemap don't change. For example, don't have your sitemap reference https://www.yourcloudprovider.com/your_id/10000.json today but then reference https://www.yourcloudprovider.com/your_id/20000.json tomorrow.

Example sitemap

Here's an example sitemap.xml file that serves data feed files:

Example 1: Entities grouped by merchants (Recommended).

XML

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
 <url>
   <loc>https://your_fulfillment_url.com/restaurant_1.ndjson</loc>
   <lastmod>2018-06-11T10:46:43+05:30</lastmod>
 </url>
 <url>
   <loc>https://your_fulfillment_url.com/restaurant_2.ndjson</loc>
   <lastmod>2018-06-11T10:46:43+05:30</lastmod>
 </url>
 <url>
   <loc>https://your_fulfillment_url.com/restaurant_3.ndjson</loc>
   <lastmod>2018-06-11T10:46:43+05:30</lastmod>
 </url>
</urlset>

Example 2: Entities grouped by types.

XML

<?xml version="1.0" encoding="UTF-8"?>
<urlset xmlns="http://www.sitemaps.org/schemas/sitemap/0.9">
 <url>
   <loc>https://your_fulfillment_url.com/restaurant.json</loc>
   <lastmod>2018-06-11T10:46:43+05:30</lastmod>
 </url>
 <url>
   <loc>https://your_fulfillment_url.com/menu.json</loc>
   <lastmod>2018-06-11T10:46:43+05:30</lastmod>
 </url>
 <url>
   <loc>https://your_fulfillment_url.com/service.json</loc>
   <lastmod>2018-06-11T10:46:43+05:30</lastmod>
 </url>
</urlset>

Note: The only required field for an Ordering End-to-End sitemap is loc.

Update your data feeds

After your data feeds are connected, Google checks for updates once each hour, but we only ingest all data feeds when the marker.txt or sitemap.xml files have been modified. We expect that you update your data feeds once a day to prevent stale inventory.

To specify that the data feeds have been modified and are ready for batch ingestion, update the last-modified object metadata field of the marker.txt file (For GCP and S3) or the last-modified response header of the sitemap.xml file. Google uses these values to determine how fresh a data feed is.

As the batch feed is being ingested,

New entities that don't exist in your current Ordering End-to-End inventory and don't have any errors would be inserted.
Entities already present in the inventory that don't have any errors on ingestion and either have a dateModified more recent than their current entry or in the case of not haveing a dateModified the feed ingestion start time is more recent than the current entry they would be updated, otherwise they would be marked as stale.
Entities that were part of a previous feed that are no longer included in the batch feed being processed would be deleted, provided there are no file level errors in the feed.

The timestamp or the last-modified response header must be updated only after all of the data feed files are generated and updated. Limit the batch jobs that update your data feeds to run only once a day. Alternatively, have a gap of at least three hours between each batch job. If you don't take these steps, Google might fetch stale files.

Note: last-modified is different from entity versioning. last-modified triggers when the batch gets picked up, and it has no impact on the entity's version.

Test your data feed