Shard feed files

AI-generated Key Takeaways

Sharding, or splitting feeds into multiple files, is necessary when a single feed file exceeds 200 MB after gzip compression or when inventory is distributed across various systems.
Each shard must not exceed 200 MB after gzip compression, with a recommended maximum of 20 shards per feed.
Shard metadata requires processing_instruction, shard_number, total_shards, nonce, and generation_timestamp to be set correctly.
nonce and generation_timestamp must be the same across all shards of a single feed to ensure they are processed as one unit, while shard_number should be unique for each shard.
Sharding can effectively handle inventory reconciliation issues arising from distributed systems by assigning a shard to each instance of the inventory.

Depending on your inventory, sharding (or breaking up feeds into multiple files) may be necessary.

When to use sharding

Feed exceeds 200 MB for 1 file (after gzip compression).
- Example: Generated availability feed is 1 GB. This should be sharded to 5+ separate files (or shards).
Partner inventory is distributed across systems and/or regions resulting in difficulty reconciling the inventory.
- Example: Partner has US and EU inventory that live in separate systems. The feed may be generated with 2 files (or shards), 1 for US, and 1 for EU with the same nonce and generation_timestamp.

General rules

Each shard cannot exceed 200 MB for 1 file (after gzip compression).
We recommend no more than 20 shards per feed. If you have a business justification that requires more than that amount, please contact support for further instruction.
Individual records (one Merchant object for example) must be sent in one shard, they cannot be split across multiple shards. However, they don't have to be sent in the shard with the same shard_number for future feeds.
For better performance, your data should be split evenly among the shards so that all sharded files are similar in size.

How to shard feeds

You can shard the events feed by splitting a single JSON into separate JSON files with non overlapping events and updating the file descriptor JSON with the list of JSON file names.

Recommended: For each file (or shard), set the filename to indicate the feed type, the timestamp and the shard number. Shards should be roughly equal in size and are processed once all shards are uploaded.

Sharded example

File descriptor - event.feeddata.v1_1728306001.filedescriptor.json

{
  "generation_timestamp": 1728306001,
  "name": "event.feeddata.v1",
  "data_file": [
    "event.feeddata.v1_1728306001_001.json",
    "event.feeddata.v1_1728306001_002.json"
  ]
}

Shard 0 - event.feeddata.v1_1728306001_001.json

{
  "data": [
    {
      "id": "event-1",
      ...
    },
    {
      "id": "event-2",
      ...
    }
  ]
}

Shard 1 - event.feeddata.v1_1728306001_002.json

{
  "data": [
    {
      "id": "event-3",
      ...
    },
    {
      "id": "event-4",
      ...
    }
  ]
}

Shards for partner distributed inventory

It can be challenging for partners to consolidate inventory distributed across multiple systems and or regions into a single feed. Sharding can be used to resolve reconciliation challenges by setting each shard to match each distributed system's inventory set.

For example, say a partner's inventory is separated into 2 regions (US and EU inventory), which live in 2 separate systems.

The partner can break each feed into 2 files (or shards):

Use the following steps to ensure the feeds are properly processed:

Decide on an upload schedule, and configure each instance of inventory to follow the schedule.
Assign unique shard numbers for each instance (e.g. US = N, EU = N + 1). Set total_shards to the total number of shards.
At each scheduled upload time, decide on a generation_timestamp. In set all filenames to hold the same values for these two fields and list all expected file names in the descriptor file.
- generation_timestamp should be current or recent past (ideally, the partner's read-at database timestamp)
After all shards are uploaded, Google groups the shards using generation_timestamp and nonce.

Google will process the feed as one even though each shard represents a different region of the partner's inventory and could be uploaded at a different time of the day as long as the generation_timestamp is the same across all shards.