About Population Dynamics Insights data

Understand the Data

While the embeddings are available for multiple countries, the schema remains consistent across all datasets. The embeddings are organized into separate BigQuery listings for each country.

Anatomy of the Embedding Vector

The features column is a 330-dimensional vector (stored as a REPEATED FLOAT array in BigQuery). Each section of the array corresponds to a specific data signal extracted by the Population Dynamics model.

Understanding this structure allows for feature ablation (for instance, determining how much search behavior predicts sales as compared to weather).

Vector indices Data source Description
0 – 127 Aggregated Search Trends Captures regional interests and concerns (for instance, searches for "gym," "flu symptoms," "luxury goods").
128 – 255 Maps and Busyness Captures the built environment (POIs like hospitals, parks, schools) and human activity density.
256 – 329 Weather and Air Quality Captures environmental context (Temperature, Precipitation, AQI, Wind).

Key columns and metadata

The embeddings table contains spatial metadata enabling geospatial analysis, filtering, and interoperability with other Google Maps Platform services.

  • geo_id: The primary identifier for the region. For S2 cell datasets, this is the S2 cell token represented as a hexadecimal string (for instance, '80ead45'). Use this as your primary join key.
  • geo_name: The human-readable name for the region. Note: For S2 grid datasets, mathematical cells don't have standard names, so this column will contain the exact same token as geo_id. This is by design to maintain a consistent column structure across all Population Dynamics offerings.
  • administrative_area_level_1_id: The unique Google Maps Place ID for the top-level administrative boundary (for instance, State or Province).
  • administrative_area_level_1_name: The human-readable name for the top-level boundary (for instance, 'California').
  • administrative_area_level_2_id: The unique Google Maps Place ID for the secondary administrative boundary (for instance, County or District).
  • administrative_area_level_2_name: The human-readable name for the secondary boundary (for instance, 'Tulare County').
  • features: The core 330-dimensional embedding vector, stored natively as an ARRAY<FLOAT64>. Loading this into the Pandas Python library requires flattening or converting to a NumPy matrix.

Frequently asked questions (FAQ)

Can I access the raw input data (for example, specific search queries or mobility traces)?

No. The Population Dynamics Insights embeddings are generated from aggregated, privacy-preserving signals. To ensure user privacy, we don't provide specific user traces, individual search histories, or raw movement patterns. The embeddings provide a latent representation of these behaviors, optimized for modeling and prediction, rather than raw analytics.

Are the vector dimensions interpretable (for example, is Dimension 5 "Coffee")?

The vectors are latent representations, meaning they capture abstract patterns rather than specific, human-readable labels. While we know that indices 0–127 derive from Search Trends, a specific index (like index 5) does not map one-to-one to a single keyword like "Coffee." Instead, it represents a complex feature of search behavior learned by the model.

Does the dataset include polygon boundaries (Shapefiles)?

The dataset provides S2 cell IDs (geo_id) and Place IDs for geographic identifiers (such as admin 1 and admin 2 regions), but it does not include the raw polygon geometry (WKT/Shapefiles) for the regions.

  • For Visualization: You can plot the centroids directly using tools like BigQuery GeoViz, or use geometry libraries to calculate the S2 polygon from the hex token.
  • For Spatial Joins: If you need precise boundary operations (for example, ST_CONTAINS), we recommend joining this dataset with public boundary datasets (available in BigQuery Public Data).