Developer Documentation
Product Description
Population Dynamics Insights (PDI) is an embeddings dataset that distills data about human behavior and our interaction with the environment into concise, analysis-ready embeddings (or "digital fingerprints") at specific locations.
These embeddings capture patterns in aggregated data such as search trends, busyness trends, and environmental conditions (maps, air quality, weather), providing a rich, location-specific snapshot of how populations engage with their surroundings. Aggregated over space and time, these embeddings ensure privacy while enabling nuanced spatial analysis and prediction for applications ranging from public health to socioeconomic modeling.
Product Overview
Population dynamics embeddings are generated using a purpose-built machine learning model, trained on a rich set of features and converted into a condensed vector representation. These embeddings are trained on and generated from:
- Aggregated Search Trends: Regional interests and concerns reflected in search data.
- Aggregated Maps Data (including busyness): Amenities, services and businesses in the regions along with local visitation trends.
- Aggregated Weather and Air Quality: Climate-related metrics, including temperature and air quality.
These features are aggregated at the postal code level to generate localized, context-aware embeddings that preserve privacy. PDI is an ongoing, time-series dataset, with new data slices processed and partitioned monthly. Data is refreshed and appended to the data table by the last day of the following calendar month (for example, February data published no later than March 31).
Prerequisites
To access Population Dynamics embeddings, you must be granted access. If you don't have access, reach out to your sales or customer engineering representative.
- Enable Analytics Hub API in Cloud Console.
- Enable BigQuery API in Cloud Console.
- Have working knowledge of the BigQuery product.
- Ensure that your account has the Analytics Hub Subscription Owner
(
roles/analyticshub.subscriptionOwner) role to perform subscriber tasks. - Ensure that your account has the BigQuery User (
roles/bigquery.user) role to create datasets.
Recommended Training
If you are new to working with embeddings or BigQuery Machine Learning, we highly recommend completing the following training materials before beginning your analysis:
- Machine Learning Crash Course: Embeddings: A foundational, fast-paced overview of how machine learning models use embeddings to translate high-dimensional data into lower-dimensional space while preserving semantic relationships.
- Getting Started with Vector Search and Embeddings: A practical Google Cloud Skills Boost lab that introduces vector embeddings, semantic similarity, and how to utilize embeddings within the broader Google Cloud ecosystem.
- BigQuery Machine Learning (BQML) Tutorials: Since the PDI dataset is hosted in BigQuery, lets you train and execute machine learning models directly on the embeddings using standard SQL, without needing to export the data.
Use the Embeddings
Understand the Data
Before starting your analysis, take a moment to review the schema structure.
Dataset Organization
The embeddings are organized into separate BigQuery tables for each country or test region.
Anatomy of the Embedding Vector
The features column is a 330-dimensional vector (stored as a REPEATED FLOAT
array in BigQuery). Each section of the array corresponds to a specific data
signal extracted by the Population Dynamics model.
Understanding this structure allows for feature ablation (for example, determining how much Search behavior predicts sales compared to Weather).
| Vector Indices | Data Source | Description |
|---|---|---|
| 0 – 127 | Aggregated Search Trends | Captures regional interests and concerns (for example, searches for "gym," "flu symptoms," "luxury goods"). |
| 128 – 255 | Maps & Busyness | Captures the built environment (POIs like hospitals, parks, schools) and human visitation to show places of interest. |
| 256 – 329 | Weather & Air Quality | Captures environmental context (Temperature, Precipitation, Air Quality). |
Key Columns & Metadata
The embeddings table contains spatial and temporal metadata enabling geospatial analysis, filtering, and interoperability with other services.
Because a single postal code can occasionally cross administrative boundaries (like county lines), the administrative area fields are provided as arrays.
geo_id: The unique Place ID associated with this postal code.geo_name: The postal code string for the region (for example,'90210').administrative_area_level_1_names: A list (ARRAY<STRING>) of human-readable names for the top-level boundaries (for example,['California']).administrative_area_level_1_ids: A list (ARRAY<STRING>) of unique Place IDs for the top-level administrative boundaries this postal code intersects (for example, State or Province).administrative_area_level_2_names: A list (ARRAY<STRING>) of human-readable names for the secondary boundaries (for example,['Los Angeles County']).administrative_area_level_2_ids: A list (ARRAY<STRING>) of unique Place IDs for the secondary administrative boundaries this postal code intersects (for example, County or District).features: The core 330-dimensional embedding vector, stored natively as anARRAY<FLOAT64>. Loading this into Pandas using Python requires flattening or converting to a NumPy matrix.snapshot_date: ADATEformatted asYYYY-MM-DD, standardized to only use the first day of the month. Represents the specific monthly time-slice from which the inputs features were aggregated to generate the embeddings data. For example, data from April 2026 will be formatted as2026-04-01.
Prepare Ground Truth Data
To use Population Dynamics embeddings, your ground truth data must be aggregated to a supported geographic boundary (postal codes).
Option 1: Incorporate Embeddings into an Existing Model
- Prepare Existing Model-Based Ground Truth: Use the embeddings as geospatial covariates to enhance an existing model.
- Train an Error Correction Model: Improve an existing model by integrating the embeddings into a model that takes the original model output, the expected value or ground truth, and the embeddings to learn a new error correction model.
Option 2: Tune for Specific Use Cases
- Choose a Prediction Model Type: Any model, such as GBDT, MLP, or linear, can be used for predictions.
- Use Embeddings for Prediction: Use Population Dynamics embeddings as input features, alongside other contextual data, to improve prediction accuracy.
Quickstart Code Snippets
Use these snippets to verify your access and understand the data format.
1. SQL: Fetching Embeddings for a Specific Month
Because PDI is a time-series dataset, you should typically filter by
snapshot_date so you don't return duplicate postal codes across multiple
months. The day must always be set to -01.
SELECT
snapshot_date,
geo_name AS postal_code,
geo_id AS place_id,
features -- The 330-dim vector
FROM
`your-project.population_dynamics___us___domestic.v1_postal_code.embeddings_table`
WHERE
snapshot_date = '2025-10-01' -- You must use the first of the month ('-01')
LIMIT 10;
2. SQL: Filtering by Administrative Area (Unnesting Arrays)
Because postal codes can span multiple administrative boundaries, the
administrative_area_* fields are stored as arrays. To filter for all postal
codes within a specific state (for example, 'California'), you must use
BigQuery's UNNEST() function.
SELECT
snapshot_date,
geo_name AS postal_code,
admin1_name
FROM
your-project.population_dynamics___us___domestic.v1_postal_code.embeddings_table,
UNNEST(administrative_area_level_1_names) AS admin1_name
WHERE
-- On or after October 2025
snapshot_date >= '2025-10-01' -- You must use the first of the month ('-01')
AND admin1_name = 'California'
LIMIT 10;
3. SQL: Finding Similar Locations
This query identifies behaviorally similar locations without requiring external
data. It uses the ML.DISTANCE function to calculate Cosine Similarity,
returning the top matches for a target postal code.
WITH TargetLocation AS (
SELECT features AS target_vector
FROM `your-project.population_dynamics___us___domestic.v1_postal_code.embeddings_table`
WHERE snapshot_date = '2025-10-01' -- You must use the first of the month ('-01')
AND geo_name = '90210' -- Replace with your target postal code
LIMIT 1
)
SELECT
t.geo_name AS postal_code,
-- Calculate Similarity (1.0 is identical, 0.0 is dissimilar)
(1 - ML.DISTANCE(t.features, p.target_vector, 'COSINE')) AS similarity_score
FROM
`your-project.population_dynamics___us___domestic.v1_postal_code.embeddings_table` t,
TargetLocation p
WHERE
t.snapshot_date = '2025-10-01' -- You must use the first of the month ('-01')
AND t.geo_name != '90210' -- Exclude the target itself
ORDER BY
similarity_score DESC
LIMIT 20;
4. SQL: Joining Customer Data
This example demonstrates how to enrich your own internal data (for example, a Store Performance table) with behavioral embeddings by joining on the postal code.
SELECT
store.store_id,
store.postal_code,
store.total_revenue,
embeddings.features AS pdi_vector
FROM
`your-project.internal_data.store_performance` AS store
JOIN
`your-project.population_dynamics___us___domestic.v1_postal_code.embeddings_table` AS embeddings
ON
store.postal_code = embeddings.geo_name
WHERE
embeddings.snapshot_date = '2025-10-01' -- You must use the first of the month ('-01')
5. Python: Loading Data for Machine Learning
The embeddings are stored as BigQuery Arrays. To use them in ML libraries, you must convert the column into a NumPy matrix.
from google.cloud import bigquery
import numpy as np
import pandas as pd
client = bigquery.Client()
query = """
SELECT
geo_name,
features -- Returns as a list of floats
FROM
`your-project.population_dynamics___us___domestic.v1_postal_code.embeddings_table`
WHERE
snapshot_date = '2025-10-01' -- You must use the first of the month ('-01')
LIMIT 1000
"""
# 1. Load data into DataFrame
df = client.query(query).to_dataframe()
# 2. Convert the 'features' column (Series of Lists) into a Matrix (2D Array)
X_matrix = np.stack(df['features'].values)
print(f"Data Loaded. Matrix Shape: {X_matrix.shape}")
# Output: Data Loaded. Matrix Shape: (1000, 330)
Frequently Asked Questions (FAQ)
Can I access the raw input data (for example, specific search queries or mobility traces)?
No. The Population Dynamics embeddings are generated from aggregated, privacy-preserving signals. To ensure user privacy, we don't provide specific user traces, individual search histories, or raw movement patterns. The embeddings provide a latent representation of these behaviors, optimized for modeling and prediction, rather than raw analytics.
How do you select the search terms used to generate the embeddings?
We use Knowledge Graph (KG) entities rather than raw search queries. For example, queries like "taylor swift boyfriend" and "kc tight end" would both map to the same underlying KG entity ("Travis Kelce"). This approach is language-agnostic, captures broader semantic categories, and significantly enhances privacy.
Are the vector dimensions interpretable (for example, Is Dimension 5 "Coffee"?)?
No, the vectors are latent representations. Because the features are learned by the machine learning model, there is no simple semantic mapping or one-to-one translation from a final vector index to a specific source input. While we know which blocks of indices derive from which datasets (for example, indices 0–127 represent Search Trends), a specific index like Index 5 does not map to a single keyword. Instead, it represents a complex, abstract feature learned by the model.
Does the dataset include polygon boundaries (Shapefiles)?
No. The dataset provides postal codes (geo_name) and their associated Place
IDs
(geo_id), but it doesn't include raw polygon geometries (such as WKT).
Depending on your use case, we recommend the following approaches:
- For Visualization on Google Maps: You can use the Place IDs provided in geo_id to style and render the boundaries directly on a map using Data-driven Styling. While these boundaries are ideal for visual display, they cannot be exported as raw geometry files.
- For Spatial Joins & Analysis: If you need raw spatial polygons, we recommend joining this dataset with public boundary datasets (such as those available in BigQuery public datasets) using the geo_name postal code.
What is the time window of the embeddings dataset?
The PDI embeddings are updated monthly with each new month appended to the
dataset. Data is represented using the snapshot_date column (formatted as
YYYY-MM-DD), providing a stable baseline that reflects the behavioral and
physical fingerprint of a location for that given month.