Prepare Ground Truth Data
To use Population Dynamics embeddings, your ground truth data must be aggregated to a supported geographic boundary. Because administrative boundary types vary globally, you can align your data using either universal mathematical grid systems (like S2 cells) or local administrative regions (such as counties or districts, depending on the specific country dataset).
Option 1: Incorporate Embeddings into an Existing Model
- Prepare existing model-based ground truth: Use the embeddings as geospatial covariates to enhance an existing model.
- Train an error correction model: Improve an existing model by integrating the embeddings into a model that takes the original model output, the expected value or ground truth, and the embeddings to learn a new error correction model.
Option 2: Tune for Specific Use Cases
- Choose a prediction model: Any model, such as GBDT, MLP, or linear, can be used for predictions.
- Use embeddings for prediction: Use Population Dynamics embeddings as input features, alongside other contextual data, to improve prediction accuracy.
Query examples
Replace your-project.your_dataset.embeddings_table with your actual project,
dataset, and target table name.
SQL: Fetch Embeddings
This query retrieves the embedding vector and administrative metadata for the S2 cells in your provisioned dataset.
SELECT geo_id, administrative_area_level_1_name AS state, administrative_area_level_2_name AS county, features -- The 330-dim vector FROM `your-project.your_dataset.embeddings_table` LIMIT 10;
SQL: Find Similar Locations
This query identifies behaviorally similar locations without requiring external data.
It uses the ML.DISTANCE function to calculate cosine similarity, returning the
top matches for a target S2 cell. This approach supports expansion planning
scenarios, such as determining where to open a new store based on the profile of
a successful existing location.
WITH TargetLocation AS ( SELECT features AS target_vector FROM `your-project.your_dataset.embeddings_table` -- Replace with your target S2 hex token (e.g., '80ead45') WHERE geo_id = 'YOUR_TARGET_S2_TOKEN' ) SELECT t.geo_id, t.administrative_area_level_1_name AS state, t.administrative_area_level_2_name AS county, -- Calculate Similarity (1.0 is identical, 0.0 is dissimilar) (1 - ML.DISTANCE(t.features, p.target_vector, 'COSINE')) AS similarity_score FROM `your-project.your_dataset.embeddings_table` t, TargetLocation p WHERE t.geo_id != `YOUR_TARGET_S2_TOKEN` -- Exclude the target itself ORDER BY similarity_score DESC LIMIT 20;
SQL: Join Customer Data
This example demonstrates how to enrich your own internal data (for instance, a store performance table) with behavioral embeddings. Ensure your internal data includes matching S2 cell tokens (hex strings).
SELECT store.store_id, store.s2_token, store.total_revenue, embeddings.features AS pdfm_vector FROM `your-project.internal_data.store_performance` AS store JOIN `your-project.your_dataset.embeddings_table` AS embeddings ON -- Join based on the S2 hex token string store.s2_token = embeddings.geo_id
Python: Load Data for Machine Learning
The embeddings are stored as BigQuery Arrays. To use them in ML libraries, you must convert the column into a NumPy matrix.
from google.cloud import bigquery import numpy as np import pandas as pd client = bigquery.Client() query = """ SELECT geo_id, features -- Returns as a list of floats FROM `your-project.your_dataset.embeddings_table` LIMIT 1000 """ # 1. Load data into DataFrame df = client.query(query).to_dataframe() # 2. Convert the 'features' column (Series of Lists) into a Matrix (2D Array) X_matrix = np.stack(df['features'].values) print(f"Data Loaded. Matrix Shape: {X_matrix.shape}") # Output: Data Loaded. Matrix Shape: (1000, 330)