Use the Population Dynamics Insights embeddings

Prepare Ground Truth Data

To use Population Dynamics embeddings, your ground truth data must be aggregated to a supported geographic boundary. Because administrative boundary types vary globally, you can align your data using either universal mathematical grid systems (like S2 cells) or local administrative regions (such as counties or districts, depending on the specific country dataset).

Option 1: Incorporate Embeddings into an Existing Model

  • Prepare existing model-based ground truth: Use the embeddings as geospatial covariates to enhance an existing model.
  • Train an error correction model: Improve an existing model by integrating the embeddings into a model that takes the original model output, the expected value or ground truth, and the embeddings to learn a new error correction model.

Option 2: Tune for Specific Use Cases

  • Choose a prediction model: Any model, such as GBDT, MLP, or linear, can be used for predictions.
  • Use embeddings for prediction: Use Population Dynamics embeddings as input features, alongside other contextual data, to improve prediction accuracy.

Query examples

Replace your-project.your_dataset.embeddings_table with your actual project, dataset, and target table name.

SQL: Fetch Embeddings

This query retrieves the embedding vector and administrative metadata for the S2 cells in your provisioned dataset.

SELECT
  geo_id,
  administrative_area_level_1_name AS state,
  administrative_area_level_2_name AS county,
  features -- The 330-dim vector
FROM
  `your-project.your_dataset.embeddings_table`
LIMIT 10;

SQL: Find Similar Locations

This query identifies behaviorally similar locations without requiring external data.

It uses the ML.DISTANCE function to calculate cosine similarity, returning the top matches for a target S2 cell. This approach supports expansion planning scenarios, such as determining where to open a new store based on the profile of a successful existing location.

WITH TargetLocation AS (
  SELECT features AS target_vector
  FROM `your-project.your_dataset.embeddings_table`
  -- Replace with your target S2 hex token (e.g., '80ead45')
  WHERE geo_id = 'YOUR_TARGET_S2_TOKEN'
)

SELECT
  t.geo_id,
  t.administrative_area_level_1_name AS state,
  t.administrative_area_level_2_name AS county,
  -- Calculate Similarity (1.0 is identical, 0.0 is dissimilar)
  (1 - ML.DISTANCE(t.features, p.target_vector, 'COSINE')) AS similarity_score
FROM
  `your-project.your_dataset.embeddings_table` t,
  TargetLocation p
WHERE
  t.geo_id != `YOUR_TARGET_S2_TOKEN` -- Exclude the target itself
ORDER BY
  similarity_score DESC
LIMIT 20;

SQL: Join Customer Data

This example demonstrates how to enrich your own internal data (for instance, a store performance table) with behavioral embeddings. Ensure your internal data includes matching S2 cell tokens (hex strings).

SELECT
  store.store_id,
  store.s2_token,
  store.total_revenue,
  embeddings.features AS pdfm_vector
FROM
  `your-project.internal_data.store_performance` AS store
JOIN
  `your-project.your_dataset.embeddings_table` AS embeddings
ON
  -- Join based on the S2 hex token string
  store.s2_token = embeddings.geo_id

Python: Load Data for Machine Learning

The embeddings are stored as BigQuery Arrays. To use them in ML libraries, you must convert the column into a NumPy matrix.

from google.cloud import bigquery
import numpy as np
import pandas as pd

client = bigquery.Client()

query = """
    SELECT
        geo_id,
        features -- Returns as a list of floats
    FROM
        `your-project.your_dataset.embeddings_table`
    LIMIT 1000
"""

# 1. Load data into DataFrame
df = client.query(query).to_dataframe()

# 2. Convert the 'features' column (Series of Lists) into a Matrix (2D Array)
X_matrix = np.stack(df['features'].values)

print(f"Data Loaded. Matrix Shape: {X_matrix.shape}")
# Output: Data Loaded. Matrix Shape: (1000, 330)