The gs://alphaearth_foundations GCS bucket contains COG (Cloud Optimized
GeoTIFF) files that together make up the AlphaEarth Foundations annual
Satellite
Embedding
dataset. It contains the annual embeddings for the years from 2017 to 2024,
inclusive.
License
This dataset is licensed under CC-BY 4.0 and requires the following attribution text: "The AlphaEarth Foundations Satellite Embedding dataset is produced by Google and Google DeepMind."
This bucket is set up as "requester pays", so downloading data may incur egress and other charges.
Directory structure
They are divided into directories by year; each year's directory is divided into
120 subdirectories, one per UTM zone, whose names reflect the zone number and
hemisphere (N or S).
Within each directory are a number of COG files. These files contain all the pixel data for that UTM zone.
File structure
Each file is 8192x8192 pixels, with 64 channels. The magnitude of each pixel, after the de-quantization mapping has been applied (see below), has been normalized so that it has a Euclidean length of 1.
The files contain overview layers at 4096x4096 pixels, 2048x2048 pixels, and so on down to a 1x1 top-level overview layer. These overview layers are constructed so that each overview pixel is the mean of the highest-resolution pixels under that overview pixel, where the mean's magnitude has been normalized to have length 1.
The channels correspond, in order, to the A00 through A63 axes of the
Satellite Embedding dataset. The COGs also contain this naming for the channels.
Each pixel's value for each channel is a signed 8-bit integer. The way in which these values are mapped to the native values (in the range [-1, 1]) of the embeddings is described below.
The value -128 corresponds to a masked pixel. If it is present in one channel,
it will be present in all channels. The COGs reflect this (i.e., they have the
NoData value set to -128).
The name of each file also carries some information. For example, consider the
file named
gs://alphaearth_foundations/satellite_embedding/v1/annual/2019/1S/x8qqwcsisbgygl2ry-0000008192-0000000000.tiff.
As described above, this file is part of the 2019 annual embedding, and is in
UTM zone 1S (zone 1, southern hemisphere). The base filename,
x8qqwcsisbgygl2ry-0000008192-0000000000, serves to link this file to the
corresponding Earth Engine Satellite Embedding Image name. In this example, this
file corresponds to a portion of the Earth Engine image
GOOGLE/SATELLITE_EMBEDDING/V1/ANNUAL/x8qqwcsisbgygl2ry. The two decimal parts
of the filename specify where this COG's values are relative to that Earth
Engine Image, as an offset in Y followed by an offset in X. In this case, the
COG's pixel origin is at (0, 8192) relative to the Earth Engine Image's origin.
This is because it was necessary to subdivide each Earth Engine Image (which are
16384x16384 pixels) so that the resulting COGs would not be too unwieldy.
De-quantization
To transform the raw signed 8-bit value (which will be between -127 and 127 inclusive, as -128 is reserved as the "no data" value) in each channel of each pixel to the analysis-ready floating-point value (which will be between -1 and 1), the mapping to perform is
- divide by 127.5
- square
- multiply by the sign of the original value
This would be expressed in NumPy as
# values is a NumPy array of raw pixel values
de_quantized_values = ((values / 127.5) ** 2) * np.sign(values)
In Earth Engine, the corresponding operation would be
var de_quantized_values = values.divide(127.5).pow(2).multiply(values.signum());
Manifest and index
A list of the files in this dataset can be found in
gs://alphaearth_foundations/satellite_embedding/v1/annual/manifest.txt.
As it is not possible to determine from the file names what area of the world
they cover, an index has also been provided, in three forms (GeoParquet,
GeoPackage, and CSV) in the files
gs://alphaearth_foundations/satellite_embedding/v1/annual/aef_index.parquet,
gs://alphaearth_foundations/satellite_embedding/v1/annual/aef_index.gpkg, and
gs://alphaearth_foundations/satellite_embedding/v1/annual/aef_index.csv. This
index contains one entry for each file in the dataset. The information provided
for each file is
- the geometry of the file as a WGS84 (i.e., EPSG:4326) polygon. In the CSV
form, this is in the column
WKT. See below for details on how this geometry is computed. crs: The CRS of the UTM zone this image belongs to as an EPSG code, likeEPSG:32610.year: The year that the image covers.utm_zone: The UTM zone of the image, like10N.utm_west,utm_south,utm_east,utm_north: The UTM bounds of the raw pixel array. This does not reflect any geometry processing, and includes all pixels whether or not they are valid.wgs84_west,wgs84_south,wgs84_east,wgs84_north: The min/max longitude and latitude of the WGS84 geometry.
Geometry processing
The pixel array is natively in some UTM zone, so in that UTM zone the bounding box of the pixel array is a simple rectangle. That bounding box is transformed into a polygon in WGS84. This polygon includes a number of extra points so that its edges closely follow the curved lines in WGS84 that the straight lines in UTM transform into. This polygon does not take into account the validity/invalidity of pixels in the image, just the bounds of the image's pixel array.
The polygon is then clipped to the minimum and maximum longitude of the image's UTM zone. In practice, this may cause it to not include a few valid pixels that hang over the edge of the UTM zone. Omitting these pixels from the index shouldn't cause any problems: some image from the neighbouring UTM zone should cover that area.
Note that clipping to the min/max longitude of the UTM zone means that no polygon crosses the antimeridian, which should make processing this file a little simpler.