Camera Motion Metadata Spec

This page describes a specification that allows MP4 files to embed metadata about camera motion during video capture. Devices that capture video typically have sensors that can provide additional information about capture. For example:

  • Mobile phones typically have sensors for gyroscope, accelerometer, magnetometer, and GPS.
  • Sensor fusion can be used to track the 3 degrees of freedom (3DoF) pose of devices.
  • Simultaneous localization and mapping (SLAM) can be used to track the 6 degrees of freedom (6DoF) pose of the device (for example, Tango).
  • Exposure information can be used to interpolate per-scanline motion.

This metadata can be saved in the video for advanced post-processing in various applications. For example:

  • Frame-level rotation information can be used to stabilize videos, and scanline-level motion data can be used to reduce rolling shutter effects.
  • IMU readings and derived 3DoF poses can be used to evaluate time alignment and geometric alignment between IMU and the camera.

The sections below specify the CAmera Motion Metadata (CAMM) track, which includes a new sample entry that indicates the existence of the track and the data format of track samples.

Sample entry

The video file should contain the following sample entry box to indicate the custom metadata track, and the subComponentType of the track should be set to meta.

Camera Motion Metadata Sample Entry (camm)

Box Type: camm
Container: stsd
A sample entry indicating the data track that saves the camera motion.

aligned(8) class CameraMotionMetadataSampleEntry extends SampleEntry('camm') {

Data format

The metadata track contains a stream of metadata samples that are formatted as follows.

Field Unit Description
uint16 reserved;
Reserved. Should be 0.
uint16 type;
The type of the data packet (see below). Each packet has one type of data.
switch (type) {
  case 0:
    float angle_axis[3];

Angle axis orientation in radians representing the rotation from local camera coordinates to a world coordinate system. The world coordinate system is defined by applications.

Let M be the 3x3 rotation matrix corresponding to the angle axis vector. For any ray X in the local coordinate system, the ray direction in the world coordinate is M * X.

This information can be obtained by running 3DoF sensor fusion on the device. After integrating the IMU readings, only the integrated global orientation needs to be recorded.

  case 1:
    int32 pixel_exposure_time;
    int32 rolling_shutter_skew_time;

This metadata is per video frame. The presentation time (PTS) of this metadata should be the start of the exposure of the first-used scanline in a video frame.

pixel_exposure_time_ns is the exposure time for a single pixel in nanoseconds and rolling_shutter_skew_time_ns is the delay between the exposure of the first-used scanline and the last-used scanline. They can be used to interpolate per-scanline metadata.

The PTS of the corresponding frame should be within pts_of_this_metadata and pts_of_this_metadata + pixel_exposure_time_ns + rolling_shutter_skew_time_ns.

When this information is not saved, the device should make a best-effort attempt to adjust the PTS of the video frame to be at the center of the frame exposure.

  case 2:
    float gyro[3];

Gyroscope signal in radians/seconds around XYZ axes of the camera. Rotation is positive in the counterclockwise direction.

Applications define the relationship between the IMU coordinate system and the camera coordinate system. We recommend aligning them if possible.

Note that initial gyro readings are in the IMU coordinate system defined by its driver, and proper transformation is required to convert it to the camera coordinate system.

Refer to Android Sensor.TYPE_GYROSCOPE.

  case 3:
    float acceleration[3];

Accelerometer reading in meters/second^2 along XYZ axes of the camera.

Applications define the relationship between the IMU coordinate system and the camera coordinate system. We recommend aligning them if possible.

Refer to Android Sensor.TYPE_ACCELEROMETER.

  case 4:
    float position[3];

3D position of the camera. 3D position and angle axis rotation together defines the 6DoF pose of the camera, and they are in a common application-defined coordinate system.

You can get this information by running 6DoF tracking on the device.

  case 5:
    double latitude;
    double longitude;
    double altitude;

Minimal GPS coordinate of the sample.

  case 6:
    double time_gps_epoch;
    int gps_fix_type;
    double latitude;
    double longitude;
    float altitude;
    float horizontal_accuracy;
    float vertical_accuracy;
    float velocity_east;
    float velocity_north;
    float velocity_up;
    float speed_accuracy;



time_gps_epoch - Time since the GPS epoch when the measurement was taken

gps_fix_type - 0 ( no fix), 2 (2D fix), 3 (3D fix)

latitude - Latitude in degrees

longitude - Longitude in degrees

altitude - Height above the WGS-84 ellipsoid

horizontal_accuracy - Horizontal (lat/long) accuracy

vertical_accuracy - Vertical (altitude) accuracy

velocity_east - Velocity in the east direction

velocity_north - Velocity in the north direction

velocity_up - Velocity in the up direction

speed_accuracy - Speed accuracy

  case 7:
    float magnetic_field[3];

Ambient magnetic field.

Refer to Android Sensor.TYPE_MAGNETIC_FIELD.



  • There should be only one CAMM track per MP4 file, which contains all of the above data types by muxing them together.
  • GPS samples in cases 5 and 6 must be raw values generated by sensors. They can not be interpolated or repeated when there is no GPS change.
  • The coordinate systems are right-hand sided. The camera coordinate system is defined as X pointing right, Y pointing downward, and Z pointing forward. The Y-axis of the global coordinate system should point down along the gravity vector.
  • IMU readings are typically in a separate IMU coordinate system, and rotation is needed to map them to the camera coordinate system if the two coordinate systems are different.
  • All fields are little-endian (least significant byte first), and the 32-bit floating points are of IEEE 754-1985 format.
  • To accurately synchronize the video frame and the metadata, the PTS of the video frame should be at the center of its exposure (this can also be inferred from exposure metadata).
  • The application muxing this data should choose a large enough time scale to get an accurate PTS.

Potential issues

  • This design only allows one packet per data sample. Embedded devices may have issues writing very high frequency packets because it increases I/O pressure, as well as the header size (for example, the stsc and stsz atoms) if the packet size varies.
  • Mixing different types of data with different delays can cause the PTS to go both forward and backward as packets are written to the file. However, this can be overcome by buffering packets and writing them in monotonic order.