Google Cloud Storage

cp - Copy files and objects

cp - Copy files and objects

Synopsis

gsutil cp [OPTION]... src_uri dst_uri gsutil cp [OPTION]... src_uri... dst_uri gsutil cp [OPTION]... -I dst_uri

Description

The gsutil cp command allows you to copy data between your local file system and the cloud, copy data within the cloud, and copy data between cloud storage providers. For example, to copy all text files from the local directory to a bucket you could do:

gsutil cp *.txt gs://my_bucket

Similarly, you can download text files from a bucket by doing:

gsutil cp gs://my_bucket/*.txt .

If you want to copy an entire directory tree you need to use the -R option:

gsutil cp -R dir gs://my_bucket

If you have a large number of files to upload you might want to use the gsutil -m option, to perform a parallel (multi-threaded/multi-processing) copy:

gsutil -m cp -R dir gs://my_bucket

You can pass a list of URIs to copy on STDIN instead of as command line arguments by using the -I option. This allows you to use gsutil in a pipeline to copy files and objects as generated by a program, such as:

some_program | gsutil -m cp -I gs://my_bucket

The contents of STDIN can name files, cloud URIs, and wildcards of files and cloud URIs.

How Names Are Constructed

The gsutil cp command strives to name objects in a way consistent with how Linux cp works, which causes names to be constructed in varying ways depending on whether you’re performing a recursive directory copy or copying individually named objects; and whether you’re copying to an existing or non-existent directory.

When performing recursive directory copies, object names are constructed that mirror the source directory structure starting at the point of recursive processing. For example, the command:

gsutil cp -R dir1/dir2 gs://my_bucket

will create objects named like gs://my_bucket/dir2/a/b/c, assuming dir1/dir2 contains the file a/b/c.

In contrast, copying individually named files will result in objects named by the final path component of the source files. For example, the command:

gsutil cp dir1/dir2/** gs://my_bucket

will create objects named like gs://my_bucket/c.

The same rules apply for downloads: recursive copies of buckets and bucket subdirectories produce a mirrored filename structure, while copying individually (or wildcard) named objects produce flatly named files.

Note that in the above example the ‘**’ wildcard matches all names anywhere under dir. The wildcard ‘*’ will match names just one level deep. For more details see gsutil help wildcards.

There’s an additional wrinkle when working with subdirectories: the resulting names depend on whether the destination subdirectory exists. For example, if gs://my_bucket/subdir exists as a subdirectory, the command:

gsutil cp -R dir1/dir2 gs://my_bucket/subdir

will create objects named like gs://my_bucket/subdir/dir2/a/b/c. In contrast, if gs://my_bucket/subdir does not exist, this same gsutil cp command will create objects named like gs://my_bucket/subdir/a/b/c.

Copying To/from Subdirectories; Distributing Transfers Across Machines

You can use gsutil to copy to and from subdirectories by using a command like:

gsutil cp -R dir gs://my_bucket/data

This will cause dir and all of its files and nested subdirectories to be copied under the specified destination, resulting in objects with names like gs://my_bucket/data/dir/a/b/c. Similarly you can download from bucket subdirectories by using a command like:

gsutil cp -R gs://my_bucket/data dir

This will cause everything nested under gs://my_bucket/data to be downloaded into dir, resulting in files with names like dir/data/a/b/c.

Copying subdirectories is useful if you want to add data to an existing bucket directory structure over time. It’s also useful if you want to parallelize uploads and downloads across multiple machines (often reducing overall transfer time compared with simply running gsutil -m cp on one machine). For example, if your bucket contains this structure:

gs://my_bucket/data/result_set_01/
gs://my_bucket/data/result_set_02/
...
gs://my_bucket/data/result_set_99/

you could perform concurrent downloads across 3 machines by running these commands on each machine, respectively:

gsutil -m cp -R gs://my_bucket/data/result_set_[0-3]* dir
gsutil -m cp -R gs://my_bucket/data/result_set_[4-6]* dir
gsutil -m cp -R gs://my_bucket/data/result_set_[7-9]* dir

Note that dir could be a local directory on each machine, or it could be a directory mounted off of a shared file server; whether the latter performs acceptably may depend on a number of things, so we recommend you experiment and find out what works best for you.

Copying In The Cloud And Metadata Preservation

If both the source and destination URI are cloud URIs from the same provider, gsutil copies data “in the cloud” (i.e., without downloading to and uploading from the machine where you run gsutil). In addition to the performance and cost advantages of doing this, copying in the cloud preserves metadata (like Content-Type and Cache-Control). In contrast, when you download data from the cloud it ends up in a file, which has no associated metadata. Thus, unless you have some way to hold on to or re-create that metadata, downloading to a file will not retain the metadata.

Note that by default, the gsutil cp command does not copy the object ACL to the new object, and instead will use the default bucket ACL (see gsutil help defacl). You can override this behavior with the -p option (see OPTIONS below).

Resumable Transfers

gsutil automatically uses the Google Cloud Storage resumable upload feature whenever you use the cp command to upload an object that is larger than 2 MB. You do not need to specify any special command line options to make this happen. If your upload is interrupted you can restart the upload by running the same cp command that you ran to start the upload.

Similarly, gsutil automatically performs resumable downloads (using HTTP standard Range GET operations) whenever you use the cp command to download an object larger than 2 MB.

Resumable uploads and downloads store some state information in a file in ~/.gsutil named by the destination object or file. If you attempt to resume a transfer from a machine with a different directory, the transfer will start over from scratch.

See also gsutil help prod for details on using resumable transfers in production.

Streaming Transfers

Use ‘-‘ in place of src_uri or dst_uri to perform a streaming transfer. For example:

long_running_computation | gsutil cp - gs://my_bucket/obj

Streaming transfers do not support resumable uploads/downloads. (The Google resumable transfer protocol has a way to support streaming transfers, but gsutil doesn’t currently implement support for this.)

Parallel Composite Uploads

gsutil automatically uses object composition to perform uploads in parallel for large, local files being uploaded to Google Cloud Storage. This means that, by default, a large file will be split into component pieces that will be uploaded in parallel. Those components will then be composed in the cloud, and the temporary components in the cloud will be deleted after successful composition. No additional local disk space is required for this operation.

Any file whose size exceeds the “parallel_composite_upload_threshold” config variable will trigger this feature by default. The ideal size of a component can also be set with the “parallel_composite_upload_component_size” config variable. See the .boto config file for details about how these values are used.

If the transfer fails prior to composition, running the command again will take advantage of resumable uploads for those components that failed, and the component objects will be deleted after the first successful attempt. Any temporary objects that were uploaded successfully before gsutil failed will still exist until the upload is completed successfully. The temporary objects will be named in the following fashion: <random ID>/gsutil/tmp/parallel_composite_uploads/for_details_see/gsutil_help_cp/<hash> where <random ID> is some numerical value, and <hash> is an MD5 hash (not related to the hash of the contents of the file or object).

One important caveat is that files uploaded in this fashion are still subject to the maximum number of components limit. For example, if you upload a large file that gets split into 10 components, and try to compose it with another object with 1015 components, the operation will fail because it exceeds the 1024 component limit. If you wish to compose an object later and the component limit is a concern, it is recommended that you disable parallel composite uploads for that transfer.

Also note that an object uploaded using this feature will have a CRC32C hash, but it will not have an MD5 hash. For details see gsutil help crc32c.

Note that this feature can be completely disabled by setting the “parallel_composite_upload_threshold” variable in the .boto config file to 0.

Changing Temp Directories

gsutil writes data to a temporary directory in several cases:

  • when compressing data to be uploaded (see the -z option)
  • when decompressing data being downloaded (when the data has Content-Encoding:gzip, e.g., as happens when uploaded using gsutil cp -z)
  • when running integration tests (using the gsutil test command)

In these cases it’s possible the temp file location on your system that gsutil selects by default may not have enough space. If you find that gsutil runs out of space during one of these operations (e.g., raising “CommandException: Inadequate temp space available to compress <your file>” during a gsutil cp -z operation), you can change where it writes these temp files by setting the TMPDIR environment variable. On Linux and MacOS you can do this either by running gsutil this way:

TMPDIR=/some/directory gsutil cp ...

or by adding this line to your ~/.bashrc file and then restarting the shell before running gsutil:

export TMPDIR=/some/directory

On Windows 7 you can change the TMPDIR environment variable from Start -> Computer -> System -> Advanced System Settings -> Environment Variables. You need to reboot after making this change for it to take effect. (Rebooting is not necessary after running the export command on Linux and MacOS.)

Options

-a canned_acl Sets named canned_acl when uploaded objects created. See ‘gsutil help acls’ for further details.
-c If an error occurrs, continue to attempt to copy the remaining files. Note that this option is always true when running “gsutil -m cp”.
-D

Copy in “daisy chain” mode, i.e., copying between two buckets by hooking a download to an upload, via the machine where gsutil is run. By default, data are copied between two buckets “in the cloud”, i.e., without needing to copy via the machine where gsutil runs.

By default, a “copy in the cloud” when the source is a composite object will retain the composite nature of the object. However, Daisy chain mode can be used to change a composite object into a non-composite object. For example:

gsutil cp -D -p gs://bucket/obj gs://bucket/obj_tmp
gsutil mv -p gs://bucket/obj_tmp gs://bucket/obj

Note: Daisy chain mode is automatically used when copying between providers (e.g., to copy data from Google Cloud Storage to another provider).

-e Exclude symlinks. When specified, symbolic links will not be copied.
-L <file>

Outputs a manifest log file with detailed information about each item that was copied. This manifest contains the following information for each item:

  • Source path.
  • Destination path.
  • Source size.
  • Bytes transferred.
  • MD5 hash.
  • UTC date and time transfer was started in ISO 8601 format.
  • UTC date and time transfer was completed in ISO 8601 format.
  • Upload id, if a resumable upload was performed.
  • Final result of the attempted transfer, success or failure.
  • Failure details, if any.

If the log file already exists, gsutil will use the file as an input to the copy process, and will also append log items to the existing file. Files/objects that are marked in the existing log file as having been successfully copied (or skipped) will be ignored. Files/objects without entries will be copied and ones previously marked as unsuccessful will be retried. This can be used in conjunction with the -c option to build a script that copies a large number of objects reliably, using a bash script like the following:

status=1
while [ $status -ne 0 ] ; do
    gsutil cp -c -L cp.log -R ./dir gs://bucket
    status=$?
done

The -c option will cause copying to continue after failures occur, and the -L option will allow gsutil to pick up where it left off without duplicating work. The loop will continue running as long as gsutil exits with a non-zero status (such a status indicates there was at least one failure during the gsutil run).

-n No-clobber. When specified, existing files or objects at the destination will not be overwritten. Any items that are skipped by this option will be reported as being skipped. This option will perform an additional HEAD request to check if an item exists before attempting to upload the data. This will save retransmitting data, but the additional HTTP requests may make small object transfers slower and more expensive.
-p

Causes source ACLs to be preserved when copying in the cloud. Note that this option has performance and cost implications, because it is essentially performing three requests (‘acl get’, cp, ‘acl set’). (The performance issue can be mitigated to some degree by using gsutil -m cp to cause parallel copying.)

You can avoid the additional performance and cost of using cp -p if you want all objects in the destination bucket to end up with the same ACL by setting a default ACL on that bucket instead of using cp -p. See “help gsutil defacl”.

Note that it’s not valid to specify both the -a and -p options together.

-q Deprecated. Please use gsutil -q cp ... instead.
-R, -r Causes directories, buckets, and bucket subdirectories to be copied recursively. If you neglect to use this option for an upload, gsutil will copy any files it finds and skip any directories. Similarly, neglecting to specify -R for a download will cause gsutil to copy any objects at the current bucket directory level, and skip any subdirectories.
-v Requests that the version-specific URI for each uploaded object be printed. Given this URI you can make future upload requests that are safe in the face of concurrent updates, because Google Cloud Storage will refuse to perform the update if the current object version doesn’t match the version-specific URI. See gsutil help versions for more details.
-z <ext,...>

Applies gzip content-encoding to file uploads with the given extensions. This is useful when uploading files with compressible content (such as .js, .css, or .html files) because it saves network bandwidth and space in Google Cloud Storage, which in turn reduces storage costs.

When you specify the -z option, the data from your files is compressed before it is uploaded, but your actual files are left uncompressed on the local disk. The uploaded objects retain the Content-Type and name of the original files but are given a Content-Encoding header with the value “gzip” to indicate that the object data stored are compressed on the Google Cloud Storage servers.

For example, the following command:

gsutil cp -z html -a public-read cattypes.html gs://mycats

will do all of the following:

  • Upload as the object gs://mycats/cattypes.html (cp command)
  • Set the Content-Type to text/html (based on file extension)
  • Compress the data in the file cattypes.html (-z option)
  • Set the Content-Encoding to gzip (-z option)
  • Set the ACL to public-read (-a option)
  • If a user tries to view cattypes.html in a browser, the browser will know to uncompress the data based on the Content-Encoding header, and to render it as HTML based on the Content-Type header.

Authentication required

You need to be signed in with Google+ to do that.

Signing you in...

Google Developers needs your permission to do that.