Hadoop on Google Cloud Platform

Google Cloud Storage Connector for Hadoop Overview

The Google Cloud Storage connector for Hadoop lets you run MapReduce jobs directly on data in Google Cloud Storage, and offers a number of benefits over choosing Hadoop Distributed File System (HDFS) as your default file system.

Contents

Benefits of using the connector

When first setting up a Hadoop cluster, you have a choice between two default file systems. Choosing Google Cloud Storage alongside the supplied connector has several benefits.

  • Direct data access.

    Store your data in Google Cloud Storage and access it directly, with no need to transfer it into HDFS first.

  • HDFS compatibility.

    You can store data in HDFS, in addition to Google Cloud Storage, and access it with the connector by using a different file path.

  • Interoperability.

    Storing data in Google Cloud Storage enables seamless interoperability between Hadoop and other Google services.

  • Data accessibility.

    When you shut down a Hadoop cluster, you still have access to your data in Google Cloud Storage, unlike HDFS.

  • High data availability.

    Data stored in Google Cloud Storage is highly available and globally replicated without a performance hit.

  • No storage management overhead.

    Unlike HDFS, Google Cloud Storage requires no routine maintenance such as checking the file system, upgrading or rolling back to a previous version of the file system, etc.

  • Quick startup.

    In HDFS, a MapReduce job can't start until the NameNode is out of safe mode - a process that can take anywhere from a few seconds to many minutes depending on the size and state of your data. If using HDFS, you'll be charged for the cycles Google Compute Engine must wait for NameNode to exit safe mode. With Google Cloud Storage, you can start your job as soon as the task nodes start, leading to significant cost savings over time.

Back to top

Getting the connector

The Google Cloud Storage connector for Hadoop is included as part of the setup scripts, and is installed automatically when you unzip the archive and run the scripts.

Back to top

Configuring the connector

When you set up a Hadoop cluster by following the steps at setting up a Hadoop cluster, the cluster is automatically configured for optimal use with the connector. There is typically no need to configure the connector further.

Back to top

Authentication required

You need to be signed in with Google+ to do that.

Signing you in...

Google Developers needs your permission to do that.