The Google Cloud Storage connector for Hadoop lets you run MapReduce jobs directly on data in Google Cloud Storage, and offers a number of benefits over choosing Hadoop Distributed File System (HDFS) as your default file system.
Benefits of using the connector
- Direct data access.
Store your data in Google Cloud Storage and access it directly, with no need to transfer it into HDFS first.
- HDFS compatibility.
You can store data in HDFS, in addition to Google Cloud Storage, and access it with the connector by using a different file path.
Storing data in Google Cloud Storage enables seamless interoperability between Hadoop and other Google services.
- Data accessibility.
When you shut down a Hadoop cluster, you still have access to your data in Google Cloud Storage, unlike HDFS.
- High data availability.
Data stored in Google Cloud Storage is highly available and globally replicated without a performance hit.
- No storage management overhead.
Unlike HDFS, Google Cloud Storage requires no routine maintenance such as checking the file system, upgrading or rolling back to a previous version of the file system, etc.
- Quick startup.
In HDFS, a MapReduce job can't start until the NameNode is out of safe mode - a process that can take anywhere from a few seconds to many minutes depending on the size and state of your data. If using HDFS, you'll be charged for the cycles Google Compute Engine must wait for NameNode to exit safe mode. With Google Cloud Storage, you can start your job as soon as the task nodes start, leading to significant cost savings over time.
Getting the connector
The Google Cloud Storage connector for Hadoop is included as part of the setup scripts, and is installed automatically when you unzip the archive and run the scripts.
Configuring the connector
When you set up a Hadoop cluster by following the steps at setting up a Hadoop cluster, the cluster is automatically configured for optimal use with the connector. There is typically no need to configure the connector further.