Designing a robust system is important to help mitigate instance downtime and to be prepared for times where your instances fall into a maintenance window or you suffer an unexpected failure.
- Maintenance Windows
- Types of Failures
- How to Design Robust Systems
Google periodically performs scheduled maintenance on its infrastructure: patching systems with the latest software, performing routine tests and preventative maintenance, and generally ensuring that our infrastructure is as fast and efficient as possible.
There are currently two types of scheduled maintenance:
- Transparent Maintenance
Transparent maintenance affects only a small piece of the infrastructure in a given zone and Google Compute Engine automatically moves your instances elsewhere in the zone, out of the way of the maintenance work. For more information, see Transparent Maintenance.
- Scheduled Zone Maintenance Windows
For scheduled zone maintenance windows, Google takes an entire zone offline for roughly two weeks to perform various, disruptive maintenance tasks. For more information, see Scheduled Zone Maintenance Windows.
The type of scheduled maintenance your instances will experience currently depends on the zone your instances are running in. Currently, only US and Asia zones support transparent maintenance. All other zones still have scheduled maintenance windows where the whole zone is taken offline for two weeks. We are in the process of planning and rolling out the hardware and software required to support transparent maintenance for all of our zones, so check back periodically for updates. For more information, see Maintenance Events.
Types of Failures
At some point, one or more of your instances will be lost due to system or hardware failures, or due to a scheduled zone maintenance window. Some of the failures you may experience include:
- Unexpected Single Instance Failure
Unexpected single instance losses can be due to hardware or system failure. We are working to make this as rare as possible but you should expect a higher level of single instance losses during the preview period. To mitigate these events, use persistent disks and start up scripts.
- Unexpected Single Instance Reboot
At some point in time, you will experience an unexpected single instance failure and reboot. Unlike unexpected single instance losses, your instance fails and is automatically rebooted by the Google Compute Engine service. To help mitigate these events, back up your data, use persistent disks and start up scripts.
- Zone maintenance and failures
- Zone failures - Zone failures are rare, unexpected failures within a zone that can cause your instances to go down.
- Zone maintenance - Scheduled zone maintenance windows are planned periods where a zone is taken offline for servicing. In these cases, you will receive prior notification of the maintenance window. For zones that support transparent scheduled maintenance events, you can keep your instances running through maintenance events. You can do this by configuring instance scheduling options so that Google Compute Engine automatically migrates your instances away from maintenance events. For more information, see Setting Instance Scheduling Options.
To mitigate zone failures and maintenance windows, create diversity across zones and implement load balancing. You should also back up your data or migrate your persistent disk data to another zone.
How to Design Robust Systems
To help mitigate instance failures, you should design your application on the Google Compute Engine service to be robust against failures, network interruptions, and unexpected disasters. A robust system should be able to gracefully handle failures, including redirecting traffic from a downed instance to a live instance or automating tasks on reboot.
Here are some general tips to help you design a robust system against failures.
Distribute your instances
Create instances across many zones so that you have alternative VM instances to point to if a zone containing one of your instances is taken down for maintenance or fails. If you host all your instances in the same zone, you won’t be able to access any of these instances if that zone is unreachable.
Use Google Compute Engine load balancing
Google Compute Engine offers a load balancing service that helps you support periods of heavy traffic so that you don't overload your instances. With the load balancing service, you can pick a region with multiple zones and deploy your application on instances within these zones. Then, you can configure a forwarding rule that can spread traffic across all virtual machine instances in all zones within the region. Each forwarding rule can define one entry point to your application using an external IP address.
Lastly, when a zone maintenance window approaches, you can add more virtual machines in the available zones to prepare for an increased load. While the zone is offline during the maintenance window, the load balancing service will automatically direct traffic away from the terminated instances and instead use instances in healthy zones. Once the maintenance window is over and the zone is back online, you can choose to migrate your virtual machines back, or keep them in the new zone. In this way, your external clients can access your application without any service disruptions, even if some of your instances are taken offline during maintenance windows. In addition, the load balancing service also offers instance health checking, providing support in detecting and handling instance failures.
Alternatively, if you have already replicated your instances across many zones and many regions, you can create a forwarding rule for each of the regions and use it as the entry point for instances in that region. Then, you can use DNS-based load balancing to distribute the load over these entry points into each region. When a maintenance window approaches, you can adjust your DNS settings or increase the number of virtual machine instances in other zones in the same region. Once the maintenance window is over, you can recreate the virtual machines in the old zone, re-adjust the DNS setting, and tear down any backup virtual machines that are no longer needed.
Use startup scripts
Start up scripts are an efficient and invaluable way to bootstrap your instances. If an instance fails, it can bring itself back up using start up scripts, and be able to install and access the appropriate resources as if it never went down. Instead of configuring your VM instances via custom images, it can be beneficial to configure them using startup scripts. Startup scripts run whenever the VM is rebooted or restarted due to failures, and can be used to install software and updates, and to ensure that services are running within the VM. Codifying the changes to configure a VM in a startup script is easier than figuring out what files or bytes have changed on a custom image.
You can run startup scripts using the gcutil tool by specifying the
--metadata_from_file=startup-script:<script> flag with the
gcutil addinstance command:
$ gcutil addinstance simple-apache --metadata_from_file=startup-script:install-apache.sh --project=my-project
For more information, see startup scripts.
Back up your data
If you need access to data on a VM instance or persistent disk that is in a zone scheduled to be taken offline, you can back up your files to Google Cloud Storage, your local computer, or migrate your data to another persistent disk in another zone.
To copy files from a VM instance to Google Cloud Storage:
- Log into your instance from gcutil
$ gcutil ssh my-first-instance --project=my-project
- If you have never used gsutil on this VM instance, set up your credentials.
$ gsutil config
Alternatively, if you have set up your instance to use a service account with a Google Cloud Storage scope, you do can skip this and the next step.
- Follow the instructions to authenticate to Google Cloud Storage.
- Copy your data to Google Cloud Storage by using the following command:
$ gsutil cp <file1> <file2> <file3> ... gs://<your bucket>
You can also use the gcutil tool to copy files to a local computer. For more information, see Copying Files To/From an Instance.