Lifecycle of an incident
Google Maps Platform adheres to the Google Cloud Platform Incident Management framework.
When an outage or service degradation occurs, the product engineering team and the Google Maps Platform Support team work together to resolve the incident and communicate it to you.
Google uses internal and black box monitoring to detect incidents and trigger alerts to our engineers for investigation. For more information, see Chapter 6 of the Site Reliability Engineering book.
When Google detects an incident, the Support team leads communication with you. Initial notification of an incident is often sparse, frequently only mentioning the product in question along with key symptoms. This is because we prioritize fast notification over detail. As we learn more, additional details are provided in subsequent updates.
Incident communication channels
To provide the appropriate amount of information, the Google Maps Platform Support team offers different incident communication channels, depending on the scope and severity of an issue:
The Maps Public Status Dashboard is the first place to check when you discover an issue is affecting you. The dashboard shows incidents that affect many customers, so if you see an incident listed it is likely related to your problem. To indicate severity, the status dashboard marks incidents as either a disruption or an outage. Some issues are more minor and less impactful, but still widespread; these are posted as informational incidents.
The Google Maps Platform Notifications Group is a public Google group where all widespread outages are reported, in addition to other technical updates about Google Maps Platform APIs. All group members will receive an email notification when an outage is initially detected with subsequent updates until the issue is resolved.
The Support Banner is an informational message that appears in the Maps Support section of the Cloud Console when there's an active incident. The message identifies the affected product and includes a link to the Issue Tracker.
The Issue Tracker contains a reference list of all known incidents. You can view open incidents, follow their progress by subscribing to them, and add comments to help our teams investigate. You can also find the link to the public issue tracker in the Google Maps Platform support documentation
Support cases are used if the issue might be isolated to your projects or impacts a limited number of customers. If no incident has been declared, but are you still experiencing an issue, go to the Google Maps Platform Support page (in Cloud Console) and create a new support case.
Product engineering teams are responsible for investigating the root cause of incidents. Incident management is often done by Site Reliability Engineers but might be done by software engineers or others, depending on the situation and product. For more information, see Chapter 12 of the Site Reliability Engineering Book.
An issue is considered fixed only when changes have been made that Google is confident will end the impact indefinitely. For example, the fix could be rolling back a change that triggered an incident.
While an incident is in progress, the Support and Product teams will attempt to mitigate the issue. Mitigation occurs when the impact or scope of an issue can be reduced, for example by temporarily providing additional resources to a service suffering overload.
If no mitigation has been found, when possible, the Support team will find and communicate workarounds. Workarounds are steps that you can take to solve the underlying need despite the incident. A workaround might be to use different settings for an API call to avoid a problematic code path.
While an incident is ongoing, the Support team provides regular updates. Updates typically provide:
- More information about the incident, such as error messages, which features are affected, and how widespread it is.
- Progress towards mitigation, including any workarounds.
- Timelines for communication, tailored to the incident.
- Changes in status, such as when an incident is fixed.
All incidents result in a postmortem (post incident) internal analysis to fully understand the incident and to identify reliability improvements that Google can make. These improvements are then tracked and implemented. For more information on postmortems at Google, see Chapter 15 of the Site Reliability Engineering Book.
When incidents have very wide and serious impact, Google provides incident reports that outline the symptoms, impact, root cause, remediation, and future prevention of incidents. As with postmortems, we pay particular attention to the steps that we take to learn from the issue and improve reliability. Google's goal in writing and releasing postmortems is to be transparent and demonstrate our commitment to building stable services for our customers.
I want to get notified when there’s an ongoing outage. What should I do?
- Join the Google Maps Platform Notifications group to get notified of ongoing issues and to follow the progress of the incident in real-time. This group will also help you stay up to date with product and platform announcements.
- Use the RSS Feed or JSON History links at the bottom of the Maps Public Status Dashboard to view a feed of current and past incidents. Every post to the Dashboard will trigger a post to the feed. To keep you updated, each post to the feed will include all the messages and updates pertaining to the corresponding Dashboard event. That way you won't need to dig through your feed history to piece together how things are progressing. RSS feeds are published in XML format. Browser extensions such as RSS Subscription Extension (by Google) allow you to preview the feed content and subscribe through your favorite RSS reader. JSON History is a JSON Web Feed of past incidents. A range of software libraries and web frameworks support content syndication via JSON Feed.
What type of status information can I find on the dashboard home page?
The Google Maps Public Status Dashboard provides status information on services that are part of Google Maps Platform. Status indicators include one of the following:
- Service Outage: A production system or service is down. Workaround is not available or is not easily implemented.
- Service Disruption: A production system or service is partially impaired and/or does not work as expected. Workaround exists.
- Minor Incident: Low-impact issue provided for informational purposes. Service is still generally available.
- Available: Service is fully functional and working as expected.
Where can I find information about past service disruptions and outages?
The History page in the Maps Public Status Dashboard is a repository of disruptions and outages from the past 365 days. Click an incident to review the posts about the incident while it was ongoing, as well as any incident reports published by the Support team.
Who updates the dashboard?
The global Google Maps Platform Support team monitors the status of services using many different types of signals and updates the dashboard in the event of a widespread issue. If needed, they will also post a detailed analysis report after an incident has been resolved.
What is the difference between an "incident" and an "outage"?
Although these terms are often used interchangeably, Maps Public Status Dashboard and our external communications use "incident" to refer to any period of degraded service and "outage" to refer only to the most serious impairment, where a service is nonfunctioning to the extent that it renders our customers' experience effectively useless.