Sampling in Google Analytics or in any web analytics software refers to the practice of selecting a subset of data from your website traffic. Sampling is widely used in statistical analysis because analyzing a subset of data gives similar results to analyzing all of the data. In addition, sampling speeds up processing for reports when the volume of data is so large as to slow down report queries.
If your website has many millions of pageviews per month, sampling the traffic data collection for your site means that you will get good report results in a reasonable amount of time. Even if the data for your site is not sampled when it is collected, certain types of reports will contain sampled results, due to the nature of the query. For more information, see the Wikipedia article on sampling.
This document describes the two kinds of sampling that can occur in Analytics
Report Data Sampling
Regardless of how your traffic data is collected (sampled or unsampled), Analytics may examine only a portion of the collected data when calculating the result for a report. This type of sampling is called report sampling. It occurs automatically when you query for data that is not available in aggregate.
For example, suppose you query a Content Detail report for your top page, which received 80,000 pageviews over the past month. That information has been automatically compiled in the Analytics database, so the report can quickly display the actual pageview number. However, if you then query that same page for pageviews by browser, you are requesting data that has not automatically been compiled, which means that a special query is needed to do the calculation and this can trigger sampling by Analytics. Analytics indicates that a report is sampled with a notification in yellow at the top of the screen. It provides further information about the sampled metrics, as described below.
If either one of the following thresholds are met, Analytics samples data accordingly:
- 1,000,000 maximum unique dimension combinations for any type
What does this mean? Suppose you request a content report for your site, which has visits to 1,000,000 unique URLs for the requested date range. In such a situation, your report would take a very long time to load in order to display all the unique URLs for that date range. To avoid this, Google Analytics retrieves a maximum of 1,000,000 unique URLs (or any other dimension value) for a given request, divided by the number of days in the request. For example:
- A report for the past 30 days would display approximately 30,000 unique URLs (e.g. 1,000,000/30).
- A report for the past 60 days will display a maximum of 16,000 unique URLs (e.g. 1,000,000/60).
- 500,000 maximum sessions for special queries where the data
is not already stored.
In many of the reports, the list dimension is fixed, so Analytics can store this data. This enables Analytics to deliver timely reporting information for large data sets. However, if you request an ad hoc set of dimensions, that information is not stored and Analytics will need to perform the calculation at the time of the request. In this case, only 500,000 sessions will be processed in order to improve the response time. Your report query might easily exceed 500,000 sessions if you request an adhoc dimension over an expanded date range. To get a sense of how many sessions might appear in your request, you can use the visits metric over the date range you intend to query. This maximum of 500,000 session applies per web property.
For data sampling, keep in mind that the larger the data set being sampled, the more reliable the estimate, and vice versa.