Safe Browsing Update API (v3)

The Update API (previously referred to in the Protocol v3.0 Developer's Guide as the Safe Browsing API) is an experimental API that allows client applications to check URLs against Google's constantly updated blacklists of suspected phishing, malware, and unwanted software pages. Your client application can use the API to download an encrypted table for local, client-side lookups of URLs.

This document describes the capabilities of the Update API, provides code samples for interacting with the API by sending HTTP messages to download lists or perform lookups. This document also includes information and examples on how to perform client-side lookups in a downloaded list.

Audience

This document is intended for programmers who want to use Google anti-phishing and anti-malware data to protect users from potentially malicious websites. It provides examples of basic data API interactions, guidelines for how the data may be accessed, and information on how the data can be used.

Overview

Google publishes phishing, malware, and unwanted software data in three separate blacklists (googpub-phish-shavar, goog-malware-shavar, and goog-unwanted-shavar). Each is a list of SHA-256 hash values that are usually truncated to a 4-byte hash prefix. The client MUST be able to handle both 4-byte prefixes and full 32-byte hashes. The client should keep a local copy of the lists and consult them for every URL that is to be scanned or visited. The client should store the lists as it receives them and make no attempt at converting a hashed list to plaintext. Clients MUST follow the protocol's requirements for update frequency, as this behavior is designed to prevent clients from overwhelming the service in case of excess demand on the service or the service recovering from failure. Clients MAY check for data less frequently than specified (e.g. for research purposes), but restrictions apply on how data may be used based on its age. Specifically, your application is not permitted to show warnings to end users unless they are based on current data, as defined in the Age of Data, Usage section. Also, if you do show warnings to end users, you must adhere to Google's guidelines for the warning text and provide appropriate attribution as discussed in End User Visible Warnings.

Key Differences From Previous Versions

Differences Between Version 2.2 and 3.0

Changes since version 2.2:

  • shavar chunk data is now encoded using Protocol Buffers for improved efficiency.
  • shavar chunks no longer include host keys. Clients should request full-length hashes anytime a hash prefix matches the URL, subject to the caching behavior described below.
  • The HTTP Response for Full-Length Hashes now includes optional metadata associated with each full hash.
  • The goog-malware-shavar list uses the new metadata functionality to distinguish between types of sites and allow for more informative warnings. See metadata contents and how they apply to warnings.
  • The caching semantics for full hashes have changed:
    • The HTTP Response for Full-Length Hashes now includes an expiration time in the response. As a result of this change, these requests will no longer return a 204 response.
    • Clients must clear cached full-length hashes each time they send an update request.
    • Clients migrating from version 2.2 to version 3.0 need to make sure that any previously-received full hashes follow these guidelines. These clients may need to clear their database if there is no other way to accomplish this.
    • Full-length hashes obtained in shavar add chunks must also be verified via a full-length hash request prior to showing a warning. As with hash prefixes, the response will include an expiration time that specifies how long it may be cached.
  • Message Authentication Code (MAC) support has been removed in favor of HTTPS. Requesting a MAC is not allowed when using pver=3.0, though it is still supported for older protocol versions. HTTPS is required for pver 3.0.
  • The format of API keys has changed. API keys are now managed through the Google Developers Console, as described in Getting Started. Note that the CGI parameter is now called key.
  • The URLs in Reporting Incorrect Data section have been updated. Note that we now recommend using HTTPS for these URLs.

Differences between Version 1 and Version 2

Version 1 of the update protocol is inefficient and not scalable. Caveats for version 1 of the protocol include:

  • It does not support partial list updates unless a client has a recent version of the list already fully downloaded. A new client must download the entire list of phishing entries at once or else it will never get any data. As a result, some clients using slow connections take a very long time to download the full list, the request times out, and they never download anything.
  • It sends phishing data to the client in oldest to newest order, which is inefficient for phishing sites since they have a very short lifetime.
  • Expiring old entries requires listing them in updates, which actually consumes bandwidth.
  • Clients only rarely find a match with any given listed pattern, so sending all the data is somewhat wasteful.

To address the above concerns, we have implemented a new version of the protocol, v2. The key differences are:

  • The list is comprised of a series of "chunks" rather than a single versioned list.
  • Updating now involves sending the list of chunks you have, and getting back a list of URLs that you should fetch for more data.
  • The majority of data in the chunks are 32-bit truncated hashes (the first 32 bits of a SHA-256 hash). When you find a match, you send this 32-bit fragment to Google and get back a full list of 256-bit hashes.

Getting Started

To interact with the Safe Browsing service, in the Google Developers Console, you need to enable the Safe Browsing API and get an API key to authenticate as an API user. You will pass this key as a CGI parameter in your HTTP requests to the lookup server:

https://safebrowsing.google.com/safebrowsing/...&key=SIzaVyOm19mrXxv-z80s-nC-G2XYH1-3hAtNlGh&...

Complete the following steps to enable the API and get an API key:
  1. Open the Google Developers Console API Library.
  2. From the project drop-down, select a project or create a new one.
  3. In the Google APIs tab, search for and select the Safe Browsing API, then click Enable API.
  4. Next, in the sidebar on the left select Credentials.
  5. Select the Create credentials drop-down, then choose API key.
  6. Depending on your application, from the Create a new key pop-up, select Browser key or Server key.
  7. Enter a name for the key, set up the optional referrers or IP addresses, then click Create. Your key is created and displayed in a pop-up window. The key is also listed on the Credentials page.

If you need more help, check out the Google Developers Console Help Center.

See HTTP Request for Data for information on how to start downloading Safe Browsing updates.

Protocol Basics

Version 3 of the update protocol has the following characteristics:

  • Each list type has one canonical list divided into chunks. Each chunk is assigned a unique identifier and describes entries to be added or removed from the blacklist.
  • Clients can recommend a preferred download size, but that request is not guaranteed to be honored by the server.
  • Clients inherently perform partial updates each time they connect, and the server will send the most valuable data to a client first. This could, for example, be the most recent data.
  • The chunk structure is determined by the list type. Currently, all of the lists contain hashed expressions.
  • Chunks that contain hash values do not necessarily contain the full hash; they are often only a prefix for that hash. A second request (a gethash request) can be issued to get the list of full-length hashes that start with the prefix.
  • Within each chunk, all hash prefixes are the same length, but different chunks may contain prefixes of different lengths.

As with the previous protocol, the v3 protocol supports many different blacklists or whitelists. List names are in the form "provider-type-format", such as "googpub-phish-shavar". Each item in a list will represent an expression that will match a malicious URL, but the exact format depends on the list type, and how the content is used is application-specific. Note that the rest of the specification will generally talk about lists in terms of blacklists, but the protocol itself is agnostic to the contents of the list. (See List Contents for details.)

The lists are divided into chunks, the smallest unit of data that will be sent to the client. This allows for supporting partial updates to all users, including new users, and allows for more flexibility in choosing which data to send the client. The actual chunk size is determined by the server.

There are two kind of chunks:

  • "add" chunks contain new entries for the list.
  • "sub" chunks contain entries that need to be removed from the client's list.

Chunks are assigned a number, which is a sequence number for chunks of the same type. For example, for a given list, there will be:

  • "Add" chunk #1, "add" chunk #2,..., "add" chunk #N.
  • "Sub" chunk #1, "sub" chunk #2,..., "sub" chunk #M.
  • The total number of "add" and "sub" chunks will generally be different.
  • There is no chunk number 0. Chunk numbers start with 1.
  • Chunk numbers within the same chunk type grow increasingly.

For a blacklist, "add" chunks contain the new hashes to add to the blacklist and "sub" chunks contain the false positives that need to be removed from the client's blacklist.

The server does not explicly list all hashes that need to be removed. Instead, to save bandwidth, the server indicates which chunks need to be deleted by specifying a previously-seen "add" or "sub" chunk number.

Protocol Specification

The client-server exchange uses a simple pull model: the client connects regularly to the server and pulls updates. The data exchange can be summarized as follows:

  • The client sends an HTTP POST request to the server and specifies which lists it wants to download. It indicates which chunks it already has. It specifies the desired download size.
  • The server replies with an HTTP status code and an HTTP response. If there is any data, the response contains the chunk data URLs for the various requested lists.

Besides the data exchange, the server provides a way for the client to discover which lists are available.

R-BNF

This document uses a R-BNF notation to specify the format of requests and responses. This notation is a mix of Extended BNF and PCRE-style regular expressions:

  • Rules are in the form: name = definition. Rule names are referenced as-is in the definition. Angle brackets may be used to help facilitate discerning the use of rule names.
  • Literals are surrounded by quotation marks: "literal".
  • Sequences: (rule1 rule2) or simply rule1 rule2.
  • Alternatives groups: (rule1 | rule2).
  • Optional groups: [rule[]].
  • Repetition: rule* means 0 or more of this rule or this group.
  • Repetition: rule+ means 1 or more of this rule or this group.

The following basic rules that describe the US-ASCII character set are also used as defined in RFC 2616:

  • UPALPHA = <any US-ASCII uppercase letter "A".."Z">
  • LOALPHA = <any US-ASCII lowercase letter "a".."z">
  • ALPHA = UPALPHA | LOALPHA
  • DIGIT = <any US-ASCII digit "0".."9">
  • UNRESERVED = ALPHA | DIGIT | "-" | "_" | "." | "!" | "~" | "*" | "'" | "(" + ")"
  • CR = <US-ASCII CR, carriage return (13)>
  • LF = <US-ASCII LF, line-feed (10)>
  • <"> = <US-ASCII double-quote mark (34)>
  • EOF = End of File / End of Stream

HTTP Request for List

Clients use this to discover the available list types.

Request's URL

The client performs a request by sending an HTTP POST request to the URL:

https://safebrowsing.google.com/safebrowsing/list?client=api&key=APIKEY&appver=CLIENTVER&pver=PVER

Required CGI parameters:

  • The client parameter indicates the type of client. Unless we tell you otherwise, this should be "api".
  • The key parameter indicates your API key.
  • The appver parameter specifies the version of the client, such as "1.5.2".
  • The pver parameter indicates the protocol version that the client supports. Currently, this should be "3.0". The format is "major.minor", which indicates the scope of changes from the previous version. When new major or minor revisions are announced, they may be accompanied by turndown dates for older revisions of the protocol.

Formal R-BNF description:

APIKEY    = UNRESERVED+
CLIENTVER = DIGIT ["." DIGIT]*
PVER      = DIGIT "." DIGIT

Example:

https://safebrowsing.google.com/safebrowsing/list?client=api&key=12345&appver=1.5.2&pver=3.0

Request Body

There is no body content for this request—any body data will be ignored by the server.

HTTP Response for List

The server reply consists of an HTTP status code and a response body.

Response Code

The server generates the following HTTP status codes:

  • 200: OK—Data is available in the HTTP response body.
  • 400: Bad Request—The HTTP request was not correctly formed. The client did not provide all required CGI parameters.
  • 401: Not Authorized—The client id is invalid.
  • 503: Service Unavailable—The server cannot handle the request. Clients MUST follow the backoff behavior specified in the Request Frequency section.
  • 505: HTTP Version Not Supported—The server CANNOT handle the requested protocol major version.

Response Body

There is no data in the response body for codes in 3xx, 4xx, and 5xx.

The response body may be empty. When present, the response body contains the name of each list that this client can access. Formal R-BNF description of the response body:

BODY     = (LISTNAME LF)* EOF
LISTNAME = (LOALPHA | DIGIT)+ "-" LOALPHA+ "-" (LOALPHA | DIGIT)+

Example:

googpub-phish-shavar
goog-malware-shavar

HTTP Request for Data

Clients use this to get new data for known list types.

Request URL

The client performs a data request by sending an HTTP POST request to the URL:

https://safebrowsing.google.com/safebrowsing/downloads?client=api&key=APIKEY&appver=CLIENTVER&pver=PVER

Required CGI parameters:

  • The client parameter indicates the type of client. Unless we tell you otherwise, this should be "api".
  • The key parameter indicates your API key.
  • The appver parameter specifies the version of the client, such as "1.5.2".
  • The pver parameter indicates the protocol version that the client supports. Currently this should be "3.0". The format is "major.minor", which indicates the scope of changes from the previous version. When new major or minor revisions are announced, they may be accompanied by turndown dates for older revisions of the protocol.

Formal R-BNF description:

APIKEY    = UNRESERVED+
CLIENTVER = DIGIT ["." DIGIT]*
PVER      = DIGIT "." DIGIT

Example:

https://safebrowsing.google.com/safebrowsing/downloads?client=api&key=12345&appver=1.5.2&pver=3.0

Request Body

The request body is used to specify what the client has and wants:

  • The client optionally specifies the maximum size of the download it wants to retrieve.
  • The client specifies which lists it wants to retrieve.
  • For each list, the client specifies the chunk numbers it already has.

The format of the body is line-oriented. Lines are separated by LF. Lines that cannot be understood are ignored by the server.

Formal R-BNF description of the request body:

BODY      = [SIZE LF] (LIST LF)+ EOF
SIZE      = "s;" DIGIT+                            # Optional size, in kilobytes and >= 1
LIST      = LISTNAME ";" LISTINFO (":" LISTINFO)*
LISTINFO  = CHUNKTYPE ":" CHUNKLIST
LISTNAME  = (LOALPHA | DIGIT)+ "-" LOALPHA+ "-" (LOALPHA | DIGIT)+
CHUNKTYPE = "a" | "s"                              # 'Add' or 'Sub' chunks
CHUNKLIST = (RANGE | NUMBER) ["," CHUNKLIST]
NUMBER    = DIGIT+                                 # Chunk number >= 1
RANGE     = NUMBER "-" NUMBER

Note that the last line of the body MUST have a trailing line-feed.

Clients must collapse consecutive chunk numbers into a RANGE to reduce the request size.

The size request is optional. If present, the number indicates the ideal maximum response size, in kilobytes, that the server should return. The size is used as a hint by the server; the actual reply size may vary and could be larger or smaller than the ideal size specified by the client.

We strongly recommend that clients omit the size field unless they have a special need to limit the response size. Clients who are operating on a small bandwidth, such as a modem, may want to use the size field to limit the response size. However, doing so may cause the client to permanently lag behind. If unsure, clients should omit the size field and let the server decide the appropriate response size.

Example 1:

googpub-phish-shavar;a:1-3,5,8:s:4-5
acme-white-shavar;a:1-7:s:1-2

In this example, the client requests data for two lists. It then lists the chunks it already has for each list type.

Example 2:

s;200
googpub-phish-shavar;a:1-3,5,8:s:4-5
acme-white-shavar;a:1-7:s:1-2

In this example, the client requests a response size of 200 kilobytes for the two given lists. It then lists the chunks it already has for each list type.

Note that at first, the client has no data, so it has no chunk number on its side. If a client does not have any chunks of a type, it should not list the corresponding chunk type. Example (inline comments start after a # and are not part of the protocol):

googpub-phish-shavar;a:1-5      # The client has 'add' chunks but no 'sub' chunks

acme-malware-shavar;           # The client has no data for this list.

Examples of good chunk lists:

googpub-phish-shavar;a:1-5,10,12:s:3-8
googpub-phish-shavar;a:1-5,10,12,15,16
googpub-phish-shavar;a:16-20,3-5,1

Examples of bad chunk lists:

googpub-phish-shavar              # Missing ; at end of list name
googpub-phish-shavar;a:1,2,3,4,5,10,12,15,16  # Chunk range is not collapsed
googpub-phish-shavar;5-7,16-10    # Missing 'a:' or 's:' for chunk type
googpub-phish-shavar;a:5-4,16-10  # Invalid range (high < low)
googpub-phish-shavar;a:5-7:s:     # Missing chunk numbers for 's:'

Server Behavior:

  • The server MUST reject a request with an empty body.
  • The server MUST ignore ill-formated lines and MUST reply to the correctly formatted ones.
  • The server SHALL try to accommodate the desired response size. The requested size takes into account only chunk data, not any metadata.
  • However, if the desired size is less than at least one chunk, the server MUST send at least one chunk.

Client Behavior:

  • The client MUST request at least one list.

HTTP Response for Data

The server reply consists of an HTTP status code and a response body.

Response Code

The server generates the following HTTP status codes:

  • 200: OK—Data is available in the HTTP response body.
  • 400: Bad Request—The HTTP request was not correctly formed. The client did not provide all required CGI parameters or the body did not contain any meaningful entries.
  • 403: Forbidden—The client id or API key is invalid or unauthorized.
  • 503: Service Unavailable—The server cannot handle the request. Clients MUST follow the backoff behavior specified in the Request Frequency section.
  • 505: HTTP Version Not Supported—The server CANNOT handle the requested protocol major version.

Response Body

The response body will not be present for codes in 4xx and 5xx.

When present, the response body contains the following information:

  • The next polling interval to use; that is, the number of seconds before the client should contact the server again.
  • For each list, its name followed by redirect URLs containing chunk data.

The response body is line-oriented. Formal R-BNF description of the response body:

BODY      = NEXT LF (RESET | (LIST LF)+) EOF
NEXT      = "n:" DIGIT+                               # Minimum delay before polling again in seconds
RESET     = "r:pleasereset"
LIST      = "i:" LISTNAME LF LISTDATA
LISTNAME  = (LOALPHA | DIGIT | "-")+                  # e.g. "googpub-phish-shavar"
LISTDATA  = ((REDIRECT_URL | ADDDEL-HEAD | SUBDEL-HEAD) LF)*
REDIRECT_URL = "u:" URL
URL       = Defined in RFC 1738 (the scheme is omitted; see below)
ADDDEL-HEAD  = "ad:" CHUNKLIST
SUBDEL-HEAD  = "sd:" CHUNKLIST
CHUNKLIST = (RANGE | NUMBER) ["," CHUNKLIST]
NUMBER    = DIGIT+                                    # Chunk number >= 1
RANGE     = NUMBER "-" NUMBER

A reset response from the server means to clear out all current data in the database before requesting again.

The response doesn't actually contain the data associated with the lists; instead, it tells you where to find the data via redirect URLs. These URLs should be visited in the order that they are given, and if an error is encountered fetching any of the URLs, then the client must NOT fetch any URL after that. Parallel fetching is NOT allowed.

REDIRECT_URL, ADDDEL-HEAD, and SUBDEL-HEAD can be presented in any order. They can even be intermixed. Clients MUST NOT rely on any particular ordering.

The adddel and subdel chunks are used to expire previous add and sub chunks. Consequently, they have no associated chunk data. More than one chunk can be specified, either by listing each number, using a range, or a combination of both. When an add chunk is deleted, the client can delete the data associated with that chunk. Clients should no longer report that they have received that chunk. When a sub chunk is deleted, the client no longer needs to keep track of the removals for any unreceived add chunks, and no longer reports that it received that sub chunk in the past.

There may not be any chunks of a given type. In this case, no redirect URLs will contain the given chunk type, and there will be no adddels or subdels in the response.

The format for each redirect URL is a host and path, for example "example.com/redirect/123". The client should use HTTPS to fetch the URL.

Formal R-BNF description of redirect response:

BODY      = (UINT32 CHUNKDATA)+
UINT32    = Unsigned 32-bit integer in network byte order.
CHUNKDATA = Encoded ChunkData protocol message, see below.

The format for add and sub chunks is exactly the same. A length is given first, which specifies the size of the ChunkData message that immediately follows.

Following this length is an encoded ChunkData protocol message, which is defined as follows:

// Chunk data encoding format for the shavar-proto list format.
message ChunkData {
  required int32 chunk_number = 1;

  // The chunk type is either an add or sub chunk.
  enum ChunkType {
    ADD = 0;
    SUB = 1;
  }
  optional ChunkType chunk_type = 2 [default = ADD];

  // Prefix type which currently is either 4B or 32B.  The default is set
  // to the prefix length, so it doesn't have to be set at all for most
  // chunks.
  enum PrefixType {
    PREFIX_4B = 0;
    FULL_32B = 1;
  }
  optional PrefixType prefix_type = 3 [default = PREFIX_4B];
  // Stores all SHA256 add or sub prefixes or full-length hashes. The number
  // of hashes can be inferred from the length of the hashes string and the
  // prefix type above.
  optional bytes hashes = 4;

  // Sub chunks also encode one add chunk number for every hash stored above.
  repeated int32 add_numbers = 5 [packed = true];
}

See the protocol buffers documentation for details on how to generate language-specific bindings from this message definition, which can be used to parse the message.

Add and sub chunks can be presented in any order. They can even be intermixed. The order of the chunks depends on the implementation of the server and the clients MUST NOT rely on any empirical behavior. Moreover, the sequence order in which chunks of the same type are present in the stream is not guaranteed.

A chunk's hashes may be empty. In this case, the prefix size will still be set, but will have no meaning. Chunks may be given this way to prevent fragmentation of chunk numbers and reduce request size.

In the case of an empty add chunk, it's possible that the client has or will receive a sub chunk that contains an expression that points to the empty add. In this case, the client is allowed to drop the sub expression.

The client may receive an empty chunk after previously receiving a non-empty version of the same chunk number. In this situation, no action is needed by the client. The prefix size of the empty chunk may not match the originally received chunk.

A sub chunk may refer to an add chunk that the client has not yet received. In this situation, the client must keep track of the pending removal, and apply it if the referenced add chunk is received in the future.

Example:

n:1200
i:googpub-phish-shavar
u:cache.google.com/first_redirect_example
sd:1,2
i:acme-white-shavar
u:cache.google.com/second_redirect_example
ad:1-2,4-5,7
sd:2-6

Contents of first_redirect_example: (contents shown in text format for demonstration purposes

ChunkData <
  chunk_number: 4
  // chunk_type not set, default value of ADD
  // prefix_type not set, default value of PREFIX_4B
  hashes: 0x1122334455667788  // 2 4-byte hash prefixes
>
ChunkData <
  chunk_number: 3
  chunk_type: SUB
  // prefix_type not set, default value of PREFIX_4B
  hashes: 0x1212343445456767  // 2 4-byte hash prefixes
  add_numbers: 3 4            // an add chunk number for each prefix
>
ChunkData <
  chunk_number: 6
  // chunk_type not set, default value of ADD
  // prefix_type not set, default value of PREFIX_4B
  // empty hashes
>

Contents of second_redirect_example:

ChunkData <
  chunk_number: 10
  // chunk_type not set, default value of ADD
  // prefix_type not set, default value of PREFIX_4B
  hashes: 0x0011998800119977  // 2 4-byte hash prefixes
>

In this example, there are no adddel chunks for the "googpub-phish-shavar" list, and there are no sub chunks for the "acme-white-shavar" redirect response.

Server Behavior:

  • The server CAN change the "next" value (i.e. "n:" line) for each response.

Client Behavior:

  • The client MUST respect the "next" value and not contact the server again until the specified delay has expired. See the Request Frequency section for more information on how often the server can be contacted after replying with an HTTP error code.
  • The client MUST ignore a line starting with a keyword that it doesn't understand.
  • If a redirect request returns an error code, the client MUST perform backoff behavior as indicated in the Request Frequency section.
  • A client MUST perform a download request again if a redirect request returns an error.
  • The client SHOULD keep all data delivered prior to a bad request.
  • The client MUST refuse to use the whole response if any of the adddel and subdel metadata headers, or the encoded chunk data, cannot be parsed successfully.
  • Upon successful decoding of all the response and all the binary data, the client MUST update its lists in an atomic fashion.

List Contents

The content of each chunk depends on its list type. Currently, the possible lists are:

  • googpub-phish-shavar: A list of hashed suffix/prefix expressions representing sites that should be blocked, because they are hosting or redirecting to phishing pages.
  • goog-malware-shavar: A list of hashed suffix/prefix expressions representing sites that should be blocked, because they are hosting or redirecting to malware.
  • goog-unwanted-shavar: A list of hashed suffix/prefix expressions representing sites that should be blocked, because they are hosting or redirecting to unwanted software pages.

The "shavar" (short for "Variable-length SHA256") list type relies on suffix/prefix expressions. Each of the suffix/prefix expressions consists of a host suffix (or full host) and a path prefix (or full path). The path prefix consists of full path components. If the expression contains the full path, there may optionally be query parameters appended to the path.

Examples:

Suffix/prefix expressionEquivalent regular expression
a.b/mypath/
http\:\/\/.*\.a\.b\/mypath\/.*
c.d/full/path.html?myparam=a
http\:\/\/.*.c\.d\/full\/path\.html?myparam=a
For a more complete description of suffix/prefix expressions, see the Suffix/Prefix Expression Lookup section.

shavar List Format

For the "shavar" list format, hash prefixes are used to reduce bandwidth. A hash prefix is some number of the most significant bytes of a full-length, 256-bit hash. Each ChunkData message contains zero or more hash prefixes, and indicates the length of the hash prefixes in that chunk.

Examples of "shavar" hashes based on the examples from FIPS-180-2:

  • Example B1:
    • Input is "abc".
    • SHA 256 digest is ba7816bf 8f01cfea 414140de 5dae2223 b00361a3 96177a9c b410ff61 f20015ad.
    • The 32-bit hash prefix is ba7816bf.
  • Example B2:
    • Input is "abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq".
    • SHA 256 digest is 248d6a61 d20638b8 e5c02693 0c3e6039 a33ce459 64ff2167 f6ecedd4 19db06c1.
    • The 48-bit hash prefix is 248d6a61 d206.

Unit test you can use to validate the key computation (in pseudo-C):

// Example B1 from FIPS-180-2
string input1 = "abc";
string output1 = TruncatedSha256Prefix(input1, 32);
int expected1[] = { 0xba, 0x78, 0x16, 0xbf };
assert(output1.size() == 4);  // 4 bytes == 32 bits
for (int i = 0; i < output1.size(); i++) assert(output1[i] == expected1[i]);

// Example B2 from FIPS-180-2
string input2 = "abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq";
string output2 = TruncatedSha256Prefix(input2, 48);
int expected2[] = { 0x24, 0x8d, 0x6a, 0x61, 0xd2, 0x06 };
assert(output2.size() == 6);
for (int i = 0; i < output2.size(); i++) assert(output2[i] == expected2[i]);

// Example B3 from FIPS-180-2
string input3(1000000, 'a');  // 'a' repeated a million times
string output3 = TruncatedSha256Prefix(input3, 96);
int expected3[] = { 0xcd, 0xc7, 0x6e, 0x5c, 0x99, 0x14, 0xfb, 0x92,
                    0x81, 0xa1, 0xc7, 0xe2 };
assert(output3.size() == 12);
for (int i = 0; i < output3.size(); i++) assert(output3[i] == expected3[i]);

HTTP Request for Full-Length Hashes

A client may request the list of full-length hashes for a hash prefix. This usually occurs when a client is about to download content from a URL whose calculated hash starts with a prefix listed in a blacklist. See the Lookup section for details.

Request URL

The client performs a data request by sending an HTTP POST request to the URL:

https://safebrowsing.google.com/safebrowsing/gethash?client=api&key=APIKEY&appver=CLIENTVER&pver=PVER

Required CGI parameters:

  • The client parameter indicates the type of client. Unless we tell you otherwise, this should be "api".
  • The key parameter indicates your API key.
  • The appver parameter specifies the version of the client, such as "1.5.2".
  • The pver parameter indicates the protocol version that the client supports. Currently this should be "3.0". The format is "major.minor", which indicates the scope of changes from the previous version. When new major or minor revisions are announced, they may be accompanied by turndown dates for older revisions of the protocol.

Formal R-BNF description:

APIKEY  = UNRESERVED+
CLIENTVER = DIGIT ["." DIGIT]*
PVER      = DIGIT "." DIGIT

Example:

https://safebrowsing.google.com/safebrowsing/gethash?client=api&key=12345&appver=1.5.2&pver=3.0

Client Behavior:

  • The client MUST specify the client, appver, and pver CGI parameters. If client is "api" (it should be unless we explicitly tell you otherwise), you MUST also specify the key parameter.

Request Body

The request body specifies the list of hash prefixes for which the client should receive full-length hashes.

Formal R-BNF description of the request body:

BODY       = HEADER LF PREFIXES EOF
HEADER     = PREFIXSIZE ":" LENGTH
PREFIXSIZE = DIGIT+         # Size of each prefix in bytes
LENGTH     = DIGIT+         # Size of PREFIXES in bytes
PREFIXES   = <LENGTH number of unsigned bytes>  # PREFIXSIZE prefixes in binary

PREFIXES is a list of PREFIXSIZE values. Note that the server returns the full hash for any matching prefixes given in the request. There may be 0 or more matches for each prefix given.

HTTP Response for Full-Length Hashes

The server replies using the status code and response body of the HTTP response. No specific HTTP headers are set by the server—some HTTP headers MAY be present but are not authoritative.

Response Code

The server generates the following HTTP status codes:

  • 200: OK—Data is available in the HTTP response body.
  • 400: Bad Request—The HTTP request was not correctly formed. The client did not provide all required CGI parameters.
  • 403: Forbidden—The client id is invalid.
  • 503: Service Unavailable—The server cannot handle the request. Clients MUST follow the backoff behavior specified in the Request Frequency section.
  • 505: HTTP Version Not Supported—The server CANNOT handle the requested protocol major version.

Response Body

The response body will not be present for codes in 4xx and 5xx.

When present, the response body contains the following information:

  • The cache lifetime of the response.
  • The number of hash entries starting with the requested prefix in decimal.
  • The matching full-length hashes.
  • Optional metadata for the full-length hashes.

Formal R-BNF description of the response body:

BODY          = CACHELIFETIME LF HASHENTRY* EOF
CACHELIFETIME = DIGIT+
HASHENTRY     = LISTNAME ":" HASHSIZE ":" NUMRESPONSES [":m"] LF HASHDATA (METADATALEN LF METADATA)*
HASHSIZE      = DIGIT+                          # Length of each full hash
NUMRESPONSES  = DIGIT+                          # Number of full hashes in HASHDATA
HASHDATA      = <HASHSIZE*NUMRESPONSES number of unsigned bytes>  # Full length hashes in binary
METADATALEN   = DIGIT+                          # Length of METADATA
METADATA      = <METADATALEN number of unsigned bytes>  # Parsing depends on list. See below.

HASHDATA is grouped by LISTNAME. Clients must ignore any unrecognized list names in the response.

CACHELIFETIME specifies for how many seconds this response is valid.

Each hash in the response MUST be cached as a full-length hash for the requested prefix in the list indicated in the response, until either the cache lifetime elapses or the client restarts.

If there is a valid cached response for a hash prefix, then the client MUST not send any further full-length hash requests for that prefix.

Metadata contains additional list-specific information. If ":m" is present in the response header, then a metadata entry will be included for each full hash in HASHDATA. The metadata entries are provided in the same order as the corresponding full hashes; for example, the second metadata entry corresponds to the second full hash. The metadata entries are variable-sized, so each metadata entry includes its length (METADATALEN), followed by a newline, then the metadata payload. See Full Hash Metadata for a further description of how to interpret the metadata. Metadata MUST be cached along with the full-length hashes in the response.

If no hashes start with the requested prefix, the response body will contain only the cache lifetime. This is expected and may occur if a client has not yet downloaded an update to a list that deletes the requested prefix. In this situation, the client should still cache the response and refrain from sending further full-length hash requests for this prefix until the chache lifetime has expired.

Example Responses (line breaks indicate an LF byte):

600
googpub-phish-shavar:32:1
01234567890123456789012345678901
This response contains a single 32-byte full hash (01234567890123456789012345678901) in googpub-phish-shavar, with no metadata. The cache lifetime is 10 minutes.
900
goog-malware-shavar:32:2:m
01234567890123456789012345678901987654321098765432109876543210982
AA3
BBBgoogpub-phish-shavar:32:1
01234567890123456789012345678901
This response contains 2 32-byte full hashes (01234567890123456789012345678901 and 98765432109876543210987654321098) in goog-malware-shavar. The first entry has metadata "AA", the second has metadata "BBB". The response also contains a single 32-byte full hash (01234567890123456789012345678901) in googpub-phish-shavar, with no metadata. The cache lifetime for all entries is 15 minutes.
900
This response indicates that no full hashes matched the given prefix. The empty result should be cached for 15 minutes.

Full Hash Metadata

This section describes the Safe Browsing lists that include accompanying metadata, and how to parse that metadata. Any lists not mentioned here do not currently return metadata.

goog-malware-shavar

The metadata for the goog-malware-shavar list is an encoded protocol buffer, as follows:

message MalwarePatternType {
  enum PATTERN_TYPE {
    LANDING = 1;
    DISTRIBUTION = 2;
  }

  required PATTERN_TYPE pattern_type = 1;
}

PATTERN_TYPE is used to target end-user warnings more precisely. See "End-User Visible Warnings".

Request Frequency

In order to ensure high availability of the API, Google limits the frequency of client requests. This is handled differently depending on the type of request.

HTTP Request for Data

When requesting a download of data from the server, two mechanisms are available to control request frequency:

  • In its response, the server gives an update interval; that is, the delay in seconds before the next connection attempt should occur.
  • The client watches for timeouts or HTTP errors (specifically HTTP response codes 3xx, 4xx or 5xx) from the server. If too many errors occur, it increases the time between requests. For example, a request returning an error code may be repeated 2 times in 2 minutes, and then not again for 30-60 minutes.

Client Behavior:

  • The first request for data MUST happen at a random interval between 0 and 5 minutes after the client starts.
  • After that, each update MUST happen at the update interval last specified by the server.

Client Behavior on error or timeout:

  • If the client receives an error during update, it MUST try again in one minute.
  • If it receives two errors in a row, it MUST continue to skip updates for a period of time defined by the following formula: 30mins * (rand + 1), where rand is a random number between 0 and 1. Thus, depending on the value of rand, the client will skip updates for 30-60 minutes.
  • If it receives another (3rd) error, it MUST skip updates for double the length of time. Thus, depending on the value of rand, the client will skip updates for 60-120 minutes.
  • If it receives another (4th) error, it MUST skip updates for double the length of time. Thus, depending on the value of rand, the client will skip updates for 120-240 minutes.
  • If it then receives another (5th) error, it MUST skip updates for double the length of time. Thus, depending on the value of rand, the client will skip updates for 240-480 minutes.
  • For every error after that, it SHOULD continue to check once every 480 minutes until the server responds with a success message.
  • Once the client receives successful HTTP replies, the error stats are reset.

HTTP Request for Full-Length Hashes

Clients should follow the caching requirements described in HTTP Response for Full-Length Hashes - Response Body. In addition, clients should handle errors and timeouts as follows:

  • If a client receives 2 errors within 5 minutes, it enters backoff mode.
  • After this point, if the client receives one non-error response, or the last error occurred at least 8 hours ago, it exits backoff mode.
  • While in backoff mode, the client MUST not ping for at least a certain amount of time from the last error. This time changes exponentially up to a maximum of 2 hours.
  • When the client receives the first error, it MUST not ping for at least 30 minutes from the last error.
  • If it receives another error, the client MUST not ping for at least 1 hour.
  • If it receives another error, the client MUST not ping for at least 2 hours.
  • After that, the client MUST wait at least 2 hours between pings.
  • The client has two options for tracking the granularity of errors. It can treat any error during a request for a full length hash equally, triggering backoff mode as specified above. Or it can track errors separately by unique hash prefix; that is, only gethash requests for that particular hash prefix should be skipped for the length of time specified above, extending with each additional error as specified.

Performing Lookups

Canonicalization

Before lookup in any list, the URL must be canonicalized.

We assume that the client has parsed the URL and made it valid according to RFC 2396. If the URL uses an internationalized domain name (IDN), it should be converted to the ASCII Punycode representation. The URL must include a path component; that is, it must have a trailing slash ('http://google.com/').

First, remove tab (0x09), CR (0x0d), and LF (0x0a) characters from the URL. Do not remove escape sequences for these characters (e.g. '%0a').

If the URL ends in a fragment, remove the fragment. For example, shorten 'http://google.com/#frag' to 'http://google.com/'.

Next, repeatedly percent-unescape the URL until it has no more percent-escapes.

To canonicalize the hostname:

Extract the hostname from the URL and then:

  1. Remove all leading and trailing dots.
  2. Replace consecutive dots with a single dot.
  3. If the hostname can be parsed as an IP address, normalize it to 4 dot-separated decimal values. The client should handle any legal IP- address encoding, including octal, hex, and fewer than 4 components.
  4. Lowercase the whole string.

To canonicalize the path:

  • The sequences "/../" and "/./" in the path should be resolved by replacing "/./" with "/", and removing "/../" along with the preceding path component.
  • Replace runs of consecutive slashes with a single slash character.

Do not apply these path canonicalizations to the query parameters.

In the URL, percent-escape all characters that are <= ASCII 32, >= 127, "#", or "%". The escapes should use uppercase hex characters.

Below are tests to help validate a canonicalization implementation.

Canonicalize("http://host/%25%32%35") = "http://host/%25";
Canonicalize("http://host/%25%32%35%25%32%35") = "http://host/%25%25";
Canonicalize("http://host/%2525252525252525") = "http://host/%25";
Canonicalize("http://host/asdf%25%32%35asd") = "http://host/asdf%25asd";
Canonicalize("http://host/%%%25%32%35asd%%") = "http://host/%25%25%25asd%25%25";
Canonicalize("http://www.google.com/") = "http://www.google.com/";
Canonicalize("http://%31%36%38%2e%31%38%38%2e%39%39%2e%32%36/%2E%73%65%63%75%72%65/%77%77%77%2E%65%62%61%79%2E%63%6F%6D/") = "http://168.188.99.26/.secure/www.ebay.com/";
Canonicalize("http://195.127.0.11/uploads/%20%20%20%20/.verify/.eBaysecure=updateuserdataxplimnbqmn-xplmvalidateinfoswqpcmlx=hgplmcx/") = "http://195.127.0.11/uploads/%20%20%20%20/.verify/.eBaysecure=updateuserdataxplimnbqmn-xplmvalidateinfoswqpcmlx=hgplmcx/";
Canonicalize("http://host%23.com/%257Ea%2521b%2540c%2523d%2524e%25f%255E00%252611%252A22%252833%252944_55%252B") = "http://host%23.com/~a!b@c%23d$e%25f^00&11*22(33)44_55+";
Canonicalize("http://3279880203/blah") = "http://195.127.0.11/blah";
Canonicalize("http://www.google.com/blah/..") = "http://www.google.com/";
Canonicalize("www.google.com/") = "http://www.google.com/";
Canonicalize("www.google.com") = "http://www.google.com/";
Canonicalize("http://www.evil.com/blah#frag") = "http://www.evil.com/blah";
Canonicalize("http://www.GOOgle.com/") = "http://www.google.com/";
Canonicalize("http://www.google.com.../") = "http://www.google.com/";
Canonicalize("http://www.google.com/foo\tbar\rbaz\n2") ="http://www.google.com/foobarbaz2";
Canonicalize("http://www.google.com/q?") = "http://www.google.com/q?";
Canonicalize("http://www.google.com/q?r?") = "http://www.google.com/q?r?";
Canonicalize("http://www.google.com/q?r?s") = "http://www.google.com/q?r?s";
Canonicalize("http://evil.com/foo#bar#baz") = "http://evil.com/foo";
Canonicalize("http://evil.com/foo;") = "http://evil.com/foo;";
Canonicalize("http://evil.com/foo?bar;") = "http://evil.com/foo?bar;";
Canonicalize("http://\x01\x80.com/") = "http://%01%80.com/";
Canonicalize("http://notrailingslash.com") = "http://notrailingslash.com/";
Canonicalize("http://www.gotaport.com:1234/") = "http://www.gotaport.com:1234/";
Canonicalize("  http://www.google.com/  ") = "http://www.google.com/";
Canonicalize("http:// leadingspace.com/") = "http://%20leadingspace.com/";
Canonicalize("http://%20leadingspace.com/") = "http://%20leadingspace.com/";
Canonicalize("%20leadingspace.com/") = "http://%20leadingspace.com/";
Canonicalize("https://www.securesite.com/") = "https://www.securesite.com/";
Canonicalize("http://host.com/ab%23cd") = "http://host.com/ab%23cd";
Canonicalize("http://host.com//twoslashes?more//slashes") = "http://host.com/twoslashes?more//slashes";

Suffix/Prefix Expression Lookup

Currently, all valid list types rely on suffix/prefix expressions, as described in List Contents. To perform a lookup for a given URL, the client will try to form different possible host suffix and path prefix combinations and see whether they match each list. Depending on the list type, the suffix/prefix combination may be hashed before lookup. These lookups only use the host and path components of the URL. The scheme, username, password, and port are disregarded. If the URL includes query parameters, the client will include a lookup with the full path and query parameters.

For the hostname, the client will try at most 5 different strings. They are:

  • the exact hostname in the URL
  • up to 4 hostnames formed by starting with the last 5 components and successively removing the leading component. The top-level domain can be skipped. These additional hostnames should not be checked if the host is an IP address.

For the path, the client will also try at most 6 different strings. They are:

  • the exact path of the URL, including query parameters
  • the exact path of the URL, without query parameters
  • the 4 paths formed by starting at the root (/) and successively appending path components, including a trailing slash.

The following examples illustrate the lookup behavior:

For the URL http://a.b.c/1/2.html?param=1, the client will try these possible strings:

a.b.c/1/2.html?param=1
a.b.c/1/2.html
a.b.c/
a.b.c/1/
b.c/1/2.html?param=1
b.c/1/2.html
b.c/
b.c/1/

For the URL http://a.b.c.d.e.f.g/1.html, the client will try these possible strings:

a.b.c.d.e.f.g/1.html
a.b.c.d.e.f.g/
(Note: skip b.c.d.e.f.g, since we'll take only the last 5 hostname components, and the full hostname)
c.d.e.f.g/1.html
c.d.e.f.g/
d.e.f.g/1.html
d.e.f.g/
e.f.g/1.html
e.f.g/
f.g/1.html
f.g/

For the URL http://1.2.3.4/1/, the client will try these possible strings:

1.2.3.4/1/
1.2.3.4/

Age of Data, Usage

Applications that retrieve data using the API must never use data older than what is specified by the service. Specifically, a warning can only be shown if a URL matches a full-length hash obtained in a response to an HTTP Request for Full-Length Hashes, and the cached response is still valid as described in HTTP Response for Full-Length Hashes - Response Body at the time a warning is to be shown.

Important: Under no other circumstances may a warning be shown.

Acceptable Usage in Clients

Please note that if you violate the requirements detailed in Acceptable Usage in Clients your key may be disabled for a period of time.

Usage Restrictions

A single API key can make requests for up to 10,000 clients per 24-hour period.

We limit the number of different clients you can support with a single API key. If you expect that more than 10,000 distinct clients per day will request updates, you must contact us to have your API key provisioned for additional capacity. We want to make sure that we have contact information for large users that may potentially affect the service and its availability. At the present time there is no cost for this. For further questions about large deployments, contact antiphish-malware-cap-req@google.com.

User Visible Warnings

If you use the Update API to warn users about risks from particular webpages, we require that you follow certain guidelines. These guidelines help protect both you and Google from misunderstandings by making clear that the page is not known with 100% certainty to be a phishing site or a distributor of malware or unwanted sofware, and that the warnings merely identify possible risk.

  • In your user visible warning, you may not lead users to believe that the page in question is, without a doubt, a phishing page or a page that distributes malware or unwanted software. When you refer to the page being identified or the potential risks it may pose to users, you must qualify the warning using terms such as: suspected, potentially, possible, likely, may be.
  • Your warning must enable the user to learn more by reviewing information at http://www.antiphishing.org/ (for phishing warnings), http://www.stopbadware.org/ (for malware warnings), or https://www.google.com/about/company/unwanted-software-policy.html (for unwanted software warnings).
  • When you show warnings for pages identified as risky by the Update API, you must give attribution to Google by including the line "Advisory provided by Google," with a link to the Safe Browsing Advisory. If your product also shows warnings based on other sources, you may not include the Google attribution in warnings derived from non-Google data.
  • Types of Malware sites

    As of API version 3.0, goog-malware-shavar lists two different types of Malware sites: Landing sites and Distribution sites. Landing sites are gateways to malware. They are often hacked sites that include iframes, scripts, or redirects that load content from other sites that launch the actual attacks. Distribution sites are the sites that launch the attacks.

    An API client that uses the Update API to show warnings in browsers can leverage this data to tailor which warnings to show, and when to show them. Such clients should show warnings in the following circumstances:

    • A user browses directly to a page on a site of either type.
    • A user browses to a page that includes any resource from a Distribution site.
    • A user browses to a page that uses frames to include content from a Landing site.

    Unwanted Software sites

    As of our update on March 9 2015, goog-unwanted-shvar lists landing sites for unwanted software. Landing sites are gateways as described above. For more information on unwanted software, please see our Unwanted Software Policy.

Suggested warning language

We encourage you to just copy this warning language in your product, or modify it slightly to fit your product.

Warning—Suspected phishing page. This page may be a forgery or imitation of another website, designed to trick users into sharing personal or financial information. Entering any personal information on this page may result in identity theft or other abuse. You can find out more about phishing from www.antiphishing.org.

Warning—Visiting this web site may harm your computer. This page appears to contain malicious code that could be downloaded to your computer without your consent. You can learn more about harmful web content including viruses and other malicious code and how to protect your computer at StopBadware.org.

Warning—The site ahead may contain harmful programs. Attackers might attempt to trick you into installing programs that harm your browsing experience (for example, by changing your homepage or showing extra ads on sites you visit). You can learn more about unwanted software at our Unwanted Software Policy.

Notice to Users About Phishing, Malware, and Unwanted Software Protection

Our Terms of Service require that if you indicate to users that your service provides malware, phishing, or unwanted software protection, you must also let them know that the protection is not perfect. This notice must be visible to them before they enable the protection, and it must let them know that there is a chance of both false positives (safe sites flagged as risky) and false negatives (risky sites not flagged). We suggest using the following language:

Google works to provide the most accurate and up-to-date phishing, malware, and unwanted software information. However, Google cannot guarantee that its information is comprehensive and error-free: some risky sites may not be identified, and some safe sites may be identified in error.

References