URLs and Hashing

Page Summary

Safe Browsing clients need to compute hash prefixes of URLs to check against Safe Browsing lists by canonicalizing the URL, creating suffix/prefix expressions, computing full-length hashes, and then computing hash prefixes.
Canonicalization involves steps like removing fragments, percent-unescaping, normalizing hostnames and paths, and percent-escaping specific characters.
Suffix/prefix expressions are generated from host and path combinations, with up to 5 host variations and 6 path variations.
Full-length SHA256 hashes are calculated for each suffix/prefix expression, and can be validated using provided unit tests.
Hash prefixes are the most significant 4-32 bytes of the full-length SHA256 hash, used for efficient lookups in Safe Browsing lists.

This document applies to the following method: Update API (v4): fullHashes.find.

Overview

The Safe Browsing lists consist of variable length SHA256 hashes (see List Contents). To check a URL against a Safe Browsing list (either locally or on the server), clients must first compute the hash prefix of that URL.

To compute the hash prefix of a URL, follow these steps:

Canonicalize the URL (see Canonicalization).
Create the suffix/prefix expressions for the URL (see Suffix/Prefix Expressions).
Compute the full-length hash for each suffix/prefix expression (see Hash Computations).
Compute the hash prefix for each full-length hash (see Hash Prefix Computations).

Note that these steps mirror the process the Safe Browsing server uses to maintain the Safe Browsing lists.

Canonicalization

To begin, we assume that the client has parsed the URL and made it valid according to RFC 2396. If the URL uses an internationalized domain name (IDN), the client should convert the URL to the ASCII Punycode representation. The URL must include a path component; that is, it must have a trailing slash ("http://google.com/").

First, remove tab (0x09), CR (0x0d), and LF (0x0a) characters from the URL. Do not remove escape sequences for these characters (e.g. '%0a').

Second, if the URL ends in a fragment, remove the fragment. For example, shorten "http://google.com/#frag" to "http://google.com/".

Third, repeatedly percent-unescape the URL until it has no more percent-escapes.

To canonicalize the hostname:

Extract the hostname from the URL and then:

Remove all leading and trailing dots.
Replace consecutive dots with a single dot.
If the hostname can be parsed as an IP address, normalize it to 4 dot-separated decimal values. The client should handle any legal IP-address encoding, including octal, hex, and fewer than four components.
Lowercase the whole string.

To canonicalize the path:

Resolve the sequences "/../" and "/./" in the path by replacing "/./" with "/", and removing "/../" along with the preceding path component.
Replace runs of consecutive slashes with a single slash character.

Do not apply these path canonicalizations to the query parameters.

In the URL, percent-escape all characters that are <= ASCII 32, >= 127, "#", or "%". The escapes should use uppercase hex characters.

Below are tests to help validate a canonicalization implementation.

Canonicalize("http://host/%25%32%35") = "http://host/%25";
Canonicalize("http://host/%25%32%35%25%32%35") = "http://host/%25%25";
Canonicalize("http://host/%2525252525252525") = "http://host/%25";
Canonicalize("http://host/asdf%25%32%35asd") = "http://host/asdf%25asd";
Canonicalize("http://host/%%%25%32%35asd%%") = "http://host/%25%25%25asd%25%25";
Canonicalize("http://www.google.com/") = "http://www.google.com/";
Canonicalize("http://%31%36%38%2e%31%38%38%2e%39%39%2e%32%36/%2E%73%65%63%75%72%65/%77%77%77%2E%65%62%61%79%2E%63%6F%6D/") = "http://168.188.99.26/.secure/www.ebay.com/";
Canonicalize("http://195.127.0.11/uploads/%20%20%20%20/.verify/.eBaysecure=updateuserdataxplimnbqmn-xplmvalidateinfoswqpcmlx=hgplmcx/") = "http://195.127.0.11/uploads/%20%20%20%20/.verify/.eBaysecure=updateuserdataxplimnbqmn-xplmvalidateinfoswqpcmlx=hgplmcx/";
Canonicalize("http://host%23.com/%257Ea%2521b%2540c%2523d%2524e%25f%255E00%252611%252A22%252833%252944_55%252B") = "http://host%23.com/~a!b@c%23d$e%25f^00&11*22(33)44_55+";
Canonicalize("http://3279880203/blah") = "http://195.127.0.11/blah";
Canonicalize("http://www.google.com/blah/..") = "http://www.google.com/";
Canonicalize("www.google.com/") = "http://www.google.com/";
Canonicalize("www.google.com") = "http://www.google.com/";
Canonicalize("http://www.evil.com/blah#frag") = "http://www.evil.com/blah";
Canonicalize("http://www.GOOgle.com/") = "http://www.google.com/";
Canonicalize("http://www.google.com.../") = "http://www.google.com/";
Canonicalize("http://www.google.com/foo\tbar\rbaz\n2") ="http://www.google.com/foobarbaz2";
Canonicalize("http://www.google.com/q?") = "http://www.google.com/q?";
Canonicalize("http://www.google.com/q?r?") = "http://www.google.com/q?r?";
Canonicalize("http://www.google.com/q?r?s") = "http://www.google.com/q?r?s";
Canonicalize("http://evil.com/foo#bar#baz") = "http://evil.com/foo";
Canonicalize("http://evil.com/foo;") = "http://evil.com/foo;";
Canonicalize("http://evil.com/foo?bar;") = "http://evil.com/foo?bar;";
Canonicalize("http://\x01\x80.com/") = "http://%01%80.com/";
Canonicalize("http://notrailingslash.com") = "http://notrailingslash.com/";
Canonicalize("http://www.gotaport.com:1234/") = "http://www.gotaport.com/";
Canonicalize("  http://www.google.com/  ") = "http://www.google.com/";
Canonicalize("http:// leadingspace.com/") = "http://%20leadingspace.com/";
Canonicalize("http://%20leadingspace.com/") = "http://%20leadingspace.com/";
Canonicalize("%20leadingspace.com/") = "http://%20leadingspace.com/";
Canonicalize("https://www.securesite.com/") = "https://www.securesite.com/";
Canonicalize("http://host.com/ab%23cd") = "http://host.com/ab%23cd";
Canonicalize("http://host.com//twoslashes?more//slashes") = "http://host.com/twoslashes?more//slashes";

Suffix/prefix expressions

Once the URL is canonicalized, the next step is to create the suffix/prefix expressions. Each suffix/prefix expression consists of a host suffix (or full host) and a path prefix (or full path) as shown in these examples.

Suffix/Prefix Expression	Equivalent Regular Expression
a.b/mypath/	http\:\/\/.\.a\.b\/mypath\/.
c.d/full/path.html?myparam=a	http\:\/\/.*.c\.d\/full\/path\.html?myparam=a

The client will form up to 30 different possible host suffix and path prefix combinations. These combinations use only the host and path components of the URL. The scheme, username, password, and port are discarded. If the URL includes query parameters, then at least one combination will include the full path and query parameters.

For the host, the client will try at most five different strings. They are:

The exact hostname in the URL.
Up to four hostnames formed by starting with the last five components and successively removing the leading component. The top-level domain can be skipped. These additional hostnames should not be checked if the host is an IP address.

For the path, the client will try at most six different strings. They are:

The exact path of the URL, including query parameters.
The exact path of the URL, without query parameters.
The four paths formed by starting at the root (/) and successively appending path components, including a trailing slash.

The following examples illustrate the check behavior:

For the URL http://a.b.c/1/2.html?param=1, the client will try these possible strings:

a.b.c/1/2.html?param=1
a.b.c/1/2.html
a.b.c/
a.b.c/1/
b.c/1/2.html?param=1
b.c/1/2.html
b.c/
b.c/1/

For the URL http://a.b.c.d.e.f.g/1.html, the client will try these possible strings:

a.b.c.d.e.f.g/1.html
a.b.c.d.e.f.g/
(Note: skip b.c.d.e.f.g, since we'll take only the last five hostname components, and the full hostname)
c.d.e.f.g/1.html
c.d.e.f.g/
d.e.f.g/1.html
d.e.f.g/
e.f.g/1.html
e.f.g/
f.g/1.html
f.g/

For the URL http://1.2.3.4/1/, the client will try these possible strings:

1.2.3.4/1/
1.2.3.4/

Hash computations

Once the set of suffix/prefix expressions has been created, the next step is to compute the full-length SHA256 hash for each expression. A unit test (in pseudo-C) you can use to validate your hash computations is provided below.

Examples from FIPS-180-2:

Unit Test (in pseudo-C)

// Example B1 from FIPS-180-2
string input1 = "abc";
string output1 = TruncatedSha256Prefix(input1, 32);
int expected1[] = { 0xba, 0x78, 0x16, 0xbf };
assert(output1.size() == 4);  // 4 bytes == 32 bits
for (int i = 0; i < output1.size(); i++) assert(output1[i] == expected1[i]);

// Example B2 from FIPS-180-2
string input2 = "abcdbcdecdefdefgefghfghighijhijkijkljklmklmnlmnomnopnopq";
string output2 = TruncatedSha256Prefix(input2, 48);
int expected2[] = { 0x24, 0x8d, 0x6a, 0x61, 0xd2, 0x06 };
assert(output2.size() == 6);
for (int i = 0; i < output2.size(); i++) assert(output2[i] == expected2[i]);

// Example B3 from FIPS-180-2
string input3(1000000, 'a');  // 'a' repeated a million times
string output3 = TruncatedSha256Prefix(input3, 96);
int expected3[] = { 0xcd, 0xc7, 0x6e, 0x5c, 0x99, 0x14, 0xfb, 0x92,
                    0x81, 0xa1, 0xc7, 0xe2 };
assert(output3.size() == 12);
for (int i = 0; i < output3.size(); i++) assert(output3[i] == expected3[i]);

Hash prefix computations

Finally, the client needs to compute the hash prefix for each full-length SHA256 hash. For Safe Browsing, a hash prefix consists of the most significant 4-32 bytes of a SHA256 hash.