Friday, June 3, 2022
This week, we introduced an algorithmic improvement that identifies documents where the title element is written in a different language or script from its content, and chooses a title that is similar to the language and script of the document. This is based on the general principle that a document’s title should be written by the language or script of its primary contents. It's one of the reasons where we might go beyond title elements for web result titles.
Multilingual titles repeat the same phrase with two different languages or scripts. The most popular pattern is to append an English version to the original title text.
गीतांजलि की जीवनी - Geetanjali Biography in Hindi
In this example, the title consists of two parts (divided by a hyphen), and they express the same contents in different languages (Hindi and English). While the title is in both languages, the document itself is written only in Hindi. Our system detects such inconsistency and might use only the Hindi headline text, like:
गीतांजलि की जीवनी
Latin scripted titles
Transliteration is when content is written from one language into a different language that uses a different script or alphabet. For example, consider a page title for a song written in Hindi but transliterated to use Latin characters rather than Hindi’s native Devanagari script:
jis desh me holi kheli jati hai
In such a case, our system tries to find an alternative title using the script that’s predominant on the page, which in this case could be:
जिस देश में होली खेली जाती है
In general, our systems tend to use the title element of the page. In cases with multi-language or transliterated titles, our systems may seek alternatives that match the predominant language of the page. This is why it's a good practice to provide a title that matches the language and/or the script of the page's main content.