Improve search quality

Search quality refers to the quality of search results in terms of ranking and recall as perceived by the user making the search query.

Ranking refers to the ordering of items and recall refers to the number of relevant items retrieved. An item (also referred to as a document) is any piece of digital content that Google Cloud Search can index. Types of items include Microsoft Office documents, PDF files, a row in a database, unique URLs, and so on. An item is comprised of:

  • Structured metadata
  • Indexable content
  • ACLs

Cloud Search uses a variety of signals to retrieve and to rank search query results; the items resulting from a search query. You can influence Cloud Search’s signals through settings in the schema, the item's content and metadata (during indexing), and the search application. The goal of this document is to help you improve search quality through modification of these signal influencers.

For a summary of recommended and optional settings, refer to Summary of recommended and optional search quality settings.

Influence topicality score

Topicality refers to the relevance of a search result to the original query terms. Topicality of an item is calculated based on the following criteria:

  • The importance of each query term.
  • The number of hits (the number of times a query term appears in the item’s content or metadata).
  • The type of matches the query term, and their variants, have with an item indexed in Cloud Search.

To influence a text property's topicality score, define the RetrievalImportance on the text property in your schema. A match on a property with high RetrievalImportance results in a higher score compared to a match on a property with low RetrievalImportance.

For example, suppose you have a data source with the following characteristics:

  • The data source is used to store history for software bugs.
  • Each bug has a name, description, and priority.

Most users would query this data source using the bug name, so you would set the RetrievalImportance on the name to HIGHEST in the schema.

Conversely, most users may not query this data source using the description of the bug, so, set the RetrievalImportance on the description to DEFAULT. Following is sample schema containing RetrievalImportance settings.

{
  "objectDefinitions": [
    {
      "name": "issues",
      "propertyDefinitions": [
        {
          "name": "summary",
          "textPropertyOptions": {
            "retrievalImportance": {
              "importance": HIGHEST
              }
            }
          },
        {
          "name": "description",
          "textPropertyOptions": {
            "retrievalImportance": {
              "importance": DEFAULT
              }
            }
          },
        {
          "name": "label",
            "isRepeatable": true,
          "textPropertyOptions": {
            "retrievalImportance": {
              "importance": DEFAULT
              }
            }
          },
        {
          "name": "comments",
          "textPropertyOptions": {
            "retrievalImportance": {
              "importance": DEFAULT
              }
            }
          },
        {
          "name": "project",
          "textPropertyOptions": {
            "retrievalImportance": {
              "importance": HIGH
              }
            }
          },
        {
          "name": "duedate",
          "datePropertyOptions": {
          }
        },
        ...
      ]
    }
  ]
}

In the case of HTML documents, tags such as <title> and <h1>, along with formatting settings such as font size and bolding, are used for determining the importance of various terms. If the ContentFormat is TEXT, ItemContent has DEFAULT retrieval importance and if it is HTML, its retrieval importance is determined on the basis of HTML properties.

Influence freshness

Freshness measures how recently an item has been modified and is determined by the createTime and updateTime properties in the ItemMetadata. Fresher items get a ranking boost.

It is possible to influence how freshness is computed for an object by adjusting the freshnessProperty and freshnessDuration of FreshnessOptions in the schema.

The freshnessProperty allows you to use a date or timestamp properties for computing freshness instead of the default updateTime.

In our previous example of a software bug tracking system, the due date could be used as a freshnessProperty such that items with a due date closest to the current date are considered “fresher” and obtain a ranking boost. Following is sample schema containing freshnessProperty settings:

{
  "objectDefinitions": [
    {
      "name": "issues",
      "options": {
        "freshnessOptions": {
          "freshnessProperty": "duedate"
        }
      },
      "propertyDefinitions": [
        {
          "name": "summary",
          "textPropertyOptions": {
            "retrievalImportance": {
              "importance": HIGHEST
            }
          }
        },
        {
          "name": "duedate",
          "datePropertyOptions": {
          }
        },
        ...
      ]
    }
  ]
}

Use the freshnessDuration to identify when an item is considered out-of-date. For example, you may have a data source that is not indexed regularly or for which you do not want freshness to influence the ranking. You can achieve this goal by specifying a high value for freshnessDuration.

Suppose you have a data source with employee profile information. In this scenario, you might want a high freshnessDuration because changes to employee information is often not relevant to the ranking of the employee. Following is sample schema containing freshnessDuration setting:

{
  "objectDefinitions": [
    {
      "name": "people",
      "options": {
        "freshnessOptions": {
          "freshnessDuration": "315360000s", # 100 years
        }
      },
    }
  ]
}

You can also set freshnessDuration to a very small value for data sources whose content changes rapidly, such as a data source containing news articles. In this scenario, the most-recently created or modified documents are most relevant. Following is sample schema containing freshnessDuration setting for a data source containing rapidly changing content:

{
  "objectDefinitions": [
    {
      "name": "news",
      "options": {
        "freshnessOptions": {
          "freshnessDuration": "259200s", # 3 days
        }
      },
    }
  ]
}

Influence quality

Quality is a measurement of the accuracy and usefulness of an item. A data source can contain multiple semantically similar documents, each with a different level of quality. You can specify a quality value between 0 and 1 using SearchQualityMetadata. Items with higher values receive a ranking boost relative to items with a lower values. Use this setting only if you need to influence or boost the quality of an item outside of the information provided to Cloud Search.

For example, suppose you have a data source containing employee benefits documents. You might use SearchQualityMetadata to boost the ranking of documents authored by Human Resources employees over documents authored by other employees.

Following is sample schema containing SearchQualityMetadata settings for issues in a bug tracking system:

{
  "name": "datasources/.../items/issue1",
  "acl": {
    ...
  },
  "metadata": {
    "title": "Issue 1"
    "objectType": "issues"
  },
  ...
}

{
  "name": "datasources/.../items/issue2",
  "acl": {
    ...
  },
  "metadata": {
    "title": "Issue 2"
    "objectType": "issues"
    "searchQualityMetadata": {
      "quality": 0.5
    }
  },
  ...
}

{
  "name": "datasources/.../items/issue3",
  "acl": {
    ...
  },
  "metadata": {
    "title": "Issue 3"
    "objectType": "issues"
    "searchQualityMetadata": {
      "quality": 1
    }
  },
  ...
}

Given this schema, when a user searches using the search term “issue,” Issue 3 in the schema (quality of 1) is ranked higher than Issue 2 (quality of .5) and Issue 1 (if nothing is specified, the default quality is 0).

Influence using field type

Cloud Search allows you to influence ranking based on the value of enum or integer properties. For each integer or enum property, an OrderedRanking can be specified. This setting has the following values:

  • NO_ORDER (default): The property does not affect ranking.
  • ASCENDING: Items with higher values of this integer or enum property receive a ranking boost compared to items with lower values.
  • DESCENDING: Items with lower values of the integer or enum property receive a ranking boost compared to items with higher values.

For example, suppose each bug in a bug tracking system has an enum property for storing the priority of the bug as either HIGH (1), MEDIUM (2), or LOW (3). In this scenario, setting an OrderedRanking of DESCENDING provides a ranking boost to HIGH priority bugs in comparison to LOW priority bugs. Following is sample schema containing OrderedRanking settings for issues in a bug tracking system:

{
  "objectDefinitions": [
    {
      "name": "issues",
      "options": {
        "freshnessOptions": {
          "freshnessProperty": "duedate",
        }
      },
      "propertyDefinitions": [
        {
          "name": "summary",
          "textPropertyOptions": {
            "retrievalImportance": {
              "importance": HIGHEST
            }
          }
        },
        {
          "name": "duedate",
          "datePropertyOptions": {
          }
        },
        {
          "name": "priority",
          "enumPropertyOptions": {
            "possibleValues": [
              {
                "stringValue": "HIGH",
                "integerValue": 1
              },
              {
                "stringValue": "MEDIUM",
                "integerValue": 2
              },
              {
                "stringValue": "LOW",
                "integerValue": 3
              }
            ],
            "orderedRanking": DESCENDING,
          }
        },

        ...
      ]
    }
  ]
}

A bug tracking system could also have an integer property called votes used to gather feedback from users on the relative importance of a bug. You could use the votes property to influence ranking by providing higher importance to the bugs with the most votes. In this case, you could specify OrderedRanking as ASCENDING for the votes property so that issues with the most votes receive a ranking boost. Following is sample schema containing OrderedRanking settings for issues in a bug tracking system:

{
  "objectDefinitions": [
    {
      "name": "issues",
      "propertyDefinitions": [
        {
          "name": "summary",
          "textPropertyOptions": {
            "retrievalImportance": {
              "importance": HIGHEST
            }
          }
        },
        {
          "name": "description",
          "textPropertyOptions": {
            "retrievalImportance": {
              "importance": DEFAULT
            }
          }
        },
        {
          "name": "votes",
          "integerPropertyOptions": {
            "orderedRanking": ASCENDING,
            "minimumValue": 0,
            "maximumValue": 1000,
          }
        },

        ...
      ]
    }
  ]
}

Influence ranking through query expansion

Query expansion refers to expanding the terms in the query, using synonyms and spelling, to retrieve better results.

Use synonyms to influence search results

Cloud Search utilizes synonyms inferred from public web content to expand the query terms. You can also define custom synonyms to capture organization-specific terminology, such as common acronyms used within an organization or industry-specific terminology.

Custom synonyms can be defined within a data source or as a separate data source. Custom synonyms are applied across all search applications in a domain, regardless of where the synonym is defined. For information on defining custom synonyms, refer to Define synonyms.

Use spelling to influence search results

Cloud Search provides spelling suggestions based on models built using the public Google Search data. If Cloud Search detects a misspelling in the context of a query, it returns the suggested query in the SpellResult. The suggested spelling can be displayed to the user as a suggestion. For example, the user might misspell the query term “employe” and could receive the suggestion “Did you mean employee?”

Cloud Search also uses spell corrections as synonyms to help retrieve documents that may otherwise be missed due to a spelling error.

Influencing ranking through search application settings

As mentioned in the Introduction to Google Cloud Search, a Search Application is a group of settings that, when associated with a search interface, provide contextual information about searches. The following configurations allow you to influence ranking through the search application:

  • Scoring configuration
  • Source configuration

The following two sections explain how these configurations are useful in influencing ranking.

Adjust the scoring configuration

For each search application, you can specify a ScoringConfig used for controlling the application of some signals during ranking. Currently, you can disable freshness and personalization.

If freshness is disabled, it is disabled for all data sources listed in the search application, regardless of the freshness options specified in the schema for the data source. Similarly, if personalization is disabled, owner boost and interaction boost doesn’t affect the ranking.

For step-by-step instructions on configuring this setting, refer to Create a custom search experience.

Adjust the source configuration

The source configuration allows you to specify data source-level settings in a search application. The following settings are supported:

  • Source importance
  • Crowding

Set source importance

Source importance refers to the relative importance of a data source within a search application. This setting can be specified in SourceImportance field inside SourceScoringConfig. Items from a data source with HIGH source importance receive a ranking boost compared to items from a data source with a DEFAULT or a LOW source importance. Use this setting to influence ranking when you believe users would prefer results from certain datasources.

For example, suppose you have a product support portal containing external and internal troubleshooting data. In this scenario, you might want to configure your search application to prioritize results from the internal data source.

For step-by-step instructions on configuring this setting, refer to Create a custom search experience.

Set crowding

Crowding refers to a the maximum number of results that can be returned from a data source in a search application. This value can be controlled using the numResults field in SourceCrowdingConfig. This value defaults to 3 which means if we have shown 3 results from a data source Cloud Search starts presenting results from other data sources. Items from the first data source are reconsidered only if all data sources have reached their crowding limit or there are no more results from other data sources.

This setting is helpful in ensuring diversity of the search results and preventing one data source from dominating the search result page.

For step-by-step instructions on configuring this setting, refer to Create a custom search experience.

Influencing ranking through personalization

Personalization refers to the presentation of personalized search results based on the individual user accessing the result. You can influence ranking by prioritizing items based on the following criteria:

  • Item ownership
  • Item interaction
  • Item language

The following three sections address how to influence search quality based on these criteria.

Influence ranking based on item ownership

Item ownership refers to providing a ranking boost to items owned by the user performing the search query. Each item has an ItemAcl with an owners field. If the user executing a query is the owner of an item, then, by default, that item receives a ranking boost. You can turn off personalization in the search application.

Increase ranking based on item interaction

Item interaction refers to providing a ranking boost to items that the search query user interacted with (viewed, commented, edited, and so on).

Item interaction signals are automatically obtained for G Suite products such as Drive, Gmail, and so on. For other products, you can provide item-level interaction data, including the type of interaction (view, edit), the timestamp of the interaction, and the principal (user who interacted with the item). Note that items with recent interactions obtain a higher ranking boost.

Influence ranking through query interpretation

Cloud Search’s query interpretation feature automatically interprets the operators and filters in a user’s query, and converts those elements into a structured, operator-based query. Query interpretation uses operators defined in the schema, together with the indexed documents, to deduce what the user's query means. This feature allows a user to search with minimal keywords, yet still obtain precise results. For further information, refer to Structure a schema for optimal query interpretation.

Increase ranking based on item language

Language refers to providing a ranking demotion to items whose language does not match the language of the query. The following affects the ranking of items based on language:

  • The languageCode specified in the RequestOptions.
  • The auto-detected language of the search query.
  • The language of the item (contentLanguage in ItemMetadata or the auto-detected language in that order).

If the language of the query and item match, no language demotion is applied. If these settings do not match, then the item is demoted.

Summary of recommended and optional search quality settings

The following table lists all of recommended and optional search quality settings. These recommendations should help you achieve the most benefit from Cloud Search's ranking models.

SettingLocationRecommended/optionalDetails
Schema settings
ItemContent fieldItemContentRecommendedWhen creating or updating your schema, populate the unstructured content of an item. This field is used for generating snippets.
RetrievalImportance fieldRetrievalImportanceRecommendedWhen creating or updating a schema, set for text properties which are clearly important or topical.
FreshnessOptionsFreshnessOptionsOptionalWhen creating or updating a schema, set to ensure that items aren't demoted because of incorrect data or cases when data is missing.
Indexing settings
createTime/updateTimeItemMetadataRecommendedPopulate during indexing of an item.
owners fieldItemAcl()RecommendedPopulate during indexing of an item.
Custom synonyms_dictionaryEntry schemaRecommendedDefine at data source-level or as separate data source during indexing.
quality fieldSearchQualityMetadataOptionalTo provide a base quality boost compared to other semantically similar items, set quality during indexing. Setting this field for all items in a data source nullifies it's effect.
item-level interaction datainteractionOptionalIf the data source records and provides access to user's interactions, populate the interactions for each item during indexing.
integer/enum propertiesOrderedRankingOptionalWhen order of items is relevant, specify the ordered ranking for integer and enum properties during indexing.
Search application settings
Personalization=falseScoringConfig or using CloudSearch admin UIRecommendedWhen creating or updating the search application. Ensure you provide the correct owner information as described in Influencing ranking through personalization
SourceImportance fieldSourceCrowdingConfigOptionalTo bias the results from certain data sources, set this field.
numResults fieldSourceCrowdingConfigOptionalTo control the diversity of results, set this field.

Next Steps

Here are a few next steps you might take:

  1. Structure a schema for optimal query interpretation.

  2. Learn how to leverage the _dictionaryEntry schema to define synonyms for terms commonly used in your company. To use the _dictionaryEntry schema, refer to Define synonyms.

Şunun hakkında geri bildirim gönderin...