Use the reverse token filter to allow suffix matching queries in OpenSearch

Big Data

Use the reverse token filter to allow suffix matching queries in OpenSearch

lohitnath.453

September 18, 2023

Use the reverse token filter to allow suffix matching queries in OpenSearch

[ad_1]

OpenSearch is an open-source RESTful search engine constructed on prime of the Apache Lucene library. OpenSearch full-text search is quick, may give the results of complicated queries inside a fraction of a second. With OpenSearch, you may convert unstructured textual content into structured textual content utilizing totally different textual content analyzers, tokenizers, and filters to enhance search. OpenSearch makes use of a default analyzer, known as the customary analyzer, which works effectively for many use circumstances out of the field. However for some use circumstances, it might not work finest, and it is advisable to use a selected analyzer.

On this submit, we present how one can implement a suffix-based search. To discover a doc with the film title “saving non-public ryan” for instance, you need to use the prefix “saving” with a prefix-based question. Often, you additionally wish to match suffixes as effectively, comparable to matching “Harry Potter Goblet of Hearth” with the suffix “Hearth” To do this, first reverse the string “eriF telboG rettoP yrraH” with the reverse token filter, then question for the prefix “eriF”.

Resolution overview

Textual content evaluation entails remodeling unstructured textual content, such because the content material of an e-mail or a product description, right into a structured format that’s finely tuned for efficient looking. An analyzer permits the implementation of full-text search utilizing tokenization, which entails breaking down a textual content into smaller fragments generally known as tokens, with these tokens generally representing particular person phrases. To implement a reversed subject search, the analyzer does the next.

The analyzer processes textual content within the following order:

Use a personality filter to interchange - with _. For instance, from “My Driving License Quantity Is 123-456-789” to “My Driving License Quantity Is 123_456_789.”
The usual tokenizer splits texts into tokens. For instance, from “My Driving License Quantity Is 123_456_789” to “[ my, driving, license, number, is, 123, 456, 789 ].”
The reverse token filter reverses every token in a stream. For instance, from [ my, driving, license, number, is, 123, 456, 789 ] to [ ym, gnivird, esnecil, rebmun, si, 321, 654, 987 ].

The usual analyzer (default analyzer) breaks down enter strings into tokens primarily based on phrase boundaries and removes most punctuation marks. For extra details about analyzers, refer Construct-in analyzers.

Indexing and looking

Each doc is a set of fields, every having its personal particular information sort. While you create a mapping on your information, you create a mapping definition, which accommodates an inventory of fields which can be pertinent to the doc. To know extra about index mappings confer with index mapping.

Let’s take the instance of an analyzer with the reverse token filter utilized on the textual content subject.

First, create an index with mappings as proven within the following code. The brand new subject ‘reverse_title’ is derived from ‘title’ subject for suffix search and unique subject ‘title’ can be used for regular search.

PUT motion pictures
{
  "settings" : {
    "evaluation" : {
      "analyzer" : {
        "whitespace_reverse" : {
          "tokenizer" : "whitespace",
          "filter" : ["reverse"]
        }
      }
    }
  },
  "mappings": {
    "properties": {
      "title": { 
        "sort": "textual content",
        "analyzer": "customary",
        "copy_to": "reverse_title"
      },
      "reverse_title": {
        "sort": "textual content",
        "analyzer": "whitespace_reverse"
      }
    }
  }
}

Insert some paperwork into the index:

POST _bulk
{ "index" : { "_index" : "motion pictures", "_id" : "1" } }
{ "title": "Harry Potter Goblet of Hearth" }
{ "index" : { "_index" : "motion pictures", "_id" : "2" } }
{ "title": "Lord of the rings" }
{ "index" : { "_index" : "motion pictures", "_id" : "3" } }
{ "title": "Saving Personal Ryan" }

Run the next question to carry out a suffix/reverse search on derived subject ‘reverse_title’ for “Hearth”:

GET motion pictures/_search
{
  "question": {
    "prefix": {
      "reverse_title": {
        "worth": "eriF"
      }
    }
  }
}

The next code exhibits our outcomes:

   {
        "_index": "motion pictures",
        "_id": "1",
        "_score": 1,
        "_source": {
          "title": "Harry Potter Goblet of Hearth"
        }
      }

For non-reverse search you need to use unique subject ‘title’.

GET motion pictures/_search
{
  "question": {
    "match": {
      "title": "Hearth"
    }
  }
}

The next code exhibits our consequence.

{
        "_index": "motion pictures",
        "_id": "1",
        "_score": 0.2876821,
        "_source": {
          "title": "Harry Potter Goblet of Hearth"
        }
}

The question returns a doc with the film title “Harry Potter Goblet of Hearth”.
For those who’re curious to know the way search works at excessive degree, confer with A question, or There and Again Once more.

Conclusion

On this submit, you walked by how textual content evaluation works in OpenSearch and the right way to implement suffix-based search utilizing a reverse token filter successfully.

When you’ve got suggestions about this submit, submit your feedback within the feedback part.

In regards to the Authors

Bharav Patel is a Specialist Resolution Architect, Analytics at Amazon Net Providers. He primarily works on Amazon OpenSearch Service and helps prospects with key ideas and design rules of working OpenSearch workloads on the cloud. Bharav likes to discover new locations and check out totally different cuisines.

[ad_2]