Case insensitive exact matches in Elasticsearch
There was a usecase in my application where I had to support case insensitive exact matches i.e. the strings "maruti", "Maruti","MARUTI" should be same while "Maruti suzuki", "New maruti", "maruti car" etc should not be returned as a result on searching for "maruti" because they are not exact matches.
Let's consider, for example, that we have a list of program titles of some shows and those are being collected in a single index. Consider the following documents in the index titles and type default -
PUT titles/default/1
{
"title" : "The BIg BaNg Theory"
}
PUT titles/default/2
{
"title" : "the big bang"
}
PUT titles/default/3
{
"title" : "big Theory"
}
If you see document 1, some letters are upper case while some are lowercase letters. Let's consider that the standard analyzer is being used for this field. The title for document 1 will be analyzed into ["the","big","bang","theory"], document 2 would be ["the","big","bang"] and document 3 would be ["big","theory"]. Now let's say someone doesn't know the case of the letters in the title and searches for title "The Big Bang Theory" and want only document 1 in the result. How do we implement the search? Let's try few methods here -
- Match query
GET titles/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"title": "The Big Bang Theory"
}
}
]
}
}
}
Standard analyzer will analyze the query string into tokens ["the", "big", "bang", "theory"] and will look for documents containing any of these tokens. Since, all the documents has one or more words from this list therefore all of the documents are returned in result. The option of using match query is rejected because we are looking for exact matches but match query looks for similar documents. Remember we can always do "title.keyword" to look for eact match but it'll make elastic search to not analyze the field and we won't be able to achieve case insensitivity in search. Hence, match query won't work.
- Terms query
GET titles/_search
{
"query": {
"bool": {
"must": [
{
"terms": {
"title": [
"The Big Bang Theory"
]
}
}
]
}
}
}
Terms query doesn't analyze the given query string, it looks for the value given to it. So, this query won't match any document as none of the document contains the exact term "The Big Bang Theory".
Hmmm...interesting, nothing works! Should we create another field with lowercase values? That'd require extra effort to convert all the fields into lowercase. We don't want that....
Let's see what do we need -
- Case insensitivity - something that will convert the case of letters into lowercase/uppercase while indexing and searching both.
- Exact Match - Assume we have something that'll convert the case to lowercase now if we have something that doesn't break the string into tokens then we can do terms query on "title" or match query using "title.keyword".
Elasticsearch 5.2 introduced something called normalizer which is similar as an analyzer but it makes sure that only one token is produced through the analysis of the text. It's a property of keyword and can't be applied on text. Cool!..Now we can do terms query if the analysis converts the whole text into lowercase while indexing and searching! For this particular usecase we can use lowercase normalizer of elasticsearch. This will do two things - it'll analyze the string and will convert everything to lowercase and second it won't create multiple tokens out of the string, a single token will be produced.
Unfortunately, we can't do this on an existing index as it requires change in the setting of the index which is not allowed in any of the version of elasticsearch yet(current version 6.4). We'd need to create a new index with custom settings and reindex all the documents in new index. Let's see how to do this -
- Create index with custom settings -
PUT titles
{
"settings": {
"analysis": {
"normalizer": {
"my_normalizer": {
"type": "custom",
"filter": ["lowercase"]
}
}
}
},
"mappings": {
"default": {
"properties": {
"title": {
"type": "text",
"fields": {
"normalize": {
"type": "keyword",
"normalizer": "my_normalizer"
},
"keyword" : {
"type": "keyword"
}
}
}
}
}
}
}
We've defined a custom analyzer(my_normalizer) with lowercase filter, lowercase filter will ensure that all the letters are changed to lowercase before indexing the document and searching. Now you can search "The Big Bang Theory" using two ways - either use terms query or match query on title.normalize.
GET titles/_search
{
"query": {
"bool": {
"must": [
{
"terms": {
"title.normalize": [
"The Big Bang Theory"
]
}
}
]
}
}
}
GET titles/_search
{
"query": {
"bool": {
"must": [
{
"match": {
"title.normalize": "The Big Bang Theory"
}
}
]
}
}
}
Terms query will look for exact match but because of the normalize field it'll first convert the whole query string to lowercase and will search for the token the big bang theory. And we'll find document 1.
Match query is being done on title.normalize hence no analysis will be done on the query string other than the analysis done by my_normalizer i.e. converting every letter to lowercase.
This works!
Things to remember -
- normalizer is a property of keyword and won't work on text fields.
- Normalizer is applied before indexing the keyword which means if you do aggregation on normalizer you'll see all the documents in lowercase.
GET titles/_search
{
"size": 0,
"aggs": {
"foo_terms": {
"terms": {
"field": "title.normalize"
}
}
}
}
{
"took": 0,
"timed_out": false,
"_shards": {
"total": 5,
"successful": 5,
"skipped": 0,
"failed": 0
},
"hits": {
"total": 3,
"max_score": 0,
"hits": []
},
"aggregations": {
"foo_terms": {
"doc_count_error_upper_bound": 0,
"sum_other_doc_count": 0,
"buckets": [
{
"key": "big theory",
"doc_count": 1
},
{
"key": "the big bang",
"doc_count": 1
},
{
"key": "the big bang theory",
"doc_count": 1
}
]
}
}
}
You won't see the original titles, you'll see that normalizer has been applied to all the titles. So, don't do aggregation on normalizer field if you require the original values. In this case, you can do aggregation on title.keyword to obtain the original values.
You can read more about analyzers on elasticsearch documentation page here and here
Thank You!
Hi Mehul, I have followed the above steps
“FileName”: {
“type”: “text”,
“fields”: {
“keyword”: {
“type”: “keyword”,
“ignore_above”: 256
},
“normalize”: {
“type”: “keyword”,
“normalizer”: “lowercase_normalizer”
}
}
}
My mapping ^^
“analysis”: {
“normalizer”: {
“lowercase_normalizer”: {
“filter”: [
“lowercase”
],
“type”: “custom”
}
}
},
My index setting
{
“query_string”: {
“fields”: [
“FileName.normalize”
],
“query”: “*fry\ *”
}
},
I have a document which has “Fry”, which doesn’t come up when I search with small letter.
Please help me by pointing where I am doing wrong. Thanks
very helpful!! thanks a lot!
If the field that I want to search is a keyword field, do I still have to use the normalizer to achieve this? Or is there a easier way?
you cannot do case insensitive search on a keyword field. Remember - keyword fields are stored as is and search will be on exact terms not on analysed ones.
I have a keyword field which I am storing in uppercase, but when I search using the same value in uppercase it doesn’t return the document, it only returns if I put the value in lowercase in my search request. Looks like it does case-sensitive search which there is not the exact value search. Have you tried this?