On a project I work on, messages are stored in MongoDB. Sometimes we need to search within those messages. MongoDB’s full-text search (FTS) is the best tool for this.
If a message contains "blueberry", then FTS search terms such as "blueberry" and "blueberries" will find that message. You think "oh that's pretty smart!" Until I tell you that searching for "blue" returns no search results.
FTS is not too smart, to put it politely. It uses stemming (unless explicitly disabled), and then uses the stem words to look through the index. In this case, the stem word is “blueberry” and not “blue.”
In search of a solution for better full-text search
While Typesense (a better alternative to Meilisearch) is an option, I wanted to explore the possibility of doing a regex based search first. I suspected this would be slower than FTS, but FTS just didn't seem to cut it.
Here’s where the story gets interesting though.
Unexpected results from regex search in nested Objects
As you know, MongoDB is a document store and the documents within the same collection may have different schemas. In our case, older messages have a title
field that has a string value. Newer messages, on the other hand, have a title
Object whose keys include the respective language, so for example “title.en” would be the English string of the title.
Simple regex tests
Let's assume we're looking for all messages that contain the string "test" in the title, not necessarily at the beginning of it.
Part 1: A string value
{title: {$regex: /test/i}}
This returns the messages that have a title
key and “test” is contained within the string value. This was promising, good enough for me to already write a unit test for the more complex query with multiple words.
Part 2: An object without regex
{title: {en: "test"}}
This returns the messages that have a title
Object with an “en” key whose value exactly matches “test”.
Part 3: An object with regex
{title: {en: {$regex: /test/i}}}
This does NOT return the messages that have a title
Object with “en” key and a “test” is contained within the string value.
What’s going on?
The only thing the $regex documentation says about this is the opening:
Provides regular expression capabilities for pattern matching strings in queries.
It does say "strings," but sadly it mentions nothing about the inability to search through the strings nested in documents. MongoDB does support wildcard indexes like title.$**
, so I was left confused why this wasn't supported and doesn't produce some kind of error, simply returning no query results instead!
It turns out, there is a solution. The devil is in the details. Note how the nested key is composed and double-quoted:
{"title.en": {$regex: /test/i}}
I hope the explanation of this detail saves you some time trying to figure out the same thing in the future.
Protip: regex searches anchored to the start with a ^ regex symbol are faster than non-anchored searches. You can think of anchored searches as SQL's LIKE 'test%'
clause (not LIKE '%test%'
). Anchored searches can use an index, while the non-anchored ones will do a full key scan.
Member discussion