Engineering Architecture

14 Jun 2019 4 min read

The hidden complexities of search

Search is one of those life-changing innovations that we started taking for granted. Companies like Google make it seem so easy to search and find whatever you are looking for

by Dieter Vanden Eynde

Search is one of those life-changing innovations that we started taking for granted. Companies like Google make it seem so easy to search and find whatever you are looking for in a heartbeat. However, it takes a lot of effort and thought to design and build the right search algorithm for a specific context.

Document vs relational search

Search engines like Google are document based; that means that they search through large blobs of text without really understanding the meaning of the text (Google is trying really hard to understand the text though). To put it in a simplified way, they scan every document for your search query and show you all the matches.

When building a custom search engine for a company or product, we usually deal with very structured and relational data. Those relations have value and we should take those into consideration when searching for the most relevant result. For example, when searching for the text “concrete” in a list of invoices, we probably want to give a different meaning to an invoice from a company called “Concrete Energy” than to an invoice with a description “Pouring concrete for new construction”.

In a relational search, there might also be added security or role complexities. A user might not be able to see all the invoices but only invoices from company X. This is another example of a relationship between the invoice and the user that needs to be considered in your search.

Deciding how to search your data

After considering what the value of a relationship is, we also need to consider how to interpret that data. For example, the recipient email and the invoice number are both texts but should be interpreted very differently. If we were to interpret them equally, then a search for “1999” would return both an invoice with number “19601999” and an invoice with email “randomlovebird1999@gmail.com” and they would be considered equally relevant as both contain the search term.

If the user was viewing an invoice list screen, they probably did not intend to find partial matches on email addresses, but since we know the relation of those 2 fields to an invoice we can account for that. We can expand our logic to only show email addresses that exactly match and allow partial matching on invoice numbers. If your situation requires it, you could still show partial matching email addresses but consider them as less relevant.

There are a lot of different possibilities for how to interpret data and they are all very specific to a use case. Even in the same application and with the same invoices data, you could have very different interpretations of that data. It all depends on the context in which you are implementing it. The clearest example is the difference in searching for data in an admin panel versus the user application — both search in the same data but their intent is very different.

The deepest pitfall: search for everything

If you were to brainstorm your new search engine, you’ll probably hear suggestions like spelling correction, nearby matches, partial matches, etc. All of those are very good suggestions that might add value to your search engine but can become a lethal combination together. It becomes very hard to combine all these things for a large number of data fields, especially because the user will no longer understand why certain results are showing in the results. Some spelling corrections might be clear if they are a spelling correction.

For example, our search for “1999” will also return an invoice which has a description with “2019/19/12” in it. Because we choose to allow partial matching and do some limited spelling corrections that invoice will match. Spelling corrections are usually a Levenshtein distance calculation, and in this example that makes “19/19” only a distance of 1 and that is a partial match.

In another example, our search for “1999” will return an invoice with a total price of 1880EUR because we allowed for a 10% deviation on the price field (which is a common practice as users generally don’t know exact prices).

After implementing both examples the user might now see invoices in the search results that are only vaguely relevant. The intent of the user’s search query might be to find all invoices from the year 1999, yet invoices show up that don’t even have a literal mention of “1999”.

Filtering vs. search

Yes, a search engine in your application that finds everything via one input field might seem like a killer feature, but it is very hard to find the right results if the user can not express their intent clearly. In a relational search, it makes a lot of sense to use filters instead of search. Or ideally, it is a combination of the two. Allowing the user to select exactly which company they want to see invoices for will return them almost 100% accurate results.

A combination of the two is very powerful. Amazon, for example, allows you to enter any search term you want but you can also specify a category (i.e. books, movies, etc) which gives a lot of direction to the results which should show.

Smart search usability

There is no golden bullet for every application that needs search. It is crucial to spend some time understanding your data but it is even more important to understand what your user will actually be searching for and what they expect as a result. As a product owner, you could talk to your users and try to understand what they are looking for. The better you can visualise and document how your search should behave, the faster you will also understand the contradictions and overlapping complexities in it.

The challenge will probably be to start small, start by searching only for basic things. Let your search engine grow as you start to understand the needs of the users better. The user never wants to find “everything”, they usually have a very specific need. And if you understand that need, they can expect the best results.

The hidden complexities of search

Document vs relational search

Deciding how to search your data

The deepest pitfall: search for everything

Filtering vs. search

Smart search usability

Other useful posts:

How we help

Software engineering

CTO as a service

Auditing

CTO Coaching

Recruiting

Product management

The hidden complexities of search

Document vs relational search

Deciding how to search your data

The deepest pitfall: search for everything

Filtering vs. search

Smart search usability

Other useful posts:

From Andreas' Desk

Similar topics

My email agent invented a prompt injection, then fell for it

Residuality Theory: a different way to think about architecture

This tool is useless

How we help

Software engineering

CTO as a service

Auditing

CTO Coaching

Recruiting

Product management