Friday, October 11, 2013

Efficient Techniques for Fuzzy and Partial matching in mongoDB

Efficient Techniques for Fuzzy and Partial matching in mongoDB


This blogpost describes a number of techniques, in MongoDB, for efficiently finding documents that have a number of similar attributes to a supplied query whilst not being an exact match. This concept of "Fuzzy" searching allows users to avoid the risks of failing to find important information due to slight differences in how was entered.


Where users are able to enter data items manually, rather than choosing from a list of options, there exists a reasonable probability that multiple users will enter data in different forms. For names this may be misspelling or typographic errors, telephone numbers may be entered with or without spaces or international prefixes. When typing addresses people may enter sub-items in incorrect form boxes.

In an application aware of these issues then the data entry interface can be made somewhat smarter; telephone numbers can have an enforced format and addresses can be verified against a geo databases. Unfortunately other data elements such as names have no such safeguards and are therefore frequently entered in multiple formats.

Historical data may also form part of the data you are searching and this may well have been entered without a rigid data quality regime and without any form of data cleansing and verification after the fact.

Telephone numbers can be in the correct format but still be incorrect due to a mistake made in typing or mentally transposing numbers. This means they are also candidates for "Fuzzy" matching.

People may explicitly change they way they describe themselves for a variety of reasons: A change of name through marriage, divorce or decision to use a middle name as a first name. A change in the way an address is described, sometimes including a neighbourhood, sometimes not. In some circumstances people may deliberately make subtle changes to their identification details to avoid identification or at least to prevent themselves being linked to a previous data entry.

Finally, and somewhat ironically, too much information can prevent the discovery of matching. Where stored data contains fewer data points than are being used to query, a naïve matching technique can result in there being zero exact matches and therefore zero records found.

This document will arm the reader with four efficient techniques using mongoDB features to ensure user find the data they need - even where there are differences between their query and the stored data.

Business Benefits.

"Fuzzy" search techniques are especially useful in certain industries:

Any organization with a large number of customers and product lines, especially in a B2C situation, can have challenges around finding all existing customer details. Whilst some details such as an account reference number assist these fail where multiple disparate system are brought together, or where different attributes have been recorded. Businesses implementing a unified customer view can therefore benefit from using these methods.

Organisations investigating fraud – especially higher volume, lower value fraud such as insurance claims can benefit from being able to see through simple, seemingly innocent obfuscations used to thwart fraud detection.

Organisations working in Law Enforcement or National security can often have very poor data quality due to the nature of information gathering or very unreliable search criteria. These techniques assist in finding the relevant data regardless of source or quality.


It is important to define the concept of efficiency; Whilst hard numbers are provided as to the performance of the algorithms for specific tests in this whitepaper these need to be understood in a larger context. In nearly all cases it is slower to find partially matching documents than exact ones. An exact match can nearly always be completed in O(log(n)) time and with a very small constant overhead, as long as a suitable binary index exists – a partial match may however require additional computation.

The techniques described here are designed to perform significantly better than a naïve approach which could have O(n) or worse time. The generally make use of indexes to perform in  O(log(n)) but may have a larger constant overhead or multiple than is associated with a simple binary search. For example a fuzzy search may take ten times as long as a non fuzzy search, but will still scale with data growth.

In terms of storage efficiency – again there is a cost in terms of additional storage and storage processing to reap benefits at query time. In a typical case this is expected to be either only a small overhead relative to the size of the data being stored and retrieved or there is an expectation the overall data size is not large and relatively bounded.

Sample Data

These techniques were tested with a data sets of 5 and 20 million fictional individuals. Each person having a first and last name, one or more middle names, a gender, an address and multiple telephone numbers. In addition a small percentage of records are were corrupted to simulate typing mistakes on data entry.

The data generation script along with a python based web user interface for testing is available on Github in ( – the Fuzzgo web interface is the source of the screen shots used in this whitepaper.

The sample data is 8GB and the indexes are 6GB. In this data the proportion of ‘Fuzzy Searchable’ data to retrievable data is very high making the indexes proportionally far larger than they would be in production. In addition this data is indexed to show multiple techniques where not all are required for each use case.

Technique One – Find Similar Sounding Data

When typing a name you have only heard and not seen it can be quite difficult to correctly spell it. The bearer may spell names we have previously learned how to spell differently to our expectation.  Other names, especially those not native to our own language can be exceptionally difficult to transcribe, and not everyone will question the spelling when being told. Further when a name is being passed via a third party then important information regarding the spelling may be lost completely.

Even in common usage there are many common spellings of names, there are more than three dozen common spellings of the name Mohammed in English – although only one in Arabic.

One of the best solutions to this is to make use of one of the family of canonical sound-alike algorithms. These algorithms were first used manually in 1918 for analysing census data but have been improved greatly since the original ‘Soundex’ which whilst simple to apply on paper, gives poor results for many names.

Today we have access to a choice of algorithms: Metaphone, Double Metaphone, NYSIIS, Caverphone, Daitch-Mokotoff and Kölner Phonetik[1]. Which algorithm is most appropriate depends on the language you principally expect the data to be in and also potentially on any licensing costs. The technique to utilize them remains the same and for the purposes of this whitepaper we will use Double Metaphone, as this is included in Python and is a good general purpose algorithm.

A sound-alike algorithm takes a text string, often but not always representing a name and emits a second text string being a canonical version of it. The theory is that two inputs, which sound similar, will have the same output. The table below shows some examples.


From this we can see that the original Soundex algorithm, whilst relatively good fails to differentiate between Smith and Schmidt – whether this is correct is somewhat subjective and again language dependent.

While Metaphone is capable, double Metaphone returns two values per input, which can be compared independently. This in the case of Smith/Schmidt it is telling us there is some similarity – both match XMT but there are also differences.

In this whitepaper we are using the concatenated output of double Metaphone to ensure maximum similarity between terms deems as sounding alike.

Using a sound-alike algorithm in MongoDB is very simple; wherever you have a field and wish to query that field by it’s sound rather than it’s value you should include, at load time, an additional field containing the permuted version. We propose the convention _s_fieldname based on the original fieldname the underscore denoting a ‘special’ field and the s denoting sound-alike.[2]

For a document like this

      firstname: JOHN,
      lastname: SMITH,
      … Many other data fields

You would instead load a document containing

      firstname: JOHN,
      _s_firstname: JNAN,
      lastname: SMITH,
      _s_lastname: SM0XMT
      … Many other data fields

Indexing requirements for these fields should be determined in the same way as any other field – if you intend to use it as the foremost part of a query then it should probably be included as an index.

When querying against these fields then the query term should be first passed through the appropriate sound-alike algorithm as in the python example below.

#Python example – Javascript is VERY similar

From metaphone import doublemetaphone

name = "JOHN"
metaname = doublemetaphone(name)
queryname = metaname[0]+metaname[1]


If you index both the normal and sound-alike fields then you have the overhead of an additional index as well as the additional data. Depending on your use case it can be more efficient to forgo the index on the original field and always make use of the sound-alike index when querying – if you only want exact matches then further filter the query on the exact field.

For example if I want the exact name ‘John’ as the first name but I do not with the overhead of a second exact index I would index the _s_firstname field using :


then a query of the form.

db.people.find({ "_s_firstname":"JNAN","firstname":"JOHN"})

This will make use of the index to find documents that ‘sound-like’ john and then scan the documents themselves to reduce to the exact matches. Whether this is an appropriate choice will depend on your application and preference for more effort at data insertion and more storage or marginally faster data retrieval.

This is an example of searching the firstname field for ‘Maurice’ using the Sounds-like feature. It returns results for Morris, Maurice, Marisa, and Marcy on the first page.

Technique Two – Searching for typographical errors.

The sound-alike algorithm is useful where the user has misspelt the word or used the wrong form of it, however it uses a hashing function and therefore a very small change in the entered data, which is not related phonetically, can render it useless. For example accidentally changing smith to dmith, a simple typing error, will then return a totally different set of results not including smith.

A good way to find this is by using a simple typo algorithm allowing for a small variation in the word matched. A commonly used algorithm for typo matching is.

Allow any one extra character
OR       Allow any one missing character
OR       Allow any one modified character
OR       Allow any two characters to be transposed

This can be easily encoded into a regular expression based on the original word. A regular expression is a pattern for matching that amongst other things can use the character . (dot) to denote ANY character and | (pipe) to denote alternatives as well as ^ and $ to denote start and end of word.

Therefore – when looking for the word Joe – the following regular expression will find typo variations

db.people.find({firstname : /^.JOE$|^J.OE$|^JO.E$|^JOE.$|^OE$|

This is not however a scalable approach to searching. This queries resource cost linear with data size, multiplied by fields searched as can be seen in the screenshot below, over a significant data set it can take a very long time to complete.

This is actually one of a large family of problems where the cardinality of the field is low relative to the number of records, and where the largest part of the work can be done against a unique list of values. In this circumstance it may seem appropriate to create an index on the field in question and make use of the distinct() function in MongoDB, this approach has two issues.

Firstly – the most notable is there is no method supplied to filter distinct values, there is currently no way to request only those distinct entries matching a given pattern.

Secondly, the internal structure of the btree indexes in mongoDB means that when an index scan is performed it scans one index entry for every document in the collection. It does not scan unique values – and then get a list of matching documents.  For example – if you have two distinct values and 1000 documents which have one of each of those values. An index scan will scan all one thousand, not two even if the value is the same and I am retrieving only the index field.

The solution to this is to create a database term list in a secondary collection. This term list is a very small collection designed to efficiently process a unique list of terms – it is appropriate where the number of unique terms is a small fraction of the total number of terms, and where some kind of fuzzy matching is appropriate.

To do this – if you have for example a collections customers.persons – create an additional collection with those fields you wish to perform partial matching on. In the simplest case you wish to look at the firstname field – create customers.persons_firstname.

In this collection use the value of the name as _id and insert , when inserting into, or updating the principal collection. You should explicitly ignore any duplicate key errors.

So if you were performing

db.persons.insert({firstname: "john", secondname: "page", city: "Glasgow"})

At the same time do


This collection will be small – the entries are small and it has a single index – entries therefore have very insertion low cost.

When performing a complex matching operation – typically a regular expression however on such a small collection many options such as thesaurus lookups are also appropriate – perform it against the _fieldname collection, retrieve the list of results (which will be a unique list of names that exist in the database) and convert this to either a $or query or a $in query. Normally the list of values will be relatively small.

Now you have a query like {firstname: {$in : [john,jogn,johm,joohn]}} which you can use to query the main table using an normal index.

When using this technique with the typo algorithm above there is a further optimisation you can make; In this algorithm one of the first two letters is always one of the original first two letters although they can move position. Therefore in your _firstname collection you should add a secondary value, an array containing the first two letters of the word.

For example:

db.persons_firstname.insert({_id: john, f: ["j","o"]})

Then add a further index using


Now when performing a regular expression match against the term list for a garbled value you can do

db.persons_firstname.find({f:{$in:["a","b"]},{$id:/[regex for name starting ab]/)

This reduces the index search space considerably – theoretically to ¼ of the entries although this is dependant on the distribution of letters in your language.

Using this technique for the garbled search the time taken for a garbled search against a large data set can be reduced by three orders of magnitude times as seen in the screenshot below.

Technique 3 – Any Field Search

Where you have a set of related fields, such as parts of a name or an address, it is very common for this information to be entered incorrectly. This can be the result of operator carelessness, a lack of clarity as to how an address should be structured – for example if you live in a suburban town do you put the town name or the nearest city in the ‘city’ field on a form. Also cultural difference, especially regarding names where the western concept of a first name and family name do not match cultural norms from the far east where the family name comes first.

MongoDB includes a very simple and core feature to deal with this, the idea of array fields.

Simply put – in your data – add an additional, indexed field, in this place an array of all values within the group. So for names, where you are storing firstname, middle name #1, middle name #2 and lastname add an additional field allnames with an array of all assigned name values – there is no requirement to store null or empty values.

This is an example of making use of mongoDB features that are not normally available in the relational world.

As an Example

db.people.insert({firstname:"john",middlenameone: "Laing",lastname: "page, allnames(["john","laing","page"])

You should create an index on the allnames field. Now when querying for a name use allnames in your query rather than the individual name fields. This will enable the retrieval of a record regardless of what field the value was entered into as seen below.

It should be noted that MongoDB is case sensitive in its searches unless a case insensitive regular expression is used. There could be significant advantages to having a case insensitive index where all queries are converted to a canonical lower or upper case and all values in the index (but not the data) are stored as such. Whilst there are a number of gotchas, specifically regarding covered indexes and regexes regarding this is you feel it is useful you may wish to request this as a mongoDB server feature.

Equally there are potential advantages – again with caveats - to a multi field index so this can be performed using a ‘virtual’ field, which does not require to be added to the actual document. Again if you feel this is of benefit you should request this feature in mongoDB.


Technique 4 – Matching a subset of values and scoring results.

One of the core capabilities required for efficient fuzzy matching is the ability to match on a subset of attributes where which subset is not known in advance. The normal Boolean operators $and and $or prove limited for this as $or will oftentimes match too much and $and will only manage exact matches. One solution is to construct a quorum search where at least n data points need to match. This can be combined with the above techniques as well as sorting and weighting to provide very powerful matching capabilities indeed.

The key to doing this in mongoDB is to make use of the aggregation framework as a fuzzy query engine.

Once you have performed the required aggregation you can then take the list of _id’s output by the aggregation framework and treat them as a type of cursor – fetching the records as required by performing a $in query for a single record, a page of records or all records. In the example code I retrieve all records for simplicity –an example of this quorum search is shown below – finding records with 'at least 2' of the three values provided.

The mechanism is easiest explained using an example:

Suppose we want to match the name "John Andrew Cartwright Smith"

Our data has four name fields firstname,lastname, middlename #1 and middlename #2.

A $and query will only match persons called "John Andrew Cartwright Smith" not "John Andrew Smith" or "John Smith" or "John Andrew Cartwright". A $or Query will return all Johns, Andrews, Cartwright’s and Smiths resulting in far too much data.

Given that the first few steps of an aggregation pipeline are essentially a query and that the normal rules regarding querying and indexing apply we should select the larger set in our aggregation pipeline as a $or query – this will make use of multiple indexes if they exist and search in parallel to seed our pipeline.

"$match" : {
      "$or" : [
            {"firstname" : "John"},
            {"middlenameone" : "Andrew"},
            {"middlenametwo" : "Cartwright"},
            {"lastname" : "Smith"}

We will then project this – making use of the conditional operators within projection to effectively ‚score our query as to how many of the terms were present.

"$project" : {
      "c" : {"$add" : [
{"$cond" : [{"$eq" : ["$firstname","John"]},1,0]},
{"$cond" : [{"$eq" : ["$middleone","Andrew"]},1,0]},
{"$cond" : [{"$eq" : ["$middletwo","Cartwright"]},1,0]},
{"$cond" : [{"$eq" : ["$lastname","Smith"]},1,0]}

Now we will have one document for each match of the original $or, we will have the _id field as this is retained in $project by default as well as a value c which is a value between 1 and 4 depending on how many of the fields matched

We can then filter this to select, for example, only where two terms matched.


We are then retuned from the aggregation framework a set of small JSON objects containing _id and score, which we can use effectively as a cursor to retrieve those documents we wish to view. If we do not need the score we can use a second projection to filter out c.

By adding a sort to the aggregation pipeline we can retrieve the best matches first. You can also – where you have a sound-alike field as described in technique 1 give a higher score for an exact match than a ‘sounds’ match when calculating ‘c’ in the projection above.

If you wish to use this technique with an array of names so matches can be in any field then a subtly different approach is needed. Rather than simply calculate a sum you need to put each individual match into a set and verify the number of unique values you have. This is to avoid a scenario where a names like "james james smith" is seen as matching two terms despite only one name matching.

If names like ‘James James Smith’ seems improbable bear in mind that once you add in sound-alike or typo algorithms it is much more likely that two parts of a single name will match.

To calculate the length of a set in mongoDB 2.4 you need to unwind and group it – there is no match on set length operator.

Given a field called ‘allnames’ to query on the aggregation pipeline would look as follows.

#Find all candidate records using a $or, you can also do this if using a soundalike field or name variations from a termlist as well.

{"$match" : {"$or" : [ {"allnames" : "John"    },{"allnames" : "Andrew"},{"allnames" : "Cartwright"},{"allnames" : "Smith"}]}},]}}

Minimise the data set for performance reasons.

{"$project" : "allnames"}

Create a single document per name.

{"$unwind" : "$allnames"}

For each name – if it’s one we are interested then in keep it, otherwise drop it;
Even if matching on a soundalike or a termlist keep the original name. $concat is used to aggregate multiple string conditionals.

{"$project" : {
"c" : {"$concat" :
[{"$cond":[ {"$eq" : ["$allnames", "John" ]},"John",""]},
{"$cond":[ {"$eq" : ["$allnames", "Andrew"]},"Andrew",""]},
{"$cond":[ {"$eq" : "$allnames","Cartwright"]},"Cartwright",""]
{"$cond":[ {"$eq" : ["$allnames", "Smith"]},"Smith",""]}]}}}

Now we can group into a set of which distinct names matched.

{"$group" : {"c" : {"$addToSet" : "$c"},"_id" : "$_id"}}

Calculate the set length and store in c

{"$unwind" : "$c"},
{"$group" : {"c" : {"$sum" : 1},  "_id" : "$_id"}}

And match to retrieve those with multiple matching values. As you get null values from the earlier conditional the $gte is one large then the number of matches you want.

{"$match" : {"c" : {"$gte" : 3}}},
{"$project" : {"_id" : "$_id"}}


It is possible to perform many, sophisticated partial matching operations in mongoDB whilst staying within the bounds of indexed queries and avoiding whole table or even whole index scans over anything but the smallest of data sets.

[2] At this time the sample code on Github does not yet use this convention.


Jon Rangel said...

Great post, John. These are some really useful techniques.

Nagappan said...

Dear John,

You just blew me away with your awesomeness ! Looking forward to more posts from you.

Ewan R said...

Can you host this? I don't know how to use Python :(

Timothy Armes said...
This comment has been removed by the author.
Timothy Armes said...


There's a slight error in your code. In the case that all the search terms match a given document then there will be no 'null' ("") line item. Given that you then match for the number of require items + 1 you'll miss the perfect matches.

To solve this you can insert a $match before regrouping the items:

{ "$match" : { "c": { "$ne": "" } } }

Then, of course, match for the number of required term (don't add one).

Mazki516 said...

Thank you for sharing you amazing practical ideas !

Dubai Raju said...

it’s really nice and meanful. it’s really cool blog. Linking is very useful have really helped lots of people who visit blog and provide them usefull information.

Mongo DB Training in Hyderabad

Cognegic said...

The most effective method to Solve MongoDB Security Issue through MongoDB Technical Support |Cognegic|
In the event that you trusted your Database was hacked by some unapproved individual at that point don't stress simply contact to Cognegic's MongoDB Online Support or MongoDB Customer Support USA to secure your MongoDB database. Today, the programmers can without much of a stretch hack your any of the information and once your information gone then it winds up difficult to get back your hacked information. On the off chance that you are not kidding about your database at that point rapidly contact to Cognegic's Support for MongoDB Database Software.
For More Info:
Contact Number: 1-800-450-8670
Email Address-
Company’s Address- 507 Copper Square Drive Bethel Connecticut (USA) 06801

Radha Sai said...

Very informative. Keep updating Artificial Intelligence Online Training

Brain Carve said...

nice post..
Abacus Classes in chennai
vedic maths training chennai
abacus training classes in chennai