Sunday, June 22, 2014

A way to JOIN in MongoDB using the Aggregation framework

I want to start by saying, as you may have heard before, if you get your schema design  right, you normally don't need to do any Joins in MongoDB - a lot of typical uses of Joins are handled through the flexibility of the document model and asking how to design joins out of your schema should always be your first port of call.

I was working on an interesting big data problem last week which I will document in a future blog but I had the need to take a value in each document and project a second value looked up from another collection - essentially a lookup table.

In an RDBMS this is a great use of a JOIN however for sound technical reasons MongoDB doesn't offer this - once you are working on a large data set and it's distributed out on shards the performance overhead of JOINs is bad, traditionally you need to have the data you are joining located together so your query results get pulled back to a central location and merged - not quick and not so scaleable.

I hit upon the idea of pushing out the secondary table inside the aggregation query as a kind of case statement. This is easiest to show with a short example.

Imagine you are a foreign exchange company and you have customers who have made trades with you and you are holding balances for them in a number of currencies. I will show this as the simplest form

custrec = {
   customer: 'john',
   currency: 'USD',
   quantity: 250
}

Now in this case - the exchange rate of USD varies - so If I want to work out things for my customers relative to the current price, or some hypothetical price changes how so I do it. I'm assuming in this you plan to use the aggregation framework.

So you can do

usdprice = 1.65 

db.holdings.aggregate({$project : { customer:1,currency:1,quantity:1,price : { $cond : [ $eq { '$currency','USD',usdprice,null}]}})

And if currency equals USD you will have a price added to the document for the next pipeline stage - as a project before any group or sort operations this will happen on the remote shard.

The thing is there are a bunch of currencies so how do you extend this?

One aggregation operator that lets you do this is $add ( or $concat for strings ) which allows you a construct and array and sum it like this.

usdprice = 1.65
gbpprice=2.2
eurprice=1.89

db.holdings.aggregate({$project : { customer:1,currency:1,quantity:1,
    price :  { $add : [ { $cond : [{ $eq : [ '$currency','USD']},usdprice,null]},
                        { $cond : [ {$eq : [ '$currency','GBP']},gbpprice,null]}, 
                        { $cond : [ {$eq : [ '$currency','EUR']},eurprice,null]}]
             }
    }})

This was my first attempt at joining - to be clear you don't want to type these types of queries  in but rather use some code  to construct the query for you from an array of objects. Thats why MongoDB's object as query model rocks.

Before you do that though - this method is really inefficient as MongoDB will have to evaluate every single conditional for every document, and if you have 10's of thousands of conditionals this takes forever.

I then wondered how to improve this - You could have it stop once it has found a match by nesting $cond statements, $cond has an if-then-else structure so we can rewrite the above as

usdprice = 1.65
gbpprice=2.2
eurprice=1.89


 db.holdings.aggregate({$project : { customer:1,currency:1,quantity:1,
        price :  { $cond : [ { $eq : [ '$currency','USD']},usdprice,
                    { $cond : [ {$eq : [ '$currency','GBP']},gbpprice,

                        { $cond : [{ $eq :[ '$currency','EUR']},eurprice,null]}]}]}}})

This isn't great either - the number of lookups is halved on average, you have simply changed it to a list lookup and it will have to scan 50% of the list each time on average and you have now increased the recursive document depth - MongoDB has some limits to how deep a BSON document can nest, and it's not thousands or even hundreds.

At this point I realised an efficient nested  if-then-else lookup would be a Binary Tree - and all I had to do was 'automatically' transform a table into a set of $cond statements where there were a minimal set of conditionals to get to a given value, this would keep the depth down and the lookups optimal. I would need to use both $eq for leaf nodes and $gt for branching nodes.


Here's an example complete with data to try it on.

First Generate our two data sets


use fx
db.holdings.drop()

numcustomers = 100
numtrades = 100
currencies = ["AFN","ALL","DZD","AOA","XCD","XCD","ARS","AMD","AWG","AUD","AZN","BSD",
"BHD","BDT","BBD","BYR","BZD","XOF","BMD","BTN","INR","BOB","BOV","BAM",
"BWP","NOK","BRL","BND","BGN","XOF","BIF","KHR","XAF","CAD","CVE","KYD",
"XAF","XAF","CLF","CLP","CNY","AUD","AUD","COP","COU","KMF","XAF","CDF",
"NZD","CRC","XOF","HRK","CUC","CUP","ANG","CZK","DKK","DJF","XCD","DOP",
"EGP","SVC","XAF","ERN","ETB","FKP","DKK","FJD","XPF","XAF","GMD","GEL",
"GHS","GIP","DKK","XCD","GTQ","GBP","GNF","XOF","GYD","HTG","AUD","HNL",
"HKD","HUF","ISK","INR","IDR","XDR","IRR","IQD","GBP","ILS","JMD","JPY",
"GBP","JOD","KZT","KES","AUD","KPW","KRW","KWD","KGS","LAK","LBP","LSL",
"ZAR","LRD","LYD","CHF","LTL","MOP","MKD","MGA","MWK","MYR","MVR","XOF",
"MRO","MUR","XUA","MXN","MXV","MDL","MNT","XCD","MAD","MZN","MMK","NAD",
"ZAR","AUD","NPR","XPF","NZD","NIO","XOF","NGN","NZD","AUD","NOK","OMR",
"PKR","PAB","PGK","PYG","PEN","PHP","NZD","PLN","QAR","RON","RUB","RWF",
"SHP","XCD","XCD","XCD","WST","STD","SAR","XOF","RSD","SCR","SLL","SGD",
"ANG","XSU","SBD","SOS","ZAR","SSP","LKR","SDG","SRD","NOK","SZL","SEK",
"CHE","CHF","CHW","SYP","TWD","TJS","TZS","THB","XOF","NZD","TOP","TTD",
"TND","TRY","TMT","AUD","UGX","UAH","AED","GBP","USN","UYI","UYU","UZS",
"VUV","VEF","VND","XPF","MAD","YER","ZMW","ZWL","XBA","XBB","XBC","XBD",
"XTS","XXX","XAU","XPD","XPT","XAG"]

for(t=0;t<numtrades;t++)
    customerid = Math.floor(Math.random() * numcustomers) 
    currency = currencies[ Math.floor(Math.random() * currencies.length)]
    quantity = Math.floor(Math.random() * 10000)
    //Round to 3 decimal places
    buyprice = Math.floor( ((Math.random() * 3.0) + 0.5) * 1000) / 1000
    
    db.holdings.insert({_id:t,customerid:customerid, symbol: currency, quantity: quantity, buyprice: buyprice})
}

//Make a second table with 'todays' price

db.market.drop()

for(s=0;s<currencies.length;s++)
    currency = currencies[s]
    price = Math.floor(((Math.random() * 3.0) + 0.5) * 1000) / 1000
    db.market.insert({_id:currency,price:price})

}

Now we create our function which can recursively add a new value to a MongoDB aggregation query building a tree as it does it.


addtotree = function ( tree, keyfield, key, value ) {
    if ( tree == null) {
        tree = { $cond : [ { $eq : [keyfield,key]}, value, null]}
        return tree
    }    else {
        isleaf = tree['$cond'][0]['$eq'
        if(isleaf)
        {
            nodekey = tree['$cond'][0]['$eq'][1]
            nodeval = tree['$cond'][1]
            left = {$cond:[{$eq:[keyfield,key]},value,null]}
            right = {$cond:[{$eq:[keyfield,nodekey]},nodeval,null]} 
            delete tree['$cond'][0]['$eq']
            if( key > nodekey )
            {         
                tree['$cond'][0]['$gt']=[keyfield,nodekey]
                tree['$cond'][1]=left
                tree['$cond'][2]=right
            } else {
                tree['$cond'][0]['$gt']=[keyfield,key]
                tree['$cond'][1]=right
                tree['$cond'][2]=left
            }
        } else {
            nodekey = tree['$cond'][0]['$gt'][1]
            if( key > nodekey )
            {
                addtotree(tree['$cond'][1],keyfield,key,value)
            } else {
                addtotree(tree['$cond'][2],keyfield,key,value)
            }    
        }
        return tree
    }

  }


Then pull all the current prices and put them in a query tree - Note you need to ensure they are  not sorted by the insertion key as this is a binary tree not a btree and will therefore be unbalanced - I sorted by price as that happens to be random in this data but a little array shuffling would do normally.

var marketPrices = null
var marketCursor = db.market.find({}).sort({price:1});
while(marketCursor.hasNext()){
    price = marketCursor.next()
    marketPrices = addtotree(marketPrices,'$symbol',price['_id'],price['price'])

}

I also put the key in there from the primary table '$symbol' as the query need to know what it's matching on

Finally I can run a nice simple, fast aggregation to get the join (which can then pipeline to calculations etc in a further projection)


projects = { customerid: 1, symbol: 1, quantity: 1, buyprice: 1 }
projects['marketprice'] = marketPrices

db.holdings.aggregate({$project : projects})


> db.holdings.aggregate({$project : projects})
{ "_id" : 0, "customerid" : 312, "symbol" : "WST", "quantity" : 4895, "buyprice" : 0.899, "marketprice" : 3.221 }
{ "_id" : 1, "customerid" : 871, "symbol" : "IDR", "quantity" : 4027, "buyprice" : 2.03, "marketprice" : 1.362 }
{ "_id" : 2, "customerid" : 645, "symbol" : "PLN", "quantity" : 5941, "buyprice" : 2.799, "marketprice" : 2.722 }
{ "_id" : 3, "customerid" : 177, "symbol" : "ALL", "quantity" : 7155, "buyprice" : 1.318, "marketprice" : 1.322 }
{ "_id" : 4, "customerid" : 829, "symbol" : "PAB", "quantity" : 6049, "buyprice" : 1.7, "marketprice" : 2.754 }
{ "_id" : 5, "customerid" : 74, "symbol" : "NZD", "quantity" : 3040, "buyprice" : 1.404, "marketprice" : 2.142 }
{ "_id" : 6, "customerid" : 954, "symbol" : "MOP", "quantity" : 5260, "buyprice" : 3.366, "marketprice" : 0.883 }
{ "_id" : 7, "customerid" : 509, "symbol" : "XBC", "quantity" : 4600, "buyprice" : 2.458, "marketprice" : 1.503 }
{ "_id" : 8, "customerid" : 476, "symbol" : "EGP", "quantity" : 8831, "buyprice" : 0.85, "marketprice" : 2.083 }
{ "_id" : 9, "customerid" : 342, "symbol" : "PLN", "quantity" : 8070, "buyprice" : 1.245, "marketprice" : 2.722 }
{ "_id" : 10, "customerid" : 547, "symbol" : "XDR", "quantity" : 134, "buyprice" : 2.351, "marketprice" : 1.046 }
{ "_id" : 11, "customerid" : 743, "symbol" : "XBC", "quantity" : 9110, "buyprice" : 3.221, "marketprice" : 1.503 }
{ "_id" : 12, "customerid" : 933, "symbol" : "KYD", "quantity" : 6024, "buyprice" : 2.658, "marketprice" : 2.574 }
{ "_id" : 13, "customerid" : 652, "symbol" : "ZAR", "quantity" : 2902, "buyprice" : 2.385, "marketprice" : 1.181 }
{ "_id" : 14, "customerid" : 686, "symbol" : "ERN", "quantity" : 8642, "buyprice" : 1.355, "marketprice" : 2.31 }
{ "_id" : 15, "customerid" : 420, "symbol" : "XAF", "quantity" : 2897, "buyprice" : 1.002, "marketprice" : 2.31 }
{ "_id" : 16, "customerid" : 521, "symbol" : "MGA", "quantity" : 8122, "buyprice" : 2.393, "marketprice" : 2.495 }
{ "_id" : 17, "customerid" : 414, "symbol" : "MWK", "quantity" : 2793, "buyprice" : 0.637, "marketprice" : 2.646 }
{ "_id" : 18, "customerid" : 737, "symbol" : "XPF", "quantity" : 7817, "buyprice" : 0.827, "marketprice" : 3.462 }
{ "_id" : 19, "customerid" : 620, "symbol" : "PYG", "quantity" : 8110, "buyprice" : 2.223, "marketprice" : 1.697 }

The nice thing about this method is it's efficient, it scales to large data sets and given the limit of 16MB in a mongoDB document you can probably build a pretty large lookup table, depending on your keys and values , tens of thousands of values should be possible. I'll be pushing those limits with my big data blog next week.




Saturday, May 24, 2014

The case of the missing operator - How to use $sqrt in the MongoDB Aggregation Framework


The MongoDB Aggregation framework is a fast way of doing ad-hoc summary calculations and processing over many records. Unlike mapreduce it's declarative, akin to SELECT … GROUP BY .. HAVING in SQL but much more powerful.

It has a number of mathematical operators but from a statistical processing point of view, one annoying omission is a lack of square root operator – what if you have a series of values and need to calculate a standard deviation?

Fortunately – one of the earliest Apple developers – Sir Isaac Newton told us how to do this using the existing set of operators. I'm just translating his early blogs.

Newton's method says – guess, then take an average of your guess versus the desired outcome and repeat. Linear approximation. It's like the old playground game of guess the number I'm thinking then improve based on if you are too high or low.

In the aggregation framework this is simple.

Lets say we have a document

{ n : 29 }

we want to add a 'root' to that to make it 

{ n : 29, r: ??? }

where r is the square root.

Let's put that in a collection

>use blogcode
>db.roots.insert({n:29})

First we need an aggregation pipeline stage to add our first guess – whilst I could get clever here I'm simply going to always start with a guess of 1. In the mongoDB shell I will declare this as an object.

> firstguess = { $project : { n : 1, r : { $literal : 1}}}

This shows off a handy addition in MongoDB 2.6 – n:1 means copy n through from the previous step (or original data) as it was, to set r to the value 1 we need to use literal – that's nicer than any of the 2.4 ways to do this ( like using $add:[0,1]).

Then we need a step to refine out answer – 

> refine = { $project : { n: 1, r : { $divide : [ {$add : [{$divide : [ "$n","$r"]} , "$r" ]},2]}}}

Then run the pipeline

> db.roots.aggregate([firstguess,refine,refine,refine,refine, refine,refine,refine,refine])

> { "_id" : ObjectId("538062103439ddd3764ff791"), "n" : 29, "r" : 5.385164807134505 }

My Calculator says sqrt(29) = 5.385164807134504 so that's good enough!

How many times you want to refine it is up to you – it converges pretty fast. You can build that array of operators with a loop easily enough but too much iteration will take too long for little value – a couple of dozen at most will suffice.


How fast is it – lets put 1,000,000 values in to test.

> for(x=0;x<1000000;x++){ db.roots.insert{n:x} }


Because I want them all calculated inside the timer I've added a final pointless $group pipeline stage otherwise you get results as you pull them from the cursor.


use blogcode

var firstguess = { $project : { n : 1, r : { $literal : 1}}}
var refine = { $project : { n: 1, r : { $divide : [ {$add : [{$divide : [ "$n","$r"]} , "$r" ]},2]}}}

var start = new Date().getTime();

db.roots.aggregate([firstguess,refine,refine,refine,refine, refine,refine,refine,refine,{$group:{"_id":1}}])

var end = new Date().getTime(); 
var time = end - start;
print('1 Million roots in: ' + time + "ms");

On my Mac – that takes 14 seconds. As this is done entirely in projections  all processing will be done on the individual shards for maximum parallelism so using sharding or microsharding (or simply splitting the job into two pipelines) will increase speed pretty much linearly.

Square root in aggregation is also a JIRA Ticket ( SERVER-8568 ) feel free to vote for it.













Thursday, December 19, 2013

Mongo Aggregating Fractals.

Here's a Holiday season special for you - can you do complex iterative calculation with the Aggregation Framework in MongoDB? - of course you can.

First set.


DBQuery.shellBatchSize = 500

Then paste the following into a Mongo Shell, Set it to a very tiny font to view the output.

db.img.drop()
iters = 1000

for (x=-1.5;x<1;x=x+0.03){
 for (y=-1.1;y<1;y=y+0.01){
    db.img.insert({'cr':x,'ci':y,'zr':x,'zi':y,'it':0})
  }
 }

zrsquare = {'$multiply' : [ '$zr','$zr' ]};
zisquare =  {'$subtract' : [ 0 , {'$multiply' : [ '$zi','$zi' ]}]};
zixzrx = { '$multiply' : [{'$multiply' : [ '$zr','$zi' ]},2]};
inciflow = { '$cond' : [ { '$lte' : [ '$zr' , 4 ]} , 1 , 0] };
itterate = { '$project' :
    { 
        'cr' : 1,
        'ci' : 1 ,
        'zr' : { '$add' : [ zrsquare , zisquare, '$cr']},
        'zi' : { '$add' : [ zixzrx, '$ci' ] } ,
        'it' : { '$add' : [ "$it" , inciflow]}
    } 
};

pipeline =[];
for(i=0;i<iters;i++){ pipeline.push(itterate);}
pipeline.push({ '$group' : { '_id' :'$ci','r' : { '$push' : { '$cond' : [{ '$lt' : [ '$it', iters]},88888,1]}}}});
pipeline.push({ '$sort' : { '_id' : 1 }});
pipeline.push({'$project':{ '_id' : 0, r : 1}});

db.img.aggregate(pipeline);


The output - well it looks a bit like this,







Friday, October 11, 2013

Efficient Techniques for Fuzzy and Partial matching in mongoDB

Efficient Techniques for Fuzzy and Partial matching in mongoDB

Abstract


This blogpost describes a number of techniques, in MongoDB, for efficiently finding documents that have a number of similar attributes to a supplied query whilst not being an exact match. This concept of "Fuzzy" searching allows users to avoid the risks of failing to find important information due to slight differences in how was entered.

Introduction


Where users are able to enter data items manually, rather than choosing from a list of options, there exists a reasonable probability that multiple users will enter data in different forms. For names this may be misspelling or typographic errors, telephone numbers may be entered with or without spaces or international prefixes. When typing addresses people may enter sub-items in incorrect form boxes.

In an application aware of these issues then the data entry interface can be made somewhat smarter; telephone numbers can have an enforced format and addresses can be verified against a geo databases. Unfortunately other data elements such as names have no such safeguards and are therefore frequently entered in multiple formats.

Historical data may also form part of the data you are searching and this may well have been entered without a rigid data quality regime and without any form of data cleansing and verification after the fact.

Telephone numbers can be in the correct format but still be incorrect due to a mistake made in typing or mentally transposing numbers. This means they are also candidates for "Fuzzy" matching.

People may explicitly change they way they describe themselves for a variety of reasons: A change of name through marriage, divorce or decision to use a middle name as a first name. A change in the way an address is described, sometimes including a neighbourhood, sometimes not. In some circumstances people may deliberately make subtle changes to their identification details to avoid identification or at least to prevent themselves being linked to a previous data entry.

Finally, and somewhat ironically, too much information can prevent the discovery of matching. Where stored data contains fewer data points than are being used to query, a naïve matching technique can result in there being zero exact matches and therefore zero records found.

This document will arm the reader with four efficient techniques using mongoDB features to ensure user find the data they need - even where there are differences between their query and the stored data.

Business Benefits.


"Fuzzy" search techniques are especially useful in certain industries:

Any organization with a large number of customers and product lines, especially in a B2C situation, can have challenges around finding all existing customer details. Whilst some details such as an account reference number assist these fail where multiple disparate system are brought together, or where different attributes have been recorded. Businesses implementing a unified customer view can therefore benefit from using these methods.

Organisations investigating fraud – especially higher volume, lower value fraud such as insurance claims can benefit from being able to see through simple, seemingly innocent obfuscations used to thwart fraud detection.

Organisations working in Law Enforcement or National security can often have very poor data quality due to the nature of information gathering or very unreliable search criteria. These techniques assist in finding the relevant data regardless of source or quality.

Efficiency


It is important to define the concept of efficiency; Whilst hard numbers are provided as to the performance of the algorithms for specific tests in this whitepaper these need to be understood in a larger context. In nearly all cases it is slower to find partially matching documents than exact ones. An exact match can nearly always be completed in O(log(n)) time and with a very small constant overhead, as long as a suitable binary index exists – a partial match may however require additional computation.

The techniques described here are designed to perform significantly better than a naïve approach which could have O(n) or worse time. The generally make use of indexes to perform in  O(log(n)) but may have a larger constant overhead or multiple than is associated with a simple binary search. For example a fuzzy search may take ten times as long as a non fuzzy search, but will still scale with data growth.

In terms of storage efficiency – again there is a cost in terms of additional storage and storage processing to reap benefits at query time. In a typical case this is expected to be either only a small overhead relative to the size of the data being stored and retrieved or there is an expectation the overall data size is not large and relatively bounded.

Sample Data


These techniques were tested with a data sets of 5 and 20 million fictional individuals. Each person having a first and last name, one or more middle names, a gender, an address and multiple telephone numbers. In addition a small percentage of records are were corrupted to simulate typing mistakes on data entry.

The data generation script along with a python based web user interface for testing is available on Github in (https://github.com/johnlpage/Fuzzgo) – the Fuzzgo web interface is the source of the screen shots used in this whitepaper.

The sample data is 8GB and the indexes are 6GB. In this data the proportion of ‘Fuzzy Searchable’ data to retrievable data is very high making the indexes proportionally far larger than they would be in production. In addition this data is indexed to show multiple techniques where not all are required for each use case.




Technique One – Find Similar Sounding Data


When typing a name you have only heard and not seen it can be quite difficult to correctly spell it. The bearer may spell names we have previously learned how to spell differently to our expectation.  Other names, especially those not native to our own language can be exceptionally difficult to transcribe, and not everyone will question the spelling when being told. Further when a name is being passed via a third party then important information regarding the spelling may be lost completely.

Even in common usage there are many common spellings of names, there are more than three dozen common spellings of the name Mohammed in English – although only one in Arabic.

One of the best solutions to this is to make use of one of the family of canonical sound-alike algorithms. These algorithms were first used manually in 1918 for analysing census data but have been improved greatly since the original ‘Soundex’ which whilst simple to apply on paper, gives poor results for many names.

Today we have access to a choice of algorithms: Metaphone, Double Metaphone, NYSIIS, Caverphone, Daitch-Mokotoff and Kölner Phonetik[1]. Which algorithm is most appropriate depends on the language you principally expect the data to be in and also potentially on any licensing costs. The technique to utilize them remains the same and for the purposes of this whitepaper we will use Double Metaphone, as this is included in Python and is a good general purpose algorithm.

A sound-alike algorithm takes a text string, often but not always representing a name and emits a second text string being a canonical version of it. The theory is that two inputs, which sound similar, will have the same output. The table below shows some examples.



Original
Soundex
Metaphone
DoubleMetaphone
John
J500
JN
JN-AN
James
J520
JMS
JMS-AMS
Jean
J500
JN
JN-AN
Sean
S500
SN
SN-SN
Smith
S530
SM0
SM0-XMT
Smythe
S530
SM0
SM0-XMT
Schmidt
S530
SXMTT
XMT-SMT
Shaun
S500
XN
XN-XN
Mary
M600
MR
MR-MR
Mariah
M600
MR
MR-MR
Marion
M650
MRN
MRN-MRN

From this we can see that the original Soundex algorithm, whilst relatively good fails to differentiate between Smith and Schmidt – whether this is correct is somewhat subjective and again language dependent.

While Metaphone is capable, double Metaphone returns two values per input, which can be compared independently. This in the case of Smith/Schmidt it is telling us there is some similarity – both match XMT but there are also differences.

In this whitepaper we are using the concatenated output of double Metaphone to ensure maximum similarity between terms deems as sounding alike.

Using a sound-alike algorithm in MongoDB is very simple; wherever you have a field and wish to query that field by it’s sound rather than it’s value you should include, at load time, an additional field containing the permuted version. We propose the convention _s_fieldname based on the original fieldname the underscore denoting a ‘special’ field and the s denoting sound-alike.[2]

For a document like this

{
      firstname: JOHN,
      lastname: SMITH,
      … Many other data fields
}

You would instead load a document containing

{
      firstname: JOHN,
      _s_firstname: JNAN,
      lastname: SMITH,
      _s_lastname: SM0XMT
      … Many other data fields
}

Indexing requirements for these fields should be determined in the same way as any other field – if you intend to use it as the foremost part of a query then it should probably be included as an index.

When querying against these fields then the query term should be first passed through the appropriate sound-alike algorithm as in the python example below.

#Python example – Javascript is VERY similar

From metaphone import doublemetaphone

name = "JOHN"
metaname = doublemetaphone(name)
queryname = metaname[0]+metaname[1]

db.people.find({"_s_firstname":queryname})

If you index both the normal and sound-alike fields then you have the overhead of an additional index as well as the additional data. Depending on your use case it can be more efficient to forgo the index on the original field and always make use of the sound-alike index when querying – if you only want exact matches then further filter the query on the exact field.

For example if I want the exact name ‘John’ as the first name but I do not with the overhead of a second exact index I would index the _s_firstname field using :

ensureIndex({_s_firstname:1})

then a query of the form.

db.people.find({ "_s_firstname":"JNAN","firstname":"JOHN"})

This will make use of the index to find documents that ‘sound-like’ john and then scan the documents themselves to reduce to the exact matches. Whether this is an appropriate choice will depend on your application and preference for more effort at data insertion and more storage or marginally faster data retrieval.

This is an example of searching the firstname field for ‘Maurice’ using the Sounds-like feature. It returns results for Morris, Maurice, Marisa, and Marcy on the first page.



Technique Two – Searching for typographical errors.


The sound-alike algorithm is useful where the user has misspelt the word or used the wrong form of it, however it uses a hashing function and therefore a very small change in the entered data, which is not related phonetically, can render it useless. For example accidentally changing smith to dmith, a simple typing error, will then return a totally different set of results not including smith.

A good way to find this is by using a simple typo algorithm allowing for a small variation in the word matched. A commonly used algorithm for typo matching is.

Allow any one extra character
OR       Allow any one missing character
OR       Allow any one modified character
OR       Allow any two characters to be transposed

This can be easily encoded into a regular expression based on the original word. A regular expression is a pattern for matching that amongst other things can use the character . (dot) to denote ANY character and | (pipe) to denote alternatives as well as ^ and $ to denote start and end of word.

Therefore – when looking for the word Joe – the following regular expression will find typo variations

db.people.find({firstname : /^.JOE$|^J.OE$|^JO.E$|^JOE.$|^OE$|
^JE$|^JO$|^.OE$|^J.E$|^JO.$|^OJE$|^JEO$/})

This is not however a scalable approach to searching. This queries resource cost linear with data size, multiplied by fields searched as can be seen in the screenshot below, over a significant data set it can take a very long time to complete.


This is actually one of a large family of problems where the cardinality of the field is low relative to the number of records, and where the largest part of the work can be done against a unique list of values. In this circumstance it may seem appropriate to create an index on the field in question and make use of the distinct() function in MongoDB, this approach has two issues.

Firstly – the most notable is there is no method supplied to filter distinct values, there is currently no way to request only those distinct entries matching a given pattern.

Secondly, the internal structure of the btree indexes in mongoDB means that when an index scan is performed it scans one index entry for every document in the collection. It does not scan unique values – and then get a list of matching documents.  For example – if you have two distinct values and 1000 documents which have one of each of those values. An index scan will scan all one thousand, not two even if the value is the same and I am retrieving only the index field.

The solution to this is to create a database term list in a secondary collection. This term list is a very small collection designed to efficiently process a unique list of terms – it is appropriate where the number of unique terms is a small fraction of the total number of terms, and where some kind of fuzzy matching is appropriate.

To do this – if you have for example a collections customers.persons – create an additional collection with those fields you wish to perform partial matching on. In the simplest case you wish to look at the firstname field – create customers.persons_firstname.

In this collection use the value of the name as _id and insert , when inserting into, or updating the principal collection. You should explicitly ignore any duplicate key errors.

So if you were performing

db.persons.insert({firstname: "john", secondname: "page", city: "Glasgow"})

At the same time do

db.persons_firstname.insert({_id:john})

This collection will be small – the entries are small and it has a single index – entries therefore have very insertion low cost.

When performing a complex matching operation – typically a regular expression however on such a small collection many options such as thesaurus lookups are also appropriate – perform it against the _fieldname collection, retrieve the list of results (which will be a unique list of names that exist in the database) and convert this to either a $or query or a $in query. Normally the list of values will be relatively small.

Now you have a query like {firstname: {$in : [john,jogn,johm,joohn]}} which you can use to query the main table using an normal index.

When using this technique with the typo algorithm above there is a further optimisation you can make; In this algorithm one of the first two letters is always one of the original first two letters although they can move position. Therefore in your _firstname collection you should add a secondary value, an array containing the first two letters of the word.

For example:

db.persons_firstname.insert({_id: john, f: ["j","o"]})

Then add a further index using

db.persons_firstname.ensureIndex{f:1,_id:1}

Now when performing a regular expression match against the term list for a garbled value you can do

db.persons_firstname.find({f:{$in:["a","b"]},{$id:/[regex for name starting ab]/)

This reduces the index search space considerably – theoretically to ¼ of the entries although this is dependant on the distribution of letters in your language.

Using this technique for the garbled search the time taken for a garbled search against a large data set can be reduced by three orders of magnitude times as seen in the screenshot below.




Technique 3 – Any Field Search


Where you have a set of related fields, such as parts of a name or an address, it is very common for this information to be entered incorrectly. This can be the result of operator carelessness, a lack of clarity as to how an address should be structured – for example if you live in a suburban town do you put the town name or the nearest city in the ‘city’ field on a form. Also cultural difference, especially regarding names where the western concept of a first name and family name do not match cultural norms from the far east where the family name comes first.

MongoDB includes a very simple and core feature to deal with this, the idea of array fields.

Simply put – in your data – add an additional, indexed field, in this place an array of all values within the group. So for names, where you are storing firstname, middle name #1, middle name #2 and lastname add an additional field allnames with an array of all assigned name values – there is no requirement to store null or empty values.

This is an example of making use of mongoDB features that are not normally available in the relational world.

As an Example

db.people.insert({firstname:"john",middlenameone: "Laing",lastname: "page, allnames(["john","laing","page"])

You should create an index on the allnames field. Now when querying for a name use allnames in your query rather than the individual name fields. This will enable the retrieval of a record regardless of what field the value was entered into as seen below.



It should be noted that MongoDB is case sensitive in its searches unless a case insensitive regular expression is used. There could be significant advantages to having a case insensitive index where all queries are converted to a canonical lower or upper case and all values in the index (but not the data) are stored as such. Whilst there are a number of gotchas, specifically regarding covered indexes and regexes regarding this is you feel it is useful you may wish to request this as a mongoDB server feature.

Equally there are potential advantages – again with caveats - to a multi field index so this can be performed using a ‘virtual’ field, which does not require to be added to the actual document. Again if you feel this is of benefit you should request this feature in mongoDB.



 

Technique 4 – Matching a subset of values and scoring results.


One of the core capabilities required for efficient fuzzy matching is the ability to match on a subset of attributes where which subset is not known in advance. The normal Boolean operators $and and $or prove limited for this as $or will oftentimes match too much and $and will only manage exact matches. One solution is to construct a quorum search where at least n data points need to match. This can be combined with the above techniques as well as sorting and weighting to provide very powerful matching capabilities indeed.

The key to doing this in mongoDB is to make use of the aggregation framework as a fuzzy query engine.

Once you have performed the required aggregation you can then take the list of _id’s output by the aggregation framework and treat them as a type of cursor – fetching the records as required by performing a $in query for a single record, a page of records or all records. In the example code I retrieve all records for simplicity –an example of this quorum search is shown below – finding records with 'at least 2' of the three values provided.


The mechanism is easiest explained using an example:

Suppose we want to match the name "John Andrew Cartwright Smith"

Our data has four name fields firstname,lastname, middlename #1 and middlename #2.

A $and query will only match persons called "John Andrew Cartwright Smith" not "John Andrew Smith" or "John Smith" or "John Andrew Cartwright". A $or Query will return all Johns, Andrews, Cartwright’s and Smiths resulting in far too much data.

Given that the first few steps of an aggregation pipeline are essentially a query and that the normal rules regarding querying and indexing apply we should select the larger set in our aggregation pipeline as a $or query – this will make use of multiple indexes if they exist and search in parallel to seed our pipeline.

"$match" : {
      "$or" : [
            {"firstname" : "John"},
            {"middlenameone" : "Andrew"},
            {"middlenametwo" : "Cartwright"},
            {"lastname" : "Smith"}
            ]
      }

We will then project this – making use of the conditional operators within projection to effectively ‚score our query as to how many of the terms were present.

"$project" : {
      "c" : {"$add" : [
{"$cond" : [{"$eq" : ["$firstname","John"]},1,0]},
{"$cond" : [{"$eq" : ["$middleone","Andrew"]},1,0]},
{"$cond" : [{"$eq" : ["$middletwo","Cartwright"]},1,0]},
{"$cond" : [{"$eq" : ["$lastname","Smith"]},1,0]}
            ]
      }

Now we will have one document for each match of the original $or, we will have the _id field as this is retained in $project by default as well as a value c which is a value between 1 and 4 depending on how many of the fields matched

We can then filter this to select, for example, only where two terms matched.

"$match":{"c":{"$gte":2}}

We are then retuned from the aggregation framework a set of small JSON objects containing _id and score, which we can use effectively as a cursor to retrieve those documents we wish to view. If we do not need the score we can use a second projection to filter out c.

By adding a sort to the aggregation pipeline we can retrieve the best matches first. You can also – where you have a sound-alike field as described in technique 1 give a higher score for an exact match than a ‘sounds’ match when calculating ‘c’ in the projection above.

If you wish to use this technique with an array of names so matches can be in any field then a subtly different approach is needed. Rather than simply calculate a sum you need to put each individual match into a set and verify the number of unique values you have. This is to avoid a scenario where a names like "james james smith" is seen as matching two terms despite only one name matching.

If names like ‘James James Smith’ seems improbable bear in mind that once you add in sound-alike or typo algorithms it is much more likely that two parts of a single name will match.

To calculate the length of a set in mongoDB 2.4 you need to unwind and group it – there is no match on set length operator.

Given a field called ‘allnames’ to query on the aggregation pipeline would look as follows.

#Find all candidate records using a $or, you can also do this if using a soundalike field or name variations from a termlist as well.

{"$match" : {"$or" : [ {"allnames" : "John"    },{"allnames" : "Andrew"},{"allnames" : "Cartwright"},{"allnames" : "Smith"}]}},]}}

Minimise the data set for performance reasons.

{"$project" : "allnames"}

Create a single document per name.

{"$unwind" : "$allnames"}

For each name – if it’s one we are interested then in keep it, otherwise drop it;
Even if matching on a soundalike or a termlist keep the original name. $concat is used to aggregate multiple string conditionals.

{"$project" : {
"c" : {"$concat" :
[{"$cond":[ {"$eq" : ["$allnames", "John" ]},"John",""]},
{"$cond":[ {"$eq" : ["$allnames", "Andrew"]},"Andrew",""]},
{"$cond":[ {"$eq" : "$allnames","Cartwright"]},"Cartwright",""]
{"$cond":[ {"$eq" : ["$allnames", "Smith"]},"Smith",""]}]}}}


Now we can group into a set of which distinct names matched.

{"$group" : {"c" : {"$addToSet" : "$c"},"_id" : "$_id"}}

Calculate the set length and store in c

{"$unwind" : "$c"},
{"$group" : {"c" : {"$sum" : 1},  "_id" : "$_id"}}

And match to retrieve those with multiple matching values. As you get null values from the earlier conditional the $gte is one large then the number of matches you want.

{"$match" : {"c" : {"$gte" : 3}}},
{"$project" : {"_id" : "$_id"}}


Conclusion


It is possible to perform many, sophisticated partial matching operations in mongoDB whilst staying within the bounds of indexed queries and avoiding whole table or even whole index scans over anything but the smallest of data sets.




[1] http://en.wikipedia.org/wiki/Phonetic_algorithm
[2] At this time the sample code on Github does not yet use this convention.