Thursday, March 31, 2016

MongoDB is not a Javascript Database

Sometimes people refer to MongoDB as a Javascript/JSON database - it's not, its internal format is BSON - which is a serialised binary representation of objects. Unlike Javascript everything in MongoDB is stored strongly typed and quickly traversable.

That being said, people do access MongoDB with duck-typed Languages like PHP, Perl, Python and yes - Javascript so it can certainly seem that way - and the Mongo shell is is a Javascript REPL so it all seems very Javascript focussed.

Why does this matter and what's this Blog post about?

Well it matters - because one of the things I want from a database is to know that dates are dates, and integers are integers and floating point numbers aren't being stored as strings. Some 'Document' databases just store JSON strings and deal with it at runtime - ACID compliance is one debate but simply knowing you stored a date as a date and not a string is important too.

Actually I just wanted to get that off my chest before I get into he point of this post.

I want to talk about the fact you can Query MongoDB using Javascript - and like many things, this is a very useful piece of rope, you can secure things to things, you can tie things and pull things and climb things - but you can also hang yourself! so use the following information wisely.

In MongoDB, a normal query is sent to the database as an Object - basically query by example - you make an object in language of your choice and send it to a find function on a collection object to find matching Objects.

query = new Document()
query.add("name","john")
cursor = connection.getDatabase("myApp").getCollection("myUsers").find(query)
while(cursor.hasNext()) {
  record=cursor.Next()
}

Or in the shell (Javascript)

c = db.getSiblingDB("myApp").users.find({"name","john"})

This is the MongoDB approach - rather than a query language targeted at humans writing strings - which is how SQL was created, this starts with the premise queries will be constructed by code from some form of UI - it's actually a much better and safer fit than constructing SQL strings.

What many don't know is you can also do this.

c = db.getSiblingDB("myApp").users.find('this.name == "john" ')

Wait - what? Thats a query, as a string!; Is it code or some undocumented  query language? What's that all about.

Well that's the short version of 

c = db.getSiblingDB("myApp").usersfind({ $where: 'this.name == "john"' })

Each record in the database is converted, server side from it's serialised form into a Javascript object 'this' and then the supplied Javascript is run against it and whether that returns true or false determines it it's returned. It's magic and so powerful and ... slow.

The reason it's slow is two fold

 (1) Javascript - this is an interpreted/JIT language and it's never fast despite what node.js fans tell you.

(2) It has to convert every record in the Collection - there are no indexes used here.

So IMPORTANT - DO NOT do this, don't use $where for your queries.

Except $where it makes sense (see what I did there)

What if you cannot express your query using the Mongo query language? The Javascript interpreter in MongoDB is very sandboxed - so you cannot call out to other services or even access other parts of the DB - you are limited to one record but it does let you express, for example WHERE a+b > 5 or even WHERE a > b and b < 5

Let's take that last one as an example and see what the speed is like.

First I'll make a million records in the shell.


for(x=0;x < 1000000 ; x++) {
db.mine.insert({ _id:x, a:Math.random()*1000, b:Math.random()*1000}) } 

Now let us see how long a simple unindexed normal search takes


var start = new Date().getTime();
db.mine.find({a:{$lt:1}}).count() //Count used to ensure it gets not just first 100 results
var end = new Date().getTime();
var time = end - start;
print('Execution time: ' + time);

384ms - Not bad in human terms but you wouldn't want to do very many of them.

Let's try that javascript query above now

var start = new Date().getTime();
db.mine.find("this.a > this.b && this.b < 5").count()
var end = new Date().getTime();
var time = end - start;
print('Execution time: ' + time);

38,226 ms - yes 40 Seconds. This is is why you just don't do this! That database is only 38MB.

But I don't like to leave you without some positive news. MongoDB also allows a hybrid mode, thanks to $where so I can do

c = db.users.find({b:{$lt;:5},$where:'this.a>this.b'}}

This will use an index for b - which I better check is there first

db.mine.ensureIndex({b:1})
var start = new Date().getTime();
db.mine.find({b:{$lt:5},$where:"this.a > this.b"}).count()
var end = new Date().getTime();
var time = end - start;
print('Execution time: ' + time);

OK That's 263 ms, 150 times faster and I get to do comparisons between fields.


But is it the best option? No, Javascript is rarely the best option. MongoDB offers something much faster and just as capable, the Aggregation Pipeline.

With aggregation, you can write the above query like this. I have to add the last line to do the count

matchQuery={$match:{b:{$lt:5}}}
compareVals = { if: { $gt:["$a","$b"] }, then:true ,else:false}
addBool = {$project: {r:"$$ROOT",k:{$cond: compareVals} } }
filterNot = {$match: {k:true} }
simCount={$group:{_id:1,count:{$sum:1}}}

var start = new Date().getTime();
db.mine.aggregate([matchQuery,addBool,filterNot,simCount])
var end = new Date().getTime();
var time = end - start;
print('Execution time: ' + time);

And that takes ... 27ms, over 1,500 times faster than the Javascript!!

The debate rages whether the Aggregation Framework is Turing complete - but if you are using MongoDB and running ANY in-database javascript, you probably aren't doing the right thing. MongoDB is NOT a Javascript database.








Friday, July 3, 2015

How Wired Tiger opens up the option of MongoDB for Big Data Analytics.


Introduction

Using the Wired Tiger engine for MongoDB brings big improvements to concurrency, although as I blogged last time not in every case unless you understand how to take advantage of it.

However the data compression in Wired Tiger is another real win with the cost of enterprise disk being so high and high performance cloud storage and SSD's not being far behind.

What though if you want to use MongoDB as a data analytics platform? What if you have a lot of data – I'm going to avoid calling it big data because then everyone just discusses what big means – But what if you have a lot of data, its' arriving quickly and it has an ad-hoc schema.

It turns out that Wired Tiger has a great solution to that.

Let's set out a real world example – a number of companies, including Amadeus put data into MongoDB for ad-hoc analysis. The requirement is that a smallish number of users can perform any query or aggregation they want, or pull data into something else for further analysis.

To do this well they either have a huge quantity of RAM to keep all the data in RAM, or they have Fusion IO and other ultra-fast flash storage, or both.

That's good and one asked me what the implications of Wired Tiger were for them – so for the last few months I've been working that out. The answer is it saves you a fortune in hardware.

I have also made sure that without too much difficulty you can repeat the results I get here – or at least the first and final results. Details at the end…

In short – the challenge is to take the MOT (Vehicle inspection) reports from http://data.gov.uk , load them in and then to be able to perform any ad-hoc query and groupings over them. No pre-calculation, pre aggregation or even indexes are allowed – after all you may want to perform ANY computation. There are 290 Million reports so let's load them twice just to ensure we have at least half a Billion.

Each report is simple, tabular, fixed data – exactly what MongoDB is NOT best at, in a later blog we can compare and contrast the document model and RDBMS as we pull in the more specific defect reports, in this case let's keep it simple.

There are 28 GB of tab delimited MOT files on data.gov so that's 56 GB of RAW data – MongoDB will extend it with field names and BSON types and things so in the simplest mode it gets larger.

To save time as I tried multiple configurations I wrote a custom loader in C – this allows me to load this data in in parallel on a sharded cluster – all the code for that is available too.


The tests were on an AWS C4.4xlarge instance with 30GB RAM and 16 CPU's – a typical 'Enterprise' machine, perhaps a little light on RAM.

All data was on EBS, NON PIOPS SPINNING DISK, this was deliberate is is the slow, cheap option for disks with only 150 IOPS Max and a ~60MB/s max read speed. The idea was to use cheap local/SAN disk as we said disk isn't cheap (unless it's slow like this stuff).

In these tests the query I chose, whilst a little contrived was typical of the sorts of queries customer analysts run. In short – for each car, give it a greenness score based on it's engine size and whether it's a diesel or not, then sum the total number of miles it's done. As each MOT has the total miles then this calculation is wholly worthless as you count miles from MOT's in previous years multiple times. Don’t get hung up on the query itself – the point it looks at all the data, calculated something based on each record and adds them up and groups them.

Query summary

Choose a band from the following table based on Engine Capacity (cc)


Capacity
Band
 Less than 1000
1
1000-1199
2
1200-1599
3
1600-1999
4
2000-2499
5
2500-3499
6
More then 3500
7

If the Engine is Diesel – multiply by 1.5 – a 'Dirty' factor.
Now count how many vehicles in each band and the total number of miles.

In mongodb's aggregation pipeline this is written as

isover3500 = { $cond : [ { $gt : [ '$e',3500 ] }, 7, 6 ] }
isover2500 = { $cond : [ { $gt : [ '$e',2500 ] }, isover3500, 5 ] }
isover2000 = { $cond : [ { $gt : [ '$e',2000 ] }, isover2500, 4 ] }
isover1600 = { $cond : [ { $gt : [ '$e',1600 ] }, isover2000, 3 ] }
isover1200 = { $cond : [ { $gt : [ '$e',1200 ] }, isover1600, 2 ] }
getband = { $cond : [ { $gt : [ '$e',1000 ] }, isover1200, 1 ] }

isdirty = { $cond : [ { $eq : [ '$f' ,'D']}, 1.5 , 1 ]}
dirtypenalty = { $multiply : [ getband, isdirty ]}
banding = { $project : { b : dirtypenalty , m : 1}}
group = { $group : { _id : '$b' , count : { $sum : 1} , miles : { $sum : '$m' } } }
sort = {$sort:{_id:1}}
use vosa
db.mot_results.aggregate([banding,group,sort])


Environment Summary

Data

Records: 580,650,946
Volume (CSV) : 56 GB
Volume (BSON): 88.9GB
Typical Record:

Hardware

Host: Amazon EC2
CPUs: 16
RAM: 30GB
DISK: 512GB Spinning, non optimised SAN.

Software

MongoDB Enterprise 3.0.4

My Voyage of Discovery.

I started by simply running MongoDB with default parameters, mmap_v1 storage to see what most folks 'typical' results might be.  Loading the data took me 120 minutes and by the time I has finished I was using 170GB of disk space, between indexes and BSON.

I then ran the query above a few times to get  results and got the following.

Engine
Options
Disk Usage (GB)
Load Time (mins)
Query Time (mins)
User CPU
Sys CPU
IOWait
Idle CPU
Mmap

169
120
48:47
2
1
3
95


49 minutes is a long time to wait for a query, and 169GB is a lot of disk space. Of course mmap isn't known for being frugal – one thing that mmap does by default is allocate the next 'power of two' storage size for each record. This results in an average 50% overhead but aids in avoiding fragmentation and re-allocating space. It also means that growing records only need to move on disk a minimal number of times.

In this case though I am slightly bounded by IO, I have some I/O wait in there maybe I can speed things up if I make the data smaller, these are tiny records and therefore the extra space is being read from disk – it's not like those blicks can be skipped as the disk reads a minimum of 8K at a time.

Therefore I turned off powerOfTwo sizing for the collection – technical noPadding=true in MongoDB 3.0 and got the following.

Engine
Options
Disk Usage (GB)
Load Time (mins)
Query Time (mins)
User CPU
Sys CPU
IO Wait
Idle CPU
Mmap
noPadding
127
105
38:17
3
0
3
95


Well there is a win there – 42GB less disk (about 25%) and consequent 25% improvement in speed. Of course if I were deleting records or editing them then I would have some issues but for my analytics platform this may be OK.

Aggregation pipelines are single threaded per shard in MongoDB – and therefore if I wanted to bring some more CPU to the party I needed to shard – not that this worries me and as long as my shard key is nicely distributed I should be able to parallelise the task. Therefore, still using mmap I created some microshards shards running all on the same server to see how that helped. I can of course add more servers too to keep scaling out.

Engine
Options
Disk Usage (GB)
Load Time (mins)
Query Time (mins)
User CPU
Sys CPU
IO Wait
Idle CPU
Mmap
noPadding,
2 SHARDS
127
89
25:44
4
2
4
90
Mmap
noPadding,
4 SHARDS
127
90
25:45
4
1
10
85

I see now I have a serious improvement by adding shards and bringing in CPU's but only up to two shards, a four I get no improvement – why?, well the clue is right there in the IO Wait column – that's me maxing out my Disk IO, which isn't amazing being slow, cheap disks for lots of data.

This is where I can bring Wired Tiger into the mix – if I compress the data on disk, then I can read it faster at the expense of more CPU. So I rebuilt using WiredTiger and zlib compression to see what that would look like.

Engine
Options
Disk Usage (GB)
Load Time (mins)
Query Time (mins)
User CPU
Sys CPU
IO Wait
Idle CPU
WiredTiger
zlib
28.5
88
23:12
6
0
2
74

That's an improvement – not much but there again I'm back to one CPU and that's my limiter. At this point though I noticed that my data is smaller than my RAM! The actual table data is only 24GB of the 30GB I have so will it all fit in RAM like MMAP?

The answer is yes – and no. Unlike mmap, Wired Tiger doesn’t work on the data in it's native form on disk, it uncompresses it into it's own cache – and that cache is NOT compressed, its' plain old BSON so large again.  That said the OS should still cache the disk files themselves. I saw that the IO wait was 2 every run – what was going on? Shouldn’t it be zero even if CPU was limiting me? iostat showed 12 MB/s coming from disk.

Then I remembered that Wired Tiger takes 50% of RAM for it's uncompressed cache by default  – I could turn that down to 1GB and sure enough – now we were managing to run without using the disk – now we can crank it right up.

 I therefore tried a couple of shards – first clearing the OS page cache and the results looked good – monitoring with iostat showed that after the first run no Disk IO was happening at all. I got the following.


Engine
Options
Disk Usage (GB)
Load Time (mins)
Query Time (mins)
User CPU
Sys CPU
IO Wait
Idle CPU
WiredTiger
Zlib, 2 shards, 1GB Cache
28.5
68
15:31
13
0
0
88

So that's looking good, unfortunately when I went to four shards -  I was being throttled by Disk IO again – the 4 time 1GB cache didn't leave me enough RAM to cache my data – and the MongoDB option didn’t go lower than 1 GB.

At this point I went to look at the source – and  I had an unsupported answer – you CAN tune the cache much lower using the enggineConfigOptions cache_size parameter however at that point support from MongoDB gets a little more negotiated – I kept going there, loading in the data then setting the cache to 2MB (yes Megabytes) – this made each MonogD instance have a nice tiny footprint and let me crank the shards all the way up to 16.

Engine
Options
Disk Usage (GB)
Load Time (mins)
Query Time (mins)
User CPU
Sys CPU
IO Wait
Idle CPU
WiredTiger
Zlib,2MB cache, 4 shards

68
8:32
25
0
0
75
WiredTiger
Zlib,2MB cache, 8 shards

67
4:36
50
0
0
50
WiredTiger
Zlib,2MB cache, 16 shards
28.5
66
3:35
100
0
0
0

I was disappointed that the final 16 didn't halve it again – I'll probably investigate that more sometime I suspect that's because I'm using some cores for decompressing data though. But reducing an ad-hoc  data processing job from over 48 minutes to 3.5 is a real case for using Wired Tiger in your analytic cluster. Conversely – you can buy about 80% less RAM and avoid expensive disks, that should answer my client's question.

If you want to try this at home, I threw (literally) all my materials in http://github.com/johnlpage/BigAggBlog - feel free to use those to get yourself up and going. Let me know if you are wanting to try this seriously on your data and I may be able to lend a hand.

Anyone want to save me the effort and create a similar single server, ad-hoc query/analysis  setup in Hadoop, Spark, Cassandra and Postgress. I'd do it but I'm sure so many of you could do it far better and I'd love to see this like for like. As Mark Callaghan wisely observed the value of a 'benchmark' is worth 1/numberOfSystemTested – and no, I don't expect MongoDB to be the fastest analytic platform out there – that's SAS LASR! I'd like to know how far off the pace MongoDB is though.