Friday, July 3, 2015

How Wired Tiger opens up the option of MongoDB for Big Data Analytics.


Introduction

Using the Wired Tiger engine for MongoDB brings big improvements to concurrency, although as I blogged last time not in every case unless you understand how to take advantage of it.

However the data compression in Wired Tiger is another real win with the cost of enterprise disk being so high and high performance cloud storage and SSD's not being far behind.

What though if you want to use MongoDB as a data analytics platform? What if you have a lot of data – I'm going to avoid calling it big data because then everyone just discusses what big means – But what if you have a lot of data, its' arriving quickly and it has an ad-hoc schema.

It turns out that Wired Tiger has a great solution to that.

Let's set out a real world example – a number of companies, including Amadeus put data into MongoDB for ad-hoc analysis. The requirement is that a smallish number of users can perform any query or aggregation they want, or pull data into something else for further analysis.

To do this well they either have a huge quantity of RAM to keep all the data in RAM, or they have Fusion IO and other ultra-fast flash storage, or both.

That's good and one asked me what the implications of Wired Tiger were for them – so for the last few months I've been working that out. The answer is it saves you a fortune in hardware.

I have also made sure that without too much difficulty you can repeat the results I get here – or at least the first and final results. Details at the end…

In short – the challenge is to take the MOT (Vehicle inspection) reports from http://data.gov.uk , load them in and then to be able to perform any ad-hoc query and groupings over them. No pre-calculation, pre aggregation or even indexes are allowed – after all you may want to perform ANY computation. There are 290 Million reports so let's load them twice just to ensure we have at least half a Billion.

Each report is simple, tabular, fixed data – exactly what MongoDB is NOT best at, in a later blog we can compare and contrast the document model and RDBMS as we pull in the more specific defect reports, in this case let's keep it simple.

There are 28 GB of tab delimited MOT files on data.gov so that's 56 GB of RAW data – MongoDB will extend it with field names and BSON types and things so in the simplest mode it gets larger.

To save time as I tried multiple configurations I wrote a custom loader in C – this allows me to load this data in in parallel on a sharded cluster – all the code for that is available too.


The tests were on an AWS C4.4xlarge instance with 30GB RAM and 16 CPU's – a typical 'Enterprise' machine, perhaps a little light on RAM.

All data was on EBS, NON PIOPS SPINNING DISK, this was deliberate is is the slow, cheap option for disks with only 150 IOPS Max and a ~60MB/s max read speed. The idea was to use cheap local/SAN disk as we said disk isn't cheap (unless it's slow like this stuff).

In these tests the query I chose, whilst a little contrived was typical of the sorts of queries customer analysts run. In short – for each car, give it a greenness score based on it's engine size and whether it's a diesel or not, then sum the total number of miles it's done. As each MOT has the total miles then this calculation is wholly worthless as you count miles from MOT's in previous years multiple times. Don’t get hung up on the query itself – the point it looks at all the data, calculated something based on each record and adds them up and groups them.

Query summary

Choose a band from the following table based on Engine Capacity (cc)


Capacity
Band
 Less than 1000
1
1000-1199
2
1200-1599
3
1600-1999
4
2000-2499
5
2500-3499
6
More then 3500
7

If the Engine is Diesel – multiply by 1.5 – a 'Dirty' factor.
Now count how many vehicles in each band and the total number of miles.

In mongodb's aggregation pipeline this is written as

isover3500 = { $cond : [ { $gt : [ '$e',3500 ] }, 7, 6 ] }
isover2500 = { $cond : [ { $gt : [ '$e',2500 ] }, isover3500, 5 ] }
isover2000 = { $cond : [ { $gt : [ '$e',2000 ] }, isover2500, 4 ] }
isover1600 = { $cond : [ { $gt : [ '$e',1600 ] }, isover2000, 3 ] }
isover1200 = { $cond : [ { $gt : [ '$e',1200 ] }, isover1600, 2 ] }
getband = { $cond : [ { $gt : [ '$e',1000 ] }, isover1200, 1 ] }

isdirty = { $cond : [ { $eq : [ '$f' ,'D']}, 1.5 , 1 ]}
dirtypenalty = { $multiply : [ getband, isdirty ]}
banding = { $project : { b : dirtypenalty , m : 1}}
group = { $group : { _id : '$b' , count : { $sum : 1} , miles : { $sum : '$m' } } }
sort = {$sort:{_id:1}}
use vosa
db.mot_results.aggregate([banding,group,sort])


Environment Summary

Data

Records: 580,650,946
Volume (CSV) : 56 GB
Volume (BSON): 88.9GB
Typical Record:

Hardware

Host: Amazon EC2
CPUs: 16
RAM: 30GB
DISK: 512GB Spinning, non optimised SAN.

Software

MongoDB Enterprise 3.0.4

My Voyage of Discovery.

I started by simply running MongoDB with default parameters, mmap_v1 storage to see what most folks 'typical' results might be.  Loading the data took me 120 minutes and by the time I has finished I was using 170GB of disk space, between indexes and BSON.

I then ran the query above a few times to get  results and got the following.

Engine
Options
Disk Usage (GB)
Load Time (mins)
Query Time (mins)
User CPU
Sys CPU
IOWait
Idle CPU
Mmap

169
120
48:47
2
1
3
95


49 minutes is a long time to wait for a query, and 169GB is a lot of disk space. Of course mmap isn't known for being frugal – one thing that mmap does by default is allocate the next 'power of two' storage size for each record. This results in an average 50% overhead but aids in avoiding fragmentation and re-allocating space. It also means that growing records only need to move on disk a minimal number of times.

In this case though I am slightly bounded by IO, I have some I/O wait in there maybe I can speed things up if I make the data smaller, these are tiny records and therefore the extra space is being read from disk – it's not like those blicks can be skipped as the disk reads a minimum of 8K at a time.

Therefore I turned off powerOfTwo sizing for the collection – technical noPadding=true in MongoDB 3.0 and got the following.

Engine
Options
Disk Usage (GB)
Load Time (mins)
Query Time (mins)
User CPU
Sys CPU
IO Wait
Idle CPU
Mmap
noPadding
127
105
38:17
3
0
3
95


Well there is a win there – 42GB less disk (about 25%) and consequent 25% improvement in speed. Of course if I were deleting records or editing them then I would have some issues but for my analytics platform this may be OK.

Aggregation pipelines are single threaded per shard in MongoDB – and therefore if I wanted to bring some more CPU to the party I needed to shard – not that this worries me and as long as my shard key is nicely distributed I should be able to parallelise the task. Therefore, still using mmap I created some microshards shards running all on the same server to see how that helped. I can of course add more servers too to keep scaling out.

Engine
Options
Disk Usage (GB)
Load Time (mins)
Query Time (mins)
User CPU
Sys CPU
IO Wait
Idle CPU
Mmap
noPadding,
2 SHARDS
127
89
25:44
4
2
4
90
Mmap
noPadding,
4 SHARDS
127
90
25:45
4
1
10
85

I see now I have a serious improvement by adding shards and bringing in CPU's but only up to two shards, a four I get no improvement – why?, well the clue is right there in the IO Wait column – that's me maxing out my Disk IO, which isn't amazing being slow, cheap disks for lots of data.

This is where I can bring Wired Tiger into the mix – if I compress the data on disk, then I can read it faster at the expense of more CPU. So I rebuilt using WiredTiger and zlib compression to see what that would look like.

Engine
Options
Disk Usage (GB)
Load Time (mins)
Query Time (mins)
User CPU
Sys CPU
IO Wait
Idle CPU
WiredTiger
zlib
28.5
88
23:12
6
0
2
74

That's an improvement – not much but there again I'm back to one CPU and that's my limiter. At this point though I noticed that my data is smaller than my RAM! The actual table data is only 24GB of the 30GB I have so will it all fit in RAM like MMAP?

The answer is yes – and no. Unlike mmap, Wired Tiger doesn’t work on the data in it's native form on disk, it uncompresses it into it's own cache – and that cache is NOT compressed, its' plain old BSON so large again.  That said the OS should still cache the disk files themselves. I saw that the IO wait was 2 every run – what was going on? Shouldn’t it be zero even if CPU was limiting me? iostat showed 12 MB/s coming from disk.

Then I remembered that Wired Tiger takes 50% of RAM for it's uncompressed cache by default  – I could turn that down to 1GB and sure enough – now we were managing to run without using the disk – now we can crank it right up.

 I therefore tried a couple of shards – first clearing the OS page cache and the results looked good – monitoring with iostat showed that after the first run no Disk IO was happening at all. I got the following.


Engine
Options
Disk Usage (GB)
Load Time (mins)
Query Time (mins)
User CPU
Sys CPU
IO Wait
Idle CPU
WiredTiger
Zlib, 2 shards, 1GB Cache
28.5
68
15:31
13
0
0
88

So that's looking good, unfortunately when I went to four shards -  I was being throttled by Disk IO again – the 4 time 1GB cache didn't leave me enough RAM to cache my data – and the MongoDB option didn’t go lower than 1 GB.

At this point I went to look at the source – and  I had an unsupported answer – you CAN tune the cache much lower using the enggineConfigOptions cache_size parameter however at that point support from MongoDB gets a little more negotiated – I kept going there, loading in the data then setting the cache to 2MB (yes Megabytes) – this made each MonogD instance have a nice tiny footprint and let me crank the shards all the way up to 16.

Engine
Options
Disk Usage (GB)
Load Time (mins)
Query Time (mins)
User CPU
Sys CPU
IO Wait
Idle CPU
WiredTiger
Zlib,2MB cache, 4 shards

68
8:32
25
0
0
75
WiredTiger
Zlib,2MB cache, 8 shards

67
4:36
50
0
0
50
WiredTiger
Zlib,2MB cache, 16 shards
28.5
66
3:35
100
0
0
0

I was disappointed that the final 16 didn't halve it again – I'll probably investigate that more sometime I suspect that's because I'm using some cores for decompressing data though. But reducing an ad-hoc  data processing job from over 48 minutes to 3.5 is a real case for using Wired Tiger in your analytic cluster. Conversely – you can buy about 80% less RAM and avoid expensive disks, that should answer my client's question.

If you want to try this at home, I threw (literally) all my materials in http://github.com/johnlpage/BigAggBlog - feel free to use those to get yourself up and going. Let me know if you are wanting to try this seriously on your data and I may be able to lend a hand.

Anyone want to save me the effort and create a similar single server, ad-hoc query/analysis  setup in Hadoop, Spark, Cassandra and Postgress. I'd do it but I'm sure so many of you could do it far better and I'd love to see this like for like. As Mark Callaghan wisely observed the value of a 'benchmark' is worth 1/numberOfSystemTested – and no, I don't expect MongoDB to be the fastest analytic platform out there – that's SAS LASR! I'd like to know how far off the pace MongoDB is though.