Thursday, March 19, 2015

Wired Tiger - how to reduce your MongoDB hosting costs 10x

Forgive me father, it’s been a long time since I last blogged  - I've been giving in to temptation and getting to grips with MongoDB 3.0 and Wired Tiger Storage (WT)  - and I've learned some things I'd like to share.

Much has been said about WT's ingestion speed and the fine grained concurrency that allows it to rocket through mixed workloads. Playing with the POCLoader ( http://github.com/johnlpage/POCDriver ) I've seen some incredible throughput - occasionally into 7 figures per second of transactions per server. However I'm leaving all the speed posts to others - I want to talk about compression.

The Pledge


WT compresses data on disk. As a headline it uses snappy or zlib compression and gets officially 7-10 times compression - in practise I've seen higher. That's great in that it reduces your disk space for documents at the cost of decompressing them when you fetch them - a cost which is nominal with snappy and not really very much more for zlib.

What's more interesting is index compression; Indexes, unlike documents, need to be accessed in a fairly random manner normally and therefore we have  database performance rule number one is - if your workload is reasonably heavy ALL INDEXES SHOULD BE HELD IN RAM - IF THEY AREN’T ADD MORE RAM OR SHARDS WITH MORE RAM - in fact the number one reason for adding more shards to MongoDB is that it's cheaper to add a whole new server with X RAM than increase RAM in an existing server to 2X (Have you seen the price for a TB of RAM!).

This concerned me! The way WT compression works is that there is an uncompressed cache and disk blocks are uncompressed and temporarily held there - if they aren't in the WT cache then the disk cache is checked (as it still is in the MongoDB MMAPV1 Engine) and if not in there it's a request from disk. The uncompressed WT cache is by default 50% of the RAM on the server.

So my thinking was if I have a 2GB Index - that decompresses to 14GB - so to keep it in RAM I need to have 28GB of RAM on the server - otherwise it's going to be constantly being read from disk and decompressed - of course I could turn the cache up to 15GB but even then I would still need as much RAM and shards as I always did.

When I set out to investigate this I was astonished and delighted by what I found.

Firstly WIRED TIGER INDEXES ARE NOT COMPRESSED USING ZLIB OR SNAPPY. I knew it used something called index prefix compression but I thought that was in addition to zlib and snappy - it turns out it's only Index Prefix Compression.

Index prefix compression however is a domain specific way of compressing the data - with a hint of old school run length encoding - that actually get incredible compression on the indexes and does not require you to decompress them to use them - so you get compressed indexes that use a compressed amount of RAM to hold them and reduce the footprint overall.

Here's a quick and simple explanation of index prefix compression - as I understand it.

Imagine you have 50 records in your database - and each one has a gender in it Male or Female - the index - which is a tree - if flattened out and walked left to right would contain the following data

Female: 1, Female: 4, Female: 9, ... Female:50, Male : 2, Male:3...Male:49

The Value in the field is repeated throughout the index, the number is the location of the record (Which WT holds in a B-Tree too, for MMAP its a disk location).

When you use WT Index Prefix compression - per disk block each key happens only once - so that index block looks like this

Female: 1,4,9...,50,Male: 2,3,...,49

This can be a massive compression - blocks are, I believe , 4KB and so this can give really, really good compression.

Even better - we have index prefix compression - so if a previous key in the block is a subset of the key we can refer to the previous one.

Ann: 1, Ann: 4, Ann: 9, ... Ann:50, Annabelle : 2, Annabelle:3... Annabelle:49

becomes

Ann: 1,4,9...,50,*abelle: 2,3,...,49

(These are simplified examples to give you the idea)

This gives even better compression in some circumstances - remembering that the data in the tree is fundamentally sorted anyway.

The Turn


Armed with the understanding of how the index was stored and that it didn't inflate to something huge in the cache I set out to measure what sizes the indexes were and that got even more interesting.

I generated a sample set of data using POCLoader - I took the defaults and asked for 5 secondary indexes using the -x flag. I then exported that with mongoexport to allow me to reload into different setups.

Each record looked something like this:

> db.POCCOLL.findOne()


{




_id : {




w : 0,



i : 0


},



fld0 : NumberLong(648744),


fld1 : ISODate("2013-06-12T18:21:35.611Z"),

fld2 : "ipsum dolor sit amet. Lorem ip",

fld3 : "tempor invidunt ut labore etd",


fld4 : "sadipscing elitr, sed diam non",


fld5 : ISODate("2004-06-10T17:36:08.667Z"),

fld6 : NumberLong(781557),


fld7 : "no sea takimata sanctus est Lo",

fld8 : "diam nonumy eirmod tempor invi",

fld9 : NumberLong(203632)

}




I then loaded it into MMAP, Wired Tiger with Snappy and Prefix compression, Wired Tiger with Zlib and Prefix Compression, and Wired Tiger with Zlib and no index compression. And got the following results.


MMAP
WT-Snappy-Prefix
WT-Zlib-Prefix
WT-Zlib-NoPrefix
Number of Records
8,584,192
8,584,192
8,584,192
8,584,192
JSON Size (MB)
3,374
3,374
3,374
3,374
Record Storage (MB)
5,052
988
430
427
_id Index Size (MB)
606
127
121
220
fld0 Index Size (MB)
285
161
168
222
fld1 Index Size (MB)
287
168
176
257
fld2 Index Size (MB)
766
32
32
316
fld3 Index Size (MB)
742
32
32
316
fld4 Index Size (MB)
782
32
32
316





Total Index (MB)
3,468
550
562
1,647

This told me a number of interesting things.

Firstly that MMAP BSON data was larger than the original JSON - this was a surprise however in MongoDB 2.6 and higher the default is to store data with padding to allow of in-place growth, you can turn that off to get approximately a 30% improvement in disk utilisation if your data doesn't get edited or deleted . the MMAP indexes however at 3.4 GB are pretty large - although we are indexing a lot of small fields so it's not inappropriate.

This also confirmed the WT-Snappy gets more than 5x compression and WT-Zlib more than 10x on some reasonable records - that's awesome in it's own right when we all pay for disk by the GB these days.

I also confirmed my above assertion - and why I did this in the first place - the WT block compression algorithm does not fundamentally affect index size - so they aren't block compressed.

What was awesome to see was that even without prefix compression the storage of WT Btree indexes was 50% better than MMAP - I have to assume that's down tot the amount of free space in each index block and the fact BTrees grow a layer at a time typically so can make big jumps. Most of the indexes in this - apart from _id have randomly ordered data in of differing cardinality.

With the Prefix compression - we are down to indexes of 550MB versus 3.5GB - and remember that translated directly to the RAM requirement of the system - just how much hardware cost can be saved here!

The Prestige


I would have been happy there - only I accidentally discovered something even more astonishing along the way - remember I said I loaded this into each using mongoimport, first manually creating the indexes. Originally I used mongodump and mongorestore however there something different happened to what I expected.

When you use Mongorestore, or set up a new replica, or run reIndex() on a collection - the indexes are recreated using a 'Bulk' method - it takes a write lock on the whole DB for as long as it takes so in a live system do this only on a new replica - the Bulk method however does some sorting and ordering, and especially in conjunction with prefix compression and lower cardinality fields gives some real benefits.

With MMAP all it does is ensure better use of disk blocks, especially where the insertion was incrementing and therefore right-leaning. Still, 3.5GB down to 2.0GB is a nice improvement.


MMAP
MMAP-Rebuild
Number of Records
8,584,192
8,584,192
JSON Size (MB)
3,374
3,374
Record Storage (MB)
5,052
5,052
Index Size 1 (_id)
606
424
Index Size 2
285
206
Index Size 3
287
206
Index Size 4
766
398
Index Size 5
742
398
Index Size 6
782
398



Total Index (MB)
3,468
2,031

With Wired Tiger and No Index prefix compression it's not as dramatic , nearly 20% though.


WT-Zlib-NoPrefix
WT-Zlib-NoPrefix-Rebuild
Number of Records
8,584,192
8,584,192
JSON Size (MB)
3,374
3,374
Record Storage (MB)
427
432
_id Index Size (MB)
220
179
fld0 Index Size (MB)
222
98
fld1 Index Size (MB)
257
131
fld2 Index Size (MB)
316
321
fld3 Index Size (MB)
316
321
fld4 Index Size (MB)
316
321



Total Index (MB)
1,647
1,373

With the index prefix compression however the results are really dramatic  - that's a 34% improvement - what was a 3.5 GB index in MMAP is now 376MB in Wired Tiger. That's worth the effort.


WT-Zlib-Prefix
WT-Zlib-reIndex
Number of Records
8,584,192
8,584,192
JSON Size (MB)
3,374
3,374
Record Storage (MB)
430
430
_id Index Size (MB)
121
101
fld0 Index Size (MB)
168
71
fld1 Index Size (MB)
176
87
fld2 Index Size (MB)
32
39
fld3 Index Size (MB)
32
39
fld4 Index Size (MB)
32
39



Total Index (MB)
562
376

Epilogue


So when choosing hardware for MongoDB - not only can you now squeeze 100% from your CPU cores in Wired Tiger - but with some planning you need only 10% of the RAM - and if that doesn't cut your hosting bill, nothing will.