Thursday, April 2, 2015

MongoDB Full Text Search - just got 66% more usable.

Last time I blogged, I spoke about index compression. When you use the Wired Tiger storage engine in MongoDB 3.0 you get compressed indexes - and better yet indexes that don't need to be decompressed in RAM - ones that stay compressed and reduce the RAM footprint.

Why does this matter? - because for a database system to work well. You need enough RAM to hold your indexes.

MongoDB 2.4 added a beta Full Text Search (FTS) capability, in 2.6 it became a GA Release. which, whilst it doesn't have all the bells and whistles of a dedicated  FTS indexing engine like Elasticsearch it has enough to make it suitable for many use cases. MongoDB's thinking is that some people will find it easier and cheaper if they don't need to add third party text  indexing to their MongoDB cluster along with all the associated extra hardware and management.

Unfortunately - MongoDB FTS in 2.6 had a small issue, to understand what you need to look at how it works.

When you specified one or more fields as being text searchable in MongoDB using collection.createIndex({ title: "text", "article":"text"}), what  the database server does is, during each update or insert, take the contents of those fields, parse them into words, stem them - reducing shopper and shopping to shop and then treat the unique list of words as though they were an array in MongoDB, one index entry for each.

If you assume that each document has 30% unique words after stopwords then that can be a lot of index entries, each with the word repeated AND an eight byte long record locator. MongoDB MMAP indexes are fairly simple and the word is repeated in the index for every document it appear in.

Now as I explained last week - WiredTiger does NOT repeat a key in the index within a block - so for indexes like these where the same word is repeated in many documents - and where the next word lexographically may share some of the start of this one index prefix compression is a fantastic solution.

So what does this mean in reality? To test I loaded wikipedia english edition into MongoDB and built a full text index on the title and article text.

With MongoDB 2.6 the stats look like this


{
"ns" : "wikipedia.records",
"count" : 4335341,
"size" : 13774842288,
"avgObjSize" : 3177,
"numExtents" : 27,
"storageSize" : 15124713408, <-15.1GB
"nindexes" : 2,
"totalIndexSize" : 16727360160, <-16.7GB
"indexSizes" : {
"_id_" : 123817344,
"text_text_title_text" : 16603542816  
},
"ok" : 1
}


With MongoDB 3.0 WiredTiger (Snappy) they look like like this:

> db.records.stats()
{
"ns" : "wikipedia.records",
"count" : 4335341,
"size" : 9526419045,
"avgObjSize" : 2197,
"storageSize" : 6302990336, <- 6.3 GB
"nindexes" : 2,
"totalIndexSize" : 4646367232, <- 4.6GB
"indexSizes" : {
"_id_" : 45441024,
"text_text_title_text" : 4600926208 
},
"ok" : 1
}

That's huge, over 60% compression and with an index smaller that the data - not larger. Suddenly MongoDB FTS looks like a much more viable option and without all the additional setup or an indexing cluster, Unless you need extra functionality of course - but do you? And have you really accepted it's OK to have the text index update out of sync with the records?

Of course it's only fair to see how much space Elasticsearch takes for the same data, to do that I loaded the same data set in, reading from MongoDB and inserting using the excellent Python API's for both.


 curl localhost:9200/wikipedia/_stats | python -m json.tool

 "indices": {
        "wikipedia": {
            "primaries": {
                "completion": {
                    "size_in_bytes": 0
                },
                "docs": {
                    "count": 4335341,
                    "deleted": 0
                },
                "segments": {
                    "count": 134,
                    "index_writer_memory_in_bytes": 0,
                    "memory_in_bytes": 62915448,
                    "version_map_memory_in_bytes": 0
                },
                "store": {
                    "size_in_bytes": 14342148088, <-14GB
                    "throttle_time_in_millis": 946194
                },
    
            }

The answer ...  the index in ES was 14 GB,  3 times larger than the MongoDB index and of course - you still need retain your original copy. All you are getting back from your search is a key.

This is in no way a criticism of Elasticsearch or any other Indexing technology (although none hold a candle to the original "Memex Information Engine" ). Rich FTS is a category of data technologies all of it's own and specialists will and should always exist. I'm just pointing out that in the same way the music player in your phone can technically replace your home HiFi - converged search technology has come of age - Search is a commodity - it should just be there as a database feature.

Are you using MongoDB FTS? Did you consider and reject it already? I I'd love to hear why in the comments.




Thursday, March 19, 2015

Wired Tiger - how to reduce your MongoDB hosting costs 10x

Forgive me father, it’s been a long time since I last blogged  - I've been giving in to temptation and getting to grips with MongoDB 3.0 and Wired Tiger Storage (WT)  - and I've learned some things I'd like to share.

Much has been said about WT's ingestion speed and the fine grained concurrency that allows it to rocket through mixed workloads. Playing with the POCLoader ( http://github.com/johnlpage/POCDriver ) I've seen some incredible throughput - occasionally into 7 figures per second of transactions per server. However I'm leaving all the speed posts to others - I want to talk about compression.

The Pledge


WT compresses data on disk. As a headline it uses snappy or zlib compression and gets officially 7-10 times compression - in practise I've seen higher. That's great in that it reduces your disk space for documents at the cost of decompressing them when you fetch them - a cost which is nominal with snappy and not really very much more for zlib.

What's more interesting is index compression; Indexes, unlike documents, need to be accessed in a fairly random manner normally and therefore we have  database performance rule number one is - if your workload is reasonably heavy ALL INDEXES SHOULD BE HELD IN RAM - IF THEY AREN’T ADD MORE RAM OR SHARDS WITH MORE RAM - in fact the number one reason for adding more shards to MongoDB is that it's cheaper to add a whole new server with X RAM than increase RAM in an existing server to 2X (Have you seen the price for a TB of RAM!).

This concerned me! The way WT compression works is that there is an uncompressed cache and disk blocks are uncompressed and temporarily held there - if they aren't in the WT cache then the disk cache is checked (as it still is in the MongoDB MMAPV1 Engine) and if not in there it's a request from disk. The uncompressed WT cache is by default 50% of the RAM on the server.

So my thinking was if I have a 2GB Index - that decompresses to 14GB - so to keep it in RAM I need to have 28GB of RAM on the server - otherwise it's going to be constantly being read from disk and decompressed - of course I could turn the cache up to 15GB but even then I would still need as much RAM and shards as I always did.

When I set out to investigate this I was astonished and delighted by what I found.

Firstly WIRED TIGER INDEXES ARE NOT COMPRESSED USING ZLIB OR SNAPPY. I knew it used something called index prefix compression but I thought that was in addition to zlib and snappy - it turns out it's only Index Prefix Compression.

Index prefix compression however is a domain specific way of compressing the data - with a hint of old school run length encoding - that actually get incredible compression on the indexes and does not require you to decompress them to use them - so you get compressed indexes that use a compressed amount of RAM to hold them and reduce the footprint overall.

Here's a quick and simple explanation of index prefix compression - as I understand it.

Imagine you have 50 records in your database - and each one has a gender in it Male or Female - the index - which is a tree - if flattened out and walked left to right would contain the following data

Female: 1, Female: 4, Female: 9, ... Female:50, Male : 2, Male:3...Male:49

The Value in the field is repeated throughout the index, the number is the location of the record (Which WT holds in a B-Tree too, for MMAP its a disk location).

When you use WT Index Prefix compression - per disk block each key happens only once - so that index block looks like this

Female: 1,4,9...,50,Male: 2,3,...,49

This can be a massive compression - blocks are, I believe , 4KB and so this can give really, really good compression.

Even better - we have index prefix compression - so if a previous key in the block is a subset of the key we can refer to the previous one.

Ann: 1, Ann: 4, Ann: 9, ... Ann:50, Annabelle : 2, Annabelle:3... Annabelle:49

becomes

Ann: 1,4,9...,50,*abelle: 2,3,...,49

(These are simplified examples to give you the idea)

This gives even better compression in some circumstances - remembering that the data in the tree is fundamentally sorted anyway.

The Turn


Armed with the understanding of how the index was stored and that it didn't inflate to something huge in the cache I set out to measure what sizes the indexes were and that got even more interesting.

I generated a sample set of data using POCLoader - I took the defaults and asked for 5 secondary indexes using the -x flag. I then exported that with mongoexport to allow me to reload into different setups.

Each record looked something like this:

> db.POCCOLL.findOne()


{




_id : {




w : 0,



i : 0


},



fld0 : NumberLong(648744),


fld1 : ISODate("2013-06-12T18:21:35.611Z"),

fld2 : "ipsum dolor sit amet. Lorem ip",

fld3 : "tempor invidunt ut labore etd",


fld4 : "sadipscing elitr, sed diam non",


fld5 : ISODate("2004-06-10T17:36:08.667Z"),

fld6 : NumberLong(781557),


fld7 : "no sea takimata sanctus est Lo",

fld8 : "diam nonumy eirmod tempor invi",

fld9 : NumberLong(203632)

}




I then loaded it into MMAP, Wired Tiger with Snappy and Prefix compression, Wired Tiger with Zlib and Prefix Compression, and Wired Tiger with Zlib and no index compression. And got the following results.


MMAP
WT-Snappy-Prefix
WT-Zlib-Prefix
WT-Zlib-NoPrefix
Number of Records
8,584,192
8,584,192
8,584,192
8,584,192
JSON Size (MB)
3,374
3,374
3,374
3,374
Record Storage (MB)
5,052
988
430
427
_id Index Size (MB)
606
127
121
220
fld0 Index Size (MB)
285
161
168
222
fld1 Index Size (MB)
287
168
176
257
fld2 Index Size (MB)
766
32
32
316
fld3 Index Size (MB)
742
32
32
316
fld4 Index Size (MB)
782
32
32
316





Total Index (MB)
3,468
550
562
1,647

This told me a number of interesting things.

Firstly that MMAP BSON data was larger than the original JSON - this was a surprise however in MongoDB 2.6 and higher the default is to store data with padding to allow of in-place growth, you can turn that off to get approximately a 30% improvement in disk utilisation if your data doesn't get edited or deleted . the MMAP indexes however at 3.4 GB are pretty large - although we are indexing a lot of small fields so it's not inappropriate.

This also confirmed the WT-Snappy gets more than 5x compression and WT-Zlib more than 10x on some reasonable records - that's awesome in it's own right when we all pay for disk by the GB these days.

I also confirmed my above assertion - and why I did this in the first place - the WT block compression algorithm does not fundamentally affect index size - so they aren't block compressed.

What was awesome to see was that even without prefix compression the storage of WT Btree indexes was 50% better than MMAP - I have to assume that's down tot the amount of free space in each index block and the fact BTrees grow a layer at a time typically so can make big jumps. Most of the indexes in this - apart from _id have randomly ordered data in of differing cardinality.

With the Prefix compression - we are down to indexes of 550MB versus 3.5GB - and remember that translated directly to the RAM requirement of the system - just how much hardware cost can be saved here!

The Prestige


I would have been happy there - only I accidentally discovered something even more astonishing along the way - remember I said I loaded this into each using mongoimport, first manually creating the indexes. Originally I used mongodump and mongorestore however there something different happened to what I expected.

When you use Mongorestore, or set up a new replica, or run reIndex() on a collection - the indexes are recreated using a 'Bulk' method - it takes a write lock on the whole DB for as long as it takes so in a live system do this only on a new replica - the Bulk method however does some sorting and ordering, and especially in conjunction with prefix compression and lower cardinality fields gives some real benefits.

With MMAP all it does is ensure better use of disk blocks, especially where the insertion was incrementing and therefore right-leaning. Still, 3.5GB down to 2.0GB is a nice improvement.


MMAP
MMAP-Rebuild
Number of Records
8,584,192
8,584,192
JSON Size (MB)
3,374
3,374
Record Storage (MB)
5,052
5,052
Index Size 1 (_id)
606
424
Index Size 2
285
206
Index Size 3
287
206
Index Size 4
766
398
Index Size 5
742
398
Index Size 6
782
398



Total Index (MB)
3,468
2,031

With Wired Tiger and No Index prefix compression it's not as dramatic , nearly 20% though.


WT-Zlib-NoPrefix
WT-Zlib-NoPrefix-Rebuild
Number of Records
8,584,192
8,584,192
JSON Size (MB)
3,374
3,374
Record Storage (MB)
427
432
_id Index Size (MB)
220
179
fld0 Index Size (MB)
222
98
fld1 Index Size (MB)
257
131
fld2 Index Size (MB)
316
321
fld3 Index Size (MB)
316
321
fld4 Index Size (MB)
316
321



Total Index (MB)
1,647
1,373

With the index prefix compression however the results are really dramatic  - that's a 34% improvement - what was a 3.5 GB index in MMAP is now 376MB in Wired Tiger. That's worth the effort.


WT-Zlib-Prefix
WT-Zlib-reIndex
Number of Records
8,584,192
8,584,192
JSON Size (MB)
3,374
3,374
Record Storage (MB)
430
430
_id Index Size (MB)
121
101
fld0 Index Size (MB)
168
71
fld1 Index Size (MB)
176
87
fld2 Index Size (MB)
32
39
fld3 Index Size (MB)
32
39
fld4 Index Size (MB)
32
39



Total Index (MB)
562
376

Epilogue


So when choosing hardware for MongoDB - not only can you now squeeze 100% from your CPU cores in Wired Tiger - but with some planning you need only 10% of the RAM - and if that doesn't cut your hosting bill, nothing will.