Wednesday, October 11, 2017

Calculating Correlation inside MongoDB

I've been pondering recently the idea of a library of statistical and heuristic functions that run inside MongoDB using the aggregation Pipeline. After all if we can avoid pulling data out of the database that must help performance. As a little experiment, here is  the correlation co-efficient of two fields using Pearsons Rho. It's broken down into individual variables to make it easier to read rather than a huge piece of javascript. That's usually the best way to write pipelines.

//Pearsons Rho as a pipeline

testdata = [{x:1,y:2},
            {x:2,y:3},
            {x:3,y:6},
            {x:4,y:8}]
db=db.getSiblingDB("stats")
db.pearsons.drop();
db.pearsons.insertMany(testdata)

x = "$x"
y = "$y"

//This is a pipeline stage
sumcolumns = { $group : { _id: true,
             count: { $sum: 1 },
             sumx : { $sum : x},
             sumy : { $sum : y},
             sumxsquared : { $sum : { $multiply : [x,x] } },
             sumysquared : { $sum : { $multiply : [y,y] } },
             sumxy : { $sum : { $multiply : [x,y] } }
           }}

//This is building a pipeline stage from objects
multiply_sumx_sumy = { $multiply : [ "$sumx","$sumy"] }
multiply_sumxy_count = { $multiply : ["$sumxy","$count"]}
partone = { $subtract : [ multiply_sumxy_count, multiply_sumx_sumy ]}

multiply_sumxsquared_count = { $multiply : ["$sumxsquared","$count"]}
sumx_squared = { $multiply : ["$sumx","$sumx"]}
subparttwo = { $subtract : [ multiply_sumxsquared_count,sumx_squared  ]}


multiply_sumysquared_count = { $multiply : ["$sumysquared","$count"]}
sumy_squared = { $multiply : ["$sumy","$sumy"]}
subpartthree = { $subtract : [ multiply_sumysquared_count,sumy_squared  ]}

parttwo = { $sqrt : {$multiply : [ subparttwo,subpartthree ]}}

//Glue it all together
rho  = {$project : { rho:  {$divide : [partone,parttwo]}}}


pipeline = [sumcolumns,rho]
db.pearsons.aggregate(pipeline)

2 comments:

Anuj Singh said...

Nice blog. Thanks for sharing such great information.Inwizards offers Mongo database services for our Mongodb Client. Start mongodb development with our skilled and experienced mongodb developers. Intrested click here - Mongo Database Services

Teju Teju said...

Really nice blog post. provided a helpful information. I hope that you will post more updates like this Big Data Hadoop Online Course Bangalore