## Wednesday, October 11, 2017

### Calculating Correlation inside MongoDB

I've been pondering recently the idea of a library of statistical and heuristic functions that run inside MongoDB using the aggregation Pipeline. After all if we can avoid pulling data out of the database that must help performance. As a little experiment, here is  the correlation co-efficient of two fields using Pearsons Rho. It's broken down into individual variables to make it easier to read rather than a huge piece of javascript. That's usually the best way to write pipelines.

//Pearsons Rho as a pipeline

testdata = [{x:1,y:2},
{x:2,y:3},
{x:3,y:6},
{x:4,y:8}]
db=db.getSiblingDB("stats")
db.pearsons.drop();
db.pearsons.insertMany(testdata)

x = "\$x"
y = "\$y"

//This is a pipeline stage
sumcolumns = { \$group : { _id: true,
count: { \$sum: 1 },
sumx : { \$sum : x},
sumy : { \$sum : y},
sumxsquared : { \$sum : { \$multiply : [x,x] } },
sumysquared : { \$sum : { \$multiply : [y,y] } },
sumxy : { \$sum : { \$multiply : [x,y] } }
}}

//This is building a pipeline stage from objects
multiply_sumx_sumy = { \$multiply : [ "\$sumx","\$sumy"] }
multiply_sumxy_count = { \$multiply : ["\$sumxy","\$count"]}
partone = { \$subtract : [ multiply_sumxy_count, multiply_sumx_sumy ]}

multiply_sumxsquared_count = { \$multiply : ["\$sumxsquared","\$count"]}
sumx_squared = { \$multiply : ["\$sumx","\$sumx"]}
subparttwo = { \$subtract : [ multiply_sumxsquared_count,sumx_squared  ]}

multiply_sumysquared_count = { \$multiply : ["\$sumysquared","\$count"]}
sumy_squared = { \$multiply : ["\$sumy","\$sumy"]}
subpartthree = { \$subtract : [ multiply_sumysquared_count,sumy_squared  ]}

parttwo = { \$sqrt : {\$multiply : [ subparttwo,subpartthree ]}}

//Glue it all together
rho  = {\$project : { rho:  {\$divide : [partone,parttwo]}}}

pipeline = [sumcolumns,rho]
db.pearsons.aggregate(pipeline)