I've been pondering recently the idea of a library of statistical and heuristic functions that run inside MongoDB using the aggregation Pipeline. After all if we can avoid pulling data out of the database that must help performance. As a little experiment, here is the correlation co-efficient of two fields using Pearsons Rho. It's broken down into individual variables to make it easier to read rather than a huge piece of javascript. That's usually the best way to write pipelines.

**//Pearsons Rho as a pipeline**

**testdata = [{x:1,y:2},****{x:2,y:3},****{x:3,y:6},****{x:4,y:8}]****db=db.getSiblingDB("stats")****db.pearsons.drop();****db.pearsons.insertMany(testdata)**

**x = "$x"****y = "$y"**

**//This is a pipeline stage****sumcolumns = { $group : { _id: true,****count: { $sum: 1 },****sumx : { $sum : x},****sumy : { $sum : y},****sumxsquared : { $sum : { $multiply : [x,x] } },****sumysquared : { $sum : { $multiply : [y,y] } },****sumxy : { $sum : { $multiply : [x,y] } }****}}**

**//This is building a pipeline stage from objects****multiply_sumx_sumy = { $multiply : [ "$sumx","$sumy"] }****multiply_sumxy_count = { $multiply : ["$sumxy","$count"]}****partone = { $subtract : [ multiply_sumxy_count, multiply_sumx_sumy ]}**

**multiply_sumxsquared_count = { $multiply : ["$sumxsquared","$count"]}****sumx_squared = { $multiply : ["$sumx","$sumx"]}****subparttwo = { $subtract : [ multiply_sumxsquared_count,sumx_squared ]}**

**multiply_sumysquared_count = { $multiply : ["$sumysquared","$count"]}****sumy_squared = { $multiply : ["$sumy","$sumy"]}****subpartthree = { $subtract : [ multiply_sumysquared_count,sumy_squared ]}**

**parttwo = { $sqrt : {$multiply : [ subparttwo,subpartthree ]}}**

**//Glue it all together****rho = {$project : { rho: {$divide : [partone,parttwo]}}}**

**pipeline = [sumcolumns,rho]****db.pearsons.aggregate(pipeline)**