This summer, I hired an undergraduate student to look into calibrating our CSP supernovae. I wanted to do some more sophisticated models and so we looked into a few MCMC modules for python: pystan, pymc, and emcee.
Each has its plusses and minuses. In fact, I found myself wanting to merge all three together into one system. So, in case anyone out there is wondering which of these to delve into, I’m posting a summary of the three. Also, I’ve been having fun with ipython notebooks for teaching this to my students and have made three versions of the same tutorial: fitting a linear model to data. You can look at each one here:
PyMC
This was the first MCMC module for python I ever tried. It’s got a somewhat steep learning curve because the authors have very craftily created a system in which one defines the model hierarchically but using python code. But it comes with a wealth of pre-made statistical distributions you can use “out of the box”
While fun to use, the biggest problem I had with pymc was getting the sampler to be efficient. The default sampler used is Metropolis-Hastings, which is awful when you have lots of covariance (see this excellent blog post for examples). The adaptive MH is better, but still wicked finicky. I’m probably not being as intelligent setting things up as I should be, but I’ve got science to do (and a student to keep up with), and fiddling with this just got too painful.
Pymc-3 is supposed to have the much more efficient Hamiltonian Monte Carlo (HMC) sampler, but it’s still beta, so I’m going to wait on that.
Emcee
The next module I worked with. Unlike pymc, emcee is all about the sampler. There are no built-in statistical distributions to use, no fancy-shmancy system for building hierarchical models. You have to write python functions to define your priors and likelihoods. From scratch.
After working with pymc, I have to say this seemed like a major step backwards. As complex as the pymc system was, once you got used to it, it was almost fun to build up complicated models. Not so with emcee.
Where emcee shines is it’s Ensemble Sampler. I’m not going to go into details, but you basically run many parallel samplers (called walkers) in parallel and they “feel out” the shape of the parameter space. After the MH, this almost seemed magic in its efficiency. Plus multi-processing (and MPI) are built-in and ready to go, making this idea if you’ve got a cluster.
PyStan
This is the most recent modules I’ve used. And I have to say it is very very impressive in both its design and efficiency. Like pymc, pystan defines its statistical model in terms of hierarchical assignments, though you have to write them in the Stan language. But that’s not so bad, especially if you’re used to thinking that way. Pystan uses the HMC sampler, compiles your model into C++ code, and can run multiple chains in parallel, making it super efficient.
There’s just one bit gotcha. The entire model has to be written in Stan. That means no python call-backs. Unfortunately for me, a lot of my supernova-fitting relies heavily on scipy functions and my own kriging methods. These cannot be called from Stan (at least not without hacking C++). The reason (I think) is that the HMC needs to compute gradients and arbitrary python functions don’t have that information. One would think that pystan could use finite-differencing methods.
So the bottom line for me is: I use emcee for anything that needs python functions and PyStan for everything else. And I’m anxiously waiting to see what PyMC 3 brings to the table.