`PyStan` is a python interface to `STAN`, a C++ library for building Bayesian models and sampling them with Markov Chain Monte Carlo (MCMC). To install `pystan`, you'll need to install `cython`. Then download the `pystan` package from:

https://github.com/stan-dev/pystan

You could also try to install via `pip` or `easy_install`, but I had problems using those and could only compile from source.

Similar to previous examples, we start by making some fake data. In this case, we're going to do have a linear regression model with two predictor variables (x,y) and one outcome variable (z). The relation between them is: \[Z_n = \alpha + \beta_x x_n + \beta_y y_n + \epsilon\] where \(\epsilon\) is a Gaussian noise term.

In [1]:

```
from numpy import *
Nobs = 20
x_true = random.uniform(0,10, size=Nobs)
y_true = random.uniform(-1,1, size=Nobs)
alpha_true = 0.5
beta_x_true = 1.0
beta_y_true = 10.0
eps_true = 0.5
z_true = alpha_true + beta_x_true*x_true + beta_y_true*y_true
z_obs = z_true + random.normal(0, eps_true, size=Nobs)
```

In [2]:

```
%matplotlib inline
from matplotlib import pyplot as plt
plt.figure(figsize=(12,6))
plt.subplot(1,2,1)
plt.scatter(x_true, z_obs, c=y_true, marker='o')
plt.colorbar()
plt.xlabel('X')
plt.ylabel('Z')
plt.subplot(1,2,2)
plt.scatter(y_true, z_obs, c=x_true, marker='o')
plt.colorbar()
plt.xlabel('Y')
plt.ylabel('Z')
```

Out[2]:

Now we build the `STAN` model that we are going to use for this data. `STAN` is a different language (similar to C), so we have to put it in a string, or put it in a file.

In [3]:

```
model = """
data {
int<lower=4> N; // Number of data points
real x[N]; // the 1st predictor
real y[N]; // the 2nd predictor
real z[N]; // the outcome
}
parameters {
real alpha; // intercept
real betax; // x-slope
real betay; // y-slope
real<lower=0> eps; // dispersion
}
model {
for (i in 1:N)
z[i] ~ normal(alpha + betax * x[i] + betay * y[i], eps);
}"""
```

The `STAN` model has three sections in this case. The first defines the `data` that will be fit. Data types can be integers (`int`) or floating-point numbers (`real`). They can also have contraints. Above, we specify that N > 0. Next, we define the `parameters` of the model. Then, we define the `model` section that defines the likelihood. In this case, it is assumed that each data point was drawn from a normal distribution with mean equal to the regression model and standard deviation equal to `eps`. We put it in a for loop over all the data points.

With this model, we can then fit the data using `pystan`. But first, we need to put the data into a form that `pystan` can use. We do this by making a dictionary with key equal to the variable name and value equal to the data.

In [4]:

```
data = {'N':Nobs, 'x':x_true, 'y':y_true, 'z':z_obs}
```

In [5]:

```
import pystan
fit = pystan.stan(model_code=model, data=data, iter=1000, chains=4)
```

We give `stan` the model string and the data. We ask for 1000 MCMC iterations each run on 4 parallel chains. The first time you fit a model, `pystan` converts the `STAN` model into C++ code (making it fast), so it make take a while to start up. Once it's done, we can get a summary of the results:

In [6]:

```
print fit
```

This will show you some summary statistics. We can always perform our own using numpy functions. We can also plot out the chains and posterior probability distributions.

In [7]:

```
p = fit.plot()
```

You can also run the MCMC sampler for longer. This should be faster because the model is already set up.

In [8]:

```
fit2 = pystan.stan(fit=fit, data=data, iter=10000, chains=4)
temp = fit2.plot()
```

And of course, you can always plot the results to see how they look. But this time, we can correct for the other predictor's effect. First, let's get the median of each variable from the samples. There are two ways to extract the chains: 1) using `permuted=True` to get a dictionary with each key corresponding to a parameter, or 2) using `permuted=False`, which returns a 3D array with indeces `(i,c,p)`, corresponding to iteration `i` in chain `c` for parameter `p`.

In [9]:

```
samples = fit2.extract(permuted=True)
alpha = median(samples['alpha'])
beta_x = median(samples['betax'])
beta_y = median(samples['betay'])
eps = median(samples['eps'])
```

The median is more robust toward non-symmetric distributions (such as the one for `eps`). Now we can plot the results, this time correcting for the `y` predictor when plotting versus `x` and vice versa. That way we can make sure the correct trend has been found. We can also plot a shaded region denoting the intrinsic dispersion.

In [10]:

```
plt.figure(figsize=(12,6))
plt.subplot(1,2,1)
plt.plot(x_true, z_obs-alpha-beta_y*y_true, 'o')
plt.xlabel('X')
plt.ylabel('Z - alpha - beta_y y')
# Now plot the model
xx = array([x_true.min(), x_true.max()])
plt.plot(xx, xx*beta_x)
plt.plot(xx, xx*beta_x + eps, '--', color='k')
plt.plot(xx, xx*beta_x - eps, '--', color='k')
plt.subplot(1,2,2)
plt.plot(y_true, z_obs-alpha-beta_x*x_true, 'o')
plt.xlabel('Y')
plt.ylabel('Z - alpha - beta_x x')
yy = array([y_true.min(), y_true.max()])
plt.plot(yy, yy*beta_y)
plt.plot(yy, yy*beta_y + eps, '--', color='k')
plt.plot(yy, yy*beta_y - eps, '--', color='k')
```

Out[10]:

If you have the `triangle_plot` plotting package, it's an easy way to look at the covariance of the various parameters. For this to work, we need the Markov chains in a 2D array indexed by `[i,p]`, so we need to reshape the array to combine the 4 chains. `pystan` also includes a 5th "parameter", which is the log-probability at each step, so we can leave that out.

In [11]:

```
samples = array([samples['alpha'],samples['betax'], samples['betay'], samples['eps']]).T
print samples.shape
import triangle
tmp = triangle.corner(samples[:,:], labels=['alpha','betax','betay','eps'],
truths=[alpha_true, beta_x_true, beta_y_true, eps_true])
```