# Generating distributions with the Stick Breaking version of the Dirichlet Process

Posted on 28 Jun 2018

The Dirichlet Process is a prior over distributions. The basic definition relays on measure theory and has some obscure definitions for most of people. Intuitevily, imagine we want to draw a distribution $G$ over the line of real numbers using some prior distribution $P$. We say that $P$ is a Dirichlet Process if, for any possible partition of the line of real numbers, $(A_1, ..., A_m)$, the probabilities over the partitions are distributed according to a Dirichlet distribution:

There are different ways to generate a distribution G. One of the most popular is the Stick Breaking construction. The trick is to realize that

where $\pi_k$ weights sum up to 1, and the deltas are distributed around the line of the real numbers according to some distribution. The stick breaking construction goes as follows:

1. Take a stick of length 1 and make infinite random breaks. Each break removes a percentage $v_k$ of the remaining stick

The length of the removed break is

where $\prod_{l=1}^{k-1}(1-v_l)$ is the remaining length of the stick at step $k$. At the end of the process we have a collection of segments whose length sum is 1.

2. Draw infinite samples from a base distribution $G_0$:

where $G_0$ can be any distribution such as a Gaussian, a Poisson or a Multinomial.

3. We build our distribution as

Here is the R code to sample from a Dirichlet Process using the Stick Breaking construction:

library(dplyr)
library(ggplot2)

#'@title Stick Breaking construction of a Dirichlet Process
#'@param n number of samples
#'@param alpha concentration parameter
#'@param G0 base measure
DP_stick_breaking <- function(n, alpha, G0) {

# Stick breaking
v  <- rbeta(n, 1, alpha)

pi <- numeric(n)
pi[1] <- v[1]
pi[2:n] <- sapply(2:n, function(i) v[i] * prod(1 - v[1:(i-1)]))

# Atoms from the base measure
y <- G0(n)
return(list(atoms = y,
pis = pi))
# Sample from the Dirichlet Process G
#theta <- sample(y, prob = pi, replace = TRUE)
#return(theta)
}

# Choose a base distribution
G0 <- function(n) rpois(n, 7)       # Poisson
G0 <- function(n) rnorm(n, 0, 1)   # Gaussian

# Concentration parameter around the base distribution
alphas <- c(1, 10, 100, 1000)
n <- 1000
df.G <- as.data.frame(matrix(NA, 0, 4))
names(df.G) <- c("xp", "i", "atom", "pi")

df.samples <- as.data.frame(matrix(NA, 0, 4))
names(df.samples) <- c("xp", "i", "alpha", "samples")

for(alpha in alphas){
for(xp in 1:3){
cat("\n", alpha)

# Generate a distribution from the DP(alpha, G0)
G <- DP_stick_breaking(n, alpha, G0)
df.G <- bind_rows(df.G,
list(xp = rep(xp, n),
alpha= rep(alpha, n),
i=1:n,
atom = G$atoms, pi=G$pis))

# Generate samples from the DP
samples <- sample(G$atoms, prob = G$pis, replace = TRUE)
df.samples <- bind_rows(df.samples,
list(xp = rep(xp, n),
alpha= rep(alpha, n),
i=1:n,
sample=samples))
}
}

# Plot the distributon drawn by the DP
df.G$alpha <- factor(df.G$alpha)
p <- ggplot(df.G, aes(x=atom, y=pi)) +
geom_col(width=0.01)+
facet_grid(alpha ~ xp, scales = "free") +
theme_bw()
print(p)
ggsave(p, filename = paste0("fig15_DP.png"),
height=16, width=16, units='cm')

# Plot samples from the distribution
df.samples$alpha <- factor(df.samples$alpha)
p <- ggplot(df.samples, aes(x=sample)) +
#geom_density() +
#geom_histogram(bins=100, aes(y=..count../sum(..count..))) +
geom_histogram(bins=1000) +
facet_grid(alpha ~ xp, scales = "free") +
theme_bw()
print(p)
ggsave(p, filename = paste0("fig16_DPsamples.png"),
height=16, width=16, units='cm')


And here are the results of Dirichlet Process $G$ generated with the Stick Breaking and different concentration $\alpha$, and samples obtained from these $G$: