github source for this project
For most statistical models, you can verify that they work by simulating data from a set of parameter values, then fitting the model to the simulated data and seeing how well it recovers those values. With artificial neural networks it is common to have many local solutions and a stochastic learning algorithm, so while an ANN may find an good solution on the simulated data, it is far less likely to match the data-generating parameters. I was surprised to read about the generative network called the Helmholtz Machine (HM), for which this appears to not be the case.
The HM is one of a succession of binary networks including the Hopfield network, Boltzmann machine, the restricted Boltzmann machine (RBM). Each model in some way learns the joint probability distribution of many binary variables, with the latter three using binary hidden nodes. The main difference between the RBM and the HM is that the former computes a gradient for the weights by a kind of Gibbs sampling of the hidden nodes, whereas the ladder learns by “dreaming” fake data from a separate generative network. In the HM, the recognition and generation networks are used to alternately adjust each other’s weights. The solution of an HM is expected to minimize the Kullbeck-Liebler divergence of the real data and the generative distribution. In a way that I don’t fully understand yet, this leads to learning deeper hidden structure that better matches the data-generating structure.
The idea of an ANN that better learns parsimonious, data-generating structures is appealing. Do GANs that learn to, say, generate photos of faces like this accurately infer the genetically-driven axes of facial structure? Would we find nodes in the network that accurately guess the underlying genes? Something like a GWAS of hidden layer scores would be able to answer that. On a different note, does this help explain how all the various generative behaviors of humans, talking, imitating, drawing, help us come up with theoretical models of our experience? I want to come back to these questions later.
For now, I was just interested in experimenting with the RBM and HM. There isn’t much implementation of either in R, but I did find this tutorial.
My goal was to learn about both models by coding them from scratch in R. Then I could test the difference for myself.
As the example code that I found was not abstracted to take any model dimensions (number of layers, nodes per each layer), I had to think of a way to do this that would allow quick and convenient manipulations.
It seemed easiest to set up a list object L
containing the name and location of each part of the model:
> L
[[1]]
[1] "bias"
[[2]]
[1] "x1" "x2" "x3" "x4" "x5" "x6" "x7" "x8" "x9" "x10" "x11" "x12" "x13" "x14" "x15" "x16" "x17" "x18" "x19" "x20"
[[3]]
[1] "h1_1" "h1_2" "h1_3" "h1_4" "h1_5" "h1_6" "h1_7" "h1_8"
The top level indices (1-3) are the layers, and the strings within each are the names of nodes.
With this, I could set up the network as one big adjacency matrix with the contents of L
as dimnames (inspired by the RAM notation of structural equation models).
Then, I could code the entire thing with simple matrix multiplications while using L
to index the resulting block sub-matrices representing the expectations of each layer conditional on those above or below it.
This approach is slow because it is in R and it involves a lot of unnecessary computations over all the zero-values, but it seemed like a balance between making the code abstract and concise.
Data for either a Boltzmann or Helmholz machine can be generated like this:
rRBM <- function(n, Wg) {
X<-matrix(0, n, ncol(Wg), dimnames=list(NULL,colnames(Wg)) )
X[,1]<-1
for(i in Q:2 ){
X[, L[[i]] ] <- (X %*% t(Wg))[, L[[i]] ]
X[, L[[i]] ] <- matrix( rbinom(prod(dim(X[, L[[i]] ])), size=1, c(lgt(X[, L[[i]] ])) ),dim(X[, L[[i]] ])[1],dim(X[, L[[i]] ])[2] )
}
return(X)
}
where Wg
is an adjacency matrix for the entire network, including both observed and hidden nodes.
Simulation starts from the top of the network and propagates random samples downward.
The process of calculating the gradient of Wg
is similar and uses a few variations on the same loop above.
Starting with just the RBM for now, this code obtains the gradient by generating random node values up the network conditional on lower layers, then back down to the data conditional on the upper layers.
calcGrad <- function(X, Wr, Wg) {
Gr <- Gr2 <- Wr
for (i in 3:Q) {
X[, L[[i]]] <- (X %*% t(Wr))[, L[[i]]]
X[, L[[i]]] <- matrix(rbinom(prod(dim(X[, L[[i]]])), size = 1, c(lgt(X[, L[[i]]]))), dim(X[, L[[i]]])[1], dim(X[, L[[i]]])[2])
Gr[L[[i]], unlist(L[c(1, i - 1)])] <- (t(X) %*% X / nrow(X))[L[[i]], unlist(L[c(1, i - 1)])]
}
Gr[L[[2]], L[[1]]] <- (t(X) %*% X / nrow(X))[L[[2]], L[[1]]]
X2 <- X
for (i in (Q - 1):2) {
X2[, L[[i]]] <- (X2 %*% t(Wg))[, L[[i]]]
X2[, L[[i]]] <- matrix(rbinom(prod(dim(X2[, L[[i]]])), size = 1, c(lgt(X2[, L[[i]]]))), dim(X2[, L[[i]]])[1], dim(X2[, L[[i]]])[2])
}
for (i in 3:Q) {
X2[, L[[i]]] <- lgt( (X2 %*% t(Wr))[, L[[i]] ] )
if(i<Q) X2[, L[[i]]] <- matrix(rbinom(prod(dim(X2[, L[[i]]])), size = 1, c( X2[, L[[i]]] )), dim(X2[, L[[i]]])[1], dim(X2[, L[[i]]])[2])
Gr2[L[[i]], unlist(L[c(1, i - 1)])] <- (t(X2) %*% X2 / nrow(X))[L[[i]], unlist(L[c(1, i - 1)])]
}
Gr2[L[[2]], L[[1]]] <- (t(X2) %*% X2 / nrow(X) )[L[[2]], L[[1]]]
dG <- Gr - Gr2
return(dG)
}
I regret that R lists subscript with double brackets [[]]
so they kind of dominate the code. I guess the rest of it is pretty ugly too.
This next bit is the essential loop for updating.
In the RBM, the weights for going up and down the network are not actually different.
You can see here that I’m training the generative weight matrix Wg
by just taking the transposed gradient of Wr
.
for (i in 1:nIter) {
Gr <- calcGrad(X[sample(1:nrow(X),size=500,replace=F),], Wr, Wg)
Gg <- Gr
Gg[-1, -1] <- t(Gr)[-1, -1]
MSr <- MSr * momentum + Gr
MSg <- MSg * momentum + Gg
Wr[Wr != 0] <- Wr[Wr != 0] + MSr[Wr != 0] * lrate
Wg[Wg != 0] <- Wg[Wg != 0] + MSg[Wg != 0] * lrate
...
For the RBM with one layer, I could simply take the redacted correlation matrix (showing |r|>.7 only) of the data-generating node weights with those estimated:
h1_1 h1_2 h1_3 h1_4 h1_5 h1_6 h1_7 h1_8
h1_1 0.99
h1_2 0.98
h1_3 -0.98
h1_4
h1_5 -0.97
h1_6 0.98
h1_7 0.98
h1_8 0.98
Iteration 7100
This was the result after running for a few seconds. Not bad, but I noticed that it seems to miss one node on most attempts. And without any a priori constraints or priors in the model, we should expect the nodes to be permuted and flipped like that.
A second run yielded better results:
h1_1 h1_2 h1_3 h1_4 h1_5 h1_6 h1_7 h1_8
h1_1 0.99
h1_2 0.99
h1_3 0.99
h1_4 -0.98
h1_5 0.99
h1_6 0.98
h1_7 0.98
h1_8 0.99
Iteration 3325
However, including any additional layers results in different solutions at the first layer and I lose these correlations. In another post I will go through my HM code, which gave much better results.