Evolving Algorithms with Tree Genomes in JAX -

Could we evolve entirely unforeseen kinds of AI from mathematical expressions? Engineering works within the narrow bounds of what reason can foresee. Evolution, though blind and inefficient, often discovers systems no one could have predicted from first principles.

In 2024, I briefly dived into genetic programming and produced code for mutating and recombining tree-shaped genomes of mathematical expressions. The idea was that, while we did our best to engineer algorithms, some dedicated computing cluster could asynchronously churn out algorithms for fundamental problems, like localized learning rules, new dynamical systems for memory and computation, and model architectures. It isn’t a realistic goal without an extremely large, parallel, computing environment, but I had just gotten my head around JAX and thought it worth exploring.

The idea behind binary tree–structured genomes is to define a data structure for mathematical expressions that remains closed under mutation, recombination, and deletion. For example, a simple neural network looks like this: $$Y=W_1\text{ReLU}(W_0X+b_0)+b_1$$

flowchart TD
  + --- MatMul
  + --- b
  MatMul---ReLU
  MatMul---W1[W]
  ReLU---+0[+]
  +0[+]---MatMul0[MatMul]
  +0[+]---b0[b]
  MatMul0[MatMul]---W0[W]
  MatMul0[MatMul]---X

The multilayer feed-forward network needs no new elements, just a single recombination of $X$ with a copy of its (great-grand-)parent branch: $$Y = W_2\text{ReLU}(W_1\text{ReLU}(W_0X+b_0)+b_1)+b_2$$

flowchart TD
  
  +3[+] --- MatMul3[MatMul]
  +3 --- b3[b]
  MatMul3---ReLU3[ReLU]
  MatMul3---W3[W]
  
  ReLU3---+
  +---MatMul
  +---b1[b]
  MatMul---W1[W]
  MatMul---2ReLU
  
  2ReLU[ReLU]---+2
  +2[+]---MM2[MatMul]
  +2---b2[b]
  MM2---W
  MM2---X

We can formulate a simple recurrent network if the leaf node H is treated as stateful: $$Y=W_2 H+b_1$$ $$H=\text{tanh}(W_1 X + W_0 H + b_0)$$

flowchart TD
	M0[MatMul] --- W0
	M0[MatMul] --- H
	M1[MatMul] --- W1
	M1[MatMul] --- X
	+0[+] --- M0
	+0[+] --- b0
	+1[+] --- +0
	+1[+] --- M1
	tanh --- +1
	H0[H] --- tanh
	M2[MatMul] --- H0
	M2[MatMul] --- W2
	+2[+] --- M2
	+2[+] --- b1

A learning algorithm, namely the delta rule on a linear model, looks like this:

$Y=W_0X$

$W_0 = W_0 + \text{LR} * X(Y-W_0X)$

flowchart TD
  mm1[MatMul] --- W[W0]
  W---+0[+]
  +0 --- W0
  +0 --- *0[*]
  *0 --- LR
  *0 --- mm3
  mm3 --- X2[X]
  mm3[MatMul]--- -0[-]
  -0 --- Y
  -0 --- mm2[MatMul]
  mm2 --- W2[W0]
  mm2 --- X
  mm1---X1[X]

That’s all I’m going to do by hand for now. I was able to code the entire evolution pipeline this way. It required some special treatment of $X$ and $Y$ to create a consistent input-output processing loop and string notations for nodes and leaves. Leaf nodes can be vectors (V), matrices (M), scalars (S), or special vectors $X$ and $Y$. Scalars are defined by strings in the format “S_<idx>_<base>e<power>”, but often they mutate into variables. Resulting genomes look like this sample output:

Run time: 25.57770299911499
BEST_FITNESS: -7.6178203
Best model:
        V_5=(gp_vector(jnp.add(jnp.multiply(jnp.multiply(V_2,gp_matrix(gp_logabs(gp_abspower(gp_identity(S_4_5em3),gp_abspower(Y,V_2))))),gp_divide(Y,V_6)),M_1)))
        V_0=(gp_vector(gp_divide(M_11,gp_logabs(V_1))))
        V_Y=(gp_vector(V_7))
        V_6=(gp_vector(X))
        M_0=(gp_matrix(gp_divide(S_8_5e1,jnp.multiply(gp_divide(X,gp_matrix(gp_divide(X,M_5))),gp_relu(X)))))
        V_Y=(gp_vector(S_4_5em3))
GENERATION: 298

A pleasing observation about these outputs is that they behave exactly like genomes should. If the problem is to predict data generated from $Y=3X$, it is less likely that the system mutates a leaf node that is exactly 3, and more likely it mutates something like $$\log_{2}\left(\log_{2}\left(1e10^{2}\left(\log_{2}\left(2e10^{2}\right)\right)\right)\right)\approx 3. $$ This seems ridiculous, but the rationale is just different. This way, the scalar is made of several reusable parts. The evolutionary process can shape itself through such redundant “junk DNA” to select on average more or less fine-grained increments, if the problem domain tends to demand it.

I chose JAX because the idea was to leverage GPU parallel processing to make this practical. The result was less performant than originally desired: you can JIT compile each new genome which represents a system of tensors and vector operations, resulting in very fast evaluation of each member of the population. But because each member is a unique system of uncertain length, you cannot run several members asynchronously on a single GPU. The per-member JIT compilation accounts for the majority of the compute time.

I cannot think of a way to structure this up such that the whole process is compiled once and then executed fully in the GPU without imposing structural bounds on the genomes. Imposing such bounds would defeat the purpose of genetic programming. For bounded structures, it is sensible to parameterize a fixed system and use an evolutionary strategy to tune the parameters.

In any case, genetic programming like this is an idea that can never be totally abandoned. Nothing else works quite like it or can, in principle, produce the same kind and extent of results. But of course, our hardware is not designed appropriately to simulate the flexibility, asynchronicity, and scale of such natural processes.