Optimizing the Particle Life: From 400 to 4 million particles (Part 1)

Why Particle Life?

For one of my programming classes this semester, my task was writing a program in Python. It was only an introductory course, so it didn’t have to be anything fancy. Whenever I get assignments like this one, my choice is simple: I enjoy tinkering around with cellular automata and particle simulations.

I have had an eye on Particle Life for quite a while now. The idea belongs to Jeffrey Ventrella, who called it "Clusters." As the name suggests, its simple rules result in the emergence of lifelike, cell-like structures interacting with one another. During my experiments, I was also amazed by the fact that applying the same rules on much larger scales gives us a resemblance to many different phenomena in nature, somewhat like electron clouds or cosmic structures. I, personally, found it mesmerizing, and I got completely absorbed by writing this program for the better part of a week, completely neglecting other things I needed to take care of.

But what do I even mean by emergence in the paragraph above?

Have you ever wondered how a flock of birds operates? How is it able to travel, forage, and respond to predators? An individual bird does not have a blueprint of the flock's final behavior and its part in it.

Instead, a singular bird operates based on simple rules guiding its behavior. For example, in boids, an algorithm simulating the flocking behavior of birds, an individual agent's interaction must adhere to (a simplified) set of rules:

  1. Stay separated to avoid crowding local flockmates.
  2. Align with the average direction of your local flockmates.
  3. Navigate towards the average position of your local flockmates to retain cohesion.

The complexity of the flock emerges from every agent in the system following these rules. Flock has distinct properties that arose from the behavior of its components, but there's no way for us to deduce these properties from the rules we applied to individual agents. This is why we can call a flock a complex system.

Human societies, the universe, climate, cities—all these are examples of complex systems. And the simulation we'll write is an example of it as well. You'll see for yourself how the structures we'll see on the screen are described nowhere in the code but emerge as a result of uncomplicated rules generated for every particle randomly each time we start the program.

There are multiple implementations of the algorithm available online. To contribute something new, I decided to focus on the optimization of the program. This is a deep-dive, several-piece series, but at the end, we'll render a million particles in real-time, which makes it completely worth the time spent on the project!

The rules

The first thing we need to specify is the number of classes of particles. We will denote these classes by their colors, and how particles interact with each other will depend on their respective colors. This force can be attractive or repulsive, depending on our color-attraction matrix. It doesn’t have to be symmetric: red particles can be attracted to green particles, but green particles can be repulsed by red ones.

To visualize this dependence, let’s see an example matrix with two classes of particles and forces ranging from -1.0 to 1.0:

RedBlue
Red.86-.15
Blue.33.24

In this case, red particles are strongly attracted to one another. They’re slightly repulsed by blue ones. Blue particles also attract each other, although in a weaker manner, and they’re also attracted to red particles.

This force works within a radius that we’ll specify. But within a fraction of this radius, we have another force, the universal repulsive force. Otherwise, particles that are attracted to each other would end up collapsing into one.

This is how we can visualize our force function, assuming the force value between two particles to be 1.0 and a repulsive radius of 1/4 of the max radius:

Of course, these rules are not set in stone. In the following parts, you will see how slight changes in force function affect our system:

So far, so good. You must admit that these rules are really beautifully simple. It's time to write a Python script to test if I can get it up and running.

After some time, and some frustrating errors originating from Qt, my particles came to life. Below, you can find the whole script. I tried to add explicit comments to every line so that there aren’t any confusing bits. A little note about system parameters: those are arbitrarily chosen values. You have to experiment for a bit to see what works and what doesn’t.

import numpy as np
import colorsys as cs
import random as rd
import sys
import math
from PySide6.QtCore import *
from PySide6.QtWidgets import *
from PySide6.QtGui import *

width = 800
height = 800

#function definitions
def buildAttractionMatrix(m):
    return (m*2)-1 # Multiple random between [0.0,1.0] by 2 and substract one to get random between [-1.0,1.0]

#With HSL, unlike RGB, to change color we have to update only one value [[H]SL -> Hue]. Hence, it will work perfectly
#for generating predefined no of colors, sufficiently distinct from one another. Since our GUI library accepts only RGB colors,
#we have to convert it RGB.
def createColors(m):
    colors = list()
    for i in range(m):
        hueValue = (i *( 360/(m - 1)))
        #hsvtorgb() returns values between [0.0,1.0] hence i*255
        color = tuple(round(i * 255) for i in cs.hsv_to_rgb(hueValue/100,1,1))
        colors.append(QColor(color[0],color[1],color[2]))
    return colors

def force(d, f): #d - ratio of particles distance and max distance , f - force according to attraction matrix
    repRadius = 0.25 #universal repulsive force that prevents our particles from collapsing into each other
    #25% is an arbitrary value I tested to behave the best - feel free to experiment, like with other system parameters.
    if (d<repRadius):
        #if d is within the range of universal repulsive we return negative value. As the distance approaches 0 returned value approaches -1 which is max value
        #for our repulsive force
        return (d/repRadius - 1)
    elif (repRadius < d and d<1): #our distance is within max radius for standard force defined by attraction matrix
        #the closer we get the particles together, the stronger their interaction is.
        #As the particle move apart from each other, the force dissipates to reach 0 as we
        #reach max radius
        return f * (1-d)
    else:
        return 0 #particles are indiffirent to each other

def updatePositions():
    #we initiate a loop to update all particles in our system
    for i in range(n):
        #every particle will accumulate total force value we'll get by evaluating it's attraction to every other particle in the system. Hence, our algorithm
        #has a time complexity of O(n^2) if we don't add any optimizations.
        totalForceX = 0
        totalForceY = 0
        for j in range(n):
            if (i!=j):#we're not calculating the particle's influence on itself
                #calculating distance between tested particle (i) to all other particles (j)
                x = positionsX[j]-positionsX[i]
                y = positionsY[j]-positionsY[i]
                d = math.sqrt(x * x + y* y)

                if (d<radius): #we're only proceeding with calculating distance if the distance between particle is less than max radius we specified for our system
                    #1st param: We have to apply force proportionally to the distance between particles instead of simply using the value from out matrix
                    #2nd param: To get the right value from the attraction matrix we need to get colors of particles we test
                    f = force(d/radius, attractionMatrix[colors.index(particlesColors[i])][colors.index(particlesColors[j])])
                    #lets add our force for this particle (j) to our total forces for particle (i)
                    totalForceX+=(x / d)*f
                    totalForceY+=(y / d)*f
        #total force we just calculated will increase the particle's velocity
        velocitiesX[i]+=totalForceX
        velocitiesY[i]+=totalForceY
        #we have to apply friction so that our particles don't accumulate velocity forever
        velocitiesX[i]*=fr
        velocitiesY[i]*=fr
        #now that we have the particles velocity we can know how to change its position
        positionsX[i]+=velocitiesX[i]
        positionsY[i]+=velocitiesY[i]

#system properties
n = 400 #number of particles
m = 5 #number of colors
radius = 0.20
fr = math.pow(.5, 10) #friction

#init arrays to hold our values
attractionMatrix = buildAttractionMatrix(np.random.random((m,m))) #create 2-d matrix based on number of colors
colors = createColors(m)
particlesColors = [rd.choice(colors) for _ in range(n)] #every particle is assigned random color from colors we created
#every particle is assigned random position
positionsX = [rd.random() for _ in range(n)]
positionsY = [rd.random() for _ in range(n)]
#every particle is assigned velocity = 0
velocitiesX = np.empty(n, dtype=float)
velocitiesY = np.empty(n, dtype=float)

class Window(QWidget):
    def __init__(self):
        super().__init__()

    def paintEvent(self, event):
        painter = QPainter(self)

        updatePositions()
        for i in range(n):
            x = positionsX[i] * width
            y = positionsY[i] * height
            painter.setPen(particlesColors[i])
            painter.drawEllipse(x, y, 1, 1)
            self.update()


app = QApplication(sys.argv)
window = Window()
window.height = height
window.width = width
window.show()

sys.exit(app.exec())

Choosing the right language for the job

There’s just one problem. My code is embarrassingly inefficient. I haven’t implemented any optimizations yet, but I didn’t expect it to be so bad! I work on a decent machine with an M2 processor, but I was barely able to run 500 particles before the animation stopped being smooth. It was my first time writing any Python code, so I suspected stepping on mine of using an inappropriate data structure or an unfortunate library function call (for example, later on, when I was implementing the same thing in javascript, calculating distance using Math.hypot() instead of Math.sqrt(xx+yy) turned out to be nearly 10x slower).

Unfortunately, after using line profiler I realized that it is basic arithmetic operations that take up most of the time. I know Python is not exactly the most efficient language out there, but I expected better.

At this stage, I decided that my Python script would be only a point of origin, and I would write my proper implementation in something else. I’d love to simply embed my work to a webpage, where everyone can play with it, but first I wanted to see how JS will perform with a large amount of arithmetics compared to low-level languages. I know that nowadays v8 is pretty performant, so I had a good feeling about this. I know, I know - you can use C modules for Python, but since I don’t plan to dive deep into Python anytime soon, this option didn’t interest me. Another thing I was curious about was how WebAssebly would perform compared to JS.

Before we start any benchmarking, we have to consider what we need to measure. What are the most costly operations of our algorithm?

First of all, let's stress the fact that to calculate the positions of every particle we have to iterate over every other particle in the system, which gives us O(N^2) complexity. So, it is the code inside the inner loop in the updatePositions() function that takes most of the time. What is this code responsible for? Simply speaking, it performs basic arithmetic operations on array elements.

It is important to realize you should never treat these types of benchmarks as an oracle guiding your development decisions. Here, it was useful. But there are many factors that influence code execution time besides language itself, and the first call should be always to optimize what you can with the language of your choice, unless it really is unreasonable, like using Python in this case.

Having this little disclaimer out of the way, let's see what we can expect from Python, C, JS, and WebAssembly for the tasks described above.

C:

int main(int argc, char * argv[]) {
  srand((unsigned int) time(NULL));

  int diff;
  int i;
  int j;
  int l = 10000;
  float t[l];

  clock_t start = clock();

  for (int i = 0; i < l; i++) {
    t[i] = 0.0;
    for (j = 0; j < l; j++) {
      t[i] += 0.02 * (float)j;
      t[i] *= 0.03 * (float)j;
      t[i] -= 0.04 * (float)j;
      t[i] /= 0.05 * (float)j+1.00;
    }
  }

  clock_t end = clock();
  int elapsed = (int)(end - start) / (CLOCKS_PER_SEC / 1000);
  printf("%d %s", elapsed, "ms elapsed");

  return 0;
}

JS:

const l = 10000;
const t = [];

const s = Date.now();

for (let i = 0; i < l; i++) {
  t[i] = 0;

  for (j = 0; j < l; j++) {
    t[i] += 0.02 * j;
    t[i] *= 0.03 * j;
    t[i] -= 0.04 * j;
    t[i] /= 0.05 * (j + 1);
  }
}

console.log(
  Date.now() - s,
  "ms elapsed"
);

Python:

import time
import random as rd
import numpy as np

l = 10000
t = np.empty(l, dtype=float)

s= time.time()
for i in range(l):
    t[i] = 0.0
    for j in range(l):
        t[i] += 0.02 * j
        t[i] *= 0.03 * j
        t[i] -= 0.04 * j
        t[i] /= 0.05 * (j+1)

print( (time.time()-s) * 1000, 'ms elapsed')

The results were interesting to say the least:

Few things require noting here:

  1. I didn't use any optimization flags when compiling my C code. My focus is on checking if JS is good enough for the job, not if JS is faster than C! This is a very important point to keep in mind.
  2. Node is using a V8 engine. When I ran the same file using Apple's JavascriptCore, the result was nearly twice as long. I can turn a blind eye to that when developing a hobby project, but for the real product, it would be quite an issue.

We need to remember that these benchmarks measure nothing but the efficiency of my little script. It would be unwise to draw any premature conclusions from it. Still, we should appreciate how optimized modern JS is. In the next part, you'll see that many of the go-to points developers usually go through when optimizing their code, give no advantage, because the V8 compiler is just smart enough to know to take care of it for us... If there's one lesson to be learned from the examples above, it would be that for things requiring performance, developers should never trust, and always verify. I was very surprised!

If JS performed so well compared to low-level C, you can probably already guess what my next point will be: using WebAssembly gives us no advantage in this case.

For the curious of you, WASM module compiled from this AssemblyScript:

export function updateParticles(): void {
  const l = 10000;
  const t: f32[] = [];

  const s = Date.now();

  for (let i = 0; i < l; i++) {
    t[i] = 0;

    for (let j = 0; j < l; j++) {
      t[i] += 0.02 * (j as f32);
      t[i] *= 0.03 * (j as f32);
      t[i] -= 0.04 * (j as f32);
      t[i] /= 0.05 * ((j + 1) as f32);
    }
  }

  console.log((Date.now() - s).toString() + " ms elapsed");
}

Yielded the following result. My research here was nearly non-existing, so I don't feel confident enough to make any further comments:

12878 ms elapsed

This is the end of the first part. In the next part, we’ll implement the same code in JS and use spatial partitioning methods to get rid of the scary O(N^2) complexity. In part 3, we’ll implement 3D life on GPU with the new WebGPU API and optimize our WGSL code. Spoiler alert: the goal is to reach one million particles!

To keep you interested, a few screenshots:

1000 2D particles - JS1000 2D particles - JS1 million 2D particles - GPU.js1 million 2D particles - GPU.js. First frames.50000 2D particles - vanilla JS optimized with quadtree.50000 2D particles - vanilla JS optimized with quadtree.
Playing around with webGPU implementation.256000 particles within interaction radius.256000 particles within interaction radius4 million particles.4 million particles. WebGPU.

See the next part