Authoring compute shaders for performance

When you write a custom compute material, you are inside the per-particle hot loop. Every instruction runs MaximumAmount times per frame per emitter; poor habits here compound fast. This page collects the practices that reliably matter in the Kanzi Particles compute shaders shipped by the asset package.

Keep per-particle work proportional to what the particle needs

The stock affector.glsl include runs your Affector() once per particle per frame. Anything inside that function executes for every particle in the pool, regardless of whether the particle is in the affector’s region of influence.

  • Early-out on spatial falloff. Every stock affector except Affector_Gravity uses Affector.Radius for a distance test. The test is cheap; the force computation behind it (noise sampling, trig, normalisation) is not. Read Radius first and return vec3(0.0) before touching the expensive path.

  • Early-out on state. A kill-zone or velocity-clamp affector that gates on a bool flag should read that flag first.

  • Read only what you use. particle is passed inout. Writing to a field you did not intend to modify (even back to its original value) does not cost anything in isolation, but every field you read from pulls from global memory.

Prefer built-ins over expanded math

GPU built-ins are typically single-cycle on modern hardware. Replacing dot(v, v) with v.x*v.x + v.y*v.y + v.z*v.z is slower, not faster.

Common worth-calling-out built-ins in particle work:

  • length() / distance() / dot(): never hand-roll.

  • smoothstep() for falloffs: cheaper than a pow chain and better-behaved at edges.

  • mix() for ramps: cheaper than a * (1.0 - t) + b * t.

  • normalize(): if you need the direction only. If you need the magnitude and the direction, compute length once and divide.

Avoid divergent branches inside the hot loop

GPU warps execute in lockstep. A branch where different particles take different paths causes both paths to execute serially on the warp; the cost is the sum of both branches, not the average.

  • Prefer mix() or conditional assignment over if / else for value selection.

  • For rare cases (dead particles, out-of-zone particles) an early return is fine: warps whose lanes all return early skip the remainder of the shader.

  • step() and smoothstep() gate arithmetic without branching.

Hash, do not sample

Particle shaders need per-particle randomness (jitter, spread, turbulence). The stock shaders use an integer hash (hash11, hash33 from the includes) keyed on emitIndex or particle.index. This is faster and more deterministic than sampling a noise texture.

If you need a coherent noise field (curl, wind gusts), curlNoise3D is already provided by Particles/noise.glsl and is tuned for GPU cost. Do not hand-roll a Perlin or Simplex pipeline.

Buffer access: write only what you need to

The integrator and the particle-data readbacks are already optimised. Your shader should:

  • Do not write back unmodified fields. The engine’s own passes write position and velocity; your affector only needs to return a force.

  • Read ``prevPosition`` rarely. It is populated every frame by the Verlet integrator but is only meaningful if your shader needs the second derivative.

  • Treat buffers bound as ``readonly`` as such. Source-emitter buffers on a subemitter (Emission.SourceEventBuffer / SourceParticleDataBuffer) are read-only: writing to them is undefined behaviour.

Let Sync with Uniforms do the wiring

After editing a shader, press Sync with Uniforms in Kanzi Studio. It regenerates MaterialTypePropertyTypes and its binding entries automatically. Hand-editing the material type’s property list and trying to remember which uniforms are live is a reliable source of ghost bindings that silently read zero.

Measure before optimising

In Kanzi Studio Profiling or with a GPU profiler attached:

  • Time spent in affector / collider passes scales linearly with MaximumAmount. If an effect is GPU-bound, reducing MaximumAmount often beats micro-optimising the shader.

  • Depth and Reverse Depth sort cost scales with MaximumAmount log MaximumAmount. Sort type selection covers when to pay it.

  • A compute pass that’s 2ms on a desktop GPU may be 20ms on an embedded target: test on the real hardware.

See also