Authoring compute shaders for performance¶
When you write a custom compute material, you are inside the per-particle hot loop. Every instruction runs MaximumAmount times per frame per emitter; poor habits here compound fast. This page collects the practices that reliably matter in the Kanzi Particles compute shaders shipped by the asset package.
Keep per-particle work proportional to what the particle needs¶
The stock affector.glsl include runs your Affector() once per particle per frame. Anything inside that function executes for every particle in the pool, regardless of whether the particle is in the affector’s region of influence.
Early-out on spatial falloff. Every stock affector except
Affector_GravityusesAffector.Radiusfor a distance test. The test is cheap; the force computation behind it (noise sampling, trig, normalisation) is not. ReadRadiusfirst andreturn vec3(0.0)before touching the expensive path.Early-out on state. A kill-zone or velocity-clamp affector that gates on a bool flag should read that flag first.
Read only what you use.
particleis passedinout. Writing to a field you did not intend to modify (even back to its original value) does not cost anything in isolation, but every field you read from pulls from global memory.
Prefer built-ins over expanded math¶
GPU built-ins are typically single-cycle on modern hardware. Replacing dot(v, v) with v.x*v.x + v.y*v.y + v.z*v.z is slower, not faster.
Common worth-calling-out built-ins in particle work:
length()/distance()/dot(): never hand-roll.smoothstep()for falloffs: cheaper than apowchain and better-behaved at edges.mix()for ramps: cheaper thana * (1.0 - t) + b * t.normalize(): if you need the direction only. If you need the magnitude and the direction, computelengthonce and divide.
Avoid divergent branches inside the hot loop¶
GPU warps execute in lockstep. A branch where different particles take different paths causes both paths to execute serially on the warp; the cost is the sum of both branches, not the average.
Prefer
mix()or conditional assignment overif / elsefor value selection.For rare cases (dead particles, out-of-zone particles) an early
returnis fine: warps whose lanes all return early skip the remainder of the shader.step()andsmoothstep()gate arithmetic without branching.
Hash, do not sample¶
Particle shaders need per-particle randomness (jitter, spread, turbulence). The stock shaders use an integer hash (hash11, hash33 from the includes) keyed on emitIndex or particle.index. This is faster and more deterministic than sampling a noise texture.
If you need a coherent noise field (curl, wind gusts), curlNoise3D is already provided by Particles/noise.glsl and is tuned for GPU cost. Do not hand-roll a Perlin or Simplex pipeline.
Buffer access: write only what you need to¶
The integrator and the particle-data readbacks are already optimised. Your shader should:
Do not write back unmodified fields. The engine’s own passes write position and velocity; your affector only needs to return a force.
Read ``prevPosition`` rarely. It is populated every frame by the Verlet integrator but is only meaningful if your shader needs the second derivative.
Treat buffers bound as ``readonly`` as such. Source-emitter buffers on a subemitter (
Emission.SourceEventBuffer/SourceParticleDataBuffer) are read-only: writing to them is undefined behaviour.
Let Sync with Uniforms do the wiring¶
After editing a shader, press Sync with Uniforms in Kanzi Studio. It regenerates MaterialTypePropertyTypes and its binding entries automatically. Hand-editing the material type’s property list and trying to remember which uniforms are live is a reliable source of ghost bindings that silently read zero.
Measure before optimising¶
In Kanzi Studio Profiling or with a GPU profiler attached:
Time spent in affector / collider passes scales linearly with
MaximumAmount. If an effect is GPU-bound, reducing MaximumAmount often beats micro-optimising the shader.DepthandReverse Depthsort cost scales withMaximumAmount log MaximumAmount. Sort type selection covers when to pay it.A compute pass that’s 2ms on a desktop GPU may be 20ms on an embedded target: test on the real hardware.