Depending on the underlying GPU architecture, the GPU can execute many stages of rendering, such as vertex processing, fragment processing, and memory reading in parallel for each draw call. Draw call waits until all the fragments are processed. If the fragment shader executes slower than vertex shader, or other stages, the other stages need to wait for fragment shader execution to complete.
You can optimize fragment shaders by:
If fragment shading is a performance bottleneck, a decrease in precision from two cycles to one cycle improves GPU performance by half.
To decrease the precision of a pixel shader:
precision
use the appropriate value range:lowp
for data such as colors (RGB data range [0..1]) and intensities range [0..1], but not, for example, for texture coordinates, which need more accurate precision. lowp
supports range [-2..2] and contains 8-bit decimal precision.mediump
for most of the rendering, matrices need more accurate precision because the floating point values are relatively small.highp
contains accurate representation for 3D rendering, including matrices.uniform sampler2D Texture; uniform lowp float BlendIntensity; varying mediump vec2 vTexCoord; void main() { precision lowp float; vec4 color = texture2D(Texture, vTexCoord); gl_FragColor.rgba = color.rgba * BlendIntensity; }
uniform sampler2D Texture; uniform mediump float BlendIntensity; // In comparison to lowp, mediump doubles the number of cycles varying mediump vec2 vTexCoord; void main() { precision mediump float; // In comparison to lowp, mediump doubles the number of cycles vec4 color = texture2D(Texture, vTexCoord); gl_FragColor.rgba = color.rgba * BlendIntensity; }
Use the vertex shader to calculate the values that stay constant and are calculated only a few times. Do similarly for lighting calculations that can interpolate results from one vertex to another without losing too much quality, because most often the vertex coverage is much smaller than fragment coverage (except with highly dense geometry).
For example, for a vertex shader use this code:
attribute vec3 kzPosition; attribute vec2 kzTextureCoordinate0; uniform highp mat4 kzProjectionCameraWorldMatrix; uniform mediump float kzTime; varying mediump vec2 vTexCoord; varying lowp vec4 vColor; void main() { precision mediump float; // Trigonometric operation is only performed for each vertex, for example, // for quad 3 * 2 times (2 triangles containing 3 vertices each) vColor = vec4(sin(kzTime)); gl_Position = kzProjectionCameraWorldMatrix * vec4(kzPosition.xyz, 1.0); }
For example, for a fragment shader use this code:
varying lowp vec4 vColor; void main() { precision lowp float; // For each written fragment, constant interpolated assignment with same // precision (lowp -> lowp) is applied. This should not be longer than // one cycle on most GPUs. gl_FragColor.rgba = vColor; }
For example, do not use this code for a vertex shader:
attribute vec3 kzPosition; uniform highp mat4 kzProjectionCameraWorldMatrix; void main() { precision mediump float; // Vertex shader outputs the position and calculates in fragment shader, // which is not a good idea when the number of fragments exceed that of vertices. gl_Position = kzProjectionCameraWorldMatrix * vec4(kzPosition.xyz, 1.0); }
For example, do not use this code for a fragment shader:
uniform mediump float kzTime; void main() { precision lowp float; // For each written fragment, the additional trigonometric function sin() // is executed. Trigonometric functions are expensive - depending on GPU, // several GPU cycles per fragment. Effectively the outcome is the same // as when storing the result to varying. gl_FragColor.rgba = vec4(sin(kzTime)); }