Optimierungen durch Stapelverarbeitung

Einführung

Game engines have to send a set of instructions to the GPU to tell the GPU what and where to draw. These instructions are sent using common instructions called APIs. Examples of graphics APIs are OpenGL, OpenGL ES, and Vulkan.

Unterschiedliche APIs verursachen beim Zeichnen von Objekten unterschiedliche Kosten. OpenGL erledigt eine Menge Arbeit für den Benutzer im GPU-Treiber auf Kosten teurerer Zeichnungsaufrufe. Infolgedessen können Anwendungen häufig beschleunigt werden, indem die Anzahl der Zeichnungsaufrufe verringert wird.

Bemerkung

2D-Batching wird derzeit nur bei Verwendung des GLES2-Renderers unterstützt.

Zeichnungsaufrufe

In 2D müssen wir die GPU anweisen eine Reihe von Grundelementen (Rechtecke, Linien, Polygone usw.) zu rendern. Die naheliegendste Technik besteht darin die GPU anzuweisen, jeweils ein Grundelement zu rendern, indem sie einige Informationen wie die verwendete Textur, das Material, die Position, die Größe usw. mitteilt und dann "zeichnen!" sagt. (Dies wird als Zeichnungsaufruf bezeichnet).

While this is conceptually simple from the engine side, GPUs operate very slowly when used in this manner. GPUs work much more efficiently if you tell them to draw a number of similar primitives all in one draw call, which we will call a "batch".

It turns out that they don't just work a bit faster when used in this manner; they work a lot faster.

As Godot is designed to be a general-purpose engine, the primitives coming into the Godot renderer can be in any order, sometimes similar, and sometimes dissimilar. To match Godot's general-purpose nature with the batching preferences of GPUs, Godot features an intermediate layer which can automatically group together primitives wherever possible and send these batches on to the GPU. This can give an increase in rendering performance while requiring few (if any) changes to your Godot project.

Wie es funktioniert

Instructions come into the renderer from your game in the form of a series of items, each of which can contain one or more commands. The items correspond to Nodes in the scene tree, and the commands correspond to primitives such as rectangles or polygons. Some items such as TileMaps and text can contain a large number of commands (tiles and glyphs respectively). Others, such as sprites, may only contain a single command (a rectangle).

Der Stapler verwendet zwei Haupttechniken, um Grundelemente zu gruppieren:

  • Consecutive items can be joined together.
  • Consecutive commands within an item can be joined to form a batch.

Stapel unterbrechen

Batching can only take place if the items or commands are similar enough to be rendered in one draw call. Certain changes (or techniques), by necessity, prevent the formation of a contiguous batch, this is referred to as "breaking batching".

Batching will be broken by (amongst other things):

  • Change of texture.
  • Change of material.
  • Change of primitive type (say, going from rectangles to lines).

Bemerkung

For example, if you draw a series of sprites each with a different texture, there is no way they can be batched.

Determining the rendering order

Es stellt sich die Frage, ob nur ähnliche Elemente in einem Stapel zusammengezogen werden können. Warum gehen wir nicht alle Elemente in einer Szene durch, gruppieren alle ähnlichen Elemente und zeichnen sie zusammen?

In 3D, this is often exactly how engines work. However, in Godot's 2D renderer, items are drawn in "painter's order", from back to front. This ensures that items at the front are drawn on top of earlier items when they overlap.

This also means that if we try and draw objects on a per-texture basis, then this painter's order may break and objects will be drawn in the wrong order.

In Godot, this back-to-front order is determined by:

  • The order of objects in the scene tree.
  • The Z index of objects.
  • The canvas layer.
  • YSort nodes.

Bemerkung

You can group similar objects together for easier batching. While doing so is not a requirement on your part, think of it as an optional approach that can improve performance in some cases. See the Diagnose section to help you make this decision.

Ein Trick

And now, a sleight of hand. Even though the idea of painter's order is that objects are rendered from back to front, consider 3 objects A, B and C, that contain 2 different textures: grass and wood.

../../_images/overlap1.png

In painter's order they are ordered:

A - wood
B - grass
C - wood

Because of the texture changes, they can't be batched and will be rendered in 3 draw calls.

Die Bestellung des Malers wird jedoch nur unter der Annahme benötigt, dass sie übereinander gezeichnet werden. Wenn wir diese Annahme lockern, d.h. wenn sich keines dieser 3 Objekte überlappt, besteht keine Notwendigkeit die Malerreihenfolge beizubehalten. Das gerenderte Ergebnis ist das gleiche. Was wäre, wenn wir dies nutzen könnten?

Objekte neu ordnen

../../_images/overlap2.png

Elemente können auch neu geordnet werden. Dies ist jedoch nur möglich, wenn die Elemente die Bedingungen eines Überlappungstests erfüllen. Hiermit wird sichergestellt, dass das Endergebnis das gleiche ist, als ob sie nicht neu angeordnet worden wären. Der Überlappungstest ist in Bezug auf die Leistung sehr günstig, aber nicht absolut kostenlos. Daher ist es mit geringen Kosten verbunden vorausschauend zu entscheiden, ob Elemente neu geordnet werden können. Die Anzahl der Elemente, die neu geordnet werden müssen, kann in den Projekteinstellungen (siehe unten) festgelegt werden, um Kosten und Nutzen in Ihrem Projekt auszugleichen.

A - wood
C - wood
B - grass

Since the texture only changes once, we can render the above in only 2 draw calls.

Beleuchtung

Although the batching system's job is normally quite straightforward, it becomes considerably more complex when 2D lights are used. This is because lights are drawn using additional passes, one for each light affecting the primitive. Consider 2 sprites A and B, with identical texture and material. Without lights, they would be batched together and drawn in one draw call. But with 3 lights, they would be drawn as follows, each line being a draw call:

../../_images/lights_overlap.png
A
A - light 1
A - light 2
A - light 3
B
B - light 1
B - light 2
B - light 3

That is a lot of draw calls: 8 for only 2 sprites. Now, consider we are drawing 1,000 sprites. The number of draw calls quickly becomes astronomical and performance suffers. This is partly why lights have the potential to drastically slow down 2D rendering.

Wenn Sie sich jedoch an den Trick bei der Neuordnung von Gegenständen erinnern, können wir mit demselben Trick die Maler-Reihenfolge für Beleuchtung umgehen!

If A and B are not overlapping, we can render them together in a batch, so the drawing process is as follows:

../../_images/lights_separate.png
AB
AB - light 1
AB - light 2
AB - light 3

That is only 4 draw calls. Not bad, as that is a 2× reduction. However, consider that in a real game, you might be drawing closer to 1,000 sprites.

  • Before: 1000 × 4 = 4,000 draw calls.
  • After: 1 × 4 = 4 draw calls.

Dies ist eine 1000-fache Verringerung der Zeichnungsaufrufe und sollte zu einer enormen Leistungssteigerung führen.

Überlappungstest

However, as with the item reordering, things are not that simple. We must first perform the overlap test to determine whether we can join these primitives. This overlap test has a small cost. Again, you can choose the number of primitives to lookahead in the overlap test to balance the benefits against the cost. With lights, the benefits usually far outweigh the costs.

Also consider that depending on the arrangement of primitives in the viewport, the overlap test will sometimes fail (because the primitives overlap and therefore shouldn't be joined). In practice, the decrease in draw calls may be less dramatic than in a perfect situation with no overlapping at all. However, performance is usually far higher than without this lighting optimization.

Light scissoring

Batching can make it more difficult to cull out objects that are not affected or partially affected by a light. This can increase the fill rate requirements quite a bit and slow down rendering. Fill rate is the rate at which pixels are colored. It is another potential bottleneck unrelated to draw calls.

In order to counter this problem (and speed up lighting in general), batching introduces light scissoring. This enables the use of the OpenGL command glScissor(), which identifies an area outside of which the GPU won't render any pixels. We can greatly optimize fill rate by identifying the intersection area between a light and a primitive, and limit rendering the light to that area only.

Light scissoring is controlled with the scissor_area_threshold project setting. This value is between 1.0 and 0.0, with 1.0 being off (no scissoring), and 0.0 being scissoring in every circumstance. The reason for the setting is that there may be some small cost to scissoring on some hardware. That said, scissoring should usually result in performance gains when you're using 2D lighting.

The relationship between the threshold and whether a scissor operation takes place is not always straightforward. Generally, it represents the pixel area that is potentially "saved" by a scissor operation (i.e. the fill rate saved). At 1.0, the entire screen's pixels would need to be saved, which rarely (if ever) happens, so it is switched off. In practice, the useful values are close to 0.0, as only a small percentage of pixels need to be saved for the operation to be useful.

The exact relationship is probably not necessary for users to worry about, but is included in the appendix out of interest: Schwelle der Licht-Beschneidung berechnen

Light scissoring example diagram

Bottom right is a light, the red area is the pixels saved by the scissoring operation. Only the intersection needs to be rendered.

Vertex brennen (englisch: Baking)

Der GPU-Shader erhält auf zwei Arten Anweisungen zum Zeichnen:

  • Shader uniforms (e.g. modulate color, item transform).
  • Vertex attributes (vertex color, local transform).

However, within a single draw call (batch), we cannot change uniforms. This means that naively, we would not be able to batch together items or commands that change final_modulate or an item's transform. Unfortunately, that happens in an awful lot of cases. For instance, sprites are typically individual nodes with their own item transform, and they may have their own color modulate as well.

Um dieses Problem zu umgehen, kann die Stapelverarbeitung einige der Uniforms in die Vertex-Attribute "brennen".

  • Die Elementtransformation kann mit der lokalen Transformation kombiniert und in einem Vertex-Attribut gesendet werden.
  • Die endgültige Modulationsfarbe kann mit den Vertex-Farben kombiniert und in einem Vertex-Attribut gesendet werden.

In most cases, this works fine, but this shortcut breaks down if a shader expects these values to be available individually rather than combined. This can happen in custom shaders.

Custom shaders

As a result of the limitation described above, certain operations in custom shaders will prevent vertex baking and therefore decrease the potential for batching. While we are working to decrease these cases, the following caveats currently apply:

  • Reading or writing COLOR or MODULATE disables vertex color baking.
  • Reading VERTEX disables vertex position baking.

Projekteinstellungen

To fine-tune batching, a number of project settings are available. You can usually leave these at default during development, but it's a good idea to experiment to ensure you are getting maximum performance. Spending a little time tweaking parameters can often give considerable performance gains for very little effort. See the on-hover tooltips in the Project Settings for more information.

Rendern / Stapelverarbeitung / Optionen

  • use_batching - Turns batching on or off.
  • use_batching_in_editor Turns batching on or off in the Godot editor. This setting doesn't affect the running project in any way.
  • single_rect_fallback - This is a faster way of drawing unbatchable rectangles. However, it may lead to flicker on some hardware so it's not recommended.

Rendern / Stapelverarbeitung / Parameter

  • max_join_item_commands - Eine der wichtigsten Möglichkeiten zum Stapeln besteht darin, geeignete benachbarte Elemente (Nodes) miteinander zu verbinden. Sie können jedoch nur verbunden werden, wenn die darin enthaltenen Befehle kompatibel sind. Das System muss daher einen Blick auf die Befehle in einem Element werfen um festzustellen, ob diese verbunden werden können. Dies hat geringe Kosten pro Befehl und Elemente mit einer großen Anzahl von Befehlen sind es nicht wert, verbunden zu werden. Daher kann der beste Wert projektabhängig sein.
  • coloured_vertex_format_threshold - Das brennen von Farben in Eckpunkte führt zu einem größeren Vertex-Format. Dies ist nicht unbedingt sinnvoll, es sei denn, innerhalb eines verbundenen Elements werden viele Farbänderungen vorgenommen. Dieser Parameter gibt den Anteil der Befehle an, die Farbänderungen enthalten, bzw. die Gesamtzahl der Befehle, ab denen zu gebrannten Farben gewechselt wird.
  • batch_buffer_size - Dies bestimmt die maximale Größe eines Stapels. Dies hat keinen großen Einfluss auf die Leistung, kann sich jedoch für Mobilgeräte verringern, wenn der RAM knapp ist.
  • item_reordering_lookahead - Die Neuordnung von Elementen kann insbesondere bei verschachtelten Sprites mit unterschiedlichen Texturen hilfreich sein. Die Vorschau auf den Überlappungstest ist mit geringen Kosten verbunden, sodass sich der beste Wert je Projekt ändern kann.

Rendern / Stapelverarbeitung / Beleuchtung

  • scissor_area_threshold - siehe Beleuchtung beschneiden.
  • max_join_items - Joining items before lighting can significantly increase performance. This requires an overlap test, which has a small cost, so the costs and benefits may be project dependent, and hence the best value to use here.

Rendern / Stapelverarbeitung / Fehlersuche

  • flash_batching - This is purely a debugging feature to identify regressions between the batching and legacy renderer. When it is switched on, the batching and legacy renderer are used alternately on each frame. This will decrease performance, and should not be used for your final export, only for testing.
  • diagnose_frame - This will periodically print a diagnostic batching log to the Godot IDE / console.

Rendern / Stapelverarbeitung / Präzision

  • uv_contract - Auf einigen Hardware-Geräten (insbesondere einigen Android-Geräten) wurde berichtet, dass TileMap-Kacheln leicht außerhalb ihres UV-Bereichs gezeichnet wurden, was zu Kantenartefakten wie Linien um Kacheln führte. Wenn dieses Problem auftritt, aktivieren Sie den UV-Verenger. Dies führt zu einer kleinen Kontraktion der UV-Koordinaten, um Präzisionsfehler an Geräten auszugleichen.
  • uv_contract_amount - Hopefully, the default amount should cure artifacts on most devices, but this value remains adjustable just in case.

Diagnose

Although you can change parameters and examine the effect on frame rate, this can feel like working blindly, with no idea of what is going on under the hood. To help with this, batching offers a diagnostic mode, which will periodically print out (to the IDE or console) a list of the batches that are being processed. This can help pinpoint situations where batching isn't occurring as intended, and help you fix these situations to get the best possible performance.

Diagnose lesen und verstehen

canvas_begin FRAME 2604
items
    joined_item 1 refs
            batch D 0-0
            batch D 0-2 n n
            batch R 0-1 [0 - 0] {255 255 255 255 }
    joined_item 1 refs
            batch D 0-0
            batch R 0-1 [0 - 146] {255 255 255 255 }
            batch D 0-0
            batch R 0-1 [0 - 146] {255 255 255 255 }
    joined_item 1 refs
            batch D 0-0
            batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
            batch D 0-0
            batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
            batch D 0-0
            batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
canvas_end

Dies ist eine typische Diagnose.

  • joined_item: A joined item can contain 1 or more references to items (nodes). Generally, joined_items containing many references is preferable to many joined_items containing a single reference. Whether items can be joined will be determined by their contents and compatibility with the previous item.
  • batch R: A batch containing rectangles. The second number is the number of rects. The second number in square brackets is the Godot texture ID, and the numbers in curly braces is the color. If the batch contains more than one rect, MULTI is added to the line to make it easy to identify. Seeing MULTI is good as it indicates successful batching.
  • batch D: A default batch, containing everything else that is not currently batched.

Default batches

The second number following default batches is the number of commands in the batch, and it is followed by a brief summary of the contents:

l - line
PL - polyline
r - rect
n - ninepatch
PR - primitive
p - polygon
m - mesh
MM - multimesh
PA - particles
c - circle
t - transform
CI - clip_ignore

You may see "dummy" default batches containing no commands; you can ignore those.

Häufig gestellte Fragen

I don't get a large performance increase when enabling batching.

  • Probieren Sie die Diagnose aus um festzustellen, wie viel Stapelverarbeitung stattfindet und ob sie verbessert werden kann
  • Try changing batching parameters in the Project Settings.
  • Consider that batching may not be your bottleneck (see bottlenecks).

I get a decrease in performance with batching.

  • Try the steps described above to increase the number of batching opportunities.
  • Try enabling single_rect_fallback.
  • The single rect fallback method is the default used without batching, and it is approximately twice as fast. However, it can result in flickering on some hardware, so its use is discouraged.
  • Wenn Ihre Szene nach dem Ausführen der obigen Schritte immer noch leistungsmäßig schlecht abschneidet, sollten Sie das Stapeln deaktivieren.

I use custom shaders and the items are not batching.

  • Benutzerdefinierte Shader können beim Stapeln problematisch sein, siehe Abschnitt über benutzerdefinierte Shader

I am seeing line artifacts appear on certain hardware.

  • Siehe die Projekteinstellung uv_contract mit der dieses Problem gelöst werden kann.

I use a large number of textures, so few items are being batched.

  • Consider using texture atlases. As well as allowing batching, these reduce the need for state changes associated with changing textures.

Anhang

Schwelle der Licht-Beschneidung berechnen

Der tatsächliche Anteil der Bildschirmpixelfläche der als Schwellenwert verwendet wird, ist der Wert scissor_area_threshold hoch 4.

For example, on a screen size of 1920×1080, there are 2,073,600 pixels.

At a threshold of 1,000 pixels, the proportion would be:

1000 / 2073600 = 0.00048225
0.00048225 ^ (1/4) = 0.14819

So a scissor_area_threshold of 0.15 would be a reasonable value to try.

Going the other way, for instance with a scissor_area_threshold of 0.5:

0.5 ^ 4 = 0.0625
0.0625 * 2073600 = 129600 pixels

If the number of pixels saved is greater than this threshold, the scissor is activated.