GPU Optimierungen

Einführung

The demand for new graphics features and progress almost guarantees that you will encounter graphics bottlenecks. Some of these can be on the CPU side, for instance in calculations inside the Godot engine to prepare objects for rendering. Bottlenecks can also occur on the CPU in the graphics driver, which sorts instructions to pass to the GPU, and in the transfer of these instructions. And finally, bottlenecks also occur on the GPU itself.

Wo beim Rendern Engpässe auftreten ist sehr hardwarespezifisch. Insbesondere mobile GPUs haben möglicherweise Probleme mit Szenen, die problemlos auf dem Desktop ausgeführt werden können.

Understanding and investigating GPU bottlenecks is slightly different to the situation on the CPU. This is because, often, you can only change performance indirectly by changing the instructions you give to the GPU. Also, it may be more difficult to take measurements. In many cases, the only way of measuring performance is by examining changes in the time spent rendering each frame.

Zeichnungsaufrufe, Statusänderungen und APIs

Bemerkung

Der folgende Abschnitt ist für Endbenutzer nicht relevant, aber nützlich um Hintergrundinformationen zu geben, die in späteren Abschnitten wichtig sind.

Godot sends instructions to the GPU via a graphics API (OpenGL, OpenGL ES or Vulkan). The communication and driver activity involved can be quite costly, especially in OpenGL and OpenGL ES. If we can provide these instructions in a way that is preferred by the driver and GPU, we can greatly increase performance.

Nearly every API command in OpenGL requires a certain amount of validation to make sure the GPU is in the correct state. Even seemingly simple commands can lead to a flurry of behind-the-scenes housekeeping. Therefore, the goal is to reduce these instructions to a bare minimum and group together similar objects as much as possible so they can be rendered together, or with the minimum number of these expensive state changes.

2D Stapelverarbeitung

In 2D, the costs of treating each item individually can be prohibitively high - there can easily be thousands of them on the screen. This is why 2D batching is used. Multiple similar items are grouped together and rendered in a batch, via a single draw call, rather than making a separate draw call for each item. In addition, this means state changes, material and texture changes can be kept to a minimum.

Für weitere Informationen zu 2D Stapelverarbeitung siehe Optimierungen durch Stapelverarbeitung.

3D Stapelverarbeitung

In 3D, we still aim to minimize draw calls and state changes. However, it can be more difficult to batch together several objects into a single draw call. 3D meshes tend to comprise hundreds or thousands of triangles, and combining large meshes in real-time is prohibitively expensive. The costs of joining them quickly exceeds any benefits as the number of triangles grows per mesh. A much better alternative is to join meshes ahead of time (static meshes in relation to each other). This can either be done by artists, or programmatically within Godot.

There is also a cost to batching together objects in 3D. Several objects rendered as one cannot be individually culled. An entire city that is off-screen will still be rendered if it is joined to a single blade of grass that is on screen. Thus, you should always take objects' location and culling into account when attempting to batch 3D objects together. Despite this, the benefits of joining static objects often outweigh other considerations, especially for large numbers of distant or low-poly objects.

Für weitere Informationen zu 3D-spezifische Optimierungen, siehe Optimierung der 3D-Leistung.

Shader und Materialien erneut nutzen

The Godot renderer is a little different to what is out there. It's designed to minimize GPU state changes as much as possible. SpatialMaterial does a good job at reusing materials that need similar shaders. if custom shaders are used, make sure to reuse them as much as possible. Godot's priorities are:

  • Reusing Materials: The fewer different materials in the scene, the faster the rendering will be. If a scene has a huge amount of objects (in the hundreds or thousands), try reusing the materials. In the worst case, use atlases to decrease the amount of texture changes.
  • Wiederverwenden von Shadern: Wenn Materialien nicht wiederverwendet werden können, versuchen Sie zumindest Shader (oder SpatialMaterials mit unterschiedlichen Parametern, aber derselben Konfiguration) wiederzuverwenden.

Wenn eine Szene beispielsweise 20.000 Objekte mit jeweils 20.000 verschiedenen Materialien enthält, ist das Rendern langsam. Wenn dieselbe Szene 20.000 Objekte enthält, aber nur 100 Materialien verwendet, wird das Rendern viel schneller.

Pixel-Kosten im Vergleich zu Vertex-Kosten

Sie haben vielleicht gehört, dass je weniger Polygone in einem Modell vorhanden sind, desto schneller wird es gerendert. Dies ist wirklich relativ und hängt von vielen Faktoren ab.

Auf einem modernen PC und einer modernen Konsole sind die Vertex-Kosten niedrig. GPUs haben ursprünglich nur Dreiecke gerendert, sodass jeder Frame:

  1. von der CPU transformiert werden musste (einschließlich Clipping).
  2. vom Hauptspeicher an den GPU-Speicher gesendet werden musste.

Nowadays, all this is handled inside the GPU, greatly increasing performance. 3D artists usually have the wrong feeling about polycount performance because 3D DCCs (such as Blender, Max, etc.) need to keep geometry in CPU memory for it to be edited, reducing actual performance. Game engines rely on the GPU more, so they can render many triangles much more efficiently.

Auf Mobilgeräten sieht es anders anders. PC- und Konsolen-GPUs sind Brute-Force-Monster, die so viel Strom wie nötig ziehen können. Mobile GPUs sind auf einen winzigen Akku beschränkt, daher müssen sie viel energieeffizienter sein.

To be more efficient, mobile GPUs attempt to avoid overdraw. Overdraw occurs when the same pixel on the screen is being rendered more than once. Imagine a town with several buildings. GPUs don't know what is visible and what is hidden until they draw it. For example, a house might be drawn and then another house in front of it (which means rendering happened twice for the same pixel). PC GPUs normally don't care much about this and just throw more pixel processors to the hardware to increase performance (which also increases power consumption).

Using more power is not an option on mobile so mobile devices use a technique called tile-based rendering which divides the screen into a grid. Each cell keeps the list of triangles drawn to it and sorts them by depth to minimize overdraw. This technique improves performance and reduces power consumption, but takes a toll on vertex performance. As a result, fewer vertices and triangles can be processed for drawing.

Additionally, tile-based rendering struggles when there are small objects with a lot of geometry within a small portion of the screen. This forces mobile GPUs to put a lot of strain on a single screen tile, which considerably decreases performance as all the other cells must wait for it to complete before displaying the frame.

To summarize, don't worry about vertex count on mobile, but avoid concentration of vertices in small parts of the screen. If a character, NPC, vehicle, etc. is far away (which means it looks tiny), use a smaller level of detail (LOD) model. Even on desktop GPUs, it's preferable to avoid having triangles smaller than the size of a pixel on screen.

Beachten Sie die zusätzlich erforderliche Vertex-Verarbeitung, wenn Sie Folgendes verwenden:

  • Skinning (Skelettanimation)
  • Morphs (Formschlüssel)
  • Vertex-beleuchtete Objekte (häufig auf Mobilgeräten)

Pixel/Fragment-Shader und Füllrate

In contrast to vertex processing, the costs of fragment (per-pixel) shading have increased dramatically over the years. Screen resolutions have increased (the area of a 4K screen is 8,294,400 pixels, versus 307,200 for an old 640×480 VGA screen, that is 27x the area), but also the complexity of fragment shaders has exploded. Physically-based rendering requires complex calculations for each fragment.

You can test whether a project is fill rate-limited quite easily. Turn off V-Sync to prevent capping the frames per second, then compare the frames per second when running with a large window, to running with a very small window. You may also benefit from similarly reducing your shadow map size if using shadows. Usually, you will find the FPS increases quite a bit using a small window, which indicates you are to some extent fill rate-limited. On the other hand, if there is little to no increase in FPS, then your bottleneck lies elsewhere.

You can increase performance in a fill rate-limited project by reducing the amount of work the GPU has to do. You can do this by simplifying the shader (perhaps turn off expensive options if you are using a SpatialMaterial), or reducing the number and size of textures used.

When targeting mobile devices, consider using the simplest possible shaders you can reasonably afford to use.

Texturen einlesen

The other factor in fragment shaders is the cost of reading textures. Reading textures is an expensive operation, especially when reading from several textures in a single fragment shader. Also, consider that filtering may slow it down further (trilinear filtering between mipmaps, and averaging). Reading textures is also expensive in terms of power usage, which is a big issue on mobiles.

If you use third-party shaders or write your own shaders, try to use algorithms that require as few texture reads as possible.

Texturkomprimierung

Godot komprimiert beim Import standardmäßig Texturen von 3D-Modellen (VRAM-Komprimierung). Diese Video-RAM-Komprimierung ist nicht so effizient wie beim Speichern von PNG oder JPG, steigert aber die Leistung enorm, wenn genügend große Texturen gezeichnet werden.

Dies liegt daran, dass das Hauptziel der Texturkomprimierung die Bandbreitenreduzierung zwischen Speicher und GPU ist.

In 3D hängen die Formen von Objekten mehr von der Geometrie als von der Textur ab, sodass eine Komprimierung im Allgemeinen nicht erkennbar ist. In 2D hängt die Komprimierung stärker von den Formen innerhalb der Texturen ab, sodass die durch die 2D-Komprimierung resultierenden Artefakte stärker wahrgenommen werden.

Als Warnung: die meisten Android-Geräte unterstützen keine Texturkomprimierung von Texturen mit Transparenz (nur undurchsichtig). Denken Sie also daran.

Bemerkung

Even in 3D, "pixel art" textures should have VRAM compression disabled as it will negatively affect their appearance, without improving performance significantly due to their low resolution.

Nachbearbeitung und Schatten

Nachbearbeitungseffekte und Schatten können auch im Hinblick auf die Aktivität des Fragment-Shaders teuer sein. Testen Sie die Auswirkungen immer auf verschiedener Hardware.

Reducing the size of shadowmaps can increase performance, both in terms of writing and reading the shadowmaps. On top of that, the best way to improve performance of shadows is to turn shadows off for as many lights and objects as possible. Smaller or distant OmniLights/SpotLights can often have their shadows disabled with only a small visual impact.

Transparenz und Übergänge

Transparent objects present particular problems for rendering efficiency. Opaque objects (especially in 3D) can be essentially rendered in any order and the Z-buffer will ensure that only the front most objects get shaded. Transparent or blended objects are different. In most cases, they cannot rely on the Z-buffer and must be rendered in "painter's order" (i.e. from back to front) to look correct.

Transparente Objekte sind auch besonders schlecht für die Füllrate, da jedes Objekt gezeichnet werden muss, auch wenn später andere transparente Objekte darüber gezeichnet werden.

Undurchsichtige Gegenstände müssen dies nicht tun. Normalerweise können sie den Z-Puffer nutzen, indem sie zuerst nur in den Z-Puffer schreiben und dann nur den Fragment-Shader für das "gewinnende" Fragment ausführen, das sich bei einem bestimmten Pixel vorne befindet.

Transparency is particularly expensive where multiple transparent objects overlap. It is usually better to use transparent areas as small as possible to minimize these fill rate requirements, especially on mobile, where fill rate is very expensive. Indeed, in many situations, rendering more complex opaque geometry can end up being faster than using transparency to "cheat".

Multi-Plattform-Ratschlag

Wenn Sie auf mehreren Plattformen veröffentlichen möchten, testen Sie früh und häufig auf allen Ihren Plattformen, insbesondere auf Mobilgeräten. Die Entwicklung eines Spiels auf dem Desktop und dann der Versuch in letzter Minute dieses Spiel auf das Handy zu portieren, endet meistens in einer Katastrophe.

Im Allgemeinen sollten Sie Ihr Spiel für den kleinsten gemeinsamen Nenner entwerfen und dann optionale Verbesserungen für leistungsfähigere Plattformen hinzufügen. Beispielsweise möchten Sie das GLES2-Backend möglicherweise sowohl für Desktop- als auch für mobile Plattformen verwenden, wenn Sie für beides entwickeln wollen.

Mobile/Kachel-Renderer

As described above, GPUs on mobile devices work in dramatically different ways from GPUs on desktop. Most mobile devices use tile renderers. Tile renderers split up the screen into regular-sized tiles that fit into super fast cache memory, which reduces the number of read/write operations to the main memory.

Es gibt jedoch einige Nachteile, die die Durchführung bestimmter Techniken erheblich komplizierter und teurer machen können. Kacheln, die auf den Ergebnissen des Renderns verschiedener Kacheln oder auf den Ergebnissen früherer Vorgänge beruhen, können sehr langsam sein. Seien Sie sehr vorsichtig beim testen der Leistung der Shader, Ansichtsfenstern und bei der Nachbearbeitung.