Optimización de GPU

Introducción

The demand for new graphics features and progress almost guarantees that you will encounter graphics bottlenecks. Some of these can be on the CPU side, for instance in calculations inside the Godot engine to prepare objects for rendering. Bottlenecks can also occur on the CPU in the graphics driver, which sorts instructions to pass to the GPU, and in the transfer of these instructions. And finally, bottlenecks also occur on the GPU itself.

Where bottlenecks occur in rendering is highly hardware-specific. Mobile GPUs in particular may struggle with scenes that run easily on desktop.

Understanding and investigating GPU bottlenecks is slightly different to the situation on the CPU. This is because, often, you can only change performance indirectly by changing the instructions you give to the GPU. Also, it may be more difficult to take measurements. In many cases, the only way of measuring performance is by examining changes in the time spent rendering each frame.

Draw calls, state changes, and APIs

Nota

The following section is not relevant to end-users, but is useful to provide background information that is relevant in later sections.

Godot sends instructions to the GPU via a graphics API (OpenGL, OpenGL ES or Vulkan). The communication and driver activity involved can be quite costly, especially in OpenGL and OpenGL ES. If we can provide these instructions in a way that is preferred by the driver and GPU, we can greatly increase performance.

Nearly every API command in OpenGL requires a certain amount of validation to make sure the GPU is in the correct state. Even seemingly simple commands can lead to a flurry of behind-the-scenes housekeeping. Therefore, the goal is to reduce these instructions to a bare minimum and group together similar objects as much as possible so they can be rendered together, or with the minimum number of these expensive state changes.

Procesamiento por lotes 2D

In 2D, the costs of treating each item individually can be prohibitively high - there can easily be thousands of them on the screen. This is why 2D batching is used. Multiple similar items are grouped together and rendered in a batch, via a single draw call, rather than making a separate draw call for each item. In addition, this means state changes, material and texture changes can be kept to a minimum.

Para obtener más información sobre el procesamiento por lotes en 2D, consulte Optimización mediante procesamiento por lotes.

Procesamiento por lotes 3D

In 3D, we still aim to minimize draw calls and state changes. However, it can be more difficult to batch together several objects into a single draw call. 3D meshes tend to comprise hundreds or thousands of triangles, and combining large meshes in real-time is prohibitively expensive. The costs of joining them quickly exceeds any benefits as the number of triangles grows per mesh. A much better alternative is to join meshes ahead of time (static meshes in relation to each other). This can either be done by artists, or programmatically within Godot.

There is also a cost to batching together objects in 3D. Several objects rendered as one cannot be individually culled. An entire city that is off-screen will still be rendered if it is joined to a single blade of grass that is on screen. Thus, you should always take objects' location and culling into account when attempting to batch 3D objects together. Despite this, the benefits of joining static objects often outweigh other considerations, especially for large numbers of distant or low-poly objects.

Para obtener más información sobre optimizaciones específicas de 3D, consulte Optimizando las prestaciones en 3D.

Reutilizar sombreadores y materiales

El renderizador Godot es un poco diferente a lo que existe. Está diseñado para minimizar los cambios de estado de la GPU tanto como sea posible. SpatialMaterial hace un buen trabajo al reutilizar materiales que necesitan sombreadores similares. Si se utilizan sombreadores personalizados, asegúrese de reutilizarlos tanto como sea posible. Las prioridades de Godot son:

  • Reutilización de materiales: Cuantos menos materiales diferentes haya en la escena, más rápido será el renderizado. Si una escena tiene una gran cantidad de objetos (en cientos o miles), intente reutilizar los materiales. En el peor de los casos, use atlas para disminuir la cantidad de cambios de textura.

  • Reusing Shaders: If materials can't be reused, at least try to re-use shaders. Note: shaders are automatically reused between SpatialMaterials that share the same configuration (features that are enabled or disabled with a check box) even if they have different parameters.

Si una escena tiene, por ejemplo, 20.000 objetos con 20.000 materiales diferentes cada uno, el renderizado será lento. Si la misma escena tiene objetos de 20,000, pero solo usa materiales de 100, el renderizado será mucho más rápido.

Costo de píxeles frente a costo de vértice

Es posible que haya escuchado que cuanto menor sea el número de polígonos en un modelo, más rápido se renderizará. Esto es realmente relativo y depende de muchos factores.

En una PC y una consola modernas, el costo de vértice es bajo. Las GPU originalmente solo representaban triángulos. Esto significaba que cada cuadro:

  1. Todos los vértices tuvieron que ser transformados por la CPU (incluido el recorte).

  2. Todos los vértices debían enviarse a la memoria de la GPU desde la RAM principal.

Hoy en día, todo esto se maneja dentro de la GPU, aumentando enormemente el rendimiento. Los artistas 3D generalmente tienen la sensación equivocada sobre el rendimiento del multicuenta porque los DCC 3D (como Blender, Max, etc.) necesitan mantener la geometría en la memoria de la CPU para poder editarla, reduciendo el rendimiento real. Los motores de juegos dependen más de la GPU, por lo que pueden representar muchos triángulos de manera mucho más eficiente.

En los dispositivos móviles, la historia es diferente. Las GPU de PC y consola son monstruos de fuerza bruta que pueden extraer tanta electricidad como necesiten de la red eléctrica. Las GPU móviles están limitadas a una batería diminuta, por lo que deben ser mucho más eficientes energéticamente.

Para ser más eficientes, las GPU móviles intentan evitar sobregiro. El sobregiro se produce cuando el mismo píxel en la pantalla se representa más de una vez. Imagina una ciudad con varios edificios. Las GPU no saben qué es visible y qué está oculto hasta que lo dibujan. Por ejemplo, se puede dibujar una casa y luego otra casa frente a ella (lo que significa que la renderización se realizó dos veces para el mismo píxel). A las GPU de PC normalmente no les importa mucho esto y simplemente lanzan más procesadores de píxeles al hardware para aumentar el rendimiento (lo que también aumenta el consumo de energía).

Usar más energía no es una opción en dispositivos móviles, por lo que los dispositivos móviles usan una técnica llamada renderizado basado en mosaicos que divide la pantalla en una cuadrícula. Cada celda mantiene la lista de triángulos dibujados en ella y los ordena por profundidad para minimizar sobredibujar. Esta técnica mejora el rendimiento y reduce el consumo de energía, pero afecta el rendimiento de los vértices. Como resultado, se pueden procesar menos vértices y triángulos para dibujar.

Además, el renderizado basado en mosaicos tiene problemas cuando hay objetos pequeños con mucha geometría dentro de una pequeña porción de la pantalla. Esto obliga a las GPU móviles a ejercer mucha presión sobre un mosaico de una sola pantalla, lo que disminuye considerablemente el rendimiento, ya que todas las demás celdas deben esperar a que se complete antes de mostrar el marco.

En resumen, no se preocupe por el recuento de vértices en dispositivos móviles, pero evite la concentración de vértices en partes pequeñas de la pantalla. Si un personaje, NPC, vehículo, etc.está lejos (lo que significa que parece pequeño), use un modelo de menor nivel de detalle (LOD). Incluso en las GPU de escritorio, es preferible evitar tener triángulos más pequeños que el tamaño de un píxel en la pantalla.

Pay attention to the additional vertex processing required when using:

  • Skinning (animación esquelética)

  • Morphs (claves de forma)

  • Objetos iluminados por vértices (común en dispositivos móviles)

Pixel/fragment shaders and fill rate

In contrast to vertex processing, the costs of fragment (per-pixel) shading have increased dramatically over the years. Screen resolutions have increased (the area of a 4K screen is 8,294,400 pixels, versus 307,200 for an old 640×480 VGA screen, that is 27x the area), but also the complexity of fragment shaders has exploded. Physically-based rendering requires complex calculations for each fragment.

You can test whether a project is fill rate-limited quite easily. Turn off V-Sync to prevent capping the frames per second, then compare the frames per second when running with a large window, to running with a very small window. You may also benefit from similarly reducing your shadow map size if using shadows. Usually, you will find the FPS increases quite a bit using a small window, which indicates you are to some extent fill rate-limited. On the other hand, if there is little to no increase in FPS, then your bottleneck lies elsewhere.

You can increase performance in a fill rate-limited project by reducing the amount of work the GPU has to do. You can do this by simplifying the shader (perhaps turn off expensive options if you are using a SpatialMaterial), or reducing the number and size of textures used.

When targeting mobile devices, consider using the simplest possible shaders you can reasonably afford to use.

Leer texturas

The other factor in fragment shaders is the cost of reading textures. Reading textures is an expensive operation, especially when reading from several textures in a single fragment shader. Also, consider that filtering may slow it down further (trilinear filtering between mipmaps, and averaging). Reading textures is also expensive in terms of power usage, which is a big issue on mobiles.

If you use third-party shaders or write your own shaders, try to use algorithms that require as few texture reads as possible.

Compresión de texturas

De forma predeterminada, Godot comprime las texturas de los modelos 3D cuando se importan mediante compresión de RAM de vídeo (VRAM). La compresión de la RAM de video no es tan eficiente en tamaño como PNG o JPG cuando se almacena, pero aumenta enormemente el rendimiento al dibujar texturas lo suficientemente grandes.

Esto se debe a que el objetivo principal de la compresión de texturas es la reducción del ancho de banda entre la memoria y la GPU.

En 3D, la forma de los objetos depende más de la geometría que de la textura, por lo que la compresión generalmente no se nota. En 2D, la compresión depende más de las formas dentro de las texturas, por lo que el resultado de la compresión es más visible.

A modo de advertencia, la mayoría de los dispositivos Android no admiten la compresión de texturas con transparencia (sólo opacas), así que tenlo en cuenta.

Nota

Even in 3D, "pixel art" textures should have VRAM compression disabled as it will negatively affect their appearance, without improving performance significantly due to their low resolution.

Postprocesamiento y sombras

Post-processing effects and shadows can also be expensive in terms of fragment shading activity. Always test the impact of these on different hardware.

Reducing the size of shadowmaps can increase performance, both in terms of writing and reading the shadowmaps. On top of that, the best way to improve performance of shadows is to turn shadows off for as many lights and objects as possible. Smaller or distant OmniLights/SpotLights can often have their shadows disabled with only a small visual impact.

Transparencia y mezcla

Transparent objects present particular problems for rendering efficiency. Opaque objects (especially in 3D) can be essentially rendered in any order and the Z-buffer will ensure that only the front most objects get shaded. Transparent or blended objects are different. In most cases, they cannot rely on the Z-buffer and must be rendered in "painter's order" (i.e. from back to front) to look correct.

Transparent objects are also particularly bad for fill rate, because every item has to be drawn even if other transparent objects will be drawn on top later on.

Opaque objects don't have to do this. They can usually take advantage of the Z-buffer by writing to the Z-buffer only first, then only performing the fragment shader on the "winning" fragment, the object that is at the front at a particular pixel.

Transparency is particularly expensive where multiple transparent objects overlap. It is usually better to use transparent areas as small as possible to minimize these fill rate requirements, especially on mobile, where fill rate is very expensive. Indeed, in many situations, rendering more complex opaque geometry can end up being faster than using transparency to "cheat".

Asesoramiento multiplataforma

If you are aiming to release on multiple platforms, test early and test often on all your platforms, especially mobile. Developing a game on desktop but attempting to port it to mobile at the last minute is a recipe for disaster.

In general, you should design your game for the lowest common denominator, then add optional enhancements for more powerful platforms. For example, you may want to use the GLES2 backend for both desktop and mobile platforms where you target both.

Mobile/tiled renderers

As described above, GPUs on mobile devices work in dramatically different ways from GPUs on desktop. Most mobile devices use tile renderers. Tile renderers split up the screen into regular-sized tiles that fit into super fast cache memory, which reduces the number of read/write operations to the main memory.

There are some downsides though. Tiled rendering can make certain techniques much more complicated and expensive to perform. Tiles that rely on the results of rendering in different tiles or on the results of earlier operations being preserved can be very slow. Be very careful to test the performance of shaders, viewport textures and post processing.