Optimization using batching

Введение

Game engines have to send a set of instructions to the GPU to tell the GPU what and where to draw. These instructions are sent using common instructions called APIs. Examples of graphics APIs are OpenGL, OpenGL ES, and Vulkan.

Different APIs incur different costs when drawing objects. OpenGL handles a lot of work for the user in the GPU driver at the cost of more expensive draw calls. As a result, applications can often be sped up by reducing the number of draw calls.

Draw calls

In 2D, we need to tell the GPU to render a series of primitives (rectangles, lines, polygons etc). The most obvious technique is to tell the GPU to render one primitive at a time, telling it some information such as the texture used, the material, the position, size, etc. then saying "Draw!" (this is called a draw call).

While this is conceptually simple from the engine side, GPUs operate very slowly when used in this manner. GPUs work much more efficiently if you tell them to draw a number of similar primitives all in one draw call, which we will call a "batch".

It turns out that they don't just work a bit faster when used in this manner; they work a lot faster.

Поскольку Godot разработан как движок общего назначения, примитивы, поступающие в рендерер Godot, могут располагаться в любом порядке, иногда в похожем, а иногда в непохожем. Чтобы совместить универсальность Godot с предпочтениями GPU по пакетной обработке, Godot имеет промежуточный слой, который может автоматически группировать примитивы, где это возможно, и отправлять эти пакеты на GPU. Это может увеличить производительность рендеринга, при этом требуя незначительных изменений в проекте Godot (если они вообще есть).

How it works

Инструкции поступают в рендерер из вашей игры в виде серии элементов, каждый из которых может содержать одну или несколько команд. Элементы соответствуют узлам в дереве сцены, а команды соответствуют примитивам, таким как прямоугольники или многоугольники. Некоторые элементы, такие как TileMaps и текст, могут содержать большое количество команд (плитки и глифы соответственно). Другие, такие как спрайты, могут содержать только одну команду (прямоугольник).

The batcher uses two main techniques to group together primitives:

  • Consecutive items can be joined together.

  • Consecutive commands within an item can be joined to form a batch.

Breaking batching

Batching can only take place if the items or commands are similar enough to be rendered in one draw call. Certain changes (or techniques), by necessity, prevent the formation of a contiguous batch, this is referred to as "breaking batching".

Batching will be broken by (amongst other things):

  • Смена текстуры.

  • Change of material.

  • Change of primitive type (say, going from rectangles to lines).

Примечание

For example, if you draw a series of sprites each with a different texture, there is no way they can be batched.

Определение порядка рендеринга

The question arises, if only similar items can be drawn together in a batch, why don't we look through all the items in a scene, group together all the similar items, and draw them together?

In 3D, this is often exactly how engines work. However, in Godot's 2D renderer, items are drawn in "painter's order", from back to front. This ensures that items at the front are drawn on top of earlier items when they overlap.

This also means that if we try and draw objects on a per-texture basis, then this painter's order may break and objects will be drawn in the wrong order.

In Godot, this back-to-front order is determined by:

  • The order of objects in the scene tree.

  • The Z index of objects.

  • Слой холста.

  • YSort узлы.

Примечание

You can group similar objects together for easier batching. While doing so is not a requirement on your part, think of it as an optional approach that can improve performance in some cases. See the Диагностика section to help you make this decision.

A trick

And now, a sleight of hand. Even though the idea of painter's order is that objects are rendered from back to front, consider 3 objects A, B and C, that contain 2 different textures: grass and wood.

../../_images/overlap1.png

In painter's order they are ordered:

A - wood
B - grass
C - wood

Because of the texture changes, they can't be batched and will be rendered in 3 draw calls.

However, painter's order is only needed on the assumption that they will be drawn on top of each other. If we relax that assumption, i.e. if none of these 3 objects are overlapping, there is no need to preserve painter's order. The rendered result will be the same. What if we could take advantage of this?

Item reordering

../../_images/overlap2.png

Оказалось, что мы можем переупорядочить элементы. Однако мы можем сделать это только в том случае, если элементы удовлетворяют условиям теста на перекрытие, чтобы убедиться, что конечный результат будет таким же, как если бы они не были переупорядочены. Тест на перекрытие очень дешев с точки зрения производительности, но не абсолютно бесплатен, поэтому заглядывание вперед для решения вопроса о том, можно ли переупорядочить элементы, требует небольших затрат. Количество элементов, на которые нужно заглядывать для повторного упорядочивания, можно установить в настройках проекта (см. ниже), чтобы сбалансировать затраты и выгоды в вашем проекте.

A - wood
C - wood
B - grass

Since the texture only changes once, we can render the above in only 2 draw calls.

Света

Хотя работа системы пакетной обработки обычно довольно проста, она значительно усложняется, когда используются двумерные источники света. Это происходит потому, что освещение рисуется с помощью дополнительных проходов, по одному для каждого света, влияющего на примитив. Рассмотрим 2 спрайта A и B, с одинаковой текстурой и материалом. Без освещения они были бы собраны вместе и отрисованы за один вызов draw. Но с тремя источниками света они будут нарисованы следующим образом, каждая строка - это вызов draw:

../../_images/lights_overlap.png
A
A - light 1
A - light 2
A - light 3
B
B - light 1
B - light 2
B - light 3

That is a lot of draw calls: 8 for only 2 sprites. Now, consider we are drawing 1,000 sprites. The number of draw calls quickly becomes astronomical and performance suffers. This is partly why lights have the potential to drastically slow down 2D rendering.

However, if you remember our magician's trick from item reordering, it turns out we can use the same trick to get around painter's order for lights!

If A and B are not overlapping, we can render them together in a batch, so the drawing process is as follows:

../../_images/lights_separate.png
AB
AB - light 1
AB - light 2
AB - light 3

That is only 4 draw calls. Not bad, as that is a 2× reduction. However, consider that in a real game, you might be drawing closer to 1,000 sprites.

  • Before: 1000 × 4 = 4,000 draw calls.

  • After: 1 × 4 = 4 draw calls.

That is a 1000× decrease in draw calls, and should give a huge increase in performance.

Overlap test

However, as with the item reordering, things are not that simple. We must first perform the overlap test to determine whether we can join these primitives. This overlap test has a small cost. Again, you can choose the number of primitives to lookahead in the overlap test to balance the benefits against the cost. With lights, the benefits usually far outweigh the costs.

Also consider that depending on the arrangement of primitives in the viewport, the overlap test will sometimes fail (because the primitives overlap and therefore shouldn't be joined). In practice, the decrease in draw calls may be less dramatic than in a perfect situation with no overlapping at all. However, performance is usually far higher than without this lighting optimization.

Light scissoring

Batching can make it more difficult to cull out objects that are not affected or partially affected by a light. This can increase the fill rate requirements quite a bit and slow down rendering. Fill rate is the rate at which pixels are colored. It is another potential bottleneck unrelated to draw calls.

Чтобы решить эту проблему (и ускорить освещение в целом), пакетная обработка вводит "ножницы света". Это позволяет использовать команду OpenGL glScissor(), которая определяет область, за пределами которой GPU не будет отрисовывать пиксели. Мы можем значительно оптимизировать коэффициент заполнения, определив область пересечения между светом и примитивом и ограничив рендеринг света только этой областью.

Light scissoring is controlled with the scissor_area_threshold project setting. This value is between 1.0 and 0.0, with 1.0 being off (no scissoring), and 0.0 being scissoring in every circumstance. The reason for the setting is that there may be some small cost to scissoring on some hardware. That said, scissoring should usually result in performance gains when you're using 2D lighting.

Связь между порогом и тем, выполняется ли операция ножниц, не всегда однозначна. Как правило, он представляет собой область пикселей, которая потенциально может быть "спасена" операцией ножниц (т.е. сохранен коэффициент заполнения). При значении 1.0 потребуется сохранить все пиксели экрана, что случается редко (если вообще случается), поэтому этот параметр выключен. На практике полезные значения близки к 0,0, так как для того, чтобы операция была полезной, необходимо сохранить лишь небольшой процент пикселей.

The exact relationship is probably not necessary for users to worry about, but is included in the appendix out of interest: Light scissoring threshold calculation

Light scissoring example diagram

Bottom right is a light, the red area is the pixels saved by the scissoring operation. Only the intersection needs to be rendered.

Vertex baking

The GPU shader receives instructions on what to draw in 2 main ways:

  • Shader uniforms (e.g. modulate color, item transform).

  • Vertex attributes (vertex color, local transform).

Однако в рамках одного вызова рисования (пакетного) мы не можем менять форму. Это означает, что наивно мы не сможем объединить в пакет элементы или команды, которые изменяют final_modulate или трансформацию элемента. К сожалению, это происходит в огромном количестве случаев. Например, спрайты обычно представляют собой отдельные узлы с собственным преобразованием элемента, и у них также может быть своя цветовая модуляция.

To get around this problem, the batching can "bake" some of the uniforms into the vertex attributes.

  • The item transform can be combined with the local transform and sent in a vertex attribute.

  • The final modulate color can be combined with the vertex colors, and sent in a vertex attribute.

In most cases, this works fine, but this shortcut breaks down if a shader expects these values to be available individually rather than combined. This can happen in custom shaders.

Custom shaders

As a result of the limitation described above, certain operations in custom shaders will prevent vertex baking and therefore decrease the potential for batching. While we are working to decrease these cases, the following caveats currently apply:

  • Reading or writing COLOR or MODULATE disables vertex color baking.

  • Reading VERTEX disables vertex position baking.

Настройки проекта

To fine-tune batching, a number of project settings are available. You can usually leave these at default during development, but it's a good idea to experiment to ensure you are getting maximum performance. Spending a little time tweaking parameters can often give considerable performance gains for very little effort. See the on-hover tooltips in the Project Settings for more information.

rendering/batching/options

  • use_batching - Turns batching on or off.

  • use_batching_in_editor Turns batching on or off in the Godot editor. This setting doesn't affect the running project in any way.

  • single_rect_fallback - This is a faster way of drawing unbatchable rectangles. However, it may lead to flicker on some hardware so it's not recommended.

rendering/batching/parameters

  • max_join_item_commands - Одним из наиболее важных способов достижения пакетной обработки является объединение подходящих соседних элементов (узлов) вместе, однако они могут быть объединены только в том случае, если команды, которые они содержат, совместимы. Поэтому система должна просмотреть команды элемента, чтобы определить, может ли он быть объединен. Это имеет небольшую стоимость на команду, и элементы с большим количеством команд не стоит объединять, поэтому наилучшее значение может зависеть от проекта.

  • colored_vertex_format_threshold - Baking colors into vertices results in a larger vertex format. This is not necessarily worth doing unless there are a lot of color changes going on within a joined item. This parameter represents the proportion of commands containing color changes / the total commands, above which it switches to baked colors.

  • batch_buffer_size - This determines the maximum size of a batch, it doesn't have a huge effect on performance but can be worth decreasing for mobile if RAM is at a premium.

  • item_reordering_lookahead - Item reordering can help especially with interleaved sprites using different textures. The lookahead for the overlap test has a small cost, so the best value may change per project.

rendering/batching/lights

  • scissor_area_threshold - See light scissoring.

  • max_join_items - Joining items before lighting can significantly increase performance. This requires an overlap test, which has a small cost, so the costs and benefits may be project dependent, and hence the best value to use here.

rendering/batching/debug

  • flash_batching - This is purely a debugging feature to identify regressions between the batching and legacy renderer. When it is switched on, the batching and legacy renderer are used alternately on each frame. This will decrease performance, and should not be used for your final export, only for testing.

  • diagnose_frame - This will periodically print a diagnostic batching log to the Godot IDE / console.

рендеринг/ пакетная обработка/ точность

  • uv_contract - On some hardware (notably some Android devices) there have been reports of tilemap tiles drawing slightly outside their UV range, leading to edge artifacts such as lines around tiles. If you see this problem, try enabling uv contract. This makes a small contraction in the UV coordinates to compensate for precision errors on devices.

  • uv_contract_amount - Hopefully, the default amount should cure artifacts on most devices, but this value remains adjustable just in case.

Диагностика

Хотя вы можете изменять параметры и исследовать влияние на частоту кадров, это может быть похоже на работу вслепую, без представления о том, что происходит под капотом. Чтобы помочь в этом, пакетная обработка предлагает режим диагностики, который периодически выводит (на IDE или консоль) список обрабатываемых пакетов. Это поможет выявить ситуации, когда пакетная обработка происходит не так, как задумано, и помочь вам исправить эти ситуации для достижения наилучшей производительности.

Reading a diagnostic

canvas_begin FRAME 2604
items
    joined_item 1 refs
            batch D 0-0
            batch D 0-2 n n
            batch R 0-1 [0 - 0] {255 255 255 255 }
    joined_item 1 refs
            batch D 0-0
            batch R 0-1 [0 - 146] {255 255 255 255 }
            batch D 0-0
            batch R 0-1 [0 - 146] {255 255 255 255 }
    joined_item 1 refs
            batch D 0-0
            batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
            batch D 0-0
            batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
            batch D 0-0
            batch R 0-2560 [0 - 144] {158 193 0 104 } MULTI
canvas_end

This is a typical diagnostic.

  • joined_item: A joined item can contain 1 or more references to items (nodes). Generally, joined_items containing many references is preferable to many joined_items containing a single reference. Whether items can be joined will be determined by their contents and compatibility with the previous item.

  • batch R: A batch containing rectangles. The second number is the number of rects. The second number in square brackets is the Godot texture ID, and the numbers in curly braces is the color. If the batch contains more than one rect, MULTI is added to the line to make it easy to identify. Seeing MULTI is good as it indicates successful batching.

  • batch D: A default batch, containing everything else that is not currently batched.

Default batches

The second number following default batches is the number of commands in the batch, and it is followed by a brief summary of the contents:

l - line
PL - polyline
r - rect
n - ninepatch
PR - primitive
p - polygon
m - mesh
MM - multimesh
PA - particles
c - circle
t - transform
CI - clip_ignore

You may see "dummy" default batches containing no commands; you can ignore those.

Часто задаваемые вопросы

I don't get a large performance increase when enabling batching.

  • Try the diagnostics, see how much batching is occurring, and whether it can be improved

  • Try changing batching parameters in the Project Settings.

  • Consider that batching may not be your bottleneck (see bottlenecks).

I get a decrease in performance with batching.

  • Try the steps described above to increase the number of batching opportunities.

  • Try enabling single_rect_fallback.

  • The single rect fallback method is the default used without batching, and it is approximately twice as fast. However, it can result in flickering on some hardware, so its use is discouraged.

  • After trying the above, if your scene is still performing worse, consider turning off batching.

I use custom shaders and the items are not batching.

  • Custom shaders can be problematic for batching, see the custom shaders section

I am seeing line artifacts appear on certain hardware.

  • See the uv_contract project setting which can be used to solve this problem.

I use a large number of textures, so few items are being batched.

  • Consider using texture atlases. As well as allowing batching, these reduce the need for state changes associated with changing textures.

Приложение

Batched primitives

Not all primitives can be batched. Batching is not guaranteed either, especially with primitives using an antialiased border. The following primitive types are currently available:

  • RECT

  • NINEPATCH (depending on wrapping mode)

  • POLY

  • LINE

With non-batched primitives, you may be able to get better performance by drawing them manually with polys in a _draw() function. See Пользовательская отрисовка в 2D for more information.

Light scissoring threshold calculation

The actual proportion of screen pixel area used as the threshold is the scissor_area_threshold value to the power of 4.

For example, on a screen size of 1920×1080, there are 2,073,600 pixels.

At a threshold of 1,000 pixels, the proportion would be:

1000 / 2073600 = 0.00048225
0.00048225 ^ (1/4) = 0.14819

So a scissor_area_threshold of 0.15 would be a reasonable value to try.

Going the other way, for instance with a scissor_area_threshold of 0.5:

0.5 ^ 4 = 0.0625
0.0625 * 2073600 = 129600 pixels

If the number of pixels saved is greater than this threshold, the scissor is activated.