A Tiny OpenGL 2D Batch Renderer

A grid of 2D sprites

If you rather read code than words, the source is available on GitHub.

This article describes a small batch renderer written in C using modern OpenGL. The goal is to write the minimum amount to code to reduce draw calls. It's not going to be the fastest or the most efficient implementation, but it's probably one of the smallest, for learning purposes.

Below is an example of how the batch renderer can be used to draw a grid of sprites representing aliens:

typedef struct {
  // position
  float px, py;
  // texcoords
  float tx, ty, tw, th;
} Alien;

void draw_alien(BatchRenderer *renderer, Texture tex, Alien a) {
  r_texture(renderer, tex.id);

  float x1 = a.px;
  float y1 = a.py;
  float x2 = a.px + 48;
  float y2 = a.py + 48;

  float u1 = a.tx / tex.width;
  float v1 = a.ty / tex.height;
  float u2 = (a.tx + a.tw) / tex.width;
  float v2 = (a.ty + a.th) / tex.height;

  r_push_vertex(renderer, x1, y1, u1, v1);
  r_push_vertex(renderer, x2, y2, u2, v2);
  r_push_vertex(renderer, x1, y2, u1, v2);

  r_push_vertex(renderer, x1, y1, u1, v1);
  r_push_vertex(renderer, x2, y1, u2, v1);
  r_push_vertex(renderer, x2, y2, u2, v2);
}

int main(void) {
  // ...
  BatchRenderer renderer = create_renderer(6000);
  Texture tex_aliens = create_texture("aliens.png");

  struct {
    float x, y, w, h;
  } alien_uvs[] = {
    {2, 2, 24, 24},   {58, 2, 24, 24}, {114, 2, 24, 24},
    {170, 2, 24, 24}, {2, 30, 24, 24},
  };

  while (!glfwWindowShouldClose(window)) {
    // ...
    r_mvp(&renderer, mat_ortho(0, (float)width, (float)height, 0, -1.0f, 1.0f));

    float y = 0;
    for (int i = 0; i < 15; i++) {
      float x = 0;
      for (int j = 0; j < 4; j++) {
        for (int k = 0; k < 5; k++) {
          Alien ch = {
            .px = x,
            .py = y,
            .tx = alien_uvs[k].x,
            .ty = alien_uvs[k].y,
            .tw = alien_uvs[k].w,
            .th = alien_uvs[k].h,
          };
          draw_alien(&renderer, tex_aliens, ch);
          x += 48;
        }
      }
      y += 48;
    }

    r_flush(&renderer);
    // ...
  }
}

This (incomplete) program produces the image at the top of this article.

How Batch Rendering Works

Without batch rendering, this is how drawing a grid of aliens might look like:

void draw_alien() {
  // ...
  glBindVertexArray(vao);
  glDrawArrays(GL_TRIANGLES, 0, 6); // or glDrawElements()
}

int main() {
  while (...) {
    for (...) {
      draw_alien();
    }
  }
}

Each alien performs a draw call. Six vertices are drawn from the vertex array, which probably describes two triangles that make up a single quad.

A batch renderer avoids making draw calls by setting up a large dynamic vertex buffer object, where vertex data gets written to a buffer every frame and then the buffer gets used in a single draw call. The data itself may contain multiple quads with different vertex positions, texture coordinates, and perhaps tint colour.

A draw call should be performed:

When the vertex buffer reaches its capacity
When any uniform values need to change
When it's the end of the frame

After submitting a draw call, the vertex buffer is "flushed" to set up for the next draw call.

void draw_alien() {
  // nothing is actually drawn at this stage.
  // data is just being written to a buffer
  r_push_vertex(...); // x6
}

int main() {
  while (...) {
    for (...) {
      draw_alien();
    }
    r_flush(); // this is where glDrawArrays actually gets called
  }
}

In summary, instead of naively making a draw call for each alien:

draw_alien() writes to a buffer, and then r_flush() performs a single draw call.

By reducing the number of draw calls, we can increase the performance of our program.

Implementation

The BatchRenderer struct contains a shader program shader, a vertex array object vao, a vertex buffer object vbo, an array of vertices in CPU memory vertices, and values that get bound to uniforms texture and mvp.

typedef struct { float cols[4][4]; } Matrix;
typedef struct { float position[2]; float texcoord[2]; } Vertex;

typedef struct {
  GLuint shader;

  // vertex buffer data
  GLuint vao;
  GLuint vbo;
  int vertex_count;
  int vertex_capacity;
  Vertex *vertices;

  // uniform values
  GLuint texture;
  Matrix mvp;
} BatchRenderer;

The vertices buffer is needed to write vertex data somewhere in CPU memory, and the vertex buffer object vbo is needed to read vertex data in GPU memory. To avoid confusion, I'll be referring to the vertex buffer object as the GPU vertex buffer or VBO, and the array of vertices in CPU memory as the CPU vertex buffer.

The following function initializes a batch renderer with a given capacity for the CPU vertex buffer:

BatchRenderer create_renderer(int vertex_capacity) {
  GLuint vao;
  glGenVertexArrays(1, &vao);
  glBindVertexArray(vao);

  // create the dynamic vertex buffer
  GLuint vbo;
  glGenBuffers(1, &vbo);
  glBindBuffer(GL_ARRAY_BUFFER, vbo);
  glBufferData(GL_ARRAY_BUFFER, sizeof(Vertex) * vertex_capacity, NULL,
               GL_DYNAMIC_DRAW);

  glEnableVertexAttribArray(0);
  glVertexAttribPointer(0, 2, GL_FLOAT, GL_FALSE, sizeof(Vertex),
                        (void *)offsetof(Vertex, position));

  glEnableVertexAttribArray(1);
  glVertexAttribPointer(1, 2, GL_FLOAT, GL_FALSE, sizeof(Vertex),
                        (void *)offsetof(Vertex, texcoord));

  const char *vertex =
      "#version 330 core\n"
      "layout(location=0) in vec2 a_position;\n"
      "layout(location=1) in vec2 a_texindex;\n"
      "out vec2 v_texindex;\n"
      "uniform mat4 u_mvp;\n"
      "void main() {\n"
      "    gl_Position = u_mvp * vec4(a_position, 0.0, 1.0);\n"
      "    v_texindex = a_texindex;\n"
      "}\n";

  const char *fragment = "#version 330 core\n"
                         "in vec2 v_texindex;\n"
                         "out vec4 f_color;\n"
                         "uniform sampler2D u_texture;\n"
                         "void main() {\n"
                         "    f_color = texture(u_texture, v_texindex);\n"
                         "}\n";

  GLuint program = load_shader(vertex, fragment);

  return (BatchRenderer){
    .shader = program,
    .vao = vao,
    .vbo = vbo,
    .vertex_count = 0,
    .vertex_capacity = vertex_capacity,
    .vertices = malloc(sizeof(Vertex) * vertex_capacity),
    .texture = 0,
    .mvp = {0},
  };
}

The main takeaway is the call to glBufferData(). The usage pattern for the GPU vertex buffer is GL_DYNAMIC_DRAW instead of something like GL_STATIC_DRAW. This allows us to change the contents of the GPU vertex buffer in the future using the glBufferSubData() function. The data parameter is NULL, which means the VBO data living in VRAM is uninitialized.

I won't go into detail about the load_shader() function since it's nothing special. The function returns the result of glCreateProgram() after compiling the given vertex and fragment shaders. The shaders themselves consist of simple GLSL that would be expected from a basic 2D renderer.

The memory allocated for vertices should be the same size as the VBO, which is sizeof(Vertex) * vertex_capacity. This array of vertices living in CPU memory is the buffer that will be written to whenever something needs to be drawn, but it should not be mutated directly. r_push_vertex() should be used instead, which does some housekeeping by incrementing vertex_count by one and it checks when the CPU vertex buffer is at capacity.

void r_push_vertex(BatchRenderer *renderer, float x, float y, float u,
                   float v) {
  if (renderer->vertex_count == renderer->vertex_capacity) {
    r_flush(renderer);
  }

  renderer->vertices[renderer->vertex_count++] = (Vertex){
    .position = {x, y},
    .texcoord = {u, v},
  };
}

When the CPU vertex buffer reaches its capacity, then we have no more room to add another vertex. The solution is to submit a draw call and empty the CPU vertex buffer with r_flush().

void r_flush(BatchRenderer *renderer) {
  if (renderer->vertex_count == 0) {
    return;
  }

  glUseProgram(renderer->shader);

  glActiveTexture(GL_TEXTURE0);
  glBindTexture(GL_TEXTURE_2D, renderer->texture);

  glUniform1i(glGetUniformLocation(renderer->shader, "u_texture"), 0);
  glUniformMatrix4fv(glGetUniformLocation(renderer->shader, "u_mvp"), 1,
                     GL_FALSE, renderer->mvp.cols[0]);

  glBindBuffer(GL_ARRAY_BUFFER, renderer->vbo);
  glBufferSubData(GL_ARRAY_BUFFER, 0, sizeof(Vertex) * renderer->vertex_count,
                  renderer->vertices);

  glBindVertexArray(renderer->vao);
  glDrawArrays(GL_TRIANGLES, 0, renderer->vertex_count);

  renderer->vertex_count = 0;
}

This is where glBufferSubData() comes into play. glBufferSubData() is analogous to memcpy() except it copies from CPU memory into GPU memory instead of exclusively working in CPU memory. 0 is the offset where data is copied to (0 being the start of the VBO data). sizeof(Vertex) * renderer->vertex_count is the number of bytes to copy. renderer->vertices is the data to copy from.

After binding the uniform values, and copying the CPU vertex buffer data to GPU memory, a draw call is made with glDrawArrays(), and vertex_count is set to 0 to set up for the next call to r_flush.

Almost done. There's one case that has to be handled: whenever any uniform needs to change, the CPU vertex buffer needs to be flushed before setting the uniform. The following functions are created just for these cases:

void r_texture(BatchRenderer *renderer, GLuint id) {
  if (renderer->texture != id) {
    r_flush(renderer);
    renderer->texture = id;
  }
}

void r_mvp(BatchRenderer *renderer, Matrix mat) {
  if (memcmp(&renderer->mvp.cols, &mat.cols, sizeof(Matrix)) != 0) {
    r_flush(renderer);
    renderer->mvp = mat;
  }
}

That's it! To use the renderer, call create_renderer() during program initialization, r_push_vertex() when drawing in the main render loop, and r_flush() at the end of each frame.

The source code for the entire program is available on GitHub.

Improvements

Here is a list of changes that can be made to this renderer to improve its performance and efficiency:

Store uniform locations to avoid glGetUniformLocation() in r_flush().
Create an index buffer to improve efficiency and memory usage when exclusively drawing quads. Drawing a quad would need four calls to r_push_vertex() instead of six.
Create multiple buffers and group them based on texture to minimize state changes.