⇠ Luna’s Blog

Optimizations in the GBA BIOS

2019-06-13 · Luna

Note: This was originally a Twitter thread, and has been lightly edited for presentation here, with some additional clarifications.

Looks like the prototype Game Boy Advance (GBA) BIOS has a slightly different memory layout for some of the audio-related structures. Disassembly from the prototype (left) and retail (right) versions. Look at the ldr/str instructions and notice how the offsets differ.

Screenshot of some disassembled ARM code, showing two functions side-by-side. The code is nearly identical between them, differing only in label names and the memory offsets in some load/store operations.

I was thinking about this on the way home from work last night, and I think I’ve worked out why this structure’s layout was changed before release. Buckle up, you’re gonna learn some ARM assembly for this one.

ARM actually has two instruction sets[1]: ARM and THUMB. ARM instructions each consist of 4 bytes, while THUMB instructions each consist of 2 bytes. As a result, THUMB offers a more reduced feature set, but takes significantly less program space.

The vast majority of the GBA’s BIOS is implemented in THUMB instructions, with only a few parts implemented in ARM. My understanding is that the GBA’s 16-bit bus means that THUMB code is more efficient on this platform.[2]

As a brief aside, for clarity in the rest of this article, a “halfword” consists of 2 bytes (16 bits), and a “word” consists of 4 bytes (32 bits).

We’ll look at the THUMB instruction for reading data from memory into a register. There are three variants, LDRB, LDRH, and LDR, which load a byte, halfword, and word, respectively. Here’s an example usage:

LDRB r1, [r0, #0x0]

This instruction says “there’s a memory address in register r0; offset it by 0, read a byte from that address, and load it into register r1”.

See that #0x0 on the end? That’s an offset, which gets added to the memory address; so if it were #0x5, and r0 contained the address 0x1000, then this instruction would read a byte from memory address 0x1005.

The THUMB instruction encoding allows up to 5 bits for this offset, so it can range from 0 to 31. However, it’s scaled by the size of what you’re reading. So for LDRH, it can range from 0 to 62 in multiples of 2 bytes, and for LDR it can range from 0 to 124 in multiples of 4.

So in a theoretical scenario, let’s say r0 contains a pointer to some structure. You want to read a single-byte member from this structure into r1. If that member is within the first 32 bytes of the structure, you can do this with a single THUMB instruction.

Similarly, if you want to read a halfword from the structure, this is trivial if it’s within the first 64 bytes of the structure (and aligned on a 2-byte boundary). But if it’s further into the structure than that, then you’re going to need to do more work.

The audio structures in the prototype GBA BIOS contain some byte-sized members further than 31 bytes from the start of the structure. In the disassembled code I saw a few places where there was manual pointer arithmetic going on to access these members.

Meanwhile in the retail BIOS, it appears that the structures have been rearranged to put all of the byte-sized members at the start of the structure, followed by the halfword-sized members, and then the word-sized members.

Compare this excerpt from the prototype (left) and retail (right) versions. The pointer’s in r7. The prototype has a byte-sized member at offset 0x2e — out of range, so it uses add instructions to offset the pointer. The retail version has moved this member to offset 0x8.

Screenshot of more ARM disassembly, again showing two functions side-by-side. The left function contains three highlighted instructions performing pointer arithmetic to perform a long-distance load, while the right function has just a single instruction highlighted, performing that same load.

As a result of all this, the retail BIOS code is slightly smaller and faster, just by rearranging the memory layout of a few structures to be better optimized for the hardware’s instruction set. Neat, huh?

  1. Well, the ARMv4T architecture used by the GBA’s ARM7TDMI chip does, at least. 

  2. More specifically, the 16-bit data bus means that fetching a single 32-bit ARM instruction from the cartridge ROM takes at least two reads, while fetching a 16-bit THUMB instruction only takes one. Many games execute ARM code from the fast on-die RAM for specific routines such as software rasterizers, where the more sophisticated features of the ARM instruction set are required and performance is paramount, while running the rest of the game in THUMB mode.