Note: This was originally a Twitter thread, and has been lightly edited for presentation here, with some additional clarifications.
Looks like the prototype Game Boy Advance (GBA) BIOS has a slightly different memory layout for some of the audio-related structures. Disassembly from the prototype (left) and retail (right) versions. Look at the ldr/str instructions and notice how the offsets differ.
I was thinking about this on the way home from work last night, and I think I’ve worked out why this structure’s layout was changed before release. Buckle up, you’re gonna learn some ARM assembly for this one.
ARM actually has two instruction sets: ARM and THUMB. ARM instructions each consist of 4 bytes, while THUMB instructions each consist of 2 bytes. As a result, THUMB offers a more reduced feature set, but takes significantly less program space.
The vast majority of the GBA’s BIOS is implemented in THUMB instructions, with only a few parts implemented in ARM. My understanding is that the GBA’s 16-bit bus means that THUMB code is more efficient on this platform.
As a brief aside, for clarity in the rest of this article, a “halfword” consists of 2 bytes (16 bits), and a “word” consists of 4 bytes (32 bits).
We’ll look at the THUMB instruction for reading data from memory into a register. There are three variants,
LDR, which load a byte, halfword, and word, respectively. Here’s an example usage:
LDRB r1, [r0, #0x0]
This instruction says “there’s a memory address in register
r0; offset it by 0, read a byte from that address, and load it into register
#0x0 on the end? That’s an offset, which gets added to the memory address; so if it were
r0 contained the address
0x1000, then this instruction would read a byte from memory address
The THUMB instruction encoding allows up to 5 bits for this offset, so it can range from 0 to 31. However, it’s scaled by the size of what you’re reading. So for
LDRH, it can range from 0 to 62 in multiples of 2 bytes, and for
LDR it can range from 0 to 124 in multiples of 4.
So in a theoretical scenario, let’s say
r0 contains a pointer to some structure. You want to read a single-byte member from this structure into
r1. If that member is within the first 32 bytes of the structure, you can do this with a single THUMB instruction.
Similarly, if you want to read a halfword from the structure, this is trivial if it’s within the first 64 bytes of the structure (and aligned on a 2-byte boundary). But if it’s further into the structure than that, then you’re going to need to do more work.
The audio structures in the prototype GBA BIOS contain some byte-sized members further than 31 bytes from the start of the structure. In the disassembled code I saw a few places where there was manual pointer arithmetic going on to access these members.
Meanwhile in the retail BIOS, it appears that the structures have been rearranged to put all of the byte-sized members at the start of the structure, followed by the halfword-sized members, and then the word-sized members.
Compare this excerpt from the prototype (left) and retail (right) versions. The pointer’s in
r7. The prototype has a byte-sized member at offset
0x2e — out of range, so it uses
add instructions to offset the pointer. The retail version has moved this member to offset
As a result of all this, the retail BIOS code is slightly smaller and faster, just by rearranging the memory layout of a few structures to be better optimized for the hardware’s instruction set. Neat, huh?
Well, the ARMv4T architecture used by the GBA’s ARM7TDMI chip does, at least. ↩
More specifically, the 16-bit data bus means that fetching a single 32-bit ARM instruction from the cartridge ROM takes at least two reads, while fetching a 16-bit THUMB instruction only takes one. Many games execute ARM code from the fast on-die RAM for specific routines such as software rasterizers, where the more sophisticated features of the ARM instruction set are required and performance is paramount, while running the rest of the game in THUMB mode. ↩