Boot Sector Quine Golf

A quine is a program that prints its own source. Writing one should force us to think about the quotation versus the execution of code (see the use-mention distinction), touching upon ideas Quine considered. However, in BASIC, we get away with:

10 LIST

Let’s develop an x86 boot sector quine which cheats in the same way. We read the memory containing the boot sector and print out the bytes in hex. We also treat this as a round of code golf, that is, we’ll be mindful of the code size.


Although we consider xxd -r -p or basenc --base16 -d to be our compiler, rather than program directly in machine code by hand, we enlist the help of NASM, the Netwide Assembler, whose syntax resembles that used by DEBUG.COM from the DOS of yore.

An Ancient Ritual

My first "IBM PC compatible", as they used to say, was an Intel 8088, with two floppy disk drives. Like a game console, it would run whatever was on the disk in drive A, albeit only after an irritating delay due to a self-test that seemed pointless because it never failed!

Over years, I learned a lot about how my computer worked, but I still had no idea how to get it to run my code on boot. It wasn’t until a family holiday in Vancouver that I stumbled across a Computing Now! magazine that just happened to feature an article walking through the boot process. Considering how I turned out, it’s unsurprising that this was one my strongest memories of that vacation! But perhaps it’s surprising that the arcane details remain relevant decades later.

On boot, a modern PC behaves like my 8088 did (with some slight differences such as the effects of the AAA instruction). It loads the first 512-byte sector of a disk into RAM at location 0x7C00 if the last two bytes are 0x55 and 0xAA. Then the DL register holds the drive number of the boot device, and control is given to the instruction now at 0x7C00 by setting the CS and IP registers according to Intel’s scheme for referring to 20-bit address space via 16-bit registers, that is, we have CS:IP = 16*CS + IP = 0x7C00.

Modern operating systems soon switch the CPU from 16-bit real mode to 32-bit protected mode, and from there to 64-bit long mode. However, we’ll be keeping it real.

Apart from these conditions on DL, CS, and IP, the initial values of the registers and flags depend on the implementation. Though judging by boot sector examples I’ve seen, it seems reasonable to assume that DF (the direction flag) is clear and IF (the interrupt flag) is set, and that one of CS and IP is zero.

We begin by pointing DS:SI at the start of our code:

org 0x7c00
push 0x7c0
pop ds
xor si,si

We ought to set SS:SP to something reasonable as soon as possible, because an interrupt could happen at any time, and its handler may use the stack, which for all we know starts off in the middle of our code. We’ll chance it!

We prepare to LOOP for CX = 512 iterations:

xor bx,bx
mov cx,512
dump:

Every 32 iterations, we print CR and LF via BIOS interrupt 10h function 0xE, on display page BH = 0. Both are needed: the CR moves the cursor all the way to the left and the LF moves it to the next row, because I guess that’s what a teletype did.

test cl, 0x1f
jnz showbyte
mov ax,0x0e0d
int 0x10
mov al,0xa
int 0x10

An alternative is a single call to BIOS interrupt 10h function 2 to set the cursor position. This is less code, but relies on a clear screen. A BIOS splash screen is anything but.

We load the next byte into AL with LODSB. To extract its upper and lower nibble, we may instinctively reach for bitwise operations like right shifts and AND. But this is x86 assembly golf, and we should be aware of obscure opcodes for BCD arithmetic.

Consulting references for the AAM opcode, we find it only costs two bytes to compute the quotient and remainder of a division by a small constant (a feature that was originally undocumented!):

showbyte: lodsb
aam 16

We want to print the hex digit representing AH followed by the hex digit representing AL. Thus we wish to run a certain sequence of steps twice, so we prepare to loop. To switch between AH and AL, we store one of them in DL and start off the loop body with an XCHG.

mov dl, ah
hexit:
xchg ax, dx

To compute the corresponding ASCII code for a given AL value, we’d naturally write:

; cmp al, 10
; jl digit
; add al, 7
; digit: add al, 0x30

This is a straightforward way to add 48 to a member of [0..9], and 55 to a member of [10..15]. But once again, archaic opcodes shave a few bytes, and we save a little more by combining with the call to BIOS to print the character:

aam
aad 17
add ax, 0x0e30
int 0x10

Recall the inner hexit loop should run twice. Thus we XOR BL with AH = 0xE and check ZF (the zero flag): it is cleared the first time, and set the second time, because initially BL = 0.

xor bl, ah
jnz hexit

The outer loop is more idiomatic:

loop dump

Afterwards, we have nothing to do. The polite way to do nothing is to loop on a HLT instruction:

; zzz: hlt
; jmp zzz

However, to reduce code size, we spin wastefully:

jmp $

Lastly, we need magic values at the end of the 512-byte sector. We zero all intervening bytes.

times 510-($-$$) db 0
dw 0xaa55

I originally wrote a putc routine to print the character in AL, thinking I’d save space by reusing code. I also had a hexit routine that was called twice; its last action was to call putc, so applying tail-call optimization and placing the putc immediately after immediately after hexit saved a CALL and a RET.

However, it’s cheaper to inline each BIOS call. Unlike jumps, each CALL instruction costs 3 bytes, whereas INT 0x10 costs 2. Also, BIOS calls preserve unused registers, and our program is simple enough we can initialize BH to 0 once and for all. For similar reasons, it’s better to inline hexit and abuse ZF to run it twice.

We assemble and dump in hex to generate the source:

$ nasm boot.asm -o boot.bin
$ xxd -c 32 -p boot.bin

RISC versus CISC

Because I grew up on x86, my undergrad assembly class was a rude shock. They taught MIPS, and stressed it was a Reduced Instruction Set Computer (RISC). Every instruction was the same length, and although there were fewer operations available, there were more registers, and any operation could be done to any register. They said the simpler design would ultimately result in faster chips, so everything I had learned as a kid was obsolete.

Then years later I saw a post by Linux Torvalds on the strengths of x86. For example, opcodes that vary in length lead to a dense encoding, improving the effectiveness of the instruction cache. Muddying the waters further, ARM started as a disciple of RISC but eventually introduced short Thumb instructions to improve code density. Also Blem, Menon, and Sankaralingam suggest the instruction set architecture is irrelevant. Maybe it’s the RISC sales pitch I was fed that is obsolete!

One thing’s for sure: CISC beats RISC for boot sector golf. Variable-length instructions let us pack more into 512 bytes with some tricks that may be less familiar to RISC programmers. For example:

  • Zero a 16-bit register by XORing it with itself; a 16-bit zero constant already takes 2 bytes of a MOV instruction.

  • An XCHG involving AX is only one byte. Otherwise it takes two.

  • Operating only on the high or low byte of a register can shorten code.

  • Tiny instructions like LODSB or LOOP punch well above their weight.

Emulation

My younger self would have been amazed by today’s edit-compile-run cycle for boot sectors. QEMU is great:

$ qemu-system-i386 boot.bin

Nowadays, we can even test boot sectors in a browser! Go to https://copy.sh/v86/?profile=custom, choose the boot.bin file from above for the floppy disk image and click "Start Emulation".


Ben Lynn blynn@cs.stanford.edu 💡