⇠ Luna’s Blog

Text Encoding and Soda Fountains

2023-02-13 · Luna

Recently I found myself needing to encode a string into UTF-16, for a hobby project interoperating with some software from the late ’90s.

“Not a problem”, I thought naïvely, “I’m using Rust where importing libraries is trivial”, (I’ve been begrudgingly using C++ for the past decade, so the idea of packages being plentiful and trivial to import is still novel) and reached for my usual tool of choice for this sort of thing: encoding_rs. I’ve used it on other hobby projects to {en,de}code Windows-1252 text, so I figure I can just use it as I have before, substituting the UTF-16 encoding for the Windows-1252 encoding, and everything will be fine. The docs for the crate even introduce it as follows (emphasis mine):

encoding_rs is a Gecko-oriented Free Software / Open Source implementation of the Encoding Standard in Rust. Gecko-oriented means that converting to and from UTF-16 is supported in addition to converting to and from UTF-8, that the performance and streamability goals are browser-oriented, and that FFI-friendliness is a goal.

This sounds perfect for the task, right?

Careful, Icarus..
— Geoff the Robot & Craig Ferguson, The Late Late Show, September 2nd 2011

In my earlier projects using the library I’d written code like this:

let bytes = encoding_rs::WINDOWS_1252.encode("Hello!").0;

When I first started using it, I read just far enough in the docs to figure out how to get the encoded data out, and then just copy-pasted between projects.

So I figure I can adapt what I already know, and write the following:

let bytes = encoding_rs::UTF_16LE.encode("Hello!").0;

This compiles successfully, and it doesn’t trap at runtime, so it’ll give me the bytes I want, right?

...right?


Let’s imagine a soda fountain. Here’s a picture of one, if that’s helpful.

A typical soda fountain. It features five dispensing nozzles, each labeled with a different brand of soft drink.

The soda fountain’s interface is simple enough: you place a cup under the nozzle for the drink you’d like, push the corresponding button, and the machine dispenses a drink into the cup.

This interface has an implied contract: that the dispensed drink will correspond to the button you pushed. Pressing the Coke button should dispense Coke, the Sprite button should dispense Sprite, and so on.

Let’s say you press the Fanta button, and the machine dispenses Coke instead. You might be quite upset; after all, you wanted Fanta, if you wanted Coke you’d have pressed the Coke button.

It might surprise you if it turned out that, despite having the appearance of dispensing five distinct drinks, the machine feeds both the Coke and Fanta nozzles from a single tank. It might surprise you even more if this were part of the original manufacturing specification. That dispensing a drink was considered to be more important than dispensing the requested drink.

Maybe the store puts a sign next to the machine saying “The Fanta button dispenses Coke”. Will people read it? Surely making the button do what it looks like it does, or removing it entirely, would be less error-prone?

With that in mind, let’s get back to the code.


For the hobby project in question, I have some binary data with a well-defined layout that I’m serializing to/from structures in memory, and I figured it made sense to have some unit tests ensuring the conversions run correctly.

thread 'main' panicked at 'assertion failed: `(left == right)`
  left: `[72, 0, 101, 0, 108, 0, 108, 0, 111, 0, 33, 0]`,
 right: `[72, 101, 108, 108, 111, 33]`',

Well that doesn’t look right at all. I’m expecting UTF-16 but it’s outputting... is that ASCII? UTF-8? Certainly not the UTF-16 I asked for.

So I looked at the docs.

This decode-only encoding uses 16-bit code units due to Unicode originally having been designed as a 16-bit reportoire. In the absence of a byte order mark the little endian byte order is assumed.

There is no corresponding encoder in this crate or in the Encoding Standard. The output encoding of this encoding is UTF-8.

(source)

What?


encoding_rs is a soda fountain. UTF-16 encoding is Fanta, and UTF-8 encoding is Coke. The Encoding Standard upon which encoding_rs is based, that’s the manufacurer’s spec.

The standard does not describe any UTF-16 encoders, so the library does not provide any. But the library still gives you a UTF-16 Encoder button, which dispenses UTF-8.

Why? Because the standard said so:

§ 4.3. Output encodings

To get an output encoding from an encoding encoding, run these steps:

  1. If encoding is replacement or UTF-16BE/LE, then return UTF-8.
  2. Return encoding.

(source)

I somehow managed to read about this four months ago in this article by ThePhD, completely forget, and proceed to walk straight into the exact same trap.

I understand the implementers following the spec to-the-letter. I don’t understand why the spec is written this way, but I assume there must’ve been a good reason. All I can say for certain is that from a user perspective, this interface is less than ideal. Give me a button that does what it says, or don’t give me a button at all.


The punchline to all of this?

I completely failed to notice the existence of str::encode_utf16 in std.

Whoops!