JavaScript string slice() considered harmful

Attio's CSV importer writes millions of records into customers' workspaces every month. Not only is there a large volume of data, but it's also often varied and unpredictable. Users of the importer are generally implementing custom processes that write data from outside our usual sources. When working with such processes, you'll eventually run into some edge cases.

The bug: gRPC errors

We spotted one such edge case when Attio's error monitoring service reported this issue coming from a gRPC call to Google’s Spanner, our primary database.

3 INVALID_ARGUMENT: Invalid request proto:
An error was encountered during deserialization of the request proto.

The first step for any bug is to try and replicate it. After some quick log queries and a conversation with the customer in question, we had a small CSV file we could use to reproduce the bug.

I fired up my debugger and began stepping through the code. Although it didn't throw directly when called, this truncation function caught my eye.

const SORTABLE_TRUNCATION_LENGTH = 16

/**
 * Given a string value, truncate it to a fixed length
 * value prior to indexing it in the database
 */
function truncateSortableValue(rawValue: string): string {
    return rawValue.slice(0, SORTABLE_TRUNCATION_LENGTH)
}

Looking at the output of truncateSortableValue for my CSV file, I found an odd value that involved a flag character. When .slice() landed in the middle of a flag emoji, you sometimes ended up with unexpected results.

"A flag's odd 🇬🇧 when truncated".slice(0, 16)
// "A flag's odd 🇬\uD83C" (not "A flag's odd 🇬🇧 w")

Indeed, if I re-ran the import without the flag rows, I no longer saw the error. But what's really going on here? What is 🇬\uD83C?

Understanding JavaScript strings and UTF-16

On a day-to-day basis, you only need to interact with JavaScript strings at a surface level. Dig a little deeper, however, and you'll find there's lots more going on.

The first thing to understand is that JavaScript strings use Unicode. Unicode allows you to reference hundreds of thousands of characters and emojis using a unique number known as a code point. For example, 'a' has the code point U+0061 whereas '🙂' is U+1F642.

More specifically than just using Unicode, JavaScript strings use the UFT-16 encoding. This specific encoding (UTF-8 and UTF-32 are also available) determines how the numerical code points are encoded and decoded as bytes. The unit of storage for a particular encoding is known as a code unit. The 16 in UTF-16, refers to the fact that the code units are 16-bit values, and one or more code units are used for each code point. UTF-16 is optimized so that more common characters use less space i.e. can be encoded in one of these 16 bit code units. Other characters may end up using two units.

The last point to be aware of is that multiple unicode characters (with multiple code points) can be joined together to make a single readable glyph known as a grapheme cluster. Emojis are a common example of grapheme clusters and many emoji variants are formed by combining multiple sub-emojis together, usually with the zero width joiner (U+200D) character. You also see grapheme clusters used in accented characters (e.g. e + ´ = é).

Problems can occur when you're unclear about whether you're operating on grapheme clusters, code points, or code units.

When rendering human readable strings (e.g. for UI truncation), you probably want to think in terms of grapheme clusters. For example, if you separated '🇬🇧' in half you could end up rendering '🇬' (regional indicator symbol G) to a user.

For our database case, we’re more concerned with not chopping a string along code units before transmitting the data. For our gRPC call, we're using protobufs which encode strings using UTF-8. If we try to encode half a UTF-16 code unit into a protobuf message, we could hit difficulties on the other end when we try to decode.

Unfortunately, the string .slice() method we've been using thus far does work in terms of code units, not code points. As a result, it's not entirely clear how something like '🇬🇧'.slice(0, 1) will be handled.

Our Spanner database runs fully on Google servers and is closed source, meaning we can't use a step through debugger to find the source of this error. However, we do know requests are encoded using the protobufsjs library.

If we build up a toy example, we'll see that protobufjs and Node's native Buffer.from make attempts at encoding the split string. However, they both fail to decode the value back into a meaningful string on the other end.

const protobuf = require("protobufjs")

// Split a problematic string along a code unit
const str = "🇬🇧".slice(0, 1)

// Node's Buffer.from() can't encode str properly so returns
// <Buffer ef bf bd> i.e. the UTF-8 encoding of "�"
const jsBuffer = Buffer.from(str, "utf-8") // Same as Buffer.from("�", "utf-8")

// If we try and decode that buffer back into a string,
// we don't get a value equal to what we put in
const decodedJSBuffer = jsBuffer.toString("utf-8")
decodedJSBuffer          // "�"
decodedJSBuffer === str  // false

// To encode with protobuf, we first create
// a simple message to test with
//
// message Message {
//  string content = 1;
// }
const Message = new protobuf.Type("Message").add(
  new protobuf.Field("content", 1, "string")
)

// Create with the problematic split string
const message = Message.create({ content: str })

// Protobufjs encoding differs from Buffer.from and
// encodes the string as <Buffer 0a 03 ed a0 bc>
//
// 0a       - protobuf header (field number 1, wire type 2 for string)
// 03       - length of the string (3 bytes)
// ed a0 bc - attempt at a UTF-8 encoding of the string
//
// A 3 byte UTF-8 encoding is of the form 1110xxxx 10xxxxxx 10xxxxxx
// Protobufjs tries to stuff the str value into this as best it can
//
// str in hex:         0xD83C
// str in binary:          1101   100000   111100
// 3 byte utf-8 char:  1110xxxx 10xxxxxx 10xxxxxx
// stuffed (binary):   11101101 10100000 10111100
// In hex:             0xEDA0BC
const protobufBuffer = Message.encode(message).finish() // <Buffer 0a 03 ed a0 bc>

// Protobufjs doesn't error when trying to decode
// but it can't correctly decode the string either
const decoded = Message.decode(protobufBuffer)
decoded.content         // "���"
decoded.content === str // false

While the Spanner decoder on the other end won't be using the JavaScript protobufjs library, this failure to decode is almost certainly the thing causing our error.

The fix

So, how do we fix this?

Well, it turns out that while indexes .split() and .slice() use code units, [Symbol.iterator]() uses code points.

"🇬🇧".slice(0, 1)      // '\uD83C' - Partial code unit
[..."🇬🇧"].slice(0, 1) // '🇬'     - Complete code point

With this knowledge in hand, adding a new function to safely get the first n characters of a string is trivial.

/**
 * Get the first n Unicode code points of a string,
 * safely avoiding splitting codepoints in the middle.
 *
 * For example, `"🇬🇧".slice(0, 1)` is `'\\uD83C'`
 * which is not a valid unicode code point.
 *
 * The 🇬🇧 character is made up of two codepoints, 🇬 and 🇧.
 * 
 * '\\uD83C' is the first UTF-16 code *unit* of the 🇬 codepoint
 * and is not a valid unicode character in isolation. 
 *
 * `safeHead("🇬🇧", 1)` is `"🇺"`, a valid codepoint.
 *
 * @see <https://tc39.es/ecma262/multipage/text-processing.html#sec-string.prototype-@@iterator>
 */
export function safeHead(s: string, length: number): string {
    return [...s].slice(0, length).join("")
}

function truncateSortableValue(rawValue: string): string {
    return safeHead(rawValue, SORTABLE_TRUNCATION_LENGTH)
}

One small pull request later and our importer is error-free.

Interested in redefining one of the world’s most important software categories? Check out our careers page.

The bug: gRPC errors

Understanding JavaScript strings and UTF-16

The fix

Ready to redefine the world’s largest software category?