12 min read

Zero Memory Allocations: Compile-Time Rendering

Zero Memory Allocations: Compile-Time Rendering

This is part 4 of a 12-part series called Distilling the Web to Zero. The web is far from done improving - particularly as it pertains to building rich web applications that are, to users, developers, and businesses alike, more desirable than their native-app counterparts. This series is prefaced by defining the biggest challenges on the road towards that goal and the 12 essays that follow explore potential solutions using concrete examples.

When best practices become antipatterns


You’re basically wasting your life.
– Artur Bergman

I love the Fastly story. It’s about how one change in hardware technology flipped a handful of software best practices into antipatterns. Back in 2011, Artur Bergman saw solid state hard drives as a paradigm shift that would open up an opportunity to compete with entrenched CDN juggernauts of the time like Akamai. So he founded Fastly.

What made solid state drives (SSD) so interesting is that they weren’t just an incremental improvement. While hard disk drives (HDD) had a random access of 450 IOPS, SSDs were 75,000 IOPS. You could simply swap the drive in your laptop and your boot time would transform from minutes into seconds. This is because HDDs work by physically moving an arm to read bytes from a spinning plate. This is called seek time and it adds roughly 8 milliseconds to any random access. SSDs have no moving parts and therefore no seek overhead. The way he describes the difference is worth a quick watch.

To take full advantage of this new paradigm Fastly maxed out “commodity” servers (for 2011) with 768GB of memory and 18TB of SSD storage. A customized Varnish installation was run on top of an OS configured to leverage the SSD storage as virtual memory thus giving Varnish 18TB of cacheable random access address space capable of caching 2+ billion objects per machine.

As Bergman explains, it doesn’t make any sense to handle this from the application layer. Writing your own LRU cache that purges individual objects from RAM to disk will be slower. The OS is already quite good at paging 4KB chunks of RAM to and from storage so just let it do that, use a fast drive, and enjoy having terabytes of cheap random access memory to work with.

Edge computing… any day now

There’s another paradigm shift waiting to happen but this time it’s not hardware, it’s infrastructure! Edge computing has been steadily growing over the past few years. In my city there are local zones for all 3 major cloud providers. Check your own latency. I can round trip from my house to an AWS EC2 instance in 4 milliseconds! That’s two round trips completed in the time it takes my backup drive to read the first byte.

To put that in context, the time it takes to blink our eyes is 250 milliseconds. The refresh rate of your monitor is 16 milliseconds (60Hz). In Jakob Nielsen’s book Usability Engineering he establishes 100 milliseconds as the gold standard for humans to perceive something as instant.

Where’s the killer app?

Just like HDDs, where the only way to break through the physical limitations of seek time was to forgo all moving parts, edge computing overcomes its physical limitations (the latency of the speed of light) by moving compute resources physically closer to its users. This general strategy makes sense, is easy to explain, is widely accepted and is validated by the billions of dollars invested into these data center build outs.

Unfortunately, edge computing still seems to be missing its killer app. In web dev communities, the only use cases that are gaining traction are around configuration and middleware routing. How very underwhelming! Just recently Vercel, a popular cloud provider, announced some shockingly unexpected backpedaling regarding their usage of the edge.


Why does the edge only work for serving static bits but not for rendering dynamic bits? The answer can be found in last month’s essay about Web2’s horizontal tax. In a nutshell, since Web2 cannot run its logic in the absence of data, the added physical distance between the two amplifies the horizontal tax to the point of negative return.

This begs a few questions: Is it possible for web servers to render UI in the absence of data? Why is it that native mobile apps can do this so easily? If webapps were built like native apps, what best practices would suddenly become new antipatterns? Why is static data so edge-friendly but dynamic data is not? What needs to change in order to make edge rendering work?

A CDN-inspired web framework

A CDN has the luxury of not needing to have any understanding of the bytes it’s sending over the network. It specializes in firing a contiguous chunk of bytes from memory down the wire like the gun-fire sounds of the Bellagio fountain. As a result, it can fully saturate the NIC while the CPU remains mostly idle. Fastly boasts that one of their machines (from 2013) can continuously push 25 Gbit/sec, 40k req/sec and return a result in 600 microseconds. These are the table stakes when working on the edge – roughly a 1000x difference compared to what’s typical for dynamic rendering upstream in the cloud region.

So what exactly would it look like to build an edge-first web framework? To approach that same question from the opposite angle, how could an old fashioned static caching engine like Varnish be evolved to support dynamic/reactive data?

There seem to be two standout characteristics about caching servers that differ from traditional web frameworks:

  1. Predictable, sustained, maximum throughput
  2. Non-blocking, submillisecond response times

For # 1, predictable, sustained throughput, the first thing that comes to the minds of video game devs is probably wrestling with the garbage collector. Last month’s essay, already discussed building on top of a C#/.NET tech stack which is a managed runtime and therefore uses a garbage collector. Good news! As a part of that “renaissance” (mentioned in that essay), Microsoft (and its many outside contributors) have improved the .NET runtime by leaps and bounds, These days, writing C# code that is simultaneously managed code AND allocates zero memory is not just possible but an increasingly common practice.

For # 2, achieving non-blocking, submillisecond response times – the best option here is to go the same route as the SSD and remove all the moving parts. Why not make rendering a compile-time step? A caching server has no need to construct a DOM tree; it just needs those bytes lined up in contiguous memory ready to fire over the network.

NOTE

Outside the scope of this essay is how exactly to render in the absence of data. This local-first mindset was covered briefly in last month’s essay about Web4. Concrete examples of this are coming later this year.

The xUI view engine

The vast majority of web pages are composed of thousands of tiny tags arranged in a giant tree. However, most of those tags never change. In fact, it’s not uncommon that the ratio of markup that changes to markup that never changes is close to 1%. An edge-first view engine could exploit that ratio to its advantage by leveraging its compiler to isolate the portions that never change and prepare it as a contiguous array of wire-ready bytes. This moves the rendering process to a compile-time expense, done only once and ahead of time. Serving the webpage would closely resemble a static cache except, instead of sending just one contiguous chunk of bytes over the wire, it would alternate back and forth between large-static-chunk, small-dynamic-value, large-static-chunk, small-dynamic-value, etc. Each webpage would require 2n + 1 writes to the network where n is the number of dynamic values the webpage contains.

Here are the specifics on how xUI accomplishes compile-time rendering with zero memory allocations.

Structs

Every declarative/reactive UI framework falls into one of two camps: object-oriented (where state is accessed via object fields) or functional (where state is accessed via hooks). iOS has SwiftUI which is object-oriented while Android has Compose which is functional. React famously switched from object-oriented to preferring functional for a few cited reasons. While they don’t mention sustained speed as one of those reasons I’m inclined to believe it was a factor. Since rendering is something that might occur many times per second, allocating thousands of tiny objects on every render pass creates a huge burden for the garbage collector thus slowing the UI or even triggering excessive GC pauses. Using a function to call a function to call a function to call a function thousands of times over is a clever solution around this constraint since function calls utilize the stack instead of the heap. If Android chose functional and React switched to functional, why then, did Apple choose the object-oriented approach? The answer is because the Swift language has the luxury of rending with structs instead of classes while JavaScript and Kotlin do not.

C# has structs too. In fact, they’ve been around since v1.0. They’re an incredibly useful tool for offloading work from the garbage collector. In many ways they look just like a class, except they reference a contiguous block of physical memory and they live on the stack instead of the heap. No system calls for allocating memory are necessary, there’s no memory fragmentation, and there’s no garbage to clean up. Like everything, structs come with their own trade-offs (which are outside the scope of this essay) but it’s interesting to point out the pattern where languages without complex value types (like structs) use hooks for state and functions for composability while languages with lower-level memory constructs do not.

Raw string literals

String literals are used as a compiler optimization. In short, the compiler knows about them ahead of time and can make sure they’re only created once. This is especially helpful in tight loops and it alleviates pressure on the garbage collector. String literals are basically anything inside the source code that appears between two double quotes: var str = “I'm a string literal”;. Other strings, like those that are typed into a form field or that came out of a database cannot be known ahead of time and therefore live on the heap and possibly exist in memory multiple times.

Secondly, string literals are immutable and can never be changed. This is an important characteristic for xUI since it has a keen interest in knowing what parts of a template are dynamic and what is guaranteed to never change. Since HTML is a structural language, it’s not uncommon to see 99% of a document never change.

Thirdly, string literals can offer some incredible memory efficiencies which will be helpful since xUI does its diffing on the server. Imagine a server with 20,000 connected users all viewing a 100KB page containing 1000 mutable values each. To construct a virtual DOM (or tree of any kind) for each user would be like running 20,000 active browser tabs on the server. This would require potentially terabytes of RAM, and trillions of objects for the GC to manage. However, xUI does not construct a tree. Each user is represented by a tiny array, in this case, with a length of 2001 where every odd index holds a mutable value and every even index is a pointer to an immutable string literal. Each user’s array points to the same string literals! That’s 100KB of markup split across 1001 strings and most importantly never duplicated! From terabytes to megabytes, that’s 1,000,000x smaller.

Raw string literals are new in C# 11. They’re syntactic sugar on top of string literals. They can contain certain characters that are not allowed in regular string literals, including whitespace, new lines, embedded quotes, and other special characters without requiring escape sequences.

xbuttons.png

Interpolated string handlers

Interpolated string handlers are new in C# 10. Sometimes they are referred to as tagged template literals. They enable you to customize how and when a string gets interpolated (or even forgo the process altogether). For example the following two lines produce the same output:
"Hello " + name + "!"
$"Hello {name}!"

If you create an interpolated string handler the compiler will hand that string to you in pieces allowing you to interpolate it however best fits your use case.

One useful use case is for logging. You could delay the construction of the string until you know it qualifies for output. No need to waste memory by creating a whole string on the heap before passing it into the method if the log message’s priority level is configured to be ignored. This saves CPU time and GC pressure. Another example use case is for sanitizing SQL inputs.

xUI takes advantage of this feature to indicate which parts of the HTML are static and guaranteed to never change and which parts are dynamic. It hangs on to both types in an array instead of constructing a tree, tag by tag.

As an additional benefit, these static parts are treated as string literals and are never recreated (a helpful feature when trying to handle millions of requests per second).

Finally, isolating the dynamic values and storing them in an array interlaced with pointers to string literals of the static content makes implementing a diffing algorithm exceedingly simple – a complexity of O(1)! Adding more tags/elements increases the workload by zero. Adding dynamic values will increase the work, but only linearly not exponentially. Compare that to diffing a virtual DOM tree which has a default complexity of O(n3). 1000 elements would require in the order of one billion comparisons. While heuristics can help mitigate that complexity, it’s still vastly more work to build and diff two trees compared to an approach where there is no tree.

Here’s a deeper look at how the C# compiler lowers that syntactic sugar. This example shows two raw string literals but one is assigned to a default type of string while the other is assigned to a custom interpolated string handler of my own creation called NotReallyAString.

Below is the lowered code generated by the compiler. (Check out the SharpLab of this code if you want to poke at it yourself from your browser to explore what the compiler is doing under the hood.) For the first method, the main thing to notice is the last line where it calls ToStringAndClear() which does the work of allocating an array of bytes on the heap that is the properly calculated length and subsequently writing each chunk to it ready for use as a string.

The main thing to notice about the second method is that it never calls ToStringAndClear(). Instead, after it collects each chunk, it just hands you your custom handler so you can do whatever you want with the chunks – zero memory allocated (depending on how you collect your chunks)!

In the context of xUI, this is the perfect format for two things:

  • Zero memory allocations
    Skip the expensive string-building and write those raw chunks directly to the network. Html pages are frequently over 100KB and composed of hundreds/thousands of these “chunks.” There’s no need to have a single giant contiguous array of bytes as a string object in the heap. Those string literals never need to be copied unnecessarily. Just treat them like a bandolier and rapid-fire them directly to the network.
  • Fast diffing
    If you have two of these NotReallyAString objects that were created from the same raw string literal, then it’s guaranteed to have the same number of chunks and, even better, it’s guaranteed that the order of each chunk never shifts either! Detecting if any of the dynamic values have changed (e.g. count) is a simple process of iterating through the appended chunks of two NotReallyAStrings, ignoring the string literals (because they’re immutable and cannot change) and performing a simple == operation at each index. Any index with unequal values can trigger a signal for HTTP/X to update the DOM.

Source generators

It’s important to remember that few people will probably choose to template their HTML directly in xUI's fancy interpolated strings. While it does offer decent syntactic sugar, the primary “customer” is the compiler. Most humans will probably prefer to use ZeroScript which feels like writing a regular, old-fashioned .html file from the 90s. (Read the spec at ZeroScript.org or read January’s essay: Zero New Syntax to Learn: ZeroScript). To make a comparison, ZeroScript parallels JSX while the xUI view engine parallels the core React library.

C# 6 was a milestone in that it released Roslyn, the compiler as a service – the C# compiler is now written in C#! Roslyn is incredible, especially its support for source generators! These make it trivial to build a “preprocessor” for transpiling ZeroScript into C#.

Pipelines

Pipelines make it easier to do high-performance I/O in .NET. It’s thoughtful abstraction makes handling things like back pressure a breeze with async/await. Buffer allocation is made transparent and garbage-collector-friendly.

There was some experimentation with pre-converting each string literal from UTF16 (.Net’s default string encoding) into UTF8 (HTTP’s default transport encoding). In the end, this optimization didn’t improve performance by much since deep in the bowels of Pipelines and Streams, the GetBytes method is already highly vectorized.

Results

Summary (on my Macbook Pro):
Latency: “Render” in xxx nanoseconds. Throughput: up to 7M req/sec or 36Gb/sec

Benchmark.net: Compose
0 dynamic values
Hello world: ??? ns, 0 allocated
10 KB, 100 elements
100 KB. 1000 elements
1MB, 10,000 elements
10 dynamic values
Hello world: ??? ns, 0 allocated
10 KB, 100 elements
100 KB. 1000 elements
1MB, 10,000 elements
100 dynamic values
Hello world: ??? ns, 0 allocated
10 KB, 100 elements
100 KB. 1000 elements
1MB, 10,000 elements
wrk:
7M req/sec?

Artifacts