8 min read

Zero Memory Allocations: Compile-Time Rendering Part 1 - Throughput

Zero Memory Allocations: Compile-Time Rendering Part 1 - Throughput

This is part 4 of a 12-part series called Distilling the Web to Zero. The web is far from done improving - particularly as it pertains to building rich web applications that are, to users, developers, and businesses alike, more desirable than their native-app counterparts. This series is prefaced by defining the biggest challenges on the road towards that goal and the 12 essays that follow explore potential solutions using concrete examples.

When best practices become antipatterns


You’re basically wasting your life.
– Artur Bergman, founder and CEO of Fastly

I love the Fastly story. It’s about how one change in hardware technology flipped a handful of software best practices into antipatterns. Back in 2011, Artur Bergman saw solid state hard drives as a paradigm shift that would open up an opportunity to compete with entrenched CDN juggernauts of the time like Akamai. So he founded Fastly.

What made solid state drives (SSD) so interesting is that they weren’t just an incremental improvement. While hard disk drives (HDD) had a random access of 450 IOPS, SSDs were 75,000 IOPS. You could simply swap the drive in your laptop and your boot time would transform from minutes into seconds. This is because HDDs work by physically moving an arm to read bytes from a spinning plate. This is called seek time and it adds roughly 8 milliseconds to any random access. SSDs have no moving parts and therefore no seek overhead. The way he describes the difference is worth a quick watch.

To take full advantage of this new paradigm Fastly maxed out “commodity” servers (for 2011) with 768GB of memory and 18TB of SSD storage. A customized Varnish installation was run on top of an OS configured to leverage the SSD storage as virtual memory thus giving Varnish 18TB of cheap, low-latency memory space capable of caching 2+ billion objects per machine.

As Bergman explains, it doesn’t make any sense to handle this from the application layer. Writing your own LRU cache that purges individual objects from RAM to disk will be slower. The OS is already quite good at paging 4KB chunks of RAM to and from storage so just let it do that, use a fast drive, and enjoy having terabytes of cheap random access memory to work with.

Edge computing… any day now

There’s another paradigm shift waiting to happen but this time it’s not hardware, it’s infrastructure! Edge computing has been steadily growing over the past few years. In my city there are local zones for all 3 major cloud providers. Check your own latency. I can round trip from my house to an AWS EC2 instance in 4 milliseconds! That’s two round trips completed in the time it takes my backup drive to read the first byte.

To put that in context, the time it takes to blink our eyes is 250 milliseconds. The refresh rate of your monitor is 16 milliseconds (60Hz). In Jakob Nielsen’s book Usability Engineering he establishes 100 milliseconds as the gold standard for humans to perceive something as instant.

Where’s the killer app?

Just like HDDs, where the only way to break through the physical limitations of seek time was to forgo all moving parts, edge computing overcomes its physical limitations (the latency of the speed of light) by moving compute resources geographically closer to its users. This general strategy makes sense, is easy to explain, is widely accepted and is validated by the billions of dollars invested into these data center build outs.

Unfortunately, edge computing still seems to be missing its killer app. In web dev communities, the only use cases that are gaining traction are around configuration and middleware routing. How very underwhelming! Just recently Vercel, a popular cloud provider, announced some shockingly unexpected backpedaling regarding their usage of the edge.


Why does the edge only work for serving static bits but not for rendering dynamic bits? The answer can be found in last month’s essay about Web2’s horizontal tax. In a nutshell, since Web2 cannot run its logic in the absence of data, the added physical distance between the two amplifies the horizontal tax to the point of negative return.

This begs a few questions: Is it possible for web servers to render UI in the absence of data? Why is it that native mobile apps can do this so easily? If webapps were built like native apps, what best practices would suddenly become new antipatterns? Why is static data so edge-friendly but dynamic data is not? What needs to change in order to make edge rendering work?

A CDN-inspired web framework

A CDN has the luxury of not needing to have any understanding of the bytes it’s sending over the network. It specializes in firing a contiguous chunk of bytes from memory down the wire like the gun-fire sounds of the Bellagio fountain. As a result, it can fully saturate the NIC while the CPU remains mostly idle. Fastly boasts that one of their machines (from 2013) can continuously push 25 Gbit/sec, 40k req/sec and return a result in 600 microseconds. These are the table stakes when working on the edge – roughly a 1000x difference compared to what’s typical for dynamic rendering upstream in the cloud region.

So what exactly would it look like to build an edge-first web framework? To approach that same question from the opposite angle, how could an old fashioned static caching engine like Varnish be evolved to support dynamic/reactive data?

There seem to be two standout characteristics about caching servers that differ from traditional web frameworks:

  1. Predictable, sustained, maximum throughput
  2. Non-blocking, submillisecond response times

For # 1, predictable, sustained throughput, the first thing that comes to the minds of video game devs is probably wrestling with the garbage collector. Last month’s essay, already discussed building on top of a C#/.NET tech stack which is a managed runtime and therefore uses a garbage collector. Good news! As a part of that “renaissance” (mentioned in that essay), Microsoft (and its many outside contributors) have improved the .NET runtime by leaps and bounds. These days, writing C# code that is simultaneously managed code AND allocates zero memory is not just possible but an increasingly common practice.

For # 2, achieving non-blocking, submillisecond response times – the best option here is to go the same route as the SSD and remove all the moving parts. Why not make rendering a compile-time step? A caching server has no need to construct a DOM tree; it just needs those bytes lined up in contiguous memory ready to fire over the network.

NOTE

Outside the scope of this essay is how exactly to render in the absence of data. This local-first mindset was covered briefly in last month’s essay about Web4. Concrete examples of this are coming later this year.

The xUI view engine

The vast majority of web pages are composed of thousands of tiny tags arranged in a giant tree. However, most of those tags never change. In fact, it’s not uncommon to see a 100:1 ratio for markup that never changes compared to markup that does. An edge-first view engine could exploit that ratio to its advantage by leveraging its compiler to isolate the portions that never change and prepare it as a contiguous array of wire-ready bytes. This moves the rendering process to a compile-time expense, done only once and ahead of time. Serving the webpage would closely resemble a static cache except, instead of sending just one contiguous chunk of bytes over the wire, it would alternate back and forth between large-static-chunk, small-dynamic-value, large-static-chunk, small-dynamic-value, etc. Each webpage would require 2n + 1 writes to the network where n is the number of dynamic values the webpage contains.

Here are the specifics on how xUI accomplishes compile-time rendering with zero memory allocations.

Structs

Every declarative/reactive UI framework falls into one of two camps: object-oriented (where state is accessed via object fields) or functional (where state is accessed via hooks). iOS has SwiftUI which is object-oriented while Android has Compose which is functional. React famously switched from object-oriented to preferring functional for a few cited reasons. While they don’t mention sustained speed as one of those reasons I’m inclined to believe it was a factor. Since rendering is something that might occur many times per second, allocating thousands of tiny objects on every render pass creates a huge burden for the garbage collector thus slowing the UI or even triggering excessive GC pauses. Using a function to call a function to call a function to call a function thousands of times over is a clever solution around this constraint since function calls utilize the stack instead of the heap. If Android chose functional and React switched to functional, why then, did Apple choose the object-oriented approach? The answer is because the Swift language has the luxury of rending with structs instead of classes while JavaScript and Kotlin do not.

C# has structs too. In fact, they’ve been around since v1.0. They’re an incredibly useful tool for offloading work from the garbage collector. In many ways they look just like a class, except they reference a contiguous block of physical memory and they live on the stack instead of the heap. No system calls for allocating memory are necessary, there’s no memory fragmentation, and there’s no garbage to clean up. Like everything, structs come with their own trade-offs (which are outside the scope of this essay) but it’s interesting to point out the pattern where languages without complex value types (like structs) use hooks for state and functions for composability while languages with lower-level memory constructs do not.

Raw string literals

xUI has a keen interest in knowing what parts of a template are dynamic and what is guaranteed to never change. Since HTML is a structural language, it’s not uncommon to see 99% of a document never change. Strings literals are a helpful tool for this.

Raw string literals are new in C# 11. They’re syntactic sugar on top of string literals. They can contain certain characters that are not allowed in regular string literals, including whitespace, new lines, embedded quotes, and other special characters without requiring escape sequences.

Interpolated string handlers

Interpolated string handlers are the primary mechanism for separating immutable markup from dynamic values. It's a very powerful compiler service baked into C#'s Roslyn-compiler-as-a-service. There's a lot to it so it's coverd in deeper detail in next month's essay: Zero DOM.

Source generators

It’s important to remember that few people will probably choose to template their HTML directly in xUI's fancy interpolated strings. While it does offer decent syntactic sugar, the primary “customer” is the compiler. Most humans will probably prefer to use ZeroScript which feels like writing a regular, old-fashioned .html file from the 90s. (Read the spec at ZeroScript.org or read January’s essay: Zero New Syntax to Learn: ZeroScript). To make a comparison, ZeroScript parallels JSX while the xUI view engine parallels the core React library.

C# 6 was a milestone in that it released Roslyn, the compiler as a service – the C# compiler is now written in C#! Roslyn is incredible, especially its support for source generators! These make it trivial to build a “preprocessor” for transpiling ZeroScript into C#.

Pipelines

Pipelines make it easier to do high-performance I/O in .NET. Its thoughtful abstraction makes handling things like back pressure a breeze with async/await. Buffer allocation is made transparent and garbage-collector-friendly.

There was some experimentation with pre-converting each string literal from UTF16 (.NET’s default string encoding) into UTF8 (HTTP’s default transport encoding). In the end, this optimization didn’t improve performance by much since deep in the bowels of Pipelines and Streams, the GetBytes method is already highly vectorized.

Results

Coming soon...

Artifacts