High Performance Computing (HPC) has always been interesting to me, ever since the first time I read a Beowulf cluster joke on slashdot.
High performance interconnects, huge storage, hundreds of nodes all working as one, the Top 500 list, what is not to love?
I’ve only dabbled with the HPC world a little bit in my career, so this blog post is NOT written from the perspective of an HPC veteran.
But if you didn’t know anything about HPC, you might naively ask: don’t we make ALL computers “high performance”?
There are reasons, but let’s just imagine for a bit.
For the purposes of this blog post, I’m going to focus only on the high performance networking aspect of HPC.
Copying & Serialization Waste
I cringe when I think about all the energy humanity is wasting continuously encoding and decoding JSON.
Of course we can be a little faster by using protobuf. But even then, programs still need to make a copy of the data and deserialize. We can do even better with something like FlatBuffers or Cap’n Proto, where you don’t need a copy/decode step.
Or, in the HPC world, why even bother with all that stuff when you can literally read and write memory directly out your peers with RDMA!
I would love to see the industry evolve a bit here and standardize on a RDMA-capable FlatBuffer or Cap’n Proto RPC protocol.
Serialization Object Objections
I can see the case for using good old JSON over HTTP in datacenters because it matches what browser use across the internet.
But this is silly.
For one, we don’t need to use JSON over HTTP for internet stuff, you can use gRPC-web and use the same datacenter class RPC on the browser!
Could we really take it to eleven and use the same HPC stuff all the way to a consumer endpoint? Could we really do high-performance RDMA-accelerated zero-copy RPC … from a website?
Well, maybe we could get there incrementally.
What It Would Take to Make Our Web Browsers “High Performance”
Right now the status quo for a “normal” web RPC looks like this:
The first problem we’ll have here as we attempt to speed things up is that the interconnect between the Edge Proxy and your Backend usually is the internet.
In other words, we usually treat the CDN as optional anyway, and the CDN talks to us “over the internet” just like any other client1. Given that, it isn’t too surprising that we would speak the same language as if the CDN wasn’t there.
That is still OK! What if we continue to evolve incrementally, sorta like how we have evolved HTTP to operate over UDP now?
What would an HTTP/4 look like if it could operate in zero-copy RDMA way?
A New Type of High Performance Protocol
When I talk about RDMA here, I’m really talking about building a new protocol stack on top of RDMA primitives.
RDMA itself is just the raw technology for writing data directly into remote memory. But we’d still need protocol layers on top to handle connection management, authentication, and request/response semantics - similar to what HTTP or gRPC provide today.
The payload format would need to be zero-copy friendly, using something like FlatBuffers or Cap’n Proto where you can read the data structure directly from memory without parsing.
And then there’s encryption. Modern RDMA NICs already support hardware-accelerated encryption (via MACsec or IPsec). The NIC can decrypt incoming data before writing it to memory, so apps still receives plaintext in the target memory location, so the FlatBuffer zero-copy trick should still work.
If we are going to go all the way to our clients using web browsers, this is going to have to be an extension of HTTP. And if we are going over the internet, then we will have to absorb or integrate RoCE v2 somehow, to actually enable RDMA over Ethernet over a routable IP2.
For this blog post, I’m just going to call it “HTTP/4A” (A for accelerated) and you can imagine with me.
HTTP/4A: Some sort of insane combo of FlatBuffers + RDMA + IPSec + RoCEv2 that all our server software can speak if an RDMA-capable NIC is available.
HTTP/4A To The Origin
What if the CDN could be configured to use the BEST possible protocol we could offer, given our hardware and network constraints? If Cloudflare or Cloudfront had an RDMA-enabled edge proxy that you could use, what would that look like?
I hear your objections about security, and yes, we would have to make it work in a secure way. We would need the equivalent of an IOMMU that could ensure we are doing zero-copy safely to this semi-untrusted origin.
Your next objection should be about hardware. HPC stuff that uses RDMA uses exotic NICs, right?
Well, all the major cloud providers actually offer RDMA-capable networking already:
- AWS: Elastic Fabric Adapter (EFA) - available on full instance sizes for security isolation
- Azure: InfiniBand on HPC VMs - available on HB, HC, and ND series VMs
- GCP: RDMA Network profiles - RDMA over Falcon or over RoCEv2 depending on the instance type
- Oracle Cloud: RDMA Cluster Network - Free on bare metal and HPC shapes using RoCEv2
Sorta like how AWS’s S3 Endpoints that are dropped into your local VPC, I think cloud providers would need to provide a “local” endpoint for their CDN to fetch from you.
AI’s need for a high speed interconnect for GPU training has forced cloud providers to up their RDMA game. We all could benefit if we can democratize the technology beyond the GPU/AI space.
HTTP/4A to the Client
Let’s say CDNs did invest in accelerating the edge, and we decided to up our game when it comes to being an origin server.
How in the world would we bring HPC interconnect tech all the way to clients (browsers)?
Well, in the pursuit of performance, I think things are going to have to get complex and messy to pull it off.
A potential architecture could look like this:
- Browser control: Use WebAssembly with SharedArrayBuffer to create zero-copy memory regions
- Browser RDMA Engine: This would have to be a very special part of the browser code that could do real RDMA in a safe way, and sync to the untrusted code via the SharedMem.
I would call it a control-plane & data-plane pattern. But I think this would be the fastest HTTP in the West.
WebAssembly]) LocalProxy[Browser RDMA] SharedMem[(Shared Memory)] Browser -.->|Zero-copy access| SharedMem LocalProxy -.->|Zero-copy access| SharedMem end subgraph CDN EdgeProxy{{Edge Proxy}} end subgraph Infrastructure Backend[(Backend)] end Browser <-->|WebTransport/QUIC| EdgeProxy LocalProxy <-->|RDMA + RoCEv2| EdgeProxy EdgeProxy <-->|RDMA + RoCEv2| Backend linkStyle 3 stroke:#fbc02d,stroke-width:4px linkStyle 4 stroke:#fbc02d,stroke-width:4px
If available, HTTP/4 would signal that can receive bytes in this way (automatic HTTP/4A upgrade), and the browser would assist in ensuring that the bytes enter the right place in RAM in a safe way.
I Can Hear Your Security Concerns Already!
HPC clusters are generally high trust environments. MPI communication is usually unencrypted. The network is usually flat.
In the pursuit of performance in HPC, security takes a back seat, and has to be implemented on the perimeter.
But the web is a hostile environment!
Can RDMA RPC be secured? I think we have some of the parts of the puzzle:
- RDMA over RoCEv2 can use MACsec or IPsec for encryption.
- Modern RDMA NICs have IOMMU support to enforce memory protection between processes.
It’s definitely more complex than today’s web stack. But we’ve solved harder problems before.
Better Hardware
To pull this off, our phones are now going to need RDMA-capable wifi chipsets.
This sounds insane, but you know, wifi is already kinda insane.
As an industry, we evolve to more and more “insane” things in the pursuit of performance.
Then they get abstracted away.
Conclusion: The Evolution of Compute Abstractions
This is kinda the way things work with computers.
Computers & HTTP have never been more complex than they are now. The complexity is astounding when you think about all the parts.
Yet, mobile phones and the general way the internet is used are extremely straightforward (you tap on things with your finger).
It is the same thing with compiler technology, IC fabrication, all tech really.
Over time, the complexity becomes a stable building block to build even more incredible things upon.
I had the same initial reaction when I first read about HTTP/2, or QUIC, or WebAssembly.
To everyone who says: “the tradeoffs here are not worth the complexity”; this may be true for you alone, but it is worth it to all of humanity.
Just like how WiFi 7 has abstracted over the fact that it can use multiple frequency bands with non-contiguous RF bandwidth over 8 MIMO streams. If you had to implement this yourself, you would probably find that your particular application doesn’t justify that insane engineering.
But that is why we need the standard protocol (HTTP/4A). Right now today, in order to take advantage of HPC tech, you have to make a lot of tradeoffs, and kinda build your own special supercomputer. In the future, perhaps secure zero-copy RDMA-enabled instantly-deserializable copies will become just part of the HTTP/4 spec, and we don’t have to make any tradeoffs to check that button to enable it?
Till then, I’ll keep trying to figure out ways to democratize RDMA tech. It just feels like an underused rocket ship that hasn’t trickled out of the HPC world yet.
-
Cloudflare Tunnel is an attempt to provide a vpn-like experience to your origin server. ↩︎
-
I don’t know if this is actually going to work. RoCEv2 uses UDP, so that is good, but it also sounds like you can’t just use this over the internet. I think that we might need a RoCEv3? ↩︎
Comment via email