Visualização normal

Antes de ontemStream principal

Scaling MCP adoption: Our reference architecture for simpler, safer and cheaper enterprise deployments of MCP

14 de Abril de 2026, 10:00

We at Cloudflare have aggressively adopted Model Context Protocol (MCP) as a core part of our AI strategy. This shift has moved well beyond our engineering organization, with employees across product, sales, marketing, and finance teams now using agentic workflows to drive efficiency in their daily tasks. But the adoption of agentic workflow with MCP is not without its security risks. These range from authorization sprawl, prompt injection, and supply chain risks. To secure this broad company-wide adoption, we have integrated a suite of security controls from both our Cloudflare One (SASE) platform and our Cloudflare Developer platform, allowing us to govern AI usage with MCP without slowing down our workforce. 

In this blog we’ll walk through our own best practices for securing MCP workflows, by putting different parts of our platform together to create a unified security architecture for the era of autonomous AI. We’ll also share two new concepts that support enterprise MCP deployments:

We also talk about how our organization approached deploying MCP, and how we built out our MCP security architecture using Cloudflare products including remote MCP servers, Cloudflare Access, MCP server portals and AI Gateway

Remote MCP servers provide better visibility and control

MCP is an open standard that enables developers to build a two-way connection between AI applications and the data sources they need to access. In this architecture, the MCP client is the integration point with the LLM or other AI agent, and the MCP server sits between the MCP client and the corporate resources.

The separation between MCP clients and MCP servers allows agents to autonomously pursue goals and take actions while maintaining a clear boundary between the AI (integrated at the MCP client) and the credentials and APIs of the corporate resource (integrated at the MCP server). 

Our workforce at Cloudflare is constantly using MCP servers to access information in various internal resources, including our project management platform, our internal wiki, documentation and code management platforms, and more. 

Very early on, we realized that locally-hosted MCP servers were a security liability. Local MCP server deployments may rely on unvetted software sources and versions, which increases the risk of supply chain attacks or tool injection attacks. They prevent IT and security administrators from administrating these servers, leaving it up to individual employees and developers to choose which MCP servers they want to run and how they want to keep them up to date. This is a losing game.

Instead, we have a centralized team at Cloudflare that manages our MCP server deployment across the enterprise. This team built a shared MCP platform inside our monorepo that provides governed infrastructure out of the box. When an employee wants to expose an internal resource via MCP, they first get approval from our AI governance team, and then they copy a template, write their tool definitions, and deploy, all the while inheriting default-deny write controls with audit logging, auto-generated CI/CD pipelines, and secrets management for free. This means standing up a new governed MCP server is minutes of scaffolding. The governance is baked into the platform itself, which is what allowed adoption to spread so quickly. 

Our CI/CD pipeline deploys them as remote MCP servers on custom domains on Cloudflare’s developer platform. This gives us visibility into which MCPs servers are being used by our employees, while maintaining control over software sources. As an added bonus, every remote MCP server on the Cloudflare developer platform is automatically deployed across our global network of data centers, so MCP servers can be accessed by our employees with low latency, regardless of where they might be in the world.

Cloudflare Access provides authentication

Some of our MCP servers sit in front of public resources, like our Cloudflare documentation MCP server or Cloudflare Radar MCP server, and thus we want them to be accessible to anyone. But many of the MCP servers used by our workforce are sitting in front of our private corporate resources. These MCP servers require user authentication to ensure that they are off limits to everyone but authorized Cloudflare employees. To achieve this, our monorepo template for MCP servers integrates Cloudflare Access as the OAuth provider. Cloudflare Access secures login flows and issues access tokens to resources, while acting as an identity aggregator that verifies end user single-sign on (SSO), multifactor authentication (MFA), and a variety of contextual attributes such as IP addresses, location, or device certificates. 

MCP server portals centralize discovery and governance

MCP server portals unify governance and control for all AI activity.

As the number of our remote MCP servers grew, we hit a new wall: discovery. We wanted to make it easy for every employee (especially those that are new to MCP) to find and work with all the MCP servers that are available to them. Our MCP server portals product provided a convenient solution. The employee simply connects their MCP client to the MCP server portal, and the portal immediately reveals every internal and third-party MCP servers they are authorized to use. 

Beyond this, our MCP server portals provide centralized logging, consistent policy enforcement and data loss prevention (DLP guardrails). Our administrators can see who logged into what MCP portal and create DLP rules that prevent certain data, like personally identifiable data (PII), from being shared with certain MCP servers.

We can also create policies that control who has access to the portal itself, and what tools from each MCP server should be exposed. For example, we could set up one MCP server portal that is only accessible to employees that are part of our finance group that exposes just the read-only tools for the MCP server in front of our internal code repository. Meanwhile, a different MCP server portal, accessible only to employees on their corporate laptops that are in our engineering team, could expose more powerful read/write tools to our code repository MCP server.

An overview of our MCP server portal architecture is shown above. The portal supports both remote MCP servers hosted on Cloudflare, and third-party MCP servers hosted anywhere else. What makes this architecture uniquely performant is that all these security and networking components run on the same physical machine within our global network. When an employee's request moves through the MCP server portal, a Cloudflare-hosted remote MCP server, and Cloudflare Access, their traffic never needs to leave the same physical machine. 

Code Mode with MCP server portals reduces costs

After months of high-volume MCP deployments, we’ve paid out our fair share of tokens. We’ve also started to think most people are doing MCP wrong.

The standard approach to MCP requires defining a separate tool for every API operation that is exposed via an MCP server. But this static and exhaustive approach quickly exhausts an agent’s context window, especially for large platforms with thousands of endpoints.

We previously wrote about how we used server-side Code Mode to power Cloudflare’s MCP server, allowing us to expose the thousands of end-points in Cloudflare API while reducing token use by 99.9%. The Cloudflare MCP server exposes just two tools: a search tool lets the model write JavaScript to explore what’s available, and an execute tool lets it write JavaScript to call the tools it finds. The model discovers what it needs on demand, rather than receiving everything upfront.

We like this pattern so much, we had to make it available for everyone. So we have now launched the ability to use the “Code Mode” pattern with MCP server portals. Now you can front all of your MCP servers with a centralized portal that performs audit controls and progressive tool disclosure, in order to reduce token costs.

Here is how it works. Instead of exposing every tool definition to a client, all of your underlying MCP servers collapse into just two MCP portal tools: portal_codemode_search and portal_codemode_execute. The search tool gives the model access to a codemode.tools() function that returns all the tool definitions from every connected upstream MCP server. The model then writes JavaScript to filter and explore these definitions, finding exactly the tools it needs without every schema being loaded into context. The execute tool provides a codemode proxy object where each upstream tool is available as a callable function. The model writes JavaScript that calls these tools directly, chaining multiple operations, filtering results, and handling errors in code. All of this runs in a sandboxed environment on the MCP server portal powered by Dynamic Workers

Here is an example of an agent that needs to find a Jira ticket and update it with information from Google Drive. It first searches for the right tools:

// portal_codemode_search
async () => {
 const tools = await codemode.tools();
 return tools
  .filter(t => t.name.includes("jira") || t.name.includes("drive"))
  .map(t => ({ name: t.name, params: Object.keys(t.inputSchema.properties || {}) }));
}

The model now knows the exact tool names and parameters it needs, without the full schemas of tools ever entering its context. It then writes a single execute call to chain the operations together:

// portal_codemode_execute
async () => {
 const tickets = await codemode.jira_search_jira_with_jql({
  jql: ‘project = BLOG AND status = “In Progress”’,
  fields: [“summary”, “description”]
 });
 const doc = await codemode.google_workspace_drive_get_content({
  fileId: “1aBcDeFgHiJk”
 });
 await codemode.jira_update_jira_ticket({
  issueKey: tickets[0].key,
  fields: { description: tickets[0].description + “\n\n” + doc.content }
 });
 return { updated: tickets[0].key };
}

This is just two tool calls. The first discovers what's available, the second does the work. Without Code Mode, this same workflow would have required the model to receive the full schemas of every tool from both MCP servers upfront, and then make three separate tool invocations.

Let’s put the savings in perspective: when our internal MCP server portal is connected to just four of our internal MCP servers, it exposes 52 tools that consume approximately 9,400 tokens of context just for their definitions. With Code Mode enabled, those 52 tools collapse into 2 portal tools consuming roughly 600 tokens, a 94% reduction. And critically, this cost stays fixed. As we connect more MCP servers to the portal, the token cost of Code Mode doesn’t grow.

Code Mode can be activated on an MCP server portal by adding a query parameter to the URL. Instead of connecting to your portal over its usual URL (e.g. https://myportal.example.com/mcp), you attach ?codemode=search_and_execute to the URL (e.g. https://myportal.example.com/mcp?codemode=search_and_execute).

AI Gateway provides extensibility and cost controls

We aren’t done yet. We plug AI Gateway into our architecture by positioning it on the connection between the MCP client and the LLM. This allows us to quickly switch between various LLM providers (to prevent vendor lock-in) and to enforce cost controls (by limiting the number of tokens each employee can burn through). The full architecture is shown below.

Cloudflare Gateway discovers and blocks shadow MCP

Now that we’ve provided governed access to authorized MCP servers, let’s look into dealing with unauthorized MCP servers. We can perform shadow MCP discovery using Cloudflare Gateway. Cloudflare Gateway is our comprehensive secure web gateway that provides enterprise security teams with visibility and control over their employees’ Internet traffic.

We can use the Cloudflare Gateway API to perform a multi-layer scan to find remote MCP servers that are not being accessed via an MCP server portal. This is possible using a variety of existing Gateway and Data Loss Prevention (DLP) selectors, including:

  • Using the Gateway httpHost selector to scan for 

    • known MCP server hostnames using (like mcp.stripe.com)

    • mcp.* subdomains using wildcard hostname patterns 

  • Using the Gateway httpRequestURI selector to scan for MCP-specific URL paths like /mcp and /mcp/sse 

  • Using DLP-based body inspection to find MCP traffic, even if that traffic uses URI that do not contain the telltale mentions of mcp or sse. Specifically, we use the fact that MCP uses JSON-RPC over HTTP, which means every request contains a "method" field with values like "tools/call", "prompts/get", or "initialize." Here are some regex rules that can be used to detect MCP traffic in the HTTP body:

const DLP_REGEX_PATTERNS = [
  {
    name: "MCP Initialize Method",
    regex: '"method"\\s{0,5}:\\s{0,5}"initialize"',
  },
  {
    name: "MCP Tools Call",
    regex: '"method"\\s{0,5}:\\s{0,5}"tools/call"',
  },
  {
    name: "MCP Tools List",
    regex: '"method"\\s{0,5}:\\s{0,5}"tools/list"',
  },
  {
    name: "MCP Resources Read",
    regex: '"method"\\s{0,5}:\\s{0,5}"resources/read"',
  },
  {
    name: "MCP Resources List",
    regex: '"method"\\s{0,5}:\\s{0,5}"resources/list"',
  },
  {
    name: "MCP Prompts List",
    regex: '"method"\\s{0,5}:\\s{0,5}"prompts/(list|get)"',
  },
  {
    name: "MCP Sampling Create Message",
    regex: '"method"\\s{0,5}:\\s{0,5}"sampling/createMessage"',
  },
  {
    name: "MCP Protocol Version",
    regex: '"protocolVersion"\\s{0,5}:\\s{0,5}"202[4-9]',
  },
  {
    name: "MCP Notifications Initialized",
    regex: '"method"\\s{0,5}:\\s{0,5}"notifications/initialized"',
  },
  {
    name: "MCP Roots List",
    regex: '"method"\\s{0,5}:\\s{0,5}"roots/list"',
  },
];

The Gateway API supports additional automation. For example, one can use the custom DLP profile we defined above to block traffic, or redirect it, or just to log and inspect MCP payloads. Put this together, and Gateway can be used to provide comprehensive detection of unauthorized remote MCP servers accessed via an enterprise network. 

For more information on how to build this out, see this tutorial

Public-facing MCP Servers are protected with AI Security for Apps

So far, we’ve been focused on protecting our workforce’s access to our internal MCP servers. But, like many other organizations, we also have public-facing MCP servers that our customers can use to agentically administer and operate Cloudflare products. These MCP servers are hosted on Cloudflare’s developer platform. (You can find a list of individual MCPs for specific products here, or refer back to our new approach for providing more efficient access to the entire Cloudflare API using Code Mode.)

We believe that every organization should publish official, first-party MCP servers for their products. The alternative is that your customers source unvetted servers from public repositories where packages may contain dangerous trust assumptions, undisclosed data collection, and any range of unsanctioned behaviors. By publishing your own MCP servers, you control the code, update cadence, and security posture of the tools your customers use.

Since every remote MCP server is an HTTP endpoint, we can put it behind the Cloudflare Web Application Firewall (WAF). Customers can enable the AI Security for Apps feature within the WAF to automatically inspect inbound MCP traffic for prompt injection attempts, sensitive data leakage, and topic classification. Public facing MCPs are protected just as any other web API.  

The future of MCP in the enterprise

We hope our experience, products, and reference architectures will be useful to other organizations as they continue along their own journey towards broad enterprise-wide adoption of MCP.

We’ve secured our own MCP workflows by: 

  • Offering our developers a templated framework for building and deploying remote MCP servers on our developer platform using Cloudflare Access for authentication

  • Ensuring secure, identity-based access to authorized MCP servers by connecting our entire workforce to MCP server portals

  • Controlling costs using AI Gateway to mediate access to the LLMs powering our workforce’s MCP clients, and using Code Mode in MCP server portals to reduce token consumption and context bloat

  • Discovering shadow MCP usage by Cloudflare Gateway 

For organizations advancing on their own enterprise MCP journeys, we recommend starting by putting your existing remote and third-party MCP servers behind  Cloudflare MCP server portals and enabling Code Mode to start benefitting for cheaper, safer and simpler enterprise deployments of MCP.  

Acknowledgements:  This reference architecture and blog represents this work of many people across many different roles and business units at Cloudflare. This is just a partial list of contributors: Ann Ming Samborski,  Kate Reznykova, Mike Nomitch, James Royal, Liam Reese, Yumna Moazzam, Simon Thorpe, Rian van der Merwe, Rajesh Bhatia, Ayush Thakur, Gonzalo Chavarri, Maddy Onyehara, and Haley Campbell.

  • ✇The Cloudflare Blog
  • Building a serverless, post-quantum Matrix homeserver Nick Kuntz
    * This post was updated at 11:45 a.m. Pacific time to clarify that the use case described here is a proof of concept and a personal project. Some sections have been updated for clarity.Matrix is the gold standard for decentralized, end-to-end encrypted communication. It powers government messaging systems, open-source communities, and privacy-focused organizations worldwide. For the individual developer, however, the appeal is often closer to home: bridging fragmented chat networks (like Discord
     

Building a serverless, post-quantum Matrix homeserver

27 de Janeiro de 2026, 11:00

* This post was updated at 11:45 a.m. Pacific time to clarify that the use case described here is a proof of concept and a personal project. Some sections have been updated for clarity.

Matrix is the gold standard for decentralized, end-to-end encrypted communication. It powers government messaging systems, open-source communities, and privacy-focused organizations worldwide. 

For the individual developer, however, the appeal is often closer to home: bridging fragmented chat networks (like Discord and Slack) into a single inbox, or simply ensuring your conversation history lives on infrastructure you control. Functionally, Matrix operates as a decentralized, eventually consistent state machine. Instead of a central server pushing updates, homeservers exchange signed JSON events over HTTP, using a conflict resolution algorithm to merge these streams into a unified view of the room's history.

But there is a "tax" to running it. Traditionally, operating a Matrix homeserver has meant accepting a heavy operational burden. You have to provision virtual private servers (VPS), tune PostgreSQL for heavy write loads, manage Redis for caching, configure reverse proxies, and handle rotation for TLS certificates. It’s a stateful, heavy beast that demands to be fed time and money, whether you’re using it a lot or a little.

We wanted to see if we could eliminate that tax entirely.

Spoiler: We could. In this post, we’ll explain how we ported a Matrix homeserver to Cloudflare Workers. The resulting proof of concept is a serverless architecture where operations disappear, costs scale to zero when idle, and every connection is protected by post-quantum cryptography by default. You can view the source code and deploy your own instance directly from Github.

From Synapse to Workers

Our starting point was Synapse, the Python-based reference Matrix homeserver designed for traditional deployments. PostgreSQL for persistence, Redis for caching, filesystem for media.

Porting it to Workers meant questioning every storage assumption we’d taken for granted.

The challenge was storage. Traditional homeservers assume strong consistency via a central SQL database. Cloudflare Durable Objects offers a powerful alternative. This primitive gives us the strong consistency and atomicity required for Matrix state resolution, while still allowing the application to run at the edge.

We ported the core Matrix protocol logic — event authorization, room state resolution, cryptographic verification — in TypeScript using the Hono framework. D1 replaces PostgreSQL, KV replaces Redis, R2 replaces the filesystem, and Durable Objects handle real-time coordination.

Here’s how the mapping worked out:

From monolith to serverless

Moving to Cloudflare Workers brings several advantages for a developer: simple deployment, lower costs, low latency, and built-in security.

Easy deployment: A traditional Matrix deployment requires server provisioning, PostgreSQL administration, Redis cluster management, TLS certificate renewal, load balancer configuration, monitoring infrastructure, and on-call rotations.

With Workers, deployment is simply: wrangler deploy. Workers handles TLS, load balancing, DDoS protection, and global distribution.

Usage-based costs: Traditional homeservers cost money whether anyone is using them or not. Workers pricing is request-based, so you pay when you’re using it, but costs drop to near zero when everyone’s asleep. 

Lower latency globally: A traditional Matrix homeserver in us-east-1 adds 200ms+ latency for users in Asia or Europe. Workers, meanwhile, run in 300+ locations worldwide. When a user in Tokyo sends a message, the Worker executes in Tokyo. 

Built-in security: Matrix homeservers can be high-value targets: They handle encrypted communications, store message history, and authenticate users. Traditional deployments require careful hardening: firewall configuration, rate limiting, DDoS mitigation, WAF rules, IP reputation filtering.

Workers provide all of this by default. 

Post-quantum protection 

Cloudflare deployed post-quantum hybrid key agreement across all TLS 1.3 connections in October 2022. Every connection to our Worker automatically negotiates X25519MLKEM768 — a hybrid combining classical X25519 with ML-KEM, the post-quantum algorithm standardized by NIST.

Classical cryptography relies on mathematical problems that are hard for traditional computers but trivial for quantum computers running Shor’s algorithm. ML-KEM is based on lattice problems that remain hard even for quantum computers. The hybrid approach means both algorithms must fail for the connection to be compromised.

Following a message through the system

Understanding where encryption happens matters for security architecture. When someone sends a message through our homeserver, here’s the actual path:

The sender’s client takes the plaintext message and encrypts it with Megolm — Matrix’s end-to-end encryption. This encrypted payload then gets wrapped in TLS for transport. On Cloudflare, that TLS connection uses X25519MLKEM768, making it quantum-resistant.

The Worker terminates TLS, but what it receives is still encrypted — the Megolm ciphertext. We store that ciphertext in D1, index it by room and timestamp, and deliver it to recipients. But we never see the plaintext. The message “Hello, world” exists only on the sender’s device and the recipient’s device.

When the recipient syncs, the process reverses. They receive the encrypted payload over another quantum-resistant TLS connection, then decrypt locally with their Megolm session keys.

Two layers, independent protection

This protects via two encryption layers that operate independently:

The transport layer (TLS) protects data in transit. It’s encrypted at the client and decrypted at the Cloudflare edge. With X25519MLKEM768, this layer is now post-quantum.

The application layer (Megolm E2EE) protects message content. It’s encrypted on the sender’s device and decrypted only on recipient devices. This uses classical Curve25519 cryptography.

Who sees what

Any Matrix homeserver operator — whether running Synapse on a VPS or this implementation on Workers — can see metadata: which rooms exist, who’s in them, when messages were sent. But no one in the infrastructure chain can see the message content, because the E2EE payload is encrypted on sender devices before it ever hits the network. Cloudflare terminates TLS and passes requests to your Worker, but both see only Megolm ciphertext. Media in encrypted rooms is encrypted client-side before upload, and private keys never leave user devices.

What traditional deployments would need

Achieving post-quantum TLS on a traditional Matrix deployment would require upgrading OpenSSL or BoringSSL to a version supporting ML-KEM, configuring cipher suite preferences correctly, testing client compatibility across all Matrix apps, monitoring for TLS negotiation failures, staying current as PQC standards evolve, and handling clients that don’t support PQC gracefully.

With Workers, it’s automatic. Chrome, Firefox, and Edge all support X25519MLKEM768. Mobile apps using platform TLS stacks inherit this support. The security posture improves as Cloudflare’s PQC deployment expands — no action required on our part.

The storage architecture that made it work

The key insight from porting Tuwunel was that different data needs different consistency guarantees. We use each Cloudflare primitive for what it does best.

D1 for the data model

D1 stores everything that needs to survive restarts and support queries: users, rooms, events, device keys. Over 25 tables covering the full Matrix data model.

CREATE TABLE events (
	event_id TEXT PRIMARY KEY,
	room_id TEXT NOT NULL,
	sender TEXT NOT NULL,
	event_type TEXT NOT NULL,
	state_key TEXT,
	content TEXT NOT NULL,
	origin_server_ts INTEGER NOT NULL,
	depth INTEGER NOT NULL
);

D1’s SQLite foundation meant we could port Tuwunel’s queries with minimal changes. Joins, indexes, and aggregations work as expected.

We learned one hard lesson: D1’s eventual consistency breaks foreign key constraints. A write to rooms might not be visible when a subsequent write to events checks the foreign key. We removed all foreign keys and enforce referential integrity in application code.

KV for ephemeral state

OAuth authorization codes live for 10 minutes, while refresh tokens last for a session.

// Store OAuth code with 10-minute TTL
kv.put(&format!("oauth_code:{}", code), &token_data)?
	.expiration_ttl(600)
	.execute()
	.await?;

KV’s global distribution means OAuth flows work fast regardless of where users are located.

R2 for media

Matrix media maps directly to R2, so you can upload an image, get back a content-addressed URL – and egress is free.

Durable Objects for atomicity

Some operations can’t tolerate eventual consistency. When a client claims a one-time encryption key, that key must be atomically removed. If two clients claim the same key, encrypted session establishment fails.

Durable Objects provide single-threaded, strongly consistent storage:

#[durable_object]
pub struct UserKeysObject {
	state: State,
	env: Env,
}

impl UserKeysObject {
	async fn claim_otk(&self, algorithm: &str) -> Result<Option<Key>> {
    	// Atomic within single DO - no race conditions possible
    	let mut keys: Vec<Key> = self.state.storage()
        	.get("one_time_keys")
        	.await
        	.ok()
        	.flatten()
        	.unwrap_or_default();

    	if let Some(idx) = keys.iter().position(|k| k.algorithm == algorithm) {
        	let key = keys.remove(idx);
        	self.state.storage().put("one_time_keys", &keys).await?;
        	return Ok(Some(key));
    	}
    	Ok(None)
	}
}

We use UserKeysObject for E2EE key management, RoomObject for real-time room events like typing indicators and read receipts, and UserSyncObject for to-device message queues. The rest flows through D1.

Complete end-to-end encryption, complete OAuth

Our implementation supports the full Matrix E2EE stack: device keys, cross-signing keys, one-time keys, fallback keys, key backup, and dehydrated devices.

Modern Matrix clients use OAuth 2.0/OIDC instead of legacy password flows. We implemented a complete OAuth provider, with dynamic client registration, PKCE authorization, RS256-signed JWT tokens, token refresh with rotation, and standard OIDC discovery endpoints.

curl https://matrix.example.com/.well-known/openid-configuration
{
  "issuer": "https://matrix.example.com",
  "authorization_endpoint": "https://matrix.example.com/oauth/authorize",
  "token_endpoint": "https://matrix.example.com/oauth/token",
  "jwks_uri": "https://matrix.example.com/.well-known/jwks.json"
}

Point Element or any Matrix client at the domain, and it discovers everything automatically.

Sliding Sync for mobile

Traditional Matrix sync transfers megabytes of data on initial connection,  draining mobile battery and data plans.

Sliding Sync lets clients request exactly what they need. Instead of downloading everything, clients get the 20 most recent rooms with minimal state. As users scroll, they request more ranges. The server tracks position and sends only deltas.

Combined with edge execution, mobile clients can connect and render their room list in under 500ms, even on slow networks.

The comparison

For a homeserver serving a small team:

 

Traditional (VPS)

Workers

Monthly cost (idle)

$20-50

<$1

Monthly cost (active)

$20-50

$3-10

Global latency

100-300ms

20-50ms

Time to deploy

Hours

Seconds

Maintenance

Weekly

None

DDoS protection

Additional cost

Included

Post-quantum TLS

Complex setup

Automatic

*Based on public rates and metrics published by DigitalOcean, AWS Lightsail, and Linode as of January 15, 2026.

The economics improve further at scale. Traditional deployments require capacity planning and over-provisioning. Workers scale automatically.

The future of decentralized protocols

We started this as an experiment: could Matrix run on Workers? It can—and the approach can work for other stateful protocols, too.

By mapping traditional stateful components to Cloudflare’s primitives — Postgres to D1, Redis to KV, mutexes to Durable Objects — we can see  that complex applications don't need complex infrastructure. We stripped away the operating system, the database management, and the network configuration, leaving only the application logic and the data itself.

Workers offers the sovereignty of owning your data, without the burden of owning the infrastructure.

I have been experimenting with the implementation and am excited for any contributions from others interested in this kind of service. 

Ready to build powerful, real-time applications on Workers? Get started with Cloudflare Workers and explore Durable Objects for your own stateful edge applications. Join our Discord community to connect with other developers building at the edge.

  • ✇The Cloudflare Blog
  • Astro is joining Cloudflare Fred Schott · Brendan Irvine-Broque
    The Astro Technology Company, creators of the Astro web framework, is joining Cloudflare.Astro is the web framework for building fast, content-driven websites. Over the past few years, we’ve seen an incredibly diverse range of developers and companies use Astro to build for the web. This ranges from established brands like Porsche and IKEA, to fast-growing AI companies like Opencode and OpenAI. Platforms that are built on Cloudflare, like Webflow Cloud and Wix Vibe, have chosen Astro to power th
     

Astro is joining Cloudflare

16 de Janeiro de 2026, 11:00

The Astro Technology Company, creators of the Astro web framework, is joining Cloudflare.

Astro is the web framework for building fast, content-driven websites. Over the past few years, we’ve seen an incredibly diverse range of developers and companies use Astro to build for the web. This ranges from established brands like Porsche and IKEA, to fast-growing AI companies like Opencode and OpenAI. Platforms that are built on Cloudflare, like Webflow Cloud and Wix Vibe, have chosen Astro to power the websites their customers build and deploy to their own platforms. At Cloudflare, we use Astro, too — for our developer docs, website, landing pages, blog, and more. Astro is used almost everywhere there is content on the Internet.

By joining forces with the Astro team, we are doubling down on making Astro the best framework for content-driven websites for many years to come. The best version of Astro — Astro 6 —  is just around the corner, bringing a redesigned development server powered by Vite. The first public beta release of Astro 6 is now available, with GA coming in the weeks ahead.

We are excited to share this news and even more thrilled for what it means for developers building with Astro. If you haven’t yet tried Astro — give it a spin and run npm create astro@latest.

What this means for Astro

Astro will remain open source, MIT-licensed, and open to contributions, with a public roadmap and open governance. All full-time employees of The Astro Technology Company are now employees of Cloudflare, and will continue to work on Astro. We’re committed to Astro’s long-term success and eager to keep building.

Astro wouldn’t be what it is today without an incredibly strong community of open-source contributors. Cloudflare is also committed to continuing to support open-source contributions, via the Astro Ecosystem Fund, alongside industry partners including Webflow, Netlify, Wix, Sentry, Stainless and many more.

From day one, Astro has been a bet on the web and portability: Astro is built to run anywhere, across clouds and platforms. Nothing changes about that. You can deploy Astro to any platform or cloud, and we’re committed to supporting Astro developers everywhere.

There are many web frameworks out there — so why are developers choosing Astro?

Astro has been growing rapidly:

Why? Many web frameworks have come and gone trying to be everything to everyone, aiming to serve the needs of both content-driven websites and web applications.

The key to Astro’s success: Instead of trying to serve every use case, Astro has stayed focused on five design principles. Astro is…

  • Content-driven: Astro was designed to showcase your content.

  • Server-first: Websites run faster when they render HTML on the server.

  • Fast by default: It should be impossible to build a slow website in Astro.

  • Easy to use: You don’t need to be an expert to build something with Astro.

  • Developer-focused: You should have the resources you need to be successful.

Astro’s Islands Architecture is a core part of what makes all of this possible. The majority of each page can be fast, static HTML — fast and simple to build by default, oriented around rendering content. And when you need it, you can render a specific part of a page as a client island, using any client UI framework. You can even mix and match multiple frameworks on the same page, whether that’s React.js, Vue, Svelte, Solid, or anything else:

Bringing back the joy in building websites

The more Astro and Cloudflare started talking, the clearer it became how much we have in common. Cloudflare’s mission is to help build a better Internet — and part of that is to help build a faster Internet. Almost all of us grew up building websites, and we want a world where people have fun building things on the Internet, where anyone can publish to a site that is truly their own.

When Astro first launched in 2021, it had become painful to build great websites — it felt like a fight with build tools and frameworks. It sounds strange to say it, with the coding agents and powerful LLMs of 2026, but in 2021 it was very hard to build an excellent and fast website without being a domain expert in JavaScript build tooling. So much has gotten better, both because of Astro and in the broader frontend ecosystem, that we take this almost for granted today.

The Astro project has spent the past five years working to simplify web development. So as LLMs, then vibe coding, and now true coding agents have come along and made it possible for truly anyone to build — Astro provided a foundation that was simple and fast by default. We’ve all seen how much better and faster agents get when building off the right foundation, in a well-structured codebase. More and more, we’ve seen both builders and platforms choose Astro as that foundation.

We’ve seen this most clearly through the platforms that both Cloudflare and Astro serve, that extend Cloudflare to their own customers in creative ways using Cloudflare for Platforms, and have chosen Astro as the framework that their customers build on. 

When you deploy to Webflow Cloud, your Astro site just works and is deployed across Cloudflare’s network. When you start a new project with Wix Vibe, behind the scenes you’re creating an Astro site, running on Cloudflare. And when you generate a developer docs site using Stainless, that generates an Astro project, running on Cloudflare, powered by Starlight — a framework built on Astro.

Each of these platforms is built for a different audience. But what they have in common — beyond their use of Cloudflare and Astro — is they make it fun to create and publish content to the Internet. In a world where everyone can be both a builder and content creator, we think there are still so many more platforms to build and people to reach.

Astro 6 — new local dev server, powered by Vite

Astro 6 is coming, and the first open beta release is now available. To be one of the first to try it out, run:

npm create astro@latest -- --ref next

Or to upgrade your existing Astro app, run:

npx @astrojs/upgrade beta

Astro 6 brings a brand new development server, built on the Vite Environments API, that runs your code locally using the same runtime that you deploy to. This means that when you run astro dev with the Cloudflare Vite plugin, your code runs in workerd, the open-source Cloudflare Workers runtime, and can use Durable Objects, D1, KV, Agents and more. This isn’t just a Cloudflare feature: Any JavaScript runtime with a plugin that uses the Vite Environments API can benefit from this new support, and ensure local dev runs in the same environment, with the same runtime APIs as production.

Live Content Collections in Astro are also stable in Astro 6 and out of beta. These content collections let you update data in real time, without requiring a rebuild of your site. This makes it easy to bring in content that changes often, such as the current inventory in a storefront, while still benefitting from the built-in validation and caching that come with Astro’s existing support for content collections.

There’s more to Astro 6, including Astro’s most upvoted feature request — first-class support for Content Security Policy (CSP) — as well as simpler APIs, an upgrade to Zod 4, and more.

Doubling down on Astro

We're thrilled to welcome the Astro team to Cloudflare. We’re excited to keep building, keep shipping, and keep making Astro the best way to build content-driven sites. We’re already thinking about what comes next beyond V6, and we’d love to hear from you.

To keep up with the latest, follow the Astro blog and join the Astro Discord. Tell us what you’re building!

How Workers VPC Services connects to your regional private networks from anywhere in the world

5 de Novembro de 2025, 11:00

In April, we shared our vision for a global virtual private cloud on Cloudflare, a way to unlock your applications from regionally constrained clouds and on-premise networks, enabling you to build truly cross-cloud applications.

Today, we’re announcing the first milestone of our Workers VPC initiative: VPC Services. VPC Services allow you to connect to your APIs, containers, virtual machines, serverless functions, databases and other services in regional private networks via Cloudflare Tunnels from your Workers running anywhere in the world. 

Once you set up a Tunnel in your desired network, you can register each service that you want to expose to Workers by configuring its host or IP address. Then, you can access the VPC Service as you would any other Workers service binding — Cloudflare’s network will automatically route to the VPC Service over Cloudflare’s network, regardless of where your Worker is executing:

export default {
  async fetch(request, env, ctx) {
    // Perform application logic in Workers here	

    // Call an external API running in a ECS in AWS when needed using the binding
    const response = await env.AWS_VPC_ECS_API.fetch("http://internal-host.com");

    // Additional application logic in Workers
    return new Response();
  },
};

Workers VPC is now available to everyone using Workers, at no additional cost during the beta, as is Cloudflare Tunnels. Try it out now. And read on to learn more about how it works under the hood.

Connecting the networks you trust, securely

Your applications span multiple networks, whether they are on-premise or in external clouds. But it’s been difficult to connect from Workers to your APIs and databases locked behind private networks. 

We have previously described how traditional virtual private clouds and networks entrench you into traditional clouds. While they provide you with workload isolation and security, traditional virtual private clouds make it difficult to build across clouds, access your own applications, and choose the right technology for your stack.

A significant part of the cloud lock-in is the inherent complexity of building secure, distributed workloads. VPC peering requires you to configure routing tables, security groups and network access-control lists, since it relies on networking across clouds to ensure connectivity. In many organizations, this means weeks of discussions and many teams involved to get approvals. This lock-in is also reflected in the solutions invented to wrangle this complexity: Each cloud provider has their own bespoke version of a “Private Link” to facilitate cross-network connectivity, further restricting you to that cloud and the vendors that have integrated with it.

With Workers VPC, we’re simplifying that dramatically. You set up your Cloudflare Tunnel once, with the necessary permissions to access your private network. Then, you can configure Workers VPC Services, with the tunnel and hostname (or IP address and port) of the service you want to expose to Workers. Any request made to that VPC Service will use this configuration to route to the given service within the network.

{
  "type": "http",
  "name": "vpc-service-name",
  "http_port": 80,
  "https_port": 443,
  "host": {
    "hostname": "internally-resolvable-hostname.com",
    "resolver_network": {
      "tunnel_id": "0191dce4-9ab4-7fce-b660-8e5dec5172da"
    }
  }
}

This ensures that, once represented as a Workers VPC Service, a service in your private network is secured in the same way other Cloudflare bindings are, using the Workers binding model. Let’s take a look at a simple VPC Service binding example:

{
  "name": "WORKER-NAME",
  "main": "./src/index.js",
  "vpc_services": [
    {
      "binding": "AWS_VPC2_ECS_API",
      "service_id": "5634563546"
    }
  ]
}

Like other Workers bindings, when you deploy a Worker project that tries to connect to a VPC Service, the access permissions are verified at deploy time to ensure that the Worker has access to the service in question. And once deployed, the Worker can use the VPC Service binding to make requests to that VPC Service — and only that service within the network. 

That’s significant: Instead of exposing the entire network to the Worker, only the specific VPC Service can be accessed by the Worker. This access is verified at deploy time to provide a more explicit and transparent service access control than traditional networks and access-control lists do.

This is a key factor in the design of Workers bindings: de facto security with simpler management and making Workers immune to Server-Side Request Forgery (SSRF) attacks. We’ve gone deep on the binding security model in the past, and it becomes that much more critical when accessing your private networks. 

Notably, the binding model is also important when considering what Workers are: scripts running on Cloudflare’s global network. They are not, in contrast to traditional clouds, individual machines with IP addresses, and do not exist within networks. Bindings provide secure access to other resources within your Cloudflare account – and the same applies to Workers VPC Services.

A peek under the hood

So how do VPC Services and their bindings route network requests from Workers anywhere on Cloudflare’s global network to regional networks using tunnels? Let’s look at the lifecycle of a sample HTTP Request made from a VPC Service’s dedicated fetch() request represented here:

It all starts in the Worker code, where the .fetch() function of the desired VPC Service is called with a standard JavaScript Request (as represented with Step 1). The Workers runtime will use a Cap’n Proto remote-procedure-call to send the original HTTP request alongside additional context, as it does for many other Workers bindings. 

The Binding Worker of the VPC Service System receives the HTTP request along with the binding context, in this case, the Service ID of the VPC Service being invoked. The Binding Worker will proxy this information to the Iris Service within an HTTP CONNECT connection, a standard pattern across Cloudflare’s bindings to place connection logic to Cloudflare’s edge services within Worker code rather than the Workers runtime itself (Step 2). 

The Iris Service is the main service for Workers VPC. Its responsibility is to accept requests for a VPC Service and route them to the network in which your VPC Service is located. It does this by integrating with Apollo, an internal service of Cloudflare One. Apollo provides a unified interface that abstracts away the complexity of securely connecting to networks and tunnels, across various layers of networking

To integrate with Apollo, Iris must complete two tasks. First, Iris will parse the VPC Service ID from the metadata and fetch the information of the tunnel associated with it from our configuration store. This includes the tunnel ID and type from the configuration store (Step 3), which is the information that Iris needs to send the original requests to the right tunnel.

Second, Iris will create the UDP datagrams containing DNS questions for the A and AAAA records of the VPC Service’s hostname. These datagrams will be sent first, via Apollo. Once DNS resolution is completed, the original request is sent along, with the resolved IP address and port (Step 4). That means that steps 4 through 7 happen in sequence twice for the first request: once for DNS resolution and a second time for the original HTTP Request. Subsequent requests benefit from Iris’ caching of DNS resolution information, minimizing request latency.

In Step 5, Apollo receives the metadata of the Cloudflare Tunnel that needs to be accessed, along with the DNS resolution UDP datagrams or the HTTP Request TCP packets. Using the tunnel ID, it determines which datacenter is connected to the Cloudflare Tunnel. This datacenter is in a region close to the Cloudflare Tunnel, and as such, Apollo will route the DNS resolution messages and the Original Request to the Tunnel Connector Service running in that datacenter (Step 5).

The Tunnel Connector Service is responsible for providing access to the Cloudflare Tunnel to the rest of Cloudflare’s network. It will relay the DNS resolution questions, and subsequently the original request to the tunnel over the QUIC protocol (Step 6).

Finally, the Cloudflare Tunnel will send the DNS resolution questions to the DNS resolver of the network it belongs to. It will then send the original HTTP Request from its own IP address to the destination IP and port (Step 7). The results of the request are then relayed all the way back to the original Worker, from the datacenter closest to the tunnel all the way to the original Cloudflare datacenter executing the Worker request.

What VPC Service allows you to build

This unlocks a whole new tranche of applications you can build on Cloudflare. For years, Workers have excelled at the edge, but they've largely been kept "outside" your core infrastructure. They could only call public endpoints, limiting their ability to interact with the most critical parts of your stack—like a private accounts API or an internal inventory database. Now, with VPC Services, Workers can securely access those private APIs, databases, and services, fundamentally changing what's possible.

This immediately enables true cross-cloud applications that span Cloudflare Workers and any other cloud like AWS, GCP or Azure. We’ve seen many customers adopt this pattern over the course of our private beta, establishing private connectivity between their external clouds and Cloudflare Workers. We’ve even done so ourselves, connecting our Workers to Kubernetes services in our core datacenters to power the control plane APIs for many of our services. Now, you can build the same powerful, distributed architectures, using Workers for global scale while keeping stateful backends in the network you already trust.

It also means you can connect to your on-premise networks from Workers, allowing you to modernize legacy applications with the performance and infinite scale of Workers. More interesting still are some emerging use cases for developer workflows. We’ve seen developers run cloudflared on their laptops to connect a deployed Worker back to their local machine for real-time debugging. The full flexibility of Cloudflare Tunnels is now a programmable primitive accessible directly from your Worker, opening up a world of possibilities.

The path ahead of us

VPC Services is the first milestone within the larger Workers VPC initiative, but we’re just getting started. Our goal is to make connecting to any service and any network, anywhere in the world, a seamless part of the Workers experience. Here’s what we’re working on next:

Deeper network integration. Starting with Cloudflare Tunnels was a deliberate choice. It's a highly available, flexible, and familiar solution, making it the perfect foundation to build upon. To provide more options for enterprise networking, we're going to be adding support for standard IPsec tunnels, Cloudflare Network Interconnect (CNI), and AWS Transit Gateway, giving you and your teams more choices and potential optimizations. Crucially, these connections will also become truly bidirectional, allowing your private services to initiate connections back to Cloudflare resources such as pushing events to Queues or fetching from R2.

Expanded protocol and service support. The next step beyond HTTP is enabling access to TCP services. This will first be achieved by integrating with Hyperdrive. We're evolving the previous Hyperdrive support for private databases to be simplified with VPC Services configuration, avoiding the need to add Cloudflare Access and manage security tokens. This creates a more native experience, complete with Hyperdrive's powerful connection pooling. Following this, we will add broader support for raw TCP connections, unlocking direct connectivity to services like Redis caches and message queues from Workers ‘connect()’.

Ecosystem compatibility. We want to make connecting to a private service feel as natural as connecting to a public one. To do so, we will be providing a unique autogenerated hostname for each Workers VPC Service, similar to Hyperdrive’s connection strings. This will make it easier to use Workers VPC with existing libraries and object–relational mapping libraries that may require a hostname (e.g., in a global ‘fetch()’ call or a MongoDB connection string). Workers VPC Service hostname will automatically resolve and route to the correct VPC Service, just as the ‘fetch()’ command does.

Get started with Workers VPC

We’re excited to release Workers VPC Services into open beta today. We’ve spent months building out and testing our first milestone for Workers to private network access. And we’ve refined it further based on feedback from both internal teams and customers during the closed beta. 

Now, we’re looking forward to enabling everyone to build cross-cloud apps on Workers with Workers VPC, available for free during the open beta. With Workers VPC, you can bring your apps on private networks to region Earth, closer to your users and available to Workers across the globe.

Get started with Workers VPC Services for free now.

Keeping the Internet fast and secure: introducing Merkle Tree Certificates

The world is in a race to build its first quantum computer capable of solving practical problems not feasible on even the largest conventional supercomputers. While the quantum computing paradigm promises many benefits, it also threatens the security of the Internet by breaking much of the cryptography we have come to rely on.

To mitigate this threat, Cloudflare is helping to migrate the Internet to Post-Quantum (PQ) cryptography. Today, about 50% of traffic to Cloudflare's edge network is protected against the most urgent threat: an attacker who can intercept and store encrypted traffic today and then decrypt it in the future with the help of a quantum computer. This is referred to as the harvest now, decrypt later threat.

However, this is just one of the threats we need to address. A quantum computer can also be used to crack a server's TLS certificate, allowing an attacker to impersonate the server to unsuspecting clients. The good news is that we already have PQ algorithms we can use for quantum-safe authentication. The bad news is that adoption of these algorithms in TLS will require significant changes to one of the most complex and security-critical systems on the Internet: the Web Public-Key Infrastructure (WebPKI).

The central problem is the sheer size of these new algorithms: signatures for ML-DSA-44, one of the most performant PQ algorithms standardized by NIST, are 2,420 bytes long, compared to just 64 bytes for ECDSA-P256, the most popular non-PQ signature in use today; and its public keys are 1,312 bytes long, compared to just 64 bytes for ECDSA. That's a roughly 20-fold increase in size. Worse yet, the average TLS handshake includes a number of public keys and signatures, adding up to 10s of kilobytes of overhead per handshake. This is enough to have a noticeable impact on the performance of TLS.

That makes drop-in PQ certificates a tough sell to enable today: they don’t bring any security benefit before Q-day — the day a cryptographically relevant quantum computer arrives — but they do degrade performance. We could sit and wait until Q-day is a year away, but that’s playing with fire. Migrations always take longer than expected, and by waiting we risk the security and privacy of the Internet, which is dear to us.

It's clear that we must find a way to make post-quantum certificates cheap enough to deploy today by default for everyone — not just those that can afford it. In this post, we'll introduce you to the plan we’ve brought together with industry partners to the IETF to redesign the WebPKI in order to allow a smooth transition to PQ authentication with no performance impact (and perhaps a performance improvement!). We'll provide an overview of one concrete proposal, called Merkle Tree Certificates (MTCs), whose goal is to whittle down the number of public keys and signatures in the TLS handshake to the bare minimum required.

But talk is cheap. We know from experience that, as with any change to the Internet, it's crucial to test early and often. Today we're announcing our intent to deploy MTCs on an experimental basis in collaboration with Chrome Security. In this post, we'll describe the scope of this experiment, what we hope to learn from it, and how we'll make sure it's done safely.

The WebPKI today — an old system with many patches

Why does the TLS handshake have so many public keys and signatures?

Let's start with Cryptography 101. When your browser connects to a website, it asks the server to authenticate itself to make sure it's talking to the real server and not an impersonator. This is usually achieved with a cryptographic primitive known as a digital signature scheme (e.g., ECDSA or ML-DSA). In TLS, the server signs the messages exchanged between the client and server using its secret key, and the client verifies the signature using the server's public key. In this way, the server confirms to the client that they've had the same conversation, since only the server could have produced a valid signature.

If the client already knows the server's public key, then only 1 signature is required to authenticate the server. In practice, however, this is not really an option. The web today is made up of around a billion TLS servers, so it would be unrealistic to provision every client with the public key of every server. What's more, the set of public keys will change over time as new servers come online and existing ones rotate their keys, so we would need some way of pushing these changes to clients.

This scaling problem is at the heart of the design of all PKIs.

Trust is transitive

Instead of expecting the client to know the server's public key in advance, the server might just send its public key during the TLS handshake. But how does the client know that the public key actually belongs to the server? This is the job of a certificate.

A certificate binds a public key to the identity of the server — usually its DNS name, e.g., cloudflareresearch.com. The certificate is signed by a Certification Authority (CA) whose public key is known to the client. In addition to verifying the server's handshake signature, the client verifies the signature of this certificate. This establishes a chain of trust: by accepting the certificate, the client is trusting that the CA verified that the public key actually belongs to the server with that identity.

Clients are typically configured to trust many CAs and must be provisioned with a public key for each. Things are much easier however, since there are only 100s of CAs instead of billions. In addition, new certificates can be created without having to update clients.

These efficiencies come at a relatively low cost: for those counting at home, that's +1 signature and +1 public key, for a total of 2 signatures and 1 public key per TLS handshake.

That's not the end of the story, however. As the WebPKI has evolved, so have these chains of trust grown a bit longer. These days it's common for a chain to consist of two or more certificates rather than just one. This is because CAs sometimes need to rotate their keys, just as servers do. But before they can start using the new key, they must distribute the corresponding public key to clients. This takes time, since it requires billions of clients to update their trust stores. To bridge the gap, the CA will sometimes use the old key to issue a certificate for the new one and append this certificate to the end of the chain.

That's +1 signature and +1 public key, which brings us to 3 signatures and 2 public keys. And we still have a little ways to go.

Trust but verify

The main job of a CA is to verify that a server has control over the domain for which it’s requesting a certificate. This process has evolved over the years from a high-touch, CA-specific process to a standardized, mostly automated process used for issuing most certificates on the web. (Not all CAs fully support automation, however.) This evolution is marked by a number of security incidents in which a certificate was mis-issued to a party other than the server, allowing that party to impersonate the server to any client that trusts the CA.

Automation helps, but attacks are still possible, and mistakes are almost inevitable. Earlier this year, several certificates for Cloudflare's encrypted 1.1.1.1 resolver were issued without our involvement or authorization. This apparently occurred by accident, but it nonetheless put users of 1.1.1.1 at risk. (The mis-issued certificates have since been revoked.)

Ensuring mis-issuance is detectable is the job of the Certificate Transparency (CT) ecosystem. The basic idea is that each certificate issued by a CA gets added to a public log. Servers can audit these logs for certificates issued in their name. If ever a certificate is issued that they didn't request itself, the server operator can prove the issuance happened, and the PKI ecosystem can take action to prevent the certificate from being trusted by clients.

Major browsers, including Firefox and Chrome and its derivatives, require certificates to be logged before they can be trusted. For example, Chrome, Safari, and Firefox will only accept the server's certificate if it appears in at least two logs the browser is configured to trust. This policy is easy to state, but tricky to implement in practice:

  1. Operating a CT log has historically been fairly expensive. Logs ingest billions of certificates over their lifetimes: when an incident happens, or even just under high load, it can take some time for a log to make a new entry available for auditors.

  2. Clients can't really audit logs themselves, since this would expose their browsing history (i.e., the servers they wanted to connect to) to the log operators.

The solution to both problems is to include a signature from the CT log along with the certificate. The signature is produced immediately in response to a request to log a certificate, and attests to the log's intent to include the certificate in the log within 24 hours.

Per browser policy, certificate transparency adds +2 signatures to the TLS handshake, one for each log. This brings us to a total of 5 signatures and 2 public keys in a typical handshake on the public web.

The future WebPKI

The WebPKI is a living, breathing, and highly distributed system. We've had to patch it a number of times over the years to keep it going, but on balance it has served our needs quite well — until now.

Previously, whenever we needed to update something in the WebPKI, we would tack on another signature. This strategy has worked because conventional cryptography is so cheap. But 5 signatures and 2 public keys on average for each TLS handshake is simply too much to cope with for the larger PQ signatures that are coming.

The good news is that by moving what we already have around in clever ways, we can drastically reduce the number of signatures we need.

Crash course on Merkle Tree Certificates

Merkle Tree Certificates (MTCs) is a proposal for the next generation of the WebPKI that we are implementing and plan to deploy on an experimental basis. Its key features are as follows:

  1. All the information a client needs to validate a Merkle Tree Certificate can be disseminated out-of-band. If the client is sufficiently up-to-date, then the TLS handshake needs just 1 signature, 1 public key, and 1 Merkle tree inclusion proof. This is quite small, even if we use post-quantum algorithms.

  2. The MTC specification makes certificate transparency a first class feature of the PKI by having each CA run its own log of exactly the certificates they issue.

Let's poke our head under the hood a little. Below we have an MTC generated by one of our internal tests. This would be transmitted from the server to the client in the TLS handshake:

-----BEGIN CERTIFICATE-----
MIICSzCCAUGgAwIBAgICAhMwDAYKKwYBBAGC2ksvADAcMRowGAYKKwYBBAGC2ksv
AQwKNDQzNjMuNDguMzAeFw0yNTEwMjExNTMzMjZaFw0yNTEwMjgxNTMzMjZaMCEx
HzAdBgNVBAMTFmNsb3VkZmxhcmVyZXNlYXJjaC5jb20wWTATBgcqhkjOPQIBBggq
hkjOPQMBBwNCAARw7eGWh7Qi7/vcqc2cXO8enqsbbdcRdHt2yDyhX5Q3RZnYgONc
JE8oRrW/hGDY/OuCWsROM5DHszZRDJJtv4gno2wwajAOBgNVHQ8BAf8EBAMCB4Aw
EwYDVR0lBAwwCgYIKwYBBQUHAwEwQwYDVR0RBDwwOoIWY2xvdWRmbGFyZXJlc2Vh
cmNoLmNvbYIgc3RhdGljLWN0LmNsb3VkZmxhcmVyZXNlYXJjaC5jb20wDAYKKwYB
BAGC2ksvAAOB9QAAAAAAAAACAAAAAAAAAAJYAOBEvgOlvWq38p45d0wWTPgG5eFV
wJMhxnmDPN1b5leJwHWzTOx1igtToMocBwwakt3HfKIjXYMO5CNDOK9DIKhmRDSV
h+or8A8WUrvqZ2ceiTZPkNQFVYlG8be2aITTVzGuK8N5MYaFnSTtzyWkXP2P9nYU
Vd1nLt/WjCUNUkjI4/75fOalMFKltcc6iaXB9ktble9wuJH8YQ9tFt456aBZSSs0
cXwqFtrHr973AZQQxGLR9QCHveii9N87NXknDvzMQ+dgWt/fBujTfuuzv3slQw80
mibA021dDCi8h1hYFQAA
-----END CERTIFICATE-----

Looks like your average PEM encoded certificate. Let's decode it and look at the parameters:

$ openssl x509 -in merkle-tree-cert.pem -noout -text
Certificate:
    Data:
        Version: 3 (0x2)
        Serial Number: 531 (0x213)
        Signature Algorithm: 1.3.6.1.4.1.44363.47.0
        Issuer: 1.3.6.1.4.1.44363.47.1=44363.48.3
        Validity
            Not Before: Oct 21 15:33:26 2025 GMT
            Not After : Oct 28 15:33:26 2025 GMT
        Subject: CN=cloudflareresearch.com
        Subject Public Key Info:
            Public Key Algorithm: id-ecPublicKey
                Public-Key: (256 bit)
                pub:
                    04:70:ed:e1:96:87:b4:22:ef:fb:dc:a9:cd:9c:5c:
                    ef:1e:9e:ab:1b:6d:d7:11:74:7b:76:c8:3c:a1:5f:
                    94:37:45:99:d8:80:e3:5c:24:4f:28:46:b5:bf:84:
                    60:d8:fc:eb:82:5a:c4:4e:33:90:c7:b3:36:51:0c:
                    92:6d:bf:88:27
                ASN1 OID: prime256v1
                NIST CURVE: P-256
        X509v3 extensions:
            X509v3 Key Usage: critical
                Digital Signature
            X509v3 Extended Key Usage:
                TLS Web Server Authentication
            X509v3 Subject Alternative Name:
                DNS:cloudflareresearch.com, DNS:static-ct.cloudflareresearch.com
    Signature Algorithm: 1.3.6.1.4.1.44363.47.0
    Signature Value:
        00:00:00:00:00:00:02:00:00:00:00:00:00:00:02:58:00:e0:
        44:be:03:a5:bd:6a:b7:f2:9e:39:77:4c:16:4c:f8:06:e5:e1:
        55:c0:93:21:c6:79:83:3c:dd:5b:e6:57:89:c0:75:b3:4c:ec:
        75:8a:0b:53:a0:ca:1c:07:0c:1a:92:dd:c7:7c:a2:23:5d:83:
        0e:e4:23:43:38:af:43:20:a8:66:44:34:95:87:ea:2b:f0:0f:
        16:52:bb:ea:67:67:1e:89:36:4f:90:d4:05:55:89:46:f1:b7:
        b6:68:84:d3:57:31:ae:2b:c3:79:31:86:85:9d:24:ed:cf:25:
        a4:5c:fd:8f:f6:76:14:55:dd:67:2e:df:d6:8c:25:0d:52:48:
        c8:e3:fe:f9:7c:e6:a5:30:52:a5:b5:c7:3a:89:a5:c1:f6:4b:
        5b:95:ef:70:b8:91:fc:61:0f:6d:16:de:39:e9:a0:59:49:2b:
        34:71:7c:2a:16:da:c7:af:de:f7:01:94:10:c4:62:d1:f5:00:
        87:bd:e8:a2:f4:df:3b:35:79:27:0e:fc:cc:43:e7:60:5a:df:
        df:06:e8:d3:7e:eb:b3:bf:7b:25:43:0f:34:9a:26:c0:d3:6d:
        5d:0c:28:bc:87:58:58:15:00:00

While some of the parameters probably look familiar, others will look unusual. On the familiar side, the subject and public key are exactly what we might expect: the DNS name is cloudflareresearch.com and the public key is for a familiar signature algorithm, ECDSA-P256. This algorithm is not PQ, of course — in the future we would put ML-DSA-44 there instead.

On the unusual side, OpenSSL appears to not recognize the signature algorithm of the issuer and just prints the raw OID and bytes of the signature. There's a good reason for this: the MTC does not have a signature in it at all! So what exactly are we looking at?

The trick to leave out signatures is that a Merkle Tree Certification Authority (MTCA) produces its signatureless certificates in batches rather than individually. In place of a signature, the certificate has an inclusion proof of the certificate in a batch of certificates signed by the MTCA.

To understand how inclusion proofs work, let's think about a slightly simplified version of the MTC specification. To issue a batch, the MTCA arranges the unsigned certificates into a data structure called a Merkle tree that looks like this:

Each leaf of the tree corresponds to a certificate, and each inner node is equal to the hash of its children. To sign the batch, the MTCA uses its secret key to sign the head of the tree. The structure of the tree guarantees that each certificate in the batch was signed by the MTCA: if we tried to tweak the bits of any one of the certificates, the treehead would end up having a different value, which would cause the signature to fail.

An inclusion proof for a certificate consists of the hash of each sibling node along the path from the certificate to the treehead:

Given a validated treehead, this sequence of hashes is sufficient to prove inclusion of the certificate in the tree. This means that, in order to validate an MTC, the client also needs to obtain the signed treehead from the MTCA.

This is the key to MTC's efficiency:

  1. Signed treeheads can be disseminated to clients out-of-band and validated offline. Each validated treehead can then be used to validate any certificate in the corresponding batch, eliminating the need to obtain a signature for each server certificate.

  2. During the TLS handshake, the client tells the server which treeheads it has. If the server has a signatureless certificate covered by one of those treeheads, then it can use that certificate to authenticate itself. That's 1 signature,1 public key and 1 inclusion proof per handshake, both for the server being authenticated.

Now, that's the simplified version. MTC proper has some more bells and whistles. To start, it doesn’t create a separate Merkle tree for each batch, but it grows a single large tree, which is used for better transparency. As this tree grows, periodically (sub)tree heads are selected to be shipped to browsers, which we call landmarks. In the common case browsers will be able to fetch the most recent landmarks, and servers can wait for batch issuance, but we need a fallback: MTC also supports certificates that can be issued immediately and don’t require landmarks to be validated, but these are not as small. A server would provision both types of Merkle tree certificates, so that the common case is fast, and the exceptional case is slow, but at least it’ll work.

Experimental deployment

Ever since early designs for MTCs emerged, we’ve been eager to experiment with the idea. In line with the IETF principle of “running code”, it often takes implementing a protocol to work out kinks in the design. At the same time, we cannot risk the security of users. In this section, we describe our approach to experimenting with aspects of the Merkle Tree Certificates design without changing any trust relationships.

Let’s start with what we hope to learn. We have lots of questions whose answers can help to either validate the approach, or uncover pitfalls that require reshaping the protocol — in fact, an implementation of an early MTC draft by Maximilian Pohl and Mia Celeste did exactly this. We’d like to know:

What breaks? Protocol ossification (the tendency of implementation bugs to make it harder to change a protocol) is an ever-present issue with deploying protocol changes. For TLS in particular, despite having built-in flexibility, time after time we’ve found that if that flexibility is not regularly used, there will be buggy implementations and middleboxes that break when they see things they don’t recognize. TLS 1.3 deployment took years longer than we hoped for this very reason. And more recently, the rollout of PQ key exchange in TLS caused the Client Hello to be split over multiple TCP packets, something that many middleboxes weren't ready for.

What is the performance impact? In fact, we expect MTCs to reduce the size of the handshake, even compared to today's non-PQ certificates. They will also reduce CPU cost: ML-DSA signature verification is about as fast as ECDSA, and there will be far fewer signatures to verify. We therefore expect to see a reduction in latency. We would like to see if there is a measurable performance improvement.

What fraction of clients will stay up to date? Getting the performance benefit of MTCs requires the clients and servers to be roughly in sync with one another. We expect MTCs to have fairly short lifetimes, a week or so. This means that if the client's latest landmark is older than a week, the server would have to fallback to a larger certificate. Knowing how often this fallback happens will help us tune the parameters of the protocol to make fallbacks less likely.

In order to answer these questions, we are implementing MTC support in our TLS stack and in our certificate issuance infrastructure. For their part, Chrome is implementing MTC support in their own TLS stack and will stand up infrastructure to disseminate landmarks to their users.

As we've done in past experiments, we plan to enable MTCs for a subset of our free customers with enough traffic that we will be able to get useful measurements. Chrome will control the experimental rollout: they can ramp up slowly, measuring as they go and rolling back if and when bugs are found.

Which leaves us with one last question: who will run the Merkle Tree CA?

Bootstrapping trust from the existing WebPKI

Standing up a proper CA is no small task: it takes years to be trusted by major browsers. That’s why Cloudflare isn’t going to become a “real” CA for this experiment, and Chrome isn’t going to trust us directly.

Instead, to make progress on a reasonable timeframe, without sacrificing due diligence, we plan to "mock" the role of the MTCA. We will run an MTCA (on Workers based on our StaticCT logs), but for each MTC we issue, we also publish an existing certificate from a trusted CA that agrees with it. We call this the bootstrap certificate. When Chrome’s infrastructure pulls updates from our MTCA log, they will also pull these bootstrap certificates, and check whether they agree. Only if they do, they’ll proceed to push the corresponding landmarks to Chrome clients. In other words, Cloudflare is effectively just “re-encoding” an existing certificate (with domain validation performed by a trusted CA) as an MTC, and Chrome is using certificate transparency to keep us honest.

Conclusion

With almost 50% of our traffic already protected by post-quantum encryption, we’re halfway to a fully post-quantum secure Internet. The second part of our journey, post-quantum certificates, is the hardest yet though. A simple drop-in upgrade has a noticeable performance impact and no security benefit before Q-day. This means it’s a hard sell to enable today by default. But here we are playing with fire: migrations always take longer than expected. If we want to keep an ubiquitously private and secure Internet, we need a post-quantum solution that’s performant enough to be enabled by default today.

Merkle Tree Certificates (MTCs) solves this problem by reducing the number of signatures and public keys to the bare minimum while maintaining the WebPKI's essential properties. We plan to roll out MTCs to a fraction of free accounts by early next year. This does not affect any visitors that are not part of the Chrome experiment. For those that are, thanks to the bootstrap certificates, there is no impact on security.

We’re excited to keep the Internet fast and secure, and will report back soon on the results of this experiment: watch this space! MTC is evolving as we speak, if you want to get involved, please join the IETF PLANTS mailing list.

  • ✇The Cloudflare Blog
  • 15 years of helping build a better Internet: a look back at Birthday Week 2025 Nikita Cano · Korinne Alpers
    Cloudflare launched fifteen years ago with a mission to help build a better Internet. Over that time the Internet has changed and so has what it needs from teams like ours.  In this year’s Founder’s Letter, Matthew and Michelle discussed the role we have played in the evolution of the Internet, from helping encryption grow from 10% to 95% of Internet traffic to more recent challenges like how people consume content. We spend Birthday Week every year releasing the products and capabilities we bel
     

15 years of helping build a better Internet: a look back at Birthday Week 2025

29 de Setembro de 2025, 11:00

Cloudflare launched fifteen years ago with a mission to help build a better Internet. Over that time the Internet has changed and so has what it needs from teams like ours.  In this year’s Founder’s Letter, Matthew and Michelle discussed the role we have played in the evolution of the Internet, from helping encryption grow from 10% to 95% of Internet traffic to more recent challenges like how people consume content. 

We spend Birthday Week every year releasing the products and capabilities we believe the Internet needs at this moment and around the corner. Previous Birthday Weeks saw the launch of IPv6 gateway in 2011,  Universal SSL in 2014, Cloudflare Workers and unmetered DDoS protection in 2017, Cloudflare Radar in 2020, R2 Object Storage with zero egress fees in 2021,  post-quantum upgrades for Cloudflare Tunnel in 2022, Workers AI and Encrypted Client Hello in 2023. And those are just a sample of the launches.

This year’s themes focused on helping prepare the Internet for a new model of monetization that encourages great content to be published, fostering more opportunities to build community both inside and outside of Cloudflare, and evergreen missions like making more features available to everyone and constantly improving the speed and security of what we offer.

We shipped a lot of new things this year. In case you missed the dozens of blog posts, here is a breakdown of everything we announced during Birthday Week 2025. 

Monday, September 22

What In a sentence …
Help build the future: announcing Cloudflare’s goal to hire 1,111 interns in 2026 To invest in the next generation of builders, we announced our most ambitious intern program yet with a goal to hire 1,111 interns in 2026.
Supporting the future of the open web: Cloudflare is sponsoring Ladybird and Omarchy To support a diverse and open Internet, we are now sponsoring Ladybird (an independent browser) and Omarchy (an open-source Linux distribution and developer environment).
Come build with us: Cloudflare’s new hubs for startups We are opening our office doors in four major cities (San Francisco, Austin, London, and Lisbon) as free hubs for startups to collaborate and connect with the builder community.
Free access to Cloudflare developer services for non-profit and civil society organizations We extended our Cloudflare for Startups program to non-profits and public-interest organizations, offering free credits for our developer tools.
Introducing free access to Cloudflare developer features for students We are removing cost as a barrier for the next generation by giving students with .edu emails 12 months of free access to our paid developer platform features.
Cap’n Web: a new RPC system for browsers and web servers We open-sourced Cap'n Web, a new JavaScript-native RPC protocol that simplifies powerful, schema-free communication for web applications.
A lookback at Workers Launchpad and a warm welcome to Cohort #6 We announced Cohort #6 of the Workers Launchpad, our accelerator program for startups building on Cloudflare.

Tuesday, September 23

What In a sentence …
Building unique, per-customer defenses against advanced bot threats in the AI era New anomaly detection system that uses machine learning trained on each zone to build defenses against AI-driven bot attacks.
Why Cloudflare, Netlify, and Webflow are collaborating to support Open Source tools To support the open web, we joined forces with Webflow to sponsor Astro, and with Netlify to sponsor TanStack.
Launching the x402 Foundation with Coinbase, and support for x402 transactions We are partnering with Coinbase to create the x402 Foundation, encouraging the adoption of the x402 protocol to allow clients and services to exchange value on the web using a common language
Helping protect journalists and local news from AI crawlers with Project Galileo We are extending our free Bot Management and AI Crawl Control services to journalists and news organizations through Project Galileo.
Cloudflare Confidence Scorecards - making AI safer for the Internet Automated evaluation of AI and SaaS tools, helping organizations to embrace AI without compromising security.

Wednesday, September 24

What In a sentence …
Automatically Secure: how we upgraded 6,000,000 domains by default Our Automatic SSL/TLS system has upgraded over 6 million domains to more secure encryption modes by default and will soon automatically enable post-quantum connections.
Giving users choice with Cloudflare’s new Content Signals Policy The Content Signals Policy is a new standard for robots.txt that lets creators express clear preferences for how AI can use their content.
To build a better Internet in the age of AI, we need responsible AI bot principles A proposed set of responsible AI bot principles to start a conversation around transparency and respect for content creators' preferences.
Securing data in SaaS to SaaS applications New security tools to give companies visibility and control over data flowing between SaaS applications.
Securing today for the quantum future: WARP client now supports post-quantum cryptography (PQC) Cloudflare’s WARP client now supports post-quantum cryptography, providing quantum-resistant encryption for traffic.
A simpler path to a safer Internet: an update to our CSAM scanning tool We made our CSAM Scanning Tool easier to adopt by removing the need to create and provide unique credentials, helping more site owners protect their platforms.

Thursday, September 25

What In a sentence …
Every Cloudflare feature, available to everyone We are making every Cloudflare feature, starting with Single Sign On (SSO), available for anyone to purchase on any plan.
Cloudflare's developer platform keeps getting better, faster, and more powerful Updates across Workers and beyond for a more powerful developer platform – such as support for larger and more concurrent Container images, support for external models from OpenAI and Anthropic in AI Search (previously AutoRAG), and more.
Partnering to make full-stack fast: deploy PlanetScale databases directly from Workers You can now connect Cloudflare Workers to PlanetScale databases directly, with connections automatically optimized by Hyperdrive.
Announcing the Cloudflare Data Platform A complete solution for ingesting, storing, and querying analytical data tables using open standards like Apache Iceberg.
R2 SQL: a deep dive into our new distributed query engine A technical deep dive on R2 SQL, a serverless query engine for petabyte-scale datasets in R2.
Safe in the sandbox: security hardening for Cloudflare Workers A deep-dive into how we’ve hardened the Workers runtime with new defense-in-depth security measures, including V8 sandboxes and hardware-assisted memory protection keys.
Choice: the path to AI sovereignty To champion AI sovereignty, we've added locally-developed open-source models from India, Japan, and Southeast Asia to our Workers AI platform.
Announcing Cloudflare Email Service’s private beta We announced the Cloudflare Email Service private beta, allowing developers to reliably send and receive transactional emails directly from Cloudflare Workers.
A year of improving Node.js compatibility in Cloudflare Workers There are hundreds of new Node.js APIs now available that make it easier to run existing Node.js code on our platform.

Friday, September 26

What In a sentence …
Cloudflare just got faster and more secure, powered by Rust We have re-engineered our core proxy with a new modular, Rust-based architecture, cutting median response time by 10ms for millions.
Introducing Observatory and Smart Shield New monitoring tools in the Cloudflare dashboard that provide actionable recommendations and one-click fixes for performance issues.
Monitoring AS-SETs and why they matter Cloudflare Radar now includes Internet Routing Registry (IRR) data, allowing network operators to monitor AS-SETs to help prevent route leaks.
An AI Index for all our customers We announced the private beta of AI Index, a new service that creates an AI-optimized search index for your domain that you control and can monetize.
Introducing new regional Internet traffic and Certificate Transparency insights on Cloudflare Radar Sub-national traffic insights and Certificate Transparency dashboards for TLS monitoring.
Eliminating Cold Starts 2: shard and conquer We have reduced Workers cold starts by 10x by implementing a new "worker sharding" system that routes requests to already-loaded Workers.
Network performance update: Birthday Week 2025 The TCP Connection Time (Trimean) graph shows that we are the fastest TCP connection time in 40% of measured ISPs – and the fastest across the top networks.
How Cloudflare uses performance data to make the world’s fastest global network even faster We are using our network's vast performance data to tune congestion control algorithms, improving speeds by an average of 10% for QUIC traffic.
Code Mode: the better way to use MCP It turns out we've all been using MCP wrong. Most agents today use MCP by exposing the "tools" directly to the LLM. We tried something different: Convert the MCP tools into a TypeScript API, and then ask an LLM to write code that calls that API. The results are striking.

Come build with us!

Helping build a better Internet has always been about more than just technology. Like the announcements about interns or working together in our offices, the community of people behind helping build a better Internet matters to its future. This week, we rolled out our most ambitious set of initiatives ever to support the builders, founders, and students who are creating the future.

For founders and startups, we are thrilled to welcome Cohort #6 to the Workers Launchpad, our accelerator program that gives early-stage companies the resources they need to scale. But we’re not stopping there. We’re opening our doors, literally, by launching new physical hubs for startups in our San Francisco, Austin, London, and Lisbon offices. These spaces will provide access to mentorship, resources, and a community of fellow builders.

We’re also investing in the next generation of talent. We announced free access to the Cloudflare developer platform for all students, giving them the tools to learn and experiment without limits. To provide a path from the classroom to the industry, we also announced our goal to hire 1,111 interns in 2026 — our biggest commitment yet to fostering future tech leaders.

And because a better Internet is for everyone, we’re extending our support to non-profits and public-interest organizations, offering them free access to our production-grade developer tools, so they can focus on their missions.

Whether you're a founder with a big idea, a student just getting started, or a team working for a cause you believe in, we want to help you succeed.

Until next year

Thank you to our customers, our community, and the millions of developers who trust us to help them build, secure, and accelerate the Internet. Your curiosity and feedback drive our innovation.

It’s been an incredible 15 years. And as always, we’re just getting started!

(Watch the full conversation on our show ThisWeekinNET.com about what we launched during Birthday Week 2025 here.)

  • ✇The Cloudflare Blog
  • Safe in the sandbox: security hardening for Cloudflare Workers Erik Corry · Ketan Gupta
    As a serverless cloud provider, we run your code on our globally distributed infrastructure. Being able to run customer code on our network means that anyone can take advantage of our global presence and low latency. Workers isn’t just efficient though, we also make it simple for our users. In short: You write code. We handle the rest.Part of 'handling the rest' is making Workers as secure as possible. We have previously written about our security architecture. Making Workers secure is an intere
     

Safe in the sandbox: security hardening for Cloudflare Workers

25 de Setembro de 2025, 11:00

As a serverless cloud provider, we run your code on our globally distributed infrastructure. Being able to run customer code on our network means that anyone can take advantage of our global presence and low latency. Workers isn’t just efficient though, we also make it simple for our users. In short: You write code. We handle the rest.

Part of 'handling the rest' is making Workers as secure as possible. We have previously written about our security architecture. Making Workers secure is an interesting problem because the whole point of Workers is that we are running third party code on our hardware. This is one of the hardest security problems there is: any attacker has the full power available of a programming language running on the victim's system when they are crafting their attacks.

This is why we are constantly updating and improving the Workers Runtime to take advantage of the latest improvements in both hardware and software. This post shares some of the latest work we have been doing to keep Workers secure.

Some background first: Workers is built around the V8 JavaScript runtime, originally developed for Chromium-based browsers like Chrome. This gives us a head start, because V8 was forged in an adversarial environment, where it has always been under intense attack and scrutiny. Like Workers, Chromium is built to run adversarial code safely. That's why V8 is constantly being tested against the best fuzzers and sanitizers, and over the years, it has been hardened with new technologies like Oilpan/cppgc and improved static analysis.

We use V8 in a slightly different way, though, so we will be describing in this post how we have been making some changes to V8 to improve security in our use case.

Hardware-assisted security improvements from Memory Protection Keys

Modern CPUs from Intel, AMD, and ARM have support for memory protection keys, sometimes called PKU, Protection Keys for Userspace. This is a great security feature which increases the power of virtual memory and memory protection.

Traditionally, the memory protection features of the CPU in your PC or phone were mainly used to protect the kernel and to protect different processes from each other. Within each process, all threads had access to the same memory. Memory protection keys allow us to prevent specific threads from accessing memory regions they shouldn't have access to.

V8 already uses memory protection keys for the JIT compilers. The JIT compilers for a language like JavaScript generate optimized, specialized versions of your code as it runs. Typically, the compiler is running on its own thread, and needs to be able to write data to the code area in order to install its optimized code. However, the compiler thread doesn't need to be able to run this code. The regular execution thread, on the other hand, needs to be able to run, but not modify, the optimized code. Memory protection keys offer a way to give each thread the permissions it needs, but no more. And the V8 team in the Chromium project certainly aren't standing still. They describe some of their future plans for memory protection keys here.

In Workers, we have some different requirements than Chromium. The security architecture for Workers uses V8 isolates to separate different scripts that are running on our servers. (In addition, we have extra mitigations to harden the system against Spectre attacks). If V8 is working as intended, this should be enough, but we believe in defense in depth: multiple, overlapping layers of security controls.

That's why we have deployed internal modifications to V8 to use memory protection keys to isolate the isolates from each other. There are up to 15 different keys available on a modern x64 CPU and a few are used for other purposes in V8, so we have about 12 to work with. We give each isolate a random key which is used to protect its V8 heap data, the memory area containing the JavaScript objects a script creates as it runs. This means security bugs that might previously have allowed an attacker to read data from a different isolate would now hit a hardware trap in 92% of cases. (Assuming 12 keys, 92% is about 11/12.)

The illustration shows an attacker attempting to read from a different isolate. Most of the time this is detected by the mismatched memory protection key, which kills their script and notifies us, so we can investigate and remediate. The red arrow represents the case where the attacker got lucky by hitting an isolate with the same memory protection key, represented by the isolates having the same colors.

However, we can further improve on a 92% protection rate. In the last part of this blog post we'll explain how we can lift that to 100% for a particular common scenario. But first, let's look at a software hardening feature in V8 that we are taking advantage of.

The V8 sandbox, a software-based security boundary

Over the past few years, V8 has been gaining another defense in depth feature: the V8 sandbox. (Not to be confused with the layer 2 sandbox which Workers have been using since the beginning.) The V8 sandbox has been a multi-year project that has been gaining maturity for a while. The sandbox project stems from the observation that many V8 security vulnerabilities start by corrupting objects in the V8 heap memory. Attackers then leverage this corruption to reach other parts of the process, giving them the opportunity to escalate and gain more access to the victim's browser, or even the entire system.

V8's sandbox project is an ambitious software security mitigation that aims to thwart that escalation: to make it impossible for the attacker to progress from a corruption on the V8 heap to a compromise of the rest of the process. This means, among other things, removing all pointers from the heap. But first, let's explain in as simple terms as possible, what a memory corruption attack is.

Memory corruption attacks

A memory corruption attack tricks a program into misusing its own memory. Computer memory is just a store of integers, where each integer is stored in a location. The locations each have an address, which is also just a number. Programs interpret the data in these locations in different ways, such as text, pixels, or pointers. Pointers are addresses that identify a different memory location, so they act as a sort of arrow that points to some other piece of data.

Here's a concrete example, which uses a buffer overflow. This is a form of attack that was historically common and relatively simple to understand: Imagine a program has a small buffer (like a 16-character text field) followed immediately by an 8-byte pointer to some ordinary data. An attacker might send the program a 24-character string, causing a "buffer overflow." Because of a vulnerability in the program, the first 16 characters fill the intended buffer, but the remaining 8 characters spill over and overwrite the adjacent pointer.

See below for how such an attack would now be thwarted.

Now the pointer has been redirected to point at sensitive data of the attacker's choosing, rather than the normal data it was originally meant to access. When the program tries to use what it believes is its normal pointer, it's actually accessing sensitive data chosen by the attacker.

This type of attack works in steps: first create a small confusion (like the buffer overflow), then use that confusion to create bigger problems, eventually gaining access to data or capabilities the attacker shouldn't have.  The attacker can eventually use the misdirection to either steal information or plant malicious data that the program will treat as legitimate.

This was a somewhat abstract description of memory corruption attacks using a buffer overflow, one of the simpler techniques. For some much more detailed and recent examples, see this description from Google, or this breakdown of a V8 vulnerability.

Compressed pointers in V8

Many attacks are based on corrupting pointers, so ideally we would remove all pointers from the memory of the program.  Since an object-oriented language's heap is absolutely full of pointers, that would seem, on its face, to be a hopeless task, but it is enabled by an earlier development. Starting in 2020, V8 has offered the option of saving memory by using compressed pointers. This means that, on a 64-bit system, the heap uses only 32 bit offsets, relative to a base address. This limits the total heap to maximally 4 GiB, a limitation that is acceptable for a browser, and also fine for individual scripts running in a V8 isolate on Cloudflare Workers.

An artificial object with various fields, showing how the layout differs in a compressed vs. an uncompressed heap. The boxes are 64 bits wide.

If the whole of the heap is in a single 4 GiB area then the first 32 bits of all pointers will be the same, and we don't need to store them in every pointer field in every object. In the diagram we can see that the object pointers all start with 0x12345678, which is therefore redundant and doesn't need to be stored. This means that object pointer fields and integer fields can be reduced from 64 to 32 bits.

We still need 64 bit fields for some fields like double precision floats and for the sandbox offsets of buffers, which are typically used by the script for input and output data. See below for details.

Integers in an uncompressed heap are stored in the high 32 bits of a 64 bit field. In the compressed heap, the top 31 bits of a 32 bit field are used. In both cases the lowest bit is set to 0 to indicate integers (as opposed to pointers or offsets).

Conceptually, we have two methods for compressing and decompressing, using a base address that is divisible by 4 GiB:

// Decompress a 32 bit offset to a 64 bit pointer by adding a base address.
void* Decompress(uint32_t offset) { return base + offset; }
// Compress a 64 bit pointer to a 32 bit offset by discarding the high bits.
uint32_t Compress(void* pointer) { return (intptr_t)pointer & 0xffffffff; }

This pointer compression feature, originally primarily designed to save memory, can be used as the basis of a sandbox.

From compressed pointers to the sandbox

The biggest 32-bit unsigned integer is about 4 billion, so the Decompress() function cannot generate any pointer that is outside the range [base, base + 4 GiB]. You could say the pointers are trapped in this area, so it is sometimes called the pointer cage. V8 can reserve 4 GiB of virtual address space for the pointer cage so that only V8 objects appear in this range. By eliminating all pointers from this range, and following some other strict rules, V8 can contain any memory corruption by an attacker to this cage. Even if an attacker corrupts a 32 bit offset within the cage, it is still only a 32 bit offset and can only be used to create new pointers that are still trapped within the pointer cage.

The buffer overflow attack from earlier no longer works because only the attacker's own data is available in the pointer cage.

To construct the sandbox, we take the 4 GiB pointer cage and add another 4 GiB for buffers and other data structures to make the 8 GiB sandbox. This is why the buffer offsets above are 33 bits, so they can reach buffers in the second half of the sandbox (40 bits in Chromium with larger sandboxes). V8 stores these buffer offsets in the high 33 bits and shifts down by 31 bits before use, in case an attacker corrupted the low bits.

Cloudflare Workers have made use of compressed pointers in V8 for a while, but for us to get the full power of the sandbox we had to make some changes. Until recently, all isolates in a process had to be one single sandbox if you were using the sandboxed configuration of V8. This would have limited the total size of all V8 heaps to be less than 4 GiB, far too little for our architecture, which relies on serving 1000s of scripts at once.

That's why we commissioned Igalia to add isolate groups to V8. Each isolate group has its own sandbox and can have 1 or more isolates within it. Building on this change we have been able to start using the sandbox, eliminating a whole class of potential security issues in one stroke. Although we can place multiple isolates in the same sandbox, we are currently only putting a single isolate in each sandbox.

The layout of the sandbox. In the sandbox there can be more than one isolate, but all their heap pages must be in the pointer cage: the first 4 GiB of the sandbox. Instead of pointers between the objects, we use 32 bit offsets. The offsets for the buffers are 33 bits, so they can reach the whole sandbox, but not outside it.

Virtual memory isn't infinite, there's a lot going on in a Linux process

At this point, we were not quite done, though. Each sandbox reserves 8 GiB of space in the virtual memory map of the process, and it must be 4 GiB aligned for efficiency. It uses much less physical memory, but the sandbox mechanism requires this much virtual space for its security properties. This presents us with a problem, since a Linux process 'only' has 128 TiB of virtual address space in a 4-level page table (another 128 TiB are reserved for the kernel, not available to user space).

At Cloudflare, we want to run Workers as efficiently as possible to keep costs and prices down, and to offer a generous free tier. That means that on each machine we have so many isolates running (one per sandbox) that it becomes hard to place them all in a 128 TiB space.

Knowing this, we have to place the sandboxes carefully in memory. Unfortunately, the Linux syscall, mmap, does not allow us to specify the alignment of an allocation unless you can guess a free location to request. To get an 8 GiB area that is 4 GiB aligned, we have to ask for 12 GiB, then find the aligned 8 GiB area that must exist within that, and return the unused (hatched) edges to the OS:

If we allow the Linux kernel to place sandboxes randomly, we end up with a layout like this with gaps. Especially after running for a while, there can be both 8 GiB and 4 GiB gaps between sandboxes:

Sadly, because of our 12 GiB alignment trick, we can't even make use of the 8 GiB gaps. If we ask the OS for 12 GiB, it will never give us a gap like the 8 GiB gap between the green and blue sandboxes above. In addition, there are a host of other things going on in the virtual address space of a Linux process: the malloc implementation may want to grab pages at particular addresses, the executable and libraries are mapped at a random location by ASLR, and V8 has allocations outside the sandbox.

The latest generation of x64 CPUs supports a much bigger address space, which solves both problems, and Linux kernels are able to make use of the extra bits with five level page tables. A process has to opt into this, which is done by a single mmap call suggesting an address outside the 47 bit area. The reason this needs an opt-in is that some programs can't cope with such high addresses. Curiously, V8 is one of them.

This isn't hard to fix in V8, but not all of our fleet has been upgraded yet to have the necessary hardware. So for now, we need a solution that works with the existing hardware. We have modified V8 to be able to grab huge memory areas and then use mprotect syscalls to create tightly packed 8 GiB spaces for sandboxes, bypassing the inflexible mmap API.

Putting it all together

Taking control of the sandbox placement like this actually gives us a security benefit, but first we need to describe a particular threat model.

We assume for the purposes of this threat model that an attacker has an arbitrary way to corrupt data within the sandbox. This is historically the first step in many V8 exploits. So much so that there is a special tier in Google's V8 bug bounty program where you may assume you have this ability to corrupt memory, and they will pay out if you can leverage that to a more serious exploit.

However, we assume that the attacker does not have the ability to execute arbitrary machine code. If they did, they could disable memory protection keys. Having access to the in-sandbox memory only gives the attacker access to their own data. So the attacker must attempt to escalate, by corrupting data inside the sandbox to access data outside the sandbox.

You will recall that the compressed, sandboxed V8 heap only contains 32 bit offsets. Therefore, no corruption there can reach outside the pointer cage. But there are also arrays in the sandbox — vectors of data with a given size that can be accessed with an index. In our threat model, the attacker can modify the sizes recorded for those arrays and the indexes used to access elements in the arrays. That means an attacker could potentially turn an array in the sandbox into a tool for accessing memory incorrectly. For this reason, the V8 sandbox normally has guard regions around it: These are 32 GiB virtual address ranges that have no virtual-to-physical address mappings. This helps guard against the worst case scenario: Indexing an array where the elements are 8 bytes in size (e.g. an array of double precision floats) using a maximal 32 bit index. Such an access could reach a distance of up to 32 GiB outside the sandbox: 8 times the maximal 32 bit index of four billion.

We want such accesses to trigger an alarm, rather than letting an attacker access nearby memory.  This happens automatically with guard regions, but we don't have space for conventional 32 GiB guard regions around every sandbox.

Instead of using conventional guard regions, we can make use of memory protection keys. By carefully controlling which isolate group uses which key, we can ensure that no sandbox within 32 GiB has the same protection key. Essentially, the sandboxes are acting as each other's guard regions, protected by memory protection keys. Now we only need a wasted 32 GiB guard region at the start and end of the huge packed sandbox areas.

With the new sandbox layout, we use strictly rotating memory protection keys. Because we are not using randomly chosen memory protection keys, for this threat model the 92% problem described above disappears. Any in-sandbox security issue is unable to reach a sandbox with the same memory protection key. In the diagram, we show that there is no memory within 32 GiB of a given sandbox that has the same memory protection key. Any attempt to access memory within 32 GiB of a sandbox will trigger an alarm, just like it would with unmapped guard regions.

The future

In a way, this whole blog post is about things our customers don't need to do. They don't need to upgrade their server software to get the latest patches, we do that for them. They don't need to worry whether they are using the most secure or efficient configuration. So there's no call to action here, except perhaps to sleep easy.

However, if you find work like this interesting, and especially if you have experience with the implementation of V8 or similar language runtimes, then you should consider coming to work for us. We are recruiting both in the US and in Europe. It's a great place to work, and Cloudflare is going from strength to strength.

❌
❌