Canary releases at the edge with Cloudflare Workers

The idea of a canary release is to reduce risk by incrementally rolling out a new software version to production. First, a small subset of traffic gets the new release, then automated health checks are run which compare the canary to the baseline release. If things are hosed, it's rolled back, and if things are chill the release is promoted to the next stage, with a larger subset of traffic. And so on.

There are a bunch of variations. The stages could be based on random samples (e.g. 10% of users, then 25%, then 50%, then 100%), geolocation, user segments (e.g. internal users, then free tier users, then everybody). The approach can treat each request independently, as in the case of weighted DNS records, or it can be "sticky", so that a given user will consistently receive either the canary or the baseline.

Regardless of the details, the point is limiting the blast radius of new releases.

Both the App Store and Google Play support canary releases (under the names "phased releases" and "staged rollouts", respectively) and the practice is common in backend development.

But it is curiously uncommon in the world of static client-side apps that run in the browser and are delivered via CDN. Perhaps because traditionally, there haven't been great mechanisms for splitting traffic in this context.

UNTIL NOW.

This post shows how to use Cloudflare Workers to:

  • deploy a static site directly to Cloudflare Workers KV store at edge locations (read: there is no origin server; no S3 bucket or instance anywhere).
  • execute highly configurable canary releases at the edge, without having to sacrifice the blazing speed of content delivery offered by CDNs.

Overview

Let's talk about each of the components above. (Or: skip straight to the Github repo).

KV store namespaces

Cloudflare Workers KV just entered general availability. What is it?

Workers KV is a highly distributed, eventually consistent, key-value store that spans Cloudflare's global edge. It allows you to store billions of key-value pairs and read them with ultra-low latency anywhere in the world. Now you can build entire applications with the performance of a CDN static cache.

We're going to create two namespaces under which we'll store key/value pairs:

  1. APP_DEPLOYMENTS for app deployment artifacts (static files). The keys will be of the format $deployId/file/path/foo.html, where $deployId is something like a git commit SHA, and the values will be the file contents.
  2. RELEASE_STATE to represent the current state of releases, i.e., which deployment is currently released to production, and if there is a canary release in progress, its deploy id and the current stage.
{
  current: $deployId,
  next: $deployId | null,   
  stage: $stageName | null
}

Worker

The worker script receives requests, and based on release state and attributes of the request, decides what to fetch from KV and return in the response.

First let's codify the concept of canary stages. These ones will be contrived. They could be pretty much anything, but one of the primary goals is to not incur a bunch of latency here by calling out to external services on every request.

const isUserInternal = request => {
  const userId = parseInt(request.headers.get("X-User-Id"));
  return userId < 100;
};

const isUserFreeTier = request => {
  const userId = parseInt(request.headers.get("X-User-Id"));
  return userId >= 100 && userId < 200;
};

const stages = [
  {
    name: '1',
    criteria: isUserInternal
  },
  {
    name: '2',
    criteria: isUserFreeTier
  },
];

So we're saying that we want to first roll out to internal users, and then free-tier users, and then everybody.

We also need a method for selecting the correct deploy ID based on the request and current release state:

async function getDeployId(request) {
  const stateJSON = await RELEASE_STATE.get("state");
  const state = JSON.parse(stateJSON);

  if (!state.next) {
    return state.current;
  }

  const currentStageIndex = stages.findIndex(
    stage => stage.name === state.stage
  );

  const stagesLeadingUpToCurrentStage = stages.slice(0, currentStageIndex + 1);

  if (
    stagesLeadingUpToCurrentStage.some(stage => stage.criteria(request))
  ) {
    return state.next;
  }

  return state.current;
}

So if there is no canary release in progress, just return the current release, and if there is one in progress, use the stage definitions and the request to figure out which deploy id to fetch the files from. And each stage includes the stages before it.

Finally, we need to wire it all together:

async function handleRequest(request) {
  const deployId = await getDeployId(request);
  const originalPath = new URL(request.url).pathname.slice(1);
  const path = originalPath === '' ? 'index.html' : originalPath;
  const body = await APP_DEPLOYS.get(`${deployId}/${path}`);

  const extensionsToContentTypes = {
    'css': 'text/css',
    'html': 'text/html',
    'js': 'application/javascript'
  };

  const contentType = extensionsToContentTypes[path.split('.').pop()];

  return new Response(body, {
    headers: { "Content-Type": contentType }
  });
}

addEventListener("fetch", event => {
  event.respondWith(handleRequest(event.request));
});

My bias is toward using infrastructure-as-code tools to automate things from the start; I used serverless-cloudflare-workers to declaratively define and deploy the worker and namespaces and the bindings between the two.

Deploy the app

Here is our "app":

$ cat app/current/index.html
<html>
  <head>
    <title>Current</title>
    <link rel="stylesheet" href="css/style.css" type="text/css" />
  </head>
  <body>
    <h1>Current</h1>
  </body>
</html>

$ cat app/current/css/style.css
body {
  background: blue;
}

And we'll make a "new" version of it so we have something new to release:

$ cat app/next/index.html
<html>
  <head>
    <title>Next</title>
    <link rel="stylesheet" href="css/style.css" type="text/css" />
  </head>
  <body>
    <h1>Next</h1>
  </body>
</html>

$ cat app/next/css/style.css
body {
  background: orange;
}

So we'll deploy both of those:

go run deploy.go -dir app/current/ -deploy-id current
go run deploy.go -dir app/next/ -deploy-id next

(current and next are ids for demonstration purposes. IRL use a git commit SHA instead!)

At this point, inside Workers KV under the APP_DEPLOYS namespace, we've set the following keys:

  • current/index.html
  • current/css/style.css
  • next/index.html
  • next/css/style.css

But until there is a release it won't be available in production.

[Sometimes deploy and release are used interchangeably but thinking of them as two different things enables lower risk and indeed better workflows so that is how we are rolling.]

Release the app

First: release the "current" version of the app to production

go run release.go -deploy-id current

We've just set the release state to look like this:

{
  current: "current",
  next: null,   
  stage: null
}

And if we visit https://app.stuartsandine.com, where it's deployed, we'll see:

Kick off stage 1 of a canary release

go run release.go -deploy-id next -stage 1

Now we've set the release state to look like this:

{
  current: "current",
  next: "next",   
  stage: "1"
}

Which means that if a request matches the criteria for stage 1, it'll receive "next":

And otherwise it'll receive the original.

Promote the canary release to stage 2

go run release.go -deploy-id next -stage 2

Now we've set the release state to look like this:

{
  current: "current",
  next: "next",   
  stage: "2"
}

Which means all requests matching stage 1 or stage 2 criteria will get the new orange page, and all others will get the original blue page.

Promote it to a full release

go run release.go -deploy-id next

Now we've set the release state to look like this:

{
  current: "next",
  next: null,   
  stage: null
}

And at this point all traffic will receive the new orange version.

To roll back, run the same command and pass in the deploy id to roll back to.

Discussion

By letting you hook into the request/response cycle at the edge with minimal latency, and providing key/value storage at the edge, Workers provide a powerful mechanism to do interesting things that were previously impractical or impossible. And it isn't just JavaScript – you can run WebAssembly modules at the edge too!

This is a basic example and naive implementation but I think it illustrates as big idea. Plugging this kind of system into a CI/CD pipeline facilitates continuous delivery by offering a reduced-risk method of automated release for static content like websites and JAMstack apps.

You could certainly build the same kind of system on top of AWS Lambda@Edge, but Workers introduce much less latency, and managing AWS CloudFront distributions is, in my experience, comparatively clunky and slow.

It's worth mentioning that Netlify offers something related called Split Testing where you specify a set of branches, each with a percentage of traffic that it should receive. This is really cool, but not as flexible as managing the traffic splitting via JavaScript.

And speaking of traffic splitting: what would be really sweet is using LaunchDarkly to manage the user segments and automatically syncing that data to KV, where the worker can use it to make traffic-splitting decisions at the edge. They provide tools to do similar things with Redis, DynamoDB, or Consul as the data store. KV support would be super rad, IMO.

To dig more into the implementation, check out the Github repo.

Subscribe to receive my latest posts in your inbox.

Show Comments