Massively parallel web performance testing on a shoestring budget with Lighthouse and AWS Lambda

In a previous post, we walked through setting up Lighthouse in a CI/CD pipeline to prevent performance and accessibility regressions from getting merged in the first place. That setup was somewhat tailored to the assumption of a project where testing a few pages, or maybe tens of pages, is sufficient.

But what if you need to test performance and accessibility across hundreds or thousands of pages, minimizing both time-to-completion and cost?

Functions-as-a-service such as AWS Lambda or Google Cloud Functions, with their inherent rapid scalability and pay-per-actual-usage pricing, seem like a naturally good fit for this use case. But driving thousands of headless web browsers in a resource-constrained and opaque environment over which we have limited control, with the goal of taking measurements that involve a great deal of nondeterminism and variance, also seems like it could be a recipe for a bonafide tire fire.

The purpose of this post is to step through my decision to address this use case with a serverless architecture on AWS, show how it works, and share the results around timing, cost, and lighthouse performance scores.

You can also skip straight to the Github repo to check out the code and deploy this in your own AWS account.

Contents

Goal

Our goal is to test 5,000 pages in under a minute, to satisfy the use case of running lighthouse as a post-deployment test against a large site, keeping a tight feedback loop. We'll assume that it takes 10 seconds on average to test each page.

Survey of options

Run lighthouse N times sequentially

The simplest possible thing might be running lighthouse a bunch of times in a script, one after the other, inside a single VM/container/local machine. But even with "just" 1,000 pages, this is going to take almost three hours. Next!

Multiple lighthouse instances per VM or container

The next thing to consider is splitting up the work to achieve concurrency. We could do that within a single highly memory-optimized VM or container, running a bunch of lighthouse instances at the same time as separate processes. Or across several somewhat-memory-optimized VMs or containers. For each of those routes, we could have each instance of lighthouse map 1:1 with a browser instance, or it could be many lighthouse instances to each browser instance.

There are two reasons I'm not going to do this.

The first is that web browsers are complex and resource-hungry pieces of software. I can't overstate my aversion to the prospect of driving a large number of web browser instances concurrently in one VM or container, or trying to share browser instances and do multiple lighthouse runs for each browser instance. The lack of isolation feels like it is going to lead to misery (but I'd be happy to be wrong).

The second is that we either need to have the memory-optimized VM(s) always running for this to be fast, and that is going to be expensive, or we need to spin them up at the moment that we want to run tests, and that is going to instantly blow our budget of one minute. (Is it even possible to start a VM in less than ~ a minute?)

Example: on EC2, an x1.16xlarge instance has 976GB memory and costs $6.669/hr on demand. I'm guessing that this is what'd be minimally required to run 1,000 tests concurrently (and maybe significantly more), which at ten seconds a test, assuming the happiest of happy paths, could do 5,000 in under a minute. Even running during business hours only, that's over $1,000/month!

Yeah but what about autoscaling? Can't we run a VM/container instance that manages a capped number of lighthouse processes, and let the cloud provider handle the scaling so it's usually a much lower cost (e.g. 1 instance) but then scales up when we want to run the tests (e.g. 20 instances)?

Autoscaling is great in theory but nuanced in practice. It takes time to scale, it's prone to over- and under- scaling, and the default metrics being used to decide whether to scale (e.g. CPU utilization, response latencies) might not make sense for your situation. And, you might not be able to change them. And, you might still have to always have one instance running.

For this project, this approach involves a set of problems that I honestly just don't want to think of as my problems.

One lighthouse instance per container

I like this idea because it seems like packaging the task in a container, and mapping one container to one lighthouse instance could sidestep a bunch of the potential landmines discussed in the previous section, because each run is more isolated. And, we could run it someplace that lets us specify the precise resource requirements for a single lighthouse run (ECS, kubernetes, etc), which seems simpler than figuring out the resource requirements for doing multiple runs.

But I still couldn't see a good way that this approach was going to be fast and cheap compared to functions-as-a-service: we'd still need to have either the underlying VM capacity always-running, which would cost more, or spin it up on demand, which would be slower.

Functions-as-a-service

I'd been avoiding reaching for AWS Lambda (or similar) for a bunch of reasons. Even though the inherent rapid scalability and pay-per-actual-usage pricing were attractive, and the buzzwords would be great, there would a bunch of downsides:

  • Constrained to a teeny set of specific operating systems and runtimes, for example nodejs 8.10 on Amazon Linux
  • Packaging the application/task in a zip file rather than a container. I see this as a major workflow regression.
  • The tooling around logging, monitoring, etc, is not as mature
  • Hard resource limits like 512Mb of disk space, and you can only write to /tmp
  • Vendor-specific bullshit to deal with
  • Max memory of 3Gb (this seems fine)
  • No GPU access (this seems maybe fine?)
  • No sudo access (this seems fine)
  • Conceptually somewhat of a black box compared to a container (e.g. how does the CPU allocation even work?)
  • This is still somewhat of a wild west situation, and the AWS docs will almost certainly tell me things that are not true, so I will probably need to read every website on the internet and the number of open tabs may cause my aging laptop to explode

The thought of driving thousands of web browsers concurrently under these conditions made me...uncomfortable. Even if it were to work, I was concerned that it could be easy, in such an environment, to accidentally think we are measuring the performance of our site and really we are mostly measuring the performance of AWS Lambda on a Tuesday.

But you might only live once, so what the heck. Despite the drawbacks it seemed like the most likely way to build something fast, cheap, and with a low operational burden (why take on the scaling part when we can get it for free?). I went forward with a serverless architecture on AWS to test it out and collect more information.

Architecture

I used terraform to define and deploy a prototype, with the Lambda functions written in node.js. You can snag the terraform in this project's repo and deploy the exact same thing in your AWS account in a few minutes.

We're using a fan-out/fan-in pattern which you can read more about in the context of AWS Lambda in this excellent post by Yan Cui. Here's a high level diagram of how it works:

Example usage

The primary way we interact with this service is through a command-line interface:

1000 parallel lighthouse runs in 12.3 seconds!

Details

Triggering a set of runs

Using a CLI provided in the project, we invoke the initializer lambda function, passing in a set of pages to run against, the number of times to test each page, and configuration flags to feed lighthouse. We pass those things into the initializer function so that we can change them without needing to re-deploy anything.

First Lambda function: initializer

The first lambda function publishes to an SNS topic one message for each desired lighthouse run (desired runs count = count of urls times runs per url). It creates a job item in dynamodb; this is where we track the overall progress of this collection of runs. The function returns the job's id to the caller, which the caller (such as our CLI) can use to poll the job's status. This function and the SNS topic are the fan-out mechanism.

Second Lambda function: lighthouse worker

The second lambda function subscribes to the SNS topic for test runs. For each message published to the SNS topic, SNS will invoke the second lambda function once (but we must code against the possibility that it happens more than once). The only thing limiting the number of concurrent invocations of the second function is the AWS Lambda concurrency limit, which defaults to 1,000 Lambda invocations account-wide. For this project, I asked them to bump mine to 5,000 and they were like "sure."

Upon invocation, the second Lambda function runs lighthouse against a specific url. If it completes successfully, it uploads both the human- and machine- friendly lighthouse reports to S3. Upon successful completion, the function stores in dynamodb the fact that the run corresponding to a particular SNS message has been processed successfully, and it increments an atomic counter in the job's metadata that represents completed runs. Before running lighthouse, and before updating the job metadata, the function ensures idempotency by doing a consistent read against dynamodb for this particular run to check if it's already been completed, in which case the function simply bails.

If an exception is thrown while executing the function, or if it fails for any number of other reasons at a lower level than our code, or if it's throttled by the max concurrent invocations number, Lambda will automatically retry it a couple times (with delay). If it continues to fail, the message will be re-routed to a separate SNS topic that serves as a dead letter queue. Maybe it is more accurate to call it a poison queue because those messages are reprocessed by the second Lambda function, which updates the job status to reflect the fact of the error.

Third Lambda function: post-processor

The third Lambda function subscribes to a dynamodb stream; it is effectively watching the job metadata table so that it can hook into the moment of a job having just finished (total desired runs == total complete runs + total error runs) and execute some code. Right now it's not doing anything but this is the place where you could, say, hook in and do some analysis against the scores and send out alerts if the scores are bad.

An aside on SNS vs SQS

AWS Lambda recently added support for SQS as an event source. That seemed like a maybe-pretty-good fit here (rather than SNS) so I started with it, but eventually ran into this gem: "Amazon Simple Queue Service supports an initial burst of 5 concurrent function invocations and increases concurrency by 60 concurrent invocations per minute." This was much slower, as you can imagine. And processing retries was way slower. And, the situation really wanted the lambda function to process messages in batches of at least ten. I decided that I didn't really need the benefits of SQS here, which might be nice if we wanted to limit concurrency and/or if it was really important to have more reliable reprocessing of busted messages. SNS ended up fitting the use case much better.

The numbers

Time to completion

I ran a bunch of jobs with 1,000 runs each, and recorded the times to completion of each job.

Worker mem (MB) Time Comment
768 23s
768 1m38s 999 finished in 16s; one retry
768 1m30s 999 finished in 21s; one retry
768 18s
768 1m34s 999 finished in 15s; one retry
1024 14s
1024 17s
1024 15s
1024 16s
1024 13s
1536 1m32s 999 finished in 18s; one retry
1536 1m28s 999 finished in 14s; one retry
1536 15s
1536 12s
1536 1m27s 999 finished in 14s; one retry

This is looking pretty good! It seems like we get failures around once every couple thousand runs, but they're all retried after a delay and completed successfully. On average, for each job, 99.9% of the runs are complete in 16 seconds.

Through Cloudwatch's built in metrics we can see that the average and p99 durations for each individual run for the worker function:

And we also see exactly six errors, which aligns with (and is the cause of) our observation of six retries:

I also ran a few tests with 5,000 runs:

Worker mem (MB) Time Comment
1024 1m49s 4996 finished in 40s; four retries
1024 54s
1024 1m35s
3008 19s
3008 23s
3008 16s (!)

Wahoo! We reached the goal (sometimes)!!! With the maximum Lambda memory setting of 3GB, on the happy path we are testing 5,000 pages in about 20 seconds.

Overall conclusions: if we are really focused on maximizing speed we should probably do something about those retries that seem to be sometimes necessary every several-thousand runs, and that are slowing down completion: perhaps do our own retry (rather than relying on Lambda's delayed retry), or after we've reached 99.9% of completions, time the job out if it doesn't complete in another $N seconds.

But I'm happy with the timing. It is really fast! We are running thousands of browsers at the same time and we're still alive!

Performance scores

One of the killer features of lighthouse is that it gives back a single high level performance score for your site that combines several weighted metrics like time to first contentful paint, time to first meaningful paint, time to interactive, and speed index. It also performs simulated and/or emulated throttling to try and give back a score that is maximally representative of your users' actual experience. The prrroblem is that these performance scores involve an overwhelming amount of variability and non-determinism.

At first I was very confused because 1024MB seemed like plenty of memory to execute the tests quickly, yet at that level of memory, and even at 1536MB, I was seeing quite a lot of variability in performance scores. I tried a bunch of things including different methods of throttling and trying out different amounts of memory.

After tweaking various knobs and visualizing the results, I came to the conclusion that the most important factors impacting the variability of performance scores I was collecting were the Lambda function's allocated memory (and possibly more like allocated CPU time, which can't be directly controlled and but goes up with memory increases) and the lighthouse setting for CPU throttling (lighthouse simulates network and CPU throttling by default).

The box plot below shows the distribution of scores for groups of 1,000 runs with different configuration settings in terms of Lambda memory and lighthouse CPU throttling. C=1 denotes that the cpuSlowdownMultiplier has been set to 1, that is, CPU throttling is off. All runs were performed against a single page, https://kubernetesfordogs.com, which is a static site served from an S3 bucket with a CDN (Cloudfront) in front of it.

My general conclusion is that reasonably stable performance scores can be achieved with this setup with around 2GB memory and no lighthouse CPU throttling. Though maybe that's the point at which scores become stable but inflated – maybe around 2GB memory with the throttling is the point at which scores stabilize and are more representative of what users would see. I'm not sure how to figure out how you decide when exactly it's appropriate to apply the lighthouse CPU throttling – if you have any bright ideas let me know!

Even though there don't appear to be real gains increasing the memory beyond 2GB, I feel like it's probably not a bad idea to use 3GB to have a safety margin.

Cost

So if we wanted to run this system once per day against 5,000 pages similar to https://kubernetesfordogs.com, with the largest memory allocation (3008MB), which caused the average function duration to go down to 5 seconds, a ballpark estimate of the cost:

Lighthouse worker Lambda function cost: 5 seconds * 5000 pages * 30 days - 136,170 free tier seconds* $.000048975 per second  = $30.06 per month

The other two Lambda functions' cost is pretty much a rounding error.

Dynamodb, at the level we're using it, will maybe cost us a dollar or two a month, the SNS usage will fall within the free tier, and I'm not going to touch on the S3 buckets because right now we're storing way more than we probably need to out of laziness (we'd be generating ~90GB of html and json reports per month as-is, those reports have a lot in them!) and we could fix that by storing only the specific data we care about in S3 or a database or creating an object lifecycle policy to ship the full reports to an infrequent-access storage tier or whatever.

Discussion

Overall, my conclusion is that AWS Lambda seems as fine a place as any to run this kind of task, and setting it up wasn't too painful. However: I learned that packet-level throttling, which is the most accurate way to simulate network conditions, is impossible inside a Lambda function because it requires sudo access, so if that's important Lambda is basically off the table as a viable option.

I'm skeptical of using these performance scores to make important decisions, because there are so many variables involved. I'm confused by things like the fact that WebPageTest, which I understand to be a kind of gold standard in performance testing,  gives back a 64 lighthouse performance score against my site, while my system is now basically never giving back such a low score.

As discussed in the lighthouse docs, it's a good idea to perform multiple runs against each page and extract a median. I've also become convinced that it's a good idea to thoroughly test your own setup and get an idea of the expected variability. And I think it could even be useful to have some kind of a control page that is never changed, has similar characteristics to pages on your site, and which you test over time to ensure that scores aren't drifting independently of the intentional changes to your site.

A huge caveat around both the timing and performance scores is that these tests were run against a single page, and I can't say how well the conclusions will generalize to other sites. I felt that using a single page was a good starting point to reduce the number of variables and better understand the environment itself. Also, I didn't want to go around hammering other peoples' sites! The test results are also limited in that they were collected during a single brief window of time. If you deploy this and use it to test your site and come up with anything interesting in the numbers, I'd like to hear about it.

At this point, we've got a functioning basic prototype and a reasonable amount of confidence that it works, by some definition. But we also are left with a pile of reports and no useful way to visualize the data over time. Next steps on this project could include building out the infrastructure to run tests on a regular schedule or post-deployment and visualize the results, setting performance budgets and sending out alerts when they're blown, and putting metrics we want to track over time in a more easily queryable place.

If you have thoughts, questions, or comments on this topic, or if there's anything important I've missed here, let me know!

The Github repo is here.

Show Comments