AWS Lambda shouldn’t increase its timeout; we should get a new service instead

In 2018, AWS Lambda increased the maximum time a function can run from 5 minutes to 15 minutes. This was a great thing! Ever since then, people have been asking for another increase. And while I am fully on board with the need for serverless compute for longer workloads, I don’t think Lambda — or other FaaS platforms, like Google Cloud Functions — should increase its timeout much beyond its current limit. I’d rather get a new service to accomplish the goal.

The fundamental reason is this: systems make tradeoffs to achieve features. I believe that short-running and long-running compute probably need to be able to make different tradeoffs, and that trying to do both in the same service is going to make it hard for those tradeoffs to be made. I don’t think there’s one serverless compute model that can service all workloads; in the long term, I hope there’s a small handful of different models that are each the best for the different sorts of workloads out there. Note this is separate from also wanting different compute models that can help people earlier in their serverless journey; we need Google Cloud Functions for serverless-native architecture, and Google Cloud Run for providing serverless benefits to workloads that were built with servers in mind.

A good example of how different workloads need different tradeoffs is AWS Step Functions standard and express workflows. They are billed differently. They have different observability and durability. Standard workflows can have callback tasks. Express workflows can be synchronously invoked. Note that in this case, the code you provide (the state machine definition) can be the same, even though the execution model is different. That may be true of long-running compute jobs and Lambda as well!

For this article, let’s take one hour as the running time for “long-running job”. What are the differences in the needs of a one hour compute task, as opposed to a 15 minute compute task? Often I care less about promptness (for example, a 5 minute delay is less of an impact), and more about cost. There’s more that can go wrong during an hour that I’d like to intercede in. There is often a lot more variability what kind of computation is running over the life of the job.

What would the ideal serverless compute service for long-running jobs look like? I don’t know! But here are some features I would like see in it:

  • Addressability: there should be an ARN for an execution (like with Step Functions standard workflows).
  • External suspend and resume: if I can address it, I should be able to control it as well.
  • Internal suspend and resume: if I’m waiting on something, I shouldn’t have to pay for that time. I’d love this in my short-running compute as well, but if I could get the feature sooner with tradeoffs that only fit with long-running compute, I’d happily settle for that.
  • Orchestration: if I’ve got a few different things that need to happen, with dependencies between them, the service should help make that simpler to deal with.
  • Variable resources: I shouldn’t need to pay for the same amount of compute the entire time, if I’m not using all that compute.

In my mind, the service would pull together the best aspects of several different services:

  • Declarative orchestration from Step Functions: when you can resolve your orchestration into a declarative description that needs zero maintenance, that’s a win.
  • Imperative orchestration from e.g. Spark: sometimes it’s better to have your orchestration mixed in with your code. You shouldn’t lose any features if you’re doing it this way.
  • Dependency-based orchestration from Airflow: you shouldn’t have order things yourself, a dependency graph should let the service do it for you.
  • Queueing, job management, and scale-to-zero from AWS Batch: AWS Batch is already a great service for managing batch computation. Scale-to-zero should be table stakes.
  • Simplicity from CodeBuild: if what I’m doing is simple, maybe don’t even make me bring a whole container image or zip file. Don’t require a VPC, either.
  • Variable resource allocation from RoboMaker: the t2, t3, and t4 EC2 instance types have “unlimited mode” for burst, but RoboMaker takes this further and can dynamically scale your job up and down from 1 to 8 “units”, and you’re charged by the unit-hour.

Do I need any of this in a 15 minute job? I don’t think I do, other than the declarative orchestration, which I get with Step Function express workflows. I don’t think I want imperative orchestration in the style of Azure Durable Functions, but that’s a subject for another post.

Fundamentally, we need to broaden our minds when it comes to serverless compute. I think a lot of people take the existing offerings and imagine stretching them around the workloads we can’t currently fit into them. But I’d like us to work backwards from the needs we have, trying to find the optimal grouping of those needs into a few different services (a bin packing problem!) that better serve those needs than one service trying to be everything to everyone.

As always, if you’ve got questions or comments, you can find me on Twitter.

Cloud Robotics Research Scientist at @iRobot