Lessons learned experimenting with an AWS Lambda orchestration engine

We need better orchestration for serverless workflows to make system design more straightforward and easier to implement

While exploring the missing pieces to achieve the vision for a loosely-coupled and high-performance serverless architecture, I recently began tinkering with writing a library to enable orchestration for AWS Lambda. I’ve written previously about the need for this, and why AWS Step Functions isn’t the right solution for all cases.

tl;dr: Proper orchestration is needed for FaaS-based serverless workflows. ClientContext needs to be supported for async Lambda invocations.

The team developing IBM Cloud Composer has mentioned how they face a trilemma while implementing their OpenWhisk orchestration engine. They need to break one of the following:

  1. Double-pay for executions
    a coordinator function runs for the length of the orchestration
  2. Break “substitution”
    don’t allow the follow-on action for a function to be set dynamically
  3. Break the black box model
    have function code participate in the orchestration

The first option is bad with Lambda since it limits the length an orchestration can take. The second choice is bad as well — we want a function to be written in a caller-agnostic way, and be able to participate in multiple different orchestration flows. So I started thinking about how to implement a library for Python Lambdas that would make #3 as painless as possible.

Breaking the black box model

The basic premise is that the API for invoking a Lambda function provides the ability to include a ClientContext — a dictionary of metadata separate from the payload.

ClientContext is used by the AWS Mobile SDK to include extra information about the client. But there’s no reason a library used in multiple Lambda functions couldn’t use it to pass information between invocations.

My first approach was to try to enable callbacks. The idea was that there’d be a client library wrapping the Lambda.Invoke call — which also took a Lambda function name to which the output of the invoked Lambda would get passed.

I worked on this for a while, but I couldn’t figure out how to best approach the developer experience. In particular, nested callbacks are difficult to implement without making the functions explicitly know they are involved in invoking callbacks.

During my research, I came across an excellent presentation on event-driven architectures from Jonas Bonér with his opinions about callbacks:

Additionally, in most languages where callbacks are common, the callbacks are defined in the calling code — creating a closure. With Lambda, the callback is probably a separate function, so it’s as far from a closure as you can get — the callback code is probably defined somewhere else in the codebase!

I attempted to create a promise-like invocations of Lambda, where the library allowed the function code to be defined so that there is explicit state and “before” and “after” sections of code around the invocation. The library would then be responsible for stashing the state in an external store — and on the callback invocation, rehydrating the state and skipping over the “before” code.

Despite my best efforts, I couldn’t find an elegant way to develop a reasonable approach to this solution. That led me to this tweet:

A Lightweight Step Functions

So, proper orchestration is needed — but not Step Functions. I was in luck: the specification for the States Language used in defining Step Functions is documented. So I decided to try and create a ClientContext-based implementation of the States Language —a Lightweight Step Functions.

It turns out this was relatively straightforward! The States Language defines an number of different states. There’s one, Task, for user-provided work to be implemented, and the others are all control states — terminal states, branching, delays, etc.

If the Task states are already causing Lambdas to be run, I could just piggyback on those Lambdas to do the processing of the control flow. There’d be a special “invoke” Lambda used to pass the definition to — and it would kick off the state machine. Everything from there would be handled by the Task Lambdas.

I got a little ways into implementing, finishing the Task and terminal states. I had local testing using threads for async dispatch, and was testing the Lambdas using synchronous invocation so I could easily view the results.

But then, tragedy struck.

I switched to async invocation — and nothing worked.

It turns out the ClientContext is ignored for Lambda invocations of the “Event” type (i.e., asynchronous). But synchronous invocation completely defeats the purpose of what I’m trying to accomplish — a loosely-coupled and high-performance serverless architecture.

So — all this work was for naught. Or, at least until this functionality changes. I can’t recommend looking at the code since it doesn’t perform the stated goal, but I put the code on GitHub for now. Once ClientContext works with async invocations, I’ll pick it back up.

I’m still of the opinion that orchestration is superior to in-code direction of control flow for serverless architectures. Whether that orchestration should be declarative, like the States Language, or imperative, like IBM Cloud Composer, is an open question — and maybe a wash.

We need orchestration for workflows with requirements to be auditable, durable, and reliable — for which Step Functions is a great solution. We also need orchestration for cheap and fast-and-loose workflows to simply make our system design more straightforward and easier to implement and maintain — this is where we need better support from the platform to enable this kind of orchestration for Lambda functions.

Questions or thoughts? Let me know in the comments below or on Twitter.

Cloud Robotics Research Scientist at @iRobot

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store