You’ve nailed it here. All the existing platforms only provide an update-in-place mechanism (even with stages and versioning — the update of the alias is in-place), which is a white-knuckle lever to pull at scale.
Because we deploy a full copy of the system and switch clients between endpoints, I’d actually prefer this functionality to exist in API Gateway, rather than in the Lambda functions. I wrote up how I think this could be effective a while ago: https://medium.com/@ben11kehoe/enabling-blue-green-deployments-and-split-testing-on-api-gateway-a-modest-proposal-2f3b6f2f7e1c
In absence of that functionality, we use a separate service discovery service that clients use to get the endpoint, and that service controls the rollout.