AWS CodeArtifact should be the place to store AWS code artifacts

There are myriad APIs in AWS services that allow services to accept large and/or binary content from you. Zip files for your Lambda functions, images for Rekognition, CloudFormation templates, etc. All of them have one thing in common: that content has to be provided as an S3 object. I think this leaves a lot to be desired.

First, accounts do not come with an S3 bucket to use for this purpose. There’s no “default” bucket you can use. If you use the CloudFormation console, your uploaded template file gets whisked away to an AWS-owned bucket. If you use AWS SAM CLI, it will create a CloudFormation stack with a bucket in it configured for this purpose.

And let’s talk about that: configuring S3 buckets for storing artifacts. S3 is an amazing service, but it serves a multitude of purposes, including making content available to the public. It’s a very useful functionality, but too often it is used unintentionally. Wouldn’t it be great if you never had to worry about the content you’re storing to transfer into AWS services being made public?

Objects in S3 can be versioned, and you can use ETags to check if objects have changed, but people rely on other mechanisms, like including the hash in the object key, to avoid re-uploading large files that are already present, but all of it is layered on top of S3’s capabilities by means of convention, rather than baked into the service.

Now, what happens when you’ve told a service about some content you’ve got in S3 for it? AWS Lambda copies it over internally, where it is stored for free. This has two consequences: any metadata you had on the source artifact in S3 is not carried over, the original S3 location is not made available, and the total storage you’re provided within Lambda is limited.

A different model is AWS RoboMaker, which often involves large (multi-GB) application packages. You provide RoboMaker with the S3 location when defining a job, and only the reference is stored in RoboMaker; the artifact is loaded from the original S3 location whenever a job is started. While this solves the above problems with Lambda, it also means that you (or some colleague trying to be tidy) can delete the artifact from S3, without S3 complaining, and only later when RoboMaker tries to access it will you find out it was still required by a job.

Finally, the constraint that artifacts come from an S3 bucket in the same region is inconvenient. It pushes the complexity of cross-region replication onto users.

I’d like to propose a solution. AWS CodeArtifact exists to provide managed repositories for various types of language-specific package repos (Maven, PyPI, npm, etc.) CodeArtifact should create a new type of repository specifically for this AWS-native-artifact use case. It would look a lot like a (purposely) very simplified version of S3, but would also have features that S3 does not provide. I would put forward the following requirements:

  1. There is a default repository that always exists. Additional repositories may be created.
  2. An artifact has a hierarchical ID (like an S3 object key). Artifacts are always referred to by ARN.
  3. All uploads for an artifact are immutable and versioned. The ARN for an artifact by its ID does not reference any content; artifact content ARNs should be separate and inherently include a version.
  4. All artifact versions are content-addressable. Artifact versions can be referenced by an ARN using a content-addressable hash in addition to the ID+version ARN.
  5. On upload, a hash of the content can be provided in the request, and the service will short-circuit the upload if that content has already been uploaded.
  6. Other AWS services use artifacts by reference; they retain and display the artifact ARN that was provided to the service.
  7. Artifacts are reference-counted and linked as they are used by other services, so that artifact deletion and lifecycling can safeguard in-use artifacts.
  8. An artifact in region A can be accessed through the CodeArtifact API in region B, with CodeArtifact handling cross-region replication and caching behind the scenes.
  9. Cross-account access supported by both resource policies and Resource Access Manager
  10. A repository (except the default repository) can be backed by a user’s S3 bucket.

I would not expect it to support OCI images; OCI has a mature artifact ecosystem, including ECR.

I think it should probably include version aliases, but I think these should resolve into definite versions at time of use, perhaps in a way that keeps the alias use recorded. It’s not a good idea to tie deployed resources implicitly into changing source; that’s what your CI/CD pipeline is for. CodeArtifact should send service events to EventBridge for alias updates; users could then kick off CI/CD processes based on those events, for example checking where that alias is being referred to and directing those places to update to the new version the alias points to. So the functionality would be more like a “bookmark” than an “alias”, I guess.

I would want more services to make use of artifacts in this way. Today, the way different service APIs specify S3 objects is inconsistent, and there’s lots of APIs that take a large amount of configuration only as API parameters. CodePipeline should use CodeArtifact by default to store a pipeline’s artifacts, but the pipeline definition should be able to be provided as an artifact as well (same with CodeBuild project definitions). aws cloudformation deploy and SAM CLI could have a generic mechanism for uploading files/zips for any resource that has a property that takes an artifact ARN.

An artifact repository with these requirements would address the limitations of S3 described above, as well as providing a common platform for improvements to artifact generation and governance. Code signing, for example, could be implemented in CodeArtifact and then provided for other AWS services to leverage.

Given the number of AWS services that we have to upload various kinds of artifacts to, and all the particulars of artifact management, we deserve something more purpose-built than S3 that doesn’t require any setup, doesn’t come with major opportunities for insecure misconfiguration, and can provide features dedicated to code artifact management.

Cloud Robotics Research Scientist at @iRobot