In the last few years, we’ve witnessed explosive growth in the role machine learning (ML) plays in technology. Making good predictions from data has always been important in our industry, but modern machine learning techniques allow us to be much more systematic. However, this wealth of new ML algorithms and services present new challenges for software developers.
- Should we use managed ML services from Cloud providers or build our own?
- How can we architect our infrastructure to have the most flexibility while we refine our ML system?
- Do we train our ML algorithms in batch or perform streaming updates?
- How fast can we make predictions?
- How do we ensure our system scales as we add more users?
The Acceleration Research and Technology (ART) group faced these questions, and many others, in the course of building an ML service that predicts the structure of the internet. In this post, I’ll give a brief overview of Lift, an architecture we developed at ViaSat for building scalable ML services, and techniques we’ve found useful for managing those services.
Our Design Goals
The space of ML technologies is very diverse, with many different models of computation and performance guarantees. Before we started developing Lift, it was important to specify what we wanted to accomplish:
- Realtime: When fully deployed, our system will ingest thousands of records every minute. We want this new data to be immediately incorporated into our mathematical models.
- Scalable and High Availability: We want our service to be arbitrarily scalable, capable of taking requests from millions of users at the very least.
- Fast Predictions: In many applications, predictions are only useful if we can reply quickly (<10 milliseconds from receiving a request).
- Flexible: We’re always improving our algorithms so we want an architecture that lets us get our best discoveries into production quickly.
- Reusable: Many groups need tools to continuously extract predictions from vast flows of data (for example, detecting and eliminating malware that might be sent over our network). We want to save labor by building a platform where developers can focus on solving their core ML problem and leave the scale/service/infrastructure issues to us.
To meet these goals, we’ve developed Lift, an architecture for scalable, realtime ML services, and a collection of tools and techniques for creating and managing those services.
The Lift Architecture
Lift decomposes the machine learning process into three phases, Enrich, Train, and Infer, each supported by its own cluster of servers. These services have been implemented on AWS (Amazon Web Services) and leverage AWS’s managed services to achieve scale.
The Enrich cluster is responsible for receiving feedback records containing the raw data we’ll process in order to make predictions. Enrich is responsible for sanitizing feedback records, recording instrumentation for our real-time monitoring systems, and transforming the records in preparation for storage in S3. In this way, enrich uses S3 to create a clean “history” of the data our system processes. This history is invaluable for analysts who need to backtest new algorithm ideas on historical data.
The Train cluster is responsible for distilling Enrich’s output into models. A model is a serializable data structure that aggregates the important statistics we’ve gathered about a particular topic or domain. When new feedback records arrive from Enrich, Train pulls the relevant model from our Elastic File System (EFS), updates the statistics with the feedback’s information, and writes it back. This streaming approach means updates to our models are very fast.
The Infer cluster is responsible for making our predictions available to the outside world. It receives prediction requests, which record the details for the type of prediction the requester wants. In our application, we have to reply quickly to prediction requests to provide any benefit. That’s why models and EFS’ fast read time are so important: They allow infer to quickly select the right model to address the prediction request, read the appropriate statistics to make a prediction, and reply.
Finally, each Lift instance has a special test cluster called Loader. This cluster gives us the ability to simulate a population of users generating feedback and prediction requests, to load test the system. These tests are key to verifying the system can scale up (and scale down) depending on the level of user demand.
Decomposing our system this way has had many scalability benefits. By utilizing S3 and EFS for persistence, we can store unlimited amounts of data and avoid the storage limitations associated with most databases. Likewise, decomposing our service into train, infer, and enrich clusters makes it easy for each cluster to scale independently, depending on whatever resource (CPU, RAM, or Network) is most relevant to its work.
By scaling up our load cluster, we can simulate arbitrarily large populations of users and stress-test our builds before they ever reach customers. We can also confirm that our system scales down during less busy periods, so that we don’t pay for more servers than we need. Moreover, by running a scaled-down Loader in production, the cluster effectively doubles as an end-to-end test for our system, generating a small baseline of consistent traffic we can use to verify system health in real time.
Lift doesn’t specify exactly what procedures Enrich, Train, and Infer will perform. This is intentional, so that developers who need to build an ML service can simply “plug in” their implementations of these procedures and use our infrastructure creation tools to build their Lift instance, with supporting servers for CI/CD and monitoring.
DevOps and Lift
The ART team is not very large but we support a large number of servers. To do that, we’ve recently adopted some DevOps practices that are slightly different from the ones we used building CI/CD pipelines in the past.
Our implementation of Lift is 100% infrastructure as code, meaning we can create (or tear down) our entire service (including support servers) automatically, in a matter of hours. This has had several benefits.
- Decreased contention for test resources: Now, developers can easily set up their own scaled-down version of production.
- Testable infrastructure changes: We can now test ambitious infrastructure changes in our development accounts. Once a developer has a working change, they can be confident that change will yield working infrastructure in production.
- Simplified security analysis: Since our environment is 100% infrastructure as code, most security analysis can be performed by inspecting our codebase rather than stateful AWS consoles. This makes it much easier to verify our security groups are set up properly.
- Centralized Logging and Monitoring: It’s easy for developers to add an Elasticsearch cluster and Kibana instance to their test environment to collect instrumentation and debug problems. Our code adds a log transport daemon (Filebeat) to every cluster worker, so it’s easy to get instrumentation into the system. In production, that same instrumentation enables us to use Kibana as our primary tool for monitoring most aspects of system performance.
We also utilize immutable infrastructure techniques when we create our Enrich, Train, and Infer clusters. We call this system “blue-green deployment”. Whenever we have a new version to deploy, we stand up fresh clusters behind a private load balancer. We designate this collection of clusters a “green” version of our service, as opposed to the “blue” version exposed to customers on a public load balancer. We then apply a battery of acceptance tests (including load tests) to the green service. If they pass, the green service is promoted to blue, and we atomically cutover our public load balancer to point to our new clusters. The older version of the service is then retired, kept in reserve until our automation determines it is time to tear it down. This gives us one more defense against a service outage, because we can always fail back to the retired version of the system.
This deployment design has had many benefits for our CI/CD Pipeline. Despite having more strenuous acceptance tests than ever before, we can now put a new version into production in a matter of hours instead of days, with no downtime for deployment and robust defenses against failure.
Building an online machine learning service in the cloud isn’t easy, especially with a small team. Our Lift architecture, and the DevOps techniques that support it, have allowed us to greatly increased our scale while improving our security, developer productivity, and robustness to failure.