ViaSat Web Acceleration (VWA) is a product that provides terrestrial-like web performance for our Exede satellite internet customers. It is a merging and evolution of two ViaSat products: iPEP, a TCP accelerator, and AcceleNet, a web accelerator. Development of VWA started about four years ago, as a project in our Acceleration and Research Technology (ART) center based in Boston. More importantly, it was ART’s first project that experimented with continuous integration (CI). We used a tool called BuildBot which is what the open source browser Chromium uses to orchestrate CI builds and unit tests. The benefits of CI were quickly realized by the team, and we never turned back. At the same time, we also started working on an automation framework that would allow us to do automated installs of VWA and automated tests in our lab. We would schedule these tests to run on a regular basis from BuildBot and the first iteration of a VWA pipeline was born.
Two years ago, we started focusing a lot more of our efforts on CI/CD. We started using GoCD as the pipeline orchestration tool primarily because of its fan-in fan-out feature, which fits perfectly with our pipeline design.
We collapsed all project code branches into one main branch so we could manage the pipeline more easily. We reduced our build time 8-fold in order to give developers fast feedback. We started working on push button deployments using Ansible, with the vision that the same set of deployment scripts would be used for deployments into lab environments as well as production environments. This allows us to test our deployments through the pipeline at all times. We architected the pipeline to include multiple lab environments, with each lab environment running different sets of tests based on the infrastructure, in order to gain broader test coverage.
However, we eventually learned some important lessons about the pipeline. First, having a stable pipeline takes time. In fact, our team is still fixing pipeline issues on a daily basis. Most of the issues are related to unstable labs, which brings me to the second point. Running through labs that are not built and supported for CI/CD drains a tremendous amount of resources. Our developers spent many hours debugging lab issues, instead of fixing actual bugs, because labs were either unstable or there were too many moving parts. To mitigate, we have started migrating tests from less stable labs to labs that are dedicated to the pipeline.
A year ago VWA went live on Exede with Alpha. We were one of the first apps teams to deploy into Exede using a CI/CD/DevOps model. Needless to say it was a major culture shift for both the VWA development team and our Exede service operations team in Denver. Working together, we created a “Fast Pass” that allowed us to deploy new versions of VWA into Exede without going through the usual formal software release process. Back in the AcceleNet days, we would release our software to Ops and we would be done. But with DevOps, the VWA DevOps team decides what to release, when to release, and how to release. There were no more gates imposed by the Ops team. It also means that the VWA team is more accountable than ever to deliver quality software to customers quickly. Every deployment is automated with a push of a button. Our goal is to make releases and deployments boring because they are repeatable and there are no surprises. We are not completely there yet but we have made great strides towards the goal. We spent a lot of effort making sure deployments are truly push button and are resilient to external conditions such as intermittent network issues. We have a defined promotion policy for every release that we strictly follow to ensure quality software is being released. We have done over 200 deployments into production so far, and that is a major accomplishment by the team.
|build||> 2 hr||15 min|
|release testing||weeks (mostly manual)||24 hours (automated)|
|deploy to one production server||hours (mostly manual, requires making new image)||10 min for single node cluster
3 hr for 18-node cluster
|release to production||weeks||3 days (1 day on alpha, 1 day on beta, rest of prod can be deployed in 1 day)|
In order to support VWA end-to-end, we focused on making sure that the VWA team has the ability to monitor the production network, since it was the first time we needed high visibility into the production environment. We built an internal monitoring and alerting system that involved adding SNMP traps to VWA to signal critical issues, which would get funneled into Splunk and then trigger notifications to the team. As we continued to roll out VWA to the rest of the production network, we realized that just having an internal monitoring system was not sufficient. We failed to detect issues in production because VWA could get into a bad state where alerts were not being sent. We quickly pivoted and implemented an external monitoring and alerting system called Cluster Doctor that provides us reliable and accurate data about the state of VWA. It monitors the health of all VWA clusters by periodically polling each node looking for anomalies. Together with xMatters integration and a 24×7 on call team, we are able to quickly detect and address critical issues that could be customer-impacting.
In addition to monitoring for critical issues, we also focused on making tools to help us find regression issues from build to build so we can make informed decisions about whether to roll out a build further into the network. We built a pipeline dashboard that gives us the ability to quickly identify a release candidate.
We built Grafana dashboards that allow the team to see the performance of VWA across the production network in real time.
We are tracking crash rates and Mean Time Between Failure (MTBF) for each build.
We are processing crash reports in bulk and categorizing them with automation that allows the team to prioritize working on the high impact issues first. The team is more hands-on than ever before and it has led us to find unexpected issues as well. For example, we identified from processing VWA crashes that there are many fielded terminals that needed to be replaced.
Being on call 24×7 is a major culture change for the team. Fortunately, xMatters made the transition less dramatic. We have four on-call teams, each with three members covering around the clock. Teams alternate between weekdays and weekends. For example, team 1 is on call on weekdays during week 1; then weekend during week 3.
Rotation also happens within each team of 3. Primary responder becomes the last responder the next day.
The challenge is to get the number of false alarms down so the team does not experience battle fatigue. There have been many instances where the monitoring system signals an issue and the on-call team gets notified, only to find out that the issue is not related to VWA and there is nothing we could have done. Some examples are neighbor subsystems being down, maintenance events, and weather outages. We are constantly keeping an eye on team burnout and the effectiveness of escalations by keeping track of an on call journal for every incident, tracking metrics such as number of escalations and percent of false alarms, performing regular retrospectives, and identifying/prioritizing areas of improvements.
We learned that in order to do DevOps effectively, communication between the development team and the service operations team goes both ways. For example, when the VWA team gets an escalation and plans to perform system recovery (e.g. resetting a node), the plan needs to be communicated in order to make sure it would not run into conflicts. On the other hand, timely and reliable information on neighboring networking conditions that could be causing VWA alerts should be communicated by the ops team.
CI/CD/DevOps is a continuous improvement process and the team is constantly adapting to new processes and new ways of thinking as we take our lessons learned along the way. I am confident that we will take the lessons learned from VWA and apply them to the rest of the organization as we transition into DevOps.