My name is Stephan Kemper. Today, I lead ViaSat’s Cloud Engineering team, called VICE. VICE is responsible for helping groups all over ViaSat learn what it means to develop for the cloud, as well as supporting and protecting the cloud networks that ViaSat operates. There are at least a dozen groups using public cloud platforms today, with more being added to the list all the time. By far, the majority of this all takes place on the excellent Amazon Web Services platform, though we also use Microsoft Azure and OpenStack to some extent. To kick this blog off, I wanted to tell the story about how ViaSat got here, and the things we learned along the way.
In September 2010 I was fresh out of college and a newly minted member of the ArcLight program. Our lead engineer, a recent refugee from Silicon Valley, had brought with him some crazy ideas about collecting reams of dense log messages from modems on the network, and analyzing them to provide better insights into service and endpoint health.
Network One: Turning the (Arc)Light On
This represented many terabytes of data; far too much for a relational database at the time. We ended up deciding to ship and store the logs using two relatively new Apache projects, Flume and Hadoop. The whole thing would be built out in AWS, to minimize risk and reduce our time-to-deployment; procuring the right hardware and installing it in a datacenter for an experimental idea didn’t make much sense.
This very first AWS network used what is now called EC2 Classic. For the machines, we built a custom machine image (AMI) based on a clean CentOS install, spun up some m1.large instances, attached a few EBS drives to them, and called it ready. To get the data out of the ArcLight network and into Amazon, we used OpenVPN software in each ground station to create private tunnels to a concentrator box in EC2.
Ah, the simple days…
As a proof of concept, this did the trick: we got the data we wanted, came up with some great analyses, and proved the value of our approach. In fact, the project was so successful we won approval to build something similar for the new satellite service ViaSat was launching in early 2012. We had only a few months to come up with a plan and get servers in place for a January go-live date!
Network Two: Exede-ic Boogaloo
Operating in a much higher-profile setting meant we had a lot of upgrades to do. First, we moved the entire network into a Virtual Private Cloud (VPC). This change gave us much better control over Security Groups, and better isolation from the Internet for most of our systems. At the same time, we dropped our OpenVPN setup in favor of Amazon’s VPN Connection service, linked to dedicated VPN appliances in each of the Exede network’s interconnects. The managed service made our connections much more stable, and greatly reduced our administrative burden. Finally, we added a VPN connection back to ViaSat’s corporate network, so engineers and data analysts could access the systems as well.
Another big focus was upgrading the operational security of the network. We built a centralized authentication system around Microsoft’s Active Directory, using a subdomain delegated to us by ViaSat’s IT department. With that in place, we could do things like give everyone their own account, enforce password complexity and age requirements, and use DNS in the network. It was a huge step up from everyone accessing data through a single shared SSH key and IP addresses.
Sadly, all good things must come to an end. In late June that year, a major storm hit Virginia, where our servers were living inside AWS’s us-east-1 region. During the storm, a major failure in that region brought down at least one Availability Zone (AZ), which coincidentally contained most of our infrastructure. We had ignored AWS’s exhortations to split applications across multiple AZs, which cost us a huge data outage and many hours of recovery work.
Network Three: Going Multi-AZ
For the next several months, we turned our attention to ensuring all our applications, no matter how small, were distributed across multiple AZs. For most applications, we decided two AZs were enough; AD, Flume, and other support services were in this bucket. Hadoop we split across three AZs, with a specific reason: we wanted to make use of its rack-awareness features. With a data replication factor of 2, and Hadoop ensuring that no two replicas were in the same AZ, we could protect ourselves from data loss in the event an entire AZ went down.
The monumental task in getting all this done also showed us our need for automation. We started writing Chef cookbooks to build things like create DataNodes, TaskTrackers, and other servers. It took a while, but we eventually got all of our systems laid out in cookbooks.
Network Four: A New Road
Chef and “the DevOps way” eventually led us to an inescapable conclusion: Active Directory would have to go. Its model just didn’t fit in a cloud environment that was 99% Linux servers, with automated deployments happening all the time. So we designed and built a new DNS and authentication service around OpenLDAP and BIND. Both the authentication and DNS data sets would be stored in LDAP, making replication much simpler. A new DNS service required a new domain name, and viasat.io was born.
Around the same time, a few other groups at ViaSat were starting to use AWS, and they wanted to hook into our system. We wanted to minimize risk to these teams due to flaky VPN connections so we set up a pair of master LDAP servers in a dedicated VPC, and built slave servers in each constituent VPC. Each team also got its own DIT subtree, so they could manage their own LDAP groups, service accounts, and DNS entries.
I’d like to say switching systems was smooth sailing, but I’d be lying. It took us several years to finish the migration from AD to OpenLDAP; when it finally happened, our team threw a party, and brought in a big ice cream cake. In the meantime, though, the other groups using our OpenLDAP system had exploded.
Network Five: The viasat.io Platform
As more and more ViaSat groups moved out onto cloud platforms, the need for a self-service, extensible platform for network services like DNS and auth became intense. The plethora of programs brought with them a plethora of use cases, including needing to deploy services worldwide. We started looking at what it would take to make viasat.io truly ready for that, and decided to go for it.
Amazon’s platform had come a long way since we started. VPC peering was available, so we decided to place platform hubs in strategic AWS regions, which local VPCs could talk to through the peering connections. We also had to redesign our LDAP system to a truly multi-master model: each hub would need its own master server(s), which could replicate changes to its sister hubs.
ViaSat’s internal systems had come a long way too: our IT department had built several OpenStack environments, and many people using that service wanted access to viasat.io. Supporting both platforms meant we couldn’t use some managed services that AWS offers, but being able to serve the whole company is definitely worth it.
Today, we’ve deployed three hubs, all in the United States. Teams large and small from all corners of ViaSat use the viasat.io Platform for their own needs, which in turn drive future development of the platform. For example, we’ll soon be expanding into overseas AWS regions to support ViaSat’s Wi-Fi service on El Al.
On the services side, we have moved our DNS data out of LDAP and into a global Cassandra database, which we hope will serve as the backbone for future services like PKI, security scanning applications, and centralized logging and monitoring. The team is also putting together a common web service for clients to manage their services, so we get ourselves more and more out of the way, and let ViaSat engineers do what they do best: come up with crazy ideas, experiment, and make something great.
ViaSat now has a dedicated Cloud Engineering team that works on the viasat.io Platform. If any of this has tickled your fancy, check out our open positions!