Jeromy Carriere, Rick Buskens, and Jack Greenfield, Google
Evolving Mission-Critical “Legacy” Systems, Rick Buskens
Buskens’s team is a multisite team that works on a suite of projects focused on Google’s internal structure, while others are external-facing and cloud. The infrastructure for running services at Google is built on Borg, a cluster-management system that runs hundreds of thousands of jobs across thousands of applications in clusters of tens of thousands of machines. Borg is an internal cloud infrastructure, whose users have many different needs; a service configuration specification called BCL (Borg Configuration Language) allows users to tell Borg what those needs are. Buskens’s team works on Borg Config, which interprets the service configuration for Borg; it manages the millions of jobs running each day. BorgCron works for scheduled and repeated tasks at Google scale.
BorgConfig has been in production since 2004, and BorgCron has been in production since 2006. They are both huge and critical to Google’s business functions. They constantly change to meet new user and platform needs and improve performance and scalability. Both systems have outgrown their original use cases, changed owners several times, and amassed technical debt. Changes to these systems must be done in flight and may be directly visible to users. For BorgConfig, the language has confusing semantics, the library has an enormous API surface, and there is infinite backward compatibility. For BorgCron, systems are tightly coupled. The systems can’t just be rewritten; they will have to change.
Google has to plan carefully but move quickly. Buskens’s team must make the migration easy for 80% of users, provide as much guidance as possible for the remaining 20%, and leverage the Google engineering culture of code reviews and testing. Rolling out changes will not happen just once; it will be phased over many steps, and a rollback plan will be in place in case of emergency.
The Google team planned to make BorgConfig easier to change with the goal to eliminate support for infinite backward compatibility. The easy part is designing and implementing the technical solution. The hard parts include convincing system owners to take this on, because the system is owned by different teams; seeking out executive sponsorship; and setting up support mechanisms.
To make BorgCron easier to change, the team planned to simplify its state management by replacing per-replica state management with shared storage. This is a complex change and rollout process, and the system cannot go down, so a good rollback plan must always be in place. BorgCron must be backward and forward compatible as old and new binaries will need to communicate and share states with each other.
Making changes to mission-critical legacy systems is hard, but most of the time the hardest work is not technical.
Some Thoughts on Software and System Architecture, Jack Greenfield
DataSift is a real-time data-mining application for Twitter. It takes in tweets, enriches them, runs filters, and distributes them to clients at a rate of 120,000 tweets per second. It has 936 CPU cores and other scary characteristics. But software is only half the battle: distribution, availability, scalability, and disaster discovery are critical. Low market penetration and rate of new entries suggest user dissatisfaction. Customers are not fully comfortable.
Platform as a Service (PaaS) promises users an easier entry. The user writes the application, and the platform automates everything else. It’s a great vision, but implementations are too constraining. The underlying services are hidden from the user, can be replaced or customized only at predefined configuration points, and it is losing ground to DevOps where users can assemble customized platforms from packaged services. Google calls this “breaking the glass,” as in “in emergency, break glass.”
The concept of immutable infrastructure offers some promise. You manage applications by replacing, not by modifying, components. It migrates the installation and configuration tasks from deployment time to build time, meaning that you can do the install and config in a controlled environment, run tests, and validate corrections. Once you have a known good image, you can deploy it with much less concern about its success. It’s still all about the infrastructure – applications are side effects. You see lists of infrastructure, and users have apps in their heads. The platform itself doesn’t understand the application.
This changes when you bring in containers. Popularized by Docker and now challenged by Rocket, this specification layer will rapidly change the picture. You can use overlay file systems to package namespaces and cgroups and have an order of magnitude less overhead. It promotes assembly of applications from independent microservices, shifting the focus from the infrastructure to the app. Of course, the shift brings new challenges, such as resource scheduling against containers and service discovery.
A new battlefront: scaling microservices architectures. There are many players in this field now, working on scheduling engines to manage resources, orchestration frameworks to manage containers, and asynchronous communication to improve aggregate availability and throughput. You end up with declarative modeling of applications. The components are microservices, endpoints and relationships are class objects, and they are not tied to a physical infrastructure. You can schedule an app across a variety of configurations without changing the app specification.
Who Does Architecture in a Company with No Architects? Jeromy Carriere
No one at Google has the title “architect” in the product development organization, yet they have to architect at Google scale. How do they do it?
The architecture practice is a set of architectural decisions, including decomposing functionality, allocating responsibilities to components, managing technical debt, and making quality-based tradeoffs. It’s really easy to write new languages (badly) and new storage spaces (unnecessarily), but good architectural decisions will include leveraging legacy systems when possible. Architectural decisions must also respect system-wide constraints.
Carriere also discussed the role of architectural decision-making in managing technical debt. “Technical debt is awesome! I can build a great system for the same reason that I can live in Manhattan: a mortgage.” Managing a portfolio and sustaining a long-term vision sometimes involves technical debt. But good architectural decisions consider debt consciously, rather than just letting it happen, and plan strategies for paying it back.
So who architects at Google? The manager, who is a master planner, gardener, tour guide, wizard, and bartender.
In the question/answer period, Linda Northrop asked the Google team whether they chose the examples they presented by their different relationships to technical debt. They answered that they make quarterly choices about how much technical debt to reduce. The infrastructure components discussed often created constraints that they couldn’t live with, so they needed to change them. “We don’t have anything scientific, such as the interest on this debt is X much higher than the interest on this debt, about which debt to pay off first; we use the ‘holy crap’ metric.”