David Chaiken, Chief Architect, Yahoo!
Architecture at Internet Scale: Lessons Learned from Serving Half a Billion Customers
Notes by Bill Pollak, Jack Chen, and Peter Foldes
Yahoo provides IT for 1/10 of the people on Earth, more than 680 million users. Billions of advertisements. Number one in email, messenger, and other media categories. Premier digital media company. 93 million Flickr photos uploaded each month. Provides media services in communication and search to consumers. Connects consumers to the advertisers—that’s how they make money.
Goal to do fast product cycles. Release functionalities every three weeks. Flickr pushes code to production multiple times per day. Speed reflected in products such as Yahoo! Home page. Goal is to make each of their experiences customized for each and every person as well as for national cultures. Yahoo home page regenerates millions of times a day with millions of different views.
Yahoo search. Move speedily to new ways to present results—“answers not links.” In Silicon Valley speed is important, as startups with good ideas flood the market. This speed motivates, for example, the optimization and personalization of the Yahoo! homepage. Working with Microsoft to improve the search engine, pushing out new versions fast. The Royal Wedding page was huge. Lot of people signed the guestbook, and many thousands of people wanted to see the first kiss.
Scale, speed, complexity. Yahoo has grown by acquisition, which has had a dramatic effect on the complexity of their legacy systems. Huge technical debt from legacy. Scale! Competition! Legacy! We as architects know how to deal with this. It’s what we’re trained to do.
There are timeless principles that we can apply as architects to address scale, speed, and complexity. We learn and relearn.
Principle 0: Limit Complexity. The architect “serves as a limit on system complexity” (Brooks, 1975). Provides conceptual integrity for system. Complexity is the enemy of computer science. It behooves us as designers to minimize it. Conversion to architect is where we take responsibility for reducing complexity.
Principle 1: Algorithms First. Don’t optimize prematurely, and don’t pessimize prematurely. Get data structures and algorithms into place. Use case: Ad-serving latency. Latency matters to Yahoo customers and consumers. Painful process of getting product, business development, technologists on same page. Needed to get algorithm under control. Ad serving in a network of networks—greater participation in ad serving—stressed algorithm beyond ability to adapt to internet scale. Came up with an algorithm that was more tractable. From exponential algorithm to polynomial complexity algorithm.
“We need to understand complexity of our systems and their fundamentals, and how they’re driven by business needs.”
Principal 2: Conway’s Law: systems image their design groups. Use case is media products, media modernization. Rozanksi & Woods: every computer system has an architecture, whether or not it is documented and understood. Acquisitions determined what architectures looked like. Different architectures and different ways of representing them. Various waves of architects have tried to do something to consolidate this. Each of these waves failed until they reorganized the people underlying these silos.
Took friction out of system, made things a lot easier.
Principle 3: Look for Patterns. Use case: global products. Outage of home page. When Yahoo has outage, people notice. Looked for bug patterns. Post mortems look for root cause and address. Root cause was a patch within Apache not configured to be big enough. When cache fills up, machine locks up. Pattern-detection discussion among architects. Found patterns that apply not just to this incident and home page but to all media products. Identified real root causes. Core of architects looking for patterns to try to prevent future outages. The real problem was multi-tenancy of the data. A pattern was created based on this and specified what teams should look for when setting up servers.
Principle 4: Hide Information. Use case is user data stores. Crown jewel at Yahoo is the data we get from our customers, things we can use to personalize the product. While abstraction is a fundamental part of software engineering, according to Parnas, most industrial systems don’t use it. Yahoo thrives on user data, and by introducing a broker Yahoo is abstracting it away.
Principle 0 revisited: limit complexity. Pattern: Build on Grid, Push to Production. Data gets processed in a grid, for example, loading ad data from the provisioning and distribution, and the existing ad system logs, where scientists analyze it (in Yahoo Labs) using machine-learning algorithm. They look whether ads are shown to the right people, at the right time. After that they get reverse indexed to production, used by the front end and used for gathering additional data.
Since there is a separate research grid, the research scientist can work with live data, in production, while systems engineers work around them.
- Hide information
- Look for patterns
- Systems image their design groups (Conway’s Law)
- Algorithms first
- Limit complexity
Q: Are you still coding?
I do two kinds of coding. I try to stay off the critical path. I implemented some new test frameworks for the group. Wrote organizational programs for internal conferences
Q: How big are the architectural teams?
What we have chosen to do at Yahoo is to build a virtual technology organization. There is a technology council, 20-30 at the top tier, 90-100 next, 200 below that.
Q: What’s the architectural methodology followed at Yahoo?
Still in the process of standardizing. This is difficult due to the diversity. Flickr pushes code multiple times per day to respond to a very dynamic community. Revenue-generation section that must conform to regulatory bodies within different countries. Coming up with a single process for these diverse systems is a real challenge. We normalize the terminology, but different processes are tuned for different classes of products
Q: What’s the process you use for bringing all of this information together?
Trying to consolidate all of our documentation repositories. One prevalent format is a wiki to encourage documentation
Q: How is scalability reflected within the architecture?
Talked a lot about the latency and complexity (duals of each other). Scale means that we need to be able to distribute transactions across a large number of servers to keep the systems available. Where scalability really comes in is cost.