How to Become Google, Amazon, or Facebook (in Theory)

We admire Google, Amazon, Azure, Netflix, Facebook, Linkedin, Twitter, Swift. How many more of these are there? Hundreds? Perhaps thousands? No way.

Why? Because it's bloody hard to build and operate such a system. In particular, there are five things that are very difficult to get right in any large scale distributed system built by a dev team larger than fifty:

DevOps. Dev is fun -- creating something out of nothing is, in my mind, the purest form of art. Ops is bloody. DevOps is great in theory; very difficult in practice. Your most creative minds create, not operate. Pager duty, yuck!

Scale. The classic iterative model fails miserably at scale. "Let us first build the system using an RDBMS, and later on we can switch RDBMS to be Cassandra." It does not work that way. First, the abstractions are never that clean to be switchable. But more importantly, the application semantics that consume the persistence come to rely on the underlying system. To scale, one must anticipate scale. That is very hard.

Agility. Change at scale is not easy. One can lock down the system, and satisfy the safety property of systems, or so people think. In fact, there have been some excellent writeups that locking down is actually antithetical to safety, but this is a difficult lesson to learn. Trial by fire is not easy. But the progress property is even more difficult. How does one innovate, and let one's customers innovate, and yet keep things running? At Apigee, we do hundreds of changes every month, our customers deploy thousands of new APIs every month, the infrastructure keeps changing underneath us -- yet, guaranteeing availability, scale and latency for our customers' APIs is a task that we have learnt carefully.

Cost. A common style of development teams is "let's focus on features and the cost will work out by itself." We have found this to be helpful -- ask each team what is the baseline cost (for example, if one is building persistence on AWS, start with AWS GB/month cost). Now do a layering, and tell us what each added functionality costs. Without this, it is all jumbled and is no way to build distributed system at scale.

Team independence. This, of course, is not technology, though technology helps. Clean APIs between team "micro-"services help. But dependencies naturally arise -- at the very least, when some teams are providing platform services -- such as core persistence services at Apigee. Furthermore, there are always tensions -- commonality of skills (can one team use MySQL and other Postgres?), common look and feel (do you want your surface area to look like Frankenstein?) etc. "Architectural Committees" are considered anathemas.

There is no magic wand here -- it all depends on your culture, your skills. I am not going to recommend a preferred way. Suffice to say, though, if you do not get this right, you cannot sustain your play.