James Hamilton's paper On Designing and Deploying Internet-Scale Services has some solid advice on building large scale distributed systems. It was published in 2007, and since then much of the content has become common knowledge in the software engineering community, but it's always valuable to collect good advice on a topic in one place.
All components of the service should target a commodity hardware slice. For example, storage-light servers will be dual socket, 2- to 4-core systems in the $1,000 to $2,500 range with a boot disk. Storage-heavy servers are similar servers with 16 to 24 disks.
Multi-system failures are common. Expect failures of many hosts at once (power, net switch, and rollout). Unfortunately, services with state will have to be topology-aware. Correlated failures remain a fact of life.
Fail services regularly. Take down data centers, shut down racks, and power off servers. Regular controlled brown-outs will go a long way to exposing service, system, and network weaknesses. Those unwilling to test in production aren’t yet confident that the service will continue operating through failures. And, without production testing, recovery won’t work when called upon
Hardware Selection and Standardization: Use only standard SKUs. Having a single or small number of SKUs in production allows resources to be moved fluidly between services as needed. The most cost-effective model is to develop a standard service-hosting framework that includes automatic management and provisioning, hardware, and a standard set of shared services. Standard SKUs is a core requirement to achieve this goal.