Мега-список статей от Google

Это неофициальный список, которые рекомендуют к прочтению тем, кто начинает работать в Google на Site Reliability Engineer позициях (есть ссылки на статьи, где объясняется, что это такое). Он не секретный, все статьи доступны для публики, и очень  полезны для ознакомления программистам в том числе. А для тех, кто подается на SRE позицию – вообще золото! Ну и тем, кто интересуется тем, как устроен процесс в Google, чтобы минимизировать вероятность критической неполадки в production, когда люди не могут пользоваться поиском или там войти в свой gmail.

Важно: Если у вас 5+ лет опыта и вы подаетесь в Google (или планируете, или готовитесь), обязательно почитайте статьи, которые не про SRE. Там очень много хорошей информации про дизайн систем, и как Google строит large scale systems.

Engineering Reliability into Web Sites: Google SRE:
http://research.google.com/pubs/pub32583.html

Making an impact as a Site Reliability Engineering intern:
http://googleforstudents.blogspot.ie/2013/05/making-impact-as-site-reliability.html

Site Reliability Engineers: “solving the most interesting problems”:
http://googleresearch.blogspot.ie/2012/07/site-reliability-engineers-solving-most.html

Being an On-Call Engineer: A Google SRE Perspective:
http://research.google.com/pubs/pub44813.html

Inside Google datacenters:
http://www.wired.com/wiredenterprise/2012/10/ff-inside-google-data-center/all/

Kripa and Tom (SREs) on DiRT:
http://queue.acm.org/detail.cfm?id=2371516

Sabrina Farmer (SRE manager) i/view:
http://www.techcentral.ie/19149/women-in-tech-must-overcome-the-impostor-syndrome

Andrew Widdowson (SRE) i/view
http://googleforstudents.blogspot.com/2012/06/site-reliability-engineers-worlds-most.html

How SRE solved the leap-second problem for Google:
http://googleblog.blogspot.ie/2011/09/time-technology-and-leaping-seconds.html

Google Engineering in Dublin video for fun, mostly SREs:
http://www.youtube.com/watch?v=zrb6edmE5Kg
– there are a bunch of videos like this

Here’s How Google Makes Sure It (Almost) Never Goes Down:
http://www.wired.com/2016/04/google-ensures-services-almost-never-go/

Research papers (http://research.google.com/) – there are 100s here, browse around; classics:
Spanner: http://research.google.com/archive/spanner.html (recent)
Backbone “B4” network: http://cseweb.ucsd.edu/~vahdat/papers/b4-sigcomm13.pdf
Cluster networking : http://static.googleusercontent.com/external_content/untrusted_dlcp/research.google.com/en/us/pubs/archive/36740.pdf
Mapreduce: http://research.google.com/archive/mapreduce.html
Bigtable: http://research.google.com/archive/bigtable.html
Protocol Buffers: http://code.google.com/p/protobuf/

Anything Jeff Dean says (http://research.google.com/people/jeff/), particularly relevant to SRE are presentation overviews at running large scale systems, design principles, some real life outages at Google:

Berkeley AMPLab Cloud Seminar talk, March, 2012: Achieving Rapid Response Times in Large Online Services
Stanford Computer Science Department Distinguished Computer Scientist Lecture lecture, November, 2010: Building Software Systems at Google and Lessons Learned
Symposium on Cloud Computing (SOCC) keynote, June, 2010: Evolution and Future Directions of Large-scale Storage and Computation Systems at Google
Web Search and Data Mining Conference (WSDM) keynote, February, 2009: Challenges in Building Large-Scale Information Retrieval Systems
Google Faculty Summit talk, July, 2008: Some Potential Areas for Future Research
Stanford CS295 class lecture, Spring, 2007: Software Engineering Advice from Building Large-Scale Distributed Systems

O’Reilly Velocity conference papers / books
Anything available from the O’Reilly Velocity conferences: http://oreilly.com/velocity/
not everything is available without registration, but you may find talk videos, presentations, papers and links; most of the topics are relevant to large scale software design and SRE
The SRE Book: Site Reliability Engineering: How Google Runs Production Systems

Related to Ads:
High-Availability at Massive Scale: Building Google’s Data Infrastructure for Ads
F1: A Distributed SQL Database That Scales
Photon: Fault-tolerant and Scalable Joining of Continuous Data Streams