Technical Systems, Infrastructure & Machine Learning
Technical Systems, Infrastructure & Machine Learning
Who we are
Technical Systems and Infrastructure encompasses a large ecosystem of teams that design, implement, and operate the hardware, software, ML, and systems infrastructure that powers both Google Cloud and our major internet services. We are a diverse group of software engineers, hardware engineers, test engineers, network engineers, electrical engineers, mechanical engineers, researchers, program managers, product managers, analysts, developers (and more...) who collaborate to ensure that our data center hardware, our software stack, our internal and global networks, and our applications work together efficiently to deliver secure and reliable services to billions of users.
What we do
Google was an early leader in inventing warehouse-scale computing [13] and cloud-scale software infrastructure [ 1, 2, 3, 4, 5, 6 ]. Our teams design and build all of the hardware and software infrastructure for Google. The engineering teams design security, efficiency, and reliability into Google’s planetary scale systems, including its global network systems [ 7, 10 ] its world-leading Machine Learning (ML) infrastructure [ 8, 9, 11, 14 ], and its video accelerator for YouTube [ 12 ]. Talented staff across the organization participate in the invention and creation of world-leading future technology that will define the next generation of hyperscale computing.
Who we serve
Google’s computing infrastructure is among the largest and most technically sophisticated in the world. Our systems power all of Google’s services for both internal users (for example, Search, Maps, Gmail, and YouTube) and external customers (such as Google Cloud clients).
How you can make a difference
Our teams work on some of the most complex technologies in the world at global scale: computers, software, and networks that keep data centers operating efficiently and smoothly 24/7. We are seeking experienced people who can see the details and the big picture at the same time, and can understand how the pieces fit together. Your mission will be to collaborate with your peers to invent, design, and implement new concepts, designs, and technologies for Google’s applications and systems.
The topics we are interested in include – but are not limited to – operating systems, virtual machines, serverless computing, distributed systems, storage systems and databases, performance measurement and evaluation, data analytics, computer architecture and hardware design, datacenter and global networking, datacenter scheduling and management, security, and advanced machine learning hardware and software. Explore career opportunities
Impact
Following are examples of published research papers from Google that have been highly influential in the field of computer systems and warehouse-scale computing in particular.
-
Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google file system. Proceedings of ACM Symposium on Operating Systems Principles (SOSP), 2003.
-
Jeffrey Dean and Sanjay Ghemawat. MapReduce: simplified data processing on large clusters. Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation (OSDI), 2004, USENIX Association.
-
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. Bigtable: A Distributed Storage System for Structured Data. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’06), 2006.
-
Mike Burrows. The Chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2006.
-
James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, Dale Woodford. Spanner: Google’s Globally-Distributed Database. Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2012
-
Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. Large-scale cluster management at Google with Borg. Proceedings of the European Conference on Computer Systems (EuroSys), 2015.
-
Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armistead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, Anand Kanagala, Jeff Provost, Jason Simmons, Eiichi Tanda, Jim Wanderer, Urs Hölzle, Stephen Stuart, Amin Vahdat. Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network. Proceedings of ACM SIGCOMM, 2015.
-
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2016.
-
Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-Luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, Doe Hyun Yoon. In-datacenter performance analysis of a tensor processing unit. Proceedings of the International Symposium on Computer Architecture (ISCA), 2017.
-
Michael Marty, Marc de Kruijf, Jacob Adriaens, Christopher Alfeld, Sean Bauer, Carlo Contavalli, Michael Dalton, Nandita Dukkipati, William C Evans, Steve Gribble, Nicholas Kidd, Roman Kononov, Gautam Kumar, Carl Mauer, Emily Musick, Lena Olson, Erik Rubow, Michael Ryan, Kevin Springborn, Paul Turner, Valas Valancius, Xi Wang, Amin Vahdat. Snap: A Microkernel Approach to Host Networking. Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP), 2019.
-
Norman Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B. Jablin, George Kurian, James Loudon, Sheng Li, Peter Ma, XIaoyu Ma, Thomas Norrie, Nishant Patil, Sushma Prasad, Cliff Young, Zongwei Zhou, and David Patterson. Ten lessons from three generations shaped Google’s TPUv4i. Proceedings of the 48th International Symposium on Computer Architecture (ISCA), 2021.
-
Parthasarathy Ranganathan, Daniel Stodolsky, Jeff CAlow, Jeremy Dorman, Marisabel Guevara, Clinton Wills Smullen IV, Aki Kuusela. Warehouse-Scale Video Acceleration. IEEE Micro, July/August 2022.
-
Luiz André Barroso, Urs Hölzle, and Parthasarathy Ranganathan, The Datacenter as a Computer: Designing Warehouse-Scale Machines, Third Edition. Morgan Claypool Publishers, 2019.
-
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In 31st Conference on Neural Information Processing Systems, (pp. 5998-6008), 2017.