ML, Systems, and Cloud AI (MSCA)

Celebrating our Argos team, winners of the Technology & Engineering Emmy Award

Charting the Course: Integrating AI with systems products

Coming of age in the fifth epoch of distributed computing, accelerated by machine learning

Gemma Open Models

A family of lightweight, state-of-the art open models built from the same research and technology used to create the Gemini models

Expanding our AI-optimized infrastructure portfolio: Introducing Cloud TPU v5e and announcing A3 GA

Introducing Google Axion Processors, our new Arm-based CPUs

Who we are

Hear from a few of our great thinkers and doers in ML, Systems, and Cloud AI (MSCA)

Hear from a few of our great thinkers and doers in ML, Systems, and Cloud AI (MSCA)

ML, Systems, and Cloud AI (MSCA) encompasses a large ecosystem of teams that design, implement, and operate the hardware, software, ML, and systems infrastructure that powers both Google Cloud and our major internet services. We're a diverse team of engineers, researchers, product and program managers, analysts, and developers working together to build reliable and secure data center services for billions of users.

What we do

Google was an early leader in inventing warehouse-scale computing [13] and cloud-scale software infrastructure [ 1, 2, 3, 4, 5, 6 ]. Our teams design and build all of the hardware and software infrastructure for Google. The engineering teams design security, efficiency, and reliability into Google’s planetary scale systems, including its global network systems [ 7, 10 ] its world-leading Machine Learning (ML) infrastructure [ 8, 9, 11, 14 ], and its video accelerator for YouTube [ 12 ]. Talented staff across the organization participate in the invention and creation of world-leading future technology that will define the next generation of hyperscale computing.

Who we are

Google’s computing infrastructure is among the largest and most technically sophisticated in the world. Our systems power all of Google’s services for both internal users (for example, Search, Maps, Gmail, and YouTube) and external customers (such as Google Cloud clients).

How you can make a difference

Our engineering teams are responsible for building and maintaining some of the most complex technologies in the world. We work on a global scale, designing and operating systems that keep data centers running 24/7. We are seeking experienced people who can see the details and the big picture at the same time, and can understand how the pieces fit together. Your mission will be to collaborate with your peers to invent, design, and implement new concepts, designs, and technologies for Google’s applications and systems.

Our research and development efforts include operating systems, virtual machines, serverless computing, distributed systems, storage systems and databases, performance measurement and evaluation, data analytics, computer architecture and hardware design, data center and global networking, data center scheduling and management, security, and advanced machine learning hardware and software. Explore career opportunities.

Impact

Following are examples of published research papers from Google that have been highly influential in the field of computer systems, warehouse-scale computing, and machine learning.

Sanjay Ghemawat, Howard Gobioff, and Shun-Tak Leung. The Google file system. Proceedings of ACM Symposium on Operating Systems Principles (SOSP), 2003.
Jeffrey Dean and Sanjay Ghemawat. MapReduce: simplified data processing on large clusters. Proceedings of the 6th Conference on Symposium on Operating Systems Design & Implementation (OSDI), 2004, USENIX Association.
Fay Chang, Jeffrey Dean, Sanjay Ghemawat, Wilson C. Hsieh, Deborah A. Wallach, Mike Burrows, Tushar Chandra, Andrew Fikes, and Robert E. Gruber. Bigtable: A Distributed Storage System for Structured Data. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI ’06), 2006.
Mike Burrows. The Chubby lock service for loosely-coupled distributed systems. In Proceedings of the 7th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2006.
James C. Corbett, Jeffrey Dean, Michael Epstein, Andrew Fikes, Christopher Frost, JJ Furman, Sanjay Ghemawat, Andrey Gubarev, Christopher Heiser, Peter Hochschild, Wilson Hsieh, Sebastian Kanthak, Eugene Kogan, Hongyi Li, Alexander Lloyd, Sergey Melnik, David Mwaura, David Nagle, Sean Quinlan, Rajesh Rao, Lindsay Rolig, Yasushi Saito, Michal Szymaniak, Christopher Taylor, Ruth Wang, Dale Woodford. Spanner: Google’s Globally-Distributed Database. Proceedings of the 10th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2012
Abhishek Verma, Luis Pedrosa, Madhukar Korupolu, David Oppenheimer, Eric Tune, and John Wilkes. Large-scale cluster management at Google with Borg. Proceedings of the European Conference on Computer Systems (EuroSys), 2015.
Arjun Singh, Joon Ong, Amit Agarwal, Glen Anderson, Ashby Armistead, Roy Bannon, Seb Boving, Gaurav Desai, Bob Felderman, Paulie Germano, Anand Kanagala, Jeff Provost, Jason Simmons, Eiichi Tanda, Jim Wanderer, Urs Hölzle, Stephen Stuart, Amin Vahdat. Jupiter Rising: A Decade of Clos Topologies and Centralized Control in Google’s Datacenter Network. Proceedings of ACM SIGCOMM, 2015.
Martín Abadi, Paul Barham, Jianmin Chen, Zhifeng Chen, Andy Davis, Jeffrey Dean, Matthieu Devin, Sanjay Ghemawat, Geoffrey Irving, Michael Isard, Manjunath Kudlur, Josh Levenberg, Rajat Monga, Sherry Moore, Derek G. Murray, Benoit Steiner, Paul Tucker, Vijay Vasudevan, Pete Warden, Martin Wicke, Yuan Yu, and Xiaoqiang Zheng. TensorFlow: A System for Large-Scale Machine Learning. In Proceedings of the 12th USENIX Symposium on Operating Systems Design and Implementation (OSDI), 2016.
Norman P Jouppi, Cliff Young, Nishant Patil, David Patterson, Gaurav Agrawal, Raminder Bajwa, Sarah Bates, Suresh Bhatia, Nan Boden, Al Borchers, Rick Boyle, Pierre-Luc Cantin, Clifford Chao, Chris Clark, Jeremy Coriell, Mike Daley, Matt Dau, Jeffrey Dean, Ben Gelb, Tara Vazir Ghaemmaghami, Rajendra Gottipati, William Gulland, Robert Hagmann, C Richard Ho, Doug Hogberg, John Hu, Robert Hundt, Dan Hurt, Julian Ibarz, Aaron Jaffey, Alek Jaworski, Alexander Kaplan, Harshit Khaitan, Daniel Killebrew, Andy Koch, Naveen Kumar, Steve Lacy, James Laudon, James Law, Diemthu Le, Chris Leary, Zhuyuan Liu, Kyle Lucke, Alan Lundin, Gordon MacKean, Adriana Maggiore, Maire Mahony, Kieran Miller, Rahul Nagarajan, Ravi Narayanaswami, Ray Ni, Kathy Nix, Thomas Norrie, Mark Omernick, Narayana Penukonda, Andy Phelps, Jonathan Ross, Matt Ross, Amir Salek, Emad Samadiani, Chris Severn, Gregory Sizikov, Matthew Snelham, Jed Souter, Dan Steinberg, Andy Swing, Mercedes Tan, Gregory Thorson, Bo Tian, Horia Toma, Erick Tuttle, Vijay Vasudevan, Richard Walter, Walter Wang, Eric Wilcox, Doe Hyun Yoon. In-datacenter performance analysis of a tensor processing unit. Proceedings of the International Symposium on Computer Architecture (ISCA), 2017.
Michael Marty, Marc de Kruijf, Jacob Adriaens, Christopher Alfeld, Sean Bauer, Carlo Contavalli, Michael Dalton, Nandita Dukkipati, William C Evans, Steve Gribble, Nicholas Kidd, Roman Kononov, Gautam Kumar, Carl Mauer, Emily Musick, Lena Olson, Erik Rubow, Michael Ryan, Kevin Springborn, Paul Turner, Valas Valancius, Xi Wang, Amin Vahdat. Snap: A Microkernel Approach to Host Networking. Proceedings of the 27th ACM Symposium on Operating Systems Principles (SOSP), 2019.
Norman Jouppi, Doe Hyun Yoon, Matthew Ashcraft, Mark Gottscho, Thomas B. Jablin, George Kurian, James Loudon, Sheng Li, Peter Ma, XIaoyu Ma, Thomas Norrie, Nishant Patil, Sushma Prasad, Cliff Young, Zongwei Zhou, and David Patterson. Ten lessons from three generations shaped Google’s TPUv4i. Proceedings of the 48th International Symposium on Computer Architecture (ISCA), 2021.
Parthasarathy Ranganathan, Daniel Stodolsky, Jeff CAlow, Jeremy Dorman, Marisabel Guevara, Clinton Wills Smullen IV, Aki Kuusela. Warehouse-Scale Video Acceleration. IEEE Micro, July/August 2022.
Luiz André Barroso, Urs Hölzle, and Parthasarathy Ranganathan, The Datacenter as a Computer: Designing Warehouse-Scale Machines, Third Edition. Morgan Claypool Publishers, 2019.
Vaswani, A., Shazeer, N., Parmar, N., Uszkoreit, J., Jones, L., Gomez, A. N., Kaiser, Ł., and Polosukhin, I. Attention is all you need. In 31st Conference on Neural Information Processing Systems, (pp. 5998-6008), 2017.

ML, Systems, and Cloud AI (MSCA)

ML, Systems, and Cloud AI (MSCA)

Celebrating our Argos team, winners of the Technology & Engineering Emmy Award

Charting the Course: Integrating AI with systems products

Coming of age in the fifth epoch of distributed computing, accelerated by machine learning

Gemma Open Models

A family of lightweight, state-of-the art open models built from the same research and technology used to create the Gemini models

Expanding our AI-optimized infrastructure portfolio: Introducing Cloud TPU v5e and announcing A3 GA

Introducing Google Axion Processors, our new Arm-based CPUs

Who we are

Hear from a few of our great thinkers and doers in ML, Systems, and Cloud AI (MSCA)

Hear from a few of our great thinkers and doers in ML, Systems, and Cloud AI (MSCA)

What we do

Who we are

How you can make a difference

Impact

Following are examples of published research papers from Google that have been highly influential in the field of computer systems, warehouse-scale computing, and machine learning.

Interested in joining the team?