Careers
Careers

job details

Back to jobs search

Jobs search results

3,625 jobs matched
Showing 81 to 100 of 3625 rows
Back to jobs search

AI Accelerator Reliability Uber Tech Lead

GoogleSunnyvale, CA, USA

Minimum qualifications:

  • Bachelor's degree in Computer Science, Electrical Engineering, Computer Engineering, a related technical field, or equivalent practical experience.
  • 8 years of experience in reliability engineering, systems engineering, hardware engineering, or software engineering with a focus on system-level reliability.
  • 5 years of experience in a technical leadership role, managing reliability for hardware/software systems.
  • Experience with reliability analysis techniques (e.g., FMEA, FTA, reliability prediction).
  • Experience with the product development lifecycle, from concept to production.

Preferred qualifications:

  • Master's degree or PhD in Computer Science, Electrical Engineering or a related technical field or equivalent practical experience.
  • 15 years of experience in reliability engineering, with significant experience in server/data centers.
  • Experience with data center operations, SRE practices, and designing for serviceability and maintainability at scale.
  • Proven track record of defining and implementing reliability strategies for novel, large-scale compute or accelerator systems.
  • Familiarity with AI/ML accelerator architectures and workloads.
  • Deep understanding of silicon, packaging, PCB, power, and thermal reliability failure mechanisms and mitigation techniques.

About the job

Google Cloud’s mission is to make every business successful through AI by combining cutting-edge technology, infrastructure, and talent. AI/ML software engineers in Cloud bridge the gap between pioneering models and a massive product vehicle reaching billions. Our talent density and AI-powered tools drive rapid development, rooted in a culture of empowerment and a bias to action. In this role, you aren’t just building technology; you’re shaping the frontier of enterprise and driving the evolution of advanced models.

As a Staff Technical Lead, you will own and drive the end-to-end reliability, availability, and serviceability (RAS) for a groundbreaking, next-generation AI accelerator system. This is a unique opportunity for you to lead the reliability engineering efforts for a complex, large-scale hardware/software co-designed platform that will power future critical AI workloads across Google. You will be responsible for defining the reliability strategy, establishing best practices, and influencing a large cross-functional team of hardware, software, and silicon engineers to ensure this new system meets Google's stringent production standards. Your leadership will be instrumental in delivering a robust, resilient, and maintainable platform from concept through to full-scale deployment.

The AI and Infrastructure team is redefining what’s possible. We empower Google customers with breakthrough capabilities and insights by delivering AI and Infrastructure at unparalleled scale, efficiency, reliability and velocity. Our customers include Googlers, Google Cloud customers, and billions of Google users worldwide.

We're the driving team behind Google's groundbreaking innovations, empowering the development of our cutting-edge AI models, delivering unparalleled computing power to global services, and providing the essential platforms that enable developers to build the future. From software to hardware our teams are shaping the future of world-leading hyperscale computing, with key teams working on the development of our TPUs, Vertex AI for Google Cloud, Google Global Networking, Data Center operations, systems research, and much more.

Individual pay is determined by factors including job-related skills, experience, and relevant education or training.

US: $262000 - $365000 (USD) + 25% bonus target + bonus + equity + benefits

Learn more about benefits at Google.

Responsibilities

  • Define, own, and drive the end-to-end reliability, availability, and serviceability (RAS) strategy for a novel, large-scale AI accelerator system.
  • Establish and enforce reliability engineering principles, standards, and best practices across all components of the system, including custom ASICs, trays, racks, power, cooling, and the full software stack (firmware, system software, runtime, and orchestration).
  • Lead and influence cross-functional teams – including Hardware Engineering, Silicon Design, Software Engineering, Supply Chain, Manufacturing, and Site Reliability Engineering (SRE) – to ensure reliability is designed-in and validated throughout the entire product lifecycle.
  • Drive the design and implementation of fault injection testing, stress testing, and DiRT-style exercises to validate system behavior under failure conditions.
  • Define and oversee the development of robust error handling, monitoring, telemetry, and diagnostic capabilities to enable rapid detection, root cause analysis, and recovery from failures.

Information collected and processed as part of your Google Careers profile, and any job applications you choose to submit is subject to Google's Applicant and Candidate Privacy Policy.

Google is proud to be an equal opportunity and affirmative action employer. We are committed to building a workforce that is representative of the users we serve, creating a culture of belonging, and providing an equal employment opportunity regardless of race, creed, color, religion, gender, sexual orientation, gender identity/expression, national origin, disability, age, genetic information, veteran status, marital status, pregnancy or related condition (including breastfeeding), expecting or parents-to-be, criminal histories consistent with legal requirements, or any other basis protected by law. See also Google's EEO Policy, Know your rights: workplace discrimination is illegal, Belonging at Google, and How we hire.

If you have a need that requires accommodation, please let us know by completing our Accommodations for Applicants form.

Google is a global company and, in order to facilitate efficient collaboration and communication globally, English proficiency is a requirement for all roles unless stated otherwise in the job posting.

To all recruitment agencies: Google does not accept agency resumes. Please do not forward resumes to our jobs alias, Google employees, or any other organization location. Google is not responsible for any fees related to unsolicited resumes.

Equity is granted exclusively and discretionarily by Alphabet Inc. on the basis of an agreement concluded between you and Alphabet Inc. Alphabet Inc. is your sole contractual partner with respect to equity grants. GSU grants are not guaranteed, are discretionary, are subject to approval by the Alphabet Inc. board of directors or its delegate, the terms of the relevant Alphabet Inc. stock plan, and your grant agreement. They have no impact on statutory payments. Current or past grants do not confer an acquired right.

Google apps
Main menu