Machine Learning Engineer, Training

Job expired!

Machine Learning Engineer, Training at Waymo

Waymo, a pioneering autonomous driving technology company, has a mission to become the most trusted driver. Originating as the Google Self-Driving Car Project in 2009, Waymo has been dedicated to developing The World's Most Experienced Driver™—the Waymo Driver—to enhance mobility and save thousands of lives lost to traffic incidents. The Waymo Driver powers Waymo One, an entirely autonomous ride-hailing service, and is adaptable to various vehicle platforms and use cases. With over one million rider-only trips, Waymo's Driver has autonomously driven tens of millions of miles on public roads and completed tens of billions in simulation across 13+ U.S. states.

About the Waymo ML Infrastructure Team

The Waymo ML Infrastructure Team collaborates closely with both Research and Production teams to advance models in Perception and Planning, essential to our autonomous driving software. Our solutions, developed in close partnership with teams at Google, support the entire model development lifecycle, specializing in scaling models and addressing unique ML challenges for autonomous driving.

Our Focus

We create libraries and tools to enhance TensorFlow and JAX, tackling scalability, reliability, and performance challenges. Key areas of focus include:

  • Training at scale and improving ML accelerator efficiency
  • Fine-tuning multimodal LLMs for autonomous driving tasks
  • Discovering hyper-parameters and retraining neural networks
  • Computing reliable and noiseless validation metrics
  • Validating newly trained DNNs in the onboard software stack

Your Role

In this hybrid role, reporting to the Technical Lead Manager of Machine Learning Training, you will:

Responsibilities

  • Develop infrastructure for distributed training, including job scheduling, resource management, data distribution, and model synchronization
  • Implement automation for provisioning, deployment, monitoring, and scaling of training infrastructure
  • Monitor system health, diagnose and troubleshoot issues, and perform routine maintenance
  • Identify performance bottlenecks and optimization opportunities
  • Enhance the developer experience and performance of our scalable ML framework

Qualifications

Required

  • Bachelor's degree in Computer Science, Engineering, or a related field, or 2+ years of equivalent experience
  • Experience with distributed systems principles and building distributed systems for production environments
  • Proficient in Python or C++
  • Experience with Machine Learning frameworks (e.g., TensorFlow, PyTorch) and distributed training algorithms
  • Ability to debug complex distributed systems issues
  • Excellent communication skills for updating and solving issues with customers and partners

Preferred

  • Experience with ML accelerator profiling tools
  • Familiarity with cloud platforms (e.g., AWS, Azure, GCP) and managing distributed systems in cloud environments
  • Knowledge of optimization and deep learning algorithms

Compensation and Benefits

The expected base salary range for this full-time position across US locations is $158,000—$200,000 USD. Actual starting pay will depend on job-related factors such as location, experience, education, and skills. During the hiring process, the recruiter can provide specific salary ranges based on the role's location or if the role can be performed remotely, according to your preferred location.

Waymo also offers participation in its discretionary annual bonus program, equity incentive plan, and a generous benefits program, subject to eligibility requirements.