Senior AI and ML Infra Engineer, Research Clusters

Job expired!

Join Our Team as a Senior AI and ML Infrastructure Engineer at NVIDIA, Santa Clara, CA

Are you passionate about AI and machine learning? NVIDIA in Santa Clara, CA, USA is looking for a skilled AI/ML Infrastructure Engineer to enhance our cutting-edge technology solutions. This is an unparalleled opportunity to contribute to a team that is at the forefront of AI/ML technology, driving innovations that impact the world.

Your Role and Impact

As a Senior AI and ML Infrastructure Engineer, your main task will be to boost productivity for our research teams by identifying and addressing infrastructure gaps. This includes designing and implementing solutions to improve scalability, reliability, and efficiency of our large-scale GPU clusters and other critical systems.

Key Responsibilities:

  • Understand infrastructure needs of AI/ML research teams and translate these into powerful enhancements.
  • Design solutions for storage management, error attribution, and reliability issues within our GPU clusters.
  • Optimize AI/ML infrastructure performance and resource utilization through continuous monitoring and upgrading.
  • Develop automation tools and operational strategies to minimize manual tasks and simplify infrastructure management.
  • Collaborate with cross-functional teams to ensure a seamless and robust AI/ML infrastructure ecosystem.
  • Stay updated with latest advancements in AI/ML technologies and incorporate these into NVIDIA strategies.

What We Require:

We are looking for someone with a BS or equivalent (MS preferred) in Computer Science or related fields, backed by at least 12 years of relevant experience. You should have a robust background in software engineering with a keen understanding of high-scale distributed systems, preferably within AI/ML infrastructures.

Technical Skills Needed:

  • Proficiency in programming languages like Python, Go, or C++.
  • Familiarity with cloud platforms such as AWS, GCP, or Azure.
  • Experience with Docker, Kubernetes, Ansible, Terraform, Prometheus, Grafana, and other similar tools.
  • A deep understanding of AI/ML workflows from data processing to model training and inference.
  • Strong problem-solving skills and the ability to develop scalable solutions for complex systems.
  • Excellent communication and team collaboration skills.

Why Join NVIDIA?

At NVIDIA, we offer a competitive compensation package including an attractive base salary ranging from $220,000 to $419,750, reflecting your experience and role within the company. Additionally, you will be eligible for equity and a comprehensive benefits package that supports health, well-being, and financial security.

Our team comprises some of the most talented professionals in the world, and we are experiencing unprecedented growth. If you are a creative and autonomous engineer with a genuine passion for technology, NVIDIA is your stage to shine.

NVIDIA is committed to fostering a diverse and inclusive work environment. We are proud to be an equal opportunity employer and value diversity in all forms. We do not discriminate based on any legally protected characteristics.

How to Apply

Ready to contribute to our extraordinary team at NVIDIA? We accept applications on an ongoing basis. Leverage your skills in a role where you can truly make a difference. Apply today!

Additional Information:

Job Title: Senior AI and ML Infrastructure Engineer, Research Clusters
Company: NVIDIA
Location: Santa Clara, CA, USA