Join the team developing software that will be used throughout the AI world. Collaborate with high-quality software engineers to implement large-scale tool sets that test deep learning models and frameworks on the most powerful computers. The ability to work in a multifaceted, fast-paced environment is necessary, alongside having strong interpersonal skills. In this position, you will interact with internal partners, users, and members of the open source community to develop solutions for building, testing, integrating, and releasing NVIDIA AI Services and Deep Learning Frameworks on the most powerful, enterprise-grade GPU clusters capable of hundreds of Peta FLOPS. This role spans multiple products such as PyTorch, TensorFlow, JAX, PaddlePaddle. You will work with internal engineering teams to deploy and operationalize AI models and services at large scale by promoting adoption of end-to-end Machine Learning and Deep Learning solutions in the cloud and on-premises.
We are looking for individuals who are passionate about helping us scale our AI and deep learning services, platforms, models and internal tools. Your responsibility will be to implement and maintain DevOps/MLOps practices, tools, and infrastructure that enable our teams to deliver high-quality software reliably and efficiently, while ensuring the smooth management and deployment of releases. Are you up for this challenge?
What you’ll be doing:
Develop, maintain, and improve CI/CD tools for on-premise and cloud deployment of our software, enable sophisticated cross-platform build systems, and bring world-class release engineering to NVIDIA's platform and cloud deployment process.
Empower a self-service Deep Learning testing and benchmarking platform using industry-standard tools like Gitlab, GitHub, Jenkins, Docker, Bash, and NVIDIA's proprietary tools. Be responsible for best practices and methodologies for building, testing, and releasing DL software and support users of the platform.
Monitor and fix software development and deployment pipelines, identify and address issues related to build failures, test failures, code quality, and performance, in collaboration with development, operations, and quality assurance teams.
Develop documentation for proposed approaches, policies, data formats, test cases and expected results within the scope of your projects. Document and evangelize about them.
Work alongside development, operations, and quality assurance teams to establish and maintain efficient and reliable DevOps practices, tools, and infrastructure that enable continuous integration, continuous delivery (CI/CD), and efficient software release management.
What we need to see:
A BSc or MS degree in Computer Science, Computer Architecture or a related technical field, or equivalent experience.
5+ years of work experience in platform engineering/MLOps/DevOps.
Proficient Python and bash programming skills.
Proficiency with popular CI/CD tools (e.g., GitLab CI, Jenkins), git, Linux including management practices, versioning, branching, merging, and tagging, and experience with release management tools and processes.
Knowledge of Docker, REST API services, Kubernetes, ElasticSearch, HashiCorp Vault and Ansible.
Experience working with Cloud Providers (AWS, OCI, GCP).
Strong experience in setting up, maintaining, and automating continuous integration systems. Knowledge and enthusiasm for DevOps/MLOps practices. Proficiency in contemporary CI/CD techniques, GitOps and Infrastructure as Code.
A basic understanding of ML/DL training and inferencing concepts.
A strong understanding of software testing principles, including unit testing, integration testing, and end-to-end testing, and experience with automated testing frameworks and tools.
Good communication and documentation habits.
Ways to stand out from the crowd:
Experience in creating integration, delivery and deployment pipelines for ML/DL products and/or experience working with Deep Learning models and/or services.
Familiarity with large-scale distributed computing systems and cloud platforms, or experience with HPC based compute clusters and scheduling solutions like Slurm.
Proven track record of delivering solutions to customers. Deep understanding of deployments at scale and/or contribution to open source projects.
Relevant certifications (e.g., AWS Certified DevOps Engineer, Linux RedHat, Oracle, etc.) would be a plus.
NVIDIA is widely considered one of the most sought-after employers in the tech industry. We are fortunate to have some of the most talented and creative employees in the world. If you're innovative and autonomous, we'd love to hear from you!
The base salary range is 144,000 USD - 270,250 USD. Your base salary will be determined based on your location, experience, and the salary of employees in similar positions.
You will also be eligible for equity and benefits. NVIDIA continuously accepts applications.