Join NVIDIA as a Senior HPC AI Engineer
NVIDIA is seeking a skilled HPC Engineer to become part of our End-to-End Software Verification HPC/AI Infrastructure team. We specialize in building supercomputers and HPC clusters leveraging groundbreaking technologies. This is a unique opportunity to contribute to the latest advances in artificial intelligence and GPU computing by providing insights on at-scale system design and tuning mechanisms for large-scale compute runs.
Key Responsibilities
As a Senior HPC AI Engineer, you will:
- Design, implement, and maintain large-scale HPC/AI clusters with monitoring, logging, and alerting capabilities.
- Manage Linux job/workload schedules and orchestration tools.
- Develop and maintain continuous integration and delivery pipelines.
- Develop automation tooling for deployment and management of large-scale infrastructure environments.
- Deploy monitoring solutions for servers, networking, and storage systems.
- Troubleshoot issues from the hardware level to the application layer.
- Serve as a technical resource to develop and document best practices.
- Support Research & Development activities and engage in POCs/POVs to drive future improvements.
Required Qualifications
We are looking for individuals who have:
- A degree in Computer Science, Engineering, or a related field.
- 5+ years of relevant experience in HPC and AI solution technologies.
- Experience with job scheduling and orchestration tools such as Slurm and Kubernetes (K8s).
- Excellent knowledge of both Windows and Linux operating systems (Redhat/CentOS and Ubuntu) and internals, including networking, security protocols (TCP, DHCP, DNS), and firewall configurations.
- Hands-on experience with multiple storage solutions such as Lustre, GPFS, ZFS, and XFS.
- Python programming and bash scripting expertise.
- Proficiency with automation and configuration management tools like Jenkins, Ansible, Puppet, and Chef.
- Deep knowledge of networking protocols including InfiniBand and Ethernet.
- Experience with virtual systems like VMware, Hyper-V, KVM, or Citrix.
Preferred Skills
Ways to stand out from the crowd:
- Familiarity with cloud computing platforms (e.g., AWS, Azure, Google Cloud).
- Knowledge of CPU and/or GPU architectures.
- Experience with GPU-focused hardware/software (DGX, CUDA).
- Background in RDMA (InfiniBand or RoCE) fabrics.
- Proficiency with Kubernetes and microservice container technologies.
Diversity and Inclusion
At NVIDIA, diversity is a driving force behind our innovation. We are an equal opportunity employer and value diversity at our company. We do not discriminate based on race, religion, color, national origin, sex, gender, gender expression, sexual orientation, age, marital status, veteran status, or disability status. We ensure reasonable accommodation for individuals with disabilities during the job application or interview process, performance of essential job functions, and within other benefits and privileges of employment. If you require accommodation, please contact us.
Additional Information
Company Name: NVIDIA
Job Title: Senior HPC AI Engineer