About Stability:
Stability AI is a mission-oriented, open-source artificial intelligence firm with a profound regard for real-world effects and applications. Our significant advancements originate from our diversity, allowing us to work across multiple departments and fields. We are not shy about challenging established standards and exploring creativity. We strive to produce groundbreaking ideas and transform them into tangible solutions. Our dynamic communities comprise experts, leaders, and partners from all over the world who are crafting advanced open AI models for Image, Language, Audio, Video, 3D, and Biology.
About the role:
We're looking for a skilled ML Ops Engineer with a strong emphasis on High Performance Computing (HPC) to be part of our team. The primary task of this role is to effectively bridge the gap between our engineering teams, ensuring the seamless integration and operation of Machine Learning models within a High Performance Computing setting. The ideal candidate will supervise the deployment of serving and training tools for deep learning models, and manage necessary alterations in the hosting infrastructure to optimize performance.
Responsibilities:
- Work closely with the engineering teams to facilitate the seamless interaction and integration of Machine Learning model serving and training within the HPC setup.
- Oversee and optimize the deployment of training and inference tools, guaranteeing they function efficiently within the designated infrastructure.
- Implement necessary modifications in the hosting infrastructure to meet the specific needs of ML models, ensuring they operate effectively in both cloud and HPC settings.
- Facilitate the harmonious operation of cloud services and HPC systems, allowing them to function independently without interference.
- Ensure the successful integration of inference containers and resources, permitting concurrent operations in a unified manner.
- Actively participate in performance optimization in Deep Learning, leveraging a solid understanding of compilers and their role in enhancing efficiency.
- Offer technical expertise in Linux, SLURM, and experience with AWS or GCP infrastructure, optimizing the environment for ML operations.
- Collaborate with the broader team to design, build, and maintain efficient and scalable systems to support the deployment and execution of Machine Learning models.
- Show proficiency in programming languages such as Python, C++, and TypeScript, ensuring the development and management of various tools and integrations.
Requirements:
- Proficiency in programming languages like Python, C/C++ and TypeScript
- Experience working in cloud environments such as AWS, GCP, Cloudflare, etc.
- Experience with HPC cluster management tools like Slurm and systems like Linux.
- Familiarity with GPUs and other accelerators like Gaudi2 and TPUs.
- Solid experience in managing and coordinating with cross-functional teams in a fast-paced environment.
- Ability to troubleshoot and resolve complex technical issues in an HPC setting, ensuring the continuous smooth operation of ML models.
- Proven track record in designing and implementing solutions for high availability, scalability, and performance.
- Strong communication skills and the ability to convey complex technical concepts to non-technical stakeholders.
- Familiarity with Agile methodologies, allowing for quick adaptation to evolving project requirements.
Equal Employment Opportunity:
We are an equal opportunity employer and do not discriminate on the grounds of race, religion, national origin, gender, sexual orientation, age, veteran status, disability, or any other legally protected statuses.