Distributed ML Systems Engineer (Accelerated AI)

Job expired!

Join Together AI as a Distributed ML Systems Engineer

Are you passionate about designing scalable machine learning systems? Together AI is on the lookout for a talented Distributed ML Systems Engineer to develop and optimize large-scale, fault-tolerant distributed systems. Be part of a revolutionary team that's shaping the future of AI. Work closely with our talented researchers and infrastructure teams to ensure robust and efficient systems.

Responsibilities

  • Design and build large-scale, distributed machine learning systems that are fault-tolerant and high-performance.
  • Develop and optimize distributed processing frameworks and storage systems.
  • Collaborate with researchers, engineers, and product managers to integrate ML systems into our infrastructure.
  • Conduct architecture and design reviews to ensure best practices in system design.
  • Implement robust monitoring and logging systems to ensure the health and performance of our ML systems.

Requirements

  • 3+ years of experience in building large-scale, fault-tolerant, high-performance distributed systems.
  • Strong programming skills in one or more of Python, Go, Rust, or C/C++.
  • Excellent understanding of low-level operating systems concepts including multi-threading, memory management, networking, and storage, performance, and scale.
  • Experience with cloud computing platforms (AWS, GCP, Azure etc.) and large-scale infrastructure.
  • Strong problem-solving skills and ability to work in a fast-paced environment.
  • Preferred: Experience with Kubernetes.
  • Preferred: Experience with Pytorch.

About Together AI

Together AI is a research-driven artificial intelligence company dedicated to creating open and transparent AI systems. We're on a mission to significantly lower the cost of modern AI systems by co-designing software, hardware, algorithms, and models. Our team has driven advancements in technologies such as FlashAttention, Hyena, FlexGen, and RedPajama. Join our passionate group of researchers and engineers in building next-generation AI infrastructure.

Compensation

We offer competitive compensation, startup equity, health insurance, and other benefits. The US base salary range for this full-time position is $160,000 - $220,000 plus equity and benefits. Our salary ranges are determined by location, level, and role. Individual compensation will be based on experience, skills, and job-related knowledge.

Equal Opportunity

Together AI is proud to be an Equal Opportunity Employer offering equal employment opportunities to all, regardless of race, color, ancestry, religion, sex, national origin, sexual orientation, age, citizenship, marital status, disability, gender identity, veteran status, and more.

Company Name: Together AI
Job Title: Distributed ML Systems Engineer (Accelerated AI)