Data Engineer - Research

  • Full Time
Job expired!

About Stability:

Stability AI is a community and mission driven, open-source artificial intelligence company that cares deeply about real-world implications and applications. Our most significant advancements stem from our diversity in teaming across various groups and disciplines. We are not afraid to defy established norms and cultivate innovation. We are driven to generate breakthrough ideas and transform them into concrete solutions. Our dynamic communities consist of specialists, leaders and partners worldwide who are developing advanced open AI models for Image, Language, Audio, Video, 3D and Biology.

About the role:

We are in search of a talented Data Engineer who specializes in scaling efficient distributed workloads. You will work in collaboration with a blossoming multidisciplinary team of skilled research scientists and machine learning engineers to augment and scale the efficiency within our models. In this role, you will contribute to revolutionary projects such as training the largest open language models and will be accountable for ensuring data is gathered, processed and utilized appropriately.

Responsibilities:

  • Clean, standardize, and preprocess data in a scalable, parallelizable manner to prepare it for ingestion into our machine learning model training pipelines, ensuring high data quality.
  • Build and maintain highly scalable distributed workloads.
  • Construct data pipelines to ingest and process data (e.g. images and text) for integration into ML models.
  • AWS Resource Management.
  • Stay current with methods on how to enhance data quality and/or curate data for Image, Video, LLMs etc.

Qualifications:

  • Proven background in large scale distributed workloads.
  • Experience with large scale data loading for machine learning training runs.
  • Experience with cloud storage and file systems. AWS (S3) is strongly preferred, but open to other cloud platforms.
  • Experience with Python + Pytorch.
  • Experience with multiprocessing and multithreading python workloads.
  • Excellent communication skills for effective collaboration with users, resolving issues, and providing guidance.
  • Meticulous attention to detail and the ability to document processes and solutions effectively.
  • Strong interest in Generative AI.
  • Experience working with Machine Learning projects and ideally some Deep learning / Comp Vision knowledge.
  • Experience with data loading stack (webdataset, torchdata, fsspec, AIstore) and parallel dataframe manipulation using Pyspark/Ray is a plus.

Equal Employment Opportunity:

We are an equal opportunity employer and we do not discriminate on the basis of race, religion, national origin, gender, sexual orientation, age, veteran status, disability or other legally protected statuses.