Senior/Principal Data Scientist - NLP (Remote) | Veeva Systems
Join Veeva Systems, a mission-driven pioneer in industry cloud solutions, dedicated to accelerating therapy delivery for life sciences. As one of the fastest-growing SaaS companies, we achieved over $2B in revenue last fiscal year, with boundless growth opportunities ahead.
At Veeva, our values are paramount: Do the Right Thing, Customer Success, Employee Success, and Speed. In 2021, we became a public benefit corporation (PBC), legally committed to balancing the interests of customers, employees, society, and investors.
Embrace the freedom to work from anywhere—whether it's from home or the office—to excel in your ideal environment. Join us in transforming the life sciences industry, making a positive impact on our customers, employees, and communities.
The Role
Veeva is on a mission to streamline the market entry for products in Life Sciences and Regulated industries. Rooted in our core values—Do the Right Thing, Customer Success, Employee Success, and Speed—our teams create transformative cloud software, services, consulting, and data solutions to enhance efficiency and effectiveness for our customers.
As a Public Benefit Corporation, you will be part of a company committed to positively impacting its customers, employees, and communities. Veeva's Link product is pivotal in our ecosystem, connecting life sciences with key individuals to boost research and healthcare.
Your role will involve developing LLM-based agents to extract detailed information about Key Opinion Leaders (KOLs) in the healthcare sector, utilizing cloud infrastructure for model development, and collaborating with a dedicated team to refine and deploy these models.
We seek to revolutionize industry standards through advanced ML models, aided by over 2000 curators, ensuring quality and scalability across regions, languages, and medical specialties.
Location: Remote in the Netherlands, the UK, or Spain. Candidates must reside and be legally authorized to work in one of these countries without Veeva's visa or relocation support.
What You'll Do
- Adopt the latest NLP technologies and trends for your platform.
- Develop LLM-based agents for enhanced data interaction and retrieval.
- Leverage RLHF methods like Direct Preference Optimization (DPO) and Proximal Policy Optimization (PPO).
- Design an end-to-end pipeline for information extraction from large-scale, unstructured data.
- Create robust semantic search functionality to answer user queries effectively.
- Utilize named entity recognition, entity-linking, slot-filling, few-shot learning, and other techniques for information extraction.
- Analyze and interpret data models per source and region.
- Collaborate with data quality teams for qualitative and quantitative model evaluation.
- Utilize cloud infrastructure in model development, ensuring efficient deployment with software developers and DevOps engineers.
Requirements
- 4+ years as a Data Scientist (or 2+ years with a Ph.D.).
- Master's or Ph.D. in Computer Science, AI, Computational Linguistics, or related field.
- Strong theoretical knowledge of NLP, ML, and Deep Learning.
- Proven experience with LLMs and transformer architectures (e.g., GPT, BERT).
- Proficient in Python and NLP libraries (e.g., NLTK, SpaCy, Hugging Face Transformers).
- Experience with BigData frameworks (e.g., Ray, Spark) and Deep Learning frameworks (e.g., PyTorch, JAX).
- Experience with cloud infrastructure (AWS, GCP, Azure) and containerization (Docker, Kubernetes).
- Strong collaboration and communication skills with cross-functional teams.
- Startup experience, high energy, and an agile mindset.
Nice to Have
- Background in Medical NLP.
- Experience with LLM training, fine-tuning, and deployment.
- Life/health science industry experience, notably pharma.
- AI publication in a peer-reviewed journal.
- Production-grade development skills.
- Leadership and team growth skills.
- Experience with NoSQL databases, especially MongoDB.
- Familiarity with model registry solutions like MLflow.
- Familiarity with distributed computing platforms like Ray and Spark.