Data Scientist with Strong Python Expertise
Role Overview
We are seeking an experienced and innovative Data Scientist with 5-10 years of hands-on expertise to drive our advanced Artificial Intelligence and Machine Learning initiatives, with a specialized focus on Large Language Models (LLMs), Natural Language Processing (NLP), and Retrieval-Augmented Generation (RAG) systems.
The ideal candidate possesses deep technical skills in Python programming, extensive experience in text processing, and advanced SQL proficiency. This role is critical for transforming unstructured data into strategic assets and building production-ready generative AI applications.
Key Responsibilities
- Generative AI & LLM Development:
- Design, develop, and implement end-to-end solutions utilizing pre-trained and custom Large Language Models (LLMs) for tasks such as summarization, question-answering, and content generation.
- Apply techniques like fine-tuning, prompt engineering, and model distillation to optimize LLM performance and efficiency for domain-specific use cases.
- RAG System Architecture & Deployment:
- Architect and build robust Retrieval-Augmented Generation (RAG) pipelines, integrating vector databases (e.g., Pinecone, ChromaDB, Milvus) and embedding models to ground LLM outputs in proprietary data, thereby mitigating hallucinations and improving accuracy.
- Develop and manage the entire lifecycle of RAG systems, from document ingestion and chunking strategies to retrieval and re-ranking optimization.
- NLP & Text Processing:
- Lead the development of advanced Natural Language Processing (NLP) models for core tasks including Named Entity Recognition (NER), sentiment analysis, topic modelling, and text classification.
- Implement efficient text processing and feature engineering techniques on large, unstructured datasets.
- Programming & Data Management:
- Demonstrate expert-level proficiency in Python and its data science ecosystem (e.g., PyTorch/TensorFlow, Hugging Face Transformers, NumPy, Pandas, Scikit-learn).
- Write and optimize complex, performant SQL queries for data extraction, manipulation, and analysis from diverse data sources, including traditional data warehouses and NoSQL stores.
- MLOps & Deployment:
- Collaborate with MLOps and Engineering teams to transition LLM/NLP models from proof-of-concept to scalable, high-performance production systems.
- Develop model monitoring frameworks to track performance, drift, and user feedback in production.
Required Qualifications
- Experience: 5 to 10 years of progressive experience in Data Science, Machine Learning, or a related field.
- Specialized Expertise: 3+ years of hands-on experience developing and deploying solutions involving LLMs, RAG, and advanced NLP techniques.
- Technical Stack:
- Expert Python programming skills for ML model development and production-level code.
- Advanced SQL proficiency and experience working with large relational databases.
- Hands-on experience with deep learning frameworks like PyTorch or TensorFlow.
- In-depth familiarity with the Hugging Face ecosystem (Transformers, Datasets).
- Education: Master’s or Ph.D. in Computer Science, Computational Linguistics, AI, or a related quantitative field.
Preferred Qualifications
- Experience with cloud AI services (e.g., Azure OpenAI, Google Vertex AI, AWS Bedrock).
- Knowledge of distributed computing frameworks (e.g., Spark, Dask) for large-scale text processing.
- Experience with containerization (Docker, Kubernetes) and MLOps tools (e.g., MLflow).
- Track record of research publications or contributions to open-source NLP/Gen AI projects.