Role Overview:
We are seeking a Senior Data Scientist to build and deploy LLM-based capabilities for working with large, diverse datasets and documents relevant to growth analytics & bid strategy. This role emphasizes ingestion, document processing, information extraction, and retrieval methods to support analytics use cases in production. Experience with modern LLM tooling and Databricks is required; hands-on experience with advanced reasoning models & agentic/orchestration frameworks are a plus.
Key Responsibilities:
- Architect, build, and refine retrieval-grounded LLM systems, including basic and advanced RAG patterns, to deliver grounded, verifiable answers and insights.
- Design robust pipelines for ingestion, transformation, and normalization of public and internal data, including ETL, incremental processing, and data quality checks.
- Build and maintain document processing workflows across PDFs, HTML, and scanned content, including OCR, layout-aware parsing, table extraction, metadata enrichment, and document versioning.
- Develop information extraction pipelines using LLM methods and best practices, including schema design, structured outputs, validation, error handling, and accuracy evaluation.
- Own the retrieval stack end-to-end, including chunking strategies, embeddings, indexing, hybrid retrieval, reranking, filtering, and relevance tuning across a vector database or search platform.
- Implement web data acquisition where needed, including scraping, change detection, source quality checks, and operational safeguards like retries and rate limiting.
- Establish evaluation and monitoring practices for retrieval and extraction quality, including golden datasets, regression testing, groundedness checks, and production observability.
- Collaborate with subject matter experts to translate business needs into practical retrieval and extraction workflows and measurable success criteria.
- Communicate complex findings, tradeoffs, and recommendations to technical and business stakeholders, supporting data-driven forecasting and strategy.
- Ensure compliance with data governance and security standards when handling sensitive data and deploying systems to production environments.
Qualifications:
- Advanced degree in Computer Science, Data Science, Statistics, Engineering, or a related quantitative field.
- Minimum of 4 years of experience in data science or applied ML/NLP with a focus on NLP & GenAI
- Proficiency in Python and SQL, with strong engineering practices for maintainable, testable pipelines.
- Strong experience with Databricks for data processing and pipeline development, including Spark and common Lakehouse patterns.
- Demonstrated experience building retrieval-grounded LLM systems and or LLM-based information extraction for real-world use cases.
- Experience with document ingestion and parsing, including OCR and handling messy, semi-structured content such as PDFs, tables, forms, and web pages.
- Familiarity with vector databases and retrieval concepts, including indexing, embeddings, hybrid retrieval, reranking, and performance and cost tuning.
- Strong understanding of best practices for reasoning models and techniques that improve reliability and reduce hallucinations, including grounding and attribution.
- Excellent communication skills, with a track record of partnering with stakeholders and turning ambiguous requests into adopted solutions.
Libraries and Tools:
- Proficiency with LLM and orchestration libraries such as OpenAI, Google GenAI, Lang graph, langchain.
- Experience with supporting tooling commonly used in production LLM systems, for example: Pydantic for schema validation, tenacity for retries, beautifulsoup4 for HTML data extraction, and standard Python data tooling such as pandas and NumPy.
- Experience with retrieval and vector tooling, such as FAISS, Elasticsearch or OpenSearch, and vector database platforms (for example, Pinecone, Weaviate, Milvus, Chroma).
Preferred Qualifications:
- Exposure to agentic patterns and tool-calling for workflow automation.
- Experience working in regulated environments and implementing governance controls such as access control, auditability, and retention.