Develop and automate evaluation frameworks to benchmark Gemini and other LLMs across accuracy, grounding, hallucination rate, latency, and contextual coherence.
Design, run, and analyze controlled experiments (A/B testing, prompt variations) to measure the performance impact of prompt tuning and parameter adjustments
Use Python (Pandas, NumPy, SciPy) to clean, transform, and analyze datasets; apply statistical testing, regression, and hypothesis validation for model comparisons.
Proficiency in Google Colab, Jupyter, Gemini CLI, Vertex AI, and AI Studio for running reproducible experiments and maintaining model evaluation pipelines.
Build dashboards and visual reports (Matplotlib, Plotly, Looker Studio) to communicate insights and performance trends effectively to stakeholders.
Work closely with ML engineers, architects, and AI product teams to interpret results, refine prompts, and guide model retraining strategies.
Maintain clear experiment logs, reproducible notebooks, and result repositories for internal validation and audit.
Ph.D. (preferred) or Master’s in Computer Science, Data Science, ML, or Applied Mathematics with 6–10 years of relevant experience.
Hands-on exposure to LLM evaluation, NLP benchmarking, or AI experimentation is essential.
Experience with Gemini models strongly preferred.
Familiarity with RAG frameworks, evaluation agents, or LLMOps;
exposure to GCP ecosystem (Vertex AI, BigQuery, AI Studio).
Strong research and analytical mindset with clear communication of technical findings