Personal details

Andrew C. - Remote data scientist

Andrew C.

Data Scientist & Machine Learning Engineer
Based in: 🇫🇷 France
Timezone: Paris (UTC+2)

Summary

  • I have over 10 years of experience in Data Science & Analytics, with 7 years dedicated to Python & Machine Learning and 5 years to Deep Learning & Engineering.
  • My recent focus has been on NLP, where I have honed my skills in Transformers and LLMs.
  • I have a strong foundation in machine learning methodologies and statistics.
  • My expertise includes SQL, managing large datasets, API development, and deploying solutions on GCP/AWS.
  • I am proficient in communicating complex data insights and enjoy mentoring others.

Skills & Expertise:

  • Proficient in data science tasks, encompassing cluster analysis, time series analysis, dimensionality reduction (PCA, UMAP), and anomaly detection.

  • Seasoned in machine learning tools, including scikit-learn, gradient boosting, SHAP values interpretation, and model fine-tuning (Optuna).

  • Skilled in deploying models, optimizing for scalability, and utilizing tools such as Luigi, MLflow, Docker, FastAPI/Flask, and pytest.

  • Well-versed in using Python libraries like NumPy and Pandas for data mining and manipulation, as well as Selenium for web scraping.

  • Proficient in data analysis techniques, employing data visualization tools such as Plotly, Seaborn, and Matplotlib.

  • Comfortable with statistics, with experience in SciPy, statistical analysis, hypothesis testing, Bayesian statistics, and probability theory.

  • Experienced in deep learning frameworks, including TensorFlow, PyTorch, and Keras.

  • Skilled in various NLP tasks, such as sentiment analysis, topic modeling, classification, and working with GenAI APIs.

  • Proficient in working with databases, including Google BigQuery, MySQL, PostgreSQL, PL/SQL, and Redis, and competent at handling Big Data.

  • Familiar with network analysis and graph theory.

  • Comfortable using Git for version control and conducting code reviews.

  • Skilled in cloud computing platforms, including Google Cloud Platform (GCP), Vertex AI, and AWS, and proficient in Linux shell scripting (Ubuntu, Debian).

  • Following best software development practices and experienced in agile methodologies, particularly Scrum.

  • Seasoned in interacting with stakeholders and providing detailed reports on project progress and outcomes.

Work Experience

Freelance Artificial Intelligence Engineer
Toptal | Mar 2020 - Present
Python
SQL
API
NumPy
Pandas
Machine Learning
Scipy
Cluster Analysis
Data Mining
Data Analysis
Big Data
Docker
Data Science
NLP (Natural Language Processing)
Google Cloud Platform
Data Visualization
Neural Networks
Supervised Learning
Unsupervised Learning
Deep Learning
Exploratory Data Analysis
TensorFlow
Data Cleaning
PyTorch
Anomaly Detection
Predictive modeling
Statistical Analysis
Time Series Analysis
Bayesian Statistics
AI (artificial intelligence)
Scikit-learn
Large Language Models
  • Developed predictive models for race day outcomes, achieving up to 90% accuracy in single bets, addressing data limitations and integrating a betting strategy.
  • Formulated a BERT-based approach for Instagram profile categorization, achieving over 80% accuracy across 50+ groups and enhancing ad campaigns.
  • Employed a nearest neighbors model for Twitter account category estimation (~4*16 subclasses) using TFHub pre-trained embeddings.
  • Implemented audience enrichment with hypothesis testing and SQL, generating over 100k potential customer matches per company.
  • Engineered over 15 predictive models for 100+ million IDs via GCP using PL/SQL and regex, ensuring scalability and efficient logging.
  • Leveraged OpenAI API for data generation, standardizing job titles and estimating seniority/department with ~90% accuracy.
  • Applied random walk embeddings, UMAP, and HDBSCAN to customer interests, enriching a targeted ad subset by 25k users.
  • Designed a Scikit-Learn salary prediction pipeline with less than 10% error using unsupervised transformation.
Data Scientist | LightGBM | API Development | Data Visualization | Regression | Classification
CRED | Nov 2020 - Oct 2023
Python
SQL
API
NumPy
Pandas
Machine Learning
Scipy
Cluster Analysis
Data Mining
Data Analysis
Big Data
Docker
Data Science
Google Cloud Platform
Data Visualization
Supervised Learning
Unsupervised Learning
Exploratory Data Analysis
Data Cleaning
Anomaly Detection
Predictive modeling
Statistical Analysis
Time Series Analysis
Bayesian Statistics
AI (artificial intelligence)
Scikit-learn
  • Combined classification analysis and tree embeddings for in-game location data, identifying player roles with ~90% accuracy and improving clustering.
  • Assembled a 'team profile' algorithm using gradient boosting and statistics, achieving 85% accuracy in pinpointing team weaknesses within a league.
  • Deployed over 10 predictive models via REST API, managing over 100k predictions with rapid response times and robust data issue resolution.
  • Designed a dual-layer regression model for football market value forecasts, achieving short- and long-term predictions with <10% error.
  • Collaborated on a 25-page Streamlit dashboard featuring Plotly, Seaborn, and Matplotlib visuals, contributing to GitHub code reviews.
  • Created a 4-step data engineering pipeline for player stats (cleaning, scaling, imputation), enhancing models’ accuracies by ~75%.
  • Architected a LightGBM model with a 15% error rate, identifying promising young talents and suggesting optimal replacements.
  • Combined GBM model and SQL logic for football player-team recommendations, processing over 500 million combinations.
  • Applied regression analysis with XGBoost to predict footballer retirement within a 1-year margin, addressing data drift.

Education

Peoples’ Friendship University of Russia
Master's degreeApplied Mathematics & Computer Science
Sep 2004 - Jul 2011

Personal Projects

ML Development & Engineering for Football Scouting Recommendation System
2023
Python
SQL
API
NumPy
Pandas
Machine Learning
Data Analysis
Big Data
Data Science
Google Cloud Platform
Data Visualization
Supervised Learning
Exploratory Data Analysis
Data Cleaning
Anomaly Detection
Predictive modeling
Statistical Analysis
AI (artificial intelligence)
Scikit-learn
Objective: Developed a machine learning algorithm to efficiently estimate player-team compatibility scores, utilizing advanced analytical techniques for accurate assessments. - Conducted comprehensive data reshaping, employing SciPy for anomaly detection; developed targets based on playing time rates, using binary classification for goalkeepers and numerical values for the other 12 positions. - Implemented LightGBM models for each position, utilizing the SHAP library for iterative (RFECV-like) feature selection, which resulted in test error levels ranging from 5-15%. - Refined the model outputs during the prediction pipeline phase by converting them to ranks. This approach helped reduce potential noise and enhanced the intuitiveness of data presentation. Optimization & Outcome: Initially, the prediction API included over 500 million team-player combinations. However, enriching the SQL pipeline with an affordability calculation reduced the number of combinations to 10x. Additionally, prioritizing certain predictions ensured the delivery of the most critical insights.
Data Engineering & Analytics for Advanced Football Player Insights
2023
Python
SQL
NumPy
Pandas
Machine Learning
Data Mining
Data Analysis
Big Data
Google Cloud Platform
Data Visualization
Exploratory Data Analysis
Data Cleaning
Anomaly Detection
Statistical Analysis
Time Series Analysis
Objective: Focused on enhancing football player data's quality, granularity, and applicability for diverse modeling and analytical purposes. - Performed correlation analysis and feature engineering to identify and select each player position's top 30 influential features. - Developed a four-phase data engineering pipeline comprising initial cleaning, feature selection, multi-level imputation, and feature scaling based on score correlations. - Implemented biannual data segmentation for integrating player data from leagues with different schedules, deploying the pipeline on a GCP server with results stored in BigQuery, and reprocessing three large data sources in under one day simultaneously. Outcome: Successfully enhanced player data quality, significantly improving model quality by approximately 75% and streamlining the generation of time series features.