Personal details

Andrew C.

Data Scientist & Machine Learning Engineer

Based in: 🇫🇷 France

Timezone: Paris (UTC+2)

Summary

I have over 10 years of experience in Data Science & Analytics, with 7 years dedicated to Python & Machine Learning and 5 years to Deep Learning & Engineering.
My recent focus has been on NLP, where I have honed my skills in Transformers and LLMs.
I have a strong foundation in machine learning methodologies and statistics.
My expertise includes SQL, managing large datasets, API development, and deploying solutions on GCP/AWS.
I am proficient in communicating complex data insights and enjoy mentoring others.

Skills & Expertise:

Proficient in data science tasks, encompassing cluster analysis, time series analysis, dimensionality reduction (PCA, UMAP), and anomaly detection.
Seasoned in machine learning tools, including scikit-learn, gradient boosting, SHAP values interpretation, and model fine-tuning (Optuna).
Skilled in deploying models, optimizing for scalability, and utilizing tools such as Luigi, MLflow, Docker, FastAPI/Flask, and pytest.
Well-versed in using Python libraries like NumPy and Pandas for data mining and manipulation, as well as Selenium for web scraping.
Proficient in data analysis techniques, employing data visualization tools such as Plotly, Seaborn, and Matplotlib.
Comfortable with statistics, with experience in SciPy, statistical analysis, hypothesis testing, Bayesian statistics, and probability theory.
Experienced in deep learning frameworks, including TensorFlow, PyTorch, and Keras.
Skilled in various NLP tasks, such as sentiment analysis, topic modeling, classification, and working with GenAI APIs.
Proficient in working with databases, including Google BigQuery, MySQL, PostgreSQL, PL/SQL, and Redis, and competent at handling Big Data.
Familiar with network analysis and graph theory.
Comfortable using Git for version control and conducting code reviews.
Skilled in cloud computing platforms, including Google Cloud Platform (GCP), Vertex AI, and AWS, and proficient in Linux shell scripting (Ubuntu, Debian).
Following best software development practices and experienced in agile methodologies, particularly Scrum.
Seasoned in interacting with stakeholders and providing detailed reports on project progress and outcomes.

Technical skills

Top 3 skills

Machine Learning・7 yrs Data Science・10 yrs NLP (Natural Language Processing)・5 yrs

Other skills

Python・7 yrs SQL・10 yrs Predictive modeling・10 yrs AI (artificial intelligence)・5 yrs PyTorch・3 yrs Data Analysis・10 yrs Large Language Models・3 yrs

Work Experience

Freelance Artificial Intelligence Engineer

Toptal | Mar 2020 - Present

Python

SQL

API

NumPy

Pandas

Machine Learning

Scipy

Cluster Analysis

Data Mining

Data Analysis

Big Data

Docker

Data Science

NLP (Natural Language Processing)

Google Cloud Platform

Data Visualization

Neural Networks

Supervised Learning

Unsupervised Learning

Deep Learning

Exploratory Data Analysis

TensorFlow

Data Cleaning

PyTorch

Anomaly Detection

Predictive modeling

Statistical Analysis

Time Series Analysis

Bayesian Statistics

AI (artificial intelligence)

Scikit-learn

Large Language Models

Developed predictive models for race day outcomes, achieving up to 90% accuracy in single bets, addressing data limitations and integrating a betting strategy.
Formulated a BERT-based approach for Instagram profile categorization, achieving over 80% accuracy across 50+ groups and enhancing ad campaigns.
Employed a nearest neighbors model for Twitter account category estimation (~4*16 subclasses) using TFHub pre-trained embeddings.
Implemented audience enrichment with hypothesis testing and SQL, generating over 100k potential customer matches per company.
Engineered over 15 predictive models for 100+ million IDs via GCP using PL/SQL and regex, ensuring scalability and efficient logging.
Leveraged OpenAI API for data generation, standardizing job titles and estimating seniority/department with ~90% accuracy.
Applied random walk embeddings, UMAP, and HDBSCAN to customer interests, enriching a targeted ad subset by 25k users.
Designed a Scikit-Learn salary prediction pipeline with less than 10% error using unsupervised transformation.

CRED | Nov 2020 - Oct 2023

Python

SQL

API

NumPy

Pandas

Machine Learning

Scipy

Cluster Analysis

Data Mining

Data Analysis

Big Data

Docker

Data Science

Google Cloud Platform

Data Visualization

Supervised Learning

Unsupervised Learning

Exploratory Data Analysis

Data Cleaning

Anomaly Detection

Predictive modeling

Statistical Analysis

Time Series Analysis

Bayesian Statistics

AI (artificial intelligence)

Scikit-learn

Combined classification analysis and tree embeddings for in-game location data, identifying player roles with ~90% accuracy and improving clustering.
Assembled a 'team profile' algorithm using gradient boosting and statistics, achieving 85% accuracy in pinpointing team weaknesses within a league.
Deployed over 10 predictive models via REST API, managing over 100k predictions with rapid response times and robust data issue resolution.
Designed a dual-layer regression model for football market value forecasts, achieving short- and long-term predictions with <10% error.
Collaborated on a 25-page Streamlit dashboard featuring Plotly, Seaborn, and Matplotlib visuals, contributing to GitHub code reviews.
Created a 4-step data engineering pipeline for player stats (cleaning, scaling, imputation), enhancing models’ accuracies by ~75%.
Architected a LightGBM model with a 15% error rate, identifying promising young talents and suggesting optimal replacements.
Combined GBM model and SQL logic for football player-team recommendations, processing over 500 million combinations.
Applied regression analysis with XGBoost to predict footballer retirement within a 1-year margin, addressing data drift.

Education

Peoples’ Friendship University of Russia

Master's degree・Applied Mathematics & Computer Science

Sep 2004 - Jul 2011

Personal Projects

ML Development & Engineering for Football Scouting Recommendation System

2023

Python

SQL

API

NumPy

Pandas

Machine Learning

Data Analysis

Big Data

Data Science

Google Cloud Platform

Data Visualization

Supervised Learning

Exploratory Data Analysis

Data Cleaning

Anomaly Detection

Predictive modeling

Statistical Analysis

AI (artificial intelligence)

Scikit-learn

Objective: Developed a machine learning algorithm to efficiently estimate player-team compatibility scores, utilizing advanced analytical techniques for accurate assessments. - Conducted comprehensive data reshaping, employing SciPy for anomaly detection; developed targets based on playing time rates, using binary classification for goalkeepers and numerical values for the other 12 positions. - Implemented LightGBM models for each position, utilizing the SHAP library for iterative (RFECV-like) feature selection, which resulted in test error levels ranging from 5-15%. - Refined the model outputs during the prediction pipeline phase by converting them to ranks. This approach helped reduce potential noise and enhanced the intuitiveness of data presentation. Optimization & Outcome: Initially, the prediction API included over 500 million team-player combinations. However, enriching the SQL pipeline with an affordability calculation reduced the number of combinations to 10x. Additionally, prioritizing certain predictions ensured the delivery of the most critical insights.

Data Engineering & Analytics for Advanced Football Player Insights

2023

Python

SQL

NumPy

Pandas

Machine Learning

Data Mining

Data Analysis

Big Data

Google Cloud Platform

Data Visualization

Exploratory Data Analysis

Data Cleaning

Anomaly Detection

Statistical Analysis

Time Series Analysis

Objective: Focused on enhancing football player data's quality, granularity, and applicability for diverse modeling and analytical purposes. - Performed correlation analysis and feature engineering to identify and select each player position's top 30 influential features. - Developed a four-phase data engineering pipeline comprising initial cleaning, feature selection, multi-level imputation, and feature scaling based on score correlations. - Implemented biannual data segmentation for integrating player data from leagues with different schedules, deploying the pipeline on a GCP server with results stored in BigQuery, and reprocessing three large data sources in under one day simultaneously. Outcome: Successfully enhanced player data quality, significantly improving model quality by approximately 75% and streamlining the generation of time series features.