Data Science Projects

Machine learning models and predictive analytics solutions

Data Cleaning Assistant

A Python-based automation tool that helps clean and preprocess datasets efficiently, reducing manual effort and ensuring consistency in data preparation for analysis.

Python Data Cleaning Automation Pandas

This project provides a set of tools for cleaning messy datasets, including handling missing values, standardizing formats, removing duplicates, and transforming columns for analysis-ready data.

Features: Automated data cleaning functions, duplicate handling, normalization, and data validation.
Tools: Python (Pandas, NumPy), JSON for configuration and user settings.
Use Case: Prepares large datasets for machine learning, visualization, or reporting without manual intervention.
Outcome: Saves time and reduces errors in preprocessing, allowing data scientists to focus on analysis and modeling.

View Application

Cyclistic Bike-Share Analysis

End-to-end data workflow using SQL for preparation and Python for EDA to compare how casual riders and annual members use the service.

Python SQL Google Capstone Tableau

Follows the Google capstone's six-step framework (Ask-Prepare-Process-Analyze-Share-Act) to quantify behavioral differences between rider segments and translate findings into membership growth strategies.

Tools: PostgreSQL for data prep, Python (Pandas/Seaborn/Plotly) for analysis, Tableau for dashboards.
Focus: Ride duration, day-type patterns, bike types, station hotspots, and seasonal trends.
Outcome: Marketing and timing recommendations for converting casual riders to members.
Key Insights: Casual riders concentrate on weekends with longer rides, while members use bikes for weekday commuting.

View Repository Open Notebook

Salifort Motors - Employee Attrition

Classification workflow using HR data to predict which employees are likely to leave and to surface actionable drivers for retention.

Python Machine Learning HR Analytics Classification

Builds and evaluates classification models on HR features (projects, tenure, hours, evaluations) to flag employees likely to leave and inform policy changes.

Business Goal: Reduce attrition cost and improve satisfaction through proactive retention strategies.
Methods: Data wrangling and modeling in Jupyter notebooks with supervised classification approach.
Key Features: Number of projects, years at company, monthly hours worked, evaluation scores.
Recommendations: Limit concurrent projects to 3-4 (max 5), promote after ~4 years, clarify overtime expectations, discuss culture, reward work proportionally.

View Repository

Automatidata - Predicting Taxi Gratuities

NYC Yellow Taxi tipping classification using trip features (duration, distance, fare) to predict generous tips (>20%) with machine learning models.

Python Machine Learning NYC Data Random Forest

This project builds multiple models (including random forest) on NYC Yellow Taxi 2017 trip records to classify whether a rider will tip generously, focusing on interpretable trip features.

Data: 2017 Yellow Taxi trip records (~408k trips) with timestamps, locations, distance, fares, payment types.
Features: Trip duration, distance, total/fare amounts, vendor/payment type; top importance for duration, distance, cost.
Modeling: Random forest classification with 86% accuracy and 72% precision for generous tipping prediction.
Use Cases: Inform driver expectations for tip likelihood and explore pricing/route patterns correlating with higher gratuities.

View Repository

Advanced ML Projects Coming Soon

Advanced machine learning projects featuring deep learning, NLP, and computer vision applications are currently in development.

Deep Learning NLP Computer Vision

Planned

Machine Learning Expertise

Specialized in end-to-end ML solutions from data preprocessing to model deployment

Supervised Learning

Classification and regression models using scikit-learn, XGBoost, and ensemble methods

Feature Engineering

Advanced data preprocessing, feature selection, and dimensionality reduction techniques

Model Optimization

Hyperparameter tuning, cross-validation, and performance evaluation strategies

Python Ecosystem

Pandas, NumPy, scikit-learn, Matplotlib, Seaborn, and Jupyter for data science workflows

Tools & Technologies

Programming

Python SQL R

Machine Learning

scikit-learn XGBoost Random Forest

Data Analysis

Pandas NumPy Matplotlib Seaborn

Platforms

Jupyter Google Colab GitHub

Interested in Data Science Solutions?

Looking for machine learning models, predictive analytics, or data science consulting? Let's discuss your project requirements.

Get in Touch