ML | Rojin-Data Analyst Expert

Visit all my Machine Learning projects

Harnessing the power of machine learning, I have developed expertise in creating intelligent systems that can learn and adapt over time. My experience in machine learning encompasses a wide range of techniques and tools, allowing me to build and deploy models that drive innovation and efficiency. Here’s an overview of my capabilities:

Data Preparation and Cleaning: Ensuring the integrity and quality of data, which is the foundation of effective machine learning models.
Feature Engineering: Creating meaningful features that enhance the predictive power of models.
Model Selection and Training: Choosing the right algorithms and training models using techniques like regression, classification, clustering, and neural networks.
Hyperparameter Tuning: Optimizing model performance by fine-tuning hyperparameters.
Evaluation Metrics: Using various metrics like accuracy, precision, recall, F1 score, and ROC-AUC to evaluate model performance.
Model Deployment: Implementing models into production environments using tools like TensorFlow, scikit-learn, and PyTorch.
Deep Learning: Designing and training complex neural networks for tasks such as image recognition, natural language processing, and predictive analytics.
Continuous Learning: Incorporating feedback loops to allow models to improve over time based on new data.
Visualization and Interpretation: Creating visualizations to interpret model outputs and insights for stakeholders.

These skills enable me to build robust machine-learning solutions that can transform data into actionable insights, drive strategic decisions, and innovate within various domains.

LendingClub Loan Repayment Prediction

Project Overview

This project focuses on leveraging historical loan data from LendingClub, the world’s largest peer-to-peer lending platform, to build a predictive model that determines whether a borrower will repay their loan. By analyzing key features and trends within the dataset, this model aims to aid in assessing the risk associated with potential borrowers, thereby enhancing decision-making processes for lending.

Objective

Predict whether a borrower will repay their loan to assist in risk assessment and decision-making processes for lenders.

Methods

Exploratory Data Analysis: Analyzing the distribution of loan statuses, loan amounts, and installments to uncover patterns and relationships.
Data Imbalance Handling: Addressing the significant imbalance in the dataset where most loans are "Fully Paid" compared to "Charged Off".
Correlation Analysis: Investigating the relationship between loan amount and installment to validate internal calculation methods.
Count Plots: Generating count plots for each grade and subgrade to understand the distribution and correlation with loan statuses.
Feature Engineering: Classifying job titles and handling employment duration and mortgage account data to improve model performance.
Model Evaluation: Using accuracy, precision, recall, and F1 score to assess model performance, with a focus on handling imbalanced data.

Key Insights

Loan Status Distribution: A significant imbalance with more loans marked as "Fully Paid".
Loan Amount and Installment: A strong correlation suggesting a calculated method for determining payments.
Grade Analysis: Higher letter grades show a higher proportion of "Charged Off" loans.
Subgrade Distribution: Predominant categories are A, B, and C, with varying charge-off rates.
Employment Duration: Minimal impact on charge-off rates, leading to the removal of this feature.
Mortgage Accounts: A notable portion of the population with no additional mortgage accounts, with handling missing data being crucial.

Conclusion and Model Interpretation

Performance Metrics: Accuracy, precision, recall, and F1 score used for evaluation. The model achieves an accuracy rate of 89%, surpassing chance predictions and straightforward loan repayment assumptions.
Focus on Imbalanced Labels: Ensuring precision and recall are adequately addressed due to the dataset's imbalance.

This project demonstrates the application of machine learning techniques to predict loan repayment, providing valuable insights for enhancing lending decisions.

View project

Binary Classification of Yelp Reviews Using NLP

Project Overview

This project categorizes Yelp reviews into one-star or five-star ratings, determining user experiences as favorable or unfavorable. We utilized a dataset from Kaggle and employed a pipeline approach with CountVectorizer, TF-IDF Transformer, and MultinomialNB.

Objective

Classify Yelp reviews based on their content to understand customer sentiments.

Methods

CountVectorizer: Converts text data into a matrix of token counts.
TF-IDF Transformer: Weighs the importance of words in the reviews.
MultinomialNB: Applies the Naive Bayes algorithm for classification.
Pipeline: Integrates the above steps for streamlined processing.

Results

Accuracy Achieved an overall accuracy of 81%.
Insights The model excelled in identifying positive feedback (five-star reviews) with notable precision and recall.

Struggled with negative feedback (one-star reviews), indicating the need for further optimization.

Conclusion

The pipeline shows promise in capturing positive feedback effectively. However, further refinements are needed to accurately classify negative reviews, ensuring a comprehensive understanding of customer sentiments.

View project

University Classification Using K-means Clustering

Project Overview

In this project, we applied KMeans Clustering to categorize universities into two groups: Private and Public. Despite having labels for the dataset, we disregarded them to leverage the unsupervised nature of KMeans, later using the labels to assess performance. The dataset includes details about 777 institutions in the USA.

Objective

The aim was to create a KMeans model to classify colleges as private or public based purely on their features.

Methods

KMeans Clustering: Used to group universities into two clusters.
Performance Evaluation: Classification report and confusion matrix to assess accuracy.

Results

Accuracy: 78%, with better performance for private colleges (Class 1) over public colleges (Class 0).
Insights: The model excelled in classifying private institutions but struggled with public ones, indicating areas for improvement.

Conclusion

This project highlights the potential of KMeans Clustering in organizing unlabeled data into meaningful clusters, demonstrating the algorithm’s strengths and areas needing enhancement.

View project

Random Forest Project: Predicting Loan Repayment

Project Overview

This project explores data from lendingclub.com to develop a predictive model determining whether borrowers will fully repay their loans. We analyzed loan records from 2007 to 2010, considering various borrower characteristics and loan details.

Objective

To forecast loan repayment outcomes and assist investors in making informed lending decisions.

Methods

Random Forest Classifier: Used for its ensemble learning capabilities.
Data Analysis: Examined features like credit policy, loan purpose, interest rates, and more.
Metrics: Evaluated model performance using precision, recall, and F1 scores.

Results

Overall Improvement: Noted enhancements in precision, recall, and F1 scores across categories.
Random Forest vs. Decision Tree: Random Forest performed better on average, though it underperformed in specific areas. Decision Tree had higher recall for Class 1.

Conclusion

The project demonstrated the strengths and limitations of the Random Forest model in predicting loan repayment. While it showed overall better performance, specific areas need optimization. The choice between Random Forest and Decision Tree depends on the preferred metric and business context.

View project

Advertising Project: Predicting Ad Clicks Using Logistic Regression

Project Overview

This project analyzes advertising data to predict whether a user will click on an ad by examining their attributes. The goal is to help advertisers target their audience more effectively.

Objective

To build a predictive model determining if a user is likely to click on an ad based on features such as daily time spent on site, age, area income, daily internet usage, and more.

Methods

Logistic Regression: Used for binary classification to predict ad clicks.
Feature Analysis: Includes daily time spent on site, age, area income, daily internet usage, and other user attributes.

Results

Accuracy: Achieved approximately 93% accuracy, precision, and recall.
Insights: The confusion matrix showed a few misclassifications, but overall, the model performed well. The pair plot indicated strong separation between the two classes (click and no click) for certain features.

Conclusion

The model demonstrated robust performance with a 93% accuracy rate, effectively predicting ad clicks based on user attributes. The project highlights the potential of logistic regression in targeted advertising and provides valuable insights for optimizing ad placements.

View project

Recipe Traffic Prediction and Analysis

Project Overview

This project focused on predicting and analyzing website traffic for recipes, utilizing data validation, exploratory data analysis (EDA), and machine learning models to draw meaningful insights and enhance decision-making processes.

Objective

To build a predictive model to determine which recipes will lead to high website traffic based on features such as recipe type, nutritional content, servings, and more.

Methods

Data Validation and Cleaning: Validated and cleaned 947 rows and 8 columns, ensuring no rows were removed and all missing values were handled appropriately.
Exploratory Data Analysis (EDA): Applied EDA techniques to gain insights, visualize findings, and identify key relationships between features.
Logistic Regression: Chosen as the base model for its accuracy, simplicity, and efficiency.
Ada Boost Classifier: Used as a comparison model to evaluate performance differences.

Results

ROC AUC Score: Base model achieved an 84% ROC AUC score, surpassing the 80% requirement, while the comparison model scored 85%.
Precision and Recall: Both models performed well with values around 0.85.
F1-Score: Base model and comparison model reached 0.84 and 0.85, respectively.
Mean Squared Error (MSE): Base model and comparison model had MSE values of 0.151 and 0.158, indicating good performance.
Insights:
- Popular recipes leading to high traffic included Vegetable, Potato, and Pork dishes.
- Categories like 'Vegetable' had traffic rates as high as 99%, while 'Beverage' had the lowest at 5%.
- Serving sizes did not show significant correlation with recipe popularity.

Conclusion

The Logistic Regression model demonstrated robust performance with an 84% accuracy rate, effectively predicting high traffic recipes based on various features. This project highlighted the potential of logistic regression in understanding recipe popularity and provided valuable insights for optimizing website traffic.

Recommendations

Enhanced Tracking: Use business metrics to monitor site traffic changes.
Data Categorization: Break down site traffic into more categories for improved predictions.
User Accounts: Implement user accounts to track activities and enhance models.
Data Expansion: Increase data volume to refine model accuracy.
User Feedback: Enable ratings and feedback on recipes to benefit model predictions.

View project

Predicting Titanic Survival Using Logistic Regression

Project Overview

This project focuses on analyzing e-commerce data to predict whether a passenger will survive the Titanic disaster by examining various attributes. The goal is to build a robust predictive model to identify key factors affecting survival rates.

Objective

To construct a predictive model that determines the likelihood of a passenger surviving based on features such as age, gender, passenger class, number of siblings/spouses, number of parents/children, and fare amount.

Methods

Logistic Regression: Employed for binary classification to predict survival.
Data Imputation: Filled missing age values with the mean age based on passenger class.
Data Cleaning: Handled missing values in the Age and Cabin columns, removing or transforming unusable data.
Feature Analysis: Analyzed key features such as age, passenger class, and gender to understand their impact on survival rates.
Data Visualization: Used Seaborn and Matplotlib to generate heatmaps, count plots, and other visual aids for data analysis.

Results

Accuracy: Achieved an accuracy of 86% and an F1 score of 82%.
Insights: The model revealed significant survival predictors, with a clear distinction in survival rates based on gender and passenger class. The pair plots indicated strong separation between survived and not survived classes for certain features.

Conclusion

The model demonstrated robust performance with an 86% accuracy rate, effectively predicting survival based on passenger attributes. The project underscores the effectiveness of logistic regression in classification tasks and provides valuable insights into the factors influencing survival rates on the Titanic.

View project

E-commerce Retail Company Analysis Using Linear Regression

Project Overview

This project focuses on analyzing data from an online retail company based in New York, which specializes in selling clothing both online and through brick-and-mortar stores. Customers have the option to place their orders using the company's mobile app or website, and can also participate in face-to-face consultations with fashion experts at the store. The company is debating whether to emphasize its website or its mobile application.

Objective

To build a predictive model determining if a customer is likely to spend more based on features such as average session length, time on the app, time on the website, and length of membership.

Methods

Libraries Used: Pandas, Numpy, Matplotlib, and Seaborn.
Data Preparation: The CSV file includes customer details such as Email, Address, Avatar color, and numerical columns like:
- Average session length: Average session of in-store style advice session.
- Time on App: Average time spent on the app in minutes.
- Time on Website: Average time spent on the website in minutes.
- Length of Membership: How many years the customer has been a member.
Visual Analysis: Seaborn was utilized to generate a joint plot for comparing Time on Website and Yearly Amount Spent. The scatter plot shows a correlation between these variables.
Data Splitting: The data was divided into training and testing sets using model_selection.train_test_split from sklearn with a test size of 0.3 and a random state of 101.
Model Training: Linear Regression from sklearn.linear_model was used to train the model with the training data. The model was initialized and assigned to the variable lm.

Results

Model Performance: The model's performance was assessed by comparing actual test outcomes with predicted values. A scatter plot revealed a clear alignment of data points along a straight line, indicating a high degree of precision in the model. The model was considered excellent with an R2R^2 value close to 1, as its variance score was 0.98.
Coefficients Insights:
- A one-unit increase in average session length results in an uptick of about $26.
- An extension in time on the app leads to a $38 growth.
- Time spent on the website results in a minor $0.19 increase.
- An additional year of membership leads to a substantial increase of around $61.

Conclusion

The model demonstrated robust performance with an R2R^2 value of 0.98, effectively predicting customer spending based on various attributes. Analyzing the coefficients and data in our model suggests that the company could either improve the website to match the app's performance or further enhance the app, which is already showing significant improvement. The decision depends on various factors within the company, and investigating the relationship between the duration of membership and usage of the app or website could provide additional insights.

View project