Manjit Singh

Data Science Portfolio

Education

Executive Post Graduate Program in Data Science & AI
IIIT Bangalore
Oct ‘23 – Jan ‘25

Bachelor of Commerce (B.Com)
Gauhati University
Jul ‘20 – Jun ‘23

Professional Experience

Associate - Data Analyst
Samhita | Remote | Nov 2024 – Present

At Samhita, I support data-driven decision-making for financial inclusion projects by managing datasets, uncovering insights, and automating processes across multiple programs. I work closely with program, tech, and MEL teams to ensure data flows efficiently from collection to actionable insight.

• Key Responsibilities:

🧹 Data Management & Reporting:

Maintain and update datasets across Access to Finance initiatives including Pre-Credit Score (PCS), Returnable Grants, and Formal Loans.
Consolidate repayment data from various sources and support data cleaning, validation, and standardization.

📊 Dashboarding & Visualization:

Assist in building and maintaining dashboards for internal monitoring and external reporting.
Support real-time updates and enhancements for evolving project needs.

🔍 Data Analysis & Insight Generation:

Conduct exploratory data analysis (EDA) to identify trends and anomalies.
Extract, transform, and analyze data from SQL databases to inform strategic decisions.

⚙️ Process Automation & GenAI Adoption:

Develop automation scripts using Python, SQL, and VBA to streamline recurring data workflows.
Contribute to identifying and piloting GenAI use cases to enhance team productivity.

🤝 Cross-Team Collaboration:

Coordinate with cross-functional teams to respond to data requests and support grant reports, donor updates, and tool development.
Assist in the design of data collection formats and tracking templates for improved monitoring.

Internships

Business Analyst Intern
Quest Global Technologies Ltd | Remote | Nov 2023 – May 2024

Led the automation of reporting workflows using Excel, SQL, and visualization tools like PowerBI, enabling real-time KPI tracking and reducing manual effort by 10%.
Designed and analyzed A/B testing strategies to evaluate marketing effectiveness, resulting in a 5% boost in conversion rates and improved ROI.
Applied data cleaning, segmentation, and exploratory analysis to uncover business patterns and improve customer targeting.
Collaborated with product and sales teams to deliver data-driven insights, enhancing operational decision-making and customer engagement.
Created reproducible reports, dashboards, and documentation to support strategic planning and ensure data transparency across teams.

Key Projects

🛍️ Sentiment-Based Product Recommendation System

Tech Stack: Python, Flask, Scikit-learn, Pandas, NLTK, HTML/CSS

Built an intelligent recommendation engine that combines collaborative filtering with sentiment analysis to generate more accurate and personalized product suggestions.

Problem Addressed: Traditional recommender systems often overlook the sentiment expressed in user reviews, which can lead to irrelevant recommendations.
Solution Overview:
- Performed text preprocessing using NLTK and extracted features using TF-IDF vectorization.
- Applied Logistic Regression to classify the sentiment (positive/negative) of product reviews.
- Enhanced collaborative filtering by integrating sentiment scores to refine recommendation relevance.
- Developed a Flask-based web app to interactively serve recommendations based on user input.

✅ Impact: Improved recommendation precision by aligning suggestions with user sentiment, resulting in a more engaging and tailored user experience.

📄 Helmate AI: Retrieval-Augmented Generation (RAG) System

Tech Stack: Python, LangChain, GROQ AI, SBERT, ChromaDB

Developed a cost-efficient, open-source RAG pipeline to accurately respond to user queries based on insurance policy documents, enhancing the accessibility and understanding of complex policy information.

Problem Addressed: Navigating large volumes of insurance documentation can be tedious for users looking for specific policy-related answers.
Solution Overview:
- Implemented semantic search using SBERT embeddings to retrieve contextually relevant chunks.
- Used ChromaDB as the vector store for efficient document retrieval.
- Integrated LangChain to orchestrate the pipeline and GROQ AI as the LLM for generating natural, context-aware responses.
- Optimized the workflow to ensure fast, accurate, and scalable information retrieval across diverse policy documents.

✅ Impact: Delivered a scalable and interpretable RAG system with improved query response accuracy—built entirely on open-source tools for maximum accessibility and zero API cost.

🧾 Automatic Ticket Classification System

Tech Stack: Python, Scikit-learn, NLTK, Pandas

Built an intelligent automated ticket classification system to streamline customer support by accurately routing tickets to the relevant departments.

Problem Addressed: Manual triaging of support tickets was time-consuming and error-prone, often delaying response times and affecting customer satisfaction.
Solution Overview:
- Applied text preprocessing and NLP techniques (tokenization, stopword removal, lemmatization) using NLTK.
- Performed topic modeling with Non-negative Matrix Factorization (NMF) to uncover hidden structures in unstructured ticket texts.
- Built and evaluated multiple classifiers using Scikit-learn for efficient ticket label prediction based on content.
- Developed a reproducible Jupyter Notebook pipeline, allowing for future scaling and integration.

✅ Impact: Automated classification significantly reduced manual workload, improved response speed, and laid a foundation for intelligent support systems.

🧩 Customer Segmentation using Clustering

Tech Stack: Python, Scikit-learn, Pandas, Seaborn, Matplotlib

Implemented a robust unsupervised learning pipeline to segment customers based on purchasing behavior for targeted marketing and business strategy enhancement.

Objective: Helped businesses uncover meaningful customer groups by analyzing transaction data and identifying key behavioral patterns.
Approach:
- Performed data cleaning, feature engineering, and RFM analysis (Recency, Frequency, Monetary) to derive meaningful metrics.
- Applied K-Means and Hierarchical Clustering to segment customers and validate clusters using silhouette scores and dendrograms.
- Visualized insights through heatmaps, cluster plots, and bar charts to help decision-makers understand segment characteristics.
- Delivered actionable recommendations for tailored engagement, retention strategies, and personalized offers.

✅ Impact: Enabled data-driven customer targeting and improved marketing ROI through insight-backed segmentation.

🎬 RSVP Movies Analysis – SQL-Based Data Exploration

Tech Stack: MySQL, SQL Joins, Aggregations, Subqueries

Performed a comprehensive SQL analysis on the RSVP Movies dataset to derive actionable insights into movie performance, viewer behavior, and content trends.

Objective: Analyze a real-world movie dataset to uncover trends related to movie views, genres, languages, platforms, and user engagement.
Approach:
- Wrote optimized SQL queries to explore viewer demographics, most-watched genres/languages, and regional popularity.
- Used joins, window functions, subqueries, and aggregations to extract multi-dimensional insights across different entity tables.
- Investigated monthly trends in user ratings and view counts to identify periods of peak engagement.
- Segmented data by platform and genre to assist content curation decisions for OTT platforms.
- Delivered conclusions in a structured manner for use in dashboards or further analysis.

✅ Impact: Enabled data-driven decisions for movie curation, marketing strategies, and platform-specific content optimization through insightful SQL reporting.

📉 Telecom Churn Prediction

Tech Stack: Python, Scikit-learn, Pandas, Seaborn, Matplotlib

Developed a predictive machine learning model to identify telecom customers likely to churn, enabling proactive retention strategies.

Objective: Empowered telecom providers to reduce customer churn by predicting at-risk customers using historical usage and service data.
Approach:
- Conducted extensive EDA to uncover trends and patterns impacting churn (e.g., service issues, contract type, tenure).
- Handled missing values, encoded categorical features, and applied feature selection techniques to improve model performance.
- Trained and compared multiple classification models (Logistic Regression, Decision Trees, Random Forest, XGBoost), tuning hyperparameters for optimal results.
- Evaluated performance using confusion matrix, ROC-AUC, precision, and recall to ensure model reliability and interpretability.
- Visualized key insights with bar plots, correlation heatmaps, and model comparison charts.

✅ Impact: Equipped stakeholders with a reliable churn prediction system, enabling targeted interventions and improved customer retention.

🎯 Lead Scoring Model

Tech Stack: Python, Scikit-learn, Pandas, Seaborn, Matplotlib

Built a machine learning model to identify and prioritize marketing leads with the highest potential to convert, helping businesses optimize their sales funnel.

Objective: Streamlined the lead qualification process by scoring leads based on behavioral and demographic attributes to maximize conversion rates.
Approach:
- Performed thorough data cleaning, feature engineering, and outlier treatment to ensure data quality.
- Conducted EDA to uncover key patterns, such as lead source, total time spent, and page activity influencing conversion likelihood.
- Applied Logistic Regression and Random Forest classifiers with grid search-based hyperparameter tuning.
- Evaluated model using ROC-AUC, F1-score, and confusion matrix to balance precision and recall for business decisions.
- Presented key takeaways through visual dashboards and feature importance charts for actionable stakeholder insights.

✅ Impact: Enabled targeted sales efforts by automating lead prioritization, thereby reducing time-to-conversion and improving ROI.

🚗 Vehicle Dataset - Advanced EDA & Insights

Tech Stack: Python, Pandas, Seaborn, Matplotlib, Plotly

Conducted an in-depth Exploratory Data Analysis (EDA) on a vehicle dataset to uncover trends, patterns, and actionable insights in the automotive space.

Objective: Explored and visualized key factors influencing vehicle pricing, fuel efficiency, and design to support data-driven decisions for buyers and sellers.
Approach:
- Handled missing values, duplicates, and formatted features for consistency.
- Applied univariate, bivariate, and multivariate analysis to understand the influence of attributes like engine size, brand, fuel type, and body style.
- Leveraged correlation matrices, pair plots, violin plots, and box plots to visually interpret data relationships.
- Used interactive Plotly dashboards for deeper insights and stakeholder-friendly visualizations.
- Summarized insights with clear narratives and business implications after each section to make the notebook self-explanatory.

✅ Impact: Revealed strong correlations between engine size, brand, and price; highlighted the impact of fuel type on efficiency; and created a foundation for future predictive modeling.

Tech Stack: Python, Pandas, Seaborn, Matplotlib, Scikit-learn

Built a linear regression model to predict bike rental demand based on historical usage patterns and environmental conditions.

Objective: Analyze factors influencing daily bike rentals and develop a predictive model to estimate rental counts for a bike-sharing service.
Approach:
- Conducted EDA to understand seasonal trends, weather impact, and user behavior across weekdays vs. weekends.
- Applied feature engineering to extract relevant time-based and categorical features (e.g., working day, holiday, temperature bins).
- Built and evaluated a multiple linear regression model, interpreting coefficients to assess variable influence.
- Used R² score, RMSE, and residual plots to assess model accuracy and refine performance.
- Visualized relationships with scatter plots and heatmaps for better insight into dependencies.

✅ Impact: Provided an interpretable model that helps stakeholders anticipate rental demand and make informed operational decisions such as fleet planning and staff allocation.

💳 Credit Data Analysis – Deep Dive into Borrower Behavior

Tech Stack: Python, Pandas, Seaborn, Matplotlib, Plotly

Performed a comprehensive Exploratory Data Analysis (EDA) on a credit dataset to understand patterns related to credit risk, borrower profiles, and loan characteristics.

Objective: Investigated how factors like age, income, loan purpose, credit history, and employment status impact loan approval and creditworthiness.
Approach:
- Conducted data cleaning, feature formatting, and null value handling for analysis readiness.
- Performed univariate and bivariate analysis to explore distributions and identify key differentiators among defaulters and non-defaulters.
- Utilized heatmaps, bar plots, histograms, and interactive visualizations to reveal correlations and highlight risk indicators.
- Analyzed demographic and financial features in relation to loan status and provided clear, intuitive visual storytelling.
- Integrated domain-relevant business interpretations to derive real-world insights from patterns in the data.

✅ Impact: Enabled a clearer understanding of high-risk segments, borrower behavior, and key attributes associated with loan defaults—laying groundwork for future predictive modeling.

Manjit Singh

Data Science Portfolio

Education

Professional Experience

Internships

Key Projects

🛍️ Sentiment-Based Product Recommendation System

📄 Helmate AI: Retrieval-Augmented Generation (RAG) System

🧾 Automatic Ticket Classification System

🧩 Customer Segmentation using Clustering

🎬 RSVP Movies Analysis – SQL-Based Data Exploration

📉 Telecom Churn Prediction

🎯 Lead Scoring Model

🚗 Vehicle Dataset - Advanced EDA & Insights

🚴‍♂️ Bike Sharing Demand Prediction – Linear Regression Model

💳 Credit Data Analysis – Deep Dive into Borrower Behavior