Rudra Prasad Bhuyan

Aspiring Data Scientist | ML Engineer

Electrical engineering background. Self-learned data science practitioner building end-to-end machine learning systems under real-world constraints.

Focused on practical, system thinking solutions — not just models. I solve real problems using data.

Resume GitHub LinkedIn

Skills & Expertise

Languages: Python, SQL

Data Analysis Framework: Pandas, Numpy

Visualization: Matplotlib, Seaborn, Plotly

ML Frameworks: PyTorch, TensorFlow, Scikit-learn, XGBoost

MLOps & Tools: Docker, MLflow, Git, FastAPI

Work Experience

Junior Data Scientist Intern @ SBC Labs

Nov 2025 - Feb 2026

Developed a structured data analysis framework using Python to analyze an India-level open-source dataset with 400+ features, identifying high-impact variables and reducing feature redundancy.
Optimized data quality through extensive data cleaning and preprocessing in Python, transforming messy raw data into model-ready datasets and improving downstream analytics efficiency.
Built a state-level interactive dashboard using Python visualization libraries, enabling focused insights for one state and improving decision-making clarity for stakeholders.
Improved team understanding by creating a comprehensive Feature Requirement & Documentation Sheet, reducing dataset confusion and standardizing feature interpretation across the team.
Collaborated with technical and non-technical members by delivering structured weekly reports and translating complex data findings into actionable insights, enabling informed project decisions.

Electrical Engineering Intern @ Tata Power

June 2025 - July 2025

Gained hands-on exposure to electricity distribution operations, including smart meter functionality, maintenance workflows, and field service processes.
Observed and analyzed the end-to-end digital complaint management system, including online application handling, user request processing, and new connection workflows.
Understood field-to-system integration where service updates and on-site photographic evidence were recorded and synchronized within Tata Power’s internal operations platform.

Open Source

show-file-tree

A small, fast CLI tool to display styled file/folder trees with rich options, colors, icons, and metadata.

PyPI Docs GitHub

find-my-joint

A utility to find potential join keys (matching columns) across multiple pandas DataFrames.

PyPI Docs GitHub

Featured Projects

SQL Modern Data Warehouse

Problem

Raw ERP & CRM sales data was scattered in CSV files, inconsistent, and not analytics-ready for business reporting.

Tools

PostgreSQL, SQL (ETL), Star Schema Modeling, EDA, Power BI

Built a 3-layer Medallion architecture (Bronze–Silver–Gold) in PostgreSQL, transforming raw CSV data into analytics-ready tables.
Developed end-to-end ETL pipelines using SQL, integrating ERP & CRM data into a scalable star schema model.
Optimized business reporting by creating analytical views and SQL reports for customer behavior, product performance, and sales trends.

Yelp Big Data Capstone

Problem

Traditional tools like Pandas struggle with large Yelp JSON datasets due to memory and performance limitations.

Tools

Python, Polars, JSON, Parquet, Jupyter Notebook

Built a high-performance data processing pipeline using Polars, handling large Yelp JSON files efficiently without memory bottlenecks.
Optimized storage and I/O by converting raw JSON data into Parquet format, improving processing speed and scalability.
Developed step-by-step analytical workflows (filtering, groupby, joins, rolling ops) to extract structured insights from large-scale business data.

Breast Cancer Prediction App

Problem

Need for an interactive, real-time diagnostic tool to predict whether a tumor is benign or malignant using medical imaging features.

Tools

Python, Scikit-learn (Logistic Regression), Streamlit, Pandas, Plotly

Built an end-to-end ML pipeline using Logistic Regression (Scikit-learn) for tumor classification based on 30 diagnostic features.
Developed a real-time web application in Streamlit, delivering instant predictions with probability confidence scores.