I built a complete machine-learning pipeline to predict rainfall tomorrow (weather classification) using the Australian weather dataset. The project focuses on robust preprocessing and model comparison: I handled heavy missing data (column drop + KNN imputation + mode filling), extracted date features, encoded categoricals, removed outliers, and engineered a clean feature set. I then trained and tuned a variety of classical models — Logistic Regression, Decision Tree, Random Forest, Gradient Boosting, XGBoost, AdaBoost, and an MLP — and compared them using cross-validation and test metrics. For the best models I performed hyperparameter search (GridSearchCV), inspected feature importances, and validated results with confusion matrices and error analysis to choose a reliable production candidate.
What I delivered
End-to-end preprocessing: null-value analysis, column filtering (>60% null removal), KNN imputation for numeric gaps, mode filling for categoricals, date → month extraction, label encoding.
Data quality work: exploratory histograms, correlation heatmap, IQR outlier filtering, boxplots for feature distributions.
Multiple models & model selection: baseline Logistic Regression, tree-based ensembles (DecisionTree, RandomForest, GradientBoosting, XGBoost, AdaBoost), and an MLP pipeline with scaling and early stopping.
Model improvement: GridSearchCV for hyperparameter tuning, pipeline usage for scaling + MLP, ensemble strategies and model comparison via accuracy / confusion matrix / other classification metrics.
Reproducible notebook with saved preprocessed dataset and visualization assets.