Project Overview
The goal of this project is to analyze and predict the commercial success of video games on the Steam platform, using extensive data preprocessing, feature engineering, regression, and classification techniques. We investigated which game attributes (e.g., genres, reviews, discounts, platform support) influence copiesSold and sales tier classification (Bronze, Silver, Gold, Platinum).
Objectives
Integrate and clean data from multiple sources (base games, DLCs, demos, sales).
Handle missing values, correct data types, and engineer useful features.
Explore the relationship between features and user engagement/sales.
Use regression models to predict game copies sold.
Use classification models to categorize games into sales tiers.
Apply feature selection techniques to find the most influential variables.
Tune hyperparameters and evaluate models based on accuracy and efficiency.
Dataset Details
Combined from 5 sources: base info, DLCs, demos, sales data, and scraped data.
Final shape: 69,828 rows, 22 columns after merging and cleaning.
Key Preprocessing Steps
Task Description
Missing Value Handling Custom strategies for each column (e.g., filling with 0, mean, median).
Feature Engineering Extracted num_genres, price_after_discount, months_since_release.
Encoding Boolean → Integer, Label Encoding, Frequency Encoding.
Dropped Columns Removed irrelevant or redundant features like appid, languages.
Feature Selection Techniques
Pearson & Spearman Correlation
Mutual Information
Random Forest Importance
Lasso Regression
Recursive Feature Elimination (RFE)
Top features identified include: price_after_discount, positive_reviews, reviewScore, metacritic, num_languages, discount.
Regression Results
Model R² (Train) R² (Test)
Linear Regression 0.8437 0.6716
Decision Tree 0.9309 0.9291
Random Forest 0.9446 0.9145
Lasso CV 0.8437 0.6716
Random Forest achieved the highest performance and was saved as the final regression model.
Classification Task
Target:
Games were classified into:
1: Bronze
2: Silver
3: Gold
4: Platinum
Feature Selection Approaches:
Chi-Squared, Mutual Information for categorical features.
ANOVA F-test, Kendall Tau for numerical features.
Combined top features into sets like Anova_Chi, Kendall_MI.
Best Models & Accuracy
Model Test Accuracy
Decision Tree 89.20%
Random Forest 89.84%
Gradient Boosting 90.74%
Best Trade-offs:
Best Accuracy: Gradient Boosting
Best Speed: Decision Tree (ideal for real-time deployment)
Key Insights
price_after_discount, reviewScore, and positive_reviews are top predictors.
Discounts are more influential than original price.
Ensemble models (e.g., Gradient Boosting) deliver best accuracy.
Hyperparameter tuning significantly improves model performance (up to +8%).