تفاصيل العمل

Project Overview

The goal of this project is to analyze and predict the commercial success of video games on the Steam platform, using extensive data preprocessing, feature engineering, regression, and classification techniques. We investigated which game attributes (e.g., genres, reviews, discounts, platform support) influence copiesSold and sales tier classification (Bronze, Silver, Gold, Platinum).

Objectives

Integrate and clean data from multiple sources (base games, DLCs, demos, sales).

Handle missing values, correct data types, and engineer useful features.

Explore the relationship between features and user engagement/sales.

Use regression models to predict game copies sold.

Use classification models to categorize games into sales tiers.

Apply feature selection techniques to find the most influential variables.

Tune hyperparameters and evaluate models based on accuracy and efficiency.

Dataset Details

Combined from 5 sources: base info, DLCs, demos, sales data, and scraped data.

Final shape: 69,828 rows, 22 columns after merging and cleaning.

Key Preprocessing Steps

Task Description

Missing Value Handling Custom strategies for each column (e.g., filling with 0, mean, median).

Feature Engineering Extracted num_genres, price_after_discount, months_since_release.

Encoding Boolean → Integer, Label Encoding, Frequency Encoding.

Dropped Columns Removed irrelevant or redundant features like appid, languages.

Feature Selection Techniques

Pearson & Spearman Correlation

Mutual Information

Random Forest Importance

Lasso Regression

Recursive Feature Elimination (RFE)

Top features identified include: price_after_discount, positive_reviews, reviewScore, metacritic, num_languages, discount.

Regression Results

Model R² (Train) R² (Test)

Linear Regression 0.8437 0.6716

Decision Tree 0.9309 0.9291

Random Forest 0.9446 0.9145

Lasso CV 0.8437 0.6716

Random Forest achieved the highest performance and was saved as the final regression model.

Classification Task

Target:

Games were classified into:

1: Bronze

2: Silver

3: Gold

4: Platinum

Feature Selection Approaches:

Chi-Squared, Mutual Information for categorical features.

ANOVA F-test, Kendall Tau for numerical features.

Combined top features into sets like Anova_Chi, Kendall_MI.

Best Models & Accuracy

Model Test Accuracy

Decision Tree 89.20%

Random Forest 89.84%

Gradient Boosting 90.74%

Best Trade-offs:

Best Accuracy: Gradient Boosting

Best Speed: Decision Tree (ideal for real-time deployment)

Key Insights

price_after_discount, reviewScore, and positive_reviews are top predictors.

Discounts are more influential than original price.

Ensemble models (e.g., Gradient Boosting) deliver best accuracy.

Hyperparameter tuning significantly improves model performance (up to +8%).

بطاقة العمل

اسم المستقل
عدد الإعجابات
0
عدد المشاهدات
11
تاريخ الإضافة
تاريخ الإنجاز
المهارات