تفاصيل العمل

Importing Libraries

The code starts by importing necessary libraries such as Pandas, Matplotlib, NumPy, LabelEncoder, RandomForestClassifier, GridSearchCV, and suppressing warnings.

Data Wrangling

The wrangle function is defined to read the train data, perform some initial data exploration (e.g., printing the head, info, and statistics), categorize passengers into age categories, create a 'Family_Size' column, and determine if a passenger is 'Alone' or not. It returns the modified train DataFrame.

The train and test DataFrames are created using the wrangle function for the training and test datasets, respectively.

The code also separates male and female passengers into the male and female DataFrames.

Exploratory Data Analysis (EDA)

The code calculates and prints the percentage of survivors for females and males.

It visualizes the number of passengers who survived or died by age category and sex using a bar chart.

It plots a pie chart to show the distribution of survivors by ticket class (Pclass).

It creates a bar chart to display the number of survivors by family size.

Data Visualization

A heatmap of the correlation matrix for numeric columns in the dataset is plotted.

Building the Model

A list of selected features is defined.

X_train, X_test, and y_train are defined by selecting the specified features and one-hot encoding categorical variables.

A RandomForestClassifier is initialized with a random state of 42.

Hyperparameters and their values to be tuned are defined in the param_grid dictionary.

GridSearchCV is used to perform hyperparameter tuning with 5-fold cross-validation.

The best hyperparameters and the best estimator (model) are printed.

The accuracy score of the model on the training data is printed.

Finally, predictions are made on the test data using the best model, and the results are saved to a CSV file named 'submission.csv'.

Conclusion

The code provided performs various data preprocessing steps, exploratory data analysis, and builds a machine learning model (Random Forest) to predict the survival of passengers on the Titanic based on features like class, gender, family size, and more.

This code demonstrates a typical workflow for a machine learning competition where the goal is to predict outcomes on a test dataset using a trained model.

You can further enhance the documentation by adding explanations for specific parts of the code, or any additional insights or observations from the EDA and model building process.

بطاقة العمل

اسم المستقل Mahmoud E.
عدد الإعجابات 0
عدد المشاهدات 9
تاريخ الإضافة