Data Cleaning and Preprocessing Steps
Import Libraries and Load Data: The project began by importing the pandas, numpy, matplotlib, and seaborn libraries. The nba2k-full.csv file was then loaded into a DataFrame, and the first five rows were displayed to understand the data's content.
Data Cleaning:
The full_name and jersey columns were dropped as they were not deemed essential for the model.
Columns with mixed or non-numeric data types were processed:
The height column was converted from a format like 6-9 / 2.06 to a numeric value representing height in meters (e.g., 2.06).
The weight column, originally in a format like 250 lbs. / 113.4 kg., was converted to a numeric value representing weight in kilograms (e.g., 113.4).
The dollar sign $ was removed from the salary column, and the data type was converted to a float.
The country column was simplified to categorize players as either USA or NOT USA.
A new age column was created by subtracting the player's birth year (b_year) from the game's version year (version). The original b_day, b_year, college, and version columns were then dropped.
Undrafted values in both the draft_round and draft_peak columns were replaced with 0, and the columns were converted to an integer data type.
Handling Missing Values:
The dataset was checked for missing values.
The team column had 23 missing values, which were replaced with no team.
Visualizations and Insights
Visualizations were used to explore the distribution of data in several key columns:
rating: A histogram showing the distribution of player ratings.
team: A bar chart displaying the number of players on each team.
position: A bar chart illustrating the distribution of players by their positions.
b_day: A bar chart showing the distribution of player birth dates.
height: A bar chart for the distribution of player heights.
weight: A bar chart for the distribution of player weights.
draft_year: A bar chart showing the distribution of player draft years.
draft_round and draft_peak: Bar charts that illustrate the distribution of draft rounds and peak picks.
age: A histogram showing the distribution of player ages.