Customer feedback is crucial for shaping business strategies and improving products/services. This analysis uses the Kaggle Customer Feedback Dataset, which includes diverse customer sentiments.
We apply Natural Language Processing (NLP) and Machine Learning (ML) techniques to perform sentiment analysis, using Python's nltk and scikit-learn libraries. Our focus is on efficient text preprocessing, feature selection, and classification model evaluation.
1. Data Preprocessing
Data Cleaning: Handle missing values, remove duplicates.
Text Processing: Lowercase conversion, punctuation removal, stopword removal, stemming (Porter Stemmer).
Tokenization: Count words and characters for exploratory analysis.
2. Feature Selection Techniques
Filter Methods: Statistical measures (e.g., Chi-square).
Wrapper Methods: Recursive Feature Elimination (RFE).
Dimensionality Reduction: PCA to retain key variance features.
3. Sentiment Labeling
Used VADER from NLTK to assign sentiment scores:
Positive, Neutral, Negative (based on compound score thresholds).
Model Training and Evaluation
Pipeline
Vectorization: TF-IDF.
Models Used: Multinomial Naive Bayes, Random Forest, SVM.
Metrics Evaluated: Accuracy, Precision, Recall, F1 Score.
Experimental Setup
Data split into training and testing sets.
Applied preprocessing and vectorization before classification.
Compared baseline model vs. Chi-Square vs. PCA approaches.