ATM Services Data Analysis and Encoding
A Python-based data analysis and encoding project that processes ATM transaction services data, standardizes service descriptions across multiple languages, and encodes them into a structured classification system.
Project Overview
This project focuses on cleaning and encoding ATM transaction service data from Egyptian ATMs. The dataset contains inconsistent service descriptions in both English and Arabic, with various formatting styles. The goal is to:
Extract and inspect the services column
Standardize multilingual service descriptions
Replace missing and invalid values with appropriate defaults
Encode services into a meaningful 3-level classification system
Problem Statement
ATM service descriptions in the dataset suffer from:
Inconsistent formatting: Mixed delimiters (dashes, hyphens, commas)
Multilingual entries: Both English and Arabic descriptions
Missing values: Empty or null entries
Capitalization inconsistencies: "Withdrawal", "withdrawal", "WITHDRAWAL"
Descriptive variations: "Cash Dispenser only", "withdrawal only", "سحب فقط"
Multiple services per ATM: Many ATMs offer withdrawal, deposit, and forex services
Solution Approach
Data Cleaning Pipeline
Step 1: Missing Value Handling
Replace all null/NaN values in the services column with 'withdrawal' (most common operation)
Step 2: Translation Mapping
Apply the following Arabic-to-English translations:
سحب فقط → withdrawal
سحب - إيداع - تغيير عملة → withdrawal, deposit, forex
سحب - إيداع - تغيير عملة - صرف من الإيداعات → withdrawal, deposit, forex, withdrawal from deposit
سحب → withdrawal
سحب وايداع → withdrawal, deposit
سحب وايداع وتغيير عملة → withdrawal, deposit, forex
GBRU سحب - إيداع - تغيير عملة - صرف من الإيداعات → withdrawal, deposit, forex, withdrawal from deposit
سحب - إيداع - تغيير عملة - صرف من الإيداعات GBRU → withdrawal, deposit, forex, withdrawal from deposit
سحب - إيداع - تغيير عملة - صرف من الإيداعات GBRU+ Barcode → withdrawal, deposit, forex, withdrawal from deposit
Step 3: Text Normalization
Convert to lowercase
Standardize separators (hyphens, dashes, slashes → commas)
Handle special cases: "cash dispenser only" → "withdrawal", "full function" → "withdrawal, deposit, forex"
Split into individual service tokens
Remove duplicates while preserving order
Step 4: Service Encoding (1/2/3 Classification)
The normalized services are encoded into three discrete categories:
Code Description Criteria
1 Withdrawal Only Only withdrawal capability
2 Deposit & Withdrawal Deposit + withdrawal (any combination)
3 Extended Services Forex, exchange, or additional functions
Encoding Logic
def encode_services(value):
normalized = normalize_service_text(value)
# Rule 1: Pure withdrawal
if normalized == ['withdrawal']:
return 1
# Rule 2: Deposit with or without withdrawal
if normalized == ['deposit'] or normalized in [['withdrawal', 'deposit'], ['deposit', 'withdrawal']]:
return 2
# Rule 3: Has forex or other extended services
if any(item not in {'withdrawal', 'deposit'} for item in normalized):
return 3
# Default to 3 (safety fallback)
return 3
Project Structure
d:\NTI-DataAnalysis\tasks\01\
├── README.md # This file
├── atms_encoded.csv # Original raw dataset
├── atms_encoded_cleaned.csv # Cleaned CSV (auto-produced by analysis script)
├── atms_encoded_cleaned_lean.csv # Final output with encoding (from notebook)
├── atm_encode.ipynb # Main Jupyter notebook (lean version)
├── analyze_services.py # Standalone analysis script
├── services_unique.json # All unique service values from original data
├── services_non_english.json # Non-English service entries found
└── .gitignore # Git ignore patterns
Dataset Statistics
Original Data
Total rows: 4379 ATM locations
Total unique service values: 113
Missing values: 57 (replaced with 'withdrawal')
After Cleaning & Encoding
Class 1 (Withdrawal Only): 2787 rows (63.6%)
Class 2 (Deposit & Withdrawal): 105 rows (2.4%)
Class 3 (Extended Services): 1487 rows (34.0%)
Key Statistics
Multilabel rows (multiple services per ATM): 865
Non-English unique entries: 9
Rows with 'Other' values replaced: 20
Files Description
Input Files
atms_encoded.csv: Original ATM dataset with raw service descriptions
Output Files
atms_encoded_cleaned.csv: Cleaned data with standardized service column
atms_encoded_cleaned_lean.csv: Final dataset with three new columns:
services: Cleaned/normalized service description
services_normalized: List of parsed service tokens
services_encoding: Integer encoding (1/2/3)
Analysis Artifacts
services_unique.json: Dictionary of all original unique values with counts
services_non_english.json: Non-English entries with counts (for manual review)
Installation & Usage
Prerequisites
python >= 3.7
pandas >= 1.0.0
Setup
# Clone the repository
git clone https://github.com/<yo...
cd NTI-DataAnalysis/tasks/01
# Install dependencies
pip install pandas
Running the Analysis
Option 1: Standalone Script (Fast Analysis)
python analyze_services.py
This produces:
atms_encoded_cleaned.csv (cleaned data)
services_unique.json (unique value report)
services_non_english.json (non-English entries report)
Console output with summary statistics
Option 2: Jupyter Notebook (Interactive Exploration)
jupyter notebook atm_encode.ipynb
The notebook provides:
Step-by-step data loading and cleaning
Interactive inspection of normalized services
Distribution analysis of encoding classes
CSV export with all three encoding columns
Key Code Functions
normalize_service_text(value: str) -> list
Normalizes a raw service string into a list of standardized tokens.
Input: "Withdrawal -Deposit – Forex" Output: ['withdrawal', 'deposit', 'forex']
encode_services(value: str) -> int
Encodes a service string into the 1/2/3 classification.
Examples:
"Withdrawal" → 1
"withdrawal,deposit" → 2
"Withdrawal -Deposit – Forex" → 3
Data Quality & Validation
Normalization Rules Applied
All text converted to lowercase
Multi-character separators normalized to commas
Whitespace trimmed and deduplicated
Unknown tokens mapped to fallback (withdrawal)
Duplicate tokens removed while preserving order
Language Support
English: Primary language
Arabic: Egyptian Arabic (فصحى) entries translated
Mixed: Entries with mixed Arabic/English/special codes handled
Encoding Rationale
The 1/2/3 encoding scheme reflects operational ATM capabilities:
1 (Withdrawal): Basic ATMs at remote or low-traffic locations
2 (Deposit & Withdrawal): Community-focused ATMs in populated areas
3 (Extended): Premium ATMs at banks/malls with forex and other services
This ordinal encoding preserves service hierarchy and is suitable for:
Categorical machine learning models (decision trees, random forests)
Ordinal regression models
Feature engineering (one-hot encoding if needed)
Geographical analysis of service distribution
Example Usage
import pandas as pd
# Load the cleaned and encoded data
df = pd.read_csv('atms_encoded_cleaned_lean.csv')
# View the encoding distribution
print(df['services_encoding'].value_counts().sort_index())
# Filter ATMs by service level
withdrawal_only = df[df['services_encoding'] == 1]
deposit_capable = df[df['services_encoding'] >= 2]
# Analyze by governorate
by_region = df.groupby('governate')['services_encoding'].agg(['mean', 'std', 'count'])
print(by_region)
Future Improvements
ML Classification: Train a classifier to auto-predict encoding from location/context
Service Availability: Track service outages and update encoding over time
Multilingual Expansion: Add support for more Arabic dialects
Confidence Scoring: Add confidence scores for normalized services
API Integration: Create REST API for real-time service lookup
Contributing
Contributions are welcome! Please:
Fork the repository
Create a feature branch (git checkout -b feature/your-feature)
Commit changes (git commit -m 'Add your feature')
Push to branch (git push origin feature/your-feature)
Open a Pull Request
License
This project is licensed under the MIT License - see LICENSE file for details.
Author
Created as part of NTI Data Analysis tasks.
Contact & Support
For questions or issues:
Open an issue on GitHub
Check existing issues for similar problems
Review the inline code comments for detailed logic
Changelog
v1.0.0 (2026-05-02)
Initial release
Core data cleaning and encoding pipeline
Support for Arabic/English multilingual data
Three-level service classification system
Jupyter notebook for interactive analysis