تفاصيل العمل

ATM Services Data Analysis and Encoding

A Python-based data analysis and encoding project that processes ATM transaction services data, standardizes service descriptions across multiple languages, and encodes them into a structured classification system.

Project Overview

This project focuses on cleaning and encoding ATM transaction service data from Egyptian ATMs. The dataset contains inconsistent service descriptions in both English and Arabic, with various formatting styles. The goal is to:

Extract and inspect the services column

Standardize multilingual service descriptions

Replace missing and invalid values with appropriate defaults

Encode services into a meaningful 3-level classification system

Problem Statement

ATM service descriptions in the dataset suffer from:

Inconsistent formatting: Mixed delimiters (dashes, hyphens, commas)

Multilingual entries: Both English and Arabic descriptions

Missing values: Empty or null entries

Capitalization inconsistencies: "Withdrawal", "withdrawal", "WITHDRAWAL"

Descriptive variations: "Cash Dispenser only", "withdrawal only", "سحب فقط"

Multiple services per ATM: Many ATMs offer withdrawal, deposit, and forex services

Solution Approach

Data Cleaning Pipeline

Step 1: Missing Value Handling

Replace all null/NaN values in the services column with 'withdrawal' (most common operation)

Step 2: Translation Mapping

Apply the following Arabic-to-English translations:

سحب فقط → withdrawal

سحب - إيداع - تغيير عملة → withdrawal, deposit, forex

سحب - إيداع - تغيير عملة - صرف من الإيداعات → withdrawal, deposit, forex, withdrawal from deposit

سحب → withdrawal

سحب وايداع → withdrawal, deposit

سحب وايداع وتغيير عملة → withdrawal, deposit, forex

GBRU سحب - إيداع - تغيير عملة - صرف من الإيداعات → withdrawal, deposit, forex, withdrawal from deposit

سحب - إيداع - تغيير عملة - صرف من الإيداعات GBRU → withdrawal, deposit, forex, withdrawal from deposit

سحب - إيداع - تغيير عملة - صرف من الإيداعات GBRU+ Barcode → withdrawal, deposit, forex, withdrawal from deposit

Step 3: Text Normalization

Convert to lowercase

Standardize separators (hyphens, dashes, slashes → commas)

Handle special cases: "cash dispenser only" → "withdrawal", "full function" → "withdrawal, deposit, forex"

Split into individual service tokens

Remove duplicates while preserving order

Step 4: Service Encoding (1/2/3 Classification)

The normalized services are encoded into three discrete categories:

Code Description Criteria

1 Withdrawal Only Only withdrawal capability

2 Deposit & Withdrawal Deposit + withdrawal (any combination)

3 Extended Services Forex, exchange, or additional functions

Encoding Logic

def encode_services(value):

normalized = normalize_service_text(value)

# Rule 1: Pure withdrawal

if normalized == ['withdrawal']:

return 1

# Rule 2: Deposit with or without withdrawal

if normalized == ['deposit'] or normalized in [['withdrawal', 'deposit'], ['deposit', 'withdrawal']]:

return 2

# Rule 3: Has forex or other extended services

if any(item not in {'withdrawal', 'deposit'} for item in normalized):

return 3

# Default to 3 (safety fallback)

return 3

Project Structure

d:\NTI-DataAnalysis\tasks\01\

├── README.md # This file

├── atms_encoded.csv # Original raw dataset

├── atms_encoded_cleaned.csv # Cleaned CSV (auto-produced by analysis script)

├── atms_encoded_cleaned_lean.csv # Final output with encoding (from notebook)

├── atm_encode.ipynb # Main Jupyter notebook (lean version)

├── analyze_services.py # Standalone analysis script

├── services_unique.json # All unique service values from original data

├── services_non_english.json # Non-English service entries found

└── .gitignore # Git ignore patterns

Dataset Statistics

Original Data

Total rows: 4379 ATM locations

Total unique service values: 113

Missing values: 57 (replaced with 'withdrawal')

After Cleaning & Encoding

Class 1 (Withdrawal Only): 2787 rows (63.6%)

Class 2 (Deposit & Withdrawal): 105 rows (2.4%)

Class 3 (Extended Services): 1487 rows (34.0%)

Key Statistics

Multilabel rows (multiple services per ATM): 865

Non-English unique entries: 9

Rows with 'Other' values replaced: 20

Files Description

Input Files

atms_encoded.csv: Original ATM dataset with raw service descriptions

Output Files

atms_encoded_cleaned.csv: Cleaned data with standardized service column

atms_encoded_cleaned_lean.csv: Final dataset with three new columns:

services: Cleaned/normalized service description

services_normalized: List of parsed service tokens

services_encoding: Integer encoding (1/2/3)

Analysis Artifacts

services_unique.json: Dictionary of all original unique values with counts

services_non_english.json: Non-English entries with counts (for manual review)

Installation & Usage

Prerequisites

python >= 3.7

pandas >= 1.0.0

Setup

# Clone the repository

git clone https://github.com/<yo...

cd NTI-DataAnalysis/tasks/01

# Install dependencies

pip install pandas

Running the Analysis

Option 1: Standalone Script (Fast Analysis)

python analyze_services.py

This produces:

atms_encoded_cleaned.csv (cleaned data)

services_unique.json (unique value report)

services_non_english.json (non-English entries report)

Console output with summary statistics

Option 2: Jupyter Notebook (Interactive Exploration)

jupyter notebook atm_encode.ipynb

The notebook provides:

Step-by-step data loading and cleaning

Interactive inspection of normalized services

Distribution analysis of encoding classes

CSV export with all three encoding columns

Key Code Functions

normalize_service_text(value: str) -> list

Normalizes a raw service string into a list of standardized tokens.

Input: "Withdrawal -Deposit – Forex" Output: ['withdrawal', 'deposit', 'forex']

encode_services(value: str) -> int

Encodes a service string into the 1/2/3 classification.

Examples:

"Withdrawal" → 1

"withdrawal,deposit" → 2

"Withdrawal -Deposit – Forex" → 3

Data Quality & Validation

Normalization Rules Applied

All text converted to lowercase

Multi-character separators normalized to commas

Whitespace trimmed and deduplicated

Unknown tokens mapped to fallback (withdrawal)

Duplicate tokens removed while preserving order

Language Support

English: Primary language

Arabic: Egyptian Arabic (فصحى) entries translated

Mixed: Entries with mixed Arabic/English/special codes handled

Encoding Rationale

The 1/2/3 encoding scheme reflects operational ATM capabilities:

1 (Withdrawal): Basic ATMs at remote or low-traffic locations

2 (Deposit & Withdrawal): Community-focused ATMs in populated areas

3 (Extended): Premium ATMs at banks/malls with forex and other services

This ordinal encoding preserves service hierarchy and is suitable for:

Categorical machine learning models (decision trees, random forests)

Ordinal regression models

Feature engineering (one-hot encoding if needed)

Geographical analysis of service distribution

Example Usage

import pandas as pd

# Load the cleaned and encoded data

df = pd.read_csv('atms_encoded_cleaned_lean.csv')

# View the encoding distribution

print(df['services_encoding'].value_counts().sort_index())

# Filter ATMs by service level

withdrawal_only = df[df['services_encoding'] == 1]

deposit_capable = df[df['services_encoding'] >= 2]

# Analyze by governorate

by_region = df.groupby('governate')['services_encoding'].agg(['mean', 'std', 'count'])

print(by_region)

Future Improvements

ML Classification: Train a classifier to auto-predict encoding from location/context

Service Availability: Track service outages and update encoding over time

Multilingual Expansion: Add support for more Arabic dialects

Confidence Scoring: Add confidence scores for normalized services

API Integration: Create REST API for real-time service lookup

Contributing

Contributions are welcome! Please:

Fork the repository

Create a feature branch (git checkout -b feature/your-feature)

Commit changes (git commit -m 'Add your feature')

Push to branch (git push origin feature/your-feature)

Open a Pull Request

License

This project is licensed under the MIT License - see LICENSE file for details.

Author

Created as part of NTI Data Analysis tasks.

Contact & Support

For questions or issues:

Open an issue on GitHub

Check existing issues for similar problems

Review the inline code comments for detailed logic

Changelog

v1.0.0 (2026-05-02)

Initial release

Core data cleaning and encoding pipeline

Support for Arabic/English multilingual data

Three-level service classification system

Jupyter notebook for interactive analysis

بطاقة العمل

اسم المستقل
عدد الإعجابات
0
عدد المشاهدات
4
تاريخ الإضافة