تفاصيل العمل

• This project is a reproduction and extension of a highly cited CVPR 2016 paper titled “A Hierarchical Deep Temporal

Model for Group Activity Recognition”. The paper presents a deep learning approach for understanding group activities

in videos by modeling both individual and collective behavior over time. My work revisits this architecture using modern

tools (e.g., PyTorch) and further enhances it with new insights, experiments, and implementation improvements This

project aims to build a deep learning pipeline that: Frist we Recognizes individual actions from video frames (e.g.,

spiking, standing) then move to to aggregates these actions to infer group-level activities (e.g., right team winning a

point) , in the end uses temporal modeling (via LSTM/GRU) to learn motion and behavior patterns over time. for this

project we used the Volleyball Dataset introduced in the CVPR paper that have 55 volleyball videos (from YouTube),

handpicked and annotated separated in 4830 frames were manually labeled :

1. 9 player actions: e.g., Waiting, Spiking, Digging, Blocking, Standing.

2. 8 team activities: e.g., Left Pass, Right Set, Left Spike, Right Winpoint.

3. Each frame is represented with bounding boxes for all visible players.

The model employs a hierarchical temporal structure that captures both individual and group dynamics over time. First,

spatial features are extracted from each player’s cropped bounding box using a pretrained ResNet-50. These features,

collected across consecutive frames, are input into an LSTM to model the temporal behavior of each player, followed by a

classifier that predicts the player’s action at every frame. Then, the LSTM outputs for all players in a given frame are

pooled (e.g., max-pooled) to form a scene-level representation. Players are organized into two teams (left and right), and

team-wise pooling is applied to maintain spatial and group-specific information. The concatenated team-level features

are processed by a second LSTM to capture the temporal evolution of group behaviors, culminating in a final classifier

that predicts the overall group activity at each frame. Baselines and Ablation Studies

To better understand the contributions of each model component, I implemented several baselines, inspired by the CVPR

and extended work: The baselines for group activity recognition include several approaches with varying levels of

complexity. Baseline B1 is a naive method that uses a single frame per clip, fine-tuning a ResNet-50 to classify the frame

into one of 8 group activity classes, without any temporal or individual modeling. Baseline B3 improves on this by

training a ResNet-50 for individual action classification; during inference, person-level features are extracted and

max-pooled across all players to classify the overall scene activity. Baseline B4 incorporates temporal context by using

sequences of 9 consecutive frames per clip, extracting image-level features via B1, and feeding these to an LSTM.

Baseline B5 advances temporal modeling by applying an LSTM to sequences of each player’s cropped bounding box

features, then representing the clip through max-pooled player features for classification. Baselines B7 and B8 propose a

full two-stage temporal model: B7 applies a player-level LSTM followed by frame-level pooling and an LSTM for final

scene classification, while B8 extends this by pooling player features separately for each team (left/right) and

concatenating these team features for scene-level modeling. Finally, baseline B9 offers a unified model that jointly trains

for individual and group-level outputs with shared parameters, replaces LSTMs with GRUs, and uses a smaller

ResNet-34 backbone to reduce model size and overfitting, allowing end-to-end backpropagation for better optimization.

Quantitative evaluations across these baselines revealed the benefits of temporal modeling, team-wise pooling, and

unified loss functions for improving group activity recognition performance.

ملفات مرفقة

بطاقة العمل

اسم المستقل
عدد الإعجابات
0
عدد المشاهدات
6
تاريخ الإضافة
تاريخ الإنجاز
المهارات