Named Entity Recognition | Naga Harshita Marupaka

Overview

This project implements and compares BiLSTM-based models for Named Entity Recognition (NER), trained on the CoNLL-2003 dataset. The work demonstrates comprehensive preprocessing, model design, training strategies, and performance evaluation for identifying named entities in text.

Repository

Technologies Used: PyTorch, Python, NumPy, Pandas, Scikit-learn, GloVe Embeddings
Dataset: CoNLL-2003 Shared Task Dataset
GitHub Repository: NagaHarshita/NamedEntityRecognition

Dataset & Preprocessing

The project utilizes the CoNLL-2003 dataset, a standard benchmark for Named Entity Recognition tasks. The dataset preprocessing includes:

Data Split: train, dev, and test sets
Cleaning: Removal of empty lines and normalization
Tokenization: Converting words and tags to indexed sequences
Entity Types: Person (PER), Organization (ORG), Location (LOC), and Miscellaneous (MISC)

Model Architecture

BiLSTM Base Model

BLSTM(
    embedding: Embedding(23700, 100)
    lstm: LSTM(100, 256, dropout=0.33, bidirectional=True)
    linear: Linear(512 → 128)
    elu: ELU(alpha=1.0)
    classifier: Linear(128 → 10)
)

BiLSTM with GloVe Embeddings

BLSTM_Glove(
    embedding: Embedding(400000, 100)
    lstm: LSTM(100, 256, dropout=0.33, bidirectional=True)
    linear: Linear(512 → 128)
    elu: ELU(alpha=1.0)
    classifier: Linear(128 → 9)
)

Training Configuration

The models were trained with carefully tuned hyperparameters:

Optimizer: SGD with learning rate 0.8
Learning Rate Scheduler: Dynamic adjustment during training
Epochs: 100
Batch Size: 32
Loss Function: Weighted cross-entropy to handle class imbalance
Padding Strategy: ignore_index = -1 for padded values

Results & Performance

Model	Accuracy	F1 Score	LOC F1	MISC F1	ORG F1	PER F1
BiLSTM Base	93.65%	65.15	79.93	68.67	58.87	52.81
BiLSTM + GloVe	93.31%	63.54	79.58	66.42	55.82	50.82

Key Insights

Performance Analysis:

Both models achieve high accuracy (>93%) on token-level classification
Location entities show the best F1 scores (~80%), indicating clearer patterns
Person and Organization entities present greater challenges due to contextual complexity
Class imbalance significantly impacts performance, especially for PER and ORG entities

Technical Learnings:

GloVe embeddings provide rich semantic representations but require careful tuning
Proper padding and batch handling are crucial for consistent training
Weighted loss functions effectively address class imbalance issues
Bidirectional context significantly improves entity boundary detection