Named Entity Recognition

Natural Language Processing

Overview

This project implements and compares BiLSTM-based models for Named Entity Recognition (NER), trained on the CoNLL-2003 dataset. The work demonstrates comprehensive preprocessing, model design, training strategies, and performance evaluation for identifying named entities in text.

Repository

  • Technologies Used: PyTorch, Python, NumPy, Pandas, Scikit-learn, GloVe Embeddings
  • Dataset: CoNLL-2003 Shared Task Dataset
  • GitHub Repository: NagaHarshita/NamedEntityRecognition

Dataset & Preprocessing

The project utilizes the CoNLL-2003 dataset, a standard benchmark for Named Entity Recognition tasks. The dataset preprocessing includes:

  • Data Split: train, dev, and test sets
  • Cleaning: Removal of empty lines and normalization
  • Tokenization: Converting words and tags to indexed sequences
  • Entity Types: Person (PER), Organization (ORG), Location (LOC), and Miscellaneous (MISC)

Model Architecture

BiLSTM Base Model

BLSTM(
    embedding: Embedding(23700, 100)
    lstm: LSTM(100, 256, dropout=0.33, bidirectional=True)
    linear: Linear(512  128)
    elu: ELU(alpha=1.0)
    classifier: Linear(128  10)
)

BiLSTM with GloVe Embeddings

BLSTM_Glove(
    embedding: Embedding(400000, 100)
    lstm: LSTM(100, 256, dropout=0.33, bidirectional=True)
    linear: Linear(512  128)
    elu: ELU(alpha=1.0)
    classifier: Linear(128  9)
)

Training Configuration

The models were trained with carefully tuned hyperparameters:

  • Optimizer: SGD with learning rate 0.8
  • Learning Rate Scheduler: Dynamic adjustment during training
  • Epochs: 100
  • Batch Size: 32
  • Loss Function: Weighted cross-entropy to handle class imbalance
  • Padding Strategy: ignore_index = -1 for padded values

Results & Performance

Model Accuracy F1 Score LOC F1 MISC F1 ORG F1 PER F1
BiLSTM Base 93.65% 65.15 79.93 68.67 58.87 52.81
BiLSTM + GloVe 93.31% 63.54 79.58 66.42 55.82 50.82

Key Insights

Performance Analysis:

  • Both models achieve high accuracy (>93%) on token-level classification
  • Location entities show the best F1 scores (~80%), indicating clearer patterns
  • Person and Organization entities present greater challenges due to contextual complexity
  • Class imbalance significantly impacts performance, especially for PER and ORG entities

Technical Learnings:

  • GloVe embeddings provide rich semantic representations but require careful tuning
  • Proper padding and batch handling are crucial for consistent training
  • Weighted loss functions effectively address class imbalance issues
  • Bidirectional context significantly improves entity boundary detection