Named Entity Recognition
Natural Language Processing
Overview
This project implements and compares BiLSTM-based models for Named Entity Recognition (NER), trained on the CoNLL-2003 dataset. The work demonstrates comprehensive preprocessing, model design, training strategies, and performance evaluation for identifying named entities in text.
Repository
- Technologies Used: PyTorch, Python, NumPy, Pandas, Scikit-learn, GloVe Embeddings
- Dataset: CoNLL-2003 Shared Task Dataset
- GitHub Repository: NagaHarshita/NamedEntityRecognition
Dataset & Preprocessing
The project utilizes the CoNLL-2003 dataset, a standard benchmark for Named Entity Recognition tasks. The dataset preprocessing includes:
- Data Split: train, dev, and test sets
- Cleaning: Removal of empty lines and normalization
- Tokenization: Converting words and tags to indexed sequences
- Entity Types: Person (PER), Organization (ORG), Location (LOC), and Miscellaneous (MISC)
Model Architecture
BiLSTM Base Model
BLSTM(
embedding: Embedding(23700, 100)
lstm: LSTM(100, 256, dropout=0.33, bidirectional=True)
linear: Linear(512 → 128)
elu: ELU(alpha=1.0)
classifier: Linear(128 → 10)
)
BiLSTM with GloVe Embeddings
BLSTM_Glove(
embedding: Embedding(400000, 100)
lstm: LSTM(100, 256, dropout=0.33, bidirectional=True)
linear: Linear(512 → 128)
elu: ELU(alpha=1.0)
classifier: Linear(128 → 9)
)
Training Configuration
The models were trained with carefully tuned hyperparameters:
- Optimizer: SGD with learning rate 0.8
- Learning Rate Scheduler: Dynamic adjustment during training
- Epochs: 100
- Batch Size: 32
- Loss Function: Weighted cross-entropy to handle class imbalance
- Padding Strategy: ignore_index = -1 for padded values
Results & Performance
Model | Accuracy | F1 Score | LOC F1 | MISC F1 | ORG F1 | PER F1 |
---|---|---|---|---|---|---|
BiLSTM Base | 93.65% | 65.15 | 79.93 | 68.67 | 58.87 | 52.81 |
BiLSTM + GloVe | 93.31% | 63.54 | 79.58 | 66.42 | 55.82 | 50.82 |
Key Insights
Performance Analysis:
- Both models achieve high accuracy (>93%) on token-level classification
- Location entities show the best F1 scores (~80%), indicating clearer patterns
- Person and Organization entities present greater challenges due to contextual complexity
- Class imbalance significantly impacts performance, especially for PER and ORG entities
Technical Learnings:
- GloVe embeddings provide rich semantic representations but require careful tuning
- Proper padding and batch handling are crucial for consistent training
- Weighted loss functions effectively address class imbalance issues
- Bidirectional context significantly improves entity boundary detection