Learning from imbalanced data in text classification
Loading...
Date
Authors
Journal Title
Journal ISSN
Volume Title
Publisher
Πανεπιστήμιο Πελοποννήσου
Abstract
This thesis investigates the performance of several standard and advanced machine
learning techniques for text classification in the context of imbalanced datasets.
The research focuses on four well-established algorithms—Decision Tree, Random
Forest, Support Vector Machine (SVM), and Logistic Regression—alongside two
advanced methods: the DynAmic self-Paced sampling enSemble (DAPS) algorithm
and Example-Dependent Cost-Sensitive Learning. These approaches are evaluated
across 20-Newsgroups and Clickbait datasets, under varying levels of class imbalance
and text representations.
Our goal is to assess whether the DAPS algorithm and Example-Dependent Cost-
Sensitive Learning can improve classification performance compared to standard
classifiers in scenarios with high class imbalance. The DAPS algorithm utilizes
dynamic sampling and instance weighting to address overlapping regions in the data,
while Example-Dependent Cost-Sensitive Learning incorporates the financial impact
of misclassifications into the learning process. To evaluate these methods, 32 dataset
variants were created by applying transformations such as TF-IDF, Bag-of-Words,
Word2Vec, and GloVe, and inducing different levels of class imbalance.
Experimental results indicate that cost-sensitive methods, particularly when
paired with Random Forest, consistently outperform standard classifiers across a
range of imbalance ratios, especially with Word2Vec and GloVe embeddings. The
DAPS algorithm also demonstrated superior performance with Random Forest and
SVM classifiers, particularly in datasets with low imbalance ratios. However, its
effectiveness varied depending on the type of text representation. Both DAPS and
cost-sensitive methods underperformed with Bag-of-Words representations, where standard algorithms were more successful. Despite the resource-intensive nature
of cost-sensitive methods, their robustness in handling severe imbalances is a key
finding.
The datasets created during this research and the corresponding code are made
available for future exploration and replication. The study concludes that while
advanced methods like DAPS and cost-sensitive learning significantly improve classification
in imbalanced text datasets, their effectiveness is influenced by the text
representation and computational resources available. Future research should explore
expanding these methods to other algorithms, refining resource consumption,
and experimenting with a broader range of datasets and imbalance levels to further
optimize their application.
Description
Μ.Δ.Ε. 132
Citation
Endorsement
Review
Supplemented By
Referenced By
Creative Commons license
Except where otherwised noted, this item's license is described as Αναφορά Δημιουργού-Μη Εμπορική Χρήση-Όχι Παράγωγα Έργα 3.0 Ελλάδα

