Learning from imbalanced data in text classification

dc.contributor.advisorΖαβιτσάνος, Ηλίας
dc.contributor.advisorΓιαννακόπουλος, Γεώργιος
dc.contributor.authorΧαντζάρας, Αλέξανδρος
dc.contributor.committeeΓιαννακόπουλος, Γιώργος
dc.contributor.committeeΖαβιτσάνος, Ηλίας
dc.contributor.committeeΒασιλάκης, Κωνσταντίνος
dc.contributor.departmentΤμήμα Πληροφορικής και Τηλεπικοινωνιώνel
dc.contributor.facultyΣχολή Οικονομίας και Τεχνολογίαςel
dc.contributor.masterΕπιστήμη Δεδομένωνel
dc.date.accessioned2025-05-12T09:21:11Z
dc.date.available2025-05-12T09:21:11Z
dc.date.issued2024-12-13
dc.descriptionΜ.Δ.Ε. 132el
dc.description.abstractThis thesis investigates the performance of several standard and advanced machine learning techniques for text classification in the context of imbalanced datasets. The research focuses on four well-established algorithms—Decision Tree, Random Forest, Support Vector Machine (SVM), and Logistic Regression—alongside two advanced methods: the DynAmic self-Paced sampling enSemble (DAPS) algorithm and Example-Dependent Cost-Sensitive Learning. These approaches are evaluated across 20-Newsgroups and Clickbait datasets, under varying levels of class imbalance and text representations. Our goal is to assess whether the DAPS algorithm and Example-Dependent Cost- Sensitive Learning can improve classification performance compared to standard classifiers in scenarios with high class imbalance. The DAPS algorithm utilizes dynamic sampling and instance weighting to address overlapping regions in the data, while Example-Dependent Cost-Sensitive Learning incorporates the financial impact of misclassifications into the learning process. To evaluate these methods, 32 dataset variants were created by applying transformations such as TF-IDF, Bag-of-Words, Word2Vec, and GloVe, and inducing different levels of class imbalance. Experimental results indicate that cost-sensitive methods, particularly when paired with Random Forest, consistently outperform standard classifiers across a range of imbalance ratios, especially with Word2Vec and GloVe embeddings. The DAPS algorithm also demonstrated superior performance with Random Forest and SVM classifiers, particularly in datasets with low imbalance ratios. However, its effectiveness varied depending on the type of text representation. Both DAPS and cost-sensitive methods underperformed with Bag-of-Words representations, where standard algorithms were more successful. Despite the resource-intensive nature of cost-sensitive methods, their robustness in handling severe imbalances is a key finding. The datasets created during this research and the corresponding code are made available for future exploration and replication. The study concludes that while advanced methods like DAPS and cost-sensitive learning significantly improve classification in imbalanced text datasets, their effectiveness is influenced by the text representation and computational resources available. Future research should explore expanding these methods to other algorithms, refining resource consumption, and experimenting with a broader range of datasets and imbalance levels to further optimize their application.el
dc.format.extent82el
dc.identifier.urihttps://amitos.library.uop.gr/xmlui/handle/123456789/8843
dc.language.isoenel
dc.publisherΠανεπιστήμιο Πελοποννήσουel
dc.rightsΑναφορά Δημιουργού-Μη Εμπορική Χρήση-Όχι Παράγωγα Έργα 3.0 Ελλάδα*
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/3.0/gr/*
dc.subjectNatural language processingel
dc.subjectΦυσική Γλώσσα- Επεξεργασίαel
dc.subjectClassification--Computersel
dc.subjectΤαξινόμηση--Υπολογιστέςel
dc.subject.keywordData Science, Text Classification, Natural Language Processingel
dc.titleLearning from imbalanced data in text classificationel
dc.typeΜεταπτυχιακή διπλωματική εργασίαel

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
Chantzaras_2022202104022.pdf
Size:
1.16 MB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
933 B
Format:
Item-specific license agreed upon to submission
Description: