Learning from imbalanced data in text  classification

Χαντζάρας, Αλέξανδρος

Learning from imbalanced data in text classification

dc.contributor.advisor	Ζαβιτσάνος, Ηλίας
dc.contributor.advisor	Γιαννακόπουλος, Γεώργιος
dc.contributor.author	Χαντζάρας, Αλέξανδρος
dc.contributor.committee	Γιαννακόπουλος, Γιώργος
dc.contributor.committee	Ζαβιτσάνος, Ηλίας
dc.contributor.committee	Βασιλάκης, Κωνσταντίνος
dc.contributor.department	Τμήμα Πληροφορικής και Τηλεπικοινωνιών	el
dc.contributor.faculty	Σχολή Οικονομίας και Τεχνολογίας	el
dc.contributor.master	Επιστήμη Δεδομένων	el
dc.date.accessioned	2025-05-12T09:21:11Z
dc.date.available	2025-05-12T09:21:11Z
dc.date.issued	2024-12-13
dc.description	Μ.Δ.Ε. 132	el
dc.description.abstract	This thesis investigates the performance of several standard and advanced machine learning techniques for text classification in the context of imbalanced datasets. The research focuses on four well-established algorithms—Decision Tree, Random Forest, Support Vector Machine (SVM), and Logistic Regression—alongside two advanced methods: the DynAmic self-Paced sampling enSemble (DAPS) algorithm and Example-Dependent Cost-Sensitive Learning. These approaches are evaluated across 20-Newsgroups and Clickbait datasets, under varying levels of class imbalance and text representations. Our goal is to assess whether the DAPS algorithm and Example-Dependent Cost- Sensitive Learning can improve classification performance compared to standard classifiers in scenarios with high class imbalance. The DAPS algorithm utilizes dynamic sampling and instance weighting to address overlapping regions in the data, while Example-Dependent Cost-Sensitive Learning incorporates the financial impact of misclassifications into the learning process. To evaluate these methods, 32 dataset variants were created by applying transformations such as TF-IDF, Bag-of-Words, Word2Vec, and GloVe, and inducing different levels of class imbalance. Experimental results indicate that cost-sensitive methods, particularly when paired with Random Forest, consistently outperform standard classifiers across a range of imbalance ratios, especially with Word2Vec and GloVe embeddings. The DAPS algorithm also demonstrated superior performance with Random Forest and SVM classifiers, particularly in datasets with low imbalance ratios. However, its effectiveness varied depending on the type of text representation. Both DAPS and cost-sensitive methods underperformed with Bag-of-Words representations, where standard algorithms were more successful. Despite the resource-intensive nature of cost-sensitive methods, their robustness in handling severe imbalances is a key finding. The datasets created during this research and the corresponding code are made available for future exploration and replication. The study concludes that while advanced methods like DAPS and cost-sensitive learning significantly improve classification in imbalanced text datasets, their effectiveness is influenced by the text representation and computational resources available. Future research should explore expanding these methods to other algorithms, refining resource consumption, and experimenting with a broader range of datasets and imbalance levels to further optimize their application.	el
dc.format.extent	82	el
dc.identifier.uri	https://amitos.library.uop.gr/xmlui/handle/123456789/8843
dc.language.iso	en	el
dc.publisher	Πανεπιστήμιο Πελοποννήσου	el
dc.rights	Αναφορά Δημιουργού-Μη Εμπορική Χρήση-Όχι Παράγωγα Έργα 3.0 Ελλάδα	*
dc.rights.uri	http://creativecommons.org/licenses/by-nc-nd/3.0/gr/	*
dc.subject	Natural language processing	el
dc.subject	Φυσική Γλώσσα- Επεξεργασία	el
dc.subject	Classification--Computers	el
dc.subject	Ταξινόμηση--Υπολογιστές	el
dc.subject.keyword	Data Science, Text Classification, Natural Language Processing	el
dc.title	Learning from imbalanced data in text classification	el
dc.type	Μεταπτυχιακή διπλωματική εργασία	el

Files

Original bundle

Now showing 1 - 1 of 1

Name:: Chantzaras_2022202104022.pdf
Size:: 1.16 MB
Format:: Adobe Portable Document Format
Description:

Download

License bundle

Now showing 1 - 1 of 1

Name:: license.txt
Size:: 933 B
Format:: Item-specific license agreed upon to submission
Description:

Download

Collections

Τμήμα Πληροφορικής και Τηλεπικοινωνιών (Μ. Δ. Ε.)