Document Data Analysis via Machine/Deep Learning techniques

dc.contributor.advisorΠετάσης, Γεώργιος
dc.contributor.authorΣπυράτος, Άγγελος
dc.contributor.committeeΠετάσης, Γεώργιος
dc.contributor.committeeΑναστασία, Κριθαρά
dc.contributor.committeeΒασιλάκης, Κωνσταντίνος
dc.contributor.departmentΤμήμα Πληροφορικής και Τηλεπικοινωνιώνel
dc.contributor.facultyΣχολή Οικονομίας και Τεχνολογίαςel
dc.contributor.masterΕπιστήμη Δεδομένωνel
dc.date.accessioned2022-10-13T09:50:52Z
dc.date.available2022-10-13T09:50:52Z
dc.date.issued2021-03
dc.descriptionΜ.Δ.Ε. 86el
dc.description.abstractJob advert aggregators gather millions of adverts every single day, by scraping job boards and various other sources across the globe. Aggregators are getting visited by millions of active job seekers every day, that wish to find their perfect match in order to land a job, according to their skills and field of studies. With such high volume of visitors seeking to find their optimal match, proper categorization of job adverts becomes a must have feature for any aggregator in order to help their users have a smooth experience while searching for their perfect job match. However, due to the huge volume of data and the nature of the job adverts themselves, where each job description can possibly match with multiple categories and similar positions might have huge variations in the language used to describe them, the proper classification of such data comes to be a hard task. In this work, various machine learning, deep learning, data processing and data augmentation methods are used in order to try and classify job adverts in one of the twenty-nine categories of the Adzuna company. Towards this, a real-world private dataset, consisting of about 234.000 job adverts from the United Kingdom, containing titles, descriptions and hand-crafted categories, is provided by the Adzuna company. Our main results show that Deep Learning models outperform all kinds of conventional Machine Learning approaches such as Support Vector Classifiers, Multinomial Naïve Bayes and Decision Trees. In addition, training custom word2vec embeddings helps achieve higher accuracy metrics compared to using pretrained embeddings such as Glove 100. However, the model selection (choosing a Deep Learning model against a conventional Machine Learning model) is of higher impact towards better metrics than using embeddings and sequences of words. The model that achieved the highest weighted average F1-Score (80%) and the highest testing accuracy (80.5%) was the Feedforward Neural Network trained on Bag of Words (TF-IDF) representations of lowercased and stemmed job descriptions. Specifically, this model achieved a weighted average Precision of 80%, a weighted average Recall of 81%el
dc.format.extent108el
dc.identifier.urihttps://amitos.library.uop.gr/xmlui/handle/123456789/6836
dc.identifier.urihttp://dx.doi.org/10.26263/amitos-341
dc.language.isoenel
dc.publisherΠανεπιστήμιο Πελοποννήσουel
dc.rightsΑναφορά Δημιουργού-Μη Εμπορική Χρήση-Όχι Παράγωγα Έργα 3.0 Ελλάδα*
dc.rights.urihttp://creativecommons.org/licenses/by-nc-nd/3.0/gr/*
dc.subject.keywordText Classificationel
dc.subject.keywordDeep Learningel
dc.subject.keywordMachine Learningel
dc.subject.keywordNatural Language Processingel
dc.titleDocument Data Analysis via Machine/Deep Learning techniquesel
dc.typeΜεταπτυχιακή διπλωματική εργασίαel

Files

Original bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
spyratos_17021.pdf
Size:
5.11 MB
Format:
Adobe Portable Document Format
Description:

License bundle

Now showing 1 - 1 of 1
Loading...
Thumbnail Image
Name:
license.txt
Size:
933 B
Format:
Item-specific license agreed upon to submission
Description: