|Opis:||Although the majority of the scientific and professional articles are catalogued and have a bibliographic record in the COBIB Library Catalogue, including one or more notations from the Universal Decimal Classification (UDC) system, most of the articles available through the web portal of the Digital Library of Slovenia, which are mainly from the field of culture (older magazine and newspaper articles), usually do not have such a record. On the website of the Digital Library of Slovenia, it is possible to search web documents only via full text search. It is currently the best available tool for searching older texts, but such practice does not offer sufficient user experience, due to various deficiencies (poor quality of text recognition in old newspapers and magazines, usage of old Slovene languages, etc.) and too many returned search results.
In the dissertation, we address the basic problem of assistance by bibliographic processing, which is still in the hands of human experts. We start from the thesis that the usage of machine learning methods makes it possible to classify the texts automatically into the appropriate UDC notation. Thus, the support is provided for librarians during the bibliographic processing of documents. For this purpose, following a planning and development approach, we developed a classification model that was used to classify old texts. Until now, these were mostly only indirectly classified through the classification of the entire journal, such as "Newspapers. Printing. Journalism".
We developed a classification model using machine learning methods, which managed to classify any text automatically using the Universal Decimal Classification. Among the machine learning techniques, we used unsupervised and supervised machine learning. First, we used unsupervised machine learning methods over a smaller number of articles (900 articles) to test the relatedness between the UDC notations assigned by the librarians and clusters built by the algorithm. Next, we developed classification models over the entire corpus of the scientific journals available via the Digital Library of Slovenia (more than 70,000 scientific texts), where the ratio between the learning and the test set was 80/20 percent. When we confirmed the performance of classification models over scientific texts, we used them to classify more than 200,000 older texts. We used the Naive Bayes classifier, Support Vector Machine, Multilayer Perceptron, Logistic Regression, and k-nearest neighbor's algorithm. The relevance of the classification of old texts was checked by human experts - librarians. We have confirmed the assumption that in at least 80 % of cases, we can offer auto-determined UDC notations for older material that has not been bibliographically processed. It should be emphasized that this work is about human decision-making, testing with human experts, evaluation and judgment, which can vary from one decision maker to another.
In addition to enriching older texts from the eighteenth, nineteenth, and first half of the twentieth centuries with the content of UDC notations, the research has practical value in everyday use. From the standpoint of supporting the automatic classification of publications in the daily work of librarians, we see the power of implementing research into an information system that is able to offer computational suggestions to a librarian in real time to determine the appropriate classifiers for the processed publication. The librarian can obtain a "second opinion" from the machine knowledge in the process of assigning UDC notations to the processed publication. At the same time, the methodology can be used in different fields, databases and classification systems, not just for assigning UDC notations.|