| | SLO | ENG | Cookies and privacy

Bigger font | Smaller font

Show document

Title:Deduplication of metadata
Authors:Chuchurski, Martin (Author)
Ojsteršek, Milan (Mentor) More about this mentor... New window
Files:.pdf UN_Chuchurski_Martin_2019.pdf (848,73 KB)
 
Language:English
Work type:Bachelor thesis/paper (mb11)
Typology:2.11 - Undergraduate Thesis
Organization:FERI - Faculty of Electrical Engineering and Computer Science
Abstract:Duplicates are redundant data that increases the storage space needed as well as the serving cost. They also have a big impact on the search result quality of the database. Therefore, detecting and eliminating redundant data is crucial in restoring and maintaining the quality of the data stored as well as the database itself. Different methods have been used to detect duplicates. The most widely used are pattern matching algorithms, more precisely phonetic string matching algorithms. There is a wide variety of algorithms to choose from and we opted for the algorithms that best suited our needs. Jaccard, Jaro, Jaro-Winkler and Levenshtein distance algorithms were used in the development of our deduplication application. They were joined together to create a new hybrid approach for detecting duplicates in a metadata database. In a real database, the application showed promising results while maintaining relatively fast speeds and fairly small memory consumption.
Keywords:deduplikacija, metapodatki, besedilne metrike podobnosti, duplikat
Year of publishing:2019
Source:Maribor
NUK URN:URN:SI:UM:DK:OMJCUOBS
License:CC BY-NC-ND 4.0
This work is available under this license: Creative Commons Attribution Non-Commercial No Derivatives 4.0 International
Views:31
Downloads:4
Metadata:XML RDF-CHPDL DC-XML DC-RDF
Categories:KTFMB - FERI
:
  
Average score:(0 votes)
Your score:Voting is allowed only for logged in users.
Share:AddThis
AddThis uses cookies that require your consent. Edit consent...

Hover the mouse pointer over a document title to show the abstract or click on the title to get all document metadata.

Secondary language

Language:Slovenian
Title:Deduplikacija metapodatkov
Abstract:Duplikati so odvečni podatki, ki povečajo prostor, potreben za shranjevanje, kakor tudi stroške storitve. Prav tako imajo velik vpliv na kakovost rezultatov iskanja ustreznih rezultatov poizvedb v podatkovnih zbirkah, zato je odkrivanje in odpravljanje odvečnih podatkov ključnega pomena pri obnavljanju in vzdrževanju kakovosti shranjenih podatkov in same podatkovne zbirke. Za odkrivanje duplikatov se uporabljajo različne metode. Najpogosteje se uporabljajo algoritmi za ujemanje vzorcev, natančneje algoritmi za ujemanje nizov znakov. Na izbiro je veliko različnih algoritmov za odkrivanje duplikatov. Mi smo uporabili besedilne metrike podobnosti. Jaccardova, Jarova, Jaro-Winklerjeva in Levenshteinova razdalja so bile uporabljene v naši praktični rešitvi. Ustvarili smo nov hibridni pristop za odkrivanje duplikatov v podatkovni zbirki metapodatkov, ki odkrije večino duplikatov, porabi relativno malo procesorskega časa in pomnilnika.
Keywords:deduplication, metadata, text similarity metrics, duplicate


Comments

Leave comment

You have to log in to leave a comment.

Comments (0)
0 - 0 / 0
 
There are no comments!

Back
Logos of partners University of Maribor University of Ljubljana University of Primorska University of Nova Gorica