SLO | ENG | Cookies and privacy

Bigger font | Smaller font

Show document

Title:NADOMEŠČANJE MANJKAJOČIH VREDNOSTI S POMOČJO ROTACIJSKEGA REGRESIJSKEGA GOZDA
Authors:Palfy, Miroslav (Author)
Kokol, Peter (Mentor) More about this mentor... New window
Zorman, Milan (Co-mentor)
Files:.pdf DR_Palfy_Miroslav_1978.pdf (5,63 MB)
 
Language:Slovenian
Work type:Dissertation (m)
Organization:FERI - Faculty of Electrical Engineering and Computer Science
Abstract:Manjkajoče vrednosti predstavljajo pogosto težavo, ki spremlja ustvarjanje podatkovnih baz, bodisi če se podatki zbirajo s pomočjo anket bodisi če so pridobljeni iz načrtovanih eksperimentov. Ne glede na to, koliko truda je vloženo za zagotavljanje popolne izpolnjenosti vprašalnikov ali v skrbno načrtovanje znanstvenega poskusa, se manjkajočim vrednostim pogosto ni možno izogniti. Nepopolni podatki so, odvisno od razmerja v katerem se pojavljajo manjkajoče vrednosti, lahko neustrezni za nadaljnjo analizo, medtem ko je brisanje vzorcev z manjkajočimi vrednostmi, posebno ko njihov odstotek ni dovolj majhen in ti vzorci predstavljajo pomembne informacije, lahko zelo neustrezno. Za reševanje tega problema se tako na področju statistične analize uporabljajo različne metode za nadomeščanje manjkajočih vrednosti. Z namenom zapolnitve vrzeli, ki obstaja med obstoječimi metodami enkratnega vstavljanja manjkajočih vrednosti in modeli, ki temeljijo na večkratnem vstavljanju in pri katerih je za vsak cikel vstavljanja potrebna ločena statistična analiza, smo v okviru disertacije razvili nov postopek nadomeščanja manjkajočih vrednosti, ki temelji na ansambelskem pristopu nadzorovanega strojnega učenja. Uporabili smo ansambel, imenovan rotacijski regresijski gozd, ki predstavlja varianto rotacijskega gozda (Rotation forest), kot so ga razvili Rodríguez, Kuncheva in Alonso (Rodríguez, Kuncheva, & Alonso, 2006), pri katerem smo namesto osnovne metode, namenjene reševanju klasifikacijskih problemov, uporabili modelno regresijsko drevo. Našo metodo za nadomeščanje manjkajočih vrednosti smo primerjali z 9 drugimi popularnimi metodami, pri čemer smo merili natančnost metod in njihovo sposobnost ohranjanja variance po vstavljanju različnih deležev manjkajočih vrednosti. Meritve smo izvedli na 14 javno dostopnih podatkovnih množicah in eni umetno ustvarjeni množici, tako da smo obravnavali vse mehanizme nastanka manjkajočih vrednosti, kot jih je definiral Rubin (Rubin, 1976). Na podlagi poizkusov smo ugotovili, da naša metoda v povprečju natančneje napoveduje manjkajoče vrednosti v izbranih podatkovnih množicah, ne glede na mehanizem nastanka manjkajočih vrednosti. Prav tako smo pokazali, da z uvedbo dodatne stohastične metode za ohranjanje variance naš rotacijski regresijski gozd bolje ohranja varianco od vseh preostalih metod, ki izvajajo enkratno vstavljanje, pri čemer po svoji natančnosti še vedno prekaša vse metode. V disertaciji smo v uvodnih, teoretičnih poglavjih podrobneje opisali problematiko manjkajočih vrednosti ter obstoječe metode, ki se najpogosteje uporabljajo za njihovo nadomeščanje. Predstavili smo rotacijski regresijski gozd in stohastično metodo za ohranjanje variance. Največjo pozornost smo posvetili rezultatom poizkusov, na podlagi katerih smo v zaključku izoblikovali priporočila za uporabo rotacijskega regresijskega gozda za nadomeščanje manjkajočih vrednosti ter predstavili izhodišča za nadaljnje delo.
Keywords:strojno učenje, rotacijski gozd, nadomeščanje manjkajočih vrednosti, regresijsko drevo, ansambel regresorjev
Year of publishing:2009
Publisher:[M. Palfy]
Source:Maribor
UDC:004.89:004.9(043.3)
COBISS_ID:13737238 Link is opened in a new window
Views:2087
Downloads:220
Metadata:XML RDF-CHPDL DC-XML DC-RDF
Categories:KTFMB - FERI
:
  
Average score:(0 votes)
Your score:Voting is allowed only for logged in users.
Share:AddThis
AddThis uses cookies that require your consent. Edit consent...

Hover the mouse pointer over a document title to show the abstract or click on the title to get all document metadata.

Secondary language

Language:English
Title:Missing values imputation using a rotation regression forest
Abstract:Missing values represent a common problem, plaguing many databases; either based on surveys and questionnaires or designed experiments. No matter how carefully the surveys are taken, or how well the experiments are designed, missing values can occur. Incomplete data can, depending on the amount of missing values, be unsuitable for further statistical analysis, while case deletion, especially when dealing with considerable amounts of missing values, can be very inappropriate. Therefore different methods were developed which can be used to impute missing data. The main goal of this dissertation was to develop a new imputation method, which would narrow the gap between single-impute methods and multiple-imputation models, which require standard statistical analysis to be carried out on multiple imputed data sets. For this purpose we used an ensemble-based approach to supervised machine learning. We relied on a variation of rotation forest ensemble, developed by Rodríguez, Kuncheva and Alonso (Rodríguez, Kuncheva, & Alonso, 2006) which we named “rotation regression forest”, since we used a model regression tree as a base method instead of a method used for classification purposes. We selected 9 other popular imputation methods for comparison with our ensemble where we measured their accuracy as well as their ability to preserve the variance structure within data when dealing with different amounts of missing values. Measurements were carried out on 14 different public access datasets and one artificial dataset to account for each of the three missingness mechanisms, as described by Rubin (Rubin, 1976). Based on results of these tests we concluded that, on average, our method is the most accurate among the selected methods, no matter which misingness mechanism is responsible for missing values. When an additional stochastic method for preservation of variance was used, our rotation regression forest was able to preserve the variance structure within data better than any other single-impute method, while still besting them all in accuracy. The introductory, more theoretical chapters of this dissertation deal with supervised machine learning, missing values and commonly used imputation methods. Rotation regression forest ensemble was introduced, as well as our stochastic method for preservation of variance. The bulk of our work is focused on results, gained through empirical experiments, which were used to model our recommendations concerning the use of rotation regression forest ensemble for imputation of missing values and to form starting points for possible future work.
Keywords:machine learning, rotation forest, missing value imputation, regression tree, ensemble of regressors


Comments

Leave comment

You have to log in to leave a comment.

Comments (0)
0 - 0 / 0
 
There are no comments!

Back
Logos of partners University of Maribor University of Ljubljana University of Primorska University of Nova Gorica