| | SLO | ENG | Cookies and privacy

Bigger font | Smaller font

Show document Help

Title:Implementacija avtomatiziranega pristopa k analizi podatkov DNA sekvenciranja
Authors:ID Bjelić, Dragana (Author)
ID Gorenjak, Mario (Mentor) More about this mentor... New window
ID Potočnik, Uroš (Co-mentor)
Files:.pdf MAG_Bjelic_Dragana_2020.pdf (7,38 MB)
MD5: F7C69F387B0A161D4E4FA63330F2646C
PID: 20.500.12556/dkum/421a6de3-c2d5-442c-8537-93a4674aeada
 
Language:Slovenian
Work type:Master's thesis/paper
Typology:2.09 - Master's Thesis
Organization:FZV - Faculty of Health Sciences
Abstract:Uvod: Z razvojem tehnologije sekvenciranja DNA in naraščanjem podatkov se povečuje tudi potreba po kvalitetni analizi in interpretaciji podatkov. Prav tako sta pomembna hitrost in zanesljivost klasificiranja posameznikov za določen genotip. Pri metodi sekvenciranja naslednje generacije (NGS) to klasificiranje temelji na klicanju različic, ki je sklepanje, da na določenem mestu obstaja razlika v nukleotidu v primerjavi z referenčnim nukleotidnim zaporedjem. Surovi podatki pridobljeni z NGS analizo so podani v datoteki VCF (ang. variant call format), kjer je v tabeli potencialnih različic oziroma kandidatnih genotipov v spremenljivki Filter pogosto uporabljena oznaka PASS za različice oziroma genotipe za katere je klasifikator nevronske mreže podal višjo verjetnost nereferenčnega klica genotipa kot za referenco, tj. zanesljiv klic različice. V magistrskem delu želimo s primerjavo števila klicanih različic in PASS različic med obstoječim in nadgrajenim pristopom pokazati pomembnost posodobitev programskih orodij. Metode: V empiričnem delu smo implementirali avtomatiziran pristop k analizi podatkov DNA sekvenciranja, ki je nadgradnja obstoječega protokola analize, ki je na razpolago na aparatu Illumina Miseq. V našem nadgrajenem protokolu smo namesto modula GATK Variant Caller iz različice v1.6. obstoječega orodja na aparatu Illumina MiSeq uporabili modul Haplotype Caller pridobljenega iz programskega paketa GATK v3.8. Haplotype Caller je natančnejši, saj zavrne podatke o poravnavi okoli položaja, kjer se sumi na različico in ponovno prebere odčitke v tej regiji. Prav tako smo nadgradili algoritem poravnave nukleotidnih zaporedij iz različice 0.7.9 v obstoječem protokolu na 0.7.12, ki nam z nadgradnjo omogoča HLA tipizacijo. Protokol smo nadgradili tudi s predhodnim obrezovanjem tehničnih nukleotidnih zaporedij. Na koncu smo analizo števila klicanih različic in PASS različic med obema pristopoma ovrednotili v programskem okolju R z Wilcoxon-ovim statističnim testom. Rezultati: Rezultati Wilcoxon-ovega testa so pokazali močno statistično značilno razliko med odkritim številom klicanih različic in PASS različic med nadgrajenim in obstoječim pristopom, pri čemer je nadgrajen pristop v povprečju odkril 26-krat več klicanih različic in 33 krat več PASS različic, od tega 5 pozitivnih PASS različic pomembnih za diagnozo od 12, kar pomeni 41,7 %. Diskusija: Ugotovili smo, da je nadgrajen tekoči trak ukazov za analizo nukleotidnega zaporedja DNA učinkovitejši, saj odkrije več klicanih in PASS različic.
Keywords:NGS, bioinformatika, sekvenciranje, Illumina MiSeq
Place of publishing:Maribor
Publisher:[D. Bjelić]
Year of publishing:2020
PID:20.500.12556/DKUM-77387 New window
UDC:575.112(043.2)
COBISS.SI-ID:28695299 New window
NUK URN:URN:SI:UM:DK:XMKPJVKG
Publication date in DKUM:21.09.2020
Views:1280
Downloads:239
Metadata:XML RDF-CHPDL DC-XML DC-RDF
Categories:FZV
:
Copy citation
  
Average score:(0 votes)
Your score:Voting is allowed only for logged in users.
Share:Bookmark and Share


Hover the mouse pointer over a document title to show the abstract or click on the title to get all document metadata.

Licences

License:CC BY-NC-ND 4.0, Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International
Link:http://creativecommons.org/licenses/by-nc-nd/4.0/
Description:The most restrictive Creative Commons license. This only allows people to download and share the work for no commercial gain and for no other purposes.
Licensing start date:28.08.2020

Secondary language

Language:English
Title:Implementation of an automatized approach to DNA sequencing data analysis
Abstract:Introduction: With the development of DNA sequencing technology and the growth of data, the need for high quality analysis and interpretation of data is also increasing. The speed and reliability of classifying individuals for a particular genotype are also important. In the next-generation sequencing (NGS) method, this classification is based on variant calling, which is the inference that there is a difference in the nucleotide at a particular site compared to the reference nucleotide sequence. The PASS mark is often used for variants for which the neutral network classifier gave a higher probability of a non-reference cariant call than a reference, i.e. reliable variant call. Raw data, obtained by NGS analysis, is given in the VCF (variant call format) file, where the table of potential variants or candidate genotypes in the Filter variable often uses the PASS mark. The aim of this thesis is to show the importance of software tool updates by comparing the number of called variants and PASS variants between the existing and the upgraded approach. Methods: In the empirical part, we implemented an automated approach to DNA sequencing data analysis, which is an upgrade of the existing analysis protocol, available in the Illumina Miseq apparatus. In our upgraded protocol, instead of the GATK VariantCaller module from version v1.6 of the existing tool on the Illumina MiSeq device, we used HaplotypeCaller module obtained from the GATK v3.8 software package. HaplotypeCaller is more accurate, as it discards the alignment information around a position where it suspects a variant and is doing local re-assembly with those reads. We also upgraded the nucleotide sequence alignment algorithm from version 0.7.9 to 0.7.12, which allows us HLA typing by upgrading. The protocol was also upgraded by pre-trimming of the technical nucleotide sequences. Finally, the analysis of the number of called variants and the PASS variants between the two approaches was evaluated in the R software environment using the Wilcoxon statistical test. Results: The results of the Wilcoxon test showed a strong statistically significant difference between the detected number of called variants and the PASS variants between the upgraded and the existing approach, with the upgraded approach detecting an average of 26-fold more called variants and 33-fold more PASS variants. Out of 12 variants relevant for diagnosis, 5 positive PASS variants were missed by existing protocol (41.7 %), but not by our improved protocol. Conclusion: We came to the conclusion that the upgraded pipeline for DNA sequence analysis is more efficient as it detects more called and PASS variants.
Keywords:NGS, bioinformatics, sequencing, Illumina MiSeq


Comments

Leave comment

You must log in to leave a comment.

Comments (0)
0 - 0 / 0
 
There are no comments!

Back
Logos of partners University of Maribor University of Ljubljana University of Primorska University of Nova Gorica