Razpoznava govorcev na mobilni platformi : magistrsko delo

Fartek, Jože

| | SLO | ENG | Cookies and privacy

Bigger font | Smaller font

First page > Show document

Show document

Title:	Razpoznava govorcev na mobilni platformi : magistrsko delo
Authors:	ID Fartek, Jože (Author) ID Holobar, Aleš (Mentor) More about this mentor...
Files:	MAG_Fartek_Joze_2022.pdf (3,95 MB) MD5: 73F1637C5145DED5F26B80B7A97318B8 PID: 20.500.12556/dkum/2a30c972-2729-4ef1-86d3-0a69b822b4df
Language:	Slovenian
Work type:	Master's thesis/paper
Typology:	2.09 - Master's Thesis
Organization:	FERI - Faculty of Electrical Engineering and Computer Science
Abstract:	V magistrskem delu smo predstavili osnove razpoznave govorcev. V ta namen smo najprej opisali izračun vokalnih značilnic. Podrobneje smo predstavili metodo izračuna mel-frekvenčnih kepstralnih koeficientov (MFCC) in prednosti metode v primerjavi z ostalimi pristopi. Opisali smo tudi učenje glasovnih modelov in novejši metodi, ki temeljita na supervektorjih. Na podlagi tega smo v nadaljevanju magistrskega dela razvili Androidovo mobilno aplikacijo, ki v realnem času razpoznava govorce. Pri razpoznavi govorcev smo se omejili na razpoznavo le nekaj oseb. Iz zvočnih posnetkov posameznih govorcev smo izračunali MFCC in jih uporabili za učenje glasovnega modela s pomočjo konvolucijske nevronske mreže. Za optimizacijo parametrov smo primerjali, kako različni parametri vplivajo na učenje glasovnega modela. Primerjali smo, kako dolžina zvočnih posnetkov v razponu 0,5–3 sekunde vpliva na uspešnost razpoznave. Ugotovili smo, da uspešnost modela z večanjem dolžine zvočnega posnetka vse do 1,5 sekunde narašča, nato pa se naraščanje ustavi. Pri primerjavi števila MFCC med 16 in 128 uspešnost modela do 48 MFCC narašča, nato pa se naraščanje ustavi. Pri primerjavi nivoja izpuščenih nevronov med 0 in 0,7 dobimo boljšo natančnost modela z večanjem nivoja izpuščenih nevronov do 0,5, nato pa začne uspešnost padati. Glede na primerjavo smo pri učenju glasovnega modela uporabili zvočne posnetke dolžine 1 sekunde, 32 izračunanih MFCC in nivo izpuščenih nevronov 0,4. Pri tem smo dobili 88-odstotno natančnost modela. Pri razpoznavi smo ugotovili, da hitrost govora vpliva na uspešnost razpoznave, medtem ko glasnost govora nanjo ne vpliva. Testiranje smo izvajali na mobilni napravi LG G7 ThinQ. Izračun MFCC na mobilni napravi je v povprečju trajal 170 milisekund, razpoznava z modelom TensorFlow Lite pa le 8 milisekund.
Keywords:	razpoznava govorcev, mel-frekvenčni kepstralni koeficienti, konvolucijske nevronske mreže, Android
Place of publishing:	Maribor
Place of performance:	Maribor
Publisher:	[J. Fartek]
Year of publishing:	2021
Number of pages:	1 spletni vir (1 datoteka PDF (X, 64 f.))
PID:	20.500.12556/DKUM-81072
UDC:	004.934.8\'1(043.2)
COBISS.SI-ID:	98851331
Publication date in DKUM:	31.01.2022
Views:	947
Downloads:	69
Metadata:
Categories:	KTFMB - FERI
:	FARTEK, Jože, 2021, Razpoznava govorcev na mobilni platformi : magistrsko delo [online]. Master’s thesis. Maribor : J. Fartek. [Accessed 25 March 2025]. Retrieved from: https://dk.um.si/IzpisGradiva.php?lang=eng&id=81072 Copy citation

Average score:	0.5 1 1.5 2 2.5 3 3.5 4 4.5 5 (0 votes)
Your score:	Voting is allowed only for logged in users.
Share:

Similar works from our repository:

Ocenjevanje starosti osebe na osnovi digitalnih posnetkov z uporabo konvolucijskih nevronskih mrež
Razpoznavanje človeških emocij na digitalnih posnetkih s pomočjo konvolucijskih nevronskih mrež
Prepoznavanje aktivnosti osebe iz zaporedja slik s pomočjo konvolucijskih nevronskih mrež
Detekcija osebe v globinski sliki s pomočjo konvolucijskih nevronskih mrež
Razpoznavanje drevesnih značilnosti iz fotografije s pomočjo konvolucijskih nevronskih mrež

Similar works from other repositories:

Hover the mouse pointer over a document title to show the abstract or click on the title to get all document metadata.

Licences

License:	CC BY-NC-ND 4.0, Creative Commons Attribution-NonCommercial-NoDerivatives 4.0 International

Link:	http://creativecommons.org/licenses/by-nc-nd/4.0/
Description:	The most restrictive Creative Commons license. This only allows people to download and share the work for no commercial gain and for no other purposes.
Licensing start date:	22.12.2021

Secondary language

Language:	English
Title:	Speaker recognition on mobile devices
Abstract:	In this master's thesis, we review the basics of speaker recognition. We described how audio feature extraction works. We look more into details how Mel-frequency Cepstral Coefficients feature extraction works and what are its advantages compared to other feature extraction methods. This part is followed by an overview of speaker models and newer methods based on super vectors. Based on this, we have developed a mobile application, which recognizes speakers in real-time. Application was developed for operating system Android. In identifying speakers, we limited recognition to only a few people. Mel-frequency Cepstral Coefficients were extracted from the audio recordings of individual speakers and used to train the speaker model using a convolutional neural network. To get better results in a real-time recognition, we compared how different parameters affect the training of the speaker model. We compared how the length of the audio recording between 0,5 and 3 seconds affects the recognition performance. We found out that the performance of the sound model increases with increasing the length of the audio recording up to 1,5 seconds, and then the increasing stops. We compared speaker model performance by changing the number of MFCC coefficients between 16 and 128. Performance of the modal is increasing up to 48 MFCC coefficients and then the increasing stops. We also compared the affect of neural network dropout rate between 0 and 0,7. The speaker model performance is increasing up to a 0,5 dropout rate and then the performance begins to decline. According to the comparison, for the implemented mobile application we used an audio recordings of one second length, 32 MFCC coefficients and 0,4 for dropout rate. We achieved 88% accuracy of the speaker model. We measured how speech tempo and loudness affect recognition accuracy. The slower and faster we speak the recognition accuracy is decreasing while with loudness the accuracy it’s not affected. We performed testing on LG G7 ThinkQ mobile device and measured that the average time to calculate MFCC coefficients is 170 milliseconds and recognition with the TensorFlow Lite model takes only 8 milliseconds.
Keywords:	Speaker recognition, Mel-frequency Cepstral Coefficients, Convolutional neural network, Android

Comments

Leave comment

You must log in to leave a comment.

Comments (0)

0 - 0 / 0

There are no comments!

Back