Tarun Pruthi's Home Page
Latest Research
- Analysis, Vocal-tract Modeling and Automatic Detection of Vowel Nasalization (PhD Thesis Work).
The aim of this work is to clearly understand the salient features of nasalization and the sources of acoustic variability in nasalized vowels, and to suggest Acoustic Parameters (APs) for the automatic detection of vowel nasalization based on this knowledge. Possible applications in automatic speech recognition, speech enhancement, speaker recognition and clinical assessment of nasal speech quality have made the detection of vowel nasalization an important problem to study. Although several researchers in the past have found a number of acoustical and perceptual correlates of nasality, automatically extractable APs that work well in a speaker-independent manner are yet to be found. In this study, vocal tract area functions for one American English speaker, recorded using Magnetic Resonance Imaging, were used to simulate and analyze the acoustics of vowel nasalization, and to understand the variability due to velar coupling area, asymmetry of nasal passages, and the paranasal sinuses. Based on this understanding and an extensive survey of past literature, several automatically extractable APs were proposed to distinguish between oral and nasalized vowels. Nine APs with the best discrimination capability were selected from this set through Analysis of Variance. The performance of these APs was tested on several databases with different sampling rates, recording conditions and languages. Accuracies of 96.28%, 77.90% and 69.58% were obtained by using these APs on StoryDB, TIMIT and WS96/97 databases, respectively, in a Support Vector Machine classifier framework. To my knowledge, these results are the best anyone has achieved on this task. These APs were also tested in a cross-language task to distinguish between oral and nasalized vowels in Hindi. An overall accuracy of 63.72% was obtained on this task. Further, the accuracy for phonemically nasalized vowels, 73.40%, was found to be much higher than the accuracy of 53.48% for coarticulatorily nasalized vowels. This result suggests not only that the same APs can be used to capture both phonemic and coarticulatory nasalization, but also that the duration of nasalization is much longer when vowels are phonemically nasalized. This language and category independence is very encouraging since it shows that these APs are really capturing relevant information.
- Acoustic Parameters for Automatic Detection of Nasal Manner (MS Work).
Of all the sounds in any language, nasals are the only class of sounds with dominant speech output from the nasal cavity as opposed to the oral cavity. This gives the nasals some special properties including presence of zeros in the spectrum, concentration of energy at lower frequencies, higher formant density, higher losses and stability. This work included proposing acoustic correlates for the linguistic feature nasal. In particular, this project focused on the development of APs which can be extracted automatically and reliably in a speaker independent way. These APs were tested in a classification experiment between nasals and semivowels, the two classes of sounds which together form the class of sonorant consonants. Using the proposed APs with a Support Vector Machine based classifier we were able to obtain classification accuracies of 89.53%, 95.80% and 87.82% for prevocalic, postvocalic and intervocalic sonorant consonants respectively on the TIMIT database. As an additional proof to the strength of these parameters, we compared the performance of a Hidden Markov Model (HMM) based system that included the APs for nasals as part of the front-end, with an HMM system that did not. In this digit recognition experiment, we were able to obtain a 60% reduction in error rate on the TI46 database. Recently, another parameter based on the "scale" dimension of a model of the auditory cortex was added. This led to an 18.63% reduction in error rate.
- Discrimination between speech and other environmental sounds (Minor research project).
This project focused on discriminating between speech and other environmental sounds as a first stage to the final aim of recognizing speech in the presence of not just stationary noise, but also non-stationary noise sources often present in our everyday environment (like telephone rings, sirens, wind, vacuum cleaner etc). The environmental sounds were broadly classified into resonant sounds and noisy sounds. A resonance detector was proposed in this study to distinguish speech from highly resonant sounds. Some other parameters that were proposed included onsets/offsets, monotonicity, and multiple pitch. The project also involved a survey of the various techniques used for recognizing speech in noisy surroundings. This included HMM Parallel Model Combination (PMC), Independent Component Analysis (ICA) and Computational Auditory Scene Analysis (CASA).
- An Experimental Evaluation of Linear and Kernel-Based Methods for Face Recognition (Course research project).
This project was a comparitive study of linear and kernel-based methods for face recognition. The methods used for dimensionality reduction included PCA, Kernel PCA, LDA and Kernal Discriminant Analysis. The methods used for classification were Nearest Neighbor (NN) and Support Vector Machine (SVM).