GROUND TRUTH ESTIMATION OF SPOKEN ENGLISH FLUENCY SCORE
In this paper, we propose ground truth estimation of spo-ken English fluency scores using decorrelation penalized low-rank matrix factorization. Automatic spoken English fluencyscoring is a general classification problem. The model param-eters are trained to map input fluency features to correspond-ing ground truth scores, and then used to predict a score for aninput utterance. Therefore, in order to estimate the model pa-rameters to predict scores reliably, correct ground truth scoresmust be provided as target outputs. However, it is not simpleto determine correct ground truth scores from human raters’scores, as these include subjective biases. Therefore, groundtruth scores are usually estimated from human raters’ scores,and two of the most common methods are averaging and vot-ing.Although these methods are used successfully, ques-tions remain about whether the methods effectively estimateground truth scores by considering human raters’ subjectivebiases and performance metric. Therefore, to address theseissues, we propose an approach based on low-rank matrix fac-torization penalized by decorrelation. The proposed methoddecomposes human raters’ scores to biases andlatentscoresmaximizing Pearson’s correlation. The effectiveness of theproposed approach was evaluated using human ratings of theKorean-Spoken English Corpus.https://speakinenglish.in/
Recently, Computer Aided Language Learning (CALL) hasreceived considerable attention as a method for improving theEnglish speaking skills of non-native students. In order forCALL systems to provide useful tutoring feedback, an auto-mated scoring system is required for evaluating pronunciationquality, fluency, and specific mistakes made by non-nativestudents.In general, the fluency scoring system is composed of au-tomatic speech recognition, fluency feature extraction, anda scoring model. In fluency feature extraction, features as-sumed to be highly correlated to spoken English fluency arecomputed [1, 2, 3]. For example, long silence duration, num-ber of words per second, and phone duration are some of themost common fluency features [1, 4]. The scoring model is aclassifier in which the model parameters are trained to map in-put fluency features to corresponding ground truth scores, andthen used to predict a score for an input utterance. The mostcommon algorithms for scoring models are linear regression, support vector machine (SVM) , or Gaussian process.Score modelling is a general supervised learning problem.Therefore, in order for the model to be trained reliably, correctground truth scores must be provided as target outputs. How-ever, it is not simple to obtain correct ground truth scores fromhuman raters’ scores as these include variability due to humanraters’ subjective biases. For example, each human expertmight assign different scores to the same utterances. Con-sequently, ground truth scores are estimated by neutralizinghuman raters’ subjective biases. The most common methodis averaging, which estimates ground truth scores by averag-ing the biased scores [6, 7, 8, 9]; Another is voting, which isbased on majority opinions .Although averaging and voting are successfully usedin practice, questions remain about whether they producereliable ground truth scores by considering human raters’biases and scoring model metric such as Pearson’s correla-tion. Therefore, we propose an estimation approach based ondecorrelation penalized low-rank matrix factorization to takeaccount of both human raters’ subjective bias and Pearson’scorrelation.
In spoken English fluency scoring modelling, the ground truthestimation problem is sometimes overlooked because a scor-ing rubrics are designed and human raters are trained to main-tain high correlation among their scores. Nevertheless, thereexists disagreement among raters’ scores and a single scoremust be determined for an each input feature to train the com-putational scoring model such as DNN.