Automated analysis of non-verbal behaviour of schizophrenia patients
Date of Issue2019-04-12
Interdisciplinary Graduate School (IGS)
Institute for Media Innovation, Institute of Mental Health
In this thesis, I present a framework for automated assessment of non-verbal behaviour of individuals with negative symptoms of schizophrenia, using speech and movement cues. Schizophrenia is a debilitating mental disorder that often begins in adolescence and runs a lifelong course. The presentations of schizophrenia are diverse -- with positive symptoms (hallucinations and delusions), negative (avolition, asociality, blunted affect and alogia), and cognitive (attention and memory dysfunction). Negative symptoms are considered a significant unmet need in the clinical research domain since they are difficult to diagnose without expert clinician knowledge and have few to none effective drug treatments. This problem becomes even more acute in low-income countries with high patient-to-clinician ratio, where the patients are not monitored properly due to lack of trained experts. Individuals with negative symptoms of schizophrenia have almost always speech, emotion and motor impairments, and are muted in their display of such non-verbal behaviour. Therefore, these objective non-verbal cues, or their lack thereof, can be employed to assess and monitor the severity of the negative symptoms in a non-invasive manner through audio and video recordings. We collaborated with the Institute of Mental Health, Singapore (IMH) to record the audio and Kinect video of 82 individuals (56 Patients with negative symptoms, 26 Healthy Controls), in three separate sessions over a 12-week period. The individuals were interviewed by trained clinicians from IMH during the recording. The psychologists simultaneously rated the behaviour of the patients displayed during the interview on the Negative Symptoms Assessment or NSA scale, a psychological rating instrument (questionnaire) which has 17 items related to the behaviour of the individual, especially their speech, emotion and movement. These expert subjective ratings were taken as ground truth ratings. The speech audio of the patients were analysed for non-verbal audio cues related to conversation, emotion and atypical prosody due to speech impairments. The conversational features were related to the natural turns, relative speech amounts, response time, and mutual silence in the interview conversation between the patients and the psychologist. The speech-emotion features were based on the well-known openSMILE toolkit, and consisted of a large number of low-level acoustic features on MFCC, LSF and Pitch, whereas the atypical prosody features comprised of articulation, phonation and prosody cues from the relatively newer NeuroSpeech toolkit. Similarly, the movement features were corresponding to the linear and angular speed of the body joint movements. The high-level conversational and movement features were first correlated with the NSA items for validation. Significant correlations (as high as R = -0.59, p < 0.001) were obtained with the high-level features and the NSA items, and also implied strong interconnection between the speech and movement features. The whole set of non-verbal cues were then employed to predict the subjective NSA ratings, through supervised binary classifiers based on the SVM machine learning algorithm. A rigorous feature selection method, based on ranking features with the Kruskal-Wallis test, was adopted, along with a leave-one-person-out cross-validation method to calculate the prediction accuracy. The classification was performed for the recordings of the 3 individual sessions, and also for a ``combined session'', where data from the three sessions were pooled together. The classification was also carried out for the individual feature-sets at first, and then combining all audio features together, and finally combining both audio and video features. The NSA items of Restricted speech quantity, Affect: Reduced modulation of intensity, and Reduced expressive gestures were consistently classified with 63-75%, 68-72%, and 62-69%, across all sessions, and for different feature-sets. Similarly, the Patients and Controls could also be distinguished with an accuracy of 66-83%. These consistent accuracies of classification and also high correlation of feature ranks between sessions indicate a genuine and strong relationship between the subjective NSA ratings and the objective non-verbal signals. The mean values for the top-ranked features were also plotted on radar diagrams, and showed distinctive distribution of the features, implying their effectiveness in classification. The low accuracies of the items Impoverished speech content and Emotion reduced range point towards a natural language processing and facial emotion analysis based classification respectively. These non-verbal signals can be utilized to build an inexpensive, easy-to-use tool which can analyse and monitor the negative symptoms in an objective manner. Such a tool can supplement clinicians as an assessment method for preliminary screening and remote longitudinal monitoring during and after therapy. Such a framework can be developed for more mental disorders such as depression and Alzheimer's disease, and can make mental healthcare accessible and personalized to those in need, especially in low- to middle-income countries.
DRNTU::Engineering::Computer science and engineering::Computer applications::Social and behavioral sciences