Please use this identifier to cite or link to this item:
Title: An investigation of how normalisation and local modelling techniques confound machine learning performance in a mental health study
Authors: Zhang, Xinxin
Lee, Jimmy
Goh, Wilson Wen Bin
Keywords: Science::Biological sciences
Issue Date: 2022
Source: Zhang, X., Lee, J. & Goh, W. W. B. (2022). An investigation of how normalisation and local modelling techniques confound machine learning performance in a mental health study. Heliyon, 8(5), e09502-.
Project: NMRC/TCR/003/2008 
Journal: Heliyon 
Abstract: Machine learning (ML) is increasingly deployed on biomedical studies for biomarker development (feature selection) and diagnostic/prognostic technologies (classification). While different ML techniques produce different feature sets and classification performances, less understood is how upstream data processing methods (e.g., normalisation) impact downstream analyses. Using a clinical mental health dataset, we investigated the impact of different normalisation techniques on classification model performance. Gene Fuzzy Scoring (GFS), an in-house developed normalisation technique, is compared against widely used normalisation methods such as global quantile normalisation, class-specific quantile normalisation and surrogate variable analysis. We report that choice of normalisation technique has strong influence on feature selection. with GFS outperforming other techniques. Although GFS parameters are tuneable, good classification model performance (ROC-AUC > 0.90) is observed regardless of the GFS parameter settings. We also contrasted our results against local modelling, which is meant to improve the resolution and meaningfulness of classification models built on heterogeneous data. Local models, when derived from non-biologically meaningful subpopulations, perform worse than global models. A deep dive however, revealed that the factors driving cluster formation has little to do with the phenotype-of-interest. This finding is critical, as local models are often seen as a superior means of clinical data modelling. We advise against such naivete. Additionally, we have developed a combinatorial reasoning approach using both global and local paradigms: This helped reveal potential data quality issues or underlying factors causing data heterogeneity that are often overlooked. It also assists to explain the model as well as provides directions for further improvement.
ISSN: 2405-8440
DOI: 10.1016/j.heliyon.2022.e09502
Schools: School of Biological Sciences 
Lee Kong Chian School of Medicine (LKCMedicine) 
Research Centres: Centre for Biomedical Informatics
Rights: © 2022 The Author(s). Published by Elsevier Ltd. This is an open access article under the CC BY-NC-ND license (
Fulltext Permission: open
Fulltext Availability: With Fulltext
Appears in Collections:LKCMedicine Journal Articles
SBS Journal Articles

Files in This Item:
File Description SizeFormat 
PIIS2405844022007903.pdf2.24 MBAdobe PDFThumbnail

Citations 50

Updated on Feb 18, 2024

Web of ScienceTM
Citations 50

Updated on Oct 31, 2023

Page view(s)

Updated on Feb 24, 2024

Download(s) 50

Updated on Feb 24, 2024

Google ScholarTM




Items in DR-NTU are protected by copyright, with all rights reserved, unless otherwise indicated.