Loading...
Improving the performance and interpretability on medical datasets using graphical ensemble feature selection
Title / Series / Name
Publication Volume
Publication Issue
Pages
Editors
Keywords
Algorithms
Arthritis, Rheumatoid
Computational Biology/methods
Humans
Machine Learning
Arthritis, Rheumatoid
Computational Biology/methods
Humans
Machine Learning
URI
https://hdl.handle.net/20.500.14018/27117
Abstract
Motivation: A major hindrance towards using Machine Learning (ML) on medical datasets is the discrepancy between a large number of variables and small sample sizes. While multiple feature selection techniques have been proposed to avoid the resulting overfitting, overall ensemble techniques offer the best selection robustness. Yet, current methods designed to combine different algorithms generally fail to leverage the dependencies identified by their components. Here, we propose Graphical Ensembling (GE), a graph-theory-based ensemble feature selection technique designed to improve the stability and relevance of the selected features. Results: Relying on four datasets, we show that GE increases classification performance with fewer selected features. For example, on rheumatoid arthritis patient stratification, GE outperforms the baseline methods by 9% Balanced Accuracy while relying on fewer features. We use data on sub-cellular networks to show that the selected features (proteins) are closer to the known disease genes, and the uncovered biological mechanisms are more diversified. By successfully tackling the complex correlations between biological variables, we anticipate that GE will improve the medical applications of ML.
Topic
Publisher
Place of Publication
Type
Journal article
Date
2024-06-05
Language
ISBN
Identifiers
10.1093/bioinformatics/btae341