Computer Science > Machine Learning

arXiv:2410.01795 (cs)

[Submitted on 2 Oct 2024]

Title:Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models

Authors:Joseph Lee, Shu Yang, Jae Young Baik, Xiaoxi Liu, Zhen Tan, Dawei Li, Zixuan Wen, Bojian Hou, Duy Duong-Tran, Tianlong Chen, Li Shen

View PDF HTML (experimental)

Abstract:Predicting phenotypes with complex genetic bases based on a small, interpretable set of variant features remains a challenging task. Conventionally, data-driven approaches are utilized for this task, yet the high dimensional nature of genotype data makes the analysis and prediction difficult. Motivated by the extensive knowledge encoded in pre-trained LLMs and their success in processing complex biomedical concepts, we set to examine the ability of LLMs in feature selection and engineering for tabular genotype data, with a novel knowledge-driven framework. We develop FREEFORM, Free-flow Reasoning and Ensembling for Enhanced Feature Output and Robust Modeling, designed with chain-of-thought and ensembling principles, to select and engineer features with the intrinsic knowledge of LLMs. Evaluated on two distinct genotype-phenotype datasets, genetic ancestry and hereditary hearing loss, we find this framework outperforms several data-driven methods, particularly on low-shot regimes. FREEFORM is available as open-source framework at GitHub: this https URL.

Subjects:	Machine Learning (cs.LG); Computation and Language (cs.CL); Genomics (q-bio.GN)
Cite as:	arXiv:2410.01795 [cs.LG]
	(or arXiv:2410.01795v1 [cs.LG] for this version)
	https://doi.org/10.48550/arXiv.2410.01795

Submission history

From: Shu Yang [view email]
[v1] Wed, 2 Oct 2024 17:53:08 UTC (242 KB)

Computer Science > Machine Learning

Title:Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models

Submission history

Access Paper:

References & Citations

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators

Computer Science > Machine Learning

Title:Knowledge-Driven Feature Selection and Engineering for Genotype Data with Large Language Models

Submission history

Access Paper:

References & Citations

BibTeX formatted citation

Bookmark

Bibliographic and Citation Tools

Code, Data and Media Associated with this Article

Demos

Recommenders and Search Tools

arXivLabs: experimental projects with community collaborators