Novel algorithms for population structure inference

This technology is a suite of statistical and machine learning methods for population structure inference and translational biomedical applications.

Unmet Need: Methodology for capturing underlying population structures

The increasing availability of whole genome sequencing provides large datasets for population and medical genetics studies. Identifying latent population structure is crucial to both account for the variation in allele frequencies between subpopulations and avoid confounding factors when making genetic associations for diseases. Principal component analysis (PCA) is commonly applied to extract principal components (PCs) that can capture the population structure, but existing methods have several limitations at this scale of data.

The Technology: Novel algorithms for population structure inference from whole genome sequencing data

This technology, called ERStruct, is a software package for inferring the latent population structure of whole-genome datasets accurately and efficiently. ERStruct is a suite of statistical and machine learning methods, including Mendelian randomization, causal mediation analysis, statistical genetics, and deep learning. This robust computational algorithm can be applied in MATLAB and Python to estimate the number of top informative principal components and process ultra-dimensional data of whole human genomes in a computationally efficient way.

Applications:

Drug discovery
Drug target identification
Structure-based drug design
Causal protein biomarker validation
Biomarker discovery and validation
Personalized medicine
Genomics
Health AI
Clinical trial analytics
Fairness-aware AI in health
Real-world evidence analysis
Risk factor identification (e.g., COVID-19 severity, cardiometabolic disease)

Advantages:

Efficient and accurate structure inference
Can process ultra-dimensional whole genome data
Runs in MATLAB and Python environments
User-friendly software package
Outperforms traditional methods for principal components estimation

Lead Inventor:

Zhonghua Liu, Sc.D.

Related Publications:

Yang J, Xu Y, Yao M, Wang G, Liu Z. “ERStruct: a fast Python package for inferring the number of top principal components from whole genome sequencing data.” BMC Bioinformatics. 2023 May 2;24(1):180

Tech Ventures Reference:

IR CU25377
Licensing Contact: Joan Martinez