Open AccessArticle

Analysis of Regions of Homozygosity: Revisited Through New Bioinformatic Approaches

Susana Valente

^1,*

Mariana Ribeiro

¹,

Jennifer Schnur

²,

Filipe Alves

Nuno Moniz

²,

Dominik Seelow

^3,4

João Parente Freixo

Paulo Filipe Silva

^1,†

and

Jorge Oliveira

^1,5,6,†

Centro de Genética Preditiva e Preventiva (CGPP), Instituto de Biologia Molecular e Celular (IBMC), Instituto de Investigação e Inovação em Saúde (i3S), Universidade do Porto, 4200-135 Porto, Portugal

Lucy Family Institute for Data and Society, University of Notre Dame, Notre Dame, IN 46556, USA

Exploratory Diagnostic Sciences, Berliner Institut Für Gesundheitsforschung@Charité, Charitéplatz 1, 10117 Berlin, Germany

⁴

Institut Für Medizinische Genetik und Humangenetik, Charité—Universitätsmedizin Berlin, Augustenburger Platz 1, 13353 Berlin, Germany

⁵

Laboratory of Cell Biology, Department of Microscopy, ICBAS—Institute of Biomedical Sciences Abel Salazar; Universidade do Porto, 4050-313 Porto, Portugal

⁶

UMIB-Unit for Multidisciplinary Research in Biomedicine, ICBAS/ITR-Laboratory for Integrative and Translational Research in Population Health, Universidade do Porto, 4050-313 Porto, Portugal

Author to whom correspondence should be addressed.

^†

These authors contributed equally to this work.

BioMedInformatics 2024, 4(4), 2374-2399; https://doi.org/10.3390/biomedinformatics4040128

Submission received: 4 October 2024 / Revised: 4 December 2024 / Accepted: 9 December 2024 / Published: 16 December 2024

Download

Browse Figures

Figure 1
Flowchart representing the automation of the creation of multigene panels based on ROHs (DB—database; DF—dataframe). "> Figure 2
Flowchart to obtain the reference BED file. "> Figure 3
The flowchart of the multigene panel lists: white, grey, and black. "> Figure 4
Flowchart of the ROH and HPO multigene panel automation. "> Figure 5
Overview of the results regarding processes of generating the multigene panel application in a case study, the first Portuguese ROH characterization, and the clustering model. "> Figure 6
Pedigree depicting two affected sisters, daughters of a consanguineous couple. "> Figure 7
Example of an input for the personalized multigene panels based on HPO term and ROHs. "> Figure 8
IGV visualization of the reads mapped to the CSTB gene in both sisters (II:1 and II:2). "> Figure 9
BAM visualization depicting the region of the dodecamer repeat expansion in a control sample (I), and in both sisters (II:1 and II:2). No reads are aligned in this region in both patients, suggesting that a possible expansion is biallelic (present in both CSTB alleles). "> Figure 10
Histogram depicting the distribution of ROH length above 0.5 Mb in a Portuguese cohort of 3941 samples. "> Figure 11
Geographical distribution per municipality of FROH > 0.5 Mb in Portugal Mainland, Autonomous Region of Açores, and Autonomous Region of Madeira. "> Figure 12
Geographical distribution per municipality of FROH > 1.5 Mb in Portugal Mainland, Autonomous Region of Açores, and Autonomous Region of Madeira. "> Figure 13
Geographical distribution per municipality of FROH > 5 Mb in Portugal Mainland, Autonomous Region of Açores, and Autonomous Region of Madeira. "> Figure 14
Map of Portugal representing the consanguinity between 1980 and 1986 (/100,000) adapted from [<a href="#B89-biomedinformatics-04-00128" class="html-bibr">89</a>] (upper left) and the Portugal Mainland maps for the FROH calculated for ROHs of size above 0.5 Mb (upper right), 1.5 Mb (lower left), and 5 Mb (lower right). "> Figure 15
Low−dimensional MDS representations of each “tier” dataset, where Tier 0 is training and validation results (A) and testing results (B); Tier 1 is training and validation results (C) and testing results (D); Tier 2 is training and validation results (E) and testing results (F). Data points are colored according to their consanguinity labels: White “unknown” points do not possess a ground truth label; green “NCON” points represent non-consanguineous samples; red “CON” points represent consanguineous samples; and purple “CON_ST” represent stringent consanguineous points. The red dashed circles represent the elliptic envelope’s outlier decision boundary (i.e., points falling outside of the envelope are predicted to be consanguineous, either stringent or non-stringent). ">

Review Reports Versions Notes

Abstract

Background: Runs of homozygosity (ROHs), continuous homozygous regions across the genome, are often linked to consanguinity, with their size and frequency reflecting shared parental ancestry. Homozygosity mapping (HM) leverages ROHs to identify genes associated with autosomal recessive diseases. Whole-exome sequencing (WES) improves HM by detecting ROHs and disease-causing variants. Methods: To streamline personalized multigene panel creation, using WES and ROHs, we developed a methodology integrating ROHMMCLI and HomozygosityMapper algorithms, and, optionally, Human Phenotype Ontology (HPO) terms, implemented in a Django Web application. Resorting to a dataset of 12,167 WES, we performed the first ROH profiling of the Portuguese population. Clustering models were applied to predict consanguinity from ROH features. Results: These resources were applied for the genetic characterization of two siblings with epilepsy, myoclonus and dystonia, pinpointing the CSTB gene as disease-causing. Using the 2021 Census population distribution, we created a representative sample (3941 WES) and measured genome-wide autozygosity (F_ROH). Portalegre, Viseu, Bragança, Madeira, and Vila Real districts presented the highest F_ROH scores. Multidimensional scaling showed that ROH count and sum were key predictors of consanguinity, achieving a test F1-score of 0.96 with additional features. Conclusions: This study contributes with new bioinformatics tools for ROH analysis in a clinical setting, providing unprecedented population-level ROH data for Portugal.

Keywords:

regions of homozygosity; bioinformatic model; variant prioritization; whole-exome sequencing; consanguinity; multigene panels; recessive diseases

1. Introduction

Homozygosity refers to having two identical alleles of a gene inherited from each parent [1], and due to common ancestry between the parents or due to identity by descent (IBD), it is called autozygosity [2]. Runs of homozygosity (ROHs) are continuous segments of the genome identical in both copies of a chromosome pair (alleles) ranging from tens of kilobases to megabases [3]. ROHs can arise from consanguineous marriages (estimated to affect ~10% of people worldwide) [4], inbreeding [5,6], or the founder effect [7], increasing the risk of recessive diseases in the offspring. ROHs may also be “runs of hemizygosity”, when there is a deletion in one copy of a chromosome, leading to a loss of heterozygosity [8].

ROH patterns reflect the level of kinship and autozygosity, both reduced by people’s mobility and globalization [9]. Short ROHs are characteristic of admixed populations resembling ancient parental relatedness, whereas longer ROHs reflect higher consanguinity levels and recent parental relatedness [10,11,12,13]. ROH patterns reflect population and demographic history [14,15], including differences in consanguinity and number of ROHs between ethnic subgroups [10,16,17,18,19,20,21]. Understanding these patterns in diverse populations is essential for assessing disease risk and identifying disease-causing genetic variants, particularly in admixture isolates [11,22,23,24].

As consanguinity increases, so does the number and size of ROHs, raising the risk of autosomal recessive (AR) diseases. ROH analysis increases the diagnostic rate of recessive diseases, especially in consanguineous families, for finding candidate genes [25,26,27,28,29] and disease-causing homozygous variants [30] and corroborating the historical context of communities [31]. Furthermore, it is crucial for identifying candidate genes for specific recessive diseases [32,33,34,35,36], even in non-consanguineous families [37]. Biodemographic and genetic studies provide insights into population structure and its link to diseases by exploring the human genome’s significance in population history and consanguinity practices [14,38].

Homozygosity mapping (ROH detection) aids gene discovery by assuming that individuals with AR diseases likely have homozygous markers surrounding the disease locus, searching for and identifying regions harboring the affected gene. If other relatives also have the disease, the strategy includes identifying ROHs exclusive to affected individuals within the family [39]. It was first applied in 1987 by Lander and Botstein in consanguineous families affected by a recessive disease using restriction fragment length polymorphisms (RFLPs) [40]. Homozygosity mapping evolved to utilize Single-Nucleotide Polymorphism (SNP) array data, and with the advent of next-generation sequencing (NGS), software tools were designed to accommodate these sequencing data as input [39].

The introduction of NGS enables simultaneous homozygosity mapping and variant detection, generating vast data volumes surpassing previous technologies in speed and cost-effectiveness. Since 454 sequencing by Roche, NGS has evolved through second-generation (short-read) and third/fourth-generation (long-read) technologies [41]. Second-generation sequencing generates short DNA fragments (100–600 bp), with Illumina being widely used for genetic testing [41,42]. Third/fourth-generation sequencing achieves reads of over 10 kb, effectively detecting genome-wide repeats and structural variants, suitable for diagnostic and clinical applications [41,43]. The two main technologies are provided by Pacific Biosciences (PacBio) [44] and Oxford Nanopore (ONT) [45].

NGS applications include single genes, targeted multigene panels, whole-exome sequencing (WES), whole-genome sequencing (WGS), and transcriptomes (RNA sequencing), all effective for genetic testing [46]. WES, which targets protein-coding exons, where ~85% of the known Mendelian disease variants occur, has become a mainstream approach due to its cost-effectiveness and simplified data management [47,48,49]. WES can be performed individually or in trio (enhanced variant identification) [50]. Its limitations include sensitivity to GC-rich regions, reliance on Sanger sequencing to confirm low-quality variants, challenges with variants of uncertain significance (VUS), shared homology between genomic regions (segmental duplications/pseudogenes), and failure to genotype highly repetitive regions completely, especially in the presence of large repeats (expansions) [2,51,52].

Reanalyzing genomic data enhances diagnostic rates by uncovering novel gene–disease associations, improving bioinformatics techniques for CNV detection and variant calling, incorporating consanguinity assessment (ROH filter) to narrow down the list of candidate variants, and integrating the Human Phenotype Ontology (HPO) [51] terms. HPO terms describe human phenotypic information in a standardized way (used for supporting clinical diagnostics and genetic research). Estimating consanguinity through ROH analysis allows for an unbiased determination of parental or ancestral consanguinity, overcoming the limitations of self-reports or inferences based on family context [52].

Both adapted and new homozygosity mapping tools have emerged, enhancing diagnostic rates by integrating WES data with ROH analysis [39,53,54]. The software can be based on sliding-window or hidden Markov model (HMM) algorithms [39]. Sliding-window algorithms, originally designed for SNP array data analysis, move a fixed-size window along the chromosome to find stretches of consecutive homozygous SNPs [39]. PLINK [39] is widely used on its own [16,55,56,57,58,59,60,61,62,63,64,65], as a complementary analysis [66], or integrated into other algorithms [67]. Other software followed, such as Obelisc [68], GERMLINE [39], EX-HOM (EXome-HOMozygosity) [69], and HomozygosityMapper (HM) [70]. PLINK, GERMLINE and HomozygosityMapper (HM) were subsequently adapted for WES data [39,71]. Other software created include HOMWES [72], GARLIC [73], HomSI [39,74], and Automap [75].

Hidden Markov models (HMMs) represent observed data as outputs generated by hidden states, modeled as a Markov chain [76]. In ROH detection, HMMs estimate the likelihood of a genotype (observation) being homozygous or heterozygous (hidden states) [77]. The software tools available are H₃M₂ [77,78]; IBDSeq and GIBDLD [78]; BEAGLE [79]; ROHMM and BCFtools/RoH [77,80]; and Python packages FILTUS and hapROH [81,82,83]. According to the literature, ROHMM demonstrates higher performance than sliding-window algorithms [80].

The accuracy of these tools can be influenced by many factors, such as the choice of algorithm used, sample sequencing depth and coverage, SNP density and sequence quality, the need for phased data, loss of short and medium-sized ROHs, and false positives [39,82]. These factors should be considered when selecting the appropriate software for a project [39,83].

This work presents new bioinformatics approaches to address the creation of personalized multigene panels based on WES data using ROH and/or HPO terms, integrated into a Django Web application. Its impact on diagnostics is illustrated by the genetic characterization of two siblings affected by a recessive disease. Analysis of ROHs at a genomic scale in a representative sample of 3941 patients advances ROH analysis using WES data, highlighting its diagnostic potential and significance in population genetics.

2. Materials and Methods

The dataset used in this work consisted of WES samples from patients who performed genetics tests at the Center for Predictive and Preventive Genetics (CGPP), Portugal.

2.1. Creation of Personalized Multigene Panels Based on ROHs

Multigene panels based on the patient’s ROHs focus on the analysis of regions of the genome more likely to contain recessive disease-causing variants. By targeting genes within these ROHs, the panels are more likely to identify relevant genetic variants, particularly in a consanguinity context or shared ancestry.

The samples used for the creation of these panels were analyzed using two homozygosity mapping algorithms: HomozygosityMapper (HM), which uses a sliding-window algorithm, and ROHMM Command Line Interface (ROHMMCLI), which uses a hidden Markov model (HMM) algorithm. Each patient has a pseudo-anonymized ID without any personal information.

Both algorithms output data in different formats: HM outputs a raw data text file with chromosome, position, and score, while ROHMMCLI outputs a BED file. To generate the Uniform Resource Locator (URL), a connection to the HM database was initiated and the project number (project_no) was retrieved using the patient ID. With the URL generated, the data were collected and saved in a BED file. Then, the HM and ROHMMCLI BED files were merged using a shell script. This script takes the patient ID and the current date as inputs and is divided into four Linux commands, as follows:

Clean up the ROHMMCLI BED file to contain only chromosome and start and end positions.
Merge the HM and the cleaned ROHMMCLI BED files using bedtools merge with option −d of 1,000,000 bp, the maximum distance between ROHs to be merged.
Use bedtools intersect to find overlaps between the merged BED file and the coding sequence coordinate BED file, producing another BED file with the list of gene coordinates found within ROHs.
Create a text file with a list of gene Entrez IDs present in the identified ROHs.

The process of obtaining the gene list is outlined in Figure 1.

The file containing all coding sequence coordinates was generated using two in-house-developed tools (gtf2tsv.py and tsv2bed.py). For this work, the file used was a GTF file named GCF_000001405.25_GRCh37.p13_genomic.gff.gz, representing the RefSeq annotation release version 105.20220307 of the human genome build GRCh37, and the “feature” column was filtered for “CDS” (available at https://ftp.ncbi.nlm.nih.gov/genomes/all/annotation_releases/9606/105.20220307/GCF_000001405.25_GRCh37.p13/, accessed on 16 September 2023) (Figure 2).

The final step to generate multigene panels consists of comparing the genes’ list with the coverage for the representative transcript of each gene, described in Figure 3. Genes are divided into three lists based on the percentage of horizontal coverage at 20x: white (≥0.9), grey (0.1–0.9), and black (≤0.1). Only the white and grey genes are included in the multigene panel. Another list, containing the genes that were not assorted to any of the previous lists, is generated.

Copy Number Variations (CNVs), more specifically heterozygous deletions, can mimic ROHs, given that the single-nucleotide variants (SNVs) encompassed by the deletion cannot be heterozygous (and are in fact hemizygous). To incorporate such a possible impact in the multigene panel creation, the following steps were implemented:

Find the CNV results for the sample in analysis and filter by CNVs with a span above 500,000 bp and that are ‘Heterozygous Deletion’, resulting in a BED file with CNV genomic coordinates.
Filter by non-empty files, meaning files that contain CNVs.
The shell script uses bedtools jaccard tool to calculate the Jaccard index for each CNV that intersects an ROH, using the merged ROH results and CNV BED files.

The Jaccard index (Equation (1)) is a single statistic that reflects the similarity of the two BED files based on the intersections between them, where a value of 0.0 indicates no overlap and 1.0 represents complete overlap.

J a c c a r d i n d e x = \frac{i n t e r s e c t i o n}{(R O H l e n g t h + C N V l e n g t h) - i n t e r s e c t i o n}

(1)

where the intersection is the difference between the end of the ROH and the start of the CNV; the ROH length is the difference between the end and start coordinates of the ROH; and the CNV length is the difference between the end and start coordinates of the CNV.

2.2. Creation of Personalized Multigene Panels Based on HPO Terms

Multigene panels based on HPO terms ensure that gene selection is targeted for the patient’s specific phenotype. These panels are designed to ensure that only the genes possibly associated with the phenotype are analyzed, and therefore the most likely to harbor disease-causing variants responsible for the patient’s phenotype, increasing the accuracy and relevance of genetic testing. The selection of HPO terms occurs after the medical appointment when the clinician requests the genetic test, which may include specific HPO terms related to the patient’s phenotype. If the clinician provides no terms, the appropriate HPO terms are chosen by the laboratory from the phenotypic details described by the clinician in the test request.

A Python script that takes an HPO term as input was created to establish the connection to the HPO API and retrieve the list of genes’ Entrez IDs associated with the HPO identifier (https://hpo.jax.org/api/hpo/term/{hpoId}/genes, version 1.7.13, accessed on 4 May 2023). By parsing the JSON file retrieved by the HPO API, we were able to obtain the gene entrez ID and gene symbol. Then, the white, grey, and black lists for the gene panels were created, as previously described in Figure 3.

2.3. Creation of Personalized Multigene Panels Based on ROH and HPO Terms

The integration of ROH and HPO term analysis may offer an even more personalized approach. This method focuses on the individual’s ROH and examines whether the genes associated with the patient’s specific HPO terms are located within these regions.

A new script was generated based on the previously described script for creating multigene panels. To obtain a BED file with the coordinates of the genes from the HPO terms’ gene list, the command-line tool tsv2bed.py was used. The merged BED file results containing the ROHs and the HPO term genes’ BED file are merged to obtain a final list of the genes from the HPO terms within the ROHs. The corresponding flowchart is presented in Figure 4.

From the list of genes obtained, the process of generating the white, grey, and black lists for the gene panels is the same as previously described in Figure 3.

2.4. Django Web Application Development

The process of personalized multigene panel creation based on ROHs, HPO, and both simultaneously was made using Python 3 and shell scripting. To deploy a more user-friendly interface, a Django Web application was developed. The design/style of all HTML pages for this application (app) was made using Cascading Style Sheets (CSS) for visual consistency.

The home page of this app contains a title (“Personalized multigene panels”), a small description of the app, and three buttons: “HPO term-based panels”, “ROH-based panels” and “HPO term and ROH-based panels”. Each button is linked to a different HTML file, with different HTML forms to submit the input data.

For the “HPO term-based panels” button, the form handles multiple HPO terms at the same time. For the “ROH-based panels” button, the form contains two different types of input, a text area to fit the multiple input lines, and a drop-down list of the possible HM threshold options.

For the “HPO term and ROH-based panels” button, the form is a combination of the ones from the previously described HTML files. The only difference is that the text area only allows the analysis of one sample at a time with multiple HPO terms.

2.5. Establishing the First Portuguese ROH Characterization on a Genomic Scale

To establish the first Portuguese ROH characterization on a genomic scale, the dataset initially consisted of over 12,000 WES samples. Since there were municipalities that were over-represented, normalization and down-sampling processes were automated. The final number of samples was 3941 WES samples (detailed process in Supplementary File S1).

The process of establishing the first Portuguese ROH characterization started with the assessment of the ROH levels through the genome-wide autozygosity measure from ROHs (

F_{R O H}

), calculated using Equation (2).

F_{R O H} = \frac{\sum L_{R O H}}{L_{a u t o}}

(2)

Here,

\sum L_{R O H}

is the total length of all of an individual’s ROHs above a specified minimum length and

L_{a u t o}

is the length of the autosomal genome covered by WES, after removing the telomeres, pericentric regions, and centromeres (which are excluded to prevent overestimating autozygosity).

For each sample, three

F_{R O H}

values were calculated, using three ROH minimum size thresholds to calculate

\sum L_{R O H}

: 0.5, 1.5, and 5 Mb.

The calculation of

L_{a u t o}

involved using the Integrative Genomics Viewer (IGV) to determine the genomic coordinates of the first and last genes on the p and q arms of each chromosome. These coordinates were used to calculate the size of each chromosomal arm using Equation (3):

a r m s i z e = e n d o f l a s t g e n e - b e g i n n i n g o f f i r s t g e n e

(3)

The sum of both chromosomal arms’ sizes corresponds to the size of the autosome without the centromeres. The

L_{a u t o}

value is the total size of all autosomes, calculated as 2638.813981 Mb.

The patients’ address information was obtained from the internal database, consisting of patients’ ID, postcode, municipality, and district names. All patients’ VCF files were previously processed by HM and ROHMMCLI and both results were merged. Homozygosity mapping data were organized into standardized CSV (*.csv) files containing the detailed patient ROH profile (chromosome, ROH’s start and end position, ROH length (bp), ROH length (Mb), and ROH length/chromosome length), and the general patient’s profile (number of ROHs, number of ROHs > 1 Mb, and sum of ROHs (Mb)).

To calculate the F_ROH, the information needed was combined into separate CSV (*.csv) files:

One containing ROHs > 0.5 Mb;
One containing ROHs > 1.5 Mb;
One containing ROHs > 5 Mb.

Then, the

\sum L_{R O H}

was calculated per patient, grouping the ROHs, from each CSV (*.csv) file and summing its total per patient.

The value of F_ROH was calculated for each patient using the corresponding

\sum L_{R O H}

value, for each minimum ROH size, resulting in a CSV (*.csv) file containing the patients’ ID and F_ROH. Portugal comprises 18 districts and 2 Autonomous Regions (Açores and Madeira), divided into a total of 308 municipalities. With this information, and the address information, patients were grouped by municipality, and the mean F_ROH was calculated (per municipality) and used to create maps of the Portugal Mainland and Autonomous Regions (Açores and Madeira) per municipality.

There were two types of maps, the classical and the interactive ones, created for each F_ROH mean calculated using minimum ROH sizes of 0.5, 1.5, and 5 Mb.

To create the maps, several data were necessary. The geographical data, at the municipality level (shapefile), were obtained from dados.gov (https://dados.gov.pt/en/datasets/concelhos-de-portugal/, accessed on 6 June 2023), the Portuguese Public Administration’s open data portal. Then, we established the connection to the internal SQL database to obtain the municipality and respective districts’ association, the CSV (*.csv) files per municipality, and a CSV (*.csv) file containing the number of people per municipality and respective ratio, so that only the municipalities with representativity were used for the maps. For the creation of the maps, since we were dealing with geospatial data, we used the GeoPandas package. The maps created were stored as PNG (*.png) files.

The interactive maps were created using the explore method on a Geodata Frame and were saved as HTML files.

2.6. Consanguinity Classification Approach

The set of samples used to build the clustering model was meticulously chosen from the internal database based on the patients’ information concerning consanguinity. A total of 9160 WES samples were collected for the analysis. This included 9020 (98.5%) individuals with “unknown” consanguinity, 84 (0.92%) known non-consanguineous samples, and 56 (0.61%) known consanguineous samples. Of the 56 consanguineous samples, 34 (60.7%) were stringent (i.e., parents were first-degree cousins). Each sample was submitted to HM to find homozygous blocks of the exome and the raw data were provided in a text file containing the chromosome, position, and their corresponding homozygosity scores from HM.

2.6.1. Feature Extraction

For each sample, we generated features pertaining to the descriptive statistics of the ROHs embedded within each chromosome. For the purpose of this analysis, we considered ROHs to be at least 2 consecutive chromosome positions with homozygosity scores greater than or equal to 64 (i.e., 80% of the highest observed score, 80, as previously reported [71]). Specifically, the following features were generated for each sample with respect to each chromosome x:

Count_x: the number of ROHs in chromosome x.
Sum_x: the sum of ROH sizes in chromosome x.
Min_x: the minimum ROH size in chromosome x.
Max_x: the maximum of ROH size in chromosome x.
Mean_x: the mean number of ROHs in chromosome x.
STD_x: the standard deviation of ROH size in chromosome x.

To make these features more concrete, we provide an illustrative example. Suppose an individual possesses three ROHs within chromosome 1 (x = 1), with sizes 3, 7, and 4 Mb. The following features are extracted from chromosome 1: Count_1 = 3; Sum_1 = 14; Min_1 = 3; Max_1 = 7; Mean_1 = 4.67; and STD_1 = 1.6997. This feature extraction process is then repeated for the individual’s remaining chromosomes.

Following feature extraction, to test the predictive quality of various feature sets in our experiments, we created three separate representations of the data, dictated by the following sets of features:

Tier 0: includes “Count_x” and “Sum_x” features only;
Tier 1: includes “Count_x”, “Sum_x”, “Min_x”, and “Max_x” features only;
Tier 2: includes “Count_x”, “Sum_x”, “Min_x”, “Max_x”, “Mean_x”, and “STD_x” features.

2.6.2. Outlier Detection

We formulated the task of consanguinity classification as an outlier detection problem. For our experiments, we randomly selected 50% of all labeled data points (70 total data points) to be reserved for testing. Using the remaining 50% of labeled data points for validation and 100% of unlabeled data points for training, we proceeded to establish the semi-supervised outlier detection pipeline. First, we projected the data into a low-dimensional (2D) space using classical multidimensional scaling (MDS) [84], which is a manifold learning approach that aims to preserve pairwise Euclidean distances between points from high-dimensional representation in low-dimensional data representation. Following dimensionality reduction, we then fit an elliptic envelope [85] to the data with “unknown” consanguinity labels, validating the optimal contamination hyperparameter (i.e., the proportion of the data estimated to be outliers) using the remaining 50% of the labeled samples. Given the imbalanced class distribution, the F1-score was used to both optimize the contamination hyperparameter and evaluate the model on the reserved test set.

3. Results

The results obtained are presented in this section. Figure 5 contains an overview of the results obtained.

3.1. Personalized Multigene Panels

Personalized multigene panels streamline diagnostics by narrowing down the number of genes analyzed, leading to a more targeted and efficient diagnosis. In cases of suspected AR diseases, or when the patient’s consanguinity status is known, ROH analysis can help in the selection of the appropriate multigene panel. These panels can also be further tailored to the patient’s specific phenotype by using HPO terms. Integrating this process into a web application provides a user-friendly interface, making it accessible to other professionals within the genetic testing center.

During the process of creating the web application for personalized multigene panels based on ROHs, HPO, and a combination of both, several tests were conducted.

For testing the personalized multigene panels based on ROHs, several randomly selected patients were used and the identified genes’ coordinates were checked to ensure they fell within the identified ROHs.

The testing process of the creation of personalized multigene panels based on HPO terms was divided into three parts:

The creation of 15 multigene panels based on a single HPO term: HP: 0001627 (abnormal heart morphology); HP: 0001047 (atopic dermatitis); HP: 0005584 (renal cell carcinoma); HP: 0001789 (hydrops fetalis); HP: 0011842 (abnormal skeletal morphology); HP: 0000846 (adrenal insufficiency); HP: 0003155 (elevated circulating alkaline phosphatase concentration); HP: 0000548 (cone/cone–rod dystrophy); HP: 0011510 (drusen); HP: 0000365 (hearing impairment); HP: 0000925 (abnormality of the vertebral column); HP: 0001949 (neoplasm of the gastrointestinal tract); HP: 0007373 (motor neuron atrophy); HP: 0006530 (abnormal pulmonary interstitial morphology); HP: 0012211 (abnormal renal physiology); HP: 0001733 (pancreatitis); HP: 0000556 (retinal dystrophy);
The creation of three multigene panels based on multiple HPO terms: HP: 0000077 (abnormality of the kidney), HP: 0100243 (leiomyosarcoma), and HP: 0100522 (thymoma); HP:0100574 (biliary tract neoplasm) and HP: 0003003 (colon cancer); and HP: 0003198 (myopathy) and HP: 0003473 (fatigable weakness);
The creation of five personalized multigene panels based on a single HPO previously manually prepared and curated—HP: 0000126 (hydronephrosis); HP: 0001250 (seizure); HP: 0010566 (hamartoma); HP: 0012091 (abnormality of pancreas physiology); and HP:0012114 (endometrial carcinoma)—and comparison with the obtained results.

3.1.1. Output Obtained for Each Multigene Panel

The output for the multigene panels based on ROHs, for each input line, is a CSV (*.csv) file with the gene symbols that belong to each of the lists (white, grey, black). If CNVs’ results are available for the sample being analyzed, a BED file with the CNVs’ genomic coordinates is retrieved, as well as a text file with the Jaccard index of the overlap.

The output for the multigene panels based on HPO terms, for each input line, is a CSV (*.csv) file with the gene symbols that belong to each of the lists (white, grey, black). The output provided by the different multigene panels depends on the number of samples and on the number of HPO terms.

As for personalized multigene panels, simultaneously based on ROHs and HPO terms, the output is a CSV (*.csv) file with the gene symbols from the HPO term(s) in analysis within the ROHs identified in the sample being analyzed. If CNVs’ results are available for the sample in analysis, a BED file with the CNVs is retrieved, as well as a text file with the Jaccard index.

3.1.2. Application of New Bioinformatic Resources in a Clinical Case

The clinical case presented consists of two siblings with a phenotype of epilepsy, myoclonus and dystonia with onset during infancy, daughters of a consanguineous couple (Figure 6). Even after conducting several genetic tests, including an analysis of the entire WES data, both remained genetically undiagnosed for several years.

The HPO term used was HP: 0001336 (myoclonus). Myoclonus is characterized by involuntary random muscular contractions happening at rest, due to a stimulus or during voluntary movements. Figure 7 presents the HTML forms filled with the HPO term used and the path to the VCF file of one of the sisters to create the personalized multigene panel.

The CSV (*.csv) output file contains the “Summary”, “White list”, “Grey list” and “Black list”. According to the HPO API, HP:0001336 (myoclonus) comprises 360 genes (accessed 25 September 2023). The results obtained consisted of 11 genes listed for sister II:1 and 3 genes listed for sister II:2. All genes identified in both sisters were from the white list; no genes were in the grey and black lists. The white lists’ results from the sisters’ selected genes, resulting from the intersection of the genes within the identified ROHs and those associated with the HPO term used, are presented in Table 1. From the 360 genes associated with HP:0001336, 3 genes were commonly shared between the two sisters: CSTB, SIK1, and SLC32A1.

Assuming an AR inheritance and the strong correlation between the phenotype associated with defects in the CSTB gene and the patients’ phenotypes, variant data were further inspected. Integrative Genomics Viewer (IGV), a visualization tool, was used in the genomic data analysis. The VCF files, BAM files, and respective index files were loaded to IGV, as well as the all_cds.bed file as reference. The analysis was conducted using the human reference genome version GRCh37. The visualization results for the CSTB gene are depicted in Figure 8. No disease-causing variants were identified in the genomic regions covered by WES data.

Considering that the mutational spectrum associated with disease-causing variants in the CSTB gene includes the expansion of a repetitive region [86], the 5′UTR region where the dodecamer repeat CCC-CGC-CCC-GCG is located was visually inspected (Figure 9). No reads are aligned in this region in both patients, whereas these are present in the control sample. This is compatible with a large expansion of the dodecamer repeat in both genes’ alleles. A biallelic expansion within the pathogenic range (≥30 repeats) was indeed confirmed by targeted conventional approaches (fragment analysis and long-range PCR), making this variant diagnostic for the disease.

3.2. First Portuguese ROH Characterization on a Genomic Scale

After the down-sampling process, the 3941 samples were submitted to ROH analysis to contribute to the portrayal of the first Portuguese landscape of ROHs at a genomic level. The lack of data regarding ROH distribution in Portugal can be filled in with the study presented in this section, being of great interest for genetic testing and population genetics.

3.2.1. Distribution of ROHs per Length in Portugal

We began by analyzing the distribution of ROHs larger than 0.5 Mb, identifying a total of 19,407 ROHs. For an overview of the results, Figure 10 depicts the distribution of these ROHs across different length intervals in Mb.

3.2.2. Maps of Portugal and Respective Data for F_ROH > 0.5, 1.5 and 5 Mb

The results from the genome-wide autozygosity measure from ROH (F_ROH) are presented in this section. The F_ROH mean values per municipality, with ROHs of size greater than 0.5, 1.5, and 5 Mb, are presented in Supplementary File S2, and the details of the interactive maps, designed for more detailed and interactive navigation, are presented in Supplementary File S3.

Here, we present the resulting classical maps of the representative Portuguese sample of 3941 patients for the ROH characterization on a genomic scale. In Figure 11, the classical map of Portugal presents the geographical distribution of the mean F_ROH, for ROHs with size greater than 0.5 Mb (F_ROH > 0.5 Mb), per municipality. The mean value of F_ROH > 0.5 Mb is 0.004.

The municipality of Alter do Chão from Portalegre district is the municipality with the highest value of F_ROH > 0.5 Mb (0.088). The lowest value of F_ROH > 0.5 Mb (0.0004) is from the Manteigas municipality (Guarda district). The municipality of Machico has the highest value of F_ROH > 0.5 Mb (0.025) of the Autonomous Region of Madeira. In the Autonomous Region of Açores, the values of F_ROH > 0.5 Mb are not very high, with the highest one being 0.026 in Vila do Porto municipality.

The mean F_ROH > 0.5 Mb intervals and the corresponding number of individuals per interval are presented in Table 2. From the 3941 samples, 3760 have F_ROH > 0.5 Mb within the mean F_ROH > 0.5 Mb intervals, 170 have F_ROH > 0.5 Mb equal to zero, and 9 are above the mean F_ROH > 0.5 Mb intervals. This accounts for a total of 3939 samples represented in Figure 11, used to calculate the F_ROH > 0.5 Mb mean, from which 3769 presented F_ROH > 0.5 Mb different from zero. The first interval, from 0.000 to 0.004, contains the highest number of samples (3086). There is a drastic drop in the number of samples within the second interval (from 0.004 to 0.006) and the number decreases until the last interval (from 0.034 to 0.088).

In Figure 12, the classical map of Portugal for the F_ROH mean for ROHs with size higher than 1.5 Mb (F_ROH > 1.5 Mb), per municipality, is presented. The mean value of F_ROH > 1.5 Mb is 0.003.

The municipality of Alter do Chão from Portalegre district is still the municipality with the highest value of F_ROH > 1.5 Mb, with a value of 0.085, which is lower than the 0.088 from the previous F_ROH > 0.5 Mb. The lowest value of F_ROH > 1.5 Mb (0.0001) is from Felgueiras municipality from the district of Porto. The municipality of Machico still contains the highest value of F_ROH > 1.5 Mb (0.024) in the Autonomous Region of Madeira. The municipality of Vila do Porto with a mean F_ROH of 0.024 is the municipality with the highest value from the Autonomous Region of Açores.

Table 3 presents the corresponding number of samples that fit within each F_ROH > 1.5 Mb interval. The total number of samples with F_ROH > 1.5 Mb within the mean F_ROH > 1.5 Mb intervals is considerably low at 2076 out of the 3941; 1825 samples have F_ROH > 1.5 Mb equal to zero and 10 samples have a F_ROH > 1.5 Mb higher than the mean intervals. This results in a total of 3911 samples represented in the map, used to calculate the F_ROH mean per municipality, from which 2086 presented F_ROH > 1.5 Mb different from zero. The behavior of the distribution of samples within each interval is similar to the one observed in Table 2. The first interval, from 0.000 to 0.003, contains the highest number of samples (1418). There is a drastic drop in the number of people with F_ROH belonging to the interval from 0.003 to 0.005 and the number decreases until the last interval (from 0.033 to 0.085).

In Figure 13, the classical map of Portugal for the F_ROH mean for ROHs with a size higher than 5 Mb (F_ROH > 5 Mb), per municipality, is presented. The mean value of F_ROH > 5 Mb is 0.002. The decreasing tendency of the mean F_ROH value is expected, since with each minimum threshold of the ROH size, the number of people with the mean F_ROH value equal to zero increases.

The municipality of Alter do Chão from Portalegre district is still the municipality with the highest value of mean F_ROH > 5 Mb. The value of mean F_ROH is 0.074, lower than the 0.085 observed when F_ROH > 1.5 Mb. The lowest value of F_ROH > 5 Mb (0.0001) is from Santo Tirso from the district of Porto. The municipality of Machico is still the one with the highest value of mean F_ROH > 5 Mb (0.017) in the Autonomous Region of Madeira, but also showing a decreasing tendency, because the number of people with ROHs > 5 Mb is lower. The municipality of Vila do Porto with a mean F_ROH of 0.02 is the municipality with the highest value in the Autonomous Region of Açores.

Table 4 presents the corresponding number of samples that fit within each F_ROH > 5 Mb interval. The total number of samples with F_ROH > 5 Mb within the intervals of F_ROH > 5 Mb mean is even lower than before at 737 out of the 3941; 2985 samples have F_ROH >5 Mb equal to zero and 8 samples have a F_ROH > 5 Mb higher than the mean intervals. This results in a total of 3730 samples represented in the map, used to calculate the F_ROH mean per municipality, from which 745 presented F_ROH > 5 Mb different from zero. Contrary to the previous tables (Table 2 and Table 3), the first interval, in this case from 0.000 to 0.002, has the lowest number of samples (36) out of all the intervals. The interval with the highest number of people is from 0.002 to 0.004 with 314 people; then, the distribution tendency is similar to the previous tables, showing a decrease up to the last interval.

The F_ROH mean distribution throughout Portugal is heterogeneous, as seen in the maps (Figure 11, Figure 12 and Figure 13). The number of municipalities with no sample data is 16. Overall, the ROH minimum size thresholds applied to calculate the F_ROH mean caused the non-representativeness of some municipalities. This effect is explained by the scarce number of individuals with ROHs of size greater than 1.5 Mb. Initially, there were 3941 samples; after applying the first threshold (only considering the ROHs with size above 0.5 Mb), the total sample size dropped to 3939, then to 3911 with the 1.5 Mb threshold and finally to 3730 with the 5 Mb threshold.

3.2.3. Comparison with Other Studies

To compare our results, we used the reference values from a similar study developed using an insular population from the Orkney Isles in northern Scotland [87].

Table 5 contains the mean of all F_ROH values calculated for all samples, the mean of the means per municipality, and also the previously mentioned reference values for F_ROH [87].

The mean F_ROH values are lower than the mean F_ROH of the mean F_ROH values per municipality for all ROH thresholds. When comparing the Portuguese mean F_ROH values with the F_ROH reference values, the mean F_ROH for F_ROH > 0.5 Mb is 0.0042, which is inferior to 0.0315. The Portuguese mean value for F_ROH > 1.5 Mb is 0.0033 and that for F_ROH > 5 Mb is 0.0020; both are above the reference values presented in [87], 0.0021 and 0.0001.

To enrich this study, we compared our results with data from other populations available in the literature. According to a cohort of 11,919 Alzheimer’s disease cases and 9181 controls studied of the European population, the mean F_ROH > 1.5 Mb was 0.011 [88]. This value is higher than our calculated F_ROH > 1.5 Mb mean value, because it includes an Alzheimer’s disease cohort. A study using a cohort from the 1000 Genomes Project Phase 3 revealed higher F_ROH > 1.5 Mb mean values in European populations [10]. In contrast, our study, which analyzes a significantly larger cohort of 3911 individuals for F_ROH > 1.5 Mb, reports a notably lower mean F_ROH > 1.5 Mb (0.0033). The F_ROH > 1.5 Mb mean values for European populations with the cohort are presented in Supplementary File S5 [10].

To compare the results, we used a study examining the prevalence of consanguineous marriages in Portugal between 1980 and 1986 [89]. The map from this study was colorized to align with the F_ROH mean for ROHs larger than 0.5, 1.5, and 5 Mb, and is presented in Figure 14 [89]. According to this figure, the Autonomous Region of Madeira exhibits the highest number of consanguineous marriages, closely followed by the Autonomous Region of the Açores. This observation can be attributed to the isolation of island populations, due to limited population mobility during the 1980s. However, with the advent of improved transportation infrastructure, population movement to and from the islands has become more accessible. In the Portugal Mainland, the district with the highest incidence of consanguineous marriages is Bragança [89]. Furthermore, the top five districts (as shown in Table 6) with the highest number of consanguineous marriages, listed in descending order, are Madeira, Açores, Bragança, Viseu, and Vila Real.

Our findings, as presented in Table 6, reveal that the top five districts with the highest F_ROH mean, considering thresholds of 0.5, 1.5, and 5 Mb, remain consistent. The ranking from highest to lowest F_ROH mean value for 0.5 and 1.5 Mb thresholds is the following: Portalegre, Viseu, Bragança, Madeira, and Vila Real. Meanwhile, the ranking for the 5 Mb threshold is Portalegre, Bragança, Viseu, Madeira, and Vila Real.

Supplementary File S6 presents more detail about the Portuguese population distribution per district according to the Census from 2021, as well as the 3941 samples used and the sample containing F_ROH values different from zero calculated for the 0.5, 1.5, and 5 Mb thresholds per district.

The demographic origins of the ROHs [90] explored in our samples are presented in Supplementary File S4.

3.3. Consanguinity Classification Results

The results of the outlier detection consanguinity classification approach can be found in Table 7. Meanwhile, visualizations of the low-dimensional data, separated by training and validation vs. testing datasets, are shown in composite Figure 15.

The F1-scores associated with our outlier detection model remained in the range of 0.9412–0.9615 across all feature set tiers of the held-out testing data, which was comparable to the performance range of 0.9310–0.9655 on the validation set, indicating successful generalization of each model to unseen data points. The inclusion of additional descriptive statistic features (i.e., Min, Max, Mean, STD) provided only marginal predictive benefit on the held-out test set, as evidenced by an F1-score increase of only 0.0203, demonstrating that the “Count_x” (i.e., the number of ROHs in each chromosome) and “Sum_x” (i.e., the sum of ROH lengths in each chromosome) features provided the majority of the predictive power with respect to consanguinity classification in this framework. Moreover, the optimal contamination hyperparameters were observed in the range 0.0786–0.1190, which may inform the true proportion of the population who may be labeled consanguineous.

4. Discussion

Clinical diagnostics is evolving towards more personalized approaches, demanding the development of new genetic tests and the adaptation of existing ones. There are platforms created to support virtual gene panel curation, such as Genomics England PanelApp [91]. This is a database for storing virtual gene panel information while gathering community feedback, helping to build a consensus on the evidence needed to establish a gene–disease association.

With this work, we automated the process of creating personalized multigene panels based on different scenarios. Generically, all panels described in this work are less time-consuming to create, releasing professionals to other tasks to increase the number of tests carried out, and help narrow down the number of genes being analyzed. Since all the described tests are based on WES data, physicians can request a reanalysis of patient data without the need for additional sample sequencing, thereby conserving resources. In terms of storage, one sequencing per patient and a more personalized approach mean less data being analyzed and consequently less allocated space.

Multigene panels based only on the specific ROHs identified in a patient narrow the analysis to genes located within the identified ROHs, thereby increasing the likelihood of detecting recessive disease-causing variants in those genes. Since the CNV assessment was also included in these panels, the diagnostic technician analyzing the case knows where the CNVs overlap with ROHs, eliminating this confounding factor from ROH analysis.

The multigene panels based only on HPO terms take into consideration the possible phenotype or phenotypes that the patient presents. Resorting to a publicly available database, HPO, the panel is built only with the genes that are within the specified terms.

The use of HPO terms and ROH overlap results, simultaneously, is an even more personalized approach, by narrowing down the list of genes to create personalized multigene panels. In this case, only the genes associated with the HPO term(s) in analysis are taken into account and their presence is checked in the patient’s ROHs. This allows a higher level of personalization, proven to be useful in the clinical case of the two sisters previously presented.

After obtaining the gene list from the two sisters’ case, the common genes between the two were analyzed and visualized on IGV. The results were interpreted using the Online Mendelian Inheritance in Man (OMIM) database to find the phenotype associated with the genes. According to OMIM, the CSTB gene is associated with the phenotype “Epilepsy, progressive myoclonic 1A (Unverricht and Lundborg)” (gene MIM number 601,145 and phenotype MIM number 254,800), and the mode of inheritance is AR. This result is in accordance with the analysis performed, since the ROHs are used to target diseases with AR modes of inheritance. The gene SIK1, on the other hand, is associated with the phenotype “Developmental and epileptic encephalopathy 30”, with an autosomal dominant (AD) mode of inheritance. The gene SLC32A1 does not currently have an associated phenotype, but the gene encodes an amino acid transporter that loads gamma-aminobutyric acid (GABA) and glycine to synaptic vesicles. Even though there were no variants present in these genes, the diagnosis of epilepsy was only possible through this multigene panel approach, through the visualization of the biallelic expansion on the CSTB gene.

Knowing our cohort background in terms of ROHs is also important. In our study, we showed the ROH distribution in Portugal (Figure 10), where most ROHs (9358 ROHs) are within the size range of 0.5 to 1.0 Mb, followed by a decreasing tendency as the length of the ROHs increases until 4.0 Mb. Then, there is a small peak of 708 ROHs in the interval from 4.0 to 5.0 Mb, followed by a decreasing number of ROHs until the interval from 10.0 and 15.0 Mb and another decreasing tendency in the next intervals. The minimum value is 0.5 Mb, the maximum value is 72.42 Mb, and the mean value is 2.29 Mb. This is typical of a more ancient parental relatedness of the overall population.

Using the same sample with the 3941 exomes, and by calculating the F_ROH per individual using the thresholds 0.5, 1.5, and 5 Mb, it was possible to build three maps using the mean value for each municipality. Consequently, we found that the top five districts exhibiting higher F_ROH values were Portalegre, Viseu, Bragança, Madeira, and Vila Real. Furthermore, another notable finding from this study was the striking similarity between the patterns of admixed and consanguinity demographics observed in the Portuguese population when examining the number of ROHs versus the sum of ROHs, available in Supplementary File S4.

The overall mean value of F_ROH for the thresholds 0.5, 1.5, and 5 Mb decreased as the minimum threshold increased. There are less samples with ROHs above a certain minimum length, with a smaller number of ROHs but with bigger sizes, per individual. The municipality of Alter do Chão from Portalegre district is the municipality with the highest value of mean F_ROH for all the presented minimum ROH size thresholds (0.5, 1.5, and 5 Mb), which might indicate more consanguinity.

The disparities observed in the data can derive from various factors, one significant contributor being the sample sizes utilized. In our study, we analyzed a sample of 3941 individuals, with 3769 exhibiting a F_ROH distinct from zero. In contrast, the comparative study only included 49 individuals. Notably, we compared populations from diverse geographic regions: an insular population from the Orkney Isles in northern Scotland [87] with a comprehensive Portuguese population encompassing individuals from Portugal Mainland as well as the Autonomous Regions of Madeira and the Açores. Another differing factor was the way L_auto was calculated, since the reference study described using the length of the autosomal genome covered by SNPs in an array, excluding the centromeres, and in our study, we used the length of the autosomal genome covered by WES, excluding the centromeres.

According to the results from the study shown in Figure 14 [89], the Autonomous Region of Madeira exhibits the highest number of consanguineous marriages, closely followed by the Autonomous Region of the Açores. This observation can be attributed to the isolation of island populations due to limited population mobility during the 1980s. However, improved transportation infrastructure has since increased population mobility to and from the islands. In Portugal Mainland, the district with the highest incidence of consanguineous marriages is Bragança [89]. Furthermore, the top five districts (as shown in Table 6) with the highest number of consanguineous marriages, listed in descending order, are Madeira, the Açores, Bragança, Viseu, and Vila Real.

Our findings revealed that the top five districts with the highest F_ROH mean remain consistent, considering thresholds of 0.5, 1.5, and 5 Mb. The ranking from highest to lowest F_ROH mean value for 0.5 and 1.5 Mb thresholds is the following: Portalegre, Viseu, Bragança, Madeira, and Vila Real. Meanwhile, the ranking for the 5 Mb threshold is Portalegre, Bragança, Viseu, Madeira, and Vila Real.

Portalegre stands out with the highest F_ROH mean values across all three thresholds in our data, despite having fewer consanguineous marriages compared to other districts. This might be due to our sample including fewer individuals from Portalegre, which may suggest that those from this region who were referred for genetic testing were more likely to have been screened due to consanguinity. Surprisingly, the results show low F_ROH values across all three thresholds for the Autonomous Region of the Açores, which are not in accordance with the reference data on consanguineous marriages. This is possibly due to insufficient sample localization. Porto, the district with the lowest number of consanguineous marriages, also shows the lowest F_ROH mean across all thresholds. Additionally, we were unable to acquire information regarding the country of origin or birth of individuals included in the sample; this parameter was not used as an exclusion criterion. Moving forward, we should take this into consideration because certain countries present higher levels of consanguinity due to religious and cultural practices. We must also acknowledge the potential presence of samples from immigrants residing in our country, which could introduce biases into the data.

The presence of an admixed pattern denotes our country’s history, whilst the consanguineous pattern is a result of the marriages between cousins, leading to an increase in the sum of ROHs.

The data used for this work originated from the genetic testing activities performed at CGPP, thus having a higher likelihood of having individuals with genetic diseases (including autosomal recessive inheritance). To reduce possible biases in our representative sample, during the sample selection process, we prioritized healthy individuals (e.g., healthy parents when analyzing trios). We de-prioritized the number of male samples and excluded pre-natal diagnosis samples and individuals identified as consanguineous in our database. We also restricted our dataset to unrelated individuals (detailed process in Supplementary File S1).

Having a model to predict patient consanguinity based on ROH features is useful in clinical centers. They can be used for the genetic test decision process and for assessing the risk of recessive diseases, knowing that the presence of consanguinity increases the risk of having recessive genetic diseases. Tier 0 (count and sum of ROHs) of the model presented provided the majority of the predictive power for consanguinity classification (0.94). A test F1-score of 0.96 was achieved with additional features.

Although ROH analysis is a powerful tool for identifying recessive disease-causing variants, its utility may be reduced in patients that are offspring of non-consanguineous parents. These individuals tend to have fewer and smaller ROHs, which may go undetected by some algorithms, thus complicating the identification of pathogenic variants causing recessive diseases. In such cases, smaller ROHs may be mistaken for population genetic artefacts, limiting the diagnostic yield of this approach.

Transitioning from WES to WGS might open some doors in terms of genetic testing, by adding insights into the non-coding regions of the genome. This will be of great interest particularly for undiagnosed patients and accelerate the diagnosis.

With this work, we demonstrated the applicability and utility of the newly developed resources and their impact on diagnostics, by solving the genetic etiology of a rare recessive disease. The representative sample of 3941 WES individuals used in this work allowed us to provide an extensive analysis of ROHs on a genomic scale for the first time ever in the Portuguese population. In summary, this research advances ROH analysis using WES data, highlighting its diagnostic potential and significance in population genetic characterization.

Supplementary Materials

The following supporting information can be downloaded at https://www.mdpi.com/article/10.3390/biomedinformatics4040128/s1: Supplementary File S1: Representative WES sample of the Portuguese population; Supplementary File S2: F_ROH for ROHs of size greater than 0.5, 1.5 and 5 Mb, per municipality; Supplementary File S3: Interactive maps; Supplementary File S4: Demographic origins of ROHs in Portugal; Supplementary File S5: Comparative study values; Supplementary File S6: Portugal population distribution.

Author Contributions

Conceptualization, P.F.S. and J.O.; methodology, S.V., M.R., F.A., J.S., N.M., P.F.S. and J.O.; software, S.V., M.R., J.S., F.A. and D.S.; writing—original draft preparation, S.V.; writing—critical review, J.O.; writing—review and editing, P.F.S., M.R., F.A., J.S., N.M., J.P.F. and D.S.; supervision, P.F.S. and J.O.; project administration, J.P.F. and J.O. All authors have read and agreed to the published version of the manuscript.

Funding

This research received no external funding.

Institutional Review Board Statement

This study was conducted in accordance with the Declaration of Helsinki; the generation of the WES sample representative of the Portuguese Population was carried out within the scope of a project approved by the Institutional Ethics Committee of i3S (protocol code 10/CECRI/2023; date of approval: 28 September 2023).

Informed Consent Statement

The project’s work involves the reanalysis of human sequencing genetic data. All patients involved in this study were part of the diagnostic process by their referring medical doctor. Clinicians obtained their patients’ written informed consent (or that of legal guardians if minors) to perform genetic studies. All data used are fully anonymized and kept according to the authorization of the CNPD (Portuguese Data Protection Authority) to CGPP. Data are stored in an encrypted database on a dedicated server at the IBMC/i3S data center, accessible only through the internal network by authorized personnel under confidentiality agreements. Data are fully anonymized and handled in an aggregated manner. Any personally identifiable information (including personal identifiers) will be removed, making re-identification virtually impossible. This approach adheres to the principle of confidentiality and respects the privacy rights of the individuals whose data are being analyzed. We also had a Data Protection Impact Assessment (DPIA) carried out with the active collaboration of the Data Protection Officer of our institution.

Data Availability Statement

The pseudocode used in this work will be available for consultation in the MSc dissertation, which will be accessible in November 2024 (http://hdl.handle.net/10773/39751, accessed on 21 November 2024). Please note that no public repository is currently available.

Acknowledgments

The authors express their gratitude for the initial conditions provided by Professor Emeritus Jorge Sequeiros (founder of CGPP and former UnIGENe research group PI) and the assistance of Victor Mendes (CGPP laboratory database manager).

Conflicts of Interest

The authors declare no conflicts of interest.

References

Oliveira, J.; Pereira, R.; Santos, R.; Sousa, M. Evaluating runs of homozygosity in exome sequencing data—Utility in disease inheritance model selection and variant filtering. Commun. Comput. Inf. Sci. 2018, 881, 268–288. [Google Scholar] [CrossRef]
Peripolli, E.; Munari, D.P.; Silva, M.V.G.B.; Lima, A.L.F.; Irgang, R.; Baldi, F. Runs of homozygosity: Current knowledge and applications in livestock. Anim. Genet. 2017, 48, 255–271. [Google Scholar] [CrossRef]
Magi, A.; Tattini, L.; Palombo, F.; Benelli, M.; Gialluisi, A.; Giusti, B.; Abbate, R.; Seri, M.; Gensini, G.F.R.; Romeo, G.; et al. H3M2: Detection of runs of homozygosity from whole-exome sequencing data. Bioinformatics 2014, 30, 2852–2859. [Google Scholar] [CrossRef]
Oniya, O.; Neves, K.; Ahmed, B.; Konje, J.C. A review of the reproductive consequences of consanguinity. Eur. J. Obstet. Gynecol. Reprod. Biol. 2019, 232, 87–96. [Google Scholar] [CrossRef] [PubMed]
Marchi, N.; Mennecier, P.; Georges, M.; Lafosse, S.; Hegay, T.; Dorzhu, C.; Chichlo, B.; Ségurel, L.; Heyer, E. Close inbreeding and low genetic diversity in Inner Asian human populations despite geographical exogamy. Sci. Rep. 2018, 8, 9397. [Google Scholar] [CrossRef] [PubMed]
Yengo, L.; Wray, N.R.; Visscher, P.M. Extreme inbreeding in a European ancestry sample from the contemporary UK population. Nat. Commun. 2019, 10, 3719. [Google Scholar] [CrossRef] [PubMed]
Slatkin, M. A Population-Genetic Test of Founder Effects and Implications for Ashkenazi Jewish Diseases. Am. J. Hum. Genet 2004, 75, 282–293. [Google Scholar] [CrossRef]
Dong, J.-T. Chromosomal deletions and tumor suppressor genes in prostate cancer. Cancer Metastasis Rev. 2001, 20, 173–193. [Google Scholar] [CrossRef]
Nalls, M.A.; Simon-Sanchez, J.; Gibbs, J.R.; Paisan-Ruiz, C.; Bras, J.T.; Tanaka, T.; Matarin, M.; Scholz, S.; Weitz, C.; Harris, T.B.; et al. Measures of autozygosity in decline: Globalization, urbanization, and its implications for medical genetics. PLoS Genet. 2009, 5, e1000415. [Google Scholar] [CrossRef]
Ceballos, F.C.; Hazelhurst, S.; Ramsay, M. Runs of homozygosity in sub-Saharan African populations provide insights into complex demographic histories. Hum. Genet. 2019, 138, 1123–1142. [Google Scholar] [CrossRef]
Lemes, R.B.; Nunes, K.; Carnavalli, J.E.P.; Kimura, L.; Mingroni-Netto, R.C.; Meyer, D.; Otto, P.A. Inbreeding estimates in human populations: Applying new approaches to an admixed Brazilian isolate. PLoS ONE 2018, 13, e0196360. [Google Scholar] [CrossRef]
Ben Halim, N.; Nagara, M.; Regnault, B.; Hsouna, S.; Lasram, K.; Kefi, R.; Azaiez, H.; Khemira, L.; Saidane, R.; Ammar, S.; et al. Estimation of Recent and Ancient Inbreeding in a Small Endogamous Tunisian Community Through Genomic Runs of Homozygosity. Ann. Hum. Genet. 2015, 79, 402–417. [Google Scholar] [CrossRef] [PubMed]
Kang, J.T.L.; Goldberg, A.; Edge, M.D.; Behar, D.M.; Rosenberg, N.A. Consanguinity Rates Predict Long Runs of Homozygosity in Jewish Populations. Hum. Hered. 2017, 82, 87–102. [Google Scholar] [CrossRef]
Pemberton, T.J.; Absher, D.; Feldman, M.W.; Myers, R.M.; Rosenberg, N.A.; Li, J.Z. Genomic patterns of homozygosity in worldwide human populations. Am. J. Hum. Genet. 2012, 91, 275–292. [Google Scholar] [CrossRef] [PubMed]
Kirin, M.; Mcquillan, R.; Franklin, C.S.; Campbell, H.; Mckeigue, P.M. Genomic Runs of Homozygosity Record Population History and Consanguinity. PLoS ONE 2010, 5, e13996. [Google Scholar] [CrossRef] [PubMed]
Hunter-Zinck, H.; Musharoff, S.; Salit, J.; Al-Ali, K.A.; Chouchane, L.; Gohar, A.; Matthews, R.; Butler, M.W.; Fuller, J.; Hackett, N.R.; et al. Population genetic structure of the people of Qatar. Am. J. Hum. Genet. 2010, 87, 17–25. [Google Scholar] [CrossRef]
Mezzavilla, M.; Cocca, M.; Maisano Delser, P.; Badii, R.; Abbaszadeh, F.; Hadi, K.A.; Giorgia, G.; Gasparini, P. Ancestry-related distribution of Runs of homozygosity and functional variants in Qatari population. BMC Genom. Data 2022, 23, 73. [Google Scholar] [CrossRef] [PubMed]
Scott, E.M.; Halees, A.; Itan, Y.; Spencer, E.G.; He, Y.; Azab, M.A.; Gabriel, S.B.; Belkadi, A.; Boisson, B.; Abel, L.; et al. Characterization of Greater Middle Eastern genetic variation for enhanced disease gene discovery. Nat. Genet. 2016, 48, 1071. [Google Scholar] [CrossRef] [PubMed]
Yang, X.; Al-Bustan, S.; Feng, Q.; Guo, W.; Ma, Z.; Marafie, M.; Jacob, S.; Al-Mulla, F.; Xu, S. The influence of admixture and consanguinity on population genetic diversity in Middle East. J. Hum. Genet. 2014, 59, 615–622. [Google Scholar] [CrossRef]
Ceballos, F.C.; Gürün, K.; Altınışık, N.E.; Gemici, H.C.; Karamurat, C.; Koptekin, D.; Vural, K.B.; Mapelli, I.; Sağlıcan, E.; Sürer, E.; et al. Human inbreeding has decreased in time through the Holocene. Curr. Biol. 2021, 31, 3925–3934.e8. [Google Scholar] [CrossRef] [PubMed]
Kars, M.E.; Baṣak, A.N.; Onat, O.E.; Bilguvar, K.; Choi, J.; Itan, Y.; Ça, C.; Palvadeau, R.; Casanova, J.-L.; Cooper, D.N.; et al. The genetic structure of the Turkish population reveals high levels of variation and admixture. Proc. Natl. Acad. Sci. USA 2021, 118, e2026076118. [Google Scholar] [CrossRef] [PubMed]
Binzer, S.; Imrell, K.; Binzer, M.; Kyvik, K.O.; Hillert, J.; Stenager, E. High inbreeding in the Faroe Islands does not appear to constitute a risk factor for multiple sclerosis. Mult. Scler. 2015, 21, 996–1002. [Google Scholar] [CrossRef] [PubMed]
Karafet, T.M.; Bulayeva, K.B.; Bulayev, O.A.; Gurgenova, F.; Omarova, J.; Yepiskoposyan, L.; Savina, O.V.; Veeramah, K.R.; Hammer, M.F. Extensive genome-wide autozygosity in the population isolates of Daghestan. Eur. J. Hum. Genet. 2015, 23, 1405–1412. [Google Scholar] [CrossRef] [PubMed]
McLaughlin, R.L.; Kenna, K.P.; Vajda, A.; Heverin, M.; Byrne, S.; Donaghy, C.G.; Cronin, S.; Bradley, D.G.; Hardiman, O. Homozygosity mapping in an Irish ALS case-control cohort describes local demographic phenomena and points towards potential recessive risk loci. Genomics 2015, 105, 237–241. [Google Scholar] [CrossRef] [PubMed]
Alabdullatif, M.A.; Al Dhaibani, M.A.; Khassawneh, M.Y.; El-Hattab, A.W. Chromosomal microarray in a highly consanguineous population: Diagnostic yield, utility of regions of homozygosity, and novel mutations. Clin. Genet. 2017, 91, 616–622. [Google Scholar] [CrossRef]
Wang, J.C.; Ross, L.; Mahon, L.W.; Owen, R.; Hemmat, M.; Wang, B.T.; El Naggar, M.; Kopita, K.A.; Randolph, L.M.; Chase, J.M.; et al. Regions of homozygosity identified by oligonucleotide SNP arrays: Evaluating the incidence and clinical utility. Eur. J. Hum. Genet. 2015, 23, 663–671. [Google Scholar] [CrossRef] [PubMed]
Prasad, A.; Sdano, M.A.; Vanzo, R.J.; Mowery-Rushton, P.A.; Serrano, M.A.; Hensel, C.H.; Wassman, E.R. Clinical utility of exome sequencing in individuals with large homozygous regions detected by chromosomal microarray analysis. BMC Med. Genet. 2018, 19, 46. [Google Scholar] [CrossRef] [PubMed]
Hengel, H.; Buchert, R.; Sturm, M.; Haack, T.B.; Schelling, Y.; Mahajnah, M.; Sharkia, R.; Azem, A.; Balousha, G.; Ghanem, Z.; et al. First-line exome sequencing in Palestinian and Israeli Arabs with neurological disorders is efficient and facilitates disease gene discovery. Eur. J. Hum. Genet. 2020, 28, 1034–1043. [Google Scholar] [CrossRef]
Palombo, F.; Graziano, C.; Al Wardy, N.; Nouri, N.; Marconi, C.; Magini, P.; Severi, G.; La Morgia, C.; Cantalupo, G.; Cordelli, D.M.; et al. Autozygosity-driven genetic diagnosis in consanguineous families from Italy and the Greater Middle East. Hum. Genet. 2020, 139, 1429–1441. [Google Scholar] [CrossRef]
Knopp, C.; Rudnik-Schöneborn, S.; Eggermann, T.; Bergmann, C.; Begemann, M.; Schoner, K.; Zerres, K.; Brüchle, N.O. Syndromic ciliopathies: From single gene to multi gene analysis by SNP arrays and next generation sequencing. Mol. Cell. Probes 2015, 29, 299–307. [Google Scholar] [CrossRef]
de Farias, A.A.; Nunes, K.; Lemes, R.B.; Moura, R.; Fernandes, G.R.; Melo, U.S.; Zatz, M.; Kok, F.; Santos, S. Origin and age of the causative mutations in KLC2, IMPA1, MED25 and WNT7A unravelled through Brazilian admixed populations. Sci. Rep. 2018, 8, 16552. [Google Scholar] [CrossRef] [PubMed]
Wakil, S.M.; Ramzan, K.; Abuthuraya, R.; Hagos, S.; Al-Dossari, H.; Al-Omar, R.; Murad, H.; Chedrawi, A.; Al-Hassnan, Z.N.; Finsterer, J.; et al. Infantile-onset ascending hereditary spastic paraplegia with bulbar involvement due to the novel ALS2 mutation c.2761C>T. Gene 2014, 536, 217–220. [Google Scholar] [CrossRef]
Lobo-Prada, T.; Sticht, H.; Bogantes-Ledezma, S.; Ekici, A.; Uebe, S.; Reis, A.; Leal, A. A homozygous mutation in GPT2 associated with nonsyndromic intellectual disability in a consanguineous family from costa rica. JIMD Rep. 2017, 36, 59–66. [Google Scholar] [CrossRef] [PubMed]
Guo, T.; Tan, Z.P.; Chen, H.M.; Zheng, D.Y.; Liu, L.; Huang, X.G.; Chen, P.; Luo, H.; Yang, Y.F. An effective combination of whole-exome sequencing and runs of homozygosity for the diagnosis of primary ciliary dyskinesia in consanguineous families. Sci. Rep. 2017, 7, 7905. [Google Scholar] [CrossRef]
Costa, P.; Zanus, C.; Faletra, F.; Ventura, G.; di Marzio, G.M.; Cervesi, C.; Carrozzi, M. Epileptic encephalopathy with microcephaly in a patient with asparagine synthetase deficiency: A video-EEG report. Epileptic Disord. 2019, 21, 466–470. [Google Scholar] [CrossRef] [PubMed]
Khan, R.; Shabbir, R.M.K.; Raza, I.; Abdullah, U.; Naeem, M.A.; Ahmed, A.; Malik, S.; Hu, Z.; Xia, K. A founder RDH5 splice site mutation leads to retinitis punctata albescens in two inbred Pakistani kindreds. Ophthalmic Genet. 2020, 41, 7–12. [Google Scholar] [CrossRef]
Yu, W.; You, X.; Wang, D.; Dong, K.; Su, J.; Li, C.; Liu, J.; Zhang, Q.; You, F.; Wang, X.; et al. Microarray analysis unmasked two siblings with pure hereditary spastic paraplegia shared a run of homozygosity region on chromosome 3q28-q29. J. Neurol. Sci. 2015, 359, 351–355. [Google Scholar] [CrossRef]
Calderón, R.; Hernández, C.L.; García-Varela, G.; Masciarelli, D.; Cuesta, P. Inbreeding in Southeastern Spain: The Impact of Geography and Demography on Marital Mobility and Marital Distance Patterns (1900–1969). Hum. Nat. 2018, 29, 45–64. [Google Scholar] [CrossRef] [PubMed]
Pippucci, T.; Magi, A.; Gialluisi, A.; Romeo, G. Detection of runs of homozygosity from whole exome sequencing data: State of the art and perspectives for clinical, population and epidemiological studies. Hum. Hered. 2014, 77, 63–72. [Google Scholar] [CrossRef] [PubMed]
Lander, E.S.; Botstein, D. Homozygosity Mapping: A Way to Map Human Recessive Traits with the DNA of Inbred Children. Science 1987, 236, 1567–1570. [Google Scholar] [CrossRef]
Hu, T.; Chitnis, N.; Monos, D.; Dinh, A. Next-generation sequencing technologies: An overview. Hum. Immunol. 2021, 82, 801–811. [Google Scholar] [CrossRef] [PubMed]
Pereira, R.; Oliveira, J.; Sousa, M. Bioinformatics and computational tools for next-generation sequencing analysis in clinical genetics. J. Clin. Med. 2020, 9, 132. [Google Scholar] [CrossRef]
Thompson, J.F.; Milos, P.M. The properties and applications of single-molecule DNA sequencing. Genome Biol. 2011, 12, 217. [Google Scholar] [CrossRef] [PubMed]
Rhoads, A.; Au, K.F. PacBio Sequencing and Its Applications. Genom. Proteom. Bioinform. 2015, 13, 278–289. [Google Scholar] [CrossRef]
Zhang, L.; Chen, F.X.; Zeng, Z.; Xu, M.; Sun, F.; Yang, L.; Bi, X.; Lin, Y.; Gao, Y.J.; Hao, H.X.; et al. Advances in Metagenomics and Its Application in Environmental Microorganisms. Front. Microbiol. 2015, 12, 766364. [Google Scholar] [CrossRef] [PubMed]
Qin, D. Next-generation sequencing and its clinical application. Cancer Biol. Med. 2019, 16, 4–10. [Google Scholar] [CrossRef]
Barbitoff, Y.A.; Polev, D.E.; Glotov, A.S.; Serebryakova, E.A.; Shcherbakova, I.V.; Kiselev, A.M.; Kostareva, A.A.; Glotov, O.S.; Predeus, A.V. Systematic dissection of biases in whole-exome and whole-genome sequencing reveals major determinants of coding sequence coverage. Sci. Rep. 2020, 10, 2057. [Google Scholar] [CrossRef]
Choi, M.; Scholl, U.I.; Ji, W.; Liu, T.; Tikhonova, I.R.; Zumbo, P.; Nayir, A.; Bakkaloğlu, A.; Ozen, S.; Sanjad, S.; et al. Genetic diagnosis by whole exome capture and massively parallel DNA sequencing. Proc. Natl. Acad. Sci. USA 2009, 106, 19096–19101. [Google Scholar] [CrossRef] [PubMed]
Bartha, Á.; Győrffy, B. Comprehensive outline of whole exome sequencing data analysis tools available in clinical oncology. Cancers 2019, 11, 1725. [Google Scholar] [CrossRef]
Warman Chardon, J.; Beaulieu, C.; Hartley, T.; Boycott, K.M.; Dyment, D.A. Axons to Exons: The Molecular Diagnosis of Rare Neurological Diseases by Next-Generation Sequencing. Curr. Neurol. Neurosci. Rep. 2015, 15, 64. [Google Scholar] [CrossRef]
Gargano, M.A.; Matentzoglu, N.; Coleman, B.; Addo-Lartey, E.B.; Anagnostopoulos, A.V.; Anderton, J.; Avillach, P.; Bagley, A.M.; Bakštein, E.; Balhoff, J.P.; et al. The Human Phenotype Ontology in 2024: Phenotypes around the world. Nucleic Acids Res. 2024, 52, D1333–D1346. [Google Scholar] [CrossRef] [PubMed]
Bullich, G.; Matalonga, L.; Pujadas, M.; Papakonstantinou, A.; Piscia, D.; Tonda, R.; Artuch, R.; Gallano, P.; Garrabou, G.; González, J.R.; et al. Systematic Collaborative Reanalysis of Genomic Data Improves Diagnostic Yield in Neurologic Rare Diseases. J. Mol. Diagn. 2022, 24, 529–542. [Google Scholar] [CrossRef] [PubMed]
Matalonga, L.; Laurie, S.; Papakonstantinou, A.; Piscia, D.; Mereu, E.; Bullich, G.; Thompson, R.; Horvath, R.; Pérez-Jurado, L.; Riess, O.; et al. Improved Diagnosis of Rare Disease Patients through Systematic Detection of Runs of Homozygosity. J. Mol. Diagn. 2020, 22, 1205–1215. [Google Scholar] [CrossRef] [PubMed]
Becker, J.; Semler, O.; Gilissen, C.; Li, Y.; Bolz, H.J.; Giunta, C.; Bergmann, C.; Rohrbach, M.; Koerber, F.; Zimmermann, K.; et al. Exome sequencing identifies truncating mutations in human SERPINF1 in autosomal-recessive osteogenesis imperfecta. Am. J. Hum. Genet. 2011, 88, 362–371. [Google Scholar] [CrossRef]
Mezzavilla, M.; Vozzi, D.; Badii, R.; Khalifa Alkowari, M.; Abdulhadi, K.; Girotto, G.; Gasparini, P. Increased rate of deleterious variants in long runs of homozygosity of an inbred population from Qatar. Hum. Hered. 2015, 79, 14–19. [Google Scholar] [CrossRef]
Yang, T.L.; Guo, Y.; Zhang, L.S.; Tian, Q.; Yan, H.; Papasian, C.J.; Recker, R.R.; Deng, H.W. Runs of homozygosity identify a recessive locus 12q21.31 for human adult height. J. Clin. Endocrinol. Metab. 2010, 95, 3777–3782. [Google Scholar] [CrossRef]
Wang, L.S.; Hranilovic, D.; Wang, K.; Lindquist, I.E.; Yurcaba, L.; Petkovic, Z.B.; Gidaya, N.; Jernej, B.; Hakonarson, H.; Bucan, M. Population-based study of genetic variation in individuals with autism spectrum disorders from Croatia. BMC Med. Genet. 2010, 11, 134. [Google Scholar] [CrossRef]
Gross, A.; Tönjes, A.; Kovacs, P.; Veeramah, K.R.; Ahnert, P.; Roshyara, N.R.; Gieger, C.; Rueckert, I.M.; Loeffler, M.; Stoneking, M.; et al. Population-genetic comparison of the Sorbian isolate population in Germany with the German KORA population using genome-wide SNP arrays. BMC Genet. 2011, 12, 67. [Google Scholar] [CrossRef] [PubMed]
Ghani, M.; Sato, C.; Lee, J.H.; Reitz, C.; Moreno, D.; Mayeux, R.; George-Hyslop, P.S.; Rogaeva, E. Evidence of recessive Alzheimer disease loci in a Caribbean Hispanic data set: Genome-wide survey of runs of homozygosity. JAMA Neurol. 2013, 70, 1261–1267. [Google Scholar] [CrossRef] [PubMed]
Yang, T.L.; Guo, Y.; Zhang, J.G.; Xu, C.; Tian, Q.; Deng, H.W. Genome-wide Survey of Runs of Homozygosity Identifies Recessive Loci for Bone Mineral Density in Caucasian and Chinese Populations. J. Bone Miner. Res. Off. J. Am. Soc. Bone Miner. Res. 2015, 30, 2119–2126. [Google Scholar] [CrossRef] [PubMed]
Ghani, M.; Reitz, C.; Cheng, R.; Vardarajan, B.N.; Jun, G.; Sato, C.; Naj, A.; Rajbhandary, R.; Wang, L.S.; Valladares, O.; et al. Association of Long Runs of Homozygosity with Alzheimer Disease Among African American Individuals. JAMA Neurol. 2015, 72, 1313–1323. [Google Scholar] [CrossRef] [PubMed]
Bandrés-Ciga, S.; Price, T.R.; Barrero, F.J.; Escamilla-Sevilla, F.; Pelegrina, J.; Arepalli, S.; Hernández, D.; Gutiérrez, B.; Cervilla, J.; Rivera, M.; et al. Genome-wide assessment of Parkinson’s disease in a Southern Spanish population. Neurobiol. Aging 2016, 45, 213.e3–213.e9. [Google Scholar] [CrossRef]
Barbieri, C.; Barquera, R.; Arias, L.; Sandoval, J.R.; Acosta, O.; Zurita, C.; Aguilar-Campos, A.; Tito-Álvarez, A.M.; Serrano-Osuna, R.; Gray, R.D.; et al. The Current Genomic Landscape of Western South America: Andes, Amazonia, and Pacific Coast. Mol. Biol. Evol. 2019, 36, 2698–2713. [Google Scholar] [CrossRef]
Font-Porterias, N.; Caro-Consuegra, R.; Lucas-Sánchez, M.; Lopez, M.; Giménez, A.; Carballo-Mesa, A.; Bosch, E.; Calafell, F.; Quintana-Murci, L.; Comas, D. The Counteracting Effects of Demography on Functional Genomic Variation: The Roma Paradigm. Mol. Biol. Evol. 2021, 38, 2804–2817. [Google Scholar] [CrossRef] [PubMed]
Da Cruz, P.R.S.; Ananina, G.; Secolin, R.; Gil-Da-Silva-Lopes, V.L.; Lima, C.S.P.; de França, P.H.C.; Donatti, A.; Lourenço, G.J.; de Araujo, T.K.; Simioni, M.; et al. Demographic history differences between Hispanics and Brazilians imprint haplotype features. G3 2022, 12, jkac111. [Google Scholar] [CrossRef] [PubMed]
Ruan, X.; Kocher, J.P.A.; Pommier, Y.; Liu, H.; Reinhold, W.C. Mass homozygotes accumulation in the NCI-60 cancer cell lines as compared to HapMap Trios, and relation to fragile site location. PLoS ONE 2012, 7, e31628. [Google Scholar] [CrossRef]
Santoni, F.A.; Makrythanasis, P.; Antonarakis, S.E. CATCHing putative causative variants in consanguineous families. BMC Bioinform. 2015, 16, 310. [Google Scholar] [CrossRef] [PubMed]
Sonehara, K.; Okada, Y. Obelisc: An identical-by-descent mapping tool based on SNP streak. Bioinformatics 2020, 36, 5567–5570. [Google Scholar] [CrossRef] [PubMed]
Garone, C.; Pippucci, T.; Cordelli, D.M.; Zuntini, R.; Castegnaro, G.; Marconi, C.; Graziano, C.; Marchiani, V.; Verrotti, A.; Seri, M.; et al. FA2H-related disorders: A novel c.270+3A>T splice-site mutation leads to a complex neurodegenerative phenotype. Dev. Med. Child Neurol. 2011, 53, 958–961. [Google Scholar] [CrossRef]
Seelow, D.; Schuelke, M. HomozygosityMapper2012-bridging the gap between homozygosity mapping and deep sequencing. Nucleic Acids Res. 2012, 40, W516–W520. [Google Scholar] [CrossRef] [PubMed]
Seelow, D.; Schuelke, M.; Hildebrandt, F.; Nürnberg, P. HomozygosityMapper—An interactive approach to homozygosity mapping. Nucleic Acids Res. 2009, 37 (Suppl. S2), W593–W599. [Google Scholar] [CrossRef] [PubMed]
Kancheva, D.; Atkinson, D.; De Rijk, P.; Zimon, M.; Chamova, T.; Mitev, V.; Yaramis, A.; Maria Fabrizi, G.; Topaloglu, H.; Tournev, I.; et al. Novel mutations in genes causing hereditary spastic paraplegia and Charcot-Marie-Tooth neuropathy identified by an optimized protocol for homozygosity mapping based on whole-exome sequencing. Genet. Med. 2016, 18, 600–607. [Google Scholar] [CrossRef] [PubMed]
Szpiech, Z.A.; Blant, A.; Pemberton, T.J. GARLIC: Genomic Autozygosity Regions Likelihood-based Inference and Classification. Bioinformatics 2017, 33, 2059–2062. [Google Scholar] [CrossRef] [PubMed]
Görmez, Z.; Bakir-Gungor, B.; Saǧiroǧlu, M.Ş. HomSI: A homozygous stretch identifier from next-generation sequencing data. Bioinformatics 2014, 30, 445–447. [Google Scholar] [CrossRef] [PubMed]
Quinodoz, M.; Peter, V.G.; Bedoni, N.; Bertrand, B.R.; Cisarova, K.; Salmaninejad, A.; Sepahi, N.; Rodrigues, R.; Piran, M.; Mojarrad, M.; et al. AutoMap is a high performance homozygosity mapping tool using next-generation sequencing data. Nat. Commun. 2021, 12, 518. [Google Scholar] [CrossRef]
Yoon, B.-J. Hidden Markov Models and their Applications in Biological Sequence Analysis. Curr. Genom. 2009, 10, 402–415. [Google Scholar] [CrossRef]
Narasimhan, V.; Danecek, P.; Scally, A.; Xue, Y.; Tyler-Smith, C.; Durbin, R. BCFtools/RoH: A hidden Markov model approach for detecting autozygosity from next-generation sequencing data. Bioinformatics 2016, 32, 1749–1751. [Google Scholar] [CrossRef] [PubMed]
Zhuang, Z.; Gusev, A.; Cho, J.; Pe’er, I. Detecting Identity by Descent and Homozygosity Mapping in Whole-Exome Sequencing Data. PLoS ONE 2012, 7, e47618. [Google Scholar] [CrossRef]
Browning, S.R.; Browning, B.L. High-Resolution Detection of Identity by Descent in Unrelated Individuals. Am. J. Hum. Genet. 2010, 86, 526–539. [Google Scholar] [CrossRef]
Çelik, G.; Tuncalı, T. ROHMM—A flexible hidden Markov model framework to detect runs of homozygosity from genotyping data. Hum. Mutat. 2022, 43, 158–168. [Google Scholar] [CrossRef]
Vigeland, M.D.; Gjøtterud, K.S.; Selmer, K.K. FILTUS: A desktop GUI for fast and efficient detection of disease-causing variants, including a novel autozygosity detector. Bioinformatics 2016, 32, 1592–1594. [Google Scholar] [CrossRef]
hapROH · PyPI. (n.d.). Retrieved 27 March 2023. Available online: https://pypi.org/project/hapROH/ (accessed on 6 June 2023).
Ringbauer, H.; Novembre, J.; Steinrücken, M. Parental relatedness through time revealed by runs of homozygosity in ancient DNA. Nat. Commun. 2021, 12, 5425. [Google Scholar] [CrossRef] [PubMed]
Kruskal, J.B.; Hill, M. Multidimensional scaling by optimizing goodness of fit to a nonmetric hypothesis. Psychometrika 1964, 29, 1–27. [Google Scholar] [CrossRef]
Rousseeuw, P.J.; Van Driessen, K. A fast algorithm for the minimum covariance determinant estimator. Technometrics 1999, 41, 212–223. [Google Scholar] [CrossRef]
Lalioti, M.D.; Mirotsou, M.; Buresi, C.; Peitsch, M.C.; Rossier, C.; Ouazzani, R.; Baldy-Moulinier, M.; Bottani, A.; Malafosse, A.; Antonarakis, S.E. Identification of mutations in cystatin B, the gene responsible for the Unverricht-Lundborg type of progressive myoclonus epilepsy (EPM1). Am. J. Hum. Genet. 1997, 60, 342. [Google Scholar] [PubMed]
McQuillan, R.; Leutenegger, A.L.; Abdel-Rahman, R.; Franklin, C.S.; Pericic, M.; Barac-Lauc, L.; Smolej-Narancic, N.; Janicijevic, B.; Polasek, O.; Tenesa, A.; et al. Runs of Homozygosity in European Populations. Am. J. Hum. Genet. 2008, 83, 359. [Google Scholar] [CrossRef] [PubMed]
Moreno-Grau, S.; Fernández, M.V.; de Rojas, I.; Garcia-González, P.; Hernández, I.; Farias, F.; Budde, J.P.; Quintela, I.; Madrid, L.; González-Pérez, A.; et al. Long runs of homozygosity are associated with Alzheimer’s disease. Transl. Psychiatry 2021, 11, 142. [Google Scholar] [CrossRef]
Santos, H.G.; Dias, J.A.; Pimenta, Z.P. Sumário 41 Incidência de Casamentos Consanguíneos na População Incidência de Casamentos Consanguíneos na População Portuguesa-1980–1986. In Saúde em Números; 1988; Volume 3, pp. 41–48. [Google Scholar]
Ceballos, F.C.; Joshi, P.K.; Clark, D.W.; Ramsay, M.; Wilson, J.F. Runs of homozygosity: Windows into population history and trait architecture. Nat. Rev. Genet. 2018, 19, 220–234. [Google Scholar] [CrossRef]
Martin, A.R.; Williams, E.; Foulger, R.E.; Leigh, S.; Daugherty, L.C.; Niblock, O.; Leong, I.U.S.; Smith, K.R.; Gerasimenko, O.; Haraldsdottir, E.; et al. PanelApp crowdsources expert knowledge to establish consensus diagnostic gene panels. Nat. Genet. 2019, 51, 1560–1565. [Google Scholar] [CrossRef] [PubMed]

Figure 1. Flowchart representing the automation of the creation of multigene panels based on ROHs (DB—database; DF—dataframe).

Figure 2. Flowchart to obtain the reference BED file.

Figure 3. The flowchart of the multigene panel lists: white, grey, and black.

Figure 4. Flowchart of the ROH and HPO multigene panel automation.

Figure 5. Overview of the results regarding processes of generating the multigene panel application in a case study, the first Portuguese ROH characterization, and the clustering model.

Figure 6. Pedigree depicting two affected sisters, daughters of a consanguineous couple.

Figure 7. Example of an input for the personalized multigene panels based on HPO term and ROHs.

Figure 8. IGV visualization of the reads mapped to the CSTB gene in both sisters (II:1 and II:2).

Figure 9. BAM visualization depicting the region of the dodecamer repeat expansion in a control sample (I), and in both sisters (II:1 and II:2). No reads are aligned in this region in both patients, suggesting that a possible expansion is biallelic (present in both CSTB alleles).

Figure 10. Histogram depicting the distribution of ROH length above 0.5 Mb in a Portuguese cohort of 3941 samples.

Figure 11. Geographical distribution per municipality of F_ROH > 0.5 Mb in Portugal Mainland, Autonomous Region of Açores, and Autonomous Region of Madeira.

Figure 12. Geographical distribution per municipality of F_ROH > 1.5 Mb in Portugal Mainland, Autonomous Region of Açores, and Autonomous Region of Madeira.

Figure 13. Geographical distribution per municipality of F_ROH > 5 Mb in Portugal Mainland, Autonomous Region of Açores, and Autonomous Region of Madeira.

Figure 14. Map of Portugal representing the consanguinity between 1980 and 1986 (/100,000) adapted from [89] (upper left) and the Portugal Mainland maps for the F_ROH calculated for ROHs of size above 0.5 Mb (upper right), 1.5 Mb (lower left), and 5 Mb (lower right).

Figure 15. Low−dimensional MDS representations of each “tier” dataset, where Tier 0 is training and validation results (A) and testing results (B); Tier 1 is training and validation results (C) and testing results (D); Tier 2 is training and validation results (E) and testing results (F). Data points are colored according to their consanguinity labels: White “unknown” points do not possess a ground truth label; green “NCON” points represent non-consanguineous samples; red “CON” points represent consanguineous samples; and purple “CON_ST” represent stringent consanguineous points. The red dashed circles represent the elliptic envelope’s outlier decision boundary (i.e., points falling outside of the envelope are predicted to be consanguineous, either stringent or non-stringent).

Table 1. Resulting list of genes for each sister (II:1 and II:2).

II:1	II:2
DHDDS	SIK1
HMGCL	CSTB
MERC	SLC32A1
SDHA
SIK1
CSTB
PIGV
SLC25A19
SLC32A1
TERT
TSEN54

Table 2. Number of people with F_ROH > 0.5 Mb within each interval.

F_ROH > 0.5 Mb Intervals	Number of Samples
(0.000, 0.004]	3086
(0.004, 0.006]	205
(0.006, 0.010]	153
(0.010, 0.018]	105
(0.018, 0.034]	123
(0.034, 0.088]	88

Table 3. Number of people with F_ROH > 1.5 Mb within each interval.

F_ROH > 1.5 Mb Intervals	Number of Samples
(0.000, 0.003]	1418
(0.003, 0.005]	192
(0.005, 0.009]	160
(0.009, 0.017]	103
(0.017, 0.033]	126
(0.033, 0.085]	77

Table 4. Number of people with F_ROH > 5 Mb within each interval.

F_ROH > 5 Mb Intervals	Number of Samples
(0.000, 0.002]	36
(0.002, 0.004]	314
(0.004, 0.008]	144
(0.008, 0.016]	110
(0.016, 0.032]	90
(0.032, 0.074]	43

Table 5. F_ROH mean, F_ROH mean of means per municipality, and comparative values for the different ROH size thresholds (0.5, 1.5 and 5 Mb).

	Mean F_ROH	Mean F_ROH of Means per Municipality	F_ROH Comparative Values [87]
F_ROH > 0.5 Mb	0.0042	0.0057	0.0315
F_ROH > 1.5 Mb	0.0033	0.0049	0.0021
F_ROH > 5 Mb	0.0020	0.0039	0.0001

Table 6. F_ROH calculated for the 0.5, 1.5, and 5 Mb thresholds per district, and data from a study estimating the number of consanguineous marriages per district [89].

District	F_ROH > 0.5 Mb	F_ROH > 1.5 Mb	F_ROH > 5.0 Mb	Number of Consanguineous Marriages (10,000) [89]
Açores	0.0046	0.0035	0.0023	78.7
Aveiro	0.0052	0.0041	0.0024	22.1
Beja	0.0048	0.0039	0.0022	22.8
Braga	0.0034	0.0025	0.0013	19.2
Bragança	0.0102	0.0090	0.0060	52.7
Castelo Branco	0.0066	0.0054	0.0034	19.9
Coimbra	0.0063	0.0053	0.0036	38.2
Évora	0.0039	0.0028	0.0016	34.5
Faro	0.0029	0.0019	0.0010	27.2
Guarda	0.0048	0.0038	0.0024	35.3
Leiria	0.0058	0.0047	0.0030	35.1
Lisboa	0.0039	0.0030	0.0017	20.2
Madeira	0.0077	0.0068	0.0041	133.6
Portalegre	0.0106	0.0092	0.0070	24.8
Porto	0.0026	0.0018	0.0010	14.4
Santarém	0.0056	0.0045	0.0032	27.6
Setúbal	0.0038	0.0029	0.0019	30.1
Viana do Castelo	0.0030	0.0021	0.0010	17.8
Vila Real	0.0070	0.0060	0.0037	38.3
Viseu	0.0105	0.0091	0.0059	38.7

Table 7. Outlier detection F1-score results, separated by feature set tier.

Dataset Tier (Feature Set)	Best Contamination Hyperparameter	Validation F1-Score	Test F1-Score
Tier 0 (Count_x, Sum_x)	0.0786	0.9310	0.9412
Tier 1 (Count_x, Sum_x, Min_x, Max_x)	0.1190	0.9655	0.9434
Tier 2 (Count_x, Sum_x, Min_x, Max_x, Mean_x, STD_x)	0.1061	0.9474	0.9615

Disclaimer/Publisher’s Note: The statements, opinions and data contained in all publications are solely those of the individual author(s) and contributor(s) and not of MDPI and/or the editor(s). MDPI and/or the editor(s) disclaim responsibility for any injury to people or property resulting from any ideas, methods, instructions or products referred to in the content.

© 2024 by the authors. Licensee MDPI, Basel, Switzerland. This article is an open access article distributed under the terms and conditions of the Creative Commons Attribution (CC BY) license (https://creativecommons.org/licenses/by/4.0/).

Share and Cite

MDPI and ACS Style

Valente, S.; Ribeiro, M.; Schnur, J.; Alves, F.; Moniz, N.; Seelow, D.; Freixo, J.P.; Silva, P.F.; Oliveira, J. Analysis of Regions of Homozygosity: Revisited Through New Bioinformatic Approaches. BioMedInformatics 2024, 4, 2374-2399. https://doi.org/10.3390/biomedinformatics4040128

AMA Style

Valente S, Ribeiro M, Schnur J, Alves F, Moniz N, Seelow D, Freixo JP, Silva PF, Oliveira J. Analysis of Regions of Homozygosity: Revisited Through New Bioinformatic Approaches. BioMedInformatics. 2024; 4(4):2374-2399. https://doi.org/10.3390/biomedinformatics4040128

Chicago/Turabian Style

Valente, Susana, Mariana Ribeiro, Jennifer Schnur, Filipe Alves, Nuno Moniz, Dominik Seelow, João Parente Freixo, Paulo Filipe Silva, and Jorge Oliveira. 2024. "Analysis of Regions of Homozygosity: Revisited Through New Bioinformatic Approaches" BioMedInformatics 4, no. 4: 2374-2399. https://doi.org/10.3390/biomedinformatics4040128

APA Style

Valente, S., Ribeiro, M., Schnur, J., Alves, F., Moniz, N., Seelow, D., Freixo, J. P., Silva, P. F., & Oliveira, J. (2024). Analysis of Regions of Homozygosity: Revisited Through New Bioinformatic Approaches. BioMedInformatics, 4(4), 2374-2399. https://doi.org/10.3390/biomedinformatics4040128

Article Menu

Analysis of Regions of Homozygosity: Revisited Through New Bioinformatic Approaches

Abstract

1. Introduction