Expectation of no. of allele in a loci #67

NabilAsyraf98 · 2025-04-09T02:46:24Z

Hi there!
Thanks for making this tool!

I am looking at loci chr10 129294244:129294295. The output bed file has one allele copy number - 15.1. But the output tsv file has multiple allele copy numbers- 12, 13, 14, 15, 16 and 17. In fact, 13 and 16 are supported by the majority of reads. Why does Straglr give the output call as 15.1? Is it possible to output 13 and 16 instead as the final allele call in the output bed file? I have a capillary electrophoresis result that supports this call too. I have attached screenshots of the bed and tsv files as a reference. I can send a copy of the files if needed!

Thanks!

readmanchiu · 2025-04-13T01:14:53Z

Hi @NabilAsyraf98,
Thanks for trying Straglr. It's really challenging to ascertain there are two alleles (13 and 16) because of the presence of 14.5 and 15.8. In this case, the presence of those 2 intermediate alleles tell Straglr to treat all the genotypes as 1 cluster (the difference is just a result of sequencing variability) and the median value is taken as the reported genotype.
But thanks for reporting this. I will check why the GMM does not take the frequencies into account and recognized this as a bimodal distribution, which seems pretty obvious. Perhaps another clustering algorithm is called for.

NabilAsyraf98 · 2025-04-16T02:36:59Z

Thanks!

ljohansson · 2025-05-19T11:31:55Z

We have the same issue quite regularly. Fortunately this is mostly in non-pathogenic read-lengths, so not affecting the conclusion, but it would be great if such cases would be identified as heterozygous calls with different read lengths. You would expect GMM to take into account frequencies, right? I am really trying to understand how this works.

How would the following parameters affect clustering?

"merge clusters separated by 10 (--min_cluster_d)"_ --> I read this as all clusters with a distance closer than 10 will be merged, but that is not happening generally, because than almost always, non-expanded STRs would be merged. Curious what this setting means.
"merge singleton to clusters when the singleton is within 10% of closest member of cluster"_ --> Is the 10% counted as 10% of that of the closest value? So in the above example 14.5 is closest to 15.8, so 15.8-1.58=14.22 is the boundary for merging? This in turn would make 14.5 the closest value to the '13' cluster. It is not a singleton, however, 13 is over 10% (1.45) away from 14.5.

readmanchiu · 2025-05-19T18:21:23Z

@ljohansson thanks for following up on this

Straglr always did the clustering based on sizes, not copy numbers. You are right in the interpretation of -d, but maybe you want to adjust the default value 10 if you are genotyping by copy numbers.
For merging singleton to neighboring clusters it checked the closest values of the two neighboring clusters, if the difference is less 10% of itself and the closest value, the singleton will be assigned to the neighbor cluster with the smallest difference (in size).
For the above example, since the sizes are pretty close, changing the -d to 5 will yield the desired 2 clusters.
I think automatically adjust -d based on the clustered sizes is called for, I will probably work on this.
Also, if you're still getting weird results after playing with -d, please show me the TSV I can see what can be done.
I'm also on using phasing results for assigning alleles as it seems pretty standard for nowadays Nanopore analysis workflow.
Is phasing part of your workflow as I'm not sure if it is possible with targeted sequencing.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Expectation of no. of allele in a loci #67

Expectation of no. of allele in a loci #67

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Expectation of no. of allele in a loci #67

Expectation of no. of allele in a loci #67

Comments

Uh oh!

Uh oh!

Uh oh!

Uh oh!