8000 Expectation of no. of allele in a loci · Issue #67 · bcgsc/straglr · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Expectation of no. of allele in a loci #67

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
NabilAsyraf98 opened this issue Apr 9, 2025 · 4 comments
Open

Expectation of no. of allele in a loci #67

NabilAsyraf98 opened this issue Apr 9, 2025 · 4 comments

Comments

@NabilAsyraf98
Copy link

Hi there!
Thanks for making this tool!

I am looking at loci chr10 129294244:129294295. The output bed file has one allele copy number - 15.1. But the output tsv file has multiple allele copy numbers- 12, 13, 14, 15, 16 and 17. In fact, 13 and 16 are supported by the majority of reads. Why does Straglr give the output call as 15.1? Is it possible to output 13 and 16 instead as the final allele call in the output bed file? I have a capillary electrophoresis result that supports this call too. I have attached screenshots of the bed and tsv files as a reference. I can send a copy of the files if needed!

Image
Image

Thanks!

@readmanchiu
Copy link
Collaborator

Hi @NabilAsyraf98,
Thanks for trying Straglr. It's really challenging to ascertain there are two alleles (13 and 16) because of the presence of 14.5 and 15.8. In this case, the presence of those 2 intermediate alleles tell Straglr to treat all the genotypes as 1 cluster (the difference is just a result of sequencing variability) and the median value is taken as the reported genotype.
But thanks for reporting this. I will check why the GMM does not take the frequencies into account and recognized this as a bimodal distribution, which seems pretty obvious. Perhaps another clustering algorithm is called for.

@NabilAsyraf98
Copy link
Author

Thanks!

@ljohansson
Copy link

We have the same issue quite regularly. Fortunately this is mostly in non-pathogenic read-lengths, so not affecting the conclusion, but it would be great if such cases would be identified as heterozygous calls with different read lengths. You would expect GMM to take into account frequencies, right? I am really trying to understand how this works.

How would the following parameters affect clustering?

  • "merge clusters separated by 10 (--min_cluster_d)"_ --> I read this as all clusters with a distance closer than 10 will be merged, but that is not happening generally, because than almost always, non-expanded STRs would be merged. Curious what this setting means.
  • "merge singleton to clusters when the singleton is within 10% of closest member of cluster"_ --> Is the 10% counted as 10% of that of the closest value? So in the above example 14.5 is closest to 15.8, so 15.8-1.58=14.22 is the boundary for merging? This in turn would make 14.5 the closest value to the '13' cluster. It is not a singleton, however, 13 is over 10% (1.45) away from 14.5.

@readmanchiu
Copy link
Collaborator

@ljohansson thanks for following up on this

Straglr always did the clustering based on sizes, not copy numbers. You are right in the interpretation of -d, but maybe you want to adjust the default value 10 if you are genotyping by copy numbers.
For merging singleton to neighboring clusters it checked the closest values of the two neighboring clusters, if the difference is less 10% of itself and the closest value, the singleton will be assigned to the neighbor cluster with the smallest difference (in size).
For the above example, since the sizes are pretty close, changing the -d to 5 will yield the desired 2 clusters.
I think automatically adjust -d based on the clustered sizes is called for, I will probably work on this.
Also, if you're still getting weird results after playing with -d, please show me the TSV I can see what can be done.
I'm also on using phasing results for assigning alleles as it seems pretty standard for nowadays Nanopore analysis workflow.
Is phasing part of your workflow as I'm not sure if it is possible with targeted sequencing.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
0