8000 left join broken with ValueError and TypeError · Issue #342 · pyranges/pyranges · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

left join broken with ValueError and TypeError #342

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
hardingnj opened this issue Jun 14, 2023 · 22 comments
Closed

left join broken with ValueError and TypeError #342

hardingnj opened this issue Jun 14, 2023 · 22 comments

Comments

@hardingnj
Copy link
hardingnj commented Jun 14, 2023

Thanks for the package- and all the hard work that's gone into it. Came across this issue:

f1 = pr.from_dict(
  {'Chromosome': ['chr1', 'chr1', 'chr1'], 
   'Start': [3, 8, 12],
   'Strand': ['+', '-', '+'],
   'End': [4, 9, 13], 
   'Name': ['interval1', 'interval3', 'interval2']
}, int64=True)

f2 = pr.from_dict(
  {
    'Chromosome': ['chr1'], 
    'Start': [6], 
    'Strand': ['-'],
    'End': [10], 
    'GeneName': ['test']
}, int64=True)

I'm trying to annotate f1 with f2.

f3 = f1.join(f2, strandedness="same", slack=0)

works as expected

f3 = f1.join(f2, how="left", strandedness="same", slack=0)

Fails with an error
TypeError: Cannot set a Categorical with another, without identical categories

Version is '0.0.125

@hardingnj
Copy link
Author
hardingnj commented Jun 14, 2023

Relevant:

f2 = pr.from_dict(
  {
    'Chromosome': ['chr1', 'chr1'],
    'Start': [6, 23],
    'Strand': ['-', '+'],
    'End': [10, 24],
    'GeneName': ['test', 'test2']
})

i.e. adding an extra row to f2 with a negative strand fixes any issues. I guess somewhere there is a call to pd.Categorical that should explicitly set "+"/"-" as categories.

@michaelJwilson
Copy link
michaelJwilson commented Jun 14, 2023

Replacing:

"Strand": "category",

with

from pandas.api.types import CategoricalDtype

"Strand": CategoricalDtype(["-", "+"], ordered=True),

and

df["Strand"] = df.Strand.cat.remove_unused_categories()

with

# df["Strand"] = df.Strand.cat.remove_unused_categories()

is sufficient to produce the desired result:

>> f3 = f1.join(f2, how="left", strandedness="same", slack=0)

+--------------+-----------+--------------+-----------+------------+-----------+--------------+-----------+------------+
| Chromosome   |     Start | Strand       |       End | Name       |   Start_b | Strand_b     |     End_b | GeneName   |
| (category)   |   (int64) | (category)   |   (int64) | (object)   |   (int64) | (category)   |   (int64) | (object)   |
|--------------+-----------+--------------+-----------+------------+-----------+--------------+-----------+------------|
| chr1         |         3 | +            |         4 | interval1  |        -1 | +            |        -1 | -1         |
| chr1         |        12 | +            |        13 | interval2  |        -1 | +            |        -1 | -1         |
| chr1         |         8 | -            |         9 | interval3  |         6 | -            |        10 | test       |
+--------------+-----------+--------------+-----------+------------+-----------+--------------+-----------+------------+

Happy to create the PR if this is acceptable.

@hardingnj
Copy link
Author

Thanks. This seems a reasonable solution @endrebak ?

@endrebak
Copy link
Collaborator

I'll try to fix this today. Will have to check on different pandas versions :D

@michaelJwilson
Copy link

Thanks!

@endrebak
Copy link
Collaborator

Your suggestion seems to not work with bad strands like ".".

Will try to fix.

@michaelJwilson
Copy link
"Strand": CategoricalDtype([".", "-", "+"], ordered=True),

was ruled out?

endrebak pushed a commit that referenced this issue Jun 27, 2023
@endrebak
Copy link
Collaborator

https://github.com/pyranges/pyranges/pull/new/342

The tests seem to pass and your example works?

Does the commit work for you?

Will have to wait for C/I before uploading to PyPI.

endrebak pushed a commit that referenced this issue Jun 27, 2023
endrebak added a commit that referenced this issue Jun 27, 2023
Co-authored-by: endre bakken stovner <endrebakkenstovner@endres-MacBook-Air.local>
@endrebak
Copy link
Collaborator

Started the pipeline to push the changes to pypi: https://github.com/pyranges/pyranges/actions/runs/5391831179

@michaelJwilson
Copy link

Sorry, this example still fails with v0.0.128. At this point, changing

dtypes["Strand"] = CategoricalDtype(categories=df["Strand"].drop_duplicates().to_list()) 

to

dtypes["Strand"]: CategoricalDtype([".", "-", "+"], ordered=True)

continues to pass. As suggested by the original error,

TypeError: Cannot set a Categorical with another, without identical categories

we basically need to force the Strand categories to be the same irrespective of
the passed data frame, otherwise this join will fail.

Thanks for your help!

@endrebak
Copy link
Collaborator
endrebak commented Jun 28, 2023

Ah, thanks! I'm working on implementing genomicranges for polars so my pandas knowledge is slipping away.

What I do not like about your solution is that the Strand column might have other values besides (., +, -) and your solution seems to make only those three valid.

@endrebak
Copy link
Collaborator

Perhaps we should replace all invalid values with . and warn the user?

What do you think @marco-mariotti ?

@marco-mariotti
Copy link
Member
marco-mariotti commented Jun 28, 2023 via email

@michaelJwilson
Copy link
michaelJwilson commented Jun 29, 2023

That'd be great, thanks. Already spotted the rust version, looking forward to it!

@endrebak
Copy link
Collaborator

#344

@xiucz
Copy link
xiucz commented Sep 19, 2023

Hi , @marco-mariotti

Maybe I have the similar problem:
pic1
pic1
image
pic2

when I want to annotate pic2 with pic1, it returns the error:

pic2.join(pic1, how = "left").drop(like="_b")
...

ValueError: Buffer dtype mismatch, expected 'const int64_t' but got 'int'

Can you give me some advice here, thank you?
Best,
xiucz

@michaelJwilson
Copy link

@xiucz I think you need pr.PyRanges(..., int64=True) when you initialise both tables.

@xiucz
Copy link
xiucz commented Sep 20, 2023

@michaelJwilson
Thanks for your quick comment, it is really helpful!

@marco-mariotti
Copy link
Member

Hi @xiucz, can you provide data to replicate the problem, and specify which version of pyranges are you using?
I think PyRanges now (0.0.129) has the type of all ints set to int64 ( @endrebak can you confirm?) so I'd be surprised if @michaelJwilson suggestion changed anything. The int64 argument is still there just for backward compatibility.

@xiucz
Copy link
xiucz commented Sep 21, 2023

Hi, @marco-mariotti ,

My pyranges version is 0.0.120. Actually, I use the joint method to annotate pandas dataframe, so it maybe a little hard to repeat the data, but here is my code:

#Split one column to three columns with 'Chromosome', 'Start', 'End'
ClinGenrecurrentCNV[['Chromosome', 'Start', 'End']] = ClinGenrecurrentCNV['Location on GRCh37'].str.split(':|-', expand=True)
df1[['Chromosome', 'Start', 'End', 'Type']] =  df["AnnotSV_ID"].str.split('_', n=3, expand=True)

# Translate it into a PR object.
ClinGenrecurrentCNV_pr = pr.PyRanges(ClinGenrecurrentCNV, int64=True)
df1_pr = pr.PyRanges(df1, int64=True)

# ANNOTATE,  it works^_^.
df1 = df1_pr.join(ClinGenrecurrentCNV_pr, how = "left").drop(like="_b").as_df()

However, is it possible to collapse (merge and combine the annotation columns, or something like groupby and then combine ) the result?

image

Chromosome Start End anno1 anno2 anno3
15 30430038 32895918 15:31192889-32445405 3 (Sufficient Evidence) 1 (Little Evidence)
15 30430038 32895918 15:32019621-32445405 3 (Sufficient Evidence) 40 (Dosage Sensitivity Unlikely)

#groupby and merge by ";"

Chromosome Start End anno1 anno2 anno3
15 30430038 32895918 15:31192889-32445405;15:32019621-32445405 3 (Sufficient Evidence); 3 (Sufficient Evidence) 1 (Little Evidence); 40 (Dosage Sensitivity Unlikely)

Best,
xiucz

@marco-mariotti
Copy link
Member

Please test this in the latest pyranges version, 0.0.129, and let us know if the problem persists.

@xiucz
Copy link
xiucz commented Sep 21, 2023

@marco-mariotti
I updated to the latest version, I can join them without int64=True now.

>>> pyranges.__version__
'0.0.129'

Best,
xiucz

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

5 participants
0