left join broken with ValueError and TypeError #342

hardingnj · 2023-06-14T08:45:44Z

Thanks for the package- and all the hard work that's gone into it. Came across this issue:

f1 = pr.from_dict(
  {'Chromosome': ['chr1', 'chr1', 'chr1'], 
   'Start': [3, 8, 12],
   'Strand': ['+', '-', '+'],
   'End': [4, 9, 13], 
   'Name': ['interval1', 'interval3', 'interval2']
}, int64=True)

f2 = pr.from_dict(
  {
    'Chromosome': ['chr1'], 
    'Start': [6], 
    'Strand': ['-'],
    'End': [10], 
    'GeneName': ['test']
}, int64=True)

I'm trying to annotate f1 with f2.

f3 = f1.join(f2, strandedness="same", slack=0)

works as expected

f3 = f1.join(f2, how="left", strandedness="same", slack=0)

Fails with an error
TypeError: Cannot set a Categorical with another, without identical categories

Version is '0.0.125

The text was updated successfully, but these errors were encountered:

hardingnj · 2023-06-14T08:52:45Z

Relevant:

f2 = pr.from_dict(
  {
    'Chromosome': ['chr1', 'chr1'],
    'Start': [6, 23],
    'Strand': ['-', '+'],
    'End': [10, 24],
    'GeneName': ['test', 'test2']
})

i.e. adding an extra row to f2 with a negative strand fixes any issues. I guess somewhere there is a call to pd.Categorical that should explicitly set "+"/"-" as categories.

michaelJwilson · 2023-06-14T10:03:16Z

Replacing:

pyranges/pyranges/methods/init.py

8000 Line 15 in 4583971

"Strand": "category",

with

from pandas.api.types import CategoricalDtype

"Strand": CategoricalDtype(["-", "+"], ordered=True),

and

pyranges/pyranges/methods/init.py

Line 41 in 4583971

df["Strand"] = df.Strand.cat.remove_unused_categories()

with

# df["Strand"] = df.Strand.cat.remove_unused_categories()

is sufficient to produce the desired result:

>> f3 = f1.join(f2, how="left", strandedness="same", slack=0)

+--------------+-----------+--------------+-----------+------------+-----------+--------------+-----------+------------+
| Chromosome   |     Start | Strand       |       End | Name       |   Start_b | Strand_b     |     End_b | GeneName   |
| (category)   |   (int64) | (category)   |   (int64) | (object)   |   (int64) | (category)   |   (int64) | (object)   |
|--------------+-----------+--------------+-----------+------------+-----------+--------------+-----------+------------|
| chr1         |         3 | +            |         4 | interval1  |        -1 | +            |        -1 | -1         |
| chr1         |        12 | +            |        13 | interval2  |        -1 | +            |        -1 | -1         |
| chr1         |         8 | -            |         9 | interval3  |         6 | -            |        10 | test       |
+--------------+-----------+--------------+-----------+------------+-----------+--------------+-----------+------------+

Happy to create the PR if this is acceptable.

hardingnj · 2023-06-23T14:22:28Z

Thanks. This seems a reasonable solution @endrebak ?

endrebak · 2023-06-26T10:28:35Z

I'll try to fix this today. Will have to check on different pandas versions :D

michaelJwilson · 2023-06-26T14:20:56Z

Thanks!

endrebak · 2023-06-27T13:57:42Z

Your suggestion seems to not work with bad strands like ".".

Will try to fix.

michaelJwilson · 2023-06-27T14:02:30Z

"Strand": CategoricalDtype([".", "-", "+"], ordered=True),

was ruled out?

endrebak · 2023-06-27T14:07:31Z

https://github.com/pyranges/pyranges/pull/new/342

The tests seem to pass and your example works?

Does the commit work for you?

Will have to wait for C/I before uploading to PyPI.

Co-authored-by: endre bakken stovner <endrebakkenstovner@endres-MacBook-Air.local>

endrebak · 2023-06-27T15:40:50Z

Started the pipeline to push the changes to pypi: https://github.com/pyranges/pyranges/actions/runs/5391831179

michaelJwilson · 2023-06-28T11:05:26Z

Sorry, this example still fails with v0.0.128. At this point, changing

dtypes["Strand"] = CategoricalDtype(categories=df["Strand"].drop_duplicates().to_list())

to

dtypes["Strand"]: CategoricalDtype([".", "-", "+"], ordered=True)

continues to pass. As suggested by the original error,

TypeError: Cannot set a Categorical with another, without identical categories

we basically need to force the Strand categories to be the same irrespective of
the passed data frame, otherwise this join will fail.

Thanks for your help!

endrebak · 2023-06-28T13:49:34Z

Ah, thanks! I'm working on implementing genomicranges for polars so my pandas knowledge is slipping away.

What I do not like about your solution is that the Strand column might have other values besides (., +, -) and your solution seems to make only those three valid.

endrebak · 2023-06-28T13:51:14Z

Perhaps we should replace all invalid values with . and warn the user?

What do you think @marco-mariotti ?

marco-mariotti · 2023-06-28T15:41:00Z

I agree with Endre's proposal.

…

On Wed, Jun 28, 2023 at 3:51 PM endrebak.ada ***@***.***> wrote: Perhaps we should replace all invalid values with . and warn the user? What do you think @marco-mariotti <https://github.com/marco-mariotti> ? — Reply to this email directly, view it on GitHub <#342 (comment)>, or unsubscribe <https://github.com/notifications/unsubscribe-auth/ACTXRIHDT2UTSCZZAV56TALXNQZF5ANCNFSM6AAAAAAZGALTKI> . You are receiving this because you were mentioned.Message ID: ***@***.***>

-- Marco Mariotti Ramón y Cajal Fellow, Group Leader Department of Genetics, Microbiology and Statistics Universitat de Barcelona

michaelJwilson · 2023-06-29T07:54:15Z

That'd be great, thanks. Already spotted the rust version, looking forward to it!

endrebak · 2023-06-29T14:14:23Z

#344

xiucz · 2023-09-19T08:27:24Z

Hi , @marco-mariotti

Maybe I have the similar problem:

pic1

pic2

when I want to annotate pic2 with pic1, it returns the error:

pic2.join(pic1, how = "left").drop(like="_b")
...

ValueError: Buffer dtype mismatch, expected 'const int64_t' but got 'int'

Can you give me some advice here, thank you?
Best,
xiucz

michaelJwilson · 2023-09-19T09:17:29Z

@xiucz I think you need pr.PyRanges(..., int64=True) when you initialise both tables.

xiucz · 2023-09-20T03:08:34Z

@michaelJwilson
Thanks for your quick comment, it is really helpful!

marco-mariotti · 2023-09-20T12:19:01Z

Hi @xiucz, can you provide data to replicate the problem, and specify which version of pyranges are you using?
I think PyRanges now (0.0.129) has the type of all ints set to int64 ( @endrebak can you confirm?) so I'd be surprised if @michaelJwilson suggestion changed anything. The int64 argument is still there just for backward compatibility.

xiucz · 2023-09-21T06:53:34Z

Hi, @marco-mariotti ,

My pyranges version is 0.0.120. Actually, I use the joint method to annotate pandas dataframe, so it maybe a little hard to repeat the data, but here is my code:

#Split one column to three columns with 'Chromosome', 'Start', 'End'
ClinGenrecurrentCNV[['Chromosome', 'Start', 'End']] = ClinGenrecurrentCNV['Location on GRCh37'].str.split(':|-', expand=True)
df1[['Chromosome', 'Start', 'End', 'Type']] =  df["AnnotSV_ID"].str.split('_', n=3, expand=True)

# Translate it into a PR object.
ClinGenrecurrentCNV_pr = pr.PyRanges(ClinGenrecurrentCNV, int64=True)
df1_pr = pr.PyRanges(df1, int64=True)

# ANNOTATE,  it works^_^.
df1 = df1_pr.join(ClinGenrecurrentCNV_pr, how = "left").drop(like="_b").as_df()

However, is it possible to collapse (merge and combine the annotation columns, or something like groupby and then combine ) the result?

Chromosome	Start	End	anno1	anno2	anno3
15	30430038	32895918	15:31192889-32445405	3 (Sufficient Evidence)	1 (Little Evidence)
15	30430038	32895918	15:32019621-32445405	3 (Sufficient Evidence)	40 (Dosage Sensitivity Unlikely)

#groupby and merge by ";"

Chromosome	Start	End	anno1	anno2	anno3
15	30430038	32895918	15:31192889-32445405;15:32019621-32445405	3 (Sufficient Evidence); 3 (Sufficient Evidence)	1 (Little Evidence); 40 (Dosage Sensitivity Unlikely)

Best,
xiucz

marco-mariotti · 2023-09-21T07:43:07Z

Please test this in the latest pyranges version, 0.0.129, and let us know if the problem persists.

xiucz · 2023-09-21T08:50:24Z

@marco-mariotti
I updated to the latest version, I can join them without int64=True now.

>>> pyranges.__version__
'0.0.129'

Best,
xiucz

endrebak pushed a commit that referenced this issue Jun 27, 2023

0.0.128 Fix left join category error #342

85254f1

endrebak pushed a commit that referenced this issue Jun 27, 2023

0.0.128 Fix left join category error #342

151912a

endrebak added a commit that referenced this issue Jun 27, 2023

8000 0.0.128 Fix left join category error #342 (#343)

c1cb6ca

Co-authored-by: endre bakken stovner <endrebakkenstovner@endres-MacBook-Air.local>

marco-mariotti closed this as completed Sep 21, 2023

endrebak mentioned this issue Feb 17, 2024

Unexpected behaviour with strandedness of pyranges object #374

Open

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

left join broken with ValueError and TypeError #342

left join broken with ValueError and TypeError #342

left join broken with ValueError and TypeError #342

left join broken with ValueError and TypeError #342

Comments