8000 Multiprocessing is Experiencing Deadlocks in GraphFrame.filter · Issue #372 · hatchet/hatchet · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Multiprocessing is Experiencing Deadlocks in GraphFrame.filter #372

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
JBlaschke opened this issue Apr 23, 2021 · 3 comments
Open

Multiprocessing is Experiencing Deadlocks in GraphFrame.filter #372

JBlaschke opened this issue Apr 23, 2021 · 3 comments
Labels
area: graphframe PRs and Issues involving Hatchet's core GraphFrame datastructure and associated classes priority: high High priority issues and PRs type: bug Identifies bugs in issues and identifies bug fixes in PRs
Milestone

Comments

@JBlaschke
Copy link
Contributor

Not quite sure why this is happening yet, but for some reason GraphFrame.filter is hanging here:

returned_frames.append(queue.get())

I see all processes putting their result in the queue and returning, but things are hanging in the queue -- the problem seems to be a dataframe pickle issue. After adding a sleep(1) at the end of parallel_apply, i can see the error:

  File "/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/Users/blaschke/local/virtualenv/py3/lib/python3.8/site-packages/dill/_dill.py", line 941, in save_module_dict
    StockPickler.save_dict(pickler, obj)
  File "/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/pickle.py", line 971, in save_dict
    self._batch_setitems(obj.items())
  File "/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/pickle.py", line 997, in _batch_setitems
    save(v)
  File "/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/pickle.py", line 560, in save
    f(self, obj)  # Call unbound method with explicit self
  File "/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/pickle.py", line 855, in save_str
    self.write(SHORT_BINUNICODE + pack("<B", n) + encoded)
  File "/usr/local/opt/python@3.8/Frameworks/Python.framework/Versions/3.8/lib/python3.8/pickle.py", line 243, in write
    return self.current_frame.write(data)
RecursionError: maximum recursion depth exceeded while calling a Python object

Any ideas, anyone who knows more about dataframes and pickling? I am testing on a python cProfile

@ilumsden
Copy link
Contributor

@JBlaschke how large is the Graph that you're testing with? When Pandas serializes into any format (e.g., pickle, HDF5, SQL, etc.), it will serialize any cells with Object type data with pickle. This could become a problem with the index of the DataFrame because it contains nodes in the Graph. Under normal circumstances, each of these nodes will contain two lists: one of its parent nodes and one of its children nodes. So, when Pandas tries to pickle a node, it will recursively try to pickle each child and parent node. As a result, for large or highly connected graphs, it would be pretty easy to reach Python's (relatively small) recursion depth limit.

This is why, in developing #272 for saving GraphFrames to the filesystem, I disconnect the graph and encode that info with integers in a copy of the DataFrame.

@pearce8 @bhatele, we should probably discuss this during our next meeting.

@ilumsden
Copy link
Contributor
ilumsden commented Apr 27, 2021

This library could be useful for fixing this issue: https://github.com/jmcarpenter2/swifter

It's a Pandas plugin that automatically decides if vectorization, Dask, or Pandas is faster and runs the apply function using that method. The syntax is pretty nice too. Instead of df.apply(func), Swifter's syntax is df.swifter.apply(func).

Only question is: do we want to add more dependencies?

@ilumsden
Copy link
Contributor

Actually, we probably don't want to use that package. It requires Pandas 1.0.0 or higher, and that version of Pandas doesn't support Python 2.7 (officially, it might work, but it's not a guarantee).

@ilumsden ilumsden added the bug label Apr 30, 2021
@slabasan slabasan added area: graphframe PRs and Issues involving Hatchet's core GraphFrame datastructure and associated classes priority: high High priority issues and PRs type: bug Identifies bugs in issues and identifies bug fixes in PRs and removed bug labels Jan 2, 2022
@ocnkr ocnkr added this to the v1.4.0 milestone Nov 7, 2022
@bhatele bhatele modified the milestones: v1.4.0, v1.4.1 May 24, 2024
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
area: graphframe PRs and Issues involving Hatchet's core GraphFrame datastructure and associated classes priority: high High priority issues and PRs type: bug Identifies bugs in issues and identifies bug fixes in PRs
Projects
None yet
Development

No branches or pull requests

5 participants
0