8000 Collect data resources together as a "data package" by d33bs · Pull Request #41 · WayScience/mitocheck_data · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Collect data resources together as a "data package" #41

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 8 commits into from
May 17, 2024

Conversation

d33bs
Copy link
Member
@d33bs d33bs commented Apr 5, 2024

This PR collects the data resources associated with this project as a "data package" for a unified way to query and explore the sources, intermediary data, or findings. This work focuses on adding the data from steps as tables within a LanceDB dataset. LanceDB is formed of one or many Lance tables and includes many other features which may be beneficial for research purposes.

During development, I struggled to recreate the Conda environment on my Mac, and found there might be additional configuration required. As a result, I created environment isolation with a pyproject.toml and experimented with a Dockerfile-based approach for image frame extraction as a way of demonstrating capabilities within Lance. I also found that suggested practices about using IDR have perhaps changed from Aspera to FTP, and made relevant changes here as well.

In order to store image data within a Lance table I used tifffile to parse the data as a numpy array and awkward to store the multidimensional array in an Arrow-compatible (and as a result, Lance compatible) format. I found awkward is one of few solutions which use custom data type classes to help work around existing limitations surrounding how multidimensional array values may be stored within Arrow data structures (tensors and other types seem to have challenges at the moment). I also explored huggingface/datasets which uses a similar approach, but opted for awkward due to greater personal familiarity (unsure if performance / other aspects might impact decisions here).

Copy link
Member
@jenna-tomkinson jenna-tomkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is a really cool PR! I added a lot of questions to clarify these changes more. My overall comment is I would recommend changing from just python files to notebooks converted into python scripts to follow the same formatting of this repo.

I added an approval but looks like Greg also needs to review. Let me know if you have any questions about my comments! 😄

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Some overall comments on this file:

  1. This file has a lot of functions defined in it. Does it make sense to create a separate utility file to make this file more simplified?
  2. Same comment as the other python file regarding use of notebooks. I also think it makes sense to use notebooks converted to python scripts since I do believe that is the same formatting done in the rest of the repo. From my perspective, it makes sense to keep to the same format unless there is a plan to refactor this repo.

Copy link
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for these comments. Replying below:

  1. This file has a lot of functions defined in it. Does it make sense to create a separate utility file to make this file more simplified?

These functions are only used in one spot and are specific to the structure of this project. If we move the functions into another file, we add complexity in the form of additional abstractions without the benefit of decoupled implementation. I feel it doesn't make sense for now (please advise if I interpreted this incorrectly).

  1. Same comment as the other python file regarding use of notebooks. I also think it makes sense to use notebooks converted to python scripts since I do believe that is the same formatting done in the rest of the repo. From my perspective, it makes sense to keep to the same format unless there is a plan to refactor this repo.

I didn't write these files with a Jupyter kernel and as a result we may be adding complexity where I feel there's little benefit (additional files increase the amount of work someone must do to understand things). The other directories include Python files and may rely on executing them, so I feel the work is still compatible with other areas. Could I ask for further thoughts surrounding the benefits of copying this code into a file which wasn't used for development or implementation?

Copy link
Member
@jenna-tomkinson jenna-tomkinson May 15, 2024

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks for the reply! I am thinking that my comments come from something we have discussed before of "Analysis versus Software" repos.

This mitocheck_data repository to me is an analysis repo where we are using software to develop pipelines. I think my confusion was that I assumed the changes being made were to improve IDR_stream the software, which now I am seeing that the goal of this is to collect the Mitocheck data after IDR_stream processing.

I think your README additions clarify this perfectly! Apologies for my misunderstanding!

@d33bs
Copy link
Member Author
d33bs commented May 15, 2024

Thanks @jenna-tomkinson for your great review and thoughts! I've made changes which I feel address the comments (or have made justifications as replies to your comments). When you have a moment, would you mind giving this another review to ensure this is merge-ready?

In my mind, this PR doesn't mean the work is completed but is one step towards "piecemeal growth". Next steps after a merge would I believe entail creating IC images based on the original frames extracted (and storing these in LanceDB), then scaling the work to the entirety of the frames/images (at the moment this is artificially limited to assist with prototyping). These would I feel be best served in separate PR's to help keep the focus isolated to just those aspects.

@d33bs d33bs requested a review from jenna-tomkinson May 15, 2024 16:48
Copy link
Member
@jenna-tomkinson jenna-tomkinson left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thank you @d33bs for clarifying and blossoming this conversation! I now fully realize my misunderstanding when reviewing that this PR is focusing on merging the IDR_stream output into something that the audience can use for further analysis. I think my perspective on the Juypter notebooks has changed as it makes sense that this 5th module is not related to analysis so it doesn't have to follow the same format. Amazing job on clarifying all of this in the README! Looks ready to merge to me!

Copy link
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This README perfectly clarifies the goal, wonderful job!

I fully understand the point is not related to IDR_stream but to make the output of IDR_stream better accessible.

@d33bs
Copy link
Member Author
d33bs commented May 17, 2024

Thanks @jenna-tomkinson ! Merging this in with a focus towards #43 , #44, #45 as next steps (in this order).

@d33bs d33bs merged commit 71c68bc into WayScience:main May 17, 2024
@d33bs d33bs deleted the data-packaging branch May 17, 2024 17:03
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

Successfully merging this pull request may close these issues.

2 participants
0