Collect data resources together as a "data package" #41

d33bs · 2024-04-05T22:25:47Z

This PR collects the data resources associated with this project as a "data package" for a unified way to query and explore the sources, intermediary data, or findings. This work focuses on adding the data from steps as tables within a LanceDB dataset. LanceDB is formed of one or many Lance tables and includes many other features which may be beneficial for research purposes.

During development, I struggled to recreate the Conda environment on my Mac, and found there might be additional configuration required. As a result, I created environment isolation with a pyproject.toml and experimented with a Dockerfile-based approach for image frame extraction as a way of demonstrating capabilities within Lance. I also found that suggested practices about using IDR have perhaps changed from Aspera to FTP, and made relevant changes here as well.

In order to store image data within a Lance table I used tifffile to parse the data as a numpy array and awkward to store the multidimensional array in an Arrow-compatible (and as a result, Lance compatible) format. I found awkward is one of few solutions which use custom data type classes to help work around existing limitations surrounding how multidimensional array values may be stored within Arrow data structures (tensors and other types seem to have challenges at the moment). I also explored huggingface/datasets which uses a similar approach, but opted for awkward due to greater personal familiarity (unsure if performance / other aspects might impact decisions here).

jenna-tomkinson

This is a really cool PR! I added a lot of questions to clarify these changes more. My overall comment is I would recommend changing from just python files to notebooks converted into python scripts to follow the same formatting of this repo.

I added an approval but looks like Greg also needs to review. Let me know if you have any questions about my comments! 😄

1.idr_streams/stream_files/idr0013-screenA-plates-w-colnames.tsv

5.data_packaging/Dockerfile.bfconvert

5.data_packaging/README.md

5.data_packaging/constants.py

5.data_packaging/create_lancedb.py

jenna-tomkinson · 2024-04-29T15:30:05Z

5.data_packaging/gather_images.py

Some overall comments on this file:

This file has a lot of functions defined in it. Does it make sense to create a separate utility file to make this file more simplified?

Same comment as the other python file regarding use of notebooks. I also think it makes sense to use notebooks converted to python scripts since I do believe that is the same formatting done in the rest of the repo. From my perspective, it makes sense to keep to the same format unless there is a plan to refactor this repo.

Thanks for these comments. Replying below:

This file has a lot of functions defined in it. Does it make sense to create a separate utility file to make this file more simplified?

These functions are only used in one spot and are specific to the structure of this project. If we move the functions into another file, we add complexity in the form of additional abstractions without the benefit of decoupled implementation. I feel it doesn't make sense for now (please advise if I interpreted this incorrectly).

Same comment as the other python file regarding use of notebooks. I also think it makes sense to use notebooks converted to python scripts since I do believe that is the same formatting done in the rest of the repo. From my perspective, it makes sense to keep to the same format unless there is a plan to refactor this repo.

I didn't write these files with a Jupyter kernel and as a result we may be adding complexity where I feel there's little benefit (additional files increase the amount of work someone must do to understand things). The other directories include Python files and may rely on executing them, so I feel the work is still compatible with other areas. Could I ask for further thoughts surrounding the benefits of copying this code into a file which wasn't used for development or implementation?

Thanks for the reply! I am thinking that my comments come from something we have discussed before of "Analysis versus Software" repos.

This mitocheck_data repository to me is an analysis repo where we are using software to develop pipelines. I think my confusion was that I assumed the changes being made were to improve IDR_stream the software, which now I am seeing that the goal of this is to collect the Mitocheck data after IDR_stream processing.

I think your README additions clarify this perfectly! Apologies for my misunderstanding!

5.data_packaging/infer_schema.py

5.data_packaging/schema/4.analyze_data.results.compiled_2D_umap_embeddings.csv.arrow.schema.txt

README.md

d33bs · 2024-05-15T16:48:18Z

Thanks @jenna-tomkinson for your great review and thoughts! I've made changes which I feel address the comments (or have made justifications as replies to your comments). When you have a moment, would you mind giving this another review to ensure this is merge-ready?

In my mind, this PR doesn't mean the work is completed but is one step towards "piecemeal growth". Next steps after a merge would I believe entail creating IC images based on the original frames extracted (and storing these in LanceDB), then scaling the work to the entirety of the frames/images (at the moment this is artificially limited to assist with prototyping). These would I feel be best served in separate PR's to help keep the focus isolated to just those aspects.

jenna-tomkinson

Thank you @d33bs for clarifying and blossoming this conversation! I now fully realize my misunderstanding when reviewing that this PR is focusing on merging the IDR_stream output into something that the audience can use for further analysis. I think my perspective on the Juypter notebooks has changed as it makes sense that this 5th module is not related to analysis so it doesn't have to follow the same format. Amazing job on clarifying all of this in the README! Looks ready to merge to me!

jenna-tomkinson · 2024-05-15T20:14:05Z

5.data_packaging/README.md

This README perfectly clarifies the goal, wonderful job!

I fully understand the point is not related to IDR_stream but to make the output of IDR_stream better accessible.

d33bs · 2024-05-17T17:03:11Z

Thanks @jenna-tomkinson ! Merging this in with a focus towards #43 , #44, #45 as next steps (in this order).

d33bs added 7 commits April 5, 2024 14:02

8000

initial work towards step 5

90f0a3a

linting

0da358d

add better context to readme

5d19a79

spacing

5428e10

initial movie frame image extraction work

46f0021

migrate from dagger to docker

9494d68

write tiff data to parquet and write to lancedb

08fae69

d33bs requested review from gwaybio and jenna-tomkinson April 25, 2024 16:14

jenna-tomkinson approved these changes Apr 29, 2024

View reviewed changes

updates based on comments from review

ce73ad3

d33bs mentioned this pull request May 15, 2024

Generalize data packaging work for more flexible use #42

Open

d33bs requested a review from jenna-tomkinson May 15, 2024 16:48

jenna-tomkinson approved these changes May 15, 2024

View reviewed changes

d33bs marked this pull request as ready for review May 17, 2024 16:55

This was referenced May 17, 2024

Add IC image integration capabilities for LanceDB dataset #43

Closed

Scale LanceDB data ingest for entirety of data #44

Open

Handle CellPose mask export data as part of data packaging #45

Open

d33bs merged commit 71c68bc into WayScience:main May 17, 2024

d33bs deleted the data-packaging branch May 17, 2024 17:03

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Collect data resources together as a "data package" #41

Collect data resources together as a "data package" #41

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Collect data resources together as a "data package" #41

Collect data resources together as a "data package" #41

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!