8000 Access to raw (non-normalized) data · Issue #50 · WayScience/mitocheck_data · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Access to raw (non-normalized) data #50

New issue

Have a ques 8000 tion about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
gwaybio opened this issue Dec 13, 2024 · 7 comments
Open

Access to raw (non-normalized) data #50

gwaybio opened this issue Dec 13, 2024 · 7 comments

Comments

@gwaybio
Copy link
Member
gwaybio commented Dec 13, 2024

@roshankern - would you have a way to provide access to the raw feature?

I believe these would be the datasets located in the following folder, which is currently ignored: https://github.com/WayScience/mitocheck_data/blob/main/.gitignore#L15

cc @hwarden162

@gwaybio
Copy link
Member Author
gwaybio commented Dec 15, 2024

I've been in touch with @roshankern (Roshan, please feel free to clarify anything), and we do have access to this data 🎉 it is about 45GB. Therefore, @hwarden162 here's my proposal:

  • You go ahead and start working on the aggregated real-world example.
  • In the meantime, let us know exactly the transformations that you're applying. IIRC, you have implemented this in R, so perhaps this is as simple as pointing us to this function.
  • Roshan, Hugh (or someone else in the lab) can develop a crude python implementation (this will be the first step toward a stable pycytominer-based implementation)
  • Roshan, with access to maple, can apply the feature transformation to the 45GB single-cell dataset.
  • Roshan, again through maple, can process the new, transformed data and generate a new dataset of the same format in https://github.com/WayScience/mitocheck_data/tree/main/3.normalize_data/normalized_data called training_data__ic__rotationalvariancetransform.csv.gz. This would be the final deliverable Roshan, for you to be included as an author on the paper Hugh is working on. We only need to do this for the CellProfiler features
  • [Optional] We will pass this new dataset through our LOIO evaluation. Roshan, you can also take this on. Otherwise, Greg will run this. We are testing the hypothesis that by transforming the rotationally variant texture CP features, that we improve LOIO performance.

@roshankern
Copy link
Member

This makes sense to me. @hwarden162 let me know if you are able to develop the crude python implementation or if some version of this already exists!

@gwaybio
Copy link
Member Author
gwaybio commented Dec 16, 2024

For some additional context, we're hosting a rotation student next term, so ideally Roshan is able to perform his part before Feb 1

@hwarden162
Copy link

This looks really good, thank you! I will write up the transoformations and the extra features I am suggesting are blocklisted and will then update you here.

@hwarden162
Copy link
hwarden162 commented Dec 30, 2024

Back at work today after the christmas break. Looking into this, is there a way for me to get a csv of the first ~10 rows of the data (or failing this the column names and preferably their dtype)? Just thinking this will definitely allow me to deliver some code to transform the data with the minimum likelihood of errors.

As a side note, is there a preference on if this transformation is in Python or R? I only ask as my data manipulation is a lot better in R so would be easier if it is the same for you but I can give either if there are external factors that would prefer one over the other.

If I can get access to the head of the data then the turn around on the code should be very fast. Thanks.

@roshankern
Copy link
Member

Looking into this, is there a way for me to get a csv of the first ~10 rows of the data (or failing this the column names and preferably their dtype)?

I have uploaded the relevant head of the illumination-corrected training data data here. I believe that we will also need the transform applied to the illumination-corrected negative control data that we use to normalize the training data, so I have uploaded the head of that data here as well.

The data frame structures are nearly identical, but the training data also has the Mitocheck-assigned phenotypic class and object outlines. The general data frame column structures are:

  • Metadata columns (str, int)
  • CellProfiler feature columns prefixed with CP__ (int, float)
  • DeepProfiler feature columns prefixed with DP__ (float)

Given that the transform is only relevant for CellProfiler features, we can ignore the DeepProfiler features.

As a side note, is there a preference on if this transformation is in Python or R?

< 7F26 /blockquote>

We would definitely prefer this transformation in Python for easier integration into this and the pycytominer project.

Thank you for your help with this! Let me know if anything is unclear or more resources would be helpful.

@roshankern
Copy link
Member

Hello @hwarden162 👋. Any chance there is an update on this transformation? We would like to complete the analysis before the next rotational student starts (if possible). Thanks!

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

3 participants
0