Skip to content
New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Linking images to metadata #5

Open
shntnu opened this issue Oct 13, 2022 · 6 comments
Open

Linking images to metadata #5

shntnu opened this issue Oct 13, 2022 · 6 comments

Comments

@shntnu
Copy link
Collaborator

shntnu commented Oct 13, 2022

Let's use this to discuss how we can link images to metadata

@dmikeando Presumably all that you are using as input right now is the images, but no other information about them.

To help you get started on how to link images to metadata, can you clarify how you get Source, Plate, Batch, Well, Site information from the images? Presumably from their paths?

@dmikeando
Copy link
Collaborator

We currently use "load_data_with_illum.csv" to get image and flatfield filepaths and well/site metadata. Right now I can't access the path (it's within .../source/workspace/load_data_csv) within any of the sources. I'd like to see if I can get access and make sure that the same file is present across sources to use it as a consistent input.

@shntnu
Copy link
Collaborator Author

shntnu commented Oct 13, 2022

@dmikeando That's great you're using load_data_with_illum.csv. Note that the paths will almost certainly need to be edited because the original locations were not on the S3 bucket.

Can you verify that?

@shntnu
Copy link
Collaborator Author

shntnu commented Nov 8, 2022

In https://github.com/jump-cellpainting/datasets-private/issues/11#issuecomment-1304031422 we concluded that the embedder uses all these columns from load_data_with_illum.csv

Metadata_Plate
Metadata_Well
Metadata_Site
FileName_IllumAGP
FileName_IllumDNA
FileName_IllumER
FileName_IllumMito
FileName_IllumRNA
FileName_OrigAGP
FileName_OrigDNA
FileName_OrigER
FileName_OrigMito
FileName_OrigRNA
PathName_IllumAGP
PathName_IllumDNA
PathName_IllumER
PathName_IllumMito
PathName_IllumRNA
PathName_OrigAGP
PathName_OrigDNA
PathName_OrigER
PathName_OrigMito
PathName_OrigRNA

We now additionally include these two columns in the load_data_with_illum.csv files https://github.com/jump-cellpainting/datasets-private/pull/23

Metadata_Source
Metadata_Batch

What additional information do we need to link to cells?

I think we can get everything else the embedder needs by querying the SQLite files (and storing it as a parquet file). This is essentially what DeepProfiler does too.

backend_file=/Users/shsingh/work/projects/2015_Bray_GigaScience/workspace/backend/CDRP/25738/25738.sqlite
sqlite3 -header -csv  ${backend_file} "select "select Image.Image_Metadata_Plate as Metadata_Plate,Image.Image_Metadata_Well as Metadata_Well,Image.Image_Metadata_Site as Metadata_Site,Nuclei.ObjectNumber,Nuclei.Nuclei_Location_Center_X,Nuclei.Nuclei_Location_Center_Y from Nuclei inner join Image on Nuclei.ImageNumber=Image.ImageNumber and Nuclei.TableNumber=Image.TableNumber limit 10"
Metadata_Plate,Metadata_Well,Metadata_Site,ObjectNumber,Nuclei_Location_Center_X,Nuclei_Location_Center_Y
25738,a01,1,1,505.621545403271,58.2637713855988
25738,a01,1,2,168.221237113402,152.262268041237
25738,a01,1,3,111.510355815189,178.495485926713
25738,a01,1,4,463.204137066444,254.365022983702
25738,a01,1,5,164.306306306306,290.675675675676
25738,a01,1,6,85.0708743971483,364.864856364018
25738,a01,2,1,137.892468787757,324.760471204188
25738,a01,2,2,317.639041437843,407.450490930271
25738,a01,2,3,381.467097170972,426.253536285363
25738,a01,3,1,417.722408026756,58.5016722408027

Rendered as a table:

Metadata_Plate Metadata_Well Metadata_Site ObjectNumber Nuclei_Location_Center_X Nuclei_Location_Center_Y
25738 a01 1 1 505.621545403271 58.2637713855988
25738 a01 1 2 168.221237113402 152.262268041237
25738 a01 1 3 111.510355815189 178.495485926713
25738 a01 1 4 463.204137066444 254.365022983702
25738 a01 1 5 164.306306306306 290.675675675676
25738 a01 1 6 85.0708743971483 364.864856364018
25738 a01 2 1 137.892468787757 324.760471204188
25738 a01 2 2 317.639041437843 407.450490930271
25738 a01 2 3 381.467097170972 426.253536285363
25738 a01 3 1 417.722408026756 58.5016722408027

So we can join this parquet file (that we'd create using the query above) with the load_data_with_illum.csv on (Metadata_Plate,Metadata_Well,Metadata_Site) and we are all set, right @dmikeando ?

In other words, if we create a per-plate parquet file (maybe sharded across wells) with these columns, one per cell, that's all you really need?

Metadata_Plate
Metadata_Well
Metadata_Site
ObjectNumber
Nuclei_Location_Center_X
Nuclei_Location_Center_Y

@dmikeando
Copy link
Collaborator

Thanks @shntnu . Your analysis looks correct to me. As we discussed, some of the columns (e.g. the illum filepaths/names) will be very repetitive, so using a dictionary/enum/categorical type could save on disk space and load time.

https://arrow.apache.org/docs/python/data.html#dictionary-arrays

@shntnu
Copy link
Collaborator Author

shntnu commented Nov 8, 2022

Ah yes, I'm thinking we'd not actually save out the join with the load data but rather do that join on the fly. We'd only save out the Image-Nuclei join, and just those 6 columns I've listed below. Sorry if this is not clear (on my phone)

Metadata_Plate
Metadata_Well
Metadata_Site
ObjectNumber
Nuclei_Location_Center_X
Nuclei_Location_Center_Y

(Still, would be good to use enum for the first 3)

@shntnu
Copy link
Collaborator Author

shntnu commented Feb 27, 2023

This is now being addressed in cytomining/pycytominer#257

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants