Linking images to metadata #5

shntnu · 2022-10-13T19:30:12Z

Let's use this to discuss how we can link images to metadata

@dmikeando Presumably all that you are using as input right now is the images, but no other information about them.

To help you get started on how to link images to metadata, can you clarify how you get Source, Plate, Batch, Well, Site information from the images? Presumably from their paths?

dmikeando · 2022-10-13T20:33:25Z

We currently use "load_data_with_illum.csv" to get image and flatfield filepaths and well/site metadata. Right now I can't access the path (it's within .../source/workspace/load_data_csv) within any of the sources. I'd like to see if I can get access and make sure that the same file is present across sources to use it as a consistent input.

shntnu · 2022-10-13T23:12:34Z

@dmikeando That's great you're using load_data_with_illum.csv. Note that the paths will almost certainly need to be edited because the original locations were not on the S3 bucket.

You already have full access to everything in source_4
You now have access to load_data_csv of several partners. https://github.com/jump-cellpainting/cellpainting-gallery-config/pull/45/commits/9b6cff07dba919468dc3168ed5b51e00ed6dadfe

Can you verify that?

shntnu · 2022-11-08T23:14:35Z

In https://github.com/jump-cellpainting/datasets-private/issues/11#issuecomment-1304031422 we concluded that the embedder uses all these columns from load_data_with_illum.csv

Metadata_Plate
Metadata_Well
Metadata_Site
FileName_IllumAGP
FileName_IllumDNA
FileName_IllumER
FileName_IllumMito
FileName_IllumRNA
FileName_OrigAGP
FileName_OrigDNA
FileName_OrigER
FileName_OrigMito
FileName_OrigRNA
PathName_IllumAGP
PathName_IllumDNA
PathName_IllumER
PathName_IllumMito
PathName_IllumRNA
PathName_OrigAGP
PathName_OrigDNA
PathName_OrigER
PathName_OrigMito
PathName_OrigRNA

We now additionally include these two columns in the load_data_with_illum.csv files https://github.com/jump-cellpainting/datasets-private/pull/23

Metadata_Source
Metadata_Batch

What additional information do we need to link to cells?

I think we can get everything else the embedder needs by querying the SQLite files (and storing it as a parquet file). This is essentially what DeepProfiler does too.

backend_file=/Users/shsingh/work/projects/2015_Bray_GigaScience/workspace/backend/CDRP/25738/25738.sqlite
sqlite3 -header -csv  ${backend_file} "select "select Image.Image_Metadata_Plate as Metadata_Plate,Image.Image_Metadata_Well as Metadata_Well,Image.Image_Metadata_Site as Metadata_Site,Nuclei.ObjectNumber,Nuclei.Nuclei_Location_Center_X,Nuclei.Nuclei_Location_Center_Y from Nuclei inner join Image on Nuclei.ImageNumber=Image.ImageNumber and Nuclei.TableNumber=Image.TableNumber limit 10"

Metadata_Plate,Metadata_Well,Metadata_Site,ObjectNumber,Nuclei_Location_Center_X,Nuclei_Location_Center_Y
25738,a01,1,1,505.621545403271,58.2637713855988
25738,a01,1,2,168.221237113402,152.262268041237
25738,a01,1,3,111.510355815189,178.495485926713
25738,a01,1,4,463.204137066444,254.365022983702
25738,a01,1,5,164.306306306306,290.675675675676
25738,a01,1,6,85.0708743971483,364.864856364018
25738,a01,2,1,137.892468787757,324.760471204188
25738,a01,2,2,317.639041437843,407.450490930271
25738,a01,2,3,381.467097170972,426.253536285363
25738,a01,3,1,417.722408026756,58.5016722408027

Rendered as a table:

Metadata_Plate	Metadata_Well	Metadata_Site	ObjectNumber	Nuclei_Location_Center_X	Nuclei_Location_Center_Y
25738	a01	1	1	505.621545403271	58.2637713855988
25738	a01	1	2	168.221237113402	152.262268041237
25738	a01	1	3	111.510355815189	178.495485926713
25738	a01	1	4	463.204137066444	254.365022983702
25738	a01	1	5	164.306306306306	290.675675675676
25738	a01	1	6	85.0708743971483	364.864856364018
25738	a01	2	1	137.892468787757	324.760471204188
25738	a01	2	2	317.639041437843	407.450490930271
25738	a01	2	3	381.467097170972	426.253536285363
25738	a01	3	1	417.722408026756	58.5016722408027

So we can join this parquet file (that we'd create using the query above) with the load_data_with_illum.csv on (Metadata_Plate,Metadata_Well,Metadata_Site) and we are all set, right @dmikeando ?

In other words, if we create a per-plate parquet file (maybe sharded across wells) with these columns, one per cell, that's all you really need?

Metadata_Plate
Metadata_Well
Metadata_Site
ObjectNumber
Nuclei_Location_Center_X
Nuclei_Location_Center_Y

dmikeando · 2022-11-08T23:35:03Z

Thanks @shntnu . Your analysis looks correct to me. As we discussed, some of the columns (e.g. the illum filepaths/names) will be very repetitive, so using a dictionary/enum/categorical type could save on disk space and load time.

https://arrow.apache.org/docs/python/data.html#dictionary-arrays

shntnu · 2022-11-08T23:53:08Z

Ah yes, I'm thinking we'd not actually save out the join with the load data but rather do that join on the fly. We'd only save out the Image-Nuclei join, and just those 6 columns I've listed below. Sorry if this is not clear (on my phone)

Metadata_Plate
Metadata_Well
Metadata_Site
ObjectNumber
Nuclei_Location_Center_X
Nuclei_Location_Center_Y

(Still, would be good to use enum for the first 3)

shntnu · 2023-02-27T12:07:17Z

This is now being addressed in cytomining/pycytominer#257

dmikeando closed this as completed Dec 1, 2022

shntnu mentioned this issue Dec 20, 2022

TableNumber and ImageNumber discrepancy #8

Closed

shntnu reopened this Feb 27, 2023

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Linking images to metadata #5

Linking images to metadata #5

shntnu commented Oct 13, 2022

dmikeando commented Oct 13, 2022

shntnu commented Oct 13, 2022

shntnu commented Nov 8, 2022 •

edited

Loading

dmikeando commented Nov 8, 2022

shntnu commented Nov 8, 2022 •

edited

Loading

shntnu commented Feb 27, 2023

Linking images to metadata #5

Linking images to metadata #5

Comments

shntnu commented Oct 13, 2022

dmikeando commented Oct 13, 2022

shntnu commented Oct 13, 2022

shntnu commented Nov 8, 2022 • edited Loading

dmikeando commented Nov 8, 2022

shntnu commented Nov 8, 2022 • edited Loading

shntnu commented Feb 27, 2023

shntnu commented Nov 8, 2022 •

edited

Loading

shntnu commented Nov 8, 2022 •

edited

Loading