make_plots dropping all ranks except one for plotting #97

DreRnc · 2025-02-15T11:54:01Z

Hi, I don't know if I'm understanding this properly, but it seems to me like the plots are being made on only one rank (chosen at random) when doing the plots, because they are being dropped on duplicates on the "reference" column, which is the reference PLINDER system_id.

plinder/src/plinder/eval/docking/make_plots.py

Lines 43 to 65 in d303084

    
           def from_scores_and_data_files( 
        
               cls, score_file: Path, data_file: Path, output_dir: Path, top_n: int = 10 
        
           ) -> "EvaluationResults": 
        
               # use only one score per system when plotting aggregated scores 
        
               scores_df = pd.read_parquet(score_file).drop_duplicates("reference") 
        
               data_df = pd.read_parquet(data_file) 
        
               merged_df = scores_df[scores_df["reference"].isin(data_df["system_id"])].merge( 
        
                   data_df, left_on="reference", right_on="system_id", how="left" 
        
               ) 
        
               merged_df["bisy_rmsd_wave"] = merged_df["bisy_rmsd_wave"].fillna( 
        
                   np.nanmax(merged_df["bisy_rmsd_wave"]) 
        
               ) 
        
               merged_df["lddt_pli_wave"] = merged_df["lddt_pli_wave"].fillna(0) 
        
               merged_df["success"] = merged_df["bisy_rmsd_wave"] <= 2 
        
               merged_df["rank"] = merged_df["rank"].astype(int) 
        
               output_dir.mkdir(exist_ok=True) 
        
               merged_df.to_parquet(output_dir / "merged.parquet") 
        
               data = cls(merged_df, output_dir) 
        
               data.write_results_table(True, top_n) 
        
               data.write_results_table(False, top_n) 
        
               data.write_leakage_plots(True, top_n) 
        
               data.write_leakage_plots(False, top_n) 
        
               return data

The scores_df keeps only one prediction per system_id (with a rank chosen at random) since it does "drop_duplicates." I checked, and in fact, the merged_df is of shape (n_test_points, n_columns) instead of (n_test_points * rank, n_columns).

The result is that data is initialized with this merged_df. I think this is not the intended behavior since data operations have a top_n argument, which doesn't make sense on a merged_df that has only one rank per reference system.

Let me know if I'm missing something! In case this is a bug, I could quickly fix it, I would need to understand what was the reasoning for the drop duplicates operation in the first place.

The text was updated successfully, but these errors were encountered:

OleinikovasV · 2025-02-17T10:09:46Z

Thanks @DreRnc for spotting this indeed was an unfortunate bug meant to avoid multi-counting for aggregate scores in multi-ligand systems.

This PR should fix it: #98
Plus added some extra comments to make it easier to understand the reasoning for duplication drop!

OleinikovasV · 2025-02-17T12:23:32Z

PR with bugfix #98 has been merged

DreRnc changed the title ~~Plots dropping all except ranks except one for plotting~~ Plots dropping all ranks except one for plotting Feb 15, 2025

DreRnc changed the title ~~Plots dropping all ranks except one for plotting~~ make_plots dropping all ranks except one for plotting Feb 15, 2025

OleinikovasV mentioned this issue Feb 17, 2025

bugfix: one aggregate score per model + pin networkit==11.0.0 #98

Merged

OleinikovasV self-assigned this Feb 17, 2025

OleinikovasV added the bug Something isn't working label Feb 17, 2025

OleinikovasV closed this as completed Feb 17, 2025

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

make_plots dropping all ranks except one for plotting #97

make_plots dropping all ranks except one for plotting #97

make_plots dropping all ranks except one for plotting #97

make_plots dropping all ranks except one for plotting #97

Comments