8000 make_plots dropping all ranks except one for plotting · Issue #97 · plinder-org/plinder · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

make_plots dropping all ranks except one for plotting #97

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Closed
DreRnc opened this issue Feb 15, 2025 · 2 comments
Closed

make_plots dropping all ranks except one for plotting #97

DreRnc opened this issue Feb 15, 2025 · 2 comments
Assignees
Labels
bug Something isn't working

Comments

@DreRnc
Copy link
Contributor
DreRnc commented Feb 15, 2025

Hi, I don't know if I'm understanding this properly, but it seems to me like the plots are being made on only one rank (chosen at random) when doing the plots, because they are being dropped on duplicates on the "reference" column, which is the reference PLINDER system_id.

def from_scores_and_data_files(
cls, score_file: Path, data_file: Path, output_dir: Path, top_n: int = 10
) -> "EvaluationResults":
# use only one score per system when plotting aggregated scores
scores_df = pd.read_parquet(score_file).drop_duplicates("reference")
data_df = pd.read_parquet(data_file)
merged_df = scores_df[scores_df["reference"].isin(data_df["system_id"])].merge(
data_df, left_on="reference", right_on="system_id", how="left"
)
merged_df["bisy_rmsd_wave"] = merged_df["bisy_rmsd_wave"].fillna(
np.nanmax(merged_df["bisy_rmsd_wave"])
)
merged_df["lddt_pli_wave"] = merged_df["lddt_pli_wave"].fillna(0)
merged_df["success"] = merged_df["bisy_rmsd_wave"] <= 2
merged_df["rank"] = merged_df["rank"].astype(int)
output_dir.mkdir(exist_ok=True)
merged_df.to_parquet(output_dir / "merged.parquet")
data = cls(merged_df, output_dir)
data.write_results_table(True, top_n)
data.write_results_table(False, top_n)
data.write_leakage_plots(True, top_n)
data.write_leakage_plots(False, top_n)
return data

The scores_df keeps only one prediction per system_id (with a rank chosen at random) since it does "drop_duplicates." I checked, and in fact, the merged_df is of shape (n_test_points, n_columns) instead of (n_test_points * rank, n_columns).

The result is that data is initialized with this merged_df. I think this is not the intended behavior since data operations have a top_n argument, which doesn't make sense on a merged_df that has only one rank per reference system.

Let me know if I'm missing something! In case this is a bug, I could quickly fix it, I would need to understand what was the reasoning for the drop duplicates operation in the first place.

@DreRnc DreRnc changed the title Plots dropping all except ranks except one for plotting Plots dropping all ranks except one for plotting Feb 15, 2025
@DreRnc DreRnc changed the title Plots dropping all ranks except one for plotting make_plots dropping all ranks except one for plotting Feb 15, 2025
@OleinikovasV OleinikovasV self-assigned this Feb 17, 2025
@OleinikovasV OleinikovasV added the bug Something isn't working label Feb 17, 2025
@OleinikovasV
Copy link
Member

Thanks @DreRnc for spotting this indeed was an unfortunate bug meant to avoid multi-counting for aggregate scores in multi-ligand systems.

This PR should fix it: #98
Plus added some extra comments to make it easier to understand the reasoning for duplication drop!

@OleinikovasV
Copy link
Member

PR with bugfix #98 has been merged

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
bug Something isn't working
Projects
None yet
Development

No branches or pull requests

2 participants
0