Updating SageAdapter with new functionality + compatibility with new Sage versions #7577

JohannesvKL · 2024-09-06T11:55:24Z

User description

Description

Added functionality to the SageAdapter tool by changing the SageAdapter.cpp and PercolatorInfile.cpp files. Now supports many different Sage features such as annotation, wide-window mode, chimera, FDR filtering among others. The tool is now also compliant with Sage's new variable modifications specifications and works on versions of Sage beyond v0.15.0.

SageAdapter now also provides PTM discovery functionality and additionally outputs a table of all discovered PTMs from the given data.

Checklist

Make sure that you are listed in the AUTHORS file
Add relevant changes and new features to the CHANGELOG file
I have commented my code, particularly in hard-to-understand areas
New and existing unit tests pass locally with my changes
Updated or added python bindings for changed or new classes (Tick if no updates were necessary.)

How can 8000 I get additional information on failed tests during CI

Click to expand

If your PR is failing you can check out

The details of the action statuses at the end of the PR or the "Checks" tab.
http://cdash.openms.de/index.php?project=OpenMS and look for your PR. Use the "Show filters" capability on the top right to search for your PR number.
If you click in the column that lists the failed tests you will get detailed error messages.

Advanced commands (admins / reviewer only)

Click to expand

/reformat (experimental) applies the clang-format style changes as additional commit. Note: your branch must have a different name (e.g., yourrepo:feature/XYZ) than the receiving branch (e.g., OpenMS:develop). Otherwise, reformat fails to push.
setting the label "NoJenkins" will skip tests for this PR on jenkins (saves resources e.g., on edits that do not affect tests)
commenting with rebuild jenkins will retrigger Jenkins-based CI builds

⚠️ Note: Once you opened a PR try to minimize the number of pushes to it as every push will trigger CI (automated builds and test) and is rather heavy on our infrastructure (e.g., if several pushes per day are performed).

PR Type

enhancement, documentation

Description

Enhanced the SageAdapter tool with new functionalities including PTM discovery, wide-window mode, and chimera search.
Updated the tool to be compatible with Sage versions beyond v0.15.0, including changes to variable modification specifications.
Implemented FDR threshold filtering and additional metadata handling in PercolatorInfile.
Updated documentation and authorship to reflect new contributions and changes.

Changes walkthrough 📝

Relevant files

Enhancement

PercolatorInfile.cpp `Enhance PercolatorInfile with TSV processing and FDR filtering` src/openms/source/FORMAT/PercolatorInfile.cpp Added functionality to load and process additional TSV files. Implemented mapping of PSM IDs to peak annotations. Introduced FDR threshold filtering for peptide identifications. Enhanced metadata handling and annotation.	+147/-4
SageAdapter.cpp `Update SageAdapter for PTM discovery and Sage v0.15.0 compatibility` src/topp/SageAdapter.cpp Added new functionalities for PTM discovery and annotation. Implemented wide-window mode and chimera search options. Updated configuration handling for compatibility with Sage v0.15.0. Enhanced logging and debugging outputs.	+399/-19
PercolatorInfile.h `Update PercolatorInfile interface for FDR threshold` src/openms/include/OpenMS/FORMAT/PercolatorInfile.h Updated function signature to include FDR threshold parameter.	+2/-1

Documentation

CHANGELOG `Update CHANGELOG for SageAdapter enhancements` CHANGELOG Documented updates to SageAdapter and new features. Added notes on compatibility and new functionalities.	+59/-14
AUTHORS `Update AUTHORS file with new contributor` AUTHORS Added Johannes von Kleist to the list of authors.	+1/-0

💡 PR-Agent usage:
Comment /help on the PR to get a list of all available PR-Agent tools and their descriptions

…nted

…probem

…e v0.15.0

qodo-merge-pro · 2024-09-06T11:55:57Z

PR Reviewer Guide 🔍

⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
🧪 No relevant tests
🔒 No security concerns identified
⚡ Key issues to review Code Complexity The `mapDifftoMods` function is quite complex and may benefit from refactoring into smaller, more focused functions. Error Handling The error handling in the `mapDifftoMods` function could be improved, particularly around file operations. Performance Concern Multiple nested loops and map lookups in the load function may impact performance for large datasets.

qodo-merge-pro · 2024-09-06T11:56:25Z

PR Code Suggestions ✨

Category	Suggestion	Score
Performance	✅ Use a more efficient algorithm for finding the closest bucket in the histogram Suggestion Impact: The commit implemented a more efficient algorithm using binary search to find the closest bucket in the histogram, replacing the iteration through all buckets. code diff: + auto hist_it = hist.find(deltamass); + if (hist_it == hist.end()) + { + bool bucketcheck = true; + bool chargecheck = false; + + // Perform binary search on the sorted keys + auto it = std::lower_bound(sorted_keys.begin(), sorted_keys.end(), deltamass); + + if (it != sorted_keys.end() && std::abs(deltamass - it) < 0.0005 && (deltamass > 0.05 \|\| deltamass < -0.05)) + { + hist[it] += 1.0; + bucketcheck = false; + + auto& charges = charge_states[it]; + if (std::find(charges.begin(), charges.end(), charge) == charges.end()) + { + num_charges_at_mass[it] += 1.0; + charges.push_back(charge); + } + } + + if (bucketcheck && (deltamass > 0.05 \|\| deltamass < -0.05)) + { + // Insert new key in the sorted vector at the correct position + sorted_keys.insert(it, deltamass); + + hist[deltamass] = 1.0; + num_charges_at_mass[deltamass] = 1.0; + charge_states[deltamass].push_back(charge); + } + } + else + { + hist_it->second += 1.0; + + auto& charges = charge_states[deltamass]; + if (std::find(charges.begin(), charges.end(), charge) == charges.end()) + { + num_charges_at_mass[deltamass] += 1.0; + charges.push_back(charge); + } Consider using a more efficient algorithm for finding the closest bucket in the histogram, such as binary search or a hash table, instead of iterating through all buckets. src/topp/SageAdapter.cpp [141-147] -for (map<double, double>::const_iterator mit = hist.begin(); mit != hist.end(); ++mit) +auto it = std::lower_bound(hist.begin(), hist.end(), deltamass, + [](const auto& bucket, double value) { return bucket.first < value - 0.05; }); +if (it != hist.end() && std::abs(it->first - deltamass) <= 0.05 && ((0.05 < deltamass) \|\| (-0.05 > deltamass))) { - //Check with error tolerance in buckets if already present in histogram - if(deltamass < mit->first+0.05 && deltamass > mit->first-0.05 && bucketcheck && ((0.05 < deltamass) \|\| (-0.05 > deltamass))) - { - hist[mit->first] += 1.0; - bucketcheck = false; + it->second += 1.0; + bucketcheck = false; Apply this suggestion Suggestion importance[1-10]: 8 Why: Implementing a binary search or similar algorithm can significantly improve performance when searching for the closest bucket, especially if the histogram is large. This is a valuable optimization.	8
	✅ Replace std::map with std::unordered_map for better performance Suggestion Impact: The typedef for mapRatetoMass was changed from std::map to std::unordered_map, which aligns with the suggestion to improve performance by using a more efficient data structure. code diff: -typedef map<double, double> mapRatetoMass; Consider using a more efficient data structure, such as std::unordered_map, instead of std::map for `mapRatetoMass` to improve lookup performance. src/topp/SageAdapter.cpp [83] -typedef map<double, double> mapRatetoMass; +typedef std::unordered_map<double, double> mapRatetoMass; Apply this suggestion Suggestion importance[1-10]: 7 Why: Using std::unordered_map can improve lookup performance due to its average constant time complexity, which is beneficial for performance. However, the impact may be minimal if the map size is small or if order is important.	7
	✅ Use std::unordered_map instead of std::map for better lookup performance Suggestion Impact: The suggestion to use std::unordered_map instead of std::map for anno_mapping was implemented in the commit. The data structure for anno_mapping was changed to std::unordered_map to improve lookup performance. code diff: + // TODO: make it work also without tsv file or sage result + //Sage Variables, initialized in the following block if SageAnnotation is set + map<int, vector<PeptideHit::PeakAnnotation>> anno_mapping; + CsvFile tsv; + CsvFile annos; + unordered_map<String, size_t> to_idx_t; + + if (SageAnnotation) // Block for special treatment of sage + { + String tsv_file_path = pin_file.substr(0, pin_file.size()-3); + tsv_file_path = tsv_file_path + "tsv"; + tsv = CsvFile(tsv_file_path,'\t'); + + String temp_diff = "results.sage.pin"; + String anno_file_path = pin_file.substr(0, pin_file.size()-temp_diff.length()); + anno_file_path = anno_file_path + "matched_fragments.sage.tsv"; + annos = CsvFile(anno_file_path, '\t'); - String tsv_file_path = pin_file.substr(0, pin_file.size()-3); - tsv_file_path = tsv_file_path + "tsv"; - CsvFile tsv(tsv_file_path, '\t'); - - - - String temp_diff = "results.sage.pin"; - String anno_file_path = pin_file.substr(0, pin_file.size()-temp_diff.length()); - anno_file_path = anno_file_path + "matched_fragments.sage.tsv"; - CsvFile annos(anno_file_path, '\t'); - - //map PSMID to vec of PeakAnnotation - - StringList header; - StringList fullheader; - StringList ann_header; - //TODO DANGEROUS! Our CSV reader does not support comment lines!! - csv.getRow(0, header); - tsv.getRow(0, fullheader); - annos.getRow(0, ann_header); + + //map PSMID to vec of PeakAnnotation + StringList sage_tsv_header; + tsv.getRow(0, sage_tsv_header); + to_idx_t; // map column name to column index, for full .tsv file + { + int idx_t{}; + for (const auto& h : sage_tsv_header) { to_idx_t[h] = idx_t++; } + } + + // processs annotation file + StringList sage_annotation_header; + annos.getRow(0, sage_annotation_header); + unordered_map<String, size_t> to_idx_a; // map column name to column index, for full annotation file file + { + int idx_a{}; + for (const auto& h : sage_annotation_header) { to_idx_a[h] = idx_a++; } + } + // map PSMs -> PeakAnnotation vector + auto num_rows = annos.rowCount(); + + for (size_t i = 1; i < num_rows; ++i) + { + StringList row; + annos.getRow(i, row); + + //Check if mapping already has PSM, if it does add + if (anno_mapping.find(row[to_idx_a.at("psm_id")].toInt()) == anno_mapping.end()) + { + //Make a new vector of annotations + PeptideHit::PeakAnnotation peak_temp; + + peak_temp.annotation = row[to_idx_a.at("fragment_type")] + row[to_idx_a.at("fragment_ordinals")]; + peak_temp.charge = row[to_idx_a.at("fragment_charge")].toInt(); + peak_temp.intensity = row[to_idx_a.at("fragment_intensity")].toDouble(); + peak_temp.mz = row[to_idx_a.at("fragment_mz_experimental")].toDouble(); + + vector<PeptideHit::PeakAnnotation> temp_anno_vec; + temp_anno_vec.push_back(peak_temp); + anno_mapping[ row[to_idx_a.at("psm_id")].toInt() ] = temp_anno_vec; + } + else + { + //Add values to exisiting vector + PeptideHit::PeakAnnotation peak_temp; + + peak_temp.annotation = row[to_idx_a.at("fragment_type")] + row[to_idx_a.at("fragment_ordinals")]; + peak_temp.charge = row[to_idx_a.at("fragment_charge")].toInt(); + peak_temp.intensity = row[to_idx_a.at("fragment_intensity")].toDouble(); + peak_temp.mz = row[to_idx_a.at("fragment_mz_experimental")].toDouble(); + + anno_mapping[ row[to_idx_a.at("psm_id")].toInt() ].push_back(peak_temp); + } Consider using a more efficient data structure, such as std::unordered_map, for `anno_mapping` to improve lookup performance. src/openms/source/FORMAT/PercolatorInfile.cpp [141] -map<int, vector<PeptideHit::PeakAnnotation>> anno_mapping; +std::unordered_map<int, vector<PeptideHit::PeakAnnotation>> anno_mapping; Apply this suggestion Suggestion importance[1-10]: 7 Why: Switching to std::unordered_map can enhance lookup performance due to its average constant time complexity, which is advantageous for performance, especially when the map grows large.	7
Maintainability	✅ Rename the 'threshhold' parameter to a more descriptive name Suggestion Impact: The commit changed the parameter name from 'threshhold' to 'threshold', which aligns with the suggestion's intent to improve readability, although it did not use the exact names suggested. code diff: - double threshhold = 0.01); + double threshold = 0.01); Consider using a more descriptive name for the `threshhold` parameter, such as `score_threshold` or `confidence_threshold`, to better convey its purpose. src/openms/include/OpenMS/FORMAT/PercolatorInfile.h [45-46] String decoy_prefix = "", -double threshhold = 0.01); +double score_threshold = 0.01); Apply this suggestion Suggestion importance[1-10]: 7 Why: The suggestion to rename 'threshhold' to a more descriptive name like 'score_threshold' improves code readability and maintainability by clearly conveying the parameter's purpose.	7
Enhancement	✅ Add an inline comment to explain the purpose of the 'threshhold' parameter Suggestion Impact: The commit added a detailed comment explaining the purpose of the 'threshold' parameter, including its role in filtering spectra based on the 'spectrum_q' value. code diff: + * @param decoy_prefix The prefix used to identify decoy protein accessions. Proteins with accessions starting with this prefix are marked as decoys. Otherwise, it assumes that the pin file already contains the correctly annotated decoy status. + * @param threshold A double value representing the threshold for the `spectrum_q` value. Only spectra with `spectrum_q` below this threshold are processed. + Implemented to allow prefiltering of Sage results. Consider adding a brief inline comment explaining the purpose of the `threshhold` parameter and its default value. src/openms/include/OpenMS/FORMAT/PercolatorInfile.h [45-46] String decoy_prefix = "", -double threshhold = 0.01); +double threshhold = 0.01); // Score threshold for filtering, default: 1% FDR Apply this suggestion Suggestion importance[1-10]: 6 Why: Adding a comment explaining the purpose of the 'threshhold' parameter enhances code clarity and helps future developers understand its role without needing to dive into the implementation details.	6
	Expand the description of SageAdapter updates with more details and an example Consider adding a brief explanation or example of the new functionality added to SageAdapter, to provide users with more context about the changes. CHANGELOG [26] -- SageAdapter received large updates including added functionality for PTM discovery + enabling features such as chimera seach, RT predicition, etc. +- SageAdapter received large updates including: + - Added functionality for PTM discovery + - Enabled features such as chimera search and RT prediction + - Improved compatibility with newer Sage versions (v0.15.0 and beyond) + Example: SageAdapter now supports automatic PTM discovery on peptides, enhancing its ability to identify modified proteins. Apply this suggestion Suggestion importance[1-10]: 5 Why: Providing more detailed information and an example about the SageAdapter updates in the changelog helps users understand the new functionalities and improvements, although it is not critical for functionality.	5
	Provide more specific version information for SageAdapter compatibility Consider adding version numbers or more specific information about the compatibility with new Sage versions to provide clearer guidance for users. CHANGELOG [58] -- SageAdapter now works with sage v0.15.0 and beyond +- SageAdapter now works with Sage v0.15.0 and later versions (tested up to v0.X.X), ensuring compatibility with the latest Sage releases Apply this suggestion Suggestion importance[1-10]: 4 Why: Adding specific version information for SageAdapter compatibility in the changelog offers clearer guidance for users, but it is a minor enhancement as the existing information is already somewhat informative.	4

timosachsenberg · 2024-09-09T08:43:39Z

hi can you check why there are conflicts in other parts? Maybe your branch is not up-to-date?

JohannesvKL · 2024-09-09T16:02:54Z

Ah yes, it appears the branch is not up-to-date, and also there are some random whitespaces in files that probably accidentally got added while I was looking at the files. Will have these fixed by Wednesday afternoon.

src/tests/topp/THIRDPARTY/SageAdapter_1_v3.2.0.ini

…eSearch into OpenSageSearch Updates current branch to fix preious merge issues.

src/topp/SageAdapter.cpp

src/openms/source/FORMAT/PercolatorInfile.cpp

…in percolator

Co-authored-by: Timo Sachsenberg <timo.sachsenberg@uni-tuebingen.de>

JohannesvKL and others added 23 commits May 19, 2024 12:56

First Attempt to cluster, data reading already works

1b49c72

Clustering test

977842b

Clustering with KDE works now, access to masses in unimod.obo impleme…

1ca3cff

…nted

CLustering functional, defect pushing mod-mass up during KDE still a …

7f3f4a4

…probem

Clustering (with workaround), mapping, and reading to idXML completed

584143b

Output as a .tsv file complete, sorting based 8000 on charge implemented

147e0aa

Added some bug checks to help debug streamlit app

afdcbb0

Added multiple masses being written to output

67ba2a3

Removed print-bloat

fc24370

Annotation meta values are now saved, FDR filtering implemented

52f6b00

Added annotation feature

89b15a5

Edited annotation notation

5b074a7

Fixed naming of output-table for PTMs

5d63a5e

Added wide-window option, improved annotation

c2eebdf

Updated variable modification specification to be compatible with sag…

9e3ff66

…e v0.15.0

Re-added some filtering, tweaked backend, wide window mode available now

fa98108

Removed smoothing to speed up code, removed some old print statements

715479f

Cleaned up code a bit, preparing for PR

1386972

Removed unnecessary functions

3c9ae0c

Refactored code to be compliant with OpenMS coding conventions

8de8d87

Also updated minor changes in percolator to be form-compliant

3fc0c81

Update AUTHORS

286fcad

Update CHANGELOG

112ce9c

qodo-merge-pro bot added enhancement Documentation Review effort [1-5]: 4 labels Sep 6, 2024

JohannesvKL and others added 2 commits September 23, 2024 21:10

Refactored code and made mapping less buggy, changed two thresholds

d458dd8

Merge branch 'develop' into OpenSageSearch

0364a25

timosachsenberg reviewed Sep 25, 2024

View reviewed changes

src/tests/topp/THIRDPARTY/SageAdapter_1_v3.2.0.ini Outdated Show resolved Hide resolved

JohannesvKL and others added 10 commits September 25, 2024 12:20

Removed old ini file and fixed white 8000 spac issues

92aad16

Merge branch 'OpenSageSearch' of github.com:JohannesvKL/OpenMSOpenSag…

7ee7a98

…eSearch into OpenSageSearch Updates current branch to fix preious merge issues.

Removed commented out code block

0b81b98

Added isotope check to mapping

20256f0

Changed description of smoothing option

533e18a

add doc

66157ff

todo

d56b7bd

Added parameter to check for Sage

62916b8

Merged

bf12ba2

Refactored and rewrote core parts, clarified names

bf2c146