8000 Updating SageAdapter with new functionality + compatibility with new Sage versions by JohannesvKL · Pull Request #7577 · OpenMS/OpenMS · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Updating SageAdapter with new functionality + compatibility with new Sage versions #7577

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 54 commits into from
Sep 30, 2024

Conversation

JohannesvKL
Copy link
Contributor
@JohannesvKL JohannesvKL commented Sep 6, 2024

User description

Description

Added functionality to the SageAdapter tool by changing the SageAdapter.cpp and PercolatorInfile.cpp files. Now supports many different Sage features such as annotation, wide-window mode, chimera, FDR filtering among others. The tool is now also compliant with Sage's new variable modifications specifications and works on versions of Sage beyond v0.15.0.

SageAdapter now also provides PTM discovery functionality and additionally outputs a table of all discovered PTMs from the given data.

Checklist

  • Make sure that you are listed in the AUTHORS file
  • Add relevant changes and new features to the CHANGELOG file
  • I have commented my code, particularly in hard-to-understand areas
  • New and existing unit tests pass locally with my changes
  • Updated or added python bindings for changed or new classes (Tick if no updates were necessary.)

How can 8000 I get additional information on failed tests during CI

Click to expand If your PR is failing you can check out
  • The details of the action statuses at the end of the PR or the "Checks" tab.
  • http://cdash.openms.de/index.php?project=OpenMS and look for your PR. Use the "Show filters" capability on the top right to search for your PR number.
    If you click in the column that lists the failed tests you will get detailed error messages.

Advanced commands (admins / reviewer only)

Click to expand
  • /reformat (experimental) applies the clang-format style changes as additional commit. Note: your branch must have a different name (e.g., yourrepo:feature/XYZ) than the receiving branch (e.g., OpenMS:develop). Otherwise, reformat fails to push.
  • setting the label "NoJenkins" will skip tests for this PR on jenkins (saves resources e.g., on edits that do not affect tests)
  • commenting with rebuild jenkins will retrigger Jenkins-based CI builds

⚠️ Note: Once you opened a PR try to minimize the number of pushes to it as every push will trigger CI (automated builds and test) and is rather heavy on our infrastructure (e.g., if several pushes per day are performed).


PR Type

enhancement, documentation


Description

  • Enhanced the SageAdapter tool with new functionalities including PTM discovery, wide-window mode, and chimera search.
  • Updated the tool to be compatible with Sage versions beyond v0.15.0, including changes to variable modification specifications.
  • Implemented FDR threshold filtering and additional metadata handling in PercolatorInfile.
  • Updated documentation and authorship to reflect new contributions and changes.

Changes walkthrough 📝

Relevant files
Enhancement
PercolatorInfile.cpp
Enhance PercolatorInfile with TSV processing and FDR filtering

src/openms/source/FORMAT/PercolatorInfile.cpp

  • Added functionality to load and process additional TSV files.
  • Implemented mapping of PSM IDs to peak annotations.
  • Introduced FDR threshold filtering for peptide identifications.
  • Enhanced metadata handling and annotation.
  • +147/-4 
    SageAdapter.cpp
    Update SageAdapter for PTM discovery and Sage v0.15.0 compatibility

    src/topp/SageAdapter.cpp

  • Added new functionalities for PTM discovery and annotation.
  • Implemented wide-window mode and chimera search options.
  • Updated configuration handling for compatibility with Sage v0.15.0.
  • Enhanced logging and debugging outputs.
  • +399/-19
    PercolatorInfile.h
    Update PercolatorInfile interface for FDR threshold           

    src/openms/include/OpenMS/FORMAT/PercolatorInfile.h

    • Updated function signature to include FDR threshold parameter.
    +2/-1     
    Documentation
    CHANGELOG
    Update CHANGELOG for SageAdapter enhancements                       

    CHANGELOG

  • Documented updates to SageAdapter and new features.
  • Added notes on compatibility and new functionalities.
  • +59/-14 
    AUTHORS
    Update AUTHORS file with new contributor                                 

    AUTHORS

    • Added Johannes von Kleist to the list of authors.
    +1/-0     

    💡 PR-Agent usage:
    Comment /help on the PR to get a list of all available PR-Agent tools and their descriptions

    Copy link
    Contributor
    qodo-merge-pro bot commented Sep 6, 2024

    PR Reviewer Guide 🔍

    ⏱️ Estimated effort to review: 4 🔵🔵🔵🔵⚪
    🧪 No relevant tests
    🔒 No security concerns identified
    ⚡ Key issues to review

    Code Complexity
    The mapDifftoMods function is quite complex and may benefit from refactoring into smaller, more focused functions.

    Error Handling
    The error handling in the mapDifftoMods function could be improved, particularly around file operations.

    Performance Concern
    Multiple nested loops and map lookups in the load function may impact performance for large datasets.

    Copy link
    Contributor
    qodo-merge-pro bot commented Sep 6, 2024

    PR Code Suggestions ✨

    CategorySuggestion                                                                                                                                    Score
    Performance
    ✅ Use a more efficient algorithm for finding the closest bucket in the histogram
    Suggestion Impact:The commit implemented a more efficient algorithm using binary search to find the closest bucket in the histogram, replacing the iteration through all buckets.

    code diff:

    +        auto hist_it = hist.find(deltamass);
    +        if (hist_it == hist.end())
    +        {
    +            bool bucketcheck = true;
    +            bool chargecheck = false;
    +
    +            // Perform binary search on the sorted keys
    +            auto it = std::lower_bound(sorted_keys.begin(), sorted_keys.end(), deltamass);
    +
    +            if (it != sorted_keys.end() && std::abs(deltamass - *it) < 0.0005 && (deltamass > 0.05 || deltamass < -0.05))
    +            {
    +                hist[*it] += 1.0;
    +                bucketcheck = false;
    +
    +                auto& charges = charge_states[*it];
    +                if (std::find(charges.begin(), charges.end(), charge) == charges.end())
    +                {
    +                    num_charges_at_mass[*it] += 1.0;
    +                    charges.push_back(charge);
    +                }
    +            }
    +
    +            if (bucketcheck && (deltamass > 0.05 || deltamass < -0.05))
    +            {
    +                // Insert new key in the sorted vector at the correct position
    +                sorted_keys.insert(it, deltamass);
    +
    +                hist[deltamass] = 1.0;
    +                num_charges_at_mass[deltamass] = 1.0;
    +                charge_states[deltamass].push_back(charge);
    +            }
    +        }
    +        else
    +        {
    +            hist_it->second += 1.0;
    +
    +            auto& charges = charge_states[deltamass];
    +            if (std::find(charges.begin(), charges.end(), charge) == charges.end())
    +            {
    +                num_charges_at_mass[deltamass] += 1.0;
    +                charges.push_back(charge);
    +            }

    Consider using a more efficient algorithm for finding the closest bucket in the
    histogram, such as binary search or a hash table, instead of iterating through all
    buckets.

    src/topp/SageAdapter.cpp [141-147]

    -for (map<double, double>::const_iterator mit = hist.begin(); mit != hist.end(); ++mit)
    +auto it = std::lower_bound(hist.begin(), hist.end(), deltamass, 
    +  [](const auto& bucket, double value) { return bucket.first < value - 0.05; });
    +if (it != hist.end() && std::abs(it->first - deltamass) <= 0.05 && ((0.05 < deltamass) || (-0.05 > deltamass)))
     {
    -  //Check with error tolerance in buckets if already present in histogram 
    -  if(deltamass <  mit->first+0.05 && deltamass >  mit->first-0.05 && bucketcheck && ((0.05 <  deltamass) || (-0.05 >  deltamass)))
    -  {
    -    hist[mit->first] += 1.0; 
    -    bucketcheck = false;
    +  it->second += 1.0;
    +  bucketcheck = false;
     
    • Apply this suggestion
    Suggestion importance[1-10]: 8

    Why: Implementing a binary search or similar algorithm can significantly improve performance when searching for the closest bucket, especially if the histogram is large. This is a valuable optimization.

    8
    ✅ Replace std::map with std::unordered_map for better performance
    Suggestion Impact:The typedef for mapRatetoMass was changed from std::map to std::unordered_map, which aligns with the suggestion to improve performance by using a more efficient data structure.

    code diff:

    -typedef map<double, double> mapRatetoMass;

    Consider using a more efficient data structure, such as std::unordered_map, instead
    of std::map for mapRatetoMass to improve lookup performance.

    src/topp/SageAdapter.cpp [83]

    -typedef map<double, double> mapRatetoMass;
    +typedef std::unordered_map<double, double> mapRatetoMass;
     
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: Using std::unordered_map can improve lookup performance due to its average constant time complexity, which is beneficial for performance. However, the impact may be minimal if the map size is small or if order is important.

    7
    ✅ Use std::unordered_map instead of std::map for better lookup performance
    Suggestion Impact:The suggestion to use std::unordered_map instead of std::map for anno_mapping was implemented in the commit. The data structure for anno_mapping was changed to std::unordered_map to improve lookup performance.

    code diff:

    +    // TODO: make it work also without tsv file or sage result
    +    //Sage Variables, initialized in the following block if SageAnnotation is set 
    +    map<int, vector<PeptideHit::PeakAnnotation>> anno_mapping; 
    +    CsvFile tsv; 
    +    CsvFile annos; 
    +    unordered_map<String, size_t> to_idx_t; 
    +
    +    if (SageAnnotation) // Block for special treatment of sage 
    +    {
    +      String tsv_file_path = pin_file.substr(0, pin_file.size()-3);
    +      tsv_file_path = tsv_file_path + "tsv"; 
    +      tsv = CsvFile(tsv_file_path,'\t'); 
    +
    +      String temp_diff = "results.sage.pin"; 
    +      String anno_file_path = pin_file.substr(0, pin_file.size()-temp_diff.length());
    +      anno_file_path = anno_file_path + "matched_fragments.sage.tsv"; 
    +      annos = CsvFile(anno_file_path, '\t'); 
           
    -    String tsv_file_path = pin_file.substr(0, pin_file.size()-3);
    -    tsv_file_path = tsv_file_path + "tsv"; 
    -    CsvFile tsv(tsv_file_path, '\t'); 
    -
    -
    -
    -    String temp_diff = "results.sage.pin"; 
    -    String anno_file_path = pin_file.substr(0, pin_file.size()-temp_diff.length());
    -    anno_file_path = anno_file_path + "matched_fragments.sage.tsv"; 
    -    CsvFile annos(anno_file_path, '\t'); 
    -
    -    //map PSMID to vec of PeakAnnotation 
    -
    -    StringList header;
    -    StringList fullheader; 
    -    StringList ann_header; 
    -    //TODO DANGEROUS! Our CSV reader does not support comment lines!!
    -    csv.getRow(0, header);
    -    tsv.getRow(0, fullheader); 
    -    annos.getRow(0, ann_header); 
    +
    +      //map PSMID to vec of PeakAnnotation 
    +      StringList sage_tsv_header;
    +      tsv.getRow(0, sage_tsv_header); 
    +      to_idx_t; // map column name to column index, for full .tsv file 
    +      {
    +        int idx_t{};
    +        for (const auto& h : sage_tsv_header) { to_idx_t[h] = idx_t++; }
    +      }
    +
    +      // processs annotation file
    +      StringList sage_annotation_header; 
    +      annos.getRow(0, sage_annotation_header);
    +      unordered_map<String, size_t> to_idx_a; // map column name to column index, for full annotation file file 
    +      {
    +        int idx_a{};
    +        for (const auto& h : sage_annotation_header) { to_idx_a[h] = idx_a++; }
    +      }
    +      // map PSMs -> PeakAnnotation vector
    +      auto num_rows = annos.rowCount(); 
    +
    +      for (size_t i = 1; i < num_rows; ++i)
    +      {
    +        StringList row;
    +        annos.getRow(i, row);
    +        
    +        //Check if mapping already has PSM, if it does add 
    +        if (anno_mapping.find(row[to_idx_a.at("psm_id")].toInt()) == anno_mapping.end())
    +        {                 
    +          //Make a new vector of annotations 
    +          PeptideHit::PeakAnnotation peak_temp; 
    +
    +          peak_temp.annotation = row[to_idx_a.at("fragment_type")] + row[to_idx_a.at("fragment_ordinals")]; 
    +          peak_temp.charge = row[to_idx_a.at("fragment_charge")].toInt(); 
    +          peak_temp.intensity = row[to_idx_a.at("fragment_intensity")].toDouble(); 
    +          peak_temp.mz = row[to_idx_a.at("fragment_mz_experimental")].toDouble(); 
    +
    +          vector<PeptideHit::PeakAnnotation> temp_anno_vec; 
    +          temp_anno_vec.push_back(peak_temp); 
    +          anno_mapping[ row[to_idx_a.at("psm_id")].toInt() ] = temp_anno_vec;
    +        }
    +        else
    +        {
    +          //Add values to exisiting vector 
    +          PeptideHit::PeakAnnotation peak_temp; 
    +
    +          peak_temp.annotation = row[to_idx_a.at("fragment_type")] + row[to_idx_a.at("fragment_ordinals")]; 
    +          peak_temp.charge = row[to_idx_a.at("fragment_charge")].toInt(); 
    +          peak_temp.intensity = row[to_idx_a.at("fragment_intensity")].toDouble(); 
    +          peak_temp.mz = row[to_idx_a.at("fragment_mz_experimental")].toDouble(); 
    +
    +          anno_mapping[ row[to_idx_a.at("psm_id")].toInt() ].push_back(peak_temp);         
    +        }

    Consider using a more efficient data structure, such as std::unordered_map, for
    anno_mapping to improve lookup performance.

    src/openms/source/FORMAT/PercolatorInfile.cpp [141]

    -map<int, vector<PeptideHit::PeakAnnotation>> anno_mapping;
    +std::unordered_map<int, vector<PeptideHit::PeakAnnotation>> anno_mapping;
     
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: Switching to std::unordered_map can enhance lookup performance due to its average constant time complexity, which is advantageous for performance, especially when the map grows large.

    7
    Maintainability
    ✅ Rename the 'threshhold' parameter to a more descriptive name
    Suggestion Impact:The commit changed the parameter name from 'threshhold' to 'threshold', which aligns with the suggestion's intent to improve readability, although it did not use the exact names suggested.

    code diff:

    -        double threshhold = 0.01);
    +        double threshold = 0.01);

    Consider using a more descriptive name for the threshhold parameter, such as
    score_threshold or confidence_threshold, to better convey its purpose.

    src/openms/include/OpenMS/FORMAT/PercolatorInfile.h [45-46]

     String decoy_prefix = "",
    -double threshhold = 0.01);
    +double score_threshold = 0.01);
     
    • Apply this suggestion
    Suggestion importance[1-10]: 7

    Why: The suggestion to rename 'threshhold' to a more descriptive name like 'score_threshold' improves code readability and maintainability by clearly conveying the parameter's purpose.

    7
    Enhancement
    ✅ Add an inline comment to explain the purpose of the 'threshhold' parameter
    Suggestion Impact:The commit added a detailed comment explaining the purpose of the 'threshold' parameter, including its role in filtering spectra based on the 'spectrum_q' value.

    code diff:

    +      * @param decoy_prefix The prefix used to identify decoy protein accessions. Proteins with accessions starting with this prefix are marked as decoys. Otherwise, it assumes that the pin file already contains the correctly annotated decoy status.
    +      * @param threshold A double value representing the threshold for the `spectrum_q` value. Only spectra with `spectrum_q` below this threshold are processed.
    +                         Implemented to allow prefiltering of Sage results.

    Consider adding a brief inline comment explaining the purpose of the threshhold
    parameter and its default value.

    src/openms/include/OpenMS/FORMAT/PercolatorInfile.h [45-46]

     String decoy_prefix = "",
    -double threshhold = 0.01);
    +double threshhold = 0.01); // Score threshold for filtering, default: 1% FDR
     
    • Apply this suggestion
    Suggestion importance[1-10]: 6

    Why: Adding a comment explaining the purpose of the 'threshhold' parameter enhances code clarity and helps future developers understand its role without needing to dive into the implementation details.

    6
    Expand the description of SageAdapter updates with more details and an example

    Consider adding a brief explanation or example of the new functionality added to
    SageAdapter, to provide users with more context about the changes.

    CHANGELOG [26]

    -- SageAdapter received large updates including added functionality for PTM discovery + enabling features such as chimera seach, RT predicition, etc.
    +- SageAdapter received large updates including:
    +  - Added functionality for PTM discovery
    +  - Enabled features such as chimera search and RT prediction
    +  - Improved compatibility with newer Sage versions (v0.15.0 and beyond)
    +  Example: SageAdapter now supports automatic PTM discovery on peptides, enhancing its ability to identify modified proteins.
     
    • Apply this suggestion
    Suggestion importance[1-10]: 5

    Why: Providing more detailed information and an example about the SageAdapter updates in the changelog helps users understand the new functionalities and improvements, although it is not critical for functionality.

    5
    Provide more specific version information for SageAdapter compatibility

    Consider adding version numbers or more specific information about the compatibility
    with new Sage versions to provide clearer guidance for users.

    CHANGELOG [58]

    -- SageAdapter now works with sage v0.15.0 and beyond
    +- SageAdapter now works with Sage v0.15.0 and later versions (tested up to v0.X.X), ensuring compatibility with the latest Sage releases
     
    • Apply this suggestion
    Suggestion importance[1-10]: 4

    Why: Adding specific version information for SageAdapter compatibility in the changelog offers clearer guidance for users, but it is a minor enhancement as the existing information is already somewhat informative.

    4

    @timosachsenberg
    Copy link
    Contributor

    hi can you check why there are conflicts in other parts? Maybe your branch is not up-to-date?

    @JohannesvKL
    Copy link
    Contributor Author

    Ah yes, it appears the branch is not up-to-date, and also there are some random whitespaces in files that probably accidentally got added while I was looking at the files. Will have these fixed by Wednesday afternoon.

    @timosachsenberg timosachsenberg merged commit 594447a into OpenMS:develop Sep 30, 2024
    12 checks passed
    Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
    Projects
    None yet
    Development

    Successfully merging this pull request may close these issues.

    2 participants
    0