8000 Use submodule to retrieve data files by GopiGugan · Pull Request #68 · PoonLab/sierra-local · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Use submodule to retrieve data files #68

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 7 commits into from
Jan 17, 2023
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension


Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
3 changes: 3 additions & 0 deletions .gitmodules
Original file line number Diff line number Diff line change
@@ -0,0 +1,3 @@
[submodule "sierralocal/data/hivfacts"]
path = sierralocal/data/hivfacts
url = https://github.com/hivdb/hivfacts.git
49 changes: 2 additions & 47 deletions README.md
Original file line number Diff line number Diff line change
Expand Up @@ -25,6 +25,7 @@ On a Linux system, you can install *sierra-local* as follows:
```
git clone http://github.com/PoonLab/sierra-local
cd sierra-local
git submodule init; git submodule update
sudo python3 setup.py install
```
Note that you need super-user privileges to install the package by this method. For more detailed instrucitons, please refer to the document [INSTALL.md](INSTALL.md) that should be located in the root directory of this Python package.
Expand Down Expand Up @@ -107,6 +108,7 @@ If you have downloaded the package source to your computer, you can also run *si
```console
art@Jesry:~/git/sierra-local$ git clone http://github.com/PoonLab/sierra-local
art@Jesry:~/git/sierra-local$ cd sierra-local
art@Jesry:~/git/sierra-local$ git submodule init; git submodule update
art@Jesry:~/git/sierra-local$ python3
Python 3.6.6 (default, Sep 12 2018, 18:26:19)
[GCC 8.0.1 20180414 (experimental) [trunk revision 259383]] on linux
Expand All @@ -131,53 +133,6 @@ python3 gui.py
```


## Updating the algorithm

The Stanford HIVdb database regularly updates its resistance genotyping algorithm and publishes the associated ASI2 XML file on their website. In previous versions of *sierra-local*, we used Python to automatically query this website and download the newest version if it was not already present on the user's computer. Subsequent changes to the Stanford HIVdb website, however, meant that users would have to install several additional dependencies in order for Python to locate the required files. As a result, we decided to make the `updater.py` script an optional step of the pipeline.

To run `updater.py`, you need to install the following requirements:
* [Google Chrome](https://www.google.com/chrome/)
* The appropriate [chromedriver](https://chromedriver.chromium.org/) for your operating system *and* [version of Google Chrome](https://chromedriver.chromium.org/downloads/version-selection).
* The Python module [Selenium](https://pypi.org/project/selenium/)

You'll also need to be working with a local clone or download of this code repository, because you will probably need to modify a line of the `updater.py` script.

Next, you need to locate your `chromedriver` binary. For the purpose of demonstration, I happened to leave this binary in my `~/Downloads` folder. Use this information to modify the following lines in `updater.py`:
```python
# this needs to be modified to point Python to your local chromedriver binary
mod_path = Path(os.path.dirname(__file__))
driver_path = mod_path.parent / 'bin/chromedriver'
```
Otherwise, Python will be unable to locate the Chrome browser binary and throws a `WebDriverException`:
```console
art@orolo:~/git/sierra-local$ python3 sierralocal/updater.py
...
Traceback (most recent call last):
File "sierralocal/updater.py", line 19, in <module>
browser = webdriver.Chrome(executable_path=str(driver_path), chrome_options=options)
File "/home/art/.local/lib/python3.6/site-packages/selenium/webdriver/chrome/webdriver.py", line 73, in __init__
self.service.start()
File "/home/art/.local/lib/python3.6/site-packages/selenium/webdriver/common/service.py", line 83, in start
os.path.basename(self.path), self.start_error_message)
selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home
```

For my system running Ubuntu, I modified the `updater.py` script as follows:
```python
#driver_path = mod_path.parent / 'bin/chromedriver'
driver_path = Path('/home/art/Downloads/chromedriver')
```

Manually running the script enabled me to grab the most recent versions of the ASI2 and APOBEC files from the HIVdb webserver:
```console
art@orolo:~/git/sierra-local$ sudo python3 sierralocal/updater.py
Updated HIVDB XML from https://hivdb.stanford.edu/assets/media/HIVDB_8.8.a126e04c.xml into sierralocal/data/HIVDB_8.8.a126e04c.xml
Updated APOBEC DRMs from https://hivdb.stanford.edu/assets/media/apobec-drms.221b0330.tsv into sierralocal/data/apobec.tsv
```

Now of course, it would be much simpler to manually point your browser to the Stanford HIVdb website and download these files yourself, but in some applications there may be a benefit to automating this step.


## About Us
This project was developed at the [Poon lab](http://github.com/PoonLab) within the [Department of Pathology and Laboratory Medicine](https://www.schulich.uwo.ca/pathol/), Schulich School of Medicine and Dentistry, [Western University](http://uwo.ca), London, Ontario. Development of *sierra-local* was supported in part by a grant from the [Canadian Institutes of Health Research](http://www.cihr-irsc.gc.ca/e/193.html) (PJT-156178).

Expand Down
4 changes: 3 additions & 1 deletion setup.py
8000
Original file line number Diff line number Diff line change
Expand Up @@ -52,7 +52,9 @@ def run(self):
'data/*Prevalences.tsv',
'data/*-comments.csv',
'data/HIVDB_8.8.a126e04c.xml',
'data/apobec-drms.221b0330.tsv'
'data/apobec-drms.221b0330.tsv',
'data/hivfacts/data/apobecs/apobec_*.json',
'data/hivfacts/data/algorithms/HIVDB_*'
]},
cmdclass={'install': OverrideInstall},
#include_package_data=True
Expand Down
Binary file modified sierralocal/bin/nucamino-darwin-386
Binary file not shown.
Binary file modified sierralocal/bin/nucamino-darwin-amd64
Binary file not shown.
Binary file modified sierralocal/bin/nucamino-linux-386
Binary file not shown.
Binary file modified sierralocal/bin/nucamino-linux-amd64
Binary file not shown.
Binary file modified sierralocal/bin/nucamino-windows-386.exe
Binary file not shown.
Binary file modified sierralocal/bin/nucamino-windows-amd64.exe
Binary file not shown.
1 change: 1 addition & 0 deletions sierralocal/data/hivfacts
Submodule hivfacts added at 3f7ecb
37 changes: 18 additions & 19 deletions sierralocal/hivdb.py
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,7 @@
from pathlib import Path
import re
import sys
import subprocess


class HIVdb():
Expand All @@ -14,17 +15,19 @@ class HIVdb():
"""
def __init__(self, asi2=None, apobec=None, forceupdate=False):
self.xml_filename = None
self.tsv_filename = None
self.json_filename = None
self.BASE_URL = 'https://hivdb.stanford.edu'

if forceupdate:
# DEPRECATED, requires selenium, chrome and chromedriver
import sierralocal.updater as updater
self.xml_filename = updater.update_HIVDB()
self.tsv_filename = updater.updateAPOBEC()
print("Updating submodule to retrieve the latest data files")
# update submodules
try:
subprocess.check_call("git submodule foreach git pull origin main", shell=True)
except:
print("Could not update submodules")
else:
self.set_hivdb_xml(asi2)
self.set_apobec_tsv(apobec)
self.set_apobec_json(apobec)

# Set algorithm metadata
self.root = xml.parse(str(self.xml_filename)).getroot()
Expand All @@ -38,14 +41,14 @@ def set_hivdb_xml(self, path):
if path is None:
# If user has not specified XML path
# Iterate over possible HIVdb ASI files matching the glob pattern
dest = str(Path(os.path.dirname(__file__))/'data'/'HIVDB*.xml')
dest = str(Path(os.path.dirname(__file__))/'data'/'hivfacts'/'data'/'algorithms'/'HIVDB_*.xml')
print("searching path " + dest)
files = glob.glob(dest)

# find the newest XML that can be parsed
intermed = []
for file in files:
version = re.search("HIVDB_([0-9]\.[0-9.]+)\.", file).group(1)
version = re.search("HIVDB_([0-9]\.[0-9.-]+)\.", file).group(1)
intermed.append((version, file))
intermed.sort(reverse=True)

Expand Down Expand Up @@ -78,35 +81,31 @@ def set_hivdb_xml(self, path):
# Parseable XML file not found. Update from web
if not file_found:
print("Error: could not find local copy of HIVDB XML.")
print("Manually download from https://hivdb.stanford.edu/page/"
"release-notes/#algorithm.updates")
print("Please ensure that the submodule (https://github.com/hivdb/hivfacts/tree/) has been initialized and updated")
sys.exit()
# self.xml_filename = updater.update_HIVDB()

def set_apobec_tsv(self, path):
def set_apobec_json(self, path):
"""
Attempt to locate a local APOBEC DRM file (tsv format)
Attempt to locate a local APOBEC DRM file (json format)
"""
if path is None:
dest = str(Path(os.path.dirname(__file__))/'data'/'apobec*.tsv')
dest = str(Path(os.path.dirname(__file__))/'data'/'hivfacts'/'data'/'apobecs'/'apobec_*.json')
print("searching path {}".format(dest))
files = glob.glob(dest)
for file in files:
# no version numbering, take first hit
# TODO: some basic format check on TSV file
if os.path.isfile(file):
self.tsv_filename = file
self.json_filename = file
return
else:
self.tsv_filename = path
self.json_filename = path
return

# if we end up here, no local files found
print("Error: could not locate local APOBEC DRM data file.")
print("Manually download from https://hivdb.stanford.edu/page/"
"release-notes/#data.files")
print("Please ensure that the submodule (https://github.com/hivdb/hivfacts/tree/) has been initialized and updated")
sys.exit()
# self.tsv_filename = updater.updateAPOBEC()

def parse_definitions(self, root):
"""
Expand Down
13 changes: 6 additions & 7 deletions sierralocal/jsonwriter.py
Original file line number Diff line number Diff line change
Expand Up @@ -23,10 +23,9 @@ def __init__(self, algorithm):
self.comments = self.algorithm.parse_comments(self.algorithm.root)

# Load comments files stored locally. These are distributed in the repo for now.
#dest = str(Path(os.path.dirname(__file__))/'data'/'apobec.tsv')
dest = algorithm.tsv_filename
dest = algorithm.json_filename
with open(dest,'r') as csvfile:
self.ApobecDRMs = list(csv.reader(csvfile, delimiter='\t'))
self.ApobecDRMs = json.load(csvfile)

dest = str(Path(os.path.dirname(__file__))/'data'/'INSTI-comments.csv')
with open(dest, 'r') as INSTI_file:
Expand Down Expand Up @@ -258,11 +257,11 @@ def findComment(self, gene, mutation, comments, details):
return details[full_mut]['1']

def isApobecDRM(self, gene, consensus, position, AA):
ls = [row[0:3] for row in self.ApobecDRMs[1:]]
if [gene, consensus, str(position)] in ls:
i = ls.index([gene, consensus, str(position)])
ls = [[row['gene'], str(row['position'])] for row in self.ApobecDRMs]
if [gene, str(position)] in ls:
i = ls.index([gene, str(position)])
for aa in AA:
if aa in self.ApobecDRMs[1:][i][3]:
if aa in self.ApobecDRMs[i]['aa']:
return True
return False

Expand Down
10 changes: 5 additions & 5 deletions sierralocal/main.py
Original file line number Diff line number Diff line change
Expand Up @@ -87,7 +87,7 @@ def scorefile(input_file, algorithm, do_subtype=False):
sequence_lengths, file_trims, subtypes


def sierralocal(fasta, outfile, xml=None, tsv=None, cleanup=False, forceupdate=False):
def sierralocal(fasta, outfile, xml=None, json=None, cleanup=False, forceupdate=False):
"""
Contains all initializing and processing calls.

Expand All @@ -104,7 +104,7 @@ def sierralocal(fasta, outfile, xml=None, tsv=None, cleanup=False, forceupdate=F

# initialize algorithm and jsonwriter
time0 = time.time()
algorithm = HIVdb(asi2=xml, apobec=tsv, forceupdate=forceupdate)
algorithm = HIVdb(asi2=xml, apobec=json, forceupdate=forceupdate)
writer = JSONWriter(algorithm)
time_elapsed = time.time() - time0

Expand Down Expand Up @@ -152,8 +152,8 @@ def parse_args():
parser.add_argument('-o', dest='outfile', default=None, type=str, help='Output filename.')
parser.add_argument('-xml', default=None,
help='<optional> Path to HIVdb ASI2 XML file')
parser.add_argument('-tsv', default=None,
help='<optional> Path to tab-delimited (tsv) HIVdb APOBEC DRM file')
parser.add_argument('-json', default=None,
help='<optional> Path to JSON HIVdb APOBEC DRM file')
parser.add_argument('--cleanup', action='store_true',
help='Deletes NucAmino alignment file after processing.')
parser.add_argument('--forceupdate', action='store_true',
Expand All @@ -175,7 +175,7 @@ def main():
sys.exit()

time_start = time.time()
count, time_elapsed = sierralocal(args.fasta, args.outfile, xml=args.xml, tsv=args.tsv,
count, time_elapsed = sierralocal(args.fasta, args.outfile, xml=args.xml, json=args.json,
cleanup=args.cleanup, forceupdate=args.forceupdate)
time_diff = time.time() - time_start

Expand Down
16 changes: 9 additions & 7 deletions sierralocal/nucaminohook.py
Original file line number Diff line number Diff line change
Expand Up @@ -53,8 +53,9 @@ def __init__(self, algorithm, binary=None):
self.tripletTable = self.generateTable()

#with open(str(Path(os.path.dirname(__file__))/'data'/'apobec.tsv'), 'r') as csvfile:
with open(algorithm.tsv_filename) as csvfile:
self.ApobecDRMs = list(csv.reader(csvfile, delimiter='\t'))
with open(algorithm.json_filename) as jsonfile:
# self.ApobecDRMs = list(csv.reader(csvfile, delimiter='\t'))
self.ApobecDRMs = json.load(jsonfile)

self.PI_dict = self.prevalence_parser('PIPrevalences.tsv')
self.RTI_dict = self.prevalence_parser('RTIPrevalences.tsv')
Expand Down Expand Up @@ -169,10 +170,11 @@ def align_file(self, filename):

args = [
'{}'.format(self.nucamino_binary), # in case of byte-string
"align",
"hiv1b",
"pol",
"-q",
"-i", tf.name,
"-g=POL",
'--output-format', 'json',
]
p = subprocess.Popen(args, stdout=subprocess.PIPE) #, encoding='utf8')
Expand Down Expand Up @@ -514,11 +516,11 @@ def isStopCodon(self, triplet):
return ("*" in self.translateNATriplet(triplet))

def isApobecDRM(self, gene, consensus, position, AA):
ls = [row[0:3] for row in self.ApobecDRMs[1:]]
if [gene, consensus, str(position)] in ls:
i = ls.index([gene, consensus, str(position)])
ls = [[row['gene'], str(row['position'])] for row in self.ApobecDRMs]
if [gene, str(position)] in ls:
i = ls.index([gene, str(position)])
for aa in AA:
if aa in self.ApobecDRMs[1:][i][3]:
if aa in self.ApobecDRMs[i]['aa']:
return True
return False

Expand Down
77 changes: 0 additions & 77 deletions sierralocal/updater.py

This file was deleted.

0