PoonLab · ArtPoon · Jan 17, 2023 · Aug 4, 2022 · Aug 7, 2022 · Sep 29, 2022
diff --git a/.gitmodules b/.gitmodules
@@ -0,0 +1,3 @@
+[submodule "sierralocal/data/hivfacts"]
+	path = sierralocal/data/hivfacts
+	url = https://github.com/hivdb/hivfacts.git
diff --git a/README.md b/README.md
@@ -25,6 +25,7 @@ On a Linux system, you can install *sierra-local* as follows:
 ```
 git clone http://github.com/PoonLab/sierra-local
 cd sierra-local
+git submodule init; git submodule update
 sudo python3 setup.py install
 ```
 Note that you need super-user privileges to install the package by this method.  For more detailed instrucitons, please refer to the document [INSTALL.md](INSTALL.md) that should be located in the root directory of this Python package.
@@ -107,6 +108,7 @@ If you have downloaded the package source to your computer, you can also run *si
 ```console
 art@Jesry:~/git/sierra-local$ git clone http://github.com/PoonLab/sierra-local
 art@Jesry:~/git/sierra-local$ cd sierra-local
+art@Jesry:~/git/sierra-local$ git submodule init; git submodule update
 art@Jesry:~/git/sierra-local$ python3
 Python 3.6.6 (default, Sep 12 2018, 18:26:19) 
 [GCC 8.0.1 20180414 (experimental) [trunk revision 259383]] on linux
@@ -131,53 +133,6 @@ python3 gui.py
 ```
 
 
-## Updating the algorithm
-
-The Stanford HIVdb database regularly updates its resistance genotyping algorithm and publishes the associated ASI2 XML file on their website.  In previous versions of *sierra-local*, we used Python to automatically query this website and download the newest version if it was not already present on the user's computer.  Subsequent changes to the Stanford HIVdb website, however, meant that users would have to install several additional dependencies in order for Python to locate the required files.  As a result, we decided to make the `updater.py` script an optional step of the pipeline.
-
-To run `updater.py`, you need to install the following requirements:
-* [Google Chrome](https://www.google.com/chrome/)
-* The appropriate [chromedriver](https://chromedriver.chromium.org/) for your operating system *and* [version of Google Chrome](https://chromedriver.chromium.org/downloads/version-selection).
-* The Python module [Selenium](https://pypi.org/project/selenium/)
-
-You'll also need to be working with a local clone or download of this code repository, because you will probably need to modify a line of the `updater.py` script.
-
-Next, you need to locate your `chromedriver` binary.  For the purpose of demonstration, I happened to leave this binary in my `~/Downloads` folder.  Use this information to modify the following lines in `updater.py`:
-```python
-# this needs to be modified to point Python to your local chromedriver binary
-mod_path = Path(os.path.dirname(__file__))
-driver_path = mod_path.parent / 'bin/chromedriver'
-```
-Otherwise, Python will be unable to locate the Chrome browser binary and throws a `WebDriverException`:
-```console
-art@orolo:~/git/sierra-local$ python3 sierralocal/updater.py
-...
-Traceback (most recent call last):
-  File "sierralocal/updater.py", line 19, in <module>
-    browser = webdriver.Chrome(executable_path=str(driver_path), chrome_options=options)
-  File "/home/art/.local/lib/python3.6/site-packages/selenium/webdriver/chrome/webdriver.py", line 73, in __init__
-    self.service.start()
-  File "/home/art/.local/lib/python3.6/site-packages/selenium/webdriver/common/service.py", line 83, in start
-    os.path.basename(self.path), self.start_error_message)
-selenium.common.exceptions.WebDriverException: Message: 'chromedriver' executable needs to be in PATH. Please see https://sites.google.com/a/chromium.org/chromedriver/home
-```
-
-For my system running Ubuntu, I modified the `updater.py` script as follows:
-```python
-#driver_path = mod_path.parent / 'bin/chromedriver'
-driver_path = Path('/home/art/Downloads/chromedriver')
-```
-
-Manually running the script enabled me to grab the most recent versions of the ASI2 and APOBEC files from the HIVdb webserver:
-```console
-art@orolo:~/git/sierra-local$ sudo python3 sierralocal/updater.py 
-Updated HIVDB XML from https://hivdb.stanford.edu/assets/media/HIVDB_8.8.a126e04c.xml into sierralocal/data/HIVDB_8.8.a126e04c.xml
-Updated APOBEC DRMs from https://hivdb.stanford.edu/assets/media/apobec-drms.221b0330.tsv into sierralocal/data/apobec.tsv
-```
-
-Now of course, it would be much simpler to manually point your browser to the Stanford HIVdb website and download these files yourself, but in some applications there may be a benefit to automating this step.
-
-
 ## About Us
 This project was developed at the [Poon lab](http://github.com/PoonLab) within the [Department of Pathology and Laboratory Medicine](https://www.schulich.uwo.ca/pathol/), Schulich School of Medicine and Dentistry, [Western University](http://uwo.ca), London, Ontario.  Development of *sierra-local* was supported in part by a grant from the [Canadian Institutes of Health Research](http://www.cihr-irsc.gc.ca/e/193.html) (PJT-156178).
 

diff --git a/setup.py b/setup.py
@@ -52,7 +52,9 @@ def run(self):
             'data/*Prevalences.tsv',
             'data/*-comments.csv',
             'data/HIVDB_8.8.a126e04c.xml',
-            'data/apobec-drms.221b0330.tsv'
+            'data/apobec-drms.221b0330.tsv',
+            'data/hivfacts/data/apobecs/apobec_*.json',
+            'data/hivfacts/data/algorithms/HIVDB_*'
     ]},
     cmdclass={'install': OverrideInstall},
     #include_package_data=True

diff --git a/sierralocal/bin/nucamino-darwin-386 b/sierralocal/bin/nucamino-darwin-386
diff --git a/sierralocal/bin/nucamino-darwin-amd64 b/sierralocal/bin/nucamino-darwin-amd64
diff --git a/sierralocal/bin/nucamino-linux-386 b/sierralocal/bin/nucamino-linux-386
diff --git a/sierralocal/bin/nucamino-linux-amd64 b/sierralocal/bin/nucamino-linux-amd64
diff --git a/sierralocal/bin/nucamino-windows-386.exe b/sierralocal/bin/nucamino-windows-386.exe
diff --git a/sierralocal/bin/nucamino-windows-amd64.exe b/sierralocal/bin/nucamino-windows-amd64.exe
diff --git a/sierralocal/data/hivfacts b/sierralocal/data/hivfacts
diff --git a/sierralocal/hivdb.py b/sierralocal/hivdb.py
@@ -4,6 +4,7 @@
 from pathlib import Path
 import re
 import sys
+import subprocess
 
 
 class HIVdb():
@@ -14,17 +15,19 @@ class HIVdb():
     """
     def __init__(self, asi2=None, apobec=None, forceupdate=False):
         self.xml_filename = None
-        self.tsv_filename = None
+            self.json_filename = None
         self.BASE_URL = 'https://hivdb.stanford.edu'
 
         if forceupdate:
-            # DEPRECATED, requires selenium, chrome and chromedriver
-            import sierralocal.updater as updater
-            self.xml_filename = updater.update_HIVDB()
-            self.tsv_filename = updater.updateAPOBEC()
+            print("Updating submodule to retrieve the latest data files")
+            # update submodules
+            try:
+                subprocess.check_call("git submodule foreach git pull origin main", shell=True)
+            except:
+                print("Could not update submodules")
         else:
             self.set_hivdb_xml(asi2)
-            self.set_apobec_tsv(apobec)
+            self.set_apobec_json(apobec)
 
         # Set algorithm metadata
         self.root = xml.parse(str(self.xml_filename)).getroot()
@@ -38,14 +41,14 @@ def set_hivdb_xml(self, path):
         if path is None:
             # If user has not specified XML path
             # Iterate over possible HIVdb ASI files matching the glob pattern
-            dest = str(Path(os.path.dirname(__file__))/'data'/'HIVDB*.xml')
+            dest = str(Path(os.path.dirname(__file__))/'data'/'hivfacts'/'data'/'algorithms'/'HIVDB_*.xml')
             print("searching path " + dest)
             files = glob.glob(dest)
 
             # find the newest XML that can be parsed
             intermed = []
             for file in files:
-                version = re.search("HIVDB_([0-9]\.[0-9.]+)\.", file).group(1)
+                version = re.search("HIVDB_([0-9]\.[0-9.-]+)\.", file).group(1)
                 intermed.append((version, file))
             intermed.sort(reverse=True)
 
@@ -78,35 +81,31 @@ def set_hivdb_xml(self, path):
         # Parseable XML file not found. Update from web
         if not file_found:
             print("Error: could not find local copy of HIVDB XML.")
-            print("Manually download from https://hivdb.stanford.edu/page/"
-                  "release-notes/#algorithm.updates")
+            print("Please ensure that the submodule (https://github.com/hivdb/hivfacts/tree/) has been initialized and updated")
             sys.exit()
-            # self.xml_filename = updater.update_HIVDB()
 
-    def set_apobec_tsv(self, path):
+    def set_apobec_json(self, path):
         """
-        Attempt to locate a local APOBEC DRM file (tsv format)
+        Attempt to locate a local APOBEC DRM file (json format)
         """
         if path is None:
-            dest = str(Path(os.path.dirname(__file__))/'data'/'apobec*.tsv')
+            dest = str(Path(os.path.dirname(__file__))/'data'/'hivfacts'/'data'/'apobecs'/'apobec_*.json')
             print("searching path {}".format(dest))
             files = glob.glob(dest)
             for file in files:
                 # no version numbering, take first hit
                 # TODO: some basic format check on TSV file
                 if os.path.isfile(file):
-                    self.tsv_filename = file
+                    self.json_filename = file
                     return
         else:
-            self.tsv_filename = path
+            self.json_filename = path
             return
 
         # if we end up here, no local files found
         print("Error: could not locate local APOBEC DRM data file.")
-        print("Manually download from https://hivdb.stanford.edu/page/"
-              "release-notes/#data.files")
+        print("Please ensure that the submodule (https://github.com/hivdb/hivfacts/tree/) has been initialized and updated")
         sys.exit()
-        # self.tsv_filename = updater.updateAPOBEC()
 
     def parse_definitions(self, root):
         """

diff --git a/sierralocal/jsonwriter.py b/sierralocal/jsonwriter.py
@@ -23,10 +23,9 @@ def __init__(self, algorithm):
         self.comments = self.algorithm.parse_comments(self.algorithm.root)
 
         # Load comments files stored locally. These are distributed in the repo for now.
-        #dest = str(Path(os.path.dirname(__file__))/'data'/'apobec.tsv')
-        dest = algorithm.tsv_filename
+        dest = algorithm.json_filename
         with open(dest,'r') as csvfile:
-            self.ApobecDRMs = list(csv.reader(csvfile, delimiter='\t'))
+            self.ApobecDRMs = json.load(csvfile)
 
         dest = str(Path(os.path.dirname(__file__))/'data'/'INSTI-comments.csv')
         with open(dest, 'r') as INSTI_file:
@@ -258,11 +257,11 @@ def findComment(self, gene, mutation, comments, details):
                         return details[full_mut]['1']
 
     def isApobecDRM(self, gene, consensus, position, AA):
-        ls = [row[0:3] for row in self.ApobecDRMs[1:]]
-        if [gene, consensus, str(position)] in ls:
-            i = ls.index([gene, consensus, str(position)])
+        ls = [[row['gene'], str(row['position'])] for row in self.ApobecDRMs]
+        if [gene, str(position)] in ls:
+            i = ls.index([gene, str(position)])
             for aa in AA:
-                if aa in self.ApobecDRMs[1:][i][3]:
+                if aa in self.ApobecDRMs[i]['aa']:
                     return True
         return False
 

diff --git a/sierralocal/main.py b/sierralocal/main.py
@@ -87,7 +87,7 @@ def scorefile(input_file, algorithm, do_subtype=False):
            sequence_lengths, file_trims, subtypes
 
 
-def sierralocal(fasta, outfile, xml=None, tsv=None, cleanup=False, forceupdate=False):
+def sierralocal(fasta, outfile, xml=None, json=None, cleanup=False, forceupdate=False):
     """
     Contains all initializing and processing calls.
 
@@ -104,7 +104,7 @@ def sierralocal(fasta, outfile, xml=None, tsv=None, cleanup=False, forceupdate=F
 
     # initialize algorithm and jsonwriter
     time0 = time.time()
-    algorithm = HIVdb(asi2=xml, apobec=tsv, forceupdate=forceupdate)
+    algorithm = HIVdb(asi2=xml, apobec=json, forceupdate=forceupdate)
     writer = JSONWriter(algorithm)
     time_elapsed = time.time() - time0
 
@@ -152,8 +152,8 @@ def parse_args():
     parser.add_argument('-o', dest='outfile', default=None, type=str, help='Output filename.')
     parser.add_argument('-xml', default=None,
                         help='<optional> Path to HIVdb ASI2 XML file')
-    parser.add_argument('-tsv', default=None,
-                        help='<optional> Path to tab-delimited (tsv) HIVdb APOBEC DRM file')
+    parser.add_argument('-json', default=None,
+                        help='<optional> Path to JSON HIVdb APOBEC DRM file')
     parser.add_argument('--cleanup', action='store_true',
                         help='Deletes NucAmino alignment file after processing.')
     parser.add_argument('--forceupdate', action='store_true',
@@ -175,7 +175,7 @@ def main():
             sys.exit()
 
     time_start = time.time()
-    count, time_elapsed = sierralocal(args.fasta, args.outfile, xml=args.xml, tsv=args.tsv,
+    count, time_elapsed = sierralocal(args.fasta, args.outfile, xml=args.xml, json=args.json,
                                       cleanup=args.cleanup, forceupdate=args.forceupdate)
     time_diff = time.time() - time_start
 

diff --git a/sierralocal/nucaminohook.py b/sierralocal/nucaminohook.py
@@ -53,8 +53,9 @@ def __init__(self, algorithm, binary=None):
         self.tripletTable = self.generateTable()
 
         #with open(str(Path(os.path.dirname(__file__))/'data'/'apobec.tsv'), 'r') as csvfile:
-        with open(algorithm.tsv_filename) as csvfile:
-            self.ApobecDRMs = list(csv.reader(csvfile, delimiter='\t'))
+        with open(algorithm.json_filename) as jsonfile:
+            # self.ApobecDRMs = list(csv.reader(csvfile, delimiter='\t'))
+            self.ApobecDRMs = json.load(jsonfile)
 
         self.PI_dict = self.prevalence_parser('PIPrevalences.tsv')
         self.RTI_dict = self.prevalence_parser('RTIPrevalences.tsv')
@@ -169,10 +170,11 @@ def align_file(self, filename):
 
         args = [
             '{}'.format(self.nucamino_binary),  # in case of byte-string
+            "align",

B1B3
             "hiv1b",
+            "pol",
             "-q",
             "-i", tf.name,
-            "-g=POL",
             '--output-format', 'json',
         ]
         p = subprocess.Popen(args, stdout=subprocess.PIPE)  #, encoding='utf8')
@@ -514,11 +516,11 @@ def isStopCodon(self, triplet):
         return ("*" in self.translateNATriplet(triplet))
 
     def isApobecDRM(self, gene, consensus, position, AA):
-        ls = [row[0:3] for row in self.ApobecDRMs[1:]]
-        if [gene, consensus, str(position)] in ls:
-            i = ls.index([gene, consensus, str(position)])
+        ls = [[row['gene'], str(row['position'])] for row in self.ApobecDRMs]
+        if [gene, str(position)] in ls:
+            i = ls.index([gene, str(position)])
             for aa in AA:
-                if aa in self.ApobecDRMs[1:][i][3]:
+                if aa in self.ApobecDRMs[i]['aa']:
                     return True
         return False
 

diff --git a/sierralocal/updater.py b/sierralocal/updater.py