Computational LOINC (in OWL).
- Python 3.11
- Clone repo:
git clone https://github.com/loinc/comp-loinc.git
- Set up virtual environment & activate:
python -m venv venv
&source venv/bin/activate
- Install Poetry:
pip install poetry
- Install dependencies:
poetry install
- Unzip downloaded inputs into the root directory of the repo.
- a. Core developers: Download latest
*_comploinc-build-sources.zip
from Google Drive, where*
is a dateYYYY-MM-DD
. - b. Everyone else: Download releases from each source:
- Ensure that comploinc_config.yaml/ is updated to point to default to the versions of your choosing, and ensure the paths are correct. The config is customizable to whatever directory structure / folder names you choose, but below are some suggestions / conventions for each source.
- LOINC: Unzip and place the folder (named
Loinc_2.80
or similar) into aloinc_release
folder in the root directory of the repo. - LOINC Tree: From this app, select from the "Hierarchy" menu at the top of the page. There are 7 options.
When you select an option, select 'Export'. Extract the CSVs in each zip, and put them into a single folder, using
the following names:
class.csv
,component.csv
,document.csv
,method.csv
,panel.csv
,system.csv
,component_by_system.csv
. The name of this folder should reflect the current version number of LOINC as it shows on the LOINC download page. For example, if it says "2.80", on that page, the folder name should be "2.80". Place this folder into aloinc_trees
folder in the root directory of the repo. - LOINC-SNOMED Ontology: Go to the website and fill out a form. You will get an
email with a download link. Unzip this, and place the unzipped folder into another folder with the version number
declared on that download page. Then place that folder into a
loinc_snomed_release
folder in the root directory of the repo. - LOINC-SNOMED mappings: There is a mapping TSV file, e.g.
part-mappings_0.0.3.tsv
, which should be placed in theloinc_snomed_release
directory at the root of the repo. However, this file is not downloadable online. To request it, find the contact email address inpyproject.toml
, and email us with a request. - SNOMED: Unzip and place the folder (named
SnomedCT_InternationalRF2_PRODUCTION_20240801T120000Z
or similar) into asnomed_release
folder in the root directory of the repo.
Contingencies
Apple Silicon users may need to run export PYTHON_KEYRING_BACKEND=keyring.backends.null.Keyring
before running
poetry install
.
- data/ - Static input files that don't need to be downloaded.
- logs/ - Logs
- owl-files/ - Contains some files to be merged together with build outputs.
- src/comp_loinc/ - Uses a loinclib
networkx
graph to generate ontological outputs.- builds/ -- LinkML schema
- datamodel/ - generated Python LinkML datamodel
- schema/ - LinkML source schema
- cli.py - Command line interface
- loinc_builder_steps.py - LOINC builder steps
- module.py - Instantiates and processes builder modules.
- runtime.py - Manages the runtime environment. Allows sharing of data between modules.
- snomed_builder_steps.py - SNOMED builder steps
- src/loinclib - Uses inputs from LOINC and other sources to create a
networkx
graph.- config.py - Configuration
- graph.py -
networkx
graph ops - loinc_loader.py - Loads LOINC release data
- loinc_schema.py - Schema for LOINC
- loinc_snomed_loader.py - Loads SNOMED-LOINC Ontology data
- loinc_snomed_schema.py - Schema for SNOMED-LOINC Ontology
- loinc_tree_loader.py - Loads LOINC web app hierarchical data
- loinc_tree_schema.py - Schema for LOINC web app hierarchical data
- snomed_loader.py - Loads SNOMED release data
- snomed_schema_v2.py - Schema for SNOMED release data
- tests/ - Tests
- comploinc_config.yaml/ - Configuration (discussed further below)
If you just want to run a build of default artefacts / options, run: make all -B
.
The main part of the make all
pipeline involves the building of modules (see "outputs" section below). These are
created through the comploinc build
command.
Usage: comploinc build [OPTIONS] [BUILD_NAME]
Performs a build from a build file as opposed to the "builder" command which takes build steps.
Positional arguments:
[BUILD_NAME]
The build name or a path to a build file. The "default" build will build all outputs.[default: default]
Named arguments:
Arg usage | Description |
---|---|
--work-dir PATH | CompLOINC work directory, defaults to current work directory. [default: (dynamic)] |
--config-file PATH | Configuration file name. Defaults to "comploinc_config.yaml" [default: comploinc_config.yaml] |
-o, --out-dir PATH | The output folder name. Defaults to "output". [default: output] |
--install-completion | Install completion for the current shell. |
--show-completion | Show completion for the current shell, to copy it or customize the installation. |
You can put together "builder" commands which are lower level steps that which formulate the sub-commands of each
build
option, including what content is combined into the module, as well as IO, etc.
Documentation on this sub-command is pending. For now, it is best to reference the build files to see how builder
commands are put together: src/comp_loinc/builds/
See: comploinc_config.yaml
If following the setup exactly, this configuration will not need to be modified.
group_components_systems.owl
group_components.owl
group_systems.owl
loinc-part-hierarchy-all.owl
loinc-part-list-all.owl
loinc-snomed-equiv.owl
loinc-term-primary-def.owl
loinc-term-supplementary-def.owl
loinc-terms-list-all.owl
snomed-parts.owl
There are a number if different ways in which these modules are merged in our analytical pipeline. See: more
If there are errors related to torch
while running CompLOINC or nlp_taxonomification.py
specifically, try changing
the torch
version to 2.1.0 in pyproject.toml
.
CompLOINC has some functionality to configure provide curator feedback on some of the inputs, which can be used to inform what content will or will not be included in the ontology.
NLP on dangling parts: nlp-matches.sssom.tsv
This file is the result of the semantic similarity process which matches dangling part terms (no parent or child)
against those in the hierarchy to try and identify a good parent for them. For each dangling part, only the top match is
included. Confidence is shown in the similarity_score
column.
File location & related files
/curation/nlp-matches.sssom.tsv
: Committed. To be used by curators and will be re-read during build time./output/analysis/dangling/
: Not committed. Has several files related to/curation/nlp-matches.sssom.tsv
.
This file adheres to the SSSOM standard. There are columns subject_id
,
subject_label
, object_id
, and object_label
. The subjects are the dangling part terms, and the objects are the
non-dangling part terms already in the hierarchy.
So where does curator input come into play? There is a curator_approved
column. If the value of this is
set to True (case insensitive) for a given row, the match will be included in the ontology. If it is set to False (case
insensitive), the match will not be included. If it is empty, or some value other than true/false is present, then that
column will be ignored and the setting for inclusion based on confidence threshold will be used. The default for this is
0.5, and can be configured in comploinc_config.yaml
. If the curator makes any judgements / edits to any rows, they
should change the default mapping_justification
from semapv:SemanticSimilarityThresholdMatching
to
semapv:ManualMappingCuration
.
There are several columns in nlp-matches.sssom.tsv
that are not part of the SSSOM specification. curator_approved
is
one of these, but there is also PartTypeName
, representing the LOINC part type, and subject_dangling
and
object_dangling
, which are boolean columns that indicate which of the subject or object for a given row is the
dangling part and which is the part that is currently connected within the hierarchy.
This is created during when the pipeline is run, and contains the following:
/output/analysis
├── chebi-subsets/ # Various intermediary files which were used to create the ChEBI-inspired hierarchy.
└── dangling
├── cache/ # Cached word embeddings for dangling parts and hierarchical terms.
├── confidence_histogram.png
├── dangling.tsv # The input file that generates nlp-matches.sssom.tsv. Shows all dangling part terms.
└── nlp-matches.sssom_prop_analysis.tsv # nlp-matches.sssom.tsv but w/ more columns. Attempt to look at the confidence=1 cases and try to ascertain why they have same label by looking at their other properties
This directory is not committed. /output/analysis/dangling/
has several files related to
/curation/nlp-matches.sssom.tsv
.
Details
robot
- Files in
output/build-default/fast-run/
- Can populate via
comploinc --fast-run build default
python -m unittest discover
When any of the sources (e.g. LOINC release, LOINC tree web app, LOINC-SNOMED ontology, SNOMED release) are updated, we need to follow this procedure.
- Download and unzip the source files into the desired / appropriate directories.
- Update the config to point to these new paths.
- Create a new
YYYY-MM-DD_comploinc-build-sources.zip
in the Google Drive folder. Ensure it has the correct structure (folder names and files at the right paths). - Make the link public: In the Google Drive folder, right-click the file, select "Share", and click "Share." At the bottom, under "General Access", click the left dropdown and select "Anyone with the link." Click "Copy link".
- Update
DL_LINK_ID
in GitHub: Go to the page for updating it. Paste the link from the previous step into the box, and click "Update secret." The value of this should be set to the ID found within the URL from step (4). E.g. if the link is " - https://drive.google.com/file/d/1i9Ym1zJhC_l6P8egAMcj4Q1QtTGk7aST/view?usp=drive_link," the ID would be
1i9Ym1zJhC_l6P8egAMcj4Q1QtTGk7aST
.