Bugfix: spec source_col parameter not working as expected · Issue #67 · Gilead-BioStats/gsm.mapping · GitHub
More Web Proxy on the site http://driver.im/
You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
when definining source_col for column in the spec yaml section I would expect:
function available to use spec parameter to rename column e.g. gsm.mapping::Ingest or automated renaming
CheckSpec checks for presence of source_col and issues warning specific to source_col
data type check by CheckSpec on renamed column
Current Behavior
Ingest does not seem to work when trying to define it in the step yaml section
CheckSpec seems to ignore source_col and does not check whether they are present
CheckSpec data type check is working but only if the columns is already present and need not be renamed
Possible Solution
I am not sure what the intent here is. But maybe the source_col argument can be deprecated and column renaming can be done in steps via RunQuery, then the CheckSpec should run after steps on the final output not on the input.
Steps to Reproduce
lData=list(
Raw_PD=clindata::ctms_protdev
)
wflow_map<-gsm.core::MakeWorkflowList(
strNames="PD",
strPath="workflow/1_mappings",
strPackage="gsm.mapping"
)
# spec implies that subjectenrollmentnumber should be renamed to subjidwflow_map$PD$spec#> $Raw_PD#> $Raw_PD$subjid#> $Raw_PD$subjid$type#> [1] "character"#> #> $Raw_PD$subjid$source_col#> [1] "subjectenrollmentnumber"#> #> #> $Raw_PD$deemedimportant#> $Raw_PD$deemedimportant$type#> [1] "character"# causes warning, specs as defined in yaml are not appliedgsm.core::RunWorkflows(lWorkflow=wflow_map, lData=lData )
#> #> ── Running 1 Workflows ─────────────────────────────────────────────────────────#> #> ── Initializing `Mapped_PD` Workflow ───────────────────────────────────────────#> #> ── Checking data against spec#> → All 1 data.frame(s) in the spec are present in the data: Raw_PD#> → All specified columns in Raw_PD are in the expected format#> Warning: Not all specified columns in the spec are present in the data, missing columns#> are: Raw_PD$subjid#> #> ── Workflow Step 1 of 1: `=` ──#> #> ── Evaluating 2 parameter(s) for `=`#> ℹ lhs = Mapped_PD: No matching data found. Passing 'Mapped_PD' as a string.#> ✔ rhs = Raw_PD: Passing lData$Raw_PD.#> #> ── Calling `=`#> #> ── 4646x4 data.frame saved as `lData$Mapped_PD`.#> #> ── Returning results from final step: 4646x4 data.frame`. ──#> #> ── Completed `Mapped_PD` Workflow ──────────────────────────────────────────────#> $Mapped_PD#> # A tibble: 4,646 × 4#> subjectenrollmentnumber deviationdate companycategory deemedimportant#> <chr> <date> <chr> <chr> #> 1 0496 NA OTHER No #> 2 1350 NA OTHER No #> 3 1350 NA OTHER No #> 4 1350 NA OTHER No #> 5 1350 NA OTHER No #> 6 1350 NA OTHER No #> 7 0539 NA OTHER TREATMENT COMPLI… No #> 8 0539 NA OTHER TREATMENT COMPLI… No #> 9 0539 NA OTHER TREATMENT COMPLI… No #> 10 0539 NA OTHER TREATMENT COMPLI… No #> # ℹ 4,636 more rows# this worksgsm.mapping::Ingest(lData, wflow_map[[1]]$spec) |>
str()
#> ℹ Ingesting data for PD.#> Creating a new temporary DuckDB connection.#> ✔ SQL Query complete: 4646 rows returned.#> Disconnected from temporary DuckDB connection.#> List of 1#> $ Raw_PD:'data.frame': 4646 obs. of 2 variables:#> ..$ subjid : chr [1:4646] "0496" "1350" "1350" "1350" ...#> ..$ deemedimportant: chr [1:4646] "No" "No" "No" "No" ...# replace `=` with Ingest()wflow_map$PD$steps<-list(list(
name="gsm.mapping::Ingest",
output="Mapped_PD",
params=list(
lSourceData="Raw_PD",
# RunStep() detects any parameter named lSpec and passes spec for steplSpec="lSpec"
)
))
# it does not work because lSpec has one unneeded nesting layer Ingest would need lSpec[1]gsm.core::RunWorkflows(lWorkflow=wflow_map, lData=lData )
#> #> ── Running 1 Workflows ─────────────────────────────────────────────────────────#> #> ── Initializing `Mapped_PD` Workflow ───────────────────────────────────────────#> #> ── Checking data against spec#> → All 1 data.frame(s) in the spec are present in the data: Raw_PD#> → All specified columns in Raw_PD are in the expected format#> Warning: Not all specified columns in the spec are present in the data, missing columns#> are: Raw_PD$subjid#> #> ── Workflow Step 1 of 1: `gsm.mapping::Ingest` ──#> #> ── Evaluating 2 parameter(s) for `gsm.mapping::Ingest`#> ✔ lSourceData = Raw_PD: Passing lData$Raw_PD.#> ✔ lSpec = lSpec: Passing full lSpec object.#> #> ── Calling `gsm.mapping::Ingest`#> ℹ Ingesting data for PD.#> Error in `map2()`:#> ℹ In index: 1.#> ℹ With name: PD.#> Caused by error in `layout()`:#> ! Domain '*_PD' not found in source data.# replace `=` with runQuery()wflow_map$PD$steps<-list(list(
name="gsm.core::RunQuery",
output="Mapped_PD",
params=list(
df="Raw_PD",
strQuery="SELECT subjectenrollmentnumber AS subjid, deemedimportant FROM df"
)
))
# this works but makes source_col for spec redundant and warning persistsgsm.core::RunWorkflows(
8000
lWorkflow=wflow_map, lData=lData ) |>
str()
#> #> ── Running 1 Workflows ─────────────────────────────────────────────────────────#> #> ── Initializing `Mapped_PD` Workflow ───────────────────────────────────────────#> #> ── Checking data against spec#> → All 1 data.frame(s) in the spec are present in the data: Raw_PD#> → All specified columns in Raw_PD are in the expected format#> Warning: Not all specified columns in the spec are present in the data, missing columns#> are: Raw_PD$subjid#> #> ── Workflow Step 1 of 1: `gsm.core::RunQuery` ──#> #> ── Evaluating 2 parameter(s) for `gsm.core::RunQuery`#> ✔ df = Raw_PD: Passing lData$Raw_PD.#> ℹ strQuery = SELECT subjectenrollmentnumber AS subjid, deemedimportant FROM df: No matching data found. Passing 'SELECT subjectenrollmentnumber AS subjid, deemedimportant FROM df' as a string.#> #> ── Calling `gsm.core::RunQuery`#> Creating a new temporary DuckDB connection.#> ✔ SQL Query complete: 4646 rows returned.#> Disconnected from temporary DuckDB connection.#> #> ── 4646x2 data.frame saved as `lData$Mapped_PD`.#> #> ── Returning results from final step: 4646x2 data.frame`. ──#> #> ── Completed `Mapped_PD` Workflow ──────────────────────────────────────────────#> List of 1#> $ Mapped_PD:'data.frame': 4646 obs. of 2 variables:#> ..$ subjid : chr [1:4646] "0496" "1350" "1350" "1350" ...#> ..$ deemedimportant: chr [1:4646] "No" "No" "No" "No" ...# CheckSpec does not check for presence of source_col# change source_col to something elsewflow_map$PD$spec$Raw_PD$subjid$source_col<-"foo"# workflow says it is checking spec and all columns are presentgsm.core::RunWorkflows(lWorkflow=wflow_map, lData=lData ) |>
str()
#> #> ── Running 1 Workflows ─────────────────────────────────────────────────────────#> #> ── Initializing `Mapped_PD` Workflow ───────────────────────────────────────────#> #> ── Checking data against spec#> → All 1 data.frame(s) in the spec are present in the data: Raw_PD#> → All specified columns in Raw_PD are in the expected format#> Warning: Not all specified columns in the spec are present in the data, missing columns#> are: Raw_PD$subjid#> #> ── Workflow Step 1 of 1: `gsm.core::RunQuery` ──#> #> ── Evaluating 2 parameter(s) for `gsm.core::RunQuery`#> ✔ df = Raw_PD: Passing lData$Raw_PD.#> ℹ strQuery = SELECT subjectenrollmentnumber AS subjid, deemedimportant FROM df: No matching data found. Passing 'SELECT subjectenrollmentnumber AS subjid, deemedimportant FROM df' as a string.#> #> ── Calling `gsm.core::RunQuery`#> Creating a new temporary DuckDB connection.#> ✔ SQL Query complete: 4646 rows returned.#> Disconnected from temporary DuckDB connection.#> #> ── 4646x2 data.frame saved as `lData$Mapped_PD`.#> #> ── Returning results from final step: 4646x2 data.frame`. ──#> #> ── Completed `Mapped_PD` Workflow ──────────────────────────────────────────────#> List of 1#> $ Mapped_PD:'data.frame': 4646 obs. of 2 variables:#> ..$ subjid : chr [1:4646] "0496" "1350" "1350" "1350" ...#> ..$ deemedimportant: chr [1:4646] "No" "No" "No" "No" ...
sessionInfo()
#> R version 4.4.1 (2024-06-14)#> Platform: aarch64-apple-darwin20#> Running under: macOS Sonoma 14.7.1#> #> Matrix products: default#> BLAS: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib #> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib; LAPACK version 3.12.0#> #> locale:#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8#> #> time zone: Europe/Zurich#> tzcode source: internal#> #> attached base packages:#> [1] stats graphics grDevices utils datasets methods base #> #> loaded via a namespace (and not attached):#> [1] jsonlite_2.0.0 gtable_0.3.6 dplyr_1.1.4 compiler_4.4.1 #> [5] tidyselect_1.2.1 reprex_2.1.1 stringr_1.5.1 xml2_1.3.8 #> [9] clindata_1.0.5 tidyr_1.3.1 scales_1.4.0 yaml_2.3.10 #> [13] fastmap_1.2.0 ggplot2_3.5.2 R6_2.6.1 generics_0.1.4 #> [17] knitr_1.50 htmlwidgets_1.6.4 backports_1.5.0 tibble_3.2.1 #> [21] gsm.mapping_1.0.1 DBI_1.2.3 pillar_1.10.2 RColorBrewer_1.1-3#> [25] rlang_1.1.6 utf8_1.2.5 stringi_1.8.7 broom_1.0.8 #> [29] gsm.core_1.1.0 xfun_0.52 fs_1.6.6 cli_3.6.5 #> [33] withr_3.0.2 magrittr_2.0.3 digest_0.6.37 grid_4.4.1 #> [37] rstudioapi_0.16.0 dbplyr_2.5.0 lifecycle_1.0.4 vctrs_0.6.5 #> [41] evaluate_1.0.3 glue_1.8.0 duckdb_1.2.2 farver_2.1.2 #> [45] log4r_0.4.4 gt_1.0.0 rmarkdown_2.29 purrr_1.0.4 #> [49] tools_4.4.1 pkgconfig_2.0.3 htmltools_0.5.8.1
@samussiah, idk if I went down the rabbit hole here, but I was confused by this when I was trying to run IMPALA-Consortium/gsm.simaerep#19 b/c the PD mapping yaml did not work with my scripts and I could not figure out what the best way to fix it is.
@samussiah, idk if I went down the rabbit hole here, but I was confused by this when I was trying to run IMPALA-Consortium/gsm.simaerep#19 b/c the PD mapping yaml did not work with my scripts and I could not figure out what the best way to fix it is.
@erblast Perhaps this is something that can use additional documentation, especially with regards to the source_col functionality of mapping, but there's a function, gsm.mapping::CombineSpecs() used to grab the respective spec objects of the wflow_map/mapping object to address the "unneeded nesting layer" issue you highlighted. This then gets fed into Ingest() that does that renaming step. Ideally the process would look something like this
lData = list(
Raw_PD = clindata::ctms_protdev,
Raw_DATAENT = clindata::edc_data_pages,
Raw_SUBJ = clindata::rawplus_dm
)
# The Customizable Workflow of Raw/Source Data to Mapped Domains
wflow_map <- gsm.core::MakeWorkflowList(
strNames = c("SUBJ", "DATAENT", "PD"),
strPath = "workflow/1_mappings",
strPackage = "gsm.mapping"
)
lData %>% # raw/source data
gsm.mapping::Ingest(., gsm.mapping::CombineSpecs(wflow_map)) %>% # Ingest does the renaming here
gsm.core::RunWorkflows(wflow_map, .) # then proceed to run the mapping
Expected Behavior
when definining source_col for column in the spec yaml section I would expect:
gsm.mapping::Ingest
or automated renamingCurrent Behavior
Possible Solution
I am not sure what the intent here is. But maybe the source_col argument can be deprecated and column renaming can be done in steps via RunQuery, then the CheckSpec should run after steps on the final output not on the input.
Steps to Reproduce
Created on 2025-05-15 with reprex v2.1.1
The text was updated successfully, but these errors were encountered: