Bugfix: spec source_col parameter not working as expected #67

erblast · 2025-05-15T08:14:23Z

Expected Behavior

when definining source_col for column in the spec yaml section I would expect:

function available to use spec parameter to rename column e.g. gsm.mapping::Ingest or automated renaming
CheckSpec checks for presence of source_col and issues warning specific to source_col
data type check by CheckSpec on renamed column

Current Behavior

Ingest does not seem to work when trying to define it in the step yaml section
CheckSpec seems to ignore source_col and does not check whether they are present
CheckSpec data type check is working but only if the columns is already present and need not be renamed

Possible Solution

I am not sure what the intent here is. But maybe the source_col argument can be deprecated and column renaming can be done in steps via RunQuery, then the CheckSpec should run after steps on the final output not on the input.

Steps to Reproduce

lData = list(
  Raw_PD = clindata::ctms_protdev
)

wflow_map <- gsm.core::MakeWorkflowList(
  strNames = "PD",
  strPath = "workflow/1_mappings",
  strPackage = "gsm.mapping"
)

# spec implies that subjectenrollmentnumber should be renamed to subjid
wflow_map$PD$spec
#> $Raw_PD
#> $Raw_PD$subjid
#> $Raw_PD$subjid$type
#> [1] "character"
#> 
#> $Raw_PD$subjid$source_col
#> [1] "subjectenrollmentnumber"
#> 
#> 
#> $Raw_PD$deemedimportant
#> $Raw_PD$deemedimportant$type
#> [1] "character"

# causes warning, specs as defined in yaml are not applied
gsm.core::RunWorkflows(lWorkflow = wflow_map, lData = lData )
#> 
#> ── Running 1 Workflows ─────────────────────────────────────────────────────────
#> 
#> ── Initializing `Mapped_PD` Workflow ───────────────────────────────────────────
#> 
#> ── Checking data against spec
#> → All 1 data.frame(s) in the spec are present in the data: Raw_PD
#> → All specified columns in Raw_PD are in the expected format
#> Warning: Not all specified columns in the spec are present in the data, missing columns
#> are: Raw_PD$subjid
#> 
#> ── Workflow Step 1 of 1: `=` ──
#> 
#> ── Evaluating 2 parameter(s) for `=`
#> ℹ lhs = Mapped_PD: No matching data found. Passing 'Mapped_PD' as a string.
#> ✔ rhs = Raw_PD: Passing lData$Raw_PD.
#> 
#> ── Calling `=`
#> 
#> ── 4646x4 data.frame saved as `lData$Mapped_PD`.
#> 
#> ── Returning results from final step: 4646x4 data.frame`. ──
#> 
#> ── Completed `Mapped_PD` Workflow ──────────────────────────────────────────────
#> $Mapped_PD
#> # A tibble: 4,646 × 4
#>    subjectenrollmentnumber deviationdate companycategory         deemedimportant
#>    <chr>                   <date>        <chr>                   <chr>          
#>  1 0496                    NA            OTHER                   No             
#>  2 1350                    NA            OTHER                   No             
#>  3 1350                    NA            OTHER                   No             
#>  4 1350                    NA            OTHER                   No             
#>  5 1350                    NA            OTHER                   No             
#>  6 1350                    NA            OTHER                   No             
#>  7 0539                    NA            OTHER TREATMENT COMPLI… No             
#>  8 0539                    NA            OTHER TREATMENT COMPLI… No             
#>  9 0539                    NA            OTHER TREATMENT COMPLI… No             
#> 10 0539                    NA            OTHER TREATMENT COMPLI… No             
#> # ℹ 4,636 more rows

# this works
gsm.mapping::Ingest(lData, wflow_map[[1]]$spec) |>
  str()
#> ℹ Ingesting data for PD.
#> Creating a new temporary DuckDB connection.
#> ✔ SQL Query complete: 4646 rows returned.
#> Disconnected from temporary DuckDB connection.
#> List of 1
#>  $ Raw_PD:'data.frame':  4646 obs. of  2 variables:
#>   ..$ subjid         : chr [1:4646] "0496" "1350" "1350" "1350" ...
#>   ..$ deemedimportant: chr [1:4646] "No" "No" "No" "No" ...

# replace `=` with Ingest()
wflow_map$PD$steps <- list(list(
  name = "gsm.mapping::Ingest",
  output = "Mapped_PD",
  params = list(
    lSourceData = "Raw_PD",
    # RunStep() detects any parameter named lSpec and passes spec for step
    lSpec = "lSpec"
  )
))

# it does not work because lSpec has one unneeded nesting layer Ingest would need lSpec[1]
gsm.core::RunWorkflows(lWorkflow = wflow_map, lData = lData )
#> 
#> ── Running 1 Workflows ─────────────────────────────────────────────────────────
#> 
#> ── Initializing `Mapped_PD` Workflow ───────────────────────────────────────────
#> 
#> ── Checking data against spec
#> → All 1 data.frame(s) in the spec are present in the data: Raw_PD
#> → All specified columns in Raw_PD are in the expected format
#> Warning: Not all specified columns in the spec are present in the data, missing columns
#> are: Raw_PD$subjid
#> 
#> ── Workflow Step 1 of 1: `gsm.mapping::Ingest` ──
#> 
#> ── Evaluating 2 parameter(s) for `gsm.mapping::Ingest`
#> ✔ lSourceData = Raw_PD: Passing lData$Raw_PD.
#> ✔ lSpec = lSpec:  Passing full lSpec object.
#> 
#> ── Calling `gsm.mapping::Ingest`
#> ℹ Ingesting data for PD.
#> Error in `map2()`:
#> ℹ In index: 1.
#> ℹ With name: PD.
#> Caused by error in `layout()`:
#> ! Domain '*_PD' not found in source data.


# replace `=` with runQuery()
wflow_map$PD$steps <- list(list(
  name = "gsm.core::RunQuery",
  output = "Mapped_PD",
  params = list(
    df = "Raw_PD",
    strQuery = "SELECT subjectenrollmentnumber AS subjid, deemedimportant FROM df"
  )
))

# this works but makes source_col for spec redundant and warning persists
gsm.core::RunWorkflows(
8000
lWorkflow = wflow_map, lData = lData ) |>
  str()
#> 
#> ── Running 1 Workflows ─────────────────────────────────────────────────────────
#> 
#> ── Initializing `Mapped_PD` Workflow ───────────────────────────────────────────
#> 
#> ── Checking data against spec
#> → All 1 data.frame(s) in the spec are present in the data: Raw_PD
#> → All specified columns in Raw_PD are in the expected format
#> Warning: Not all specified columns in the spec are present in the data, missing columns
#> are: Raw_PD$subjid
#> 
#> ── Workflow Step 1 of 1: `gsm.core::RunQuery` ──
#> 
#> ── Evaluating 2 parameter(s) for `gsm.core::RunQuery`
#> ✔ df = Raw_PD: Passing lData$Raw_PD.
#> ℹ strQuery = SELECT subjectenrollmentnumber AS subjid, deemedimportant FROM df: No matching data found. Passing 'SELECT subjectenrollmentnumber AS subjid, deemedimportant FROM df' as a string.
#> 
#> ── Calling `gsm.core::RunQuery`
#> Creating a new temporary DuckDB connection.
#> ✔ SQL Query complete: 4646 rows returned.
#> Disconnected from temporary DuckDB connection.
#> 
#> ── 4646x2 data.frame saved as `lData$Mapped_PD`.
#> 
#> ── Returning results from final step: 4646x2 data.frame`. ──
#> 
#> ── Completed `Mapped_PD` Workflow ──────────────────────────────────────────────
#> List of 1
#>  $ Mapped_PD:'data.frame':   4646 obs. of  2 variables:
#>   ..$ subjid         : chr [1:4646] "0496" "1350" "1350" "1350" ...
#>   ..$ deemedimportant: chr [1:4646] "No" "No" "No" "No" ...

# CheckSpec does not check for presence of source_col

# change source_col to something else
wflow_map$PD$spec$Raw_PD$subjid$source_col <- "foo"

# workflow says it is checking spec and all columns are present
gsm.core::RunWorkflows(lWorkflow = wflow_map, lData = lData ) |>
  str()
#> 
#> ── Running 1 Workflows ─────────────────────────────────────────────────────────
#> 
#> ── Initializing `Mapped_PD` Workflow ───────────────────────────────────────────
#> 
#> ── Checking data against spec
#> → All 1 data.frame(s) in the spec are present in the data: Raw_PD
#> → All specified columns in Raw_PD are in the expected format
#> Warning: Not all specified columns in the spec are present in the data, missing columns
#> are: Raw_PD$subjid
#> 
#> ── Workflow Step 1 of 1: `gsm.core::RunQuery` ──
#> 
#> ── Evaluating 2 parameter(s) for `gsm.core::RunQuery`
#> ✔ df = Raw_PD: Passing lData$Raw_PD.
#> ℹ strQuery = SELECT subjectenrollmentnumber AS subjid, deemedimportant FROM df: No matching data found. Passing 'SELECT subjectenrollmentnumber AS subjid, deemedimportant FROM df' as a string.
#> 
#> ── Calling `gsm.core::RunQuery`
#> Creating a new temporary DuckDB connection.
#> ✔ SQL Query complete: 4646 rows returned.
#> Disconnected from temporary DuckDB connection.
#> 
#> ── 4646x2 data.frame saved as `lData$Mapped_PD`.
#> 
#> ── Returning results from final step: 4646x2 data.frame`. ──
#> 
#> ── Completed `Mapped_PD` Workflow ──────────────────────────────────────────────
#> List of 1
#>  $ Mapped_PD:'data.frame':   4646 obs. of  2 variables:
#>   ..$ subjid         : chr [1:4646] "0496" "1350" "1350" "1350" ...
#>   ..$ deemedimportant: chr [1:4646] "No" "No" "No" "No" ...


sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Sonoma 14.7.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: Europe/Zurich
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] jsonlite_2.0.0     gtable_0.3.6       dplyr_1.1.4        compiler_4.4.1    
#>  [5] tidyselect_1.2.1   reprex_2.1.1       stringr_1.5.1      xml2_1.3.8        
#>  [9] clindata_1.0.5     tidyr_1.3.1        scales_1.4.0       yaml_2.3.10       
#> [13] fastmap_1.2.0      ggplot2_3.5.2      R6_2.6.1           generics_0.1.4    
#> [17] knitr_1.50         htmlwidgets_1.6.4  backports_1.5.0    tibble_3.2.1      
#> [21] gsm.mapping_1.0.1  DBI_1.2.3          pillar_1.10.2      RColorBrewer_1.1-3
#> [25] rlang_1.1.6        utf8_1.2.5         stringi_1.8.7      broom_1.0.8       
#> [29] gsm.core_1.1.0     xfun_0.52          fs_1.6.6           cli_3.6.5         
#> [33] withr_3.0.2        magrittr_2.0.3     digest_0.6.37      grid_4.4.1        
#> [37] rstudioapi_0.16.0  dbplyr_2.5.0       lifecycle_1.0.4    vctrs_0.6.5       
#> [41] evaluate_1.0.3     glue_1.8.0         duckdb_1.2.2       farver_2.1.2      
#> [45] log4r_0.4.4        gt_1.0.0           rmarkdown_2.29     purrr_1.0.4       
#> [49] tools_4.4.1        pkgconfig_2.0.3    htmltools_0.5.8.1

^{Created on 2025-05-15 with reprex v2.1.1}

The text was updated successfully, but these errors were encountered:

erblast · 2025-05-15T08:21:14Z

@samussiah, idk if I went down the rabbit hole here, but I was confused by this when I was trying to run IMPALA-Consortium/gsm.simaerep#19 b/c the PD mapping yaml did not work with my scripts and I could not figure out what the best way to fix it is.

zdz2101 · 2025-05-15T18:38:24Z

@samussiah, idk if I went down the rabbit hole here, but I was confused by this when I was trying to run IMPALA-Consortium/gsm.simaerep#19 b/c the PD mapping yaml did not work with my scripts and I could not figure out what the best way to fix it is.

@erblast Perhaps this is something that can use additional documentation, especially with regards to the source_col functionality of mapping, but there's a function, gsm.mapping::CombineSpecs() used to grab the respective spec objects of the wflow_map/mapping object to address the "unneeded nesting layer" issue you highlighted. This then gets fed into Ingest() that does that renaming step. Ideally the process would look something like this

lData = list(
  Raw_PD = clindata::ctms_protdev,
  Raw_DATAENT = clindata::edc_data_pages,
  Raw_SUBJ = clindata::rawplus_dm
)

# The Customizable Workflow of Raw/Source Data to Mapped Domains
wflow_map <- gsm.core::MakeWorkflowList(
  strNames = c("SUBJ", "DATAENT", "PD"),
  strPath = "workflow/1_mappings",
  strPackage = "gsm.mapping"
)

lData %>% # raw/source data
gsm.mapping::Ingest(., gsm.mapping::CombineSpecs(wflow_map)) %>% # Ingest does the renaming here
  gsm.core::RunWorkflows(wflow_map, .)  # then proceed to run the mapping

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bugfix: spec source_col parameter not working as expected #67

Bugfix: spec source_col parameter not working as expected #67

Bugfix: spec source_col parameter not working as expected #67

Bugfix: spec source_col parameter not working as expected #67

Comments

Expected Behavior

Current Behavior

Possible Solution

Steps to Reproduce