8000 Bugfix: spec source_col parameter not working as expected · Issue #67 · Gilead-BioStats/gsm.mapping · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Bugfix: spec source_col parameter not working as expected #67

New issue

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Open
erblast opened this issue May 15, 2025 · 2 comments
Open

Bugfix: spec source_col parameter not working as expected #67

erblast opened this issue May 15, 2025 · 2 comments 8000

Comments

@erblast
Copy link
erblast commented May 15, 2025

Expected Behavior

when definining source_col for column in the spec yaml section I would expect:

  • function available to use spec parameter to rename column e.g. gsm.mapping::Ingest or automated renaming
  • CheckSpec checks for presence of source_col and issues warning specific to source_col
  • data type check by CheckSpec on renamed column

Current Behavior

  • Ingest does not seem to work when trying to define it in the step yaml section
  • CheckSpec seems to ignore source_col and does not check whether they are present
  • CheckSpec data type check is working but only if the columns is already present and need not be renamed

Possible Solution

I am not sure what the intent here is. But maybe the source_col argument can be deprecated and column renaming can be done in steps via RunQuery, then the CheckSpec should run after steps on the final output not on the input.

Steps to Reproduce

lData = list(
  Raw_PD = clindata::ctms_protdev
)

wflow_map <- gsm.core::MakeWorkflowList(
  strNames = "PD",
  strPath = "workflow/1_mappings",
  strPackage = "gsm.mapping"
)

# spec implies that subjectenrollmentnumber should be renamed to subjid
wflow_map$PD$spec
#> $Raw_PD
#> $Raw_PD$subjid
#> $Raw_PD$subjid$type
#> [1] "character"
#> 
#> $Raw_PD$subjid$source_col
#> [1] "subjectenrollmentnumber"
#> 
#> 
#> $Raw_PD$deemedimportant
#> $Raw_PD$deemedimportant$type
#> [1] "character"

# causes warning, specs as defined in yaml are not applied
gsm.core::RunWorkflows(lWorkflow = wflow_map, lData = lData )
#> 
#> ── Running 1 Workflows ─────────────────────────────────────────────────────────
#> 
#> ── Initializing `Mapped_PD` Workflow ───────────────────────────────────────────
#> 
#> ── Checking data against spec
#> → All 1 data.frame(s) in the spec are present in the data: Raw_PD
#> → All specified columns in Raw_PD are in the expected format
#> Warning: Not all specified columns in the spec are present in the data, missing columns
#> are: Raw_PD$subjid
#> 
#> ── Workflow Step 1 of 1: `=` ──
#> 
#> ── Evaluating 2 parameter(s) for `=`
#> ℹ lhs = Mapped_PD: No matching data found. Passing 'Mapped_PD' as a string.
#> ✔ rhs = Raw_PD: Passing lData$Raw_PD.
#> 
#> ── Calling `=`
#> 
#> ── 4646x4 data.frame saved as `lData$Mapped_PD`.
#> 
#> ── Returning results from final step: 4646x4 data.frame`. ──
#> 
#> ── Completed `Mapped_PD` Workflow ──────────────────────────────────────────────
#> $Mapped_PD
#> # A tibble: 4,646 × 4
#>    subjectenrollmentnumber deviationdate companycategory         deemedimportant
#>    <chr>                   <date>        <chr>                   <chr>          
#>  1 0496                    NA            OTHER                   No             
#>  2 1350                    NA            OTHER                   No             
#>  3 1350                    NA            OTHER                   No             
#>  4 1350                    NA            OTHER                   No             
#>  5 1350                    NA            OTHER                   No             
#>  6 1350                    NA            OTHER                   No             
#>  7 0539                    NA            OTHER TREATMENT COMPLI… No             
#>  8 0539                    NA            OTHER TREATMENT COMPLI… No             
#>  9 0539                    NA            OTHER TREATMENT COMPLI… No             
#> 10 0539                    NA            OTHER TREATMENT COMPLI… No             
#> # ℹ 4,636 more rows

# this works
gsm.mapping::Ingest(lData, wflow_map[[1]]$spec) |>
  str()
#> ℹ Ingesting data for PD.
#> Creating a new temporary DuckDB connection.
#> ✔ SQL Query complete: 4646 rows returned.
#> Disconnected from temporary DuckDB connection.
#> List of 1
#>  $ Raw_PD:'data.frame':  4646 obs. of  2 variables:
#>   ..$ subjid         : chr [1:4646] "0496" "1350" "1350" "1350" ...
#>   ..$ deemedimportant: chr [1:4646] "No" "No" "No" "No" ...

# replace `=` with Ingest()
wflow_map$PD$steps <- list(list(
  name = "gsm.mapping::Ingest",
  output = "Mapped_PD",
  params = list(
    lSourceData = "Raw_PD",
    # RunStep() detects any parameter named lSpec and passes spec for step
    lSpec = "lSpec"
  )
))

# it does not work because lSpec has one unneeded nesting layer Ingest would need lSpec[1]
gsm.core::RunWorkflows(lWorkflow = wflow_map, lData = lData )
#> 
#> ── Running 1 Workflows ─────────────────────────────────────────────────────────
#> 
#> ── Initializing `Mapped_PD` Workflow ───────────────────────────────────────────
#> 
#> ── Checking data against spec
#> → All 1 data.frame(s) in the spec are present in the data: Raw_PD
#> → All specified columns in Raw_PD are in the expected format
#> Warning: Not all specified columns in the spec are present in the data, missing columns
#> are: Raw_PD$subjid
#> 
#> ── Workflow Step 1 of 1: `gsm.mapping::Ingest` ──
#> 
#> ── Evaluating 2 parameter(s) for `gsm.mapping::Ingest`
#> ✔ lSourceData = Raw_PD: Passing lData$Raw_PD.
#> ✔ lSpec = lSpec:  Passing full lSpec object.
#> 
#> ── Calling `gsm.mapping::Ingest`
#> ℹ Ingesting data for PD.
#> Error in `map2()`:
#> ℹ In index: 1.
#> ℹ With name: PD.
#> Caused by error in `layout()`:
#> ! Domain '*_PD' not found in source data.


# replace `=` with runQuery()
wflow_map$PD$steps <- list(list(
  name = "gsm.core::RunQuery",
  output = "Mapped_PD",
  params = list(
    df = "Raw_PD",
    strQuery = "SELECT subjectenrollmentnumber AS subjid, deemedimportant FROM df"
  )
))

# this works but makes source_col for spec redundant and warning persists
gsm.core::RunWorkflows(
8000
lWorkflow = wflow_map, lData = lData ) |>
  str()
#> 
#> ── Running 1 Workflows ─────────────────────────────────────────────────────────
#> 
#> ── Initializing `Mapped_PD` Workflow ───────────────────────────────────────────
#> 
#> ── Checking data against spec
#> → All 1 data.frame(s) in the spec are present in the data: Raw_PD
#> → All specified columns in Raw_PD are in the expected format
#> Warning: Not all specified columns in the spec are present in the data, missing columns
#> are: Raw_PD$subjid
#> 
#> ── Workflow Step 1 of 1: `gsm.core::RunQuery` ──
#> 
#> ── Evaluating 2 parameter(s) for `gsm.core::RunQuery`
#> ✔ df = Raw_PD: Passing lData$Raw_PD.
#> ℹ strQuery = SELECT subjectenrollmentnumber AS subjid, deemedimportant FROM df: No matching data found. Passing 'SELECT subjectenrollmentnumber AS subjid, deemedimportant FROM df' as a string.
#> 
#> ── Calling `gsm.core::RunQuery`
#> Creating a new temporary DuckDB connection.
#> ✔ SQL Query complete: 4646 rows returned.
#> Disconnected from temporary DuckDB connection.
#> 
#> ── 4646x2 data.frame saved as `lData$Mapped_PD`.
#> 
#> ── Returning results from final step: 4646x2 data.frame`. ──
#> 
#> ── Completed `Mapped_PD` Workflow ──────────────────────────────────────────────
#> List of 1
#>  $ Mapped_PD:'data.frame':   4646 obs. of  2 variables:
#>   ..$ subjid         : chr [1:4646] "0496" "1350" "1350" "1350" ...
#>   ..$ deemedimportant: chr [1:4646] "No" "No" "No" "No" ...

# CheckSpec does not check for presence of source_col

# change source_col to something else
wflow_map$PD$spec$Raw_PD$subjid$source_col <- "foo"

# workflow says it is checking spec and all columns are present
gsm.core::RunWorkflows(lWorkflow = wflow_map, lData = lData ) |>
  str()
#> 
#> ── Running 1 Workflows ─────────────────────────────────────────────────────────
#> 
#> ── Initializing `Mapped_PD` Workflow ───────────────────────────────────────────
#> 
#> ── Checking data against spec
#> → All 1 data.frame(s) in the spec are present in the data: Raw_PD
#> → All specified columns in Raw_PD are in the expected format
#> Warning: Not all specified columns in the spec are present in the data, missing columns
#> are: Raw_PD$subjid
#> 
#> ── Workflow Step 1 of 1: `gsm.core::RunQuery` ──
#> 
#> ── Evaluating 2 parameter(s) for `gsm.core::RunQuery`
#> ✔ df = Raw_PD: Passing lData$Raw_PD.
#> ℹ strQuery = SELECT subjectenrollmentnumber AS subjid, deemedimportant FROM df: No matching data found. Passing 'SELECT subjectenrollmentnumber AS subjid, deemedimportant FROM df' as a string.
#> 
#> ── Calling `gsm.core::RunQuery`
#> Creating a new temporary DuckDB connection.
#> ✔ SQL Query complete: 4646 rows returned.
#> Disconnected from temporary DuckDB connection.
#> 
#> ── 4646x2 data.frame saved as `lData$Mapped_PD`.
#> 
#> ── Returning results from final step: 4646x2 data.frame`. ──
#> 
#> ── Completed `Mapped_PD` Workflow ──────────────────────────────────────────────
#> List of 1
#>  $ Mapped_PD:'data.frame':   4646 obs. of  2 variables:
#>   ..$ subjid         : chr [1:4646] "0496" "1350" "1350" "1350" ...
#>   ..$ deemedimportant: chr [1:4646] "No" "No" "No" "No" ...


sessionInfo()
#> R version 4.4.1 (2024-06-14)
#> Platform: aarch64-apple-darwin20
#> Running under: macOS Sonoma 14.7.1
#> 
#> Matrix products: default
#> BLAS:   /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRblas.0.dylib 
#> LAPACK: /Library/Frameworks/R.framework/Versions/4.4-arm64/Resources/lib/libRlapack.dylib;  LAPACK version 3.12.0
#> 
#> locale:
#> [1] en_US.UTF-8/en_US.UTF-8/en_US.UTF-8/C/en_US.UTF-8/en_US.UTF-8
#> 
#> time zone: Europe/Zurich
#> tzcode source: internal
#> 
#> attached base packages:
#> [1] stats     graphics  grDevices utils     datasets  methods   base     
#> 
#> loaded via a namespace (and not attached):
#>  [1] jsonlite_2.0.0     gtable_0.3.6       dplyr_1.1.4        compiler_4.4.1    
#>  [5] tidyselect_1.2.1   reprex_2.1.1       stringr_1.5.1      xml2_1.3.8        
#>  [9] clindata_1.0.5     tidyr_1.3.1        scales_1.4.0       yaml_2.3.10       
#> [13] fastmap_1.2.0      ggplot2_3.5.2      R6_2.6.1           generics_0.1.4    
#> [17] knitr_1.50         htmlwidgets_1.6.4  backports_1.5.0    tibble_3.2.1      
#> [21] gsm.mapping_1.0.1  DBI_1.2.3          pillar_1.10.2      RColorBrewer_1.1-3
#> [25] rlang_1.1.6        utf8_1.2.5         stringi_1.8.7      broom_1.0.8       
#> [29] gsm.core_1.1.0     xfun_0.52          fs_1.6.6           cli_3.6.5         
#> [33] withr_3.0.2        magrittr_2.0.3     digest_0.6.37      grid_4.4.1        
#> [37] rstudioapi_0.16.0  dbplyr_2.5.0       lifecycle_1.0.4    vctrs_0.6.5       
#> [41] evaluate_1.0.3     glue_1.8.0         duckdb_1.2.2       farver_2.1.2      
#> [45] log4r_0.4.4        gt_1.0.0           rmarkdown_2.29     purrr_1.0.4       
#> [49] tools_4.4.1        pkgconfig_2.0.3    htmltools_0.5.8.1

Created on 2025-05-15 with reprex v2.1.1

@erblast
Copy link
Author
erblast commented May 15, 2025

@samussiah, idk if I went down the rabbit hole here, but I was confused by this when I was trying to run IMPALA-Consortium/gsm.simaerep#19 b/c the PD mapping yaml did not work with my scripts and I could not figure out what the best way to fix it is.

@zdz2101
Copy link
Contributor
zdz2101 commented May 15, 2025

@samussiah, idk if I went down the rabbit hole here, but I was confused by this when I was trying to run IMPALA-Consortium/gsm.simaerep#19 b/c the PD mapping yaml did not work with my scripts and I could not figure out what the best way to fix it is.

@erblast Perhaps this is something that can use additional documentation, especially with regards to the source_col functionality of mapping, but there's a function, gsm.mapping::CombineSpecs() used to grab the respective spec objects of the wflow_map/mapping object to address the "unneeded nesting layer" issue you highlighted. This then gets fed into Ingest() that does that renaming step. Ideally the process would look something like this

lData = list(
  Raw_PD = clindata::ctms_protdev,
  Raw_DATAENT = clindata::edc_data_pages,
  Raw_SUBJ = clindata::rawplus_dm
)

# The Customizable Workflow of Raw/Source Data to Mapped Domains
wflow_map <- gsm.core::MakeWorkflowList(
  strNames = c("SUBJ", "DATAENT", "PD"),
  strPath = "workflow/1_mappings",
  strPackage = "gsm.mapping"
)

lData %>% # raw/source data
gsm.mapping::Ingest(., gsm.mapping::CombineSpecs(wflow_map)) %>% # Ingest does the renaming here
  gsm.core::RunWorkflows(wflow_map, .)  # then proceed to run the mapping

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment
Labels
None yet
Projects
None yet
Development

No branches or pull requests

2 participants
0