Important: This update includes a major change which may alter the reproducibility of some old pipelines - especially if split_by()
was used on columns of type double. Take care to use versions prior to this release when re-running old code with LexOPS.
Update to split_by()
:
Simplified numeric splits in split_by()
. This includes removing the use of the cut()
method, and using the same method for double and integer types. The old method may have produced some unexpected behaviour when splitting by columns stored as double if the levels overlapped. See issue #6 for more details.
Another change with this new method is that, while splits can still be specified out of order (e.g., 4:5 ~ 1:3
), the specified order is now preserved, whereas before an attempt was made to sort them. This means that A1 will now be 4:5
, and A2 will be 1:3
, whereas previous versions would have forced A1 to be the lower level of 1:3
, and A2 to be the higher level of 4:5
.
Other Major updates:
- Related to the change above, numeric splits can no longer be overlapping at all (e.g.,
1:2 ~ 2:3
used to be acceptable, but will now produce an error, as it is unclear to which group a value of2
would belong). - Added
equal_size
argument tosplit_random()
. Settingequal_size=TRUE
will ensure that the split has equally (or as close to equal as possible) sized groups. This option will typically enable more candidate matches. This option was added in response to issue #4. - The
generate()
function checks that theid_col
uniquely identifies items, and gives an error if this is not the case. This avoids duplicate IDs causing incorrect matching. Addresses issue #5. run_shiny()
now checks forstringdist
package and will generate code to install if missing.
Minor Updates:
- Updated to base R pipe,
|>
, in examples. - Unnecessary dependencies (
vwr
,plyr
) have been removed from the shiny app. - All S3 methods now exported (previously only
print.LexOPS_pipeline
was exported).
Updates to Tests:
- Removed deprecated
testthat
argument. - Now tests the
equal_size
argument ofsplit_random()
. - Now tests that duplicates in
id_col
gives an error. - Ensured that variables that undergo
scale()
in tests are stored explicitly as numeric vectors. This addressed a deprecation warning fromdplyr::filter()
about 1-column matrices that was produced from one test. - Removed overlapping levels from all tests.
- Added tests for
split_by()
errors.