Description
I'm evaluating MGARD-X for compressing seismic data on-GPU in an internal application, and as I've been learning the API and tuning parameters I've run into a somewhat frustrating issue where many errors internal to the compression process are handled by calling exit(-1)
, causing the application (which is still doing productive things on many other threads and other GPUs) to vapourise.
A quick search showed a couple hundred locations where this can happen; it feels like bad practice for a library to do this, not to mention it's making debugging my own issues with occasionally malformed data volumes and overly optimistic error bounding trickier.
For my internal testing I've been replacing many of these exit()
calls with C++ exceptions and handling them at the level of GPUPipelines.hpp::compress_pipeline_gpu
to convert them into new compress_status_type
flags I've added.
While I certainly appreciate certain failure modes can leave at least the used GPU in an unknown state (eg. certain invalid pointer dereferences inside kernels for CUDA at least), I'd vastly prefer these errors to bubble up to the calling application so it can take action - even if that action is ultimately 'cleanly shut down the process and report the failure to the job scheduling system'.
From my perspective, adding in C++ exceptions that (may - I've not done substantiative testing on this) incompletely clean up internal state as a first pass but which cleanly report a catastrophic error to the calling application doesn't seem to be meaningfully worse than the current state of play (eg., at worst the default behaviour could be changed to an exit()
at a single common location, logging the exception that was raised, with a parameter to disable it for client applications that wish to handle the error themselves).
Would there be any interest if I can put together a PR with these changes? Depending on how the results of the testing play out and thus whether we use the library in production, I may or may not be able to contribute further in future, and I don't want to leave you folks with the maintenance burden of code you didn't ask for if you don't think it's necessary.
Thanks!