Home:
-
Related: DataLife, DaYu, FlowForecaster
About:
The combination of ever-growing scientific datasets and distributed workflow complexity creates I/O performance bottlenecks due to data volume, velocity, and variety. Although the increasing use of descriptive data formats (e.g., HDF5, netCDF) helps organize these datasets, it also creates obscure bottlenecks due to the need to translate high level operations into file addresses and then into low-level I/O operations.
DaYu is a framework for analyzing (a) semantic relationships between logical datasets and file addresses, (b) how dataset operations translate into I/O, and (c) the combination across entire workflows. DaYu's analysis and visualization enables identification of critical bottlenecks and reasoning about remediation. With DaYu, one can extract workflow data patterns, develop insights into the behavior of data flows, and identify opportunities for both users and I/O libraries to optimize the applications.
The DaYu framework comprises three primary components:
- Data Semantic Mapper, which maps semantic datasets to I/O statistics, capturing essential data flow insights for a 8000 nalysis.
- Workflow Analyzer, which groups I/O statistics by high-level data semantics and visualizes the combination as semantic dataflow graphs, to give insights into holistic data dependence for I/O accesses
- Data Flow Diagnostics, which explores three real-world scientific workflows from distinct domains, generating visualization of dataflow and I/O semantics, revealing potential I/O improvement opportunities, and empowered with optimizations suggested by DaYu's insights
[1] DaYu (e.g. "Yu the Great") refers to a legendary Chinese king credited with taming floods through water control projects
Contacts: (firstname.lastname@pnnl.gov)
Contributors:
- Meng Tang (Illinois Institute of Technology) (www)
- Lenny Guo (www)
- Nathan R. Tallent (PNNL) (www), (www)
- Anthony Kougkas (Illinois Institute of Technology)
- Xian-He Sun (Illinois Institute of Technology)
DaYu's Tracker monitors hdf5 program I/O from the Virtual Object Layer (VOL) level as well as the Vitual File Driver (VFD) level.
The VOL monitors objects accesses during program, implemented with the HDF5 Passthrough VOL.
The VFD monitors POSIX I/O operation during program, implemented with the HDF5 default sec2 I/O operations.
https://github.com/candiceT233/dayu-tracker/blob/main/flow_analysis/example_stat/README.md
- HDF5 (1.14.+, require C, CXX and HDF5_HL_LIBRARIES)
Install with spack (suggest spack version 0.20.+)
spack install hdf5@1.14+cxx+hl~mpi
- h5py==3.8.0
YOUR_HDF5_PATH="`which h5cc |sed 's/.\{9\}$//'`"
echo $YOUR_HDF5_PATH # make sure your path is correct
python3 -m pip uninstall h5py; HDF5_MPI="OFF" HDF5_DIR=$YOUR_HDF5_PATH python3 -m pip install --no-binary=h5py h5py==3.8.0
git clone https://github.com/candiceT233/dayu-tracker.git
cd dayu-tracker
git submodule update --init --recursive
YOUR_INSTALLATION_PATH="`pwd`" # you can use your own path
mkdir build
cd build
ccmake -DCMAKE_INSTALL_PREFIX=$YOUR_INSTALLATION_PATH ..
- Setup with bash environment variable:
export CURR_TASK="my_program"
- Setup with file in
/tmp
directory:
export WORKFLOW_NAME="my_program"
export PATH_FOR_TASK_FILES="/tmp/$USER/$WORKFLOW_NAME"
mkdir -p $PATH_FOR_TASK_FILES
> $PATH_FOR_TASK_FILES/${WORKFLOW_NAME}_vfd.curr_task # clear the file
> $PATH_FOR_TASK_FILES/${WORKFLOW_NAME}_vol.curr_task # clear the file
echo -n "$TASK_NAME" > $PATH_FOR_TASK_FILES/${WORKFLOW_NAME}_vfd.curr_task
echo -n "$TASK_NAME" > $PATH_FOR_TASK_FILES/${WORKFLOW_NAME}_vol.curr_task
TRACKER_SRC_DIR="../build/src" # dayu_tracker installation path
schema_file_path="`pwd`" #your_path_to_store_log_files
export HDF5_VOL_CONNECTOR="tracker under_vol=0;under_info={};path=$schema_file_path;level=2;format=" # VOL connector info string
export HDF5_PLUGIN_PATH=$TRACKER_SRC_DIR/vfd:$TRACKER_SRC_DIR/vol
export HDF5_DRIVER=hdf5_tracker_vfd # VFD driver name
export HDF5_DRIVER_CONFIG="${schema_file_path};${TRACKER_VFD_PAGE_SIZE}" # VFD info string
# Run your program
python h5py_write_read.py
TRACKER_SRC_DIR="../build/src" # dayu_tracker installation path
schema_file_path="`pwd`" #your_path_to_store_log_files
export HDF5_PLUGIN_PATH=$TRACKER_SRC_DIR/vfd
export HDF5_DRIVER=hdf5_tracker_vfd # VFD driver name
export HDF5_DRIVER_CONFIG="${schema_file_path};${TRACKER_VFD_PAGE_SIZE}" # VFD info string
# Run your program
python h5py_write_read.py
TRACKER_SRC_DIR="../build/src" # dayu_tracker installation path
schema_file_path="`pwd`" #your_path_to_store_log_files
export HDF5_VOL_CONNECTOR="tracker under_vol=0;under_info={};path=$schema_file_path;level=2;format="
export HDF5_PLUGIN_PATH=$TRACKER_SRC_DIR/vol
python h5py_write_read.py
- Jarvis-cd can be installed and initialized following steps from here.
- Add dayu-tracker to jarvis-cd
jarvis repo add /home/mtang11/scripts/vol-tracker/jarvis