-
Notifications
You must be signed in to change notification settings - Fork 0
KMDS Design Perspectives
KMDS is a tool that is meant to be used as part of data analysis. It is not a data analysis or modelling library, like say, scikit-learn. It is meant to be used by data analysts and data scientists. A lot of work in data science is experimental and incremental and usually a data scientist or data analyst does not have an epiphany about the right modelling approach to take with a particular task. There is usually a path of principled experimentation and analysis to get to the right approach. In figuring out this path, there is usually a lot of tedious steps, figuring out the inherent structure in the data, the data quality, understanding the problem being modelled, the modelling approach that is likely to have a reasonable chance of getting us to the goal. Not capturing the key findings, and, thier relationships, usually leads to a lot of frustration and wasted effort in rediscovering facts. This is the problem KMDS addresses.
A data analyst or data scientist receives data, performs analysis and finally devlops some models. Models have by-products, like embeddings and related meta-data. This picture implies that data scientists need a way to receive data and publish models and data by-products. Cloud native approaches are the defacto model of development today. So the development context is that the analyst or data scientist, spins up a cluster in the cloud, say a dask-cluster, a ray-cluster or jupyter lab. He/she needs locations to receive data for analysis, needs locations to publish data by-products (embeddings, meta-data, reports). The products of analysis are used by applications(models, reports), search tools (embeddings) and data-catalogs (meta-data).
S3 storage is a very common choice for sending and receiving data. That is the operating model assumed here. Other approaches such as a HDFS store, databricks, WebHDFS are possible choices. To work through illustrative examples of functionality, an infrastructure choice was needed. I have used S3 based storage as the choice here. See the minio example notebook for details how this can be done. Webhooks can be added to buckets to let KMDS users know that data has been received for analysis. Analogously, web hooks can be used let consumers of analysis know that results are available for review in a bucket.