From f36f47c09c22b819aa71df7caade1010c6844cc5 Mon Sep 17 00:00:00 2001 From: Shlomit Koyfman Date: Sun, 4 Dec 2022 14:39:34 +0000 Subject: [PATCH 1/7] storage management design Signed-off-by: Shlomit Koyfman --- docs/StorageManagement.md | 245 ++++++++++++++++++++++++++++++++++++++ 1 file changed, 245 insertions(+) create mode 100644 docs/StorageManagement.md diff --git a/docs/StorageManagement.md b/docs/StorageManagement.md new file mode 100644 index 000000000..02c4f13f9 --- /dev/null +++ b/docs/StorageManagement.md @@ -0,0 +1,245 @@ +# Storage Management by Fybrik + +## Background +In several use-cases Fybrik needs to allocate storage for data. One use case is implicit copy of a dataset in read scenarios made for performance, cost or governance sake. A second scenario is when a new dataset is created by the workload. In this case Fybrik allocates the storage and registers the new dataset in the data catalog. A third use case is explicit copy - i.e. the user indicates that a copy of an existing dataset should be made. As in the second use case, here too Fybrik allocates storage for the data and registers the new dataset in the data catalog. + +When we say that Fybrik allocates storage, we actually mean that Fybrik allocates a portion of an existing storage account for use by the given dataset. Fybrik must be informed what storage accounts are available, and how to access them. This information is currently provided via the FybrikStorageAccount CRD. + +Modules that write data to this storage, receive from Fybrik a connection that holds the relevant information about the storage (e.g. endpoint and write credentials). + +Currently, only S3 storage is supported. Both allocation and deletion of the storage (if temporary) is done using Datashim. + +## Gaps / Requirements + +- Support additional connection types (e.g. MySQL, googlesheets) + +- Business logic should not be hard-coded in Fybrik. + +- A single connection taxonomy should be used by modules, catalog conector and storage manager. + +- Deployment of FybrikStorageAccount CRD should be configurable. + +- Fybrik manages storage life cycle of temporary data copies. + +- A clear error indication should be provided if the requested storage type is not supported, or an operation has failed. + +- IT admin should be able to express config policies based on storage dynamic attributes (e.g., cost) as well as storage properties such as type, geography and others. + +- Optimizer needs to ensure that the allocated storage type matches the connection that the module uses. + +- The selected storage should not necessarily match the source dataset connection in case of copying an existing asset. + +- [Future enhancement] The data user should be able to request a specific storage type, or to specify some of the connection properties (e.g., bucket name) inside FybrikApplication. + +- The data user should be able to leave a choice of a storage type to Fybrik and organization policies.   + +## Goals + +- Provide modules with a connection object for writing data + +- Share organization storage to store temporary or persistent data while hiding the details (credentials, server URLs, service endpoints, etc.) from the data user. This means that data users will store data in organization accounts created by IT administrators. + +- Govern the use of the shared storage for the given workload according to the compliance, capacity and other factors. + +- Optimize the shared storage for the given workload (by cost, latency to the workload) + +- Manage storage life cycle of the shared storage (e.g., delete temporary data after not being used, delete empty buckets, etc.) + +# High Level Design + +## Taxonomy and structures + +It is crucial that the modules and the storage allocation component have a base common connection structure. Otherwise, the components will not work together correctly when deployed. Thus, we propose the following: + +- **Connection** will be defined in **base** taxonomy within Fybrik repository. + +``` +struct Connection { + + name string // connection name + + category string // category, optional + + properties map[string]string // connection properties shared by multiple assets as well as asset specifics + +} +``` + +Additional properties will be preserved. (needs discussion) + +- Fybrik defines **layers** with schema definition for supported connections (in pkg/storage/layers). Quickstart deploys Fybrik using these layers. Users can add/replace layers when deploying Fybrik. + +- A module yaml can specify as a connection type either name or category. Optimizer will do the matching. + +### StorageAccount + +StorageAccount specifies the following: +``` +- type (connection name) + +- geography + +- reference to a secret + +- properties as key-value map +``` + +Dynamic information about performance, amount free, costs, etc., are detailed in the separate Infrastructure Attributes JSON file. + +## StorageManager + +StorageManager is responsible for allocating storage in known storage accounts (as declared in StorageAccount CRDs) and for freeing the allocated storage. + + + +### Architecture and interfaces + +StorageManager runs as a new container in the manager pod. Its docker image is specified in Fybrik `values.yaml` and can be replaced as long as the alternative implementation obeys the following APIs: + +#### AllocateStorage + +Storage is allocated after the appropriate storage account has been selected by the optimizer. + +`AllocateStorage` request includes properties of the selected storage account, asset name (and additional properties) defined in FybrikApplication, prefix for name generation based on application uuid, attributes defined by IT config policies, e.g., bucket_name. Upon a successful allocation, a connection object will be returned. + +#### DeleteStorage + +The allocated storage is freed after FybrikApplication is deleted or a dataset is no longer required in the spec, and the storage is not persistent. + +`DeleteStorage` request receives the `Connection` object to be deleted, and configuration options defined by IT config policies, e.g., delete_empty_bucket. + +#### GetSupportedConnectionTypes + +Returns a list of supported connection types, to validate the storage accounts. + + +### Code base + +As the first step, storage management functionality will be defined in Fybrik repo under `pkg/storage`. +To consider moving to another repository. + +The folder will include: + +- taxonomy layers (yaml files) + +- StorageAccount types to generate the CRD, a separate `storageaccount-crd` helm chart. The main `fybrik` chart will declare the dependency of `storageaccount-crd` chart on `storageManager.enabled` value (see `Deployment configuration`). + +- Open-source implementation of StorageManager APIs + +### Architecture + +Architecture of StorageManager is based on [Design pattern](https://eli.thegreenplace.net/2019/design-patterns-in-gos-databasesql-package) + +It defines `main` that registers various connectors and `connector plugins` that implement the interface for `AllocateStorage`/`DeleteStorage`. Each connector plugin registers the connection type it supports in init(). + +A docker file is included to build the image of StorageManager. + + +## How to support a new connection type + +### StorageManager + +- Add a connector package with implementation of `AllocateStorage`/`DeleteStorage` and register it in the main process. + +- Create a new docker image of StorageManager + +### Fybrik core + +- Ensure existence of modules that are able to write/copy to this connection. Update the capabilities in module yamls accordingly. + +- Prepare StorageAccount resources with the shared storage information. + +- Update infrastructure attributes related to the storage accounts, e.g., cost. + +- Optionally update IT config policies to specify when the new storage can/should be selected + +- Add a new layer describing the connection schema. + +- Optionally extend catalog-connector to support the new connection. + +- Deploy new/modified yamls and re-install Fybrik release using StorageManager image and the new taxonomy schema. No change to fybrik_crd is required. + + +## Deployment configuration + +In values.yaml add another section `storageManager` with the following configurable properties: + +- `enabled` + +- `image` + +- `imagePullPolicy` + +StorageAccounts are deployed only if `storageManager.enabled` is set to `true`. + + +## Extending syntax of IT config policies + +IT config policies support filtering of StorageAccounts by one or more of the following: + +- type (e.g. allow S3 storage only) + +- geography (copy data to the workload geography) + +- infrastructure attributes (prefer cheaper cost) + +- connection properties. + + +Currently, the policies only filter out options instead of providing instructions. Instruction examples: a person with role Auditor needs to write data to a bucket named Audit, an empty bucket should be deleted if tags include “managed-by-fybrik”. This requires a change to the policy syntax - adding `attributes` to provide additional attributes for modules or storage accounts. +``` +config[{"capability": "write", "decision": decision}] { + + input.context.role == “Auditor” + + decision: = {"attributes": {"storageaccounts": [{“bucket”: “Audit”}]}} + + } + +config [{"capability": "delete", "decision": decision}] { + + input.request.dataset.tags[“managed-by-fybrik”] + + decision: = {"attributes": {"storageaccounts": [{“delete_empty_bucket”: “true”}]}} + + } +``` + +## Changes to Optimizer and storage type selection + +Earlier, the only available storage type was S3. It was hard-coded inside manager as the default connection type used in the write flow of a new dataset. Now, the following changes are required (both to optimizer and the manager naive algorithm): + +- add the constraint of storage type/category matching the module protocol + +- do not specify the desired connection type and determine it later from the selected storage type + + +## To consider in the future: + +- Changes to FybrikApplication (add requirements to the dataset entry) + +- Use information about the amount of storage available and amount of data to be written/copied to influence storage selection. + +## Development plan + +- Redefine `Connection` and provide layers for s3, db2, kafka, arrow-flight, what else? + +- Remove dependency on datashim + +- Changes to FybrikStorageAccount CR + +- Lift the requirement for the default S3 storage, add constraints to the optimizer. Change the non-CSP algorithm as well. + +- Add chart values and deploy CRs accordingly. The manager code should work even when storage CRs are not deployed. + +- Implement a standalone storage-allocator that will implement the defined API for s3 using minio sdk. + +- Support additional connection types - to be decided what types and what priorities.  + +- IT config policies for configuration options + +- Update documentation accordingly + +- Changes to Airbyte module, other modules + +- Changes to catalog connectors to align with taxonomy \ No newline at end of file From 1fa23c53d2da2cc382ff93cf684ada8cab0e273b Mon Sep 17 00:00:00 2001 From: Shlomit Koyfman Date: Mon, 5 Dec 2022 19:37:45 +0000 Subject: [PATCH 2/7] changes to the document Signed-off-by: Shlomit Koyfman --- docs/StorageManagement.md | 173 ++++++++++++++++++-------------------- 1 file changed, 82 insertions(+), 91 deletions(-) diff --git a/docs/StorageManagement.md b/docs/StorageManagement.md index 02c4f13f9..07c412e69 100644 --- a/docs/StorageManagement.md +++ b/docs/StorageManagement.md @@ -17,7 +17,7 @@ Currently, only S3 storage is supported. Both allocation and deletion of the sto - A single connection taxonomy should be used by modules, catalog conector and storage manager. -- Deployment of FybrikStorageAccount CRD should be configurable. +- Deployment of FybrikStorageAccount CRD should be configurable - to be discussed: [issue 1717](https://github.com/fybrik/fybrik/issues/1717) - Fybrik manages storage life cycle of temporary data copies. @@ -51,50 +51,81 @@ Currently, only S3 storage is supported. Both allocation and deletion of the sto It is crucial that the modules and the storage allocation component have a base common connection structure. Otherwise, the components will not work together correctly when deployed. Thus, we propose the following: -- **Connection** will be defined in **base** taxonomy within Fybrik repository. +- **Connection** is defined in **base** taxonomy within Fybrik repository, as it is done today: ``` -struct Connection { - - name string // connection name - - category string // category, optional - - properties map[string]string // connection properties shared by multiple assets as well as asset specifics - +// +kubebuilder:pruning:PreserveUnknownFields +// Name of the connection to the data source +type Connection struct { + // Name of the connection to the data source + Name ConnectionType `json:"name"` + AdditionalProperties serde.Properties `json:"-"` } ``` -Additional properties will be preserved. (needs discussion) +- Fybrik defines **taxonomy layers** with schema definition for supported connections (in pkg/taxonomy/layers). Quickstart deploys Fybrik using these layers. Users can add/replace layers when deploying Fybrik. -- Fybrik defines **layers** with schema definition for supported connections (in pkg/storage/layers). Quickstart deploys Fybrik using these layers. Users can add/replace layers when deploying Fybrik. +In **Phase1** we define layers for all connection types that are supported today by open-source modules, such as `s3`, `db2`, `kafka`, `arrow-flight`. (Revisit taxonomy layers used by Airbyte module.) -- A module yaml can specify as a connection type either name or category. Optimizer will do the matching. - -### StorageAccount - -StorageAccount specifies the following: +Example of a taxonomy layer for `s3`: +``` + s3: + description: Connection information for S3 compatible object store + type: object + properties: + bucket: + type: string + endpoint: + type: string + object_key: + type: string + region: + type: string + required: + - bucket + - endpoint + - object_key ``` -- type (connection name) +- Selection of modules is done based on connection `name`. -- geography +In **Phase2** we add an optional `Category` field to `Connection`. +A module yaml will specify as a connection type either name or category. Optimizer will do the matching. -- reference to a secret +### FybrikStorageAccount +Today, FybrikStorageAccount spec defines properties of s3 storage, e.g.: +``` +spec: + id: theshire-object-store + secretRef: credentials-theshire + region: theshire + endpoint: http://s3.eu.cloud-object-storage.appdomain.cloud +``` -- properties as key-value map +We suggest to add `type` (connection name), `geography` and `properties` defining the appropriate connection properties. +Example: +``` +spec: + id: theshire-object-store + type: s3 + secretRef: credentials-theshire + geography: theshire + properties: + s3: + region: eu + endpoint: http://s3.eu.cloud-object-storage.appdomain.cloud ``` Dynamic information about performance, amount free, costs, etc., are detailed in the separate Infrastructure Attributes JSON file. ## StorageManager -StorageManager is responsible for allocating storage in known storage accounts (as declared in StorageAccount CRDs) and for freeing the allocated storage. +StorageManager is responsible for allocating storage in known storage accounts (as declared in FybrikStorageAccount CRDs) and for freeing the allocated storage. ### Architecture and interfaces -StorageManager runs as a new container in the manager pod. Its docker image is specified in Fybrik `values.yaml` and can be replaced as long as the alternative implementation obeys the following APIs: +StorageManager runs as a new container in the manager pod. A default Fybrik deployment uses its open-source implementation as a docker image specified in Fybrik `values.yaml`. This implementation can be replaced as long as the alternative obeys the following APIs: #### AllocateStorage @@ -110,51 +141,44 @@ The allocated storage is freed after FybrikApplication is deleted or a dataset i #### GetSupportedConnectionTypes -Returns a list of supported connection types, to validate the storage accounts. +Returns a list of supported connection types. Optimizer will use this list to constrain selection of storage accounts. -### Code base +## Architecture As the first step, storage management functionality will be defined in Fybrik repo under `pkg/storage`. -To consider moving to another repository. +To consider moving to another repository in the future. The folder will include: -- taxonomy layers (yaml files) - -- StorageAccount types to generate the CRD, a separate `storageaccount-crd` helm chart. The main `fybrik` chart will declare the dependency of `storageaccount-crd` chart on `storageManager.enabled` value (see `Deployment configuration`). - +- FybrikStorageAccount types to generate the CRD - Open-source implementation of StorageManager APIs -### Architecture - Architecture of StorageManager is based on [Design pattern](https://eli.thegreenplace.net/2019/design-patterns-in-gos-databasesql-package) -It defines `main` that registers various connectors and `connector plugins` that implement the interface for `AllocateStorage`/`DeleteStorage`. Each connector plugin registers the connection type it supports in init(). - -A docker file is included to build the image of StorageManager. +It defines `main` that registers various connection types and `plugins` that implement the interface for `AllocateStorage`/`DeleteStorage`. Each plugin registers the connection type it supports in init(). StorageManager invokes the appropriate plugin method based on the registered connection type. ## How to support a new connection type ### StorageManager -- Add a connector package with implementation of `AllocateStorage`/`DeleteStorage` and register it in the main process. +- Add a new plugin with implementation of `AllocateStorage`/`DeleteStorage` and register it in the main process. - Create a new docker image of StorageManager ### Fybrik core +- Add a new taxonomy layer describing the connection schema and compile `taxonomy.json`. + - Ensure existence of modules that are able to write/copy to this connection. Update the capabilities in module yamls accordingly. -- Prepare StorageAccount resources with the shared storage information. +- Prepare FybrikStorageAccount resources with the shared storage information. - Update infrastructure attributes related to the storage accounts, e.g., cost. - Optionally update IT config policies to specify when the new storage can/should be selected -- Add a new layer describing the connection schema. - - Optionally extend catalog-connector to support the new connection. - Deploy new/modified yamls and re-install Fybrik release using StorageManager image and the new taxonomy schema. No change to fybrik_crd is required. @@ -162,48 +186,8 @@ A docker file is included to build the image of StorageManager. ## Deployment configuration -In values.yaml add another section `storageManager` with the following configurable properties: - -- `enabled` - -- `image` - -- `imagePullPolicy` - -StorageAccounts are deployed only if `storageManager.enabled` is set to `true`. - - -## Extending syntax of IT config policies - -IT config policies support filtering of StorageAccounts by one or more of the following: - -- type (e.g. allow S3 storage only) - -- geography (copy data to the workload geography) - -- infrastructure attributes (prefer cheaper cost) +In values.yaml add another section `storageManager` with `image` of StorageManager. Modify manager deployment. -- connection properties. - - -Currently, the policies only filter out options instead of providing instructions. Instruction examples: a person with role Auditor needs to write data to a bucket named Audit, an empty bucket should be deleted if tags include “managed-by-fybrik”. This requires a change to the policy syntax - adding `attributes` to provide additional attributes for modules or storage accounts. -``` -config[{"capability": "write", "decision": decision}] { - - input.context.role == “Auditor” - - decision: = {"attributes": {"storageaccounts": [{“bucket”: “Audit”}]}} - - } - -config [{"capability": "delete", "decision": decision}] { - - input.request.dataset.tags[“managed-by-fybrik”] - - decision: = {"attributes": {"storageaccounts": [{“delete_empty_bucket”: “true”}]}} - - } -``` ## Changes to Optimizer and storage type selection @@ -220,26 +204,33 @@ Earlier, the only available storage type was S3. It was hard-coded inside manage - Use information about the amount of storage available and amount of data to be written/copied to influence storage selection. -## Development plan +- Extend IT config policies with options for storage management -- Redefine `Connection` and provide layers for s3, db2, kafka, arrow-flight, what else? +## Development plan -- Remove dependency on datashim +### Phase1 + +- Provide layers for s3, db2, kafka, arrow-flight, what else? - Changes to FybrikStorageAccount CR -- Lift the requirement for the default S3 storage, add constraints to the optimizer. Change the non-CSP algorithm as well. +- Implement StorageManager with the defined API for s3 using minio sdk. -- Add chart values and deploy CRs accordingly. The manager code should work even when storage CRs are not deployed. +- Remove dependency on datashim -- Implement a standalone storage-allocator that will implement the defined API for s3 using minio sdk. +- Update documentation accordingly -- Support additional connection types - to be decided what types and what priorities.  +- Changes to Airbyte module to adapt the suggested taxonomy -- IT config policies for configuration options -- Update documentation accordingly +### Phase2 -- Changes to Airbyte module, other modules +- Lift the requirement for the default S3 storage, add constraints to the optimizer. Change the non-CSP algorithm as well. + +### Phase3 + +- IT config policies for configuration options + +### Phase4 -- Changes to catalog connectors to align with taxonomy \ No newline at end of file +- Support additional connection types - to be decided what types and what priorities.  From d732746b89bda2044d331f00723ccaa8c8cc618c Mon Sep 17 00:00:00 2001 From: Shlomit Koyfman Date: Mon, 5 Dec 2022 19:46:33 +0000 Subject: [PATCH 3/7] fixes Signed-off-by: Shlomit Koyfman --- docs/StorageManagement.md | 9 ++++----- 1 file changed, 4 insertions(+), 5 deletions(-) diff --git a/docs/StorageManagement.md b/docs/StorageManagement.md index 07c412e69..eef43dded 100644 --- a/docs/StorageManagement.md +++ b/docs/StorageManagement.md @@ -101,7 +101,7 @@ spec: endpoint: http://s3.eu.cloud-object-storage.appdomain.cloud ``` -We suggest to add `type` (connection name), `geography` and `properties` defining the appropriate connection properties. +We suggest to add `type` (connection name), `geography` and the appropriate connection properties. Example: ``` spec: @@ -109,10 +109,9 @@ spec: type: s3 secretRef: credentials-theshire geography: theshire - properties: - s3: - region: eu - endpoint: http://s3.eu.cloud-object-storage.appdomain.cloud + s3: + region: eu + endpoint: http://s3.eu.cloud-object-storage.appdomain.cloud ``` Dynamic information about performance, amount free, costs, etc., are detailed in the separate Infrastructure Attributes JSON file. From e2f8f85111ca52f3165d26f1c0a509b85e2a7a37 Mon Sep 17 00:00:00 2001 From: Shlomit Koyfman Date: Wed, 7 Dec 2022 08:40:57 +0000 Subject: [PATCH 4/7] address some of the comments Signed-off-by: Shlomit Koyfman --- docs/StorageManagement.md | 102 +++++++++++++++++++++++++++----------- 1 file changed, 73 insertions(+), 29 deletions(-) diff --git a/docs/StorageManagement.md b/docs/StorageManagement.md index eef43dded..0642025ca 100644 --- a/docs/StorageManagement.md +++ b/docs/StorageManagement.md @@ -63,29 +63,12 @@ type Connection struct { } ``` -- Fybrik defines **taxonomy layers** with schema definition for supported connections (in pkg/taxonomy/layers). Quickstart deploys Fybrik using these layers. Users can add/replace layers when deploying Fybrik. +- Fybrik defines **connection taxonomy layer** with schema definition for supported connections (in pkg/taxonomy/layers). Quickstart deploys Fybrik using this layer. Users can modify the connection layer when deploying Fybrik. -In **Phase1** we define layers for all connection types that are supported today by open-source modules, such as `s3`, `db2`, `kafka`, `arrow-flight`. (Revisit taxonomy layers used by Airbyte module.) +In **Phase1** we define a connection taxonomy layer for all connection types that are supported today by open-source modules, such as `s3`, `db2`, `kafka`, `arrow-flight`. (Revisit taxonomy definition used by Airbyte module.) + +See [connection taxonomy](https://github.com/fybrik/fybrik/blob/master/samples/taxonomy/example/catalog/connection.yaml) for an example of a connection taxonomy layer. -Example of a taxonomy layer for `s3`: -``` - s3: - description: Connection information for S3 compatible object store - type: object - properties: - bucket: - type: string - endpoint: - type: string - object_key: - type: string - region: - type: string - required: - - bucket - - endpoint - - object_key -``` - Selection of modules is done based on connection `name`. In **Phase2** we add an optional `Category` field to `Connection`. @@ -101,7 +84,7 @@ spec: endpoint: http://s3.eu.cloud-object-storage.appdomain.cloud ``` -We suggest to add `type` (connection name), `geography` and the appropriate connection properties. +We suggest to add `type` (connection name), `geography` and the appropriate connection properties taken from the taxonomy. Example: ``` spec: @@ -145,30 +128,91 @@ Returns a list of supported connection types. Optimizer will use this list to co ## Architecture -As the first step, storage management functionality will be defined in Fybrik repo under `pkg/storage`. -To consider moving to another repository in the future. +Storage management functionality will be defined in Fybrik repo under `pkg/storage`. The folder will include: - FybrikStorageAccount types to generate the CRD -- Open-source implementation of StorageManager APIs +- Open-source implementation of StorageManager APIs based on a some / all of the connection types in the [taxonomy](#taxonomy-and-structures) Architecture of StorageManager is based on [Design pattern](https://eli.thegreenplace.net/2019/design-patterns-in-gos-databasesql-package) -It defines `main` that registers various connection types and `plugins` that implement the interface for `AllocateStorage`/`DeleteStorage`. Each plugin registers the connection type it supports in init(). StorageManager invokes the appropriate plugin method based on the registered connection type. +`agent` defines the interface for `AllocateStorage`/`DeleteStorage`: + +``` +package agent + +type AgentInterface interface { + func AllocateStorage... + func DeleteStorage... +} +``` + +`registrator` registers `agents` implementing the interface: +``` +package registrator + +import "registrator/agent" + +var ( + agentsMu sync.RWMutex + agents = make(map[string]agent.Agent) +) + +func Register(name string, worker agent.Agent) error { + agentsMu.Lock() + defer agentsMu.Unlock() + if worker == nil { + // return error + } + if _, dup := agents[name]; dup { + // return error + } + agents[name] = worker +} +``` + +Each agent implements the interface and registers the connection type it supports in init(). +``` +package s3Agent + +import "registrator" +import "registrator/agent" +func init() { + registrator.Register("s3", &S3Agent{}) +} + +func AllocateStorage... +func DeleteStorage... +``` + +StorageManager invokes the appropriate agent based on the registered connection type. +``` +package storageManager +// import registrator +import "registrator" +// import agents +import "s3Agent" + +... within AllocateStorage +// get agent by type +agent = registrator.GetAgent(type) +// invoke the agent +agent.AllocateStorage... +``` ## How to support a new connection type ### StorageManager -- Add a new plugin with implementation of `AllocateStorage`/`DeleteStorage` and register it in the main process. +- Implement `AllocateStorage`/`DeleteStorage` for the new type and register it. (Depends on the definition of the appropriate schema by Fybrik.) - Create a new docker image of StorageManager ### Fybrik core -- Add a new taxonomy layer describing the connection schema and compile `taxonomy.json`. +- Add a new connection schema and compile `taxonomy.json`. - Ensure existence of modules that are able to write/copy to this connection. Update the capabilities in module yamls accordingly. @@ -209,7 +253,7 @@ Earlier, the only available storage type was S3. It was hard-coded inside manage ### Phase1 -- Provide layers for s3, db2, kafka, arrow-flight, what else? +- Provide connection taxonomy layer for s3, db2, kafka, arrow-flight, what else? - Changes to FybrikStorageAccount CR From de5a041c178d2b45a9182979cefe67f315de2abf Mon Sep 17 00:00:00 2001 From: Shlomit Koyfman Date: Wed, 7 Dec 2022 08:55:21 +0000 Subject: [PATCH 5/7] address comments Signed-off-by: Shlomit Koyfman --- docs/StorageManagement.md | 10 ++++++---- 1 file changed, 6 insertions(+), 4 deletions(-) diff --git a/docs/StorageManagement.md b/docs/StorageManagement.md index 0642025ca..5ec260e8b 100644 --- a/docs/StorageManagement.md +++ b/docs/StorageManagement.md @@ -227,14 +227,14 @@ agent.AllocateStorage... - Deploy new/modified yamls and re-install Fybrik release using StorageManager image and the new taxonomy schema. No change to fybrik_crd is required. -## Deployment configuration +## Fybrik deployment configuration -In values.yaml add another section `storageManager` with `image` of StorageManager. Modify manager deployment. +In [values.yaml](https://github.com/fybrik/fybrik/blob/master/charts/fybrik/values.yaml) add a section `storageManager` with `image` of StorageManager. Modify manager deployment to bring up a new container running in the manager pod. ## Changes to Optimizer and storage type selection -Earlier, the only available storage type was S3. It was hard-coded inside manager as the default connection type used in the write flow of a new dataset. Now, the following changes are required (both to optimizer and the manager naive algorithm): +Currently, the only available storage type is S3. It has been hard-coded inside manager as the default connection type used in the write flow of a new dataset. Now, the following changes are required (both to optimizer and the manager naive algorithm): - add the constraint of storage type/category matching the module protocol @@ -243,7 +243,7 @@ Earlier, the only available storage type was S3. It was hard-coded inside manage ## To consider in the future: -- Changes to FybrikApplication (add requirements to the dataset entry) +- Changes to FybrikApplication (add user requirements for storage type, connection details such as bucket name, etc. to the dataset entry) - Use information about the amount of storage available and amount of data to be written/copied to influence storage selection. @@ -270,6 +270,8 @@ Earlier, the only available storage type was S3. It was hard-coded inside manage - Lift the requirement for the default S3 storage, add constraints to the optimizer. Change the non-CSP algorithm as well. +- Support more storage types: db2, kafka, google sheets, mySQL + ### Phase3 - IT config policies for configuration options From fd7b745b2a23d3200350786bf8b6f3de5efa4d27 Mon Sep 17 00:00:00 2001 From: Shlomit Koyfman Date: Thu, 8 Dec 2022 15:34:09 +0000 Subject: [PATCH 6/7] address PR comments Signed-off-by: Shlomit Koyfman --- docs/StorageManagement.md | 54 ++++++++++++++++++++++----------------- 1 file changed, 31 insertions(+), 23 deletions(-) diff --git a/docs/StorageManagement.md b/docs/StorageManagement.md index 5ec260e8b..e37da07e0 100644 --- a/docs/StorageManagement.md +++ b/docs/StorageManagement.md @@ -7,15 +7,16 @@ When we say that Fybrik allocates storage, we actually mean that Fybrik allocate Modules that write data to this storage, receive from Fybrik a connection that holds the relevant information about the storage (e.g. endpoint and write credentials). -Currently, only S3 storage is supported. Both allocation and deletion of the storage (if temporary) is done using Datashim. +Currently, only S3 storage is supported. Both allocation and deletion of the storage (if temporary) is done using [Datashim](https://datashim.io/). +Business logic related to storage management is hard-coded in Fybrik. ## Gaps / Requirements -- Support additional connection types (e.g. MySQL, googlesheets) +- Support additional connection types (e.g. [MySQL](https://www.mysql.com/), [Google Sheets](https://learn.microsoft.com/en-us/connectors/googlesheet/)) - Business logic should not be hard-coded in Fybrik. -- A single connection taxonomy should be used by modules, catalog conector and storage manager. +- Storage manager should use the common connection taxonomy. - Deployment of FybrikStorageAccount CRD should be configurable - to be discussed: [issue 1717](https://github.com/fybrik/fybrik/issues/1717) @@ -23,15 +24,16 @@ Currently, only S3 storage is supported. Both allocation and deletion of the sto - A clear error indication should be provided if the requested storage type is not supported, or an operation has failed. -- IT admin should be able to express config policies based on storage dynamic attributes (e.g., cost) as well as storage properties such as type, geography and others. +- IT admin should be able to express config policies related to storage allocation, based on storage dynamic attributes (e.g., cost) as well as storage properties such as type, geography and others. - Optimizer needs to ensure that the allocated storage type matches the connection that the module uses. - The selected storage should not necessarily match the source dataset connection in case of copying an existing asset. +- The data user should be able to leave a choice of a storage type to Fybrik and organization policies.   + - [Future enhancement] The data user should be able to request a specific storage type, or to specify some of the connection properties (e.g., bucket name) inside FybrikApplication. -- The data user should be able to leave a choice of a storage type to Fybrik and organization policies.   ## Goals @@ -71,8 +73,9 @@ See [connection taxonomy](https://github.com/fybrik/fybrik/blob/master/samples/t - Selection of modules is done based on connection `name`. -In **Phase2** we add an optional `Category` field to `Connection`. -A module yaml will specify as a connection type either name or category. Optimizer will do the matching. +In **Phase2** we add an optional `Category` field to `Connection`. Category represents a wider set of connection types that can be supported by a module. +For example, `Generic S3` can represent `AWS S3`, `IBM Cloud Object Storage`, and so on. +A module yaml will be able to specify either name or category as a connection type. Optimizer will do the matching. ### FybrikStorageAccount Today, FybrikStorageAccount spec defines properties of s3 storage, e.g.: @@ -105,7 +108,7 @@ StorageManager is responsible for allocating storage in known storage accounts ( -### Architecture and interfaces +### Interfaces StorageManager runs as a new container in the manager pod. A default Fybrik deployment uses its open-source implementation as a docker image specified in Fybrik `values.yaml`. This implementation can be replaced as long as the alternative obeys the following APIs: @@ -113,7 +116,9 @@ StorageManager runs as a new container in the manager pod. A default Fybrik depl Storage is allocated after the appropriate storage account has been selected by the optimizer. -`AllocateStorage` request includes properties of the selected storage account, asset name (and additional properties) defined in FybrikApplication, prefix for name generation based on application uuid, attributes defined by IT config policies, e.g., bucket_name. Upon a successful allocation, a connection object will be returned. +`AllocateStorage` request includes properties of the selected storage account, asset name (and additional properties) defined in FybrikApplication, prefix for name generation based on application uuid, attributes defined by IT config policies, e.g., bucket_name. +Upon a successful allocation, a connection object is returned. +In case of an error, a detailed error message is returned. Examples of errors: credentials are not provided, access to cloud object storage is forbidden. #### DeleteStorage @@ -121,6 +126,8 @@ The allocated storage is freed after FybrikApplication is deleted or a dataset i `DeleteStorage` request receives the `Connection` object to be deleted, and configuration options defined by IT config policies, e.g., delete_empty_bucket. +It returns the operation status (success/failure), and a detailed error message, e.g. access is denied, the specified bucket does not exist, etc. + #### GetSupportedConnectionTypes Returns a list of supported connection types. Optimizer will use this list to constrain selection of storage accounts. @@ -204,15 +211,19 @@ agent.AllocateStorage... ## How to support a new connection type -### StorageManager +### Development -- Implement `AllocateStorage`/`DeleteStorage` for the new type and register it. (Depends on the definition of the appropriate schema by Fybrik.) +- Add a new connection schema and compile `taxonomy.json`. + +- StorageManager: implement `AllocateStorage`/`DeleteStorage` for the new type and register it. - Create a new docker image of StorageManager -### Fybrik core +- Optionally extend catalog-connector to support the new connection. + +- Re-install Fybrik release using StorageManager image and the new taxonomy schema. No change to fybrik_crd is required. -- Add a new connection schema and compile `taxonomy.json`. +### Deployment - Ensure existence of modules that are able to write/copy to this connection. Update the capabilities in module yamls accordingly. @@ -222,21 +233,16 @@ agent.AllocateStorage... - Optionally update IT config policies to specify when the new storage can/should be selected -- Optionally extend catalog-connector to support the new connection. - -- Deploy new/modified yamls and re-install Fybrik release using StorageManager image and the new taxonomy schema. No change to fybrik_crd is required. - - ## Fybrik deployment configuration In [values.yaml](https://github.com/fybrik/fybrik/blob/master/charts/fybrik/values.yaml) add a section `storageManager` with `image` of StorageManager. Modify manager deployment to bring up a new container running in the manager pod. -## Changes to Optimizer and storage type selection +## Changes to Optimizer and storage selection -Currently, the only available storage type is S3. It has been hard-coded inside manager as the default connection type used in the write flow of a new dataset. Now, the following changes are required (both to optimizer and the manager naive algorithm): +Currently, the only available storage connection type is S3. It has been hard-coded inside manager as the default connection type used in the write flow of a new dataset. Now, the following changes are required (both to optimizer and the manager naive algorithm): -- add the constraint of storage type/category matching the module protocol +- add the constraint of connection type/category matching the module protocol - do not specify the desired connection type and determine it later from the selected storage type @@ -268,9 +274,11 @@ Currently, the only available storage type is S3. It has been hard-coded inside ### Phase2 +- Add Category field to `Connection`, modify matching criteria in the optimizer/ non-CSP algorithm. + - Lift the requirement for the default S3 storage, add constraints to the optimizer. Change the non-CSP algorithm as well. -- Support more storage types: db2, kafka, google sheets, mySQL +- Support MySQL ### Phase3 @@ -278,4 +286,4 @@ Currently, the only available storage type is S3. It has been hard-coded inside ### Phase4 -- Support additional connection types - to be decided what types and what priorities.  +- Support additional connection types - DB2, Kafka, Google Sheets, From f042355c55c630078646f4463bb3da97806d09e2 Mon Sep 17 00:00:00 2001 From: Shlomit Koyfman Date: Thu, 15 Dec 2022 09:24:16 +0000 Subject: [PATCH 7/7] refer to backwards compatibility Signed-off-by: Shlomit Koyfman --- docs/StorageManagement.md | 7 +++++++ 1 file changed, 7 insertions(+) diff --git a/docs/StorageManagement.md b/docs/StorageManagement.md index e37da07e0..9d6913168 100644 --- a/docs/StorageManagement.md +++ b/docs/StorageManagement.md @@ -246,6 +246,13 @@ Currently, the only available storage connection type is S3. It has been hard-co - do not specify the desired connection type and determine it later from the selected storage type +## Backwards compatibility: + +- FybrikStorageAccount CRD will be changed without preserving backwards compatibility + +- No changes to connectors or connector APIs + +- Changes to AirByte module chart are required after the connection layer is defined, no changes to other modules ## To consider in the future: