8000 Storage management design by shlomitk1 · Pull Request #1837 · fybrik/fybrik · GitHub
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

Storage management design #1837

New issue 8000

Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.

By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.

Already on GitHub? Sign in to your account

Merged
merged 9 commits into from
Dec 18, 2022
Merged
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
296 changes: 296 additions & 0 deletions docs/StorageManagement.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,296 @@
# Storage Management by Fybrik

## Background
In several use-cases Fybrik needs to allocate storage for data. One use case is implicit copy of a dataset in read scenarios made for performance, cost or governance sake. A second scenario is when a new dataset is created by the workload. In this case Fybrik allocates the storage and registers the new dataset in the data catalog. A third use case is explicit copy - i.e. the user indicates that a copy of an existing dataset should be made. As in the second use case, here too Fybrik allocates storage for the data and registers the new dataset in the data catalog.

When we say that Fybrik allocates storage, we actually mean that Fybrik allocates a portion of an existing storage account for use by the given dataset. Fybrik must be informed what storage accounts are available, and how to access them. This information is currently provided via the FybrikStorageAccount CRD.

Modules that write data to this storage, receive from Fybrik a connection that holds the relevant information about the storage (e.g. endpoint and write credentials).

Currently, only S3 storage is supported. Both allocation and deletion of the storage (if temporary) is done using [Datashim](https://datashim.io/).
Business logic related to storage management is hard-coded in Fybrik.

## Gaps / Requirements

- Support additional connection types (e.g. [MySQL](https://www.mysql.com/), [Google Sheets](https://learn.microsoft.com/en-us/connectors/googlesheet/))

- Business logic should not be hard-coded in Fybrik.

- Storage manager should use the common connection taxonomy.

- Deployment of FybrikStorageAccount CRD should be configurable - to be discussed: [issue 1717](https://github.com/fybrik/fybrik/issues/1717)
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If FybrikStorageAccount is not deployed, does it mean that the Storage Manager will not been deployed too? Should we support optional deployment of the Storage Manager?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Optional deployment of StorageManager is equivalent to optional deployment of FybrikStorageAccount CRD, there is no meaning of deploying one without another. In this design both are deployed unconditionally.


- Fybrik manages storage life cycle of temporary data copies.

- A clear error indication should be provided if the requested storage type is not supported, or an operation has failed.

- IT admin should be able to express config policies related to storage allocation, based on storage dynamic attributes (e.g., cost) as well as storage properties such as type, geography and others.

- Optimizer needs to ensure that the allocated storage type matches the connection that the module uses.

- The selected storage should not necessarily match the source dataset connection in case of copying an existing asset.

- The data user should be able to leave a choice of a storage type to Fybrik and organization policies.  

- [Future enhancement] The data user should be able to request a specific storage type, or to specify some of the connection properties (e.g., bucket name) inside FybrikApplication.


## Goals

- Provide modules with a connection object for writing data

- Share organization storage to store temporary or persistent data while hiding the details (credentials, server URLs, service endpoints, etc.) from the data user. This means that data users will store data in organization accounts created by IT administrators.

- Govern the use of the shared storage for the given workload according to the compliance, capacity and other factors.

- Optimize the shared storage for the given workload (by cost, latency to the workload)

- Manage storage life cycle of the shared storage (e.g., delete temporary data after not being used, delete empty buckets, etc.)

# High Level Design

## Taxonomy and structures

It is crucial that the modules and the storage allocation component have a base common connection structure. Otherwise, the components will not work together correctly when deployed. Thus, we propose the following:

- **Connection** is defined in **base** taxonomy within Fybrik repository, as it is done today:

```
// +kubebuilder:pruning:PreserveUnknownFields
// Name of the connection to the data source
type Connection struct {
// Name of the connection to the data source
Name ConnectionType `json:"name"`
AdditionalProperties serde.Properties `json:"-"`
}
```

- Fybrik defines **connection taxonomy layer** with schema definition for supported connections (in pkg/taxonomy/layers). Quickstart deploys Fybrik using this layer. Users can modify the connection layer when deploying Fybrik.

In **Phase1** we define a connection taxonomy layer for all connection types that are supported today by open-source modules, such as `s3`, `db2`, `kafka`, `arrow-flight`. (Revisit taxonomy definition used by Airbyte module.)

See [connection taxonomy](https://github.com/fybrik/fybrik/b 8000 lob/master/samples/taxonomy/example/catalog/connection.yaml) for an example of a connection taxonomy layer.

- Selection of modules is done based on connection `name`.

In **Phase2** we add an optional `Category` field to `Connection`. Category represents a wider set of connection types that can be supported by a module.
For example, `Generic S3` can represent `AWS S3`, `IBM Cloud Object Storage`, and so on.
A module yaml will be able to specify either name or category as a connection type. Optimizer will do the matching.

### FybrikStorageAccount
Today, FybrikStorageAccount spec defines properties of s3 storage, e.g.:
```
spec:
id: theshire-object-store
secretRef: credentials-theshire
region: theshire
endpoint: http://s3.eu.cloud-object-storage.appdomain.cloud
```

We suggest to add `type` (connection name), `geography` and the appropriate connection properties taken from the taxonomy.
Example:
```
spec:
id: theshire-object-store
type: s3
secretRef: credentials-theshire
geography: theshire
s3:
region: eu
endpoint: http://s3.eu.cloud-object-storage.appdomain.cloud
```

Dynamic information about performance, amount free, costs, etc., are detailed in the separate Infrastructure Attributes JSON file.

## StorageManager

StorageManager is responsible for allocating storage in known storage accounts (as declared in FybrikStorageAccount CRDs) and for freeing the allocated storage.



### Interfaces

StorageManager runs as a new container in the manager pod. A default Fybrik deployment uses its open-source implementation as a docker image specified in Fybrik `values.yaml`. This implementation can be replaced as long as the alternative obeys the following APIs:

#### AllocateStorage

Storage is allocated after the appropriate storage account has been selected by the optimizer.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

If we define interfaces (API) here we should define what will be return in the case of success and some common error cases.


`AllocateStorage` request includes properties of the selected storage account, asset name (and additional properties) defined in FybrikApplication, prefix for name generation based on application uuid, attributes defined by IT config policies, e.g., bucket_name.
Upon a successful allocation, a connection object is returned.
In case of an error, a detailed error message is returned. Examples of errors: credentials are not provided, access to cloud object storage is forbidden.

#### DeleteStorage

The allocated storage is freed after FybrikApplication is deleted or a dataset is no longer required in the spec, and the storage is not persistent.

`DeleteStorage` request receives the `Connection` object to be deleted, and configuration options defined by IT config policies, e.g., delete_empty_bucket.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

What will be behavior if the requested to delete object is not exist? Error? Success?

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Depends on the connection and the logic. Using S3 as an example. In allocation operation, StorageManager created a bucket X to store temporary data Y. Delete will receive connection attributes including X and Y. Let's say that that the logic is to delete all objects in X plus the bucket. It is OK that Y is not there, and the bucket is empty, maybe the copy job has failed. However, if the allocated bucket does not exist, it is an error.


It returns the operation status (success/failure), and a detailed error message, e.g. access is denied, the specified bucket does not exist, etc.

#### GetSupportedConnectionTypes

Returns a list of supported connection types. Optimizer will use this list to constrain selection of storage accounts.


## Architecture

Storage management functionality will be defined in Fybrik repo under `pkg/storage`.

The folder will include:

- FybrikStorageAccount types to generate the CRD
- Open-source implementation of StorageManager APIs based on a some / all of the connection types in the [taxonomy](#taxonomy-and-structures)

Architecture of StorageManager is based on [Design pattern](https://eli.thegreenplace.net/2019/design-patterns-in-gos-databasesql-package)

`agent` defines the interface for `AllocateStorage`/`DeleteStorage`:

```
package agent

type AgentInterface interface {
func AllocateStorage...
func DeleteStorage...
}
```

`registrator` registers `agents` implementing the interface:
```
package registrator

import "registrator/agent"

var (
agentsMu sync.RWMutex
agents = make(map[string]agent.Agent)
)

func Register(name string, worker agent.Agent) error {
agentsMu.Lock()
defer agentsMu.Unlock()
if worker == nil {
// return error
}
if _, dup := agents[name]; dup {
// return error
}
agents[name] = worker
}
```

Each agent implements the interface and registers the connection type it supports in init().
```
package s3Agent

import "registrator"
import "registrator/agent"

func init() {
registrator.Register("s3", &S3Agent{})
}

func AllocateStorage...
func DeleteStorage...
```

StorageManager invokes the appropriate agent based on the registered connection type.
```
package storageManager
// import registrator
import "registrator"
// import agents
import "s3Agent"

... within AllocateStorage
// get agent by type
agent = registrator.GetAgent(type)
// invoke the agent
agent.AllocateStorage...
```

## How to support a new connection type

### Development

- Add a new connection schema and compile `taxonomy.json`.

- StorageManager: implement `AllocateStorage`/`DeleteStorage` for the new type and register it.

- Create a new docker image of StorageManager

- Optionally extend catalog-connector to support the new connection.

- Re-install Fybrik release using StorageManager image and the new taxonomy schema. No change to fybrik_crd is required.

### Deployment

- Ensure existence of modules that are able to write/copy to this connection. Update the capabilities in module yamls accordingly.

- Prepare FybrikStorageAccount resources with the shared storage information.
Copy link
Collaborator
6D40

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think the bullet of "Prepare FybrikStorageAccount resources with the shared storage information." is not relevant to "How to support a new connection type" The bullet adds new instances of the type. It can be done later, independent to addition of the new type support.
En example of K8s, deploy a CRD, and create CR instances.

Copy link
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Distinguished between Development and Deployment sections


- Update infrastructure attributes related to the storage accounts, e.g., cost.
Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

is it to update the attributes of the new shared resources, or define a new type specific attributes ? Please clarify. If it is the shared storage attributes, see my comment above.


- Optionally update IT config policies to specify when the new storage can/should be selected

## Fybrik deployment configuration

In [values.yaml](https://github.com/fybrik/fybrik/blob/master/charts/fybrik/values.yaml) add a section `storageManager` with `image` of StorageManager. Modify manager deployment to bring up a new container running in the manager pod.


## Changes to Optimizer and storage selection

Currently, the only available storage connection type is S3. It has been hard-coded inside manager as the default connection type used in the write flow of a new dataset. Now, the following changes are required (both to optimizer and the manager naive algorithm):

- add the constraint of connection type/category matching the module protocol

- do not specify the desired connection type and determine it later from the selected storage type

## Backwards compatibility:

- FybrikStorageAccount CRD will be changed without preserving backwards compatibility

- No changes to connectors or connector APIs

- Changes to AirByte module chart are required after the connection layer is defined, no changes to other modules

## To consider in the future:

- Changes to FybrikApplication (add user requirements for storage type, connection details such as bucket name, etc. to the dataset entry)

- Use information about the amount of storage available and amount of data to be written/copied to influence storage selection.

- Extend IT config policies with options for storage management

## Development plan

### Phase1

- Provide connection taxonomy layer for s3, db2, kafka, arrow-flight, what else?

- Changes to FybrikStorageAccount CR

- Implement StorageManager with the defined API for s3 using minio sdk.

- Remove dependency on datashim

- Update documentation accordingly

- Changes to Airbyte module to adapt the suggested taxonomy


### Phase2

- Add Category field to `Connection`, modify matching criteria in the optimizer/ non-CSP algorithm.

- Lift the requirement for the default S3 storage, add constraints to the optimizer. Change the non-CSP algorithm as well.

- Support MySQL

### Phase3

- IT config policies for configuration options

### Phase4

Copy link
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

why do you separate support connection types and storage types? Should they be implemented together ?

- Support additional connection types - DB2, Kafka, Google Sheets,
0