This repository contains code and other files to implement the 5GZORRO datalake.
The main datalake API functionality is provided in the directory python-flask-server, much of which was generated by swagger.codegen.
The API itself is specified in datalake_swagger.yaml.
This code is proof-of-concept.
The datalake server itself can run on a single VM (or bare metal) with the following resources.
- 2 vCPUs
- 4 GB RAM
- 10 GB storage
The datalake server was developed with python3.6.
The datalake server requires that there first be running:
- kubernetes
- kafka
- argo
- s3 object store (can be minio)
- postgres database
To set up minio:
wget https://dl.min.io/server/minio/release/linux-amd64/minio
chmod +x minio
mkdir /minio/data
export MINIO_VOLUMES="/var/lib/minio"
export MINIO_ACCESS_KEY=user
export MINIO_SECRET_KEY=password
./minio server /minio/data
Kubernetes should use Docker container management (rather than containerd) for argo to work properly.
For kubernetes, it is possible to run a simulated minikube cluster. To install minikube see: https://minikube.sigs.k8s.io/docs/start/.
curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
sudo install minikube-linux-amd64 /usr/local/bin/minikube
To set up Argo and standard argo-events:
kubectl create namespace argo
kubectl apply -n argo -f https://raw.githubusercontent.com/argoproj/argo/v2.12.0-rc3/manifests/install.yaml
kubectl create rolebinding default-admin --clusterrole=admin --serviceaccount=default:default
kubectl create namespace argo-events
kubectl apply -f https://raw.githubusercontent.com/argoproj/argo-events/v1.1.0/manifests/install.yaml
kubectl apply -n argo-events -f https://raw.githubusercontent.com/argoproj/argo-events/v1.1.0/examples/eventbus/native.yaml
In Argo, it is necessary to define the dl-argo-events
namespace.
kubectl create namespace dl-argo-events
cd datalake/config
kubectl apply -f ./install.yaml
kubectl apply -n dl-argo-events -f https://raw.githubusercontent.com/argoproj/argo-events/v1.1.0/examples/eventbus/native.yaml
To see the Argo GUI, run argo server
at the command line, and then connect via a web browser to http://localhost:2746
.
In Kubernetes, it is necessary to define the datalake
namespace.
kubectl create namespace datalake
Run script to periodically clean up old datalake argo jobs.
cd /datalake/experiments
nohup ./loop_argo_del.sh >/dev/null 2>&1 &
To set up postgres, see instructions at https://www.postgresqltutorial.com/install-postgresql-linux/ and https://www.postgresql.org/download/linux/ubuntu/.
Allow access from outside servers by following the insructions in https://stackoverflow.com/questions/38466190/cant-connect-to-postgresql-on-port-5432.
Then perform the following:
sudo -i -u postgres
psql
\l
CREATE DATABASE datalake;
\c datalake
DROP TABLE IF EXISTS datalake_metrics;
CREATE TABLE datalake_metrics(
seq_id SERIAL PRIMARY KEY,
resourceID VARCHAR,
referenceID VARCHAR,
transactionID VARCHAR,
productID VARCHAR,
instanceID VARCHAR,
metricName VARCHAR,
metricValue VARCHAR,
timestamp VARCHAR,
storageLocation VARCHAR
);
create user datalake_user with encrypted password 'datalake_pw';
grant all privileges on database datalake to datalake_user;
grant usage on schema public to datalake_user;
grant all privileges on table datalake_metrics to datalake_user;
grant all privileges on sequence datalake_metrics_seq_id_seq to datalake_user;
Before bringing up the datalake python-flask-server:
- The ingest pipeline must be compiled and dockerized with a name of
ingest
. - The metrics_index pipeline must be compiled and dockerized with a name of
metrics_index
. - The catalog service must be compiled and dockerized with name
dl_catalog_server
.
The ingest, metrics_index, and dl_catalog_server containers are pulled from the 5gzorro/datalake repository. In order to enable their access, supply the following secrets to kuberentes.
cd datalake/config
kubectl apply -f ./docker-secret.yaml
This is a POC implementation. Authentication is not implemented.
TODO: Proper permissions have to be set up to use the argo-events (argo-events-resource-admin-role).
In the python-flask-server directory, fill in the proper values in env and follow the instructions in the README file.
Kalman Meth - meth@il.ibm.com
This 5GZORRO component is published under Apache 2.0 license.