8000 GitHub - 5GZORRO/datalake: This repository contains code and other files to implement the 5GZORRO datalake.
[go: up one dir, main page]
More Web Proxy on the site http://driver.im/
Skip to content

5GZORRO/datalake

Repository files navigation

5GZORRO Datalake

Introduction

This repository contains code and other files to implement the 5GZORRO datalake.

The main datalake API functionality is provided in the directory python-flask-server, much of which was generated by swagger.codegen.

The API itself is specified in datalake_swagger.yaml.

This code is proof-of-concept.

Prerequisites

System Requirements

The datalake server itself can run on a single VM (or bare metal) with the following resources.

  • 2 vCPUs
  • 4 GB RAM
  • 10 GB storage

The datalake server was developed with python3.6.

Dependencies

The datalake server requires that there first be running:

Installation

To set up minio:

wget https://dl.min.io/server/minio/release/linux-amd64/minio
chmod +x minio
mkdir /minio/data
export MINIO_VOLUMES="/var/lib/minio"
export MINIO_ACCESS_KEY=user
export MINIO_SECRET_KEY=password
./minio server /minio/data

Kubernetes should use Docker container management (rather than containerd) for argo to work properly.

For kubernetes, it is possible to run a simulated minikube cluster. To install minikube see: https://minikube.sigs.k8s.io/docs/start/.

curl -LO https://storage.googleapis.com/minikube/releases/latest/minikube-linux-amd64
sudo install minikube-linux-amd64 /usr/local/bin/minikube

To set up Argo and standard argo-events:

kubectl create namespace argo
kubectl apply -n argo -f https://raw.githubusercontent.com/argoproj/argo/v2.12.0-rc3/manifests/install.yaml
kubectl create rolebinding default-admin --clusterrole=admin --serviceaccount=default:default
kubectl create namespace argo-events
kubectl apply -f https://raw.githubusercontent.com/argoproj/argo-events/v1.1.0/manifests/install.yaml
kubectl apply -n argo-events -f https://raw.githubusercontent.com/argoproj/argo-events/v1.1.0/examples/eventbus/native.yaml

In Argo, it is necessary to define the dl-argo-events namespace.

kubectl create namespace dl-argo-events
cd datalake/config
kubectl apply -f ./install.yaml
kubectl apply -n dl-argo-events -f https://raw.githubusercontent.com/argoproj/argo-events/v1.1.0/examples/eventbus/native.yaml

To see the Argo GUI, run argo server at the command line, and then connect via a web browser to http://localhost:2746.

In Kubernetes, it is necessary to define the datalake namespace.

kubectl create namespace datalake

Run script to periodically clean up old datalake argo jobs.

cd /datalake/experiments
nohup ./loop_argo_del.sh >/dev/null 2>&1 &

To set up postgres, see instructions at https://www.postgresqltutorial.com/install-postgresql-linux/ and https://www.postgresql.org/download/linux/ubuntu/.

Allow access from outside servers by following the insructions in https://stackoverflow.com/questions/38466190/cant-connect-to-postgresql-on-port-5432.

Then perform the following:

sudo -i -u postgres
psql
\l
CREATE DATABASE datalake;
\c datalake
DROP TABLE IF EXISTS datalake_metrics;
CREATE TABLE datalake_metrics(
           seq_id SERIAL PRIMARY KEY,
		 resourceID VARCHAR,
		 referenceID VARCHAR,
		 transactionID VARCHAR,
		 productID VARCHAR,
		 instanceID VARCHAR,
		 metricName VARCHAR,
		 metricValue VARCHAR,
		 timestamp VARCHAR,
		 storageLocation VARCHAR
);

create user datalake_user with encrypted password 'datalake_pw';
grant all privileges on database datalake to datalake_user;
grant usage on schema public to datalake_user;
grant all privileges on table datalake_metrics to datalake_user;
grant all privileges on sequence datalake_metrics_seq_id_seq to datalake_user;

Before bringing up the datalake python-flask-server:

  • The ingest pipeline must be compiled and dockerized with a name of ingest.
  • The metrics_index pipeline must be compiled and dockerized with a name of metrics_index.
  • The catalog service must be compiled and dockerized with name dl_catalog_server.

The ingest, metrics_index, and dl_catalog_server containers are pulled from the 5gzorro/datalake repository. In order to enable their access, supply the following secrets to kuberentes.

cd datalake/config
kubectl apply -f ./docker-secret.yaml

This is a POC implementation. Authentication is not implemented.

TODO: Proper permissions have to be set up to use the argo-events (argo-events-resource-admin-role).

Usage

In the python-flask-server directory, fill in the proper values in env and follow the instructions in the README file.

Maintainers

Kalman Meth - meth@il.ibm.com

License

This 5GZORRO component is published under Apache 2.0 license.

About

This repository contains code and other files to implement the 5GZORRO datalake.

Resources

Stars

Watchers

Forks

Packages

 
 
 

Contributors 2

  •  
  •  

Languages

0