dbt-starrocks

This project is under development.

The dbt-starrocks package contains all the code to enable dbt to work with StarRocks.

This is an experimental plugin:

We have not tested it extensively
Requires StarRocks version 2.5.0 or higher
- version 3.1.x is recommended
- StarRocks versions 2.4 and below are no longer supported

Installation

This plugin can be installed via pip:

$ pip install dbt-starrocks

Supported features

StarRocks <= 2.5	StarRocks 2.5 ~ 3.1	StarRocks >= 3.1	StarRocks >= 3.4	Feature
✅	✅	✅	✅	Table materialization
✅	✅	✅	✅	View materialization
❌	❌	✅	✅	Materialized View materialization
❌	✅	✅	✅	Incremental materialization
❌	✅	✅	✅	Primary Key Model
✅	✅	✅	✅	Sources
✅	✅	✅	✅	Custom data tests
✅	✅	✅	✅	Docs generate
❌	❌	✅	✅	Expression Partition
❌	❌	❌	❌	Kafka
❌	❌	❌	✅	Dynamic Overwrite
❌	*4	*4	✅	Submit task

Notice

When StarRocks Version < 2.5, Create table as can only set engine='OLAP' and table_type='DUPLICATE'
When StarRocks Version >= 2.5, Create table as supports table_type='PRIMARY'
When StarRocks Version < 3.1 distributed_by is required
Verify the specific submit task support for your version, see SUBMIT TASK.

Profile Configuration

Example entry for profiles.yml:

starrocks:
  target: dev
  outputs:
    dev:
      type: starrocks
      host: localhost
      port: 9030
      schema: analytics
      username: your_starrocks_username
      password: your_starrocks_password

Option	Description	Required?	Example
type	The specific adapter to use	Required	`starrocks`
host	The hostname to connect to	Required	`192.168.100.28`
port	The port to use	Required	`9030`
schema	Specify the schema (database) to build models into	Required	`analytics`
username	The username to use to connect to the server	Required	`dbt_admin`
password	The password to use for authenticating to the server	Required	`correct-horse-battery-staple`
version	Let Plugin try to go to a compatible starrocks version	Optional	`3.1.0`
use_pure	set to "true" to use C extensions	Optional	`true`
is_async	"true" to submit suitable tasks as etl tasks.	Optional	`true`
async_query_timeout	Sets the `query_timeout` value when submitting a task to StarRocks	Optional	`300`

More details about setting use_pure and other connection arguments here

More details about the use of is_async here

Example

dbt seed properties(yml):

Complete configuration:

models:
  materialized: table                   // table, view, materialized_view or incremental
  engine: 'OLAP'
  keys: ['id', 'name', 'some_date']
  table_type: 'PRIMARY'                 // PRIMARY or DUPLICATE or UNIQUE
  distributed_by: ['id']
  buckets: 3                            // leave empty for auto bucketing
  indexs=[{ 'columns': 'idx_column' }]  
  partition_by: ['some_date']
  partition_by_init: ["PARTITION p1 VALUES [('1971-01-01 00:00:00'), ('1991-01-01 00:00:00')),PARTITION p1972 VALUES [('1991-01-01 00:00:00'), ('1999-01-01 00:00:00'))"]
  // RANGE, LIST, or Expr partition types should be used in conjunction with partition_by configuration
  // Expr partition type requires an expression (e.g., date_trunc) specified in partition_by
  order_by: ['some_column']             // only for PRIMARY table_type
  partition_type: 'RANGE'               // RANGE or LIST or Expr Need to be used in combination with partition_by configuration
  properties: {"replication_num":"1", "in_memory": "true"}
  refresh_method: 'async'               // only for materialized view default manual
  
  // For 'materialized=incremental' in version >= 3.4
  incremental_strategy: 'dynamic_overwrite' // Supported values: ['default', 'insert_overwrite', 'dynamic_overwrite']

dbt run config:

Example configuration:

{{ config(materialized='view') }}
{{ config(materialized='table', engine='OLAP', buckets=32, distributed_by=['id']) }}
{{ config(materialized='table', indexs=[{ 'columns': 'idx_column' }]) }}
{{ config(materialized='table', partition_by=['date_trunc("day", first_order)'], partition_type='Expr') }}
{{ config(materialized='table', table_type='PRIMARY', keys=['customer_id'], order_by=['first_name', 'last_name'] }}
{{ config(materialized='incremental', table_type='PRIMARY', engine='OLAP', buckets=32, distributed_by=['id']) }}
{{ config(materialized='incremental', partition_by=['my_partition_key'], partition_type='Expr', incremental_strategy='dynamic_overwrite') }}
{{ config(materialized='materialized_view') }}
{{ config(materialized='materialized_view', properties={"storage_medium":"SSD"}) }}
{{ config(materialized='materialized_view', refresh_method="ASYNC START('2022-09-01 10:00:00') EVERY (interval 1 day)") }}

For materialized view only support partition_by、buckets、distributed_by、properties、refresh_method configuration.

Read From Catalog

First you need to add this catalog to starrocks. The following is an example of hive.

CREATE EXTERNAL CATALOG `hive_catalog`
PROPERTIES (
    "hive.metastore.uris"  =  "thrift://127.0.0.1:8087",
    "type"="hive"
);

How to add other types of catalogs can be found in the documentation. https://docs.starrocks.io/en-us/latest/data_source/catalog/catalog_overview Then write the sources.yaml file.

sources:
  - name: external_example
    schema: hive_catalog.hive_db
    tables:
      - name: hive_table_name

Finally, you might use below marco quote

{{ source('external_example', 'hive_table_name') }}

Dynamic Overwrite (StarRo 85B9 cks >= 3.4)

Add a new incremental_strategy property that supports the following values:

default (or omitted): Standard inserts without overwrite.
insert_overwrite: Will apply overwrite with dynamic_overwrite = false to the inserts.
dynamic_overwrite: Will apply overwrite with dynamic_overwrite = true to the inserts.

For more details on the different behaviors, see StarRocks' documentation for INSERT.

Submittable ETL tasks

The implementation of the submittable etl is located in the impl.py file.

Setting is_async: true in your profiles.yml will enable submitting suitable ETL tasks using the submit task feature of StarRocks.

This will be automatically wrapped around any statement that supports submission. Setting this manually is currently not supported by the adapter.

The following statements will be submitted automatically:

CREATE AS ... SELECT
INSERT INTO|OVERWRITE
CACHE SELECT ...

See StarRocks' documentation on SUBMIT TASK

Task Polling

Once the task has been submitted, the adapter will periodically poll StarRocks' information_schema.task_runs to retrieve the task status.

The polling is implemented using an exponential backoff, with a maximum delay of 10 minutes. The adapter's connection to the StarRocks' cluster will not be maintained during the waiting period. It will be re-opened right before the next status polling phase.

Controlling the task timeout

Using the async_query_timeout property in the profiles.yml will control the value of the query_timeout when submitting task.

It's going to be injected in the SQL query submitted to StarRocks:

submit /*+set_var(query_timeout={async_query_timeout})*/ task ...

Example `profiles.yml` configuration

my_profile:
  target: dev
  outputs:
    dev:
      type: starrocks
      host: host
      port: 9030
      schema: schema
      username: username
      password: password
      is_async: true
      async_query_timeout: 3600 # 1 hour

Table pre-creation

Table pre-creation is useful for the following scenarios:

Setting AUTO_INCREMENT columns.
Setting explicit column data types.

dbt run doesn't support table pre-creation by default. This feature was added specially to enable specific StarRocks features.

To configure tables for pre-creation, you need to do the following 3 things:

Add the pre_create configuration in your dbt_project.yml (This won't work if you set this up through Jinja config)

Example dbt_project.yml:

models:
  # Set this at the top level of models
  pre_create:
    # Each model needing pre-creation should be declared separately
    my_model_to_pre_create:
      # Put each columns, you want to include in the `insert` statement here
      insert_columns:
        - locus
  
  my_domain:
    # Add this in your domain configuration
    pre_create:
      # Disable the `pre_create` directory to prevent dbt from trying to create those models
      +enabled: false

Add a pre_create directory in your models directory with the pre-creation SQL statements.

Each pre-creation .sql file should be prefixed with template_

Example dbt project structure with pre_create directory:

dbt/
   macros/
   models/
        customers/
        pre_create/
            template_model_a.sql
            template_model_b.sql
            ...
        schema.yml
    tests/
    ...

Make sure your pre-creation SQL statements contains the {relation_name} placeholder as the relation name. (E.g. create table {relation_name} ...)

Example pre-creation .sql file:

create table {relation_name} (
    id BIGINT AUTO_INCREMENT,
    value VARCHAR(64)
)

Test Adapter

Run the following

python3 -m pytest tests/functional

consult the project

Contributing

We welcome you to contribute to dbt-starrocks. Please see the Contributing Guide for more information.

Name		Name	Last commit message	Last commit date
Latest commit History 49 Commits
dbt		dbt
test		test
tests		tests
.gitignore		.gitignore
MANIFEST.in		MANIFEST.in
README.md		README.md
dev_requirements.txt		dev_requirements.txt
pytest.ini		pytest.ini
setup.py		setup.py

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

Repository files navigation

dbt-starrocks

Installation

Supported features

Notice

Profile Configuration

Example

dbt seed properties(yml):

Complete configuration:

dbt run config:

Example configuration:

Read From Catalog

Dynamic Overwrite (StarRo 85B9 cks >= 3.4)

Submittable ETL tasks

Task Polling

Controlling the task timeout

Example `profiles.yml` configuration

Table pre-creation

Test Adapter

Contributing

About

Uh oh!

Releases

Packages

Languages

kritchie/dbt-starrocks

Folders and files

Latest commit

History

Repository files navigation

dbt-starrocks

Installation

Supported features

Notice

Profile Configuration

Example

dbt seed properties(yml):

Complete configuration:

dbt run config:

Example configuration:

Read From Catalog

Dynamic Overwrite (StarRo 85B9 cks >= 3.4)

Submittable ETL tasks

Task Polling

Controlling the task timeout

Example profiles.yml configuration

Table pre-creation

Test Adapter

Contributing

About

Resources

Uh oh!

Stars

Watchers

Forks

Releases

Packages 0

Languages

Example `profiles.yml` configuration

Packages