This project is under development.
The dbt-starrocks
package contains all the code to enable dbt to work with StarRocks.
This is an experimental plugin:
- We have not tested it extensively
- Requires StarRocks version 2.5.0 or higher
- version 3.1.x is recommended
- StarRocks versions 2.4 and below are no longer supported
This plugin can be installed via pip:
$ pip install dbt-starrocks
StarRocks <= 2.5 | StarRocks 2.5 ~ 3.1 | StarRocks >= 3.1 | StarRocks >= 3.4 | Feature |
---|---|---|---|---|
✅ | ✅ | ✅ | ✅ | Table materialization |
✅ | ✅ | ✅ | ✅ | View materialization |
❌ | ❌ | ✅ | ✅ | Materialized View materialization |
❌ | ✅ | ✅ | ✅ | Incremental materialization |
❌ | ✅ | ✅ | ✅ | Primary Key Model |
✅ | ✅ | ✅ | ✅ | Sources |
✅ | ✅ | ✅ | ✅ | Custom data tests |
✅ | ✅ | ✅ | ✅ | Docs generate |
❌ | ❌ | ✅ | ✅ | Expression Partition |
❌ | ❌ | ❌ | ❌ | Kafka |
❌ | ❌ | ❌ | ✅ | Dynamic Overwrite |
❌ | *4 | *4 | ✅ | Submit task |
- When StarRocks Version < 2.5,
Create table as
can only set engine='OLAP' and table_type='DUPLICATE' - When StarRocks Version >= 2.5,
Create table as
supports table_type='PRIMARY' - When StarRocks Version < 3.1 distributed_by is required
- Verify the specific
submit task
support for your version, see SUBMIT TASK.
Example entry for profiles.yml:
starrocks:
target: dev
outputs:
dev:
type: starrocks
host: localhost
port: 9030
schema: analytics
username: your_starrocks_username
password: your_starrocks_password
Option | Description | Required? | Example |
---|---|---|---|
type | The specific adapter to use | Required | starrocks |
host | The hostname to connect to | Required | 192.168.100.28 |
port | The port to use | Required | 9030 |
schema | Specify the schema (database) to build models into | Required | analytics |
username | The username to use to connect to the server | Required | dbt_admin |
password | The password to use for authenticating to the server | Required | correct-horse-battery-staple |
version | Let Plugin try to go to a compatible starrocks version | Optional | 3.1.0 |
use_pure | set to "true" to use C extensions | Optional | true |
is_async | "true" to submit suitable tasks as etl tasks. | Optional | true |
async_query_timeout | Sets the query_timeout value when submitting a task to StarRocks |
Optional | 300 |
More details about setting use_pure
and other connection arguments here
More details about the use of is_async
here
models:
materialized: table // table, view, materialized_view or incremental
engine: 'OLAP'
keys: ['id', 'name', 'some_date']
table_type: 'PRIMARY' // PRIMARY or DUPLICATE or UNIQUE
distributed_by: ['id']
buckets: 3 // leave empty for auto bucketing
indexs=[{ 'columns': 'idx_column' }]
partition_by: ['some_date']
partition_by_init: ["PARTITION p1 VALUES [('1971-01-01 00:00:00'), ('1991-01-01 00:00:00')),PARTITION p1972 VALUES [('1991-01-01 00:00:00'), ('1999-01-01 00:00:00'))"]
// RANGE, LIST, or Expr partition types should be used in conjunction with partition_by configuration
// Expr partition type requires an expression (e.g., date_trunc) specified in partition_by
order_by: ['some_column'] // only for PRIMARY table_type
partition_type: 'RANGE' // RANGE or LIST or Expr Need to be used in combination with partition_by configuration
properties: {"replication_num":"1", "in_memory": "true"}
refresh_method: 'async' // only for materialized view default manual
// For 'materialized=incremental' in version >= 3.4
incremental_strategy: 'dynamic_overwrite' // Supported values: ['default', 'insert_overwrite', 'dynamic_overwrite']
{{ config(materialized='view') }}
{{ config(materialized='table', engine='OLAP', buckets=32, distributed_by=['id']) }}
{{ config(materialized='table', indexs=[{ 'columns': 'idx_column' }]) }}
{{ config(materialized='table', partition_by=['date_trunc("day", first_order)'], partition_type='Expr') }}
{{ config(materialized='table', table_type='PRIMARY', keys=['customer_id'], order_by=['first_name', 'last_name'] }}
{{ config(materialized='incremental', table_type='PRIMARY', engine='OLAP', buckets=32, distributed_by=['id']) }}
{{ config(materialized='incremental', partition_by=['my_partition_key'], partition_type='Expr', incremental_strategy='dynamic_overwrite') }}
{{ config(materialized='materialized_view') }}
{{ config(materialized='materialized_view', properties={"storage_medium":"SSD"}) }}
{{ config(materialized='materialized_view', refresh_method="ASYNC START('2022-09-01 10:00:00') EVERY (interval 1 day)") }}
For materialized view only support partition_by、buckets、distributed_by、properties、refresh_method configuration.
First you need to add this catalog to starrocks. The following is an example of hive.
CREATE EXTERNAL CATALOG `hive_catalog`
PROPERTIES (
"hive.metastore.uris" = "thrift://127.0.0.1:8087",
"type"="hive"
);
How to add other types of catalogs can be found in the documentation. https://docs.starrocks.io/en-us/latest/data_source/catalog/catalog_overview Then write the sources.yaml file.
sources:
- name: external_example
schema: hive_catalog.hive_db
tables:
- name: hive_table_name
Finally, you might use below marco quote
{{ source('external_example', 'hive_table_name') }}
Add a new incremental_strategy
property that supports the following values:
default
(or omitted): Standard inserts withoutoverwrite
.insert_overwrite
: Will applyoverwrite
withdynamic_overwrite = false
to the inserts.dynamic_overwrite
: Will applyoverwrite
withdynamic_overwrite = true
to the inserts.
For more details on the different behaviors, see StarRocks' documentation for INSERT.
The implementation of the submittable etl is located in the
impl.py
file.
Setting is_async: true
in your profiles.yml
will enable submitting suitable ETL tasks using the submit task
feature of StarRocks.
This will be automatically wrapped around any statement that supports submission. Setting this manually is currently not supported by the adapter.
The following statements will be submitted automatically:
CREATE AS ... SELECT
INSERT INTO|OVERWRITE
CACHE SELECT ...
Once the task has been submitted, the adapter will periodically poll StarRocks' information_schema.task_runs
to retrieve the task status.
The polling is implemented using an exponential backoff, with a maximum delay of 10 minutes. The adapter's connection to the StarRocks' cluster will not be maintained during the waiting period. It will be re-opened right before the next status polling phase.
Using the async_query_timeout
property in the profiles.yml
will control the value of the query_timeout
when submitting task.
It's going to be injected in the SQL query submitted to StarRocks:
submit /*+set_var(query_timeout={async_query_timeout})*/ task ...
my_profile:
target: dev
outputs:
dev:
type: starrocks
host: host
port: 9030
schema: schema
username: username
password: password
is_async: true
async_query_timeout: 3600 # 1 hour
Table pre-creation is useful for the following scenarios:
- Setting
AUTO_INCREMENT
columns. - Setting explicit column data types.
dbt run
doesn't support table pre-creation by default. This feature was added specially to enable specific StarRocks features.
To configure tables for pre-creation, you need to do the following 3 things:
- Add the
pre_create
configuration in yourdbt_project.yml
(This won't work if you set this up through Jinja config)
Example dbt_project.yml
:
models:
# Set this at the top level of models
pre_create:
# Each model needing pre-creation should be declared separately
my_model_to_pre_create:
# Put each columns, you want to include in the `insert` statement here
insert_columns:
- locus
my_domain:
# Add this in your domain configuration
pre_create:
# Disable the `pre_create` directory to prevent dbt from trying to create those models
+enabled: false
- Add a
pre_create
directory in your models directory with the pre-creation SQL statements.
Each pre-creation .sql
file should be prefixed with template_
Example dbt
project structure with pre_create
directory:
dbt/
macros/
models/
customers/
pre_create/
template_model_a.sql
template_model_b.sql
...
schema.yml
tests/
...
- Make sure your pre-creation SQL statements contains the
{relation_name}
placeholder as the relation name. (E.g.create table {relation_name} ...
)
Example pre-creation .sql
file:
create table {relation_name} (
id BIGINT AUTO_INCREMENT,
value VARCHAR(64)
)
Run the following
python3 -m pytest tests/functional
consult the project
We welcome you to contribute to dbt-starrocks. Please see the Contributing Guide for more information.