Releases: datahub-project/datahub
v1.1.0rc3
What's Changed
- dataset cli - add support for schema, round-tripping to yaml by @chakru-r in #12764
- feat(ingestion/superset): ownership info for charts, dashboards and datasets by @PeteMango in #12750
- feat(ingest): allowdenypattern for dashboard, chart, dataset in superset by @kevinkarchacryl in #12782
- feat(models): adds subtypes to most entities in the model by @shirshanka in #12783
- fix: fixes mypy complaints about pkgresources by @sgomezvillamor in #12790
- fix(ingestion): fixes producing some URNs with reserved characters by @sgomezvillamor in #12772
- feat(okta): custom properties for okta user by @sgomezvillamor in #12773
- feat(mssql): adds subtypes aspect for dataflow and datajobs by @sgomezvillamor in #12775
- feat(searchBarAutocomplete): add feature flag for search bar's autocomplete redesign by @v-tarasevich-blitz-brain in #12690
- fix(ingest): enable fuzzy case resolution for oracle sql by @hsheth2 in #12778
- style: update azure.md by removing extra word by @alexbransky in #12780
- fix(ui): change tags to properties in ml model view by @yoonhyejin in #12789
- fix(ui) Fix changing color and icon for domains in UI by @chriscollins3456 in #12792
- Support container in ML Model Group, Model and Deployment by @ryota-cloud in #12793
- docs: update mlflow ingestion docs to include new concept mappings by @yoonhyejin in #12791
- fix(web) move form entity sidebar to right to align with cloud by @jayacryl in #12796
- doc(iceberg): iceberg doc updates by @chakru-r in #12787
- docs: add exporting from source to write mcp guide by @yoonhyejin in #12800
- feat(ingest/redshift): support for datashares lineage by @mayurinehate in #12660
- feat(ingestion/business-glossary): Automatically generate predictable glossary term and node URNs when incompatible URL characters are specified in term and node names. by @acrylJonny in #12673
- fix(ingestion/oracle): Improved foreign key handling by @acrylJonny in #11867
- feat(ingest/iceberg): Introduce network problems resiliency for Iceberg source by @skrydal in #12804
- chore(postgres): bump version by @david-leifker in #12808
- chore(aws): bump aws libraries by @david-leifker in #12809
- feat(api): URN, Entity, and Aspect name Async Validation by @david-leifker in #12797
- feat(ingest): improve extract-sql-agg-log command by @hsheth2 in #12803
- fix(UI): Showing platform instances only once by @sakethvarma397 in #12806
- fix: search cache invalidation for iceberg entities by @chakru-r in #12805
- feat(docs): Release for DataHub Cloud 0.3.8.2 by @pedro93 in #12811
- refactor(graphql): simplify getLastIngestionRun method by @trialiya in #12706
- docs(ingest): update metadata-ingestion dev guide by @hsheth2 in #12779
- fix(ingest/oracle): refresh golden files by @hsheth2 in #12818
- fix(openapi): fix openapi timeseries async ingestion by @david-leifker in #12812
- docs(ingest/mode): update mode workspace docs by @hsheth2 in #12774
- fix(ingestion/superset): fixed iterate over int error for building urns by @PeteMango in #12807
- fix(doc): Disable Algolia search by @treff7es in #12831
- fix(build): build improvements to help with incremental builds by @chakru-r in #12823
- feat(docs) add perms req to ai docs by @jayacryl in #12819
- Add variable to show full title in lineage by default by @Blize in #12078
- fix(doc): re-enable Algolia search by @hsheth2 in #12834
- feat(ui): support all entities with display names in browse paths v2 by @Masterchen09 in #11657
- feat(ingestion/mlflow): improve mlflow connector to pull run and experiments by @yoonhyejin in #12587
- fix(workflows): Update pr-labeler by @asikowitz in #12835
- chore(ruff): enable some ignored rules by @sgomezvillamor in #12815
- feat(ingest/redshift): lineage for external schema created from redshift by @mayurinehate in #12826
- feat(openapi-ingestion): implement openapi ingestion by @david-leifker in #12757
- fix(ui) Hide default filters we want to hide from impact analysis by @chriscollins3456 in #12843
- fix(ui) Fix submitting when selecting replacement in deprecation modal by @chriscollins3456 in #12842
- fix(UI): Multiple data product delete modals by @sakethvarma397 in #12781
- fix(graphql/search): Remove schema field and data process instance from default search types by @asikowitz in #12845
- docs: clear remote executor docs by @anshbansal in #12839
- build(deps): bump @babel/runtime from 7.24.4 to 7.26.10 in /docs-website by @dependabot in #12844
- build(deps): bump @babel/runtime-corejs3 from 7.24.4 to 7.26.10 in /docs-website by @dependabot in #12846
- build(deps): bump @babel/helpers from 7.24.4 to 7.26.10 in /docs-website by @dependabot in #12847
- fix(jaas): fix jaas login by @david-leifker in #12848
- feat(gql) allow unsetting optional incident fields by @jayacryl in #12801
- fix(ingest/dynamodb): pass env to dataset urn function by @anshbansal in #12853
- feat(models): Add edges fields to data process instance relationship aspects by @asikowitz in #12860
- feat(ui): Update ExternalUrlButton to include self-hosted gitlab URLs by @k7ragav in #12734
- fix(ui) Support glossary nodes in autocomplete by @chriscollins3456 in #12858
- feat(ingest/mlflow): update dpi to use edge for lineage by @yoonhyejin in #12861
- fix(ge-profiler): catch TimeoutError by @sgomezvillamor in #12855
- fix(databricks): fixes profile median by @sgomezvillamor in #12856
- fix(ingest): fix error in deploy command by @hsheth2 in #12820
- docs(ingest): custom transformer remote executor by @anshbansal in #12864
- feat(restore-indices): createDefaultAspects argument by @david-leifker in #12859
- ci(tests):show cypress smoke tests in junit format for better reporting by @chakru-r in #12865
- feat(ingest/salesforce): include formula in in field description by @mayurinehate in #12840
- feat(ingestion-tracing): implement ingestion with tracing api by @david-leifker in #12714
- hotfix(ui): Addressing assertions hotfixes by @jjoyce0510 in #12785
- feat(ingestion) Adding vertexAI ingestion source (v1 - model group and model) by @ryota-cloud in #12632
- feat(ingest/hive): identify partition columns in hive tables by @deepgarg-visa in #12833
- fix(api-tracing): handle corner case for historic by @david-leifker in #12870
- docs(website) update docusaurus config by @maggiehays in #12862
- feat(system-metrics): track api usage by user, client, api by @david-leifker in #12872
- fix(ingest/snowfla...
v1.1.0rc2
v1.1.0rc1
v1.0.0
DataHub v1.0.0
Release Highlights
DataHub v1.0.0 is packed with exciting updates, including:
- A completely redesigned user experience focused on simplified navigation and a visually stunning interface.
- Unified support for Data & AI, including AI Model Group Versions, AI Model Lineage, Model Stats, and Experiment/Run ingestion.
- DataHub Iceberg Catalog, allowing users to manage Iceberg tables directly from DataHub.
Read the blog post here!
Changelog
New User Interface: Putting Usability First
With a completely re-designed user interface, DataHub v1.0 represents a fundamental rethinking of how users interact with their metadata and data assets. The new experience includes:
- Intuitive Platform-Based Navigation - Hierarchically browse data by database and schema in Snowflake, BigQuery, Redshift, Databricks, and more. Combine hierarchical navigation with filtering by data owners, domain, tags, and glossary terms to find the right data fast.
- Seamless Lineage Exploration - Our reimagined lineage view features multi-level expansion, name-based search, and column-level visibility, making it easier than ever to understand data relationships and impact.
- Integrated Data Quality - Make confident decisions with deeply integrated quality signals throughout the platform, helping you quickly identify and trust reliable data assets.
DataHub Admins can enable the new UI for all users by setting the THEME_V2_DEFAULT
environment variable to true
; until then, Users can opt into the new experience by navigating to Settings > Appearance > Try New User Experience.
Comprehensive AI Asset Support: Unifying Data and AI
DataHub v1.0 treats AI assets as first-class citizens within the data ecosystem, allowing users to track their entire data-to-AI pipeline in one place.
- Unified Search and Discovery: Seamlessly search across models, model groups, and traditional data assets in one unified interface.
- Advanced Versioning System: Track multiple versions of datasets and ML models with detailed performance metrics and clear linkages between versions.
- Rich Model Statistics: Monitor key metrics across versions, understand performance trends, and make data-driven decisions about model deployment.
- End-to-End Lineage: Trace data flows from raw inputs through models to final outputs, with complete versioning support.
DataHub Iceberg REST Catalog Beta: Simplifying Data Lake Management
This release introduces an integration with Apace Iceberg, allowing users to manage Iceberg tables directly through DataHub, including:
- Create and manage Iceberg tables through DataHub
- Maintain consistent metadata across DataHub and Iceberg
- Facilitate data discovery by exposing Iceberg table metadata in DataHub
- Enable secure access to Iceberg tables through DataHub's permissions model
Read the docs here!
DataHub CLI
This release introduces the following improvements to our CLI:
- Added
container
command to apply tags, terms, and owners on all assets within the container. [ #12418, #12436] - Improved
delete
command to optionally reference a file with a list of URNS to be deleted. [#12247] - Expanded
ingest
command to support ingesting MCPs from S3. [#12649]
Metadata Ingestion
We’re continuously improving our integrations to add new capabilities and squash bugs.
- dbt: Added the parameter
include_database_name
to support including the database name in URN generation. [#12411] - Iceberg: Alongside our new Iceberg Catalog API, we’ve made various improvements to our Iceberg integration. [#12744]
- MLFlow: Significantly revamped our MLFlow connector, adding support for tracking Model Group Versions and Model Stats; tracking Model lineage to underlying datasets; and capturing Experiments and Runs.
- MSSQL: Improved support for extracting stored procedures from MS SQL. [ #12244, #12563]
- Oracle: Improved the accuracy of column-level lineage resolution.
- PowerBI: Improved lineage mapping so PowerBI Reports can now contain PowerBI Dashboards. [#12451]
- Redshift: Added support for data shares and external schemas, including automatic lineage resolution across Redshift namespaces.
- S3: Added functionality to the S3 ingestion process to ignore paths that do not match the specified depth, resolving warning messages triggered by mismatched paths. [#12326]
- Snowflake: Added support for Snowflake Streams and Hybrid Tables, and fixed a bug with lineage resolution across table renames. [#12318]
- Superset: (community contribution!): Added support for Superset virtual datasets and lineage. [#12679]
Additionally, we’re working on a new integration with Vertex AI. Please reach out if you’re interested in joining the beta.
Of course, this only scratches the surface of changes. This release contains 100+ improvements across 25 different integrations.
Thank You to our Contributors!
First-Time Contributors
@Bhadhri03 @brock-acryl @cccs-cat001 @davidebriscese @Deepalijain13 @dougbot01 @Haebuk @haon85 @josges @mihai103 @rajatgl17 @Rasnar @rharisi @samanthafigueredo5 @ttekampe
Repeat Contributors
@bda618 @deepgarg-visa @eagle-25 @jayasimhankv @ksrinath @llance @Masterchen09 @mayurinehate @mkamalas @PeteMango @pinakipb2 @remisalmon @sagar-salvi-apptware @svdimchenko @v-tarasevich-blitz-brain
Project Maintainers
@anshbansal @asikowitz @chakru-r @chriscollins3456 @david-leifker @gabe-lyons @hsheth2 @jayacryl @jjoyce0510 @kevinkarchacryl @pedro93 @RyanHolstien @ryota-cloud @sakethvarma397 @sgomezvillamor @shirshanka @skrydal @treff7es @yoonhyejin
View the full changelog: v0.15.0.1...v1.0.0
v1.0.0rc5
Full Changelog: v1.0.0rc4...v1.0.0rc5
v1.0.0rc4
Full Changelog: v1.0.0rc3...v1.0.0rc4
v1.0.0rc3
What's Changed
- fix(filters) Fix autocomplete for platforms and improve advanced search builder by @chriscollins3456 in #12560
- fix(ingest): handle groups in pattern_cleanup_ownership transformer by @cccs-cat001 in #12536
- tests(druid): integration tests for druid ingestion by @sgomezvillamor in #12717
- feat(api): let admins use granted privileges for actors by @anshbansal in #12718
- feat(build): use
pull_request_target
for datahub-wheels by @hsheth2 in #12722 - feat(ui): access management docs by @kevinkarchacryl in #12719
- fix(lineage): error message for edit lineage by @anshbansal in #12724
- docs: clarify limits on AI docs by @hsheth2 in #12728
- fix(urn-validation): additional test cases for urn validation by @david-leifker in #12727
- fix(ui) Fix NPE in pluralize function by @chriscollins3456 in #12629
- Fix platform instance support on Druid ingestion by @Rasnar in #12716
- ci(coverage): update patch coverage threshold by @chakru-r in #12733
- fix(ui) Fix bug with date dropdown in deprecation modal by @chriscollins3456 in #12633
- fix(ui) Fix group membership inconsistencies on group page by @chriscollins3456 in #12704
- fix(ui) Properly get display name when downloading search results by @chriscollins3456 in #12720
- fix(ingest): bump avro dep by @hsheth2 in #12729
- fix(ui) Filter healthy assets out of unhealthy upstreams component by @chriscollins3456 in #12705
- docs: update slack link by @hsheth2 in #12731
- fix(build): support datahub-wheels from forked PRs by @hsheth2 in #12730
- docs: add scarf integration by @hsheth2 in #12739
- fix(iceberg-cli): add missing filter for iceberg dataplatform by @chakru-r in #12732
- dev: immutable args remove by @anshbansal in #12735
- build(deps): bump dompurify from 2.5.4 to 3.2.4 in /datahub-web-react by @dependabot in #12643
- refactor(ui): Migrate to use the new Button component consistently by @jjoyce0510 in #12597
- docs(restore-indices): added best practices by @david-leifker in #12741
- feat(ui/lineageV2): Show version pill in lineage sidebar and node by @asikowitz in #12599
- chore(bump): Bump kafka-setup base by @david-leifker in #12743
- dev: enable ruff rule by @anshbansal in #12742
- revert(ci): revert datahub-wheel build changes by @hsheth2 in #12747
- feat: API key support in Metabase source by @rajatgl17 in #12711
- dev: enable ruff rule by @anshbansal in #12749
- refactor(ingest/s3): enhance readability by @eagle-25 in #12686
- feat(ingestion/superset): superset datas 8000 et lineage for metadata ingestion by @PeteMango in #12679
- chore(ci): avoid dep on confluent-kafka 2.8.1 by @hsheth2 in #12753
- feat(graphql): implement sort and facet for scroll by @david-leifker in #12746
- feat(ingest): improve error messages for unknown metadata objects by @hsheth2 in #12745
- fix(web) accurate error message for embeddedlistsearch by @jayacryl in #12622
- feat(ingestion/iceberg): Several improvements to iceberg connector by @skrydal in #12744
- fix(ingest): support pydantic v2 in file-based lineage by @hsheth2 in #12723
- feat(iceberg): improve concurrency control and resilience by @ksrinath in #12664
- docs(users+groups): show that you can set title via users YAML by @gabe-lyons in #12767
- feat(sdk): add search client by @hsheth2 in #12754
- feat(operations): ES and Kafka Operations Endpoints by @david-leifker in #12756
- feat(auth): support guest access by @chakru-r in #12619
- fix(iceberg): listnamespaces includes warehouse name as root by @chakru-r in #12761
- feat(UI): make searchbar centered and wider by @v-tarasevich-blitz-brain in #12666
- fix(ui) Fix order of parent containers on v2 autocomplete item by @chriscollins3456 in #12721
- fix(test): handle empty log by @david-leifker in #12768
- fix(lineage) Support views and sorting in impact analysis by @chriscollins3456 in #12769
- feat(versioning): Support entity versioning ingestion by @asikowitz in #12755
- fix(ui): add overflow wrap for dpi / model summary tab & add custom properties in mlmodelgroup queries by @yoonhyejin in #12771
- feat(sdk): add support for institutional memory links by @hsheth2 in #12770
New Contributors
- @Rasnar made their first contribution in #12716
- @rajatgl17 made their first contribution in #12711
- @PeteMango made their first contribution in #12679
- @v-tarasevich-blitz-brain made their first contribution in #12666
Full Changelog: v1.0.0rc2...v1.0.0rc3
v1.0.0rc2
What's Changed
- docs(ingest/mode): add details on authentication/permissions for mode by @hsheth2 in #12508
- fix(ingest/snowflake): Create all structured propery templates before assignation by @treff7es in #12469
- docs: fix token to be not required in sample script by @yoonhyejin in #12511
- fix(mssql): adds missing containers and browsepathsv2 for dataflow and datajob by @sgomezvillamor in #12483
- fix(ingest/glue): change to warning on access denied by @anshbansal in #12519
- fix(ingest/mode): remove unused field by @anshbansal in #12520
- docs: fix link to executor helm chart by @anshbansal in #12522
- fix(ingest): add missing dep for gcs by @hsheth2 in #12505
- docs(entity-change-events): add docs for action request events by @gabe-lyons in #12493
- docs(ingest): script to add ERModelRelationship Entity by @sagar-salvi-apptware in #12473
- refactor(trace-model): refactor trace model package by @david-leifker in #12510
- fix(ci): run smoke tests on release by @chakru-r in #12518
- chore(bump): bump jmx version by @david-leifker in #12524
- fix(cli): avoid false positive cli upgrade suggestions by @hsheth2 in #12497
- fix(ingest/azure-ad): limit the size of the ingestion report by @hsheth2 in #12498
- feat(metadata-io): enable rollback transaction support by @david-leifker in #12509
- feat(snowflake): add missing pushdown_deny_usernames config to be used when use_queries_v2 by @sgomezvillamor in #12527
- fix(model): fixes DashboardContainsDashboard relationship in DashboardInfo aspect by @sgomezvillamor in #12433
- feat(restoreIndices): update restore indices args and docs by @RyanHolstien in #12529
- fix(businessAttribute): fix business Attribute related entities by @deepgarg-visa in #12537
- fix(ui): make data process instance visible in container in V2& fix model/modelgroup names by @yoonhyejin in #12513
- fix(ingest): avoid multiprocessing "fork" start method by @hsheth2 in #12543
- fix(ui): revert backend breaking changes to mau by @kevinkarchacryl in #12461
- tests(kafka-connect): fixes integration tests setup by @sgomezvillamor in #12531
- fix(ingest/unity): add row count in table profile of delta tables by @mayurinehate in #12480
- fix(ingest): use lossy collections by @anshbansal in #12523
- fix(misc-openapi): fix openlineage, platform events & swagger by @david-leifker in #12539
- fix(test): move reading env variable inside method by @anshbansal in #12549
- feat(versioning): Add V2 UI; make backend more synchronous; add to component library by @asikowitz in #12542
- docs(iceberg): add iceberg user guide by @chakru-r in #12533
- feat(ingestion/snowflake):adds streams as a new dataset with lineage and properties. by @brock-acryl in #12318
- feat(powerbi): Report to Dashboard lineage by @sgomezvillamor in #12451
- fix(no-rows-updated): fix no rows updated by @david-leifker in #12530
- ci(smoke): report smoke test results to codecov by @hsheth2 in #12556
- feat(UI): Confirmation before deleting Link by @pinakipb2 in #12162
- feat(ingestion/s3): ignore depth mismatched path by @eagle-25 in #12326
- feat(docs-site) adding case studies and updating banner by @jayacryl in #12525
- feat(ingestion/mongodb) re-order aggregation logic by @Haebuk in #12428
- docs(salesforce): add missing salesforce source to cli doc by @remisalmon in #12550
- feat(openapi): precondition exceptions return 412 by @david-leifker in #12552
- feat(openapi): point in time parameter (elasticsearch only) by @david-leifker in #12553
- fix(openapi-spec): fix openapi spec oneOf schema by @david-leifker in #12561
- fix(autocomplete): fix autocomplete duplicate field by @david-leifker in #12558
- build(deps): bump black from 23.7.0 to 24.3.0 in /metadata-service/iceberg-catalog by @dependabot in #12502
- feat(sdk): add scaffolding for sdk v2 by @hsheth2 in #12554
- doc(dbt): Add missing dbt extra requirement to cli doc by @remisalmon in #12568
- feat(docs): Add live secret reload docs to k8s remote executor page by @pedro93 in #12541
- fix(ingest): remove duplicate mcps,more typing by @mayurinehate in #12557
- doc: update doc of first release by @anshbansal in #12574
- fix(docs) explain need to restore indices when adding @searchable by @jayacryl in #12576
- fix(sdk): fix platform instance generation in the sdk by @hsheth2 in #12573
- fix(looker): sort user mapping for consistency by @hsheth2 in #12569
- fix(ingestion/teradata): teradata profiling fix for pooling by @brock-acryl in #12507
- fix(structuredProps) Add validation for allowedTypes and harden API for invalid types by @chriscollins3456 in #12578
- fix(ui): better experience for analytics charts by @kevinkarchacryl in #12462
- feat(ingest/mssql): improve stored procedure splitting by @hsheth2 in #12563
- docs: add page on metadata standards by @hsheth2 in #12584
- feat(gh-workflows) adding jayacryl to pr-labeler by @jayacryl in #12579
- fix(iceberg): delete associated platform resources when deleting warehouse by @chakru-r in #12564
- feat(ingest): add display name for dynamodb tables by @mayurinehate in #12534
- fix(ui) Show editable field info for fields based on exact fieldPath version by @chriscollins3456 in #12570
- fix(openapi-schema): fix openapi schema generator by @david-leifker in #12590
- feat(ingestion/dbt): Add include_database_name parameter for dbt core by @svdimchenko in #12411
- fix(web) ingestion page resets when filter updated by @jayacryl in #12589
- dev: update pre-commit config by @anshbansal in #12592
- feat(UI): add user location to user profile page by @samanthafigueredo5 in #12016
- fix(graphql): Skip schema fields with empty fieldPaths to prevent the dataset mapper from erroring out by @jayasimhankv in #12562
- feat(graphql,ui): Update ML system V2 UI by @asikowitz in #12598
- fix(url-encoding): fix regression in url encoding by @david-leifker in #12601
- fix(ingest/snowflake): order queries for queries_v2 by @hsheth2 in #12551
- feat(ci): add pytest hooks for updating golden files by @hsheth2 in #12581
- fix(ingest): pick topics from config for sink connector by @mayurinehate in #12535
- doc: add note for subscription by @anshbansal in #12607
- feat(okta): adds ingest_groups_users config parameter by @sgomezvillamor in #12371
- feat(urn-validation): Add UrnValidation PDL annotation by @david-leifker in #12572
- feat(search): include timestamp for entity metadata change by @deepgarg-visa in https://github.com/d...
v1.0.0rc1
fix(ci): disable ci telemetry modelDocUpload (#12504)
v0.15.0.1
DataHub v0.15.0.1 Release Notes
🎵 Listen to this release's theme song on Suno: Structured Flow
Shoutout to @DSchmidtDev for this genre inspo for this round!
-
Structured Properties
- Added comprehensive support for managing structured properties, including creation, editing, deletion, and display preferences. Introduced timestamps for tracking creation and modification. [#12100, #11419]
- Enhanced property display options with badge styling, custom column types, and configurable visibility settings in asset sidebars and schema fields. [#12111, #12052]
- Added structured property filtering in UI with improved aggregation logic and entity metadata display. Introduced new property validators and display settings. [#12097, #12099]
-
UI Enhancements
- Enhanced container organization with parent hierarchy labels. [#11705]
- Added support for markdown in incident descriptions, enabling rich formatting capabilities. [#11759]
- Improved ingestion reporting with better visibility of successful ingestions with warnings. Enhanced browse paths display for business attributes and schema fields. [#11704, #11585]
- Added support for timeseries aspects in OpenAPI and customizable date range fields for Analytics charts. [#12096, #11366]
-
Authorization & Authentication
Metadata Ingestion
Ingestion Framework Improvements
-
Enhanced Data Source Support: Expanded ingestion capabilities for multiple platforms, including Superset (with dataset entities, schema fields, and column-level lineage), Feast (supporting tags and owners ingestion), Neo4j, and Cassandra. Added stateful ingestion support for file sources. [#11688, #11784, #11804, #11526, #11822]
-
SQL Processing Improvements: Replaced vulnerable sqlparse dependency with an in-house SQL parser, optimized CLL generation with reduced memory usage, and added special handling for MSSQL case sensitivity. Enhanced multi-query lineage support for Snowflake temporary tables. [#11645, #11708, #11920, #12020]
-
CLI Enhancements: Introduced new commands for managing ingestion, including listing source runs with filtering capabilities, undoing soft deletes with platform filtering, and listing structured properties. Added an offline flag to the SQL parser CLI. [#11740, #11980, #12012, #12283, #11635]
-
Ownership and Metadata Management: Extended ownership transformer capabilities across entities, improved glossary sync to preserve custom ownership types, and added support for multiple ownership types in glossaries and terms. Enhanced Forms CLI with additional filters for subtypes, platform instances, owners, tags, and glossary terms. [#11700, #11545, #12050, #10979]
-
Core Infrastructure Improvements: Implemented unique URN generation for all entities, added support for efficient entity ingestion through
get_entity_as_mcps
, improved empty field handling, and introduced progress reporting during ingestion. Added execution request cleanup job and support for dropping duplicate schema fields. [#11676, #11425, #11613, #12117, #11765, #12308]
Source-Specific Ingestion Improvements
Airflow
- Upgraded infrastructure with support for Airflow 2.10, deprecated versions below 2.3, and improved template handling with Jinja support. Added configuration options for dag patterns and environment variables. [#11300, #11371, #11472, #11537, #11579, #12056]
- Enhanced error handling and debugging with improved logging, fixed plugin stability issues on EMR, and added support for AthenaOperator lineage extraction. Introduced ability to disable plugin without restart. [#11857, #11877, #11880, #12098]
BigQuery
- Enhanced data modeling capabilities with support for foreign/primary keys, BigLake tables, and improved handling of external tables. Added support for region qualifiers and partition management. [#11686, #11728, #11874, #11940]
- Improved lineage tracking with GCS data source support and optimized query performance. Added platform resource entity generation from BigQuery labels. [#11442, #11492, #11534, #11602]
- Enhanced profiling and performance with better type handling and size limits. Fixed issues with tag synchronization and platform instance settings. [#11807, #12060]
Dagster
- Added support for skipping Asset ingestion, fixed input/output value formatting, and improved compatibility with latest Dagster versions (v1.9.6). Deprecated Python 3.8 support. [#11262, #11481, #12121, #12189]
dbt
- Improved performance and functionality with node_name_patterns for faster CLL processing, support for multiple test paths, and better handling of custom owner types. [#11450, #11460, #11848]
- Enhanced lineage handling by preventing cycles in SQL parsing and supporting multiple dataset assertions for tests. Added support for dbt Cloud's Explore page. [#11666, #11451, #12223]
Snowflake
- Expanded support for various table types, including secure, dynamic, and hybrid tables. Enhanced lineage capabilities for renames, swaps, and external tables. [#11600, #12039, #12094, #12179]
- Improved authentication with OAuth support and token management. Added incremental property processing and structured property support for tags. [#11888, #12048, #12080, #12285]
- Enhanced error handling and logging with better parse failure reporting and dot handling in table names. [#12105, #12110, #12153]
Tableau
- Enhanced project management with new path pattern filtering and improved handling of hidden assets. Added support for access roles and group permissions. [#10855, #11157, #11559]
- Improved API integration with retry logic for various error codes (502, 504), better authentication handling, and consistent page size application. [#12213, #12216, #12233]
- Enhanced reporting and debugging capabilities while maintaining efficient performance and proper permission handling. [#12015, #12024, #12175]
PowerBI
- Improved M-query parsing with support for comments, better handling of quotes, and DatabricksMultiCloud native query functionality. [#12177, #11743, #11756]
- Enhanced workspace management with cross-workspace dataset linking and app ingestion support. Added timeouts for M-query parsing. [#11560, #11629, #11753]
- Improved error reporting and performance optimization with reduced type casting and better organization of responsibilities. [#11763, #12004]
Developer Experience
-
Entity Management: Introduced entity versioning for Datasets and ML Models, with support for version set linking. Improved timeline functionality with better handling of primary key changes and rename events. Added data transformation logic models to enhance data processing capabilities. [#11819, #11843, #12166, #12198]
-
Enhanced Configuration Management: Added new customization options through environment variables and Helm charts, including editable dataset names and configurable garbage collection scheduling. The bootstrap process has been optimized to reduce latency during installation. [#11391, #11518]
-
Development Environment Updates: Added Git support to the ingestion-base image, enabling better source control integration for ingestion workflows. [#11477]
-
Security Logging Enhancement: Improved security audit trails by adding actor URN tracking for unauthorized access attempts. [#12030]
NEW: Garbage Collection
-
Comprehensive Metadata Cleanup: Introduced a new ingestion source: DataHubGC to function as a garbage collector for managing dataflows, data jobs, and data process instances, with configurable retention policies and deletion parameters. Added dry run mode for testing cleanup operations. [#11102, #11413]
-
Performance Optimizations: Significantly improved processing speed from 1 hour to 15 minutes by implementing batch processing, optimizing queries, and removing unnecessary operations. Increased default hard delete limit from 10k to 25k entities. [#11809, #12093, #12238]
-
Reliability Improvements: Enhanced garbage collection stability with additional validation checks, improved error handling, and better process visibility through ingestion stage reporting. Fixed issues with entity deletion logic and reference handling to preserve critical lineage relationships. [#12011, #12013, #12027, #12049, #12124, #12226]
Thank You to Our Contributors!
First-Time Contributors
@AColocho, @alberttwong, @Alice-608, @Bumyu, @chakru-r, @chriscc2, @dejan2609, @donovan-acryl, @eagle-25, @hwmarkcheng, @k-bartlett, @kanavnarula, @kartikey-visa, @kevinkarchacryl, @kousiknandy, @kris48k, @llance, @margaridafernandes-trip, @mikeburke24, @raudzis, @ronybony1990, @ryota-cloud, @shepherd44, @siong-tcha, @ssidorenko, @tanguyantoine, @th0ger, @udays-visa, @udbhav-hbk, @vejeta
Repeat Contributors
@aviv-julienjehannet, @bda618, @bossenti, @darnaut, @deepgarg-visa, @DSchmidtDev, @dushayntAW, @eboneil, @ethan-cartwright, @feldjay, @githendrik, @haeniya, @Jorricks, @Masterchen09, @mkamalas, @Nbagga14, @nicholas-fwang, @noggi, @pankajmahato-visa, @pinakipb2, @rtekal, @sagar-salvi-apptware, @steffengr
DataHub Maintainers
@acrylJonny, @anshbansal, @asikowitz, @chriscollins3456, @david-leifker, @gabe-lyons, @hsheth2, @jayacryl, @jjoyce0510, @maggiehays, @mayurinehate, @pedro93, @RyanHolstien, @sakethvarma397, @sgomezvillamor, @shirshanka, @sid-acryl, @skrydal, @treff7es, @yoonhyejin...