RFC: Structured Tree Display Representation #4264
Closed
universalmind303
started this conversation in
General
Replies: 1 comment
-
So I think I'll be moving forward with the alternative proposal of the datafusion style visitors. I think this'll be the minimally invasive and quickest to implement approach. We can still decide to go ahead with this proposal. The datafusion style is self contained enough that it shouldn't have any impact on our existing codebase. |
Beta Was this translation helpful? Give feedback.
0 replies
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Uh oh!
There was an error while loading. Please reload this page.
Uh oh!
There was an error while loading. Please reload this page.
-
Summary
This RFC proposes replacing the current unstructured
multiline_display
method with a type-safe structured representation system for generating contextual representations of any data type. This system enables multiple output formats (ASCII, Git-style, Mermaid diagrams, etc.) with context-driven rendering, applicable to logical plans, physical plans, expressions, and any other data structures requiring flexible display options.Motivation
Our current approach to displaying plans relies mainly on the unstructured
multiline_display
method, which has become a significant limitation as our system grows more complex. This method:To enable rich front end visualizations in "dashboard", and other frontends, we need a structured output format that can provide precise control. We need a structured representation system that separates what information to display from how to display it. This will enable:
Problem Statement
Our current representation system has several deficiencies:
multiline_display
makes adding new display formats difficult with limited output controlRequirements
The new system:
name
andid
),The last requirement is particularly important as it creates consistency across implementations and helps ensure that relevant information is always available to renderers, though we recognize some flexibility may be needed for specialized node types.
Prior Art
Several query execution engines and database systems have implemented structured representations for query plans:
These approaches demonstrate the value of decoupling plan structure from rendering logic, which this RFC adopts and extends with a structured outpus.
Proposed Solution
The proposed solution draws some inspiration from Zed's
SumTree
implementation[3], which uses a similar approach of separating data structures from their representation to enable flexible rendering and manipulation.Replace the unstructured system with a trait-based approach using Rust's type system:
We could also enhance this structure with field importance metadata to give renderers more information about what to display in different contexts:
This would allow the extra_fields to include importance indicators, enabling renderers to make smarter decisions about what to display based on the viewing context.
Pros
Item
,Summary
) and rendering (Display
)Display
typeDrawbacks
Summary
objectsSummary
types would require modifying all implementations.Alternatives Considered
1. Current Implementation (
TreeDisplay
trait)Our current implementation uses the
TreeDisplay
trait:Pros:
Problems:
multiline_display
function in most implementationsAlternative 1a: Enhanced TreeDisplay
An alternative approach could be enhancing the existing
TreeDisplay
trait rather than replacing it:This is pretty similar to my proposal, and much of the work is the same, it's just modifying the existing to get us closer, but I think it'd be cleaner to just start with a fresh slate.
2. Datafusion-style Visitor Pattern
The visitor pattern is another approach, similar to what DataFusion[4] implements with their
TreeNodeVisitor
trait. DataFusion has specific implementations likeIndentVisitor
,GraphvizVisitor
, andPgJsonVisitor
, each tightly coupled to its output format. This approach has several drawbacks:There are also a few nice things about it though
If we wanted to implement something quickly, the datafusion style pattern would by far be the easiest to do, but as mentioned, it can be pretty inflexible and would result in a lot of duplicated code for each new repr.
3. Calcite-style Writer Approach
This approach, inspired by Apache Calcite's implementation, uses a writer interface that nodes contribute attributes to:
It effectively separates data extraction from rendering by having nodes provide structured attributes to writers that handle formatting. This could be modified to include the necessary context.
Pros:
Drawbacks:
Open Questions
Whether the system should support generating different summary types for different use cases beyond representation. This could be useful for deriving different kinds of summaries for T, but can be easily worked around via newtype structs.
Scope Limitations
Multiple summary types per node are currently out of scope but may be considered for future extensions.
Performance optimizations for large plan trees are not addressed in this initial implementation.
This RFC does not specify how frontend integrations should consume the structured data, only how to produce it.
Migration Strategy
While this RFC proposes the new representation system, legacy code using
multiline_display
will need to be supported during a transition period. I propose implementing the new system alongside existing code, gradually migrating public apis to the new system, and eventually deprecating & removeing the old logic after an appropriate deprecation period.User-facing APIs like
DataFrame.explain()
and SQL explain commands SHOULD maintain backward compatibility. This can be achieved by adding ause_legacy=True
parameter that defaults toTrue
initially. Once the new system is stable and well-tested, we can switch the default touse_legacy=False
in a future release, with appropriate deprecation warnings.References
[1] Apache Calcite: https://calcite.apache.org/ - A framework for building databases and data management systems, with a sophisticated cost-based optimizer.
[2] DataFusion: https://github.com/apache/arrow-datafusion - A Rust implementation of Apache Arrow, providing SQL execution capabilities and DataFrame API.
[3] Zed: https://github.com/zed-industries/zed/blob/main/crates/sum_tree/src/sum_tree.rs - Zed’s SumTree implementation
[4] DataFusion Visitor Pattern: https://github.com/apache/datafusion/blob/a7e71a71fba59a3a20c619b51b5fbec565bf0dcc/datafusion/expr/src/logical_plan/display.rs#L42 - DataFusion's visitor-based display implementation for logical plans.
Beta Was this translation helpful? Give feedback.
All reactions