RFC: Structured Tree Display Representation #4264

universalmind303 · 2025-04-28T19:00:23Z

universalmind303
Apr 28, 2025
Maintainer

Summary

This RFC proposes replacing the current unstructured multiline_display method with a type-safe structured representation system for generating contextual representations of any data type. This system enables multiple output formats (ASCII, Git-style, Mermaid diagrams, etc.) with context-driven rendering, applicable to logical plans, physical plans, expressions, and any other data structures requiring flexible display options.

Motivation

Our current approach to displaying plans relies mainly on the unstructured multiline_display method, which has become a significant limitation as our system grows more complex. This method:

Makes it difficult to highlight important aspects of a plan while hiding implementation details
Tightly couples the node implementation with its visual representation
Has no clear contract for what information should be displayed for each node type
Critically, prevents fine-grained control over visual elements needed for our dashboard and other front-ends

To enable rich front end visualizations in "dashboard", and other frontends, we need a structured output format that can provide precise control. We need a structured representation system that separates what information to display from how to display it. This will enable:

Easier development of new visualization formats without modifying node implementations
Consistent display of important information across all formats
Clear separation of concerns between node implementation and rendering logic
Fine-grained control over visual elements for dashboard and other front-ends
Custom, tailored representations for specific front end needs

Problem Statement

Our current representation system has several deficiencies:

The current system produces only un-typed string outputs, preventing type-safe interactions and requiring error-prone parsing to extract structured data
There's no clear specification for what information should be included in a given node's representation, leading to inconsistent implementations and missing critical details
multiline_display makes adding new display formats difficult with limited output control
Lack of trait enforcement creates unclear hierarchies and expected outputs
No separation between representation data and rendering logic
No standard way to control verbosity levels across different formats

Requirements

The new system:

MUST ensure all nodes of a type T share common required fields (such as name and id),
MUST allow context specific rendering
SHOULD separate data extraction from rendering logic
SHOULD define clear guidelines for what specific information each node type needs to include in its representation

The last requirement is particularly important as it creates consistency across implementations and helps ensure that relevant information is always available to renderers, though we recognize some flexibility may be needed for specialized node types.

Prior Art

Several query execution engines and database systems have implemented structured representations for query plans:

Apache Calcite[1] uses a visitor pattern with dedicated formatters for different output formats
Datafusion[2] uses a similar visitor pattern with dedicated formatters.

These approaches demonstrate the value of decoupling plan structure from rendering logic, which this RFC adopts and extends with a structured outpus.

Proposed Solution

The proposed solution draws some inspiration from Zed's SumTree implementation[3], which uses a similar approach of separating data structures from their representation to enable flexible rendering and manipulation.

Replace the unstructured system with a trait-based approach using Rust's type system:

// an Item is anything that can be summarized.
trait Item {
    type Summary;
    fn summary(&self) -> Self::Summary;
}

// Display trait for context-specific formatting
trait Display<C> {
    fn format(&self, ctx: &C) -> String;
}

// a Summary is typically a struct of common fields across different implementations of `Item`
trait Summary {
    fn display<C>(&self, ctx: &C) -> String
    where
        Self: Display<C>;
}

// Blanket implementation for all types that implement Display<C>
impl<T> Summary for T {
    fn display<C>(&self, ctx: &C) -> String
    where
        Self: Display<C>,
    {
        self.format(ctx)
    }
}

#[derive(Debug, Clone)]
pub struct LogicalPlanSummary {
    name: String,
    schema: SchemaRef,
    stats: Option<PlanStats>,
    input_exprs: Option<Vec<ExprRef>>,
    extra_fields: Vec<(String, String)>,
}

We could also enhance this structure with field importance metadata to give renderers more information about what to display in different contexts:

enum FieldImportance {
    Critical,   // Always show
    Important,  // Show in default view
    Detail,     // Show only in detailed view
    Debug,      // Show only in debug view
}

pub struct LogicalPlanSummary {
    extra_fields: Vec<(String, String, FieldImportance)>,
}

This would allow the extra_fields to include importance indicators, enabling renderers to make smarter decisions about what to display based on the viewing context.

impl Item for LogicalPlan {
    // we summarize it in to a `LogicalPlanSummary`
    type Summary = LogicalPlanSummary;
    fn summary(&self) -> Self::Summary {
        match self {
            Self::Source(source) => source.summary(),
            //...
        }
    }
}

// now that we have all logical plan nodes converted in to a `summary` it makes adding new reprs super easy.
// Example implementation for DisplayLevel context
impl Display<DisplayLevel> for LogicalPlanSummary {
    fn format(&self, ctx: &DisplayLevel) -> String {
        match ctx {
            DisplayLevel::Compact => self.name.clone(),
            DisplayLevel::Default => {
                // Format with default style
                let expr_str = match &self.input_exprs {
                    Some(exprs) => {
                        let joined = exprs.iter()
                            .map(|e| e.to_string())
                            .collect::<Vec<_>>()
                            .join(", ");
                        format!("({})", joined)
                    }
                    None => "".to_string(),
                };

                format!("{}: {}", self.name, expr_str)
            }
            // Other display levels...
        }
    }
}

// We can also render objects with custom contexts beyond `DisplayLevel` now.
// Since we have a common struct for all of the logical plan, we only need to define the rendering logic once, instead of in every node like before.

// Custom context for detailed rendering control
struct RenderContext {
    include_schema: bool,
    include_stats: bool,
    simplified: bool,
    max_width: usize,
}

// implementing Display for a custom context
impl Display<RenderContext> for LogicalPlanSummary {
    fn format(&self, ctx: &RenderContext) -> String {
        let mut result = if ctx.simplified {
            self.name.clone()
        } else {
            let expr_str = match &self.input_exprs {
                Some(exprs) => {
                    let joined = exprs.iter()
                        .map(|e| e.to_string())
                        .collect::<Vec<_>>()
                        .join(", ");
                    format!("({})", joined)
                }
                None => "".to_string(),
            };

            format!("{}: {}", self.name, expr_str)
        };

        // conditionally include schema based on context
        if ctx.include_schema {
            result.push_str(&format!("\\nSchema: {}", self.schema.short_string()));
        }

        if ctx.include_stats && self.stats.is_some() {
            result.push_str(&format!("\\nStats: {}", self.stats.as_ref().unwrap()));
        }

        result
    }
}

// Example renderer that defines its own context
struct AsciiRenderer {
    context: RenderContext,
}

impl AsciiRenderer {
    fn render<I: Item>(&self, node: &I) -> String
    where
        I::Summary: Display<RenderContext>
    {
        // todo!()
    }
}

Pros

Strong type safety through Rust's type system with compile-time guarantees
Clear separation between data extraction (Item, Summary) and rendering (Display)
Context-sensitive rendering through the parameterized Display type
Flexible and extensible to new rendering formats without modifying node implementations
Enforces consistent node structure, leading to more uniform reprs.

Drawbacks

More boilerplate compared to some alternatives.
Additional "translation" code needed between internal node state and Summary objects
Limited flexibility with heterogeneous data types, as the output type is statically associated with the input type rather than being dynamic
adding new data to Summary types would require modifying all implementations.

Alternatives Considered

1. Current Implementation (`TreeDisplay` trait)

Our current implementation uses the TreeDisplay trait:

trait TreeDisplay {
    fn display_as(&self, level: DisplayLevel) -> String;
    fn get_name(&self) -> String;
    fn id(&self) -> String;
    fn get_children(&self) -> Vec<&dyn TreeDisplay>;
}

Pros:

Simple, lightweight interface that's easy to implement
Clear separation of tree traversal from node display

Problems:

Returns only a single string representation per item with no structured data for renderers
Relies heavily on the unstructured multiline_display function in most implementations

Alternative 1a: Enhanced TreeDisplay

An alternative approach could be enhancing the existing TreeDisplay trait rather than replacing it:

trait EnhancedTreeDisplay: TreeDisplay {
    type NodeProperties;
    fn get_properties(&self) -> Self::NodeProperties;
}

impl EnhancedTreeDisplay for LogicalPlan {
    // ...
}

This is pretty similar to my proposal, and much of the work is the same, it's just modifying the existing to get us closer, but I think it'd be cleaner to just start with a fresh slate.

2. Datafusion-style Visitor Pattern

The visitor pattern is another approach, similar to what DataFusion[4] implements with their TreeNodeVisitor trait. DataFusion has specific implementations like IndentVisitor, GraphvizVisitor, and PgJsonVisitor, each tightly coupled to its output format. This approach has several drawbacks:

Tightly couples data extraction with rendering logic
Places the entire rendering burden on each visitor implementation
Requires duplicating logic across different output formats
Lacks type safety and does not guarantee any consistency for shared properties. So the burden is put on the visitor implementation which could be more error prone & need more thorough review & testing

There are also a few nice things about it though

Existing datafusion implementations we can draw inspiration from
provides more flexibility than my proposed solution.
Super easy, albeit verbose to implement new visitors

If we wanted to implement something quickly, the datafusion style pattern would by far be the easiest to do, but as mentioned, it can be pretty inflexible and would result in a lot of duplicated code for each new repr.

3. Calcite-style Writer Approach

This approach, inspired by Apache Calcite's implementation, uses a writer interface that nodes contribute attributes to:

trait RelWriter {
    fn attribute(&mut self, name: &str, value: &dyn Display) -> &mut Self;
    fn explain(&mut self, rel: &dyn RelNode, attributes: Vec<(String, String)>);
}

trait RelNode {
    fn rel_type_name(&self) -> &str;
    fn inputs(&self) -> Vec<&dyn RelNode>;

    fn explain_terms(&self, writer: &mut dyn RelWriter) -> Result<()> {
        // Add common attributes
        writer.attribute("id", &self.rel_type_name());
        // Node-specific attributes added by implementation
        self.add_attributes(writer)?;
        // Process and continue traversal
        // ...
    }

    fn add_attributes(&self, writer: &mut dyn RelWriter) -> Result<()>;
}

It effectively separates data extraction from rendering by having nodes provide structured attributes to writers that handle formatting. This could be modified to include the necessary context.

Pros:

Separates data extraction from rendering
Successfully used in production systems (Calcite)
More direct than the proposed solution - no intermediate summary objects
Handles common fields naturally through the base implementation

Drawbacks:

Less type safety compared to the proposed solution
Attributes are dynamically typed (strings and values)
Requires careful documentation of expected attributes
Less context sensitivity without additional parameters

Open Questions

Whether the system should support generating different summary types for different use cases beyond representation. This could be useful for deriving different kinds of summaries for T, but can be easily worked around via newtype structs.

Scope Limitations

Multiple summary types per node are currently out of scope but may be considered for future extensions.

Performance optimizations for large plan trees are not addressed in this initial implementation.

This RFC does not specify how frontend integrations should consume the structured data, only how to produce it.

Migration Strategy

While this RFC proposes the new representation system, legacy code using multiline_display will need to be supported during a transition period. I propose implementing the new system alongside existing code, gradually migrating public apis to the new system, and eventually deprecating & removeing the old logic after an appropriate deprecation period.

User-facing APIs like DataFrame.explain() and SQL explain commands SHOULD maintain backward compatibility. This can be achieved by adding a use_legacy=True parameter that defaults to True initially. Once the new system is stable and well-tested, we can switch the default to use_legacy=False in a future release, with appropriate deprecation warnings.

References

[1] Apache Calcite: https://calcite.apache.org/ - A framework for building databases and data management systems, with a sophisticated cost-based optimizer.

[2] DataFusion: https://github.com/apache/arrow-datafusion - A Rust implementation of Apache Arrow, providing SQL execution capabilities and DataFrame API.

[3] Zed: https://github.com/zed-industries/zed/blob/main/crates/sum_tree/src/sum_tree.rs - Zed’s SumTree implementation

[4] DataFusion Visitor Pattern: https://github.com/apache/datafusion/blob/a7e71a71fba59a3a20c619b51b5fbec565bf0dcc/datafusion/expr/src/logical_plan/display.rs#L42 - DataFusion's visitor-based display implementation for logical plans.

universalmind303 · 2025-05-12T16:07:15Z

universalmind303
May 12, 2025
Maintainer Author

So I think I'll be moving forward with the alternative proposal of the datafusion style visitors. I think this'll be the minimally invasive and quickest to implement approach. We can still decide to go ahead with this proposal. The datafusion style is self contained enough that it shouldn't have any impact on our existing codebase.

0 replies

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

RFC: Structured Tree Display Representation #4264

Uh oh!

{{title}}

Uh oh!

Uh oh!

{{editor}}'s edit

{{editor}}'s edit

Uh oh!

Replies: 1 comment

Uh oh!

{{title}}

Uh oh!

Select a reply

Uh oh!

RFC: Structured Tree Display Representation #4264

Uh oh!

Uh oh!

universalmind303 Apr 28, 2025 Maintainer

Summary

Motivation

Problem Statement

Requirements

Prior Art

Proposed Solution

Pros

Drawbacks

Alternatives Considered

1. Current Implementation (TreeDisplay trait)

Alternative 1a: Enhanced TreeDisplay

2. Datafusion-style Visitor Pattern

3. Calcite-style Writer Approach

Open Questions

Scope Limitations

Migration Strategy

References

Replies: 1 comment

Uh oh!

universalmind303 May 12, 2025 Maintainer Author

universalmind303
Apr 28, 2025
Maintainer

1. Current Implementation (`TreeDisplay` trait)

universalmind303
May 12, 2025
Maintainer Author