Draft: RFC for cell ranges #109

felix-oq · 2023-01-19T11:01:47Z

Part of #85

rfc/0001-cell-ranges.md

georg-schwarz · 2023-01-19T12:27:58Z

rfc/0001-cell-ranges.md

+A2:B4 called MyRange;
+```
+
+#### Selecting multiple adjacent columns / rows


What about non-adjacent cols? Maybe separated by a comma, e.g. B,D,F?

Not sure whether this fits into the concept. How are non-adjacent columns treated semantically and why would a user want to do this? Even regular 2d cell ranges don't serve a purpose currently.

georg-schwarz · 2023-01-19T13:17:43Z

rfc/0001-cell-ranges.md

+
+### Cell ranges in layouts
+
+A cell range consists of a cell selector and a name for the range. The syntax of the selector follows the pattern `from:to` where from and to refer to a particular cell. A particular cell is selected via its column and its row in the sheet. The name of a range can later be used for referring to the cells in that cell range.


Maybe we can give "hints" that this can be refactored to xy as there is a collision to make it more transparent to users.

We could do that, but my intention with the overlaps was to simplify common scenarios (e.g. layouts for CSV files).
Version with overlaps:

layout CSVLayout { row 1 called Header; column A called ColumnA; column B called ColumnB; }

Version that avoids overlaps:

layout CSVLayout { row 1 called Header; A2:A called ColumnA; B2:B called ColumnB; }

The version with overlaps can be modeled using just syntactic sugar syntax whereas the other one needs to select partial columns. This would become even more complicated when a user wants to exclude a row in the middle of the data.

georg-schwarz · 2023-01-19T13:25:55Z

rfc/0001-cell-ranges.md

+
+block MyTableBuilder oftype TableBuilder {
+  sheetLayout: MyLayout;
+  columns: {


I think your idea to separate the range from the table format is a good idea. However, also a little non-intuitive referencing the ranges of the layout that is defined somewhere else (?)

A thought I had was to attach the layout to the sheet in the CSVFileExtractor. Subsequent blocks which process the sheet would then be able to access cell ranges from the attached layout.

Another idea would be to use the cell range syntax inline, without any layout:

block MyTableBuilder oftype TableBuilder { columns: { "column1": column A oftype text; "column2": B2:B, cell C1 oftype integer; } }

I don't think this is a good idea, it seems very inspired from the one use case of CSV data into tables. Imho, we should look how to model a table/structured data (probably by defining attributes and their types). And only then model a mapping between the data from a CSV file and it's layouts to that table structure.

That would make the table model easier:

you'd not have to reference layouts or table ranges

you'd also avoid the double meaning of column that occurs here (there is an attribute columns that refers to the columns of the table but also columns inside a layout are used).

It also means the table model is reusable for other, future, data, like actual DB loaders or saving entities from an object store as rows in a table.

For clarity, this could still mean we have a TableBuilder block, just that it maps from a CSV file with a layout to some definition of a table.

georg-schwarz · 2023-01-19T13:26:57Z

rfc/0001-cell-ranges.md

+
+# Drawbacks
+
+- Header rows can't be used for column names


Would be possible by a syntax like this:
cellvalue A1: MyRange oftype text;

Another possibility I thought of is an array-like syntax such as Header[1]: MyRange oftype text where Header is also a 1-dimensional cell range.
Or having an attribute columnNames: Header; in the TableBuilder block. Not sure how to define the columns attribute then, if the column names are already specified elsewhere.

georg-schwarz · 2023-01-19T13:27:11Z

rfc/0001-cell-ranges.md

+  sheetLayout: MyLayout;
+  columns: {
+    "column1": MyRange oftype text;
+    "column2": MyPartialRange, MyCell oftype integer;


"' or without?

The reason for choosing " was to indicate, that this is the column name that will be used when storing the table in a database and that the name does not have to follow the syntax for identifiers in Jayvee (alphanumeric with underscores). In case we want to reference table columns later on in the language, it might make sense to use identifiers and hence omit the ".

georg-schwarz · 2023-01-19T13:27:52Z

rfc/0001-cell-ranges.md

+
+- Header rows can't be used for column names
+- Unable to select cells using regular patterns (e.g. selecting every second row)
+- 2-dimensional ranges can be defined but currently serve no purpose


I think those ranges would get more interesting when using compound types or SI units or so. Nor sure, though, if we should think this far already?

Yeah, that's why I noted in the alternatives section that we could forbid 2-dimensional ranges for now. The potential interactions with compound types are still to be discussed in the future.

georg-schwarz · 2023-01-19T15:16:40Z

rfc/0001-cell-ranges.md

+block MyTableBuilder oftype TableBuilder {
+  sheetLayout: MyLayout;
+  columns: {
+    "column1": MyRange oftype text;


Here could a "column" be also mapped to a row. Offering reading tables line-based instead of column-based is excellent. I'm not sure, though, how this will look in the resulting table then if we have mixes of rows, columns, and single cells xD

67ED
With great power comes great responsibility ;)

rhazn

I left some comments. For the layout discussion, I think this is a good step but I'd consider something like a subselection of data from tables instead. I hope I could describe that adequately.

rhazn · 2023-01-20T09:22:01Z

rfc/0001-cell-ranges.md

+
+# Motivation
+
+Some sheet-like open data is badly structured. For example, see this [example from mobilithek.info](https://mobilithek.info/offers/-655945265921899037) which does not follow the usual CSV pattern (topmost row as header, followed by the actual data).


I think adding parts of the example (e.g. the rows that are non-conform + one row of data) would make this RFC easier to read and futureproof when the link stops working.

In general this would be easier to follow on an actual example, not made up data with MyRange etc.

rfc/0001-cell-ranges.md

rhazn · 2023-01-20T09:48:31Z

rfc/0001-cell-ranges.md

+
+block MyTableBuilder oftype TableBuilder {
+  sheetLayout: MyLayout;
+  columns: {


I don't think this is a good idea, it seems very inspired from the one use case of CSV data into tables. Imho, we should look how to model a table/structured data (probably by defining attributes and their types). And only then model a mapping between the data from a CSV file and it's layouts to that table structure.

That would make the table model easier:

you'd not have to reference layouts or table ranges

you'd also avoid the double meaning of column that occurs here (there is an attribute columns that refers to the columns of the table but also columns inside a layout are used).

It also means the table model is reusable for other, future, data, like actual DB loaders or saving entities from an object store as rows in a table.

rhazn · 2023-01-20T09:49:46Z

rfc/0001-cell-ranges.md

+
+block MyTableBuilder oftype TableBuilder {
+  sheetLayout: MyLayout;
+  columns: {


For clarity, this could still mean we have a TableBuilder block, just that it maps from a CSV file with a layout to some definition of a table.

rhazn · 2023-01-20T10:05:30Z

rfc/0001-cell-ranges.md

+#### Example for a common CSV file layout
+
+```jayvee
+layout CommonCSVLayout {


I am struggling with the concept of layouts vs. just considering partial selections from a CSV file as 2d data structures. If our internal datatypes are scalars, 1d (tuples with named scalars) and 2d (sheets with named columns and tuples as rows), why not map directly into those.

Consider describing a file with a header row, 2 useless rows and three columns, you only care/know about the type of column 1 and 2:

block MySelection oftype TableSelection { header: row 1; ignoredRows: row 2, *3:*3; column A typeof int; column C typeof string; }

You could put header to undefined to not have a header or add an array of strings there in the future. This keeps your syntax of describing cell ranges in a table but uses it differently. This also directly maps into a DB structure.

I think this has the elegance of just working on tabular data, emitting a new table. That way you can select parts that can overlap, chain selections etc. It also allows us to reuse this selection for all kinds of tabular data. If you'd ever want to use the first values in a row as a header for example, I think it is much more intuitive to just transpose a table than to define some cellrange you then set as header.

A problem I see with that syntax is, that the TableSelection block type no longer follows the usual pattern. The column statements are no regular block attributes (attributeName: attributeValue;), so we would need an exception for them in the grammar.

Another concern is the mapping between headers and columns. How does the block know which header belongs to which column? E.g. does A1 belong to column A and B1 to column C? How would you specify that C1 belongs to column C when you omit column B (as you did in the example)?

We somehow had this "multiple attributes make an array" syntax before in our prototypes, maybe we can do something like:

block MySelection oftype TableSelection { header: row 1; ignoredRows: row 2, *3:*3; column: A typeof int; column: C typeof string; }

The mapping would be very basic here, you take out of the table any header and ignored rows (so column A is just A3:A* here). Then A-Z are just indices into the columns, the header row becomes a tuple of names. So if the header is ["name1", "name2", "name3"] this would map name1 to the first column (A), name3 to the third (C). The second column would have the name name2 and no type (type "any"?).

This does not allow us to move columns to the left/right (change their index) but I think that could be another block, like transform or other reorganization of tabular data. This is basically a filter and selection of header, other blocks would be transforms.

Now this also implies, that only entire column selectors are permitted and that the syntactic sugar version needs to be used for the selection, right?

Yes, though note that this is explicitly not cell ranges, this is modeling a (sub-)table. That is why it only has headers and columns, that describes the table structure. The rows are assumed to be data. So this moves a bit away from the cell ranges concept into an alternative suggestion for the very free form layout you described.

Syntactic sugar being enforced I disagree, I have nothing against column: A1:A* typeof int;, I would consider that semantically equivalent. But I'd enforce the cell range there to be a column without an end, yes.

Your approach seems to me like the current language version with layouts, but in-lining them into a block and only permitting restrictive cell ranges (i.e. columns and rows without ends, which we are already capable of). The ability to mark ignored rows is definitely a helpful new feature though.

So why should we discuss and introduce fine-granular cell ranges if a user would not be able to use them for defining a table? In my perception, that's the whole point of this RFC.

Maybe the fine-granular cell range concept is not really necessary for CSV files due to their (usual) table-like structure? Maybe this is something to consider when we start working with files from actual sheet software (Excel / Google Sheets) where a table-like structure is not necessarily implied by the file format?

Just for shared understanding, I see sheet software as the same as CSV files since a common use case will be CSV files exported from sheet software (I assume). To your point, I would say both sheet software and CSV files can be in a structure that is not table-like, but we should assume the most common example is they are.

I think fundamentally, I am arguing for accepting that 'tables'/'tuples with names attributes' (2D structures with named columns and rows of data) are a fundamental building block in data (pandas, sheet software, csv, database tables).

To repeat my understanding: What we do in this RFC is naming collections of cells, not these 2D structures. We then map these into these 2D structures in the table builder block.

I think it would already be very helpful for me to adopt special rows in the layout. Instead of only being able to define cell ranges in a layout, let us also define a header row and ignored rows in it. That would remove the weird, implicit logic of ignoring the header row values when selecting columns from their first value. We can just clearly state: By applying a layout to a tabular file, we remove the header and ignored rows from the data and use the cell ranges to name areas.

Then I'd be semi-happy (to be clear, semi-happy because I have no better idea, not because I have more hidden feedback) with using that in the tablebuilder block to create tables as you suggested. Though I think we should try to get to a good (maybe syntactic-sugar only) solution to reuse the csv header row as column names.

As a generic feedback to the RFC process, I think this is becoming very hard to reason over with the existing comments. Maybe we can have a Status header for the RFC document, call it DRAFT, DISCUSSION, ACCEPTED, add the DISCUSSION to this one and merge it? Then we can reopen a new PR with a clean slate for more discussion?

rhazn · 2023-01-20T10:09:55Z

rfc/0001-cell-ranges.md

+block MyTableBuilder oftype TableBuilder {
+  sheetLayout: MyLayout;
+  columns: {
+    "column1": MyRange oftype text;


Explicitly modeling columns for types will run into problems if we allow complex valuetypes that might map to multiple columns. I think that should be fine (we can save them as "columnname_" for example. But something to consider.

rhazn · 2023-01-20T10:14:34Z

As a last point, maybe this would be an easier discussion if we split of the cell range syntax from the rest. I think in any case we need syntax to select things from a table, so the row, column, range and raw A1:B2/1:1/*1:*1 syntax might be good to already finalize. A common usecase could also be syntax for row and column ranges, e.g. I want to throw out row 1-3, column A-C.

georg-schwarz · 2023-01-23T14:08:47Z

From what I read we are discussing on two levels: conceptual and syntax. From syntax perspective, I'm pretty sure we will find a good solution. From conceptual perspective, I see two very different ideas.

Modelling the CSV File with cell ranges.
Modelling the range of interest/output of a selection that leads to a table design in the database in the end.

I'm on the go right now, so I didn't think the following idea through, but I still want to share it in its premature state:

Why not have both?

LayoutValidator block that checks if a layout with cell ranges applies. Could be part of the CsvExtractor block for syntactic sugar as well. This would be modelling structure of the CSV file. For each cell, we would have (a) no information, (b) one type information, or (c) multiple type information from overlapping. The output would be a 2dim data structure, each cell with a datatype calculated from the type information given in the layout.
TableSelector block that chooses a sub-selection of the modelled CSV file. Inputs would be just a content cell range and header information (array of names as hard-coded or from range). The type information per column would be calculated bottom-up from the cell types.

Example

layout MyLayout {
  row 1 oftype text called Header;
  range A3:A* A oftype integer;
  range B3:D* oftype text;
  column F oftype boolean;
}

block MySelection oftype TableSelection {
  header: range A1:D3, cell F;
  content: range A3:D*, range F3:F*
}

felix-oq · 2023-01-25T10:31:04Z

With #112 being accepted, I close this PR and open a new one with a reworked version of this RFC. I'll keep the status DRAFT because I'll introduce some new changes to this RFC that haven't been discussed yet.

Publish RFC for cell ranges

5a74e8a

felix-oq requested review from georg-schwarz and rhazn and removed request for georg-schwarz January 19, 2023 11:07

georg-schwarz reviewed Jan 19, 2023

View reviewed changes

rfc/0001-cell-ranges.md Outdated Show resolved Hide resolved

georg-schwarz reviewed Jan 19, 2023

View reviewed changes

rhazn reviewed Jan 20, 2023

View reviewed changes

felix-oq added 3 commits January 20, 2023 11:54

Add range keyword

3cf6c59

Include an excerpt of the motivating example

272b828

Add * for denoting the last row or column

C95D

bfec91e

felix-oq mentioned this pull request Jan 23, 2023

Accepted: RFC for cell range syntax #112

Merged

felix-oq changed the title ~~RFC for cell ranges~~ Draft: RFC for cell ranges Jan 23, 2023

felix-oq closed this Jan 25, 2023

github-actions bot locked and limited conversation to collaborators Jan 25, 2023


		### Cell ranges in layouts

		A cell range consists of a cell selector and a name for the range. The syntax of the selector follows the pattern `from:to` where from and to refer to a particular cell. A particular cell is selected via its column and its row in the sheet. The name of a range can later be used for referring to the cells in that cell range.


		# Motivation

		Some sheet-like open data is badly structured. For example, see this [example from mobilithek.info](https://mobilithek.info/offers/-655945265921899037) which does not follow the usual CSV pattern (topmost row as header, followed by the actual data).

Draft: RFC for cell ranges #109

Draft: RFC for cell ranges #109

Uh oh!

Conversation

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Choose a reason for hiding this comment

Uh oh!

Uh oh!

Example

Uh oh!

Uh oh!

Uh oh!