-
Notifications
You must be signed in to change notification settings - Fork 15
Draft: RFC for cell ranges #109
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Conversation
A2:B4 called MyRange; | ||
``` | ||
|
||
#### Selecting multiple adjacent columns / rows |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
What about non-adjacent cols? Maybe separated by a comma, e.g. B,D,F
?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Not sure whether this fits into the concept. How are non-adjacent columns treated semantically and why would a user want to do this? Even regular 2d cell ranges don't serve a purpose currently.
rfc/0001-cell-ranges.md
Outdated
|
||
### Cell ranges in layouts | ||
|
||
A cell range consists of a cell selector and a name for the range. The syntax of the selector follows the pattern `from:to` where from and to refer to a particular cell. A particular cell is selected via its column and its row in the sheet. The name of a range can later be used for referring to the cells in that cell range. |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Maybe we can give "hints" that this can be refactored to xy as there is a collision to make it more transparent to users.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We could do that, but my intention with the overlaps was to simplify common scenarios (e.g. layouts for CSV files).
Version with overlaps:
layout CSVLayout {
row 1 called Header;
column A called ColumnA;
column B called ColumnB;
}
Version that avoids overlaps:
layout CSVLayout {
row 1 called Header;
A2:A called ColumnA;
B2:B called ColumnB;
}
The version with overlaps can be modeled using just syntactic sugar syntax whereas the other one needs to select partial columns. This would become even more complicated when a user wants to exclude a row in the middle of the data.
|
||
block MyTableBuilder oftype TableBuilder { | ||
sheetLayout: MyLayout; | ||
columns: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think your idea to separate the range from the table format is a good idea. However, also a little non-intuitive referencing the ranges of the layout that is defined somewhere else (?)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A thought I had was to attach the layout to the sheet in the CSVFileExtractor
. Subsequent blocks which process the sheet would then be able to access cell ranges from the attached layout.
Another idea would be to use the cell range syntax inline, without any layout:
block MyTableBuilder oftype TableBuilder {
columns: {
"column1": column A oftype text;
"column2": B2:B, cell C1 oftype integer;
}
}
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a good idea, it seems very inspired from the one use case of CSV data into tables. Imho, we should look how to model a table/structured data (probably by defining attributes and their types). And only then model a mapping between the data from a CSV file and it's layouts to that table structure.
That would make the table model easier:
- you'd not have to reference layouts or table ranges
- you'd also avoid the double meaning of column that occurs here (there is an attribute columns that refers to the columns of the table but also columns inside a layout are used).
It also means the table model is reusable for other, future, data, like actual DB loaders or saving entities from an object store as rows in a table.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity, this could still mean we have a TableBuilder block, just that it maps from a CSV file with a layout to some definition of a table.
|
||
# Drawbacks | ||
|
||
- Header rows can't be used for column names |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Would be possible by a syntax like this:
cellvalue A1: MyRange oftype text;
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Another possibility I thought of is an array-like syntax such as Header[1]: MyRange oftype text
where Header
is also a 1-dimensional cell range.
Or having an attribute columnNames: Header;
in the TableBuilder
block. Not sure how to define the columns
attribute then, if the column names are already specified elsewhere.
sheetLayout: MyLayout; | ||
columns: { | ||
"column1": MyRange oftype text; | ||
"column2": MyPartialRange, MyCell oftype integer; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
"' or without?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
The reason for choosing "
was to indicate, that this is the column name that will be used when storing the table in a database and that the name does not have to follow the syntax for identifiers in Jayvee (alphanumeric with underscores). In case we want to reference table columns later on in the language, it might make sense to use identifiers and hence omit the "
.
|
||
- Header rows can't be used for column names | ||
- Unable to select cells using regular patterns (e.g. selecting every second row) | ||
- 2-dimensional ranges can be defined but currently serve no purpose |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think those ranges would get more interesting when using compound types or SI units or so. Nor sure, though, if we should think this far already?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yeah, that's why I noted in the alternatives section that we could forbid 2-dimensional ranges for now. The potential interactions with compound types are still to be discussed in the future.
block MyTableBuilder oftype TableBuilder { | ||
sheetLayout: MyLayout; | ||
columns: { | ||
"column1": MyRange oftype text; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Here could a "column" be also mapped to a row. Offering reading tables line-based instead of column-based is excellent. I'm not sure, though, how this will look in the resulting table then if we have mixes of rows, columns, and single cells xD
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
With great power comes great responsibility ;)
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I left some comments. For the layout discussion, I think this is a good step but I'd consider something like a subselection of data from tables instead. I hope I could describe that adequately.
|
||
# Motivation | ||
|
||
Some sheet-like open data is badly structured. For example, see this [example from mobilithek.info](https://mobilithek.info/offers/-655945265921899037) which does not follow the usual CSV pattern (topmost row as header, followed by the actual data). |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I think adding parts of the example (e.g. the rows that are non-conform + one row of data) would make this RFC easier to read and futureproof when the link stops working.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
In general this would be easier to follow on an actual example, not made up data with MyRange
etc.
|
||
block MyTableBuilder oftype TableBuilder { | ||
sheetLayout: MyLayout; | ||
columns: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I don't think this is a good idea, it seems very inspired from the one use case of CSV data into tables. Imho, we should look how to model a table/structured data (probably by defining attributes and their types). And only then model a mapping between the data from a CSV file and it's layouts to that table structure.
That would make the table model easier:
- you'd not have to reference layouts or table ranges
- you'd also avoid the double meaning of column that occurs here (there is an attribute columns that refers to the columns of the table but also columns inside a layout are used).
It also means the table model is reusable for other, future, data, like actual DB loaders or saving entities from an object store as rows in a table.
F438 |
|
|
block MyTableBuilder oftype TableBuilder { | ||
sheetLayout: MyLayout; | ||
columns: { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
For clarity, this could still mean we have a TableBuilder block, just that it maps from a CSV file with a layout to some definition of a table.
#### Example for a common CSV file layout | ||
|
||
```jayvee | ||
layout CommonCSVLayout { |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
I am struggling with the concept of layouts vs. just considering partial selections from a CSV file as 2d data structures. If our internal datatypes are scalars, 1d (tuples with named scalars) and 2d (sheets with named columns and tuples as rows), why not map directly into those.
Consider describing a file with a header row, 2 useless rows and three columns, you only care/know about the type of column 1 and 2:
block MySelection oftype TableSelection {
header: row 1;
ignoredRows: row 2, *3:*3;
column A typeof int;
column C typeof string;
}
You could put header to undefined to not have a header or add an array of strings there in the future. This keeps your syntax of describing cell ranges in a table but uses it differently. This also directly maps into a DB structure.
I think this has the elegance of just working on tabular data, emitting a new table. That way you can select parts that can overlap, chain selections etc. It also allows us to reuse this selection for all kinds of tabular data. If you'd ever want to use the first values in a row as a header for example, I think it is much more intuitive to just transpose a table than to define some cellrange you then set as header.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
A problem I see with that syntax is, that the TableSelection
block type no longer follows the usual pattern. The column statements are no regular block attributes (attributeName: attributeValue;
), so we would need an exception for them in the grammar.
Another concern is the mapping between headers and columns. How does the block know which header belongs to which column? E.g. does A1 belong to column A and B1 to column C? How would you specify that C1 belongs to column C when you omit column B (as you did in the example)?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
We somehow had this "multiple attributes make an array" syntax before in our prototypes, maybe we can do something like:
block MySelection oftype TableSelection {
header: row 1;
ignoredRows: row 2, *3:*3;
column: A typeof int;
column: C typeof string;
}
The mapping would be very basic here, you take out of the table any header and ignored rows (so column A is just A3:A* here). Then A-Z are just indices into the columns, the header row becomes a tuple of names. So if the header is ["name1", "name2", "name3"] this would map name1 to the first column (A), name3 to the third (C). The second column would have the name name2 and no type (type "any"?).
This does not allow us to move columns to the left/right (change their index) but I think that could be another block, like transform or other reorganization of tabular data. This is basically a filter and selection of header, other blocks would be transforms.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Now this also implies, that only entire column selectors are permitted and that the syntactic sugar version needs to be used for the selection, right?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Yes, though note that this is explicitly not cell ranges, this is modeling a (sub-)table. That is why it only has headers and columns, that describes the table structure. The rows are assumed to be data. So this moves a bit away from the cell ranges concept into an alternative suggestion for the very free form layout you described.
Syntactic sugar being enforced I disagree, I have nothing against column: A1:A* typeof int;
, I would consider that semantically equivalent. But I'd enforce the cell range there to be a column without an end, yes.
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Your approach seems to me like the current language version with layouts, but in-lining them into a block and only permitting restrictive cell ranges (i.e. columns and rows without ends, which we are already capable of). The ability to mark ignored rows is definitely a helpful new feature though.
So why should we discuss and introduce fine-granular cell ranges if a user would not be able to use them for defining a table? In my perception, that's the whole point of this RFC.
Maybe the fine-granular cell range concept is not really necessary for CSV files due to their (usual) table-like structure? Maybe this is something to consider when we start working with files from actual sheet software (Excel / Google Sheets) where a table-like structure is not necessarily implied by the file format?
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Just for shared understanding, I see sheet software as the same as CSV files since a common use case will be CSV files exported from sheet software (I assume). To your point, I would say both sheet software and CSV files can be in a structure that is not table-like, but we should assume the most common example is they are.
I think fundamentally, I am arguing for accepting that 'tables'/'tuples with names attributes' (2D structures with named columns and rows of data) are a fundamental building block in data (pandas, sheet software, csv, database tables).
To repeat my understanding: What we do in this RFC is naming collections of cells, not these 2D structures. We then map these into these 2D structures in the table builder block.
I think it would already be very helpful for me to adopt special rows in the layout. Instead of only being able to define cell ranges in a layout, let us also define a header row and ignored rows in it. That would remove the weird, implicit logic of ignoring the header row values when selecting columns from their first value. We can just clearly state: By applying a layout to a tabular file, we remove the header and ignored rows from the data and use the cell ranges to name areas.
Then I'd be semi-happy (to be clear, semi-happy because I have no better idea, not because I have more hidden feedback) with using that in the tablebuilder block to create tables as you suggested. Though I think we should try to get to a good (maybe syntactic-sugar only) solution to reuse the csv header row as column names.
As a generic feedback to the RFC process, I think this is becoming very hard to reason over with the existing comments. Maybe we can have a Status
header for the RFC document, call it DRAFT
, DISCUSSION
, ACCEPTED
, add the DISCUSSION
to this one and merge it? Then we can reopen a new PR with a clean slate for more discussion?
block MyTableBuilder oftype TableBuilder { | ||
sheetLayout: MyLayout; | ||
columns: { | ||
"column1": MyRange oftype text; |
There was a problem hiding this comment.
Choose a reason for hiding this comment
The reason will be displayed to describe this comment to others. Learn more.
Explicitly modeling columns for types will run into problems if we allow complex valuetypes that might map to multiple columns. I think that should be fine (we can save them as "columnname_" for example. But something to consider.
As a last point, maybe this would be an easier discussion if we split of the cell range syntax from the rest. I think in any case we need syntax to select things from a table, so the |
From what I read we are discussing on two levels: conceptual and syntax. From syntax perspective, I'm pretty sure we will find a good solution. From conceptual perspective, I see two very different ideas.
I'm on the go right now, so I didn't think the following idea through, but I still want to share it in its premature state: Why not have both?
Example
|
With #112 being accepted, I close this PR and open a new one with a reworked version of this RFC. I'll keep the status |
Part of #85