Little things that matter in language design

June 8, 2013

This article was contributed by Neil Brown

The designers of a new programming language are probably most interested in the big features — the things that just couldn't be done with whichever language they are trying to escape from. So they are probably thinking of the type system, the data model, the concurrency support, the approach to polymorphism, or whatever it is that they feel will affect the expressiveness of the language in the way they want.

There is a good chance they will also have a pet peeve about syntax, whether it relates to the exact meaning of the humble semicolon, or some abhorrent feature such as the C conditional expression which (they feel) should never be allowed to see the light of day again. However, designing a language requires more than just addressing the things you care about. It requires making a wide range of decisions concerning various sorts of abstractions, and making sure the choices all fit together into a coherent, and hopefully consistent, whole.

One might hope that, with over half a century of language development behind us, there would be some established norms which can be simply taken as "best practice" without further concern. While this is true to an extent, there appears to be plenty of room for languages to diverge even on apparently simple concepts.

Having begun an exploration of the relatively new languages Rust and Go and, in particular, having two languages to provide illuminating contrasts, it seems apropos to examine some of those language features that we might think should be uncontroversial to see just how uniform they have, or have not, become.

Comments

When first coming to C [PDF] from Pascal, the usage of braces can be a bit of a surprise. While Pascal sees them as one option for enclosing comments, C sees them as a means of grouping statements. This harsh conflict between the languages is bound to cause confusion, or at least a little friction, when moving from one language to the next, but fortunately appears to be a thing of the past.

One last vestige of this sort of confusion can be seen in the configuration files for BIND, the Berkeley Internet Name Daemon. In the BIND configuration files semicolons are used as statement terminators while in the database files they introduce comments.

When not hampered by standards conformance as these database files are, many languages have settled on C-style block comments:

   /* This is a comment */

and C++-style one-line comments:

   // This line has a comment

these having won over from the other Pascal option of:

   (* similar but different block comments *)

and Ada's:

   -- again a similar yet different single line comment.

The other popular alternative is to start comments with a "#" character, which is a style championed by the C-shell and Bourne shell, and consequently used by many scripting languages. Thankfully the idea of starting a comment with "COMMENT" and ending with "TNEMMOC" never really took off and may be entirely apocryphal.

Both Rust and Go have embraced these trends, though not as fully as BIND configuration files and other languages like Crack which allow all three (/* */, //, #). Rust and Go only support the C and C++ styles.

Go doesn't use the "#" character at all, allowing it only inside comments and string constants, so it is available as a comment character for a future revision, or maybe for something else.

Rust has another use for "#" which is slightly reminiscent of its use by the preprocessor in C. The construct:

  #[attribute....]

attaches arbitrary metadata to nearby parts of the program which can enable or disable compiler warnings, guide conditional compilation, specify a license, or any of various other things.

Identifiers

Identifiers are even more standard than comments. Any combination of letters, digits, and the underscore that does not start with a digit is usually acceptable as an identifier providing it hasn't already been claimed as a reserved word (like if or while).

With the increasing awareness of languages and writing systems other than English, UTF-8 is more broadly supported in programming languages these days. This extends the range of characters that can go into an identifier, though different languages extend it differently.

Unicode defines a category for every character, and Go simply extends the definition given above to allow "Unicode letter" (which has 5 sub-categories: uppercase, lowercase, titlecase, modifier, and other) and "Unicode decimal digit" (which is one of 3 sub-categories of "Number", the others being "Number,letter" and "Number,other") to be combined with the underscore. The Go FAQ suggests this definition may be extended depending on how standardization efforts progress.

Rust gives a hint of what these efforts may look like by delegating the task of determining valid identifiers to the Unicode standard. The Unicode Standard Annex #31 defines two character classes, "ID_Start" and "ID_Continue", that can be used to form identifiers in a standard way. The Annex offers these as a resource, rather than imposing them as a standard, and acknowledges that particular use cases may extend them is various ways. It particularly highlights that some languages like to allow identifiers to start with an underscore, which ID_Start does not contain. The particular rule used by Rust is to allow an identifier to start with an ASCII letter, underscore, or any ID_Start, and to be continued with ASCII letters, ASCII digits, underscores, or Unicode ID_Continue characters.

Allowing Unicode can introduce interesting issues if case is significant, as Unicode supports three cases (upper, lower, and title) and also supports characters without case. Most programming languages very sensibly have no understanding of case and treat two characters of different case as different characters, with no attempt to fold case or have a canonical representation. Go however does pay some attention to case.

In Go, identifiers where the first character is an uppercase letter are treated differently in terms of visibility between packages. A name defined in one package is only exported to other packages if it starts with an uppercase letter. This suggests that writing systems without case, such as Chinese, cannot be used to name exported identifiers without some sort of non-Chinese uppercase prefix. The Go FAQ acknowledges this weakness but shows a strong reluctance to give up the significance of case in exports.

Numbers

Numbers don't face any new issues with Unicode though possibly that is just due to continued English parochialism, as Unicode does contain a complete set of Roman numerals as well as those from more current numeral systems. So you might think that numbers would be fairly well standardized by now. To a large extent they are, but there still seems to be wiggle room.

Numbers can be integers or, with a decimal point or exponent suffix (e.g. "1.0e10"), floating point. Integers can be expressed in decimal, octal with a leading "0", or hexadecimal with a leading "0x".

In C99 and D [PDF], floating point numbers can also be hexadecimal. The exponent suffix must then have a "p" rather than "e" and gives a power of two expressed in decimal. This allows precise specification of floating point numbers without any risk of conversion errors. C11 and D also allow a "0b" prefix on integers to indicate a binary representation (e.g. "0b101010") and D allows underscores to be sprinkled though numbers to improve readability, so 1_000_000_000 is clearly the same value as 1e9.

Neither Rust nor Go have included hexadecimal floats. While Rust has included binary integers and the underscore spacing character, Go has left these out.

Another subtlety is that while C, D, Go, and many other languages allow a floating point number to start with a period (e.g. ".314159e1"), Rust does not. All numbers in Rust must start with a digit. There does not appear to be any syntactic ambiguity that would arise if a leading period were permitted, so this is presumably due to personal preference or accident.

In the language Virgil-III this choice is much clearer. Virgil has a fairly rich "tuple" concept [PDF] which provides a useful shorthand for a list of values. Members of a tuple can be accessed with a syntax similar to structure field references, only with a number rather than a name. So in:

    var x:(int, int) = (3, 4);
    var w:int = x.1;

The variable "w" is assigned the value "4" as it is element one of the tuple "x". Supporting this syntax while also allowing ".1" to be a floating point number would require the tokenizer to know when to report two tokens ("dot" and "int") and when it is just one ("float"). While possible, this would be clumsy.

Many fractional numbers (e.g. 0.75) will start with a zero even in languages which allow a leading period (.75). Unlike the case with integers, the leading zero does not mean these number are interpreted in base eight. For 0.75 this is unlikely to cause confusion. For 0777.0 it might. Best practice for programmers would be to avoid the unnecessary digit in these cases and it would be nice if the language required that.

As well as prefixes, many languages allow suffixes on numbers with a couple of different meanings. Those few languages which have "complex" as a built-in type need a syntax for specifying "imaginary" constants. Go, like D, uses an "i" suffix. Python uses "j". Spreadsheets like LibreOffice localc or Microsoft Excel allow either "i" or "j". It is a pity more languages don't take that approach. Rust doesn't support native complex numbers, so it doesn't need to choose.

The other meaning of a suffix is to indicate the "size" of the value - how many bytes are expected to be used to store it. C and D allow u, l, ll, or f for unsigned, long, long long, and float, with a few combinations permitted. Rust allows u, u8, u16, u32, u64, i8, i16, i32, i64, f32, and f64 which cover much the same set of sizes, but are more explicit. Perhaps fortunately, i is not a permitted suffix, so there is room to add imaginary numbers in the future if that turned out to be useful.

Go takes a completely different approach to the sizing of constants. The language specification talks about "untyped" constants though this seems to be some strange usage of the word "untyped" that I wasn't previously aware of. There are in fact "untyped integer" constants, "untyped floating point" constants, and even "untyped boolean" constants, which seem like they are untyped types. A more accurate term might be "unsized constants with unnamed types" though that is a little cumbersome.

These "untyped" constants have two particular properties. They are calculated using high precision with overflow forbidden, and they can be transparently converted to a different type provided that the exact value can be represented in the target type. So "1e15" is an untyped floating point constant which can be used where an int64 is expected, but not where an int32 is expected, as it requires 50 bits to store as an integer.

The specification states that "Constant expressions are always evaluated exactly" however some edge cases are to be expected:

    print((1 + 1/1e130)-1, "\n")
    print(1/1e130, "\n")

results in:

     +9.016581e-131
     +1.000000e-130

so there does seem to be some limit to precision. Maintaining high precision and forbidding overflow means that there really is no need for size suffixes.

Strings

Everyone knows that strings are enclosed in single or double quotes. Or maybe backquotes (`) or triple quotes ('''). And that while they used to contain ASCII characters, UTF-8 is preferred these days. Except when it isn't, and UTF-16 or UTF-32 are needed.

Both Rust and Go, like C and others, use single quotes for characters and double quotes for strings, both with the standard set of escape sequences (though Rust inexplicably excludes \b, \v, \a, and \f). This set includes \uXXXX and \UXXXXXXXX so that all Unicode code-points can be expressed using pure ASCII program text.

Go chooses to refer to character constants as "Runes" and provides the built in type "rune" to store them. In C and related languages "char" is used both for ASCII characters and 8-bit values. It appears that the Go developers wanted a clean break with that and do not provide a char type at all. rune (presumably more aesthetic than wchar) stores (32-bit) Unicode characters while byte or uint8 store 8-bit values.

Rust keeps the name char for 32-bit Unicode characters and introduces u8 for 8-bit values.

The modern trend seems to be to disallow literal newlines inside quoted strings, so that missing quote characters can be quickly detected by the compiler or interpreter. Go follows this trend and, like D, uses the back quote (rather than the Python triple-quote) to surround "raw" strings in which escapes are not recognized and newlines are permitted. Rust bucks the trend by allowing literal newlines in strings and does not provide for uninterpreted strings at all.

Both Rust and Go assume UTF-8. They do not support the prefixes of C (U"this is a string of 32bit characters") or the suffixes of D ("another string of 32bit chars"d), to declare a string to be a multibyte string.

Semicolons and expressions

The phrase "missing semicolon" still brings back memories from first-year computer science and learning Pascal. It was a running joke that whenever the lecturer asked "What does this code fragment do?" someone would call out "missing semicolon", and they were right more often than you would think.

In Pascal, a semicolon separates statements while in C it terminates some statements — if, for, while, switch and compound statements do not require a semicolon. Neither rule is particularly difficult to get used to, but both often require semicolons at the end of lines that can look unnecessary.

Go follows Pascal in that semicolons separate statements — every pair of statements must be separated. A semicolon is not needed before the "}" at the end of a block, though it is permitted there. Go also follows the pattern seen in Python and JavaScript where the semicolon is sometimes assumed at the end of a line (when a newline character is seen). The details of this "sometimes" is quite different between languages.

In Go, the insertion of semicolons happens during "lexical analysis", which is the step of language processing that breaks the stream of characters into a stream of tokens (i.e. a tokenizer). If a newline is detected on a non-empty line and the last token on the line was one of:

an identifier,
one of the keywords break, continue, fallthrough, or return
a numeric, rune, or string literal
one of ++, --, ), ], or }

then a semicolon is inserted at the location of the newline.

This imposes some style choices on the programmer such that:

   if some_test
   {
   	some_statement
   }

is not legal (the open brace must go on the same line as the condition), and:

   a = c
     + d
     + e

is not legal — the operation (+) must go at the end of the first line, not the start of the second.

In contrast to this, JavaScript waits until the "parsing" step of language processing when the sequence of tokens is gathered into syntactic units (statements, expressions, etc.) following a context free grammar. JavaScript will insert a semicolon, provided that semicolon would serve to terminate a non-empty statement, if:

it finds a newline in a location that the grammar forbids a newline, such as after the word "break" or before the postfix operator "++";
it finds a "}" or End-of-file that is not expected by the grammar
it finds any token that is not expected, which was separated from the previous token by at least one newline.

This often works well but brings its own share of style choices including the interesting suggestion to sometimes use a semicolon to start a statement.

While both of these approaches are workable, neither really seems ideal. They both force style choices which are rather arbitrary and seem designed to make life easy for the compiler rather than for the programmer.

Rust takes a very different approach to semicolons than Go or JavaScript or many other languages. Rather than making them less important and often unnecessary they are more important and have a significant semantic meaning.

One use involves the attributes mentioned earlier. When followed by a semicolon:

  #[some_attribute];

the attribute applies to the entity (e.g. the function or module) that the attribute appears within. When not followed by a semicolon, the attribute applies to the entity that follows it. A missing semicolon could certainly make a big difference here.

The primary use of semicolons in Rust is much like that in C — they are used to terminate expressions by turning the expressions into statements, discarding any result. The effect is really quite different from C because of a related difference: many things that C considers to be statements, Rust considers to be expressions. A simple example is the if expression.

    a = if b == c  { 4 } else { 5 };

Here the if expression returns either "4" or "5", which is stored in "a".

A block, enclosed in braces ({ }), typically includes a sequence of expressions with semicolons separating them. If the last expression is also followed by a semicolon, then the block-expression as a whole does not have a value — that last semicolon discards the final value. If the last expression is not followed by a semicolon, then the value of the block is the value of the last expression.

If this completely summed up the use of semicolons it would produce some undesirable requirements.

    if condition {
        expression1;
    } else {
        expression2;
    }
    expression3;

This would not be permitted as there is no semicolon to discard the value of the if expression before expression3. Having a semicolon after the last closing brace would be ugly, and that if expression doesn't actually return a value anyway (both internal expressions are terminated with a semicolon) so the language does not require the ugly semicolon and the above is valid Rust code. If the internal expression did return a value, for example if the internal semicolons were missing, then a semicolon would be required before expression3.

Following this line of reasoning leads to an interesting result.

    if condition {
    	function1()
    } else {
    	function2()
    }
    expression3;

Is this code correct or is there a missing semicolon? To know the answer you need to know the types of the functions. If they do not return a value, then the code is correct. If they do, a semicolon is needed, either one at the end of the whole "if" expression, or one after each function call. So in Rust, we need to evaluate the types of expressions before we can be sure of correct semicolon usage in every case.

Now the above is probably just a silly example, and no one would ever write code like that, at least not deliberately. But the rules do seem to add an unnecessary complexity to the language, and the task of programming is complex enough as it is — adding more complexity through subtle language rules is not likely to help.

Possibly a bigger problem is that any tool that wishes to accurately analyze the syntax of a program needs to perform a complete type analysis. It is a known problem that the correct parsing of C code requires you to know which identifiers are typedefs and which are not. Rust isn't quite that bad as missing type information wouldn't lead to an incorrect parse, but at the very least it is a potential source of confusion.

Return

A final example of divergence on the little issues, though perhaps not quite so little as the others, can be found in returning values from functions using a return statement. Both Rust and Go support the traditional return and both allow multiple values to be returned: Go by simply allowing a list of return types, Rust through the "tuple" type which allows easy anonymous structures. Each language has its own variation on this theme.

If we look at the half million return statements in the Linux kernel, nearly 35,000 of them return a variable called "ret", "retval", "retn", or similar, and a further 20,000 return "err", "error", or similar. This totals more than 10% of total usage of return in the kernel. This suggests that there is often a need to declare a variable to hold the intended result of a function, rather than to just return a result as soon as it is known.

Go acknowledges this need by allowing the signature of a function to give names to the return values as well as the parameter values:

    func open(filename string, flags int) (fd int, err int)

Here the (hypothetical) open() function returns two integers named fd (the file descriptor) and err. This provides useful documentation of the meaning of the return values (assuming programmers can be more creative than "retval") and also declares variables with the given names. These can be set whenever convenient in the code of the function and a simple:

    return

with no expressions listed will use the values in those variables. Go requires that this return be present, even if it lists no values and is at the end of the function, which seems a little unnecessary, but isn't too burdensome.

There is evidence [YouTube] that some Go developers are not completely comfortable with this feature, though it isn't clear whether the feature itself is a problem, or rather the interplay with other features of Go.

Rust's variation on this theme we have already glimpsed with the observation that Rust has "expressions" in preference to "statements". The whole body of a function can be viewed as an expression and, provided it doesn't end with a semicolon, the value produced by that expression is the value returned from the function. The word return is not needed at all, though it is available and an explicit return expression within the function body will cause an early return with the given value.

Conclusion

There are many other little details, but this survey provides a good sampling of the many decisions that a language designer needs to make even after they have made the important decisions that shape the general utility of the language. There certainly are standards that are appearing and broadly being adhered to, such as for comments and identifiers, but it is a little disappointing that there is still such variability concerning the available representations of numbers and strings.

The story of semicolons and statement separation is clearly not a story we've heard the end of yet. While it is good to see language designers exploring the options, none of the approaches explored above seem entirely satisfactory. The recognition of a line-break as being distinct from other kinds of white space seems to be a clear recognition that the two dimensional appearance of the code has relevance for parsing it. It is therefore a little surprising that we don't see the line indent playing a bigger role in interpretation of code. The particular rules used by Python may not be to everyone's liking, but the principle of making use of this very obvious aspect of a program seems sound.

We cannot expect ever to converge on a single language that suits every programmer and every task, but the more uniformity we can find on the little details, the easier it will be for programmers to move from language to language and maximize their productivity.

Index entries for this article
GuestArticles	Brown, Neil

Perl numeric constants

Posted Jun 8, 2013 1:22 UTC (Sat) by dskoll (subscriber, #1630) [Link] (7 responses)

I realize Perl is no longer cool and wasn't mentioned in the article, but it has a fairly nice extension for numeric constants. You can write big numbers like 5429874625 as 5_429_874_625 which makes them significantly more pleasant for humans to parse.

Perl numeric constants

Posted Jun 8, 2013 8:21 UTC (Sat) by rahulsundaram (subscriber, #21946) [Link] (1 responses)

Java can do this too without any extensions.

http://docs.oracle.com/javase/tutorial/java/nutsandbolts/...

Perl numeric constants

Posted Jun 10, 2013 11:09 UTC (Mon) by niner (subscriber, #26151) [Link]

To be clear: it's not an extension in Perl either. Perl has supported underscores in number literals since version 5.000 in the year 1994.

Perl numeric constants

Posted Jun 8, 2013 8:41 UTC (Sat) by eru (subscriber, #2753) [Link] (2 responses)

Several old languages allowed such "noise characters" to be inserted arbitrarily for supposed readability. For example, in PL/M you can insert a $ into identifiers and numbers, and it is ignored (100$000 = 100000, FO$O = FOO). I have always wondered why the designers of the language picked $, which looks a lot like a letter.

Perl numeric constants

Posted Jun 8, 2013 12:53 UTC (Sat) by dark (guest, #8483) [Link] (1 responses)

I've worked on a variant of Pascal that allowed underscores freely in identifiers, and the underscores were ignored. So, similar to what you describe except not usable in numbers.

It was actually a pain to work with since I could never just grep for an identifier and be sure I got all uses. I considered implementing an --ignore-underscore flag for GNU grep but after the project was done I no longer felt the need :)

Perl numeric constants

Posted Jun 8, 2013 15:02 UTC (Sat) by dskoll (subscriber, #1630) [Link]

Ignored underscores in identifiers is a terrible idea for the reason you mentioned (non-greppableness). However, in large numbers I like it; you are unlikely to want to grep a number and typically you'd use it only in one place like this:

use contstant SOME_NUMBER => 1_234_345_837;

Perl numeric constants

Posted Jun 8, 2013 10:38 UTC (Sat) by andreasb (guest, #80258) [Link] (1 responses)

Perl was not mentioned, but the underscore-in-literals thing was mentioned as being supported by D.

Ada also allows underscores in numeric literals, while we're at it.

Perl numeric constants

Posted Jun 8, 2013 18:11 UTC (Sat) by dvdeug (guest, #10998) [Link]

The underscores in numeric literals is one of the things I miss from Ada when I program in other languages; it's hard to tell how big 12345678 is at a glance, but 12_345_678 is entirely clear.

Little things that matter in language design

Posted Jun 8, 2013 2:24 UTC (Sat) by Richard_J_Neill (subscriber, #23093) [Link] (29 responses)

Please could we have an alternate form for octal. In the same way that "0x" means hex, and (in some languages), "0b" means binary, then perhaps "0o" (letter o) could be the new octal specifier. Then over 5 years we can transition to the new form when octal is specifically required, and eventually deprecate the interpretation of leading-zero as octal. Imho, treating leading-zero as octal is confusing, conflicts with the mathematical approach (leading 0's are ignored), it's bug-bait, it's almost never useful, and I'd like to see it trigger a compiler-warning.

It would also be wonderful if compilers could track-back with error messages. For example, a missing } somewhere in the middle of the program will usually only throw an error on the last line of the file. It would be far more helpful to report the line number of the opening { that wasn't ever closed.

Little things that matter in language design

Posted Jun 8, 2013 2:46 UTC (Sat) by nlucas (guest, #33793) [Link]

I learned octal much latter than hexadecimal, and the only use is when creating files.

Even if I code in C since 1990, once or two times per year I still get bitten by "octal bugs". My brain forgets that C and math numbers are not the same.

OTOH, trigraph warnings occur more or less with the same frequency. If they didn't got disabled until now, this is a lost cause...

Little things that matter in language design

Posted Jun 8, 2013 3:15 UTC (Sat) by geofft (subscriber, #59789) [Link] (17 responses)

The trouble with that is old source code; doing that for C seems like a huge pain.

Of course, that's an easy change to make right now in a new language. Python 3 (which is effectively a new language, for this purpose, since it explicitly disclaims syntax compatibility) considers a leading 0 a parse error, and requires 0o, just as you describe.

Language UNparseability

Posted Jun 8, 2013 6:37 UTC (Sat) by smurf (subscriber, #17840) [Link] (16 responses)

IMHO the fact that a language cannot be parsed without semantic analysis automatically disqualifies it for anything I would want to do – again, didn't we learn from Perl or C (and no, I'm not just talking about the Obfuscated C|Perl Contests) why that is a bad idea?

Also, given the plethora of bugs C's 0-denotes-octal design mistake has caused, IMHO new laguages should learn from that mistake and require 0o.

Language UNparseability

Posted Jun 8, 2013 7:19 UTC (Sat) by jzbiciak (guest, #5246) [Link]

I'm with you on the octal thing. Does modern GCC have a -Woctal that warns when you use 0octal? If so, I think I'll make it part of my standard compiler warning flag set.

Language UNparseability

Posted Jun 8, 2013 10:21 UTC (Sat) by oever (guest, #987) [Link] (11 responses)

I would go further and say that any language that does not have a runtime model for the language that can be easily used to investigate the source code disqualifies it as a serious programming language.

If I want to list all functions in a collection of source code, the language tools should make this easy. Using grep for this is not precise and error prone.

Most language design centers on silly things like the serialization, the syntactic sugar. The way the instructions are shown on screen is not nearly as important as the conceptual cleanness of the language and the ability to automatically reason about the software.

Language UNparseability

Posted Jun 8, 2013 18:28 UTC (Sat) by khim (subscriber, #9252) [Link] (7 responses)

The way the instructions are shown on screen is not nearly as important as the conceptual cleanness of the language and the ability to automatically reason about the software.

If this is true then why C++ is so popular and LISP dialects are so rarely used?

Language UNparseability

Posted Jun 8, 2013 22:37 UTC (Sat) by oever (guest, #987) [Link] (6 responses)

If C++ is so awesome, why is everybody writing apps in JavaScript?

Ok, let's not be snarky. C++ is still popular because it is low level and can be used to create fast code. There is also a network effect: people will be trained in languages that are used a lot. The network effect is also apparent in details of programming languages, as the parent article points out; the way comments are written is often similar in programming languages to make it less hard for programmers to read code in the new language.

One of the names for the field in which programmers work is 'automation'. Automation of repetitive tasks such as performing a multitude of logical tests on source code is something that should come naturally to workers in the field of automation. And yet, the very thing we work on most, the source code, is not easy to automate at all. The syntax rules of most programming languages are quite intricate and take a while to learn and there is usually no algorithm library to help parse and automate tasks on the source code. C and C++ rank high in the list of offenders because combinations of macros and includes make it impossible to parse a source code file without knowing the include directories.

Language UNparseability

Posted Jun 9, 2013 1:18 UTC (Sun) by khim (subscriber, #9252) [Link] (3 responses)

If C++ is so awesome, why is everybody writing apps in JavaScript?

Because C++ programs don't run in browser, obviously. If JavaScript was indeed a better way to write programs and not just more convenient way to deliver them to the end user then we would not see so many projects which try to somehow make sane language from that abomination (starting from CoffeeScript/TypeScript and ending with Emscripten/asm.js).

And yet, the very thing we work on most, the source code, is not easy to automate at all.

And that is good thing. There are some languages where such automation is possible (lisp dialects and C#/Java are among them). C#/Java tend to provide the tools you so crave which leads to utter disaster: pointless churn quickly produces programs which noone understands (original code was transformed so many times by "automated tasks" that original meaning was mixed with so many changes that it's basically impossible to understand what goes on where on the code; your only hope are unittests which help to produce something sensible, but the idea that you can actually find and fix all the errors in the code is considered blasphemy). Lisp development does not favor such tricks at all: instead it transforms text written by developer (and which is considered sacred) at runtime. Works much better. And guess what? For that style of work it does not really matter if your language is easily parseable/automatable: lex, yacc and other such tools are happy with any language.

C and C++ rank high in the list of offenders because combinations of macros and includes make it impossible to parse a source code file without knowing the include directories.

Right - but why is that a bad thing? If you feel the need to organize some kind of automatic surgery on the "source code" then it just means that you've chosen badly and some pieces which should be either kept in database or generated in compile time or runtime are stored in sources instead.

Early on both C and C++ quickly evolved to make sure the tools you talk about will not ever be needed - and that was a good thing. But in the last decades they were essentially frozen which made them inadequate for today's requirements. Well, perhaps it's time to do something about that and not try to add bandaids upon bandaids?

Language UNparseability

Posted Jun 9, 2013 2:47 UTC (Sun) by rahulsundaram (subscriber, #21946) [Link]

C++ is hardly frozen anymore. C++11 has been implemented just recently by both GCC and Clang and C++14 is expected to finish off some leftover parts of C++11 and C++17 is planned after that. If you haven't looked at C++ in a while, now would be a great time to do it.

Language UNparseability

Posted Jun 9, 2013 9:52 UTC (Sun) by oever (guest, #987) [Link]

The main uses for an API accompanying a programming language are not so much tranformation and refactoring, although on the fly adapting of the whitespaces to the readers preferences while storing the code in a project specific canonical serialization would be nice.

Analysis and generation of code are the main uses. If custom checks for a rule in code, e.g. "only use RAII, and no isolated new/malloc", "always use 'const' on variables at are not changed" can be written easily, then keeping project code cleaner, more readable and faultless is simpler.

Generating code is a common use-case. Consider for example an application that accesses an SQL database. Instead of hand writing type unsafe SQL statements everywhere in the code, one could generate type safe classes from which the SQL statements are generated at compile time (C++11 templates cannot collate strings at compile time easily, so a separate code generation step is preferred). In such a scenario, having an API is much more comfortable and less error prone than writing strings.

Language UNparseability

Posted Jun 10, 2013 0:10 UTC (Mon) by marcH (subscriber, #57642) [Link]

> > And yet, the very thing we work on most, the source code, is not easy to automate at all.

> And that is good thing. There are some languages where such automation is possible (lisp dialects and C#/Java are among them). C#/Java tend to provide the tools you so crave which leads to utter disaster: pointless churn quickly produces programs which noone understands

What are you on?

Language UNparseability

Posted Jun 20, 2013 22:02 UTC (Thu) by VITTUIX-MAN (guest, #82895) [Link] (1 responses)

You mentioned the word automation.

You know, in the industrial automation field we have a whole bunch of languages that are both unique and don't really shine on their design, or what do I know, surely all the features must be well justified, but uniformity is not sadly amongst the features.

First of all there's the venerable IEC 61131-3 that defines 5 languages (originally 3), the originals being "function block trees", "ladder diagrams" and a macro assembler for a two accumulator machine that carefully avoids indirect access one the variables. There's simply no means of getting and address of of some variable, or get a value by address, save defining an array (which the language supports) as big as the whole memory, though the readability would suffer a bit...

Beyond those languages, we have a whole bunch of different BASIC dialects, specially in robotics, meaning there are the lovable control structures such as multi-label ON GOTO and ON GUSUB and FOR TO STEP NEXT loop and so on, and if one is really lucky, variables can be defined with DIM. No safe subroutines; what one would even do with them when the program can be only 999 lines long? With FOR-loop there's an interesting restriction that using GOTO statement to escape it is forbidden for a reason unknown.

One idiom that seem to be a common in that kind of environments is that the variables that can be used are quite limited, and one is expected to manually assign variables to correspond a memory address, though the IDE does most of it automatically these days. One gets a total of maybe few tens of kilobytes of memory, even if the said system runs on XP embedded!

in SLIM by Nachi-Fujikoshi the variables are arranged in particularly cool way: V$1, that's a global variable (V) of string type ($) number 1 (out of 50). How ever one may also write V$[1], and 1 may be another variable, thus allowing indexing through the string segment. Brilliant. L$1 would be a string from the local for-process string segment (another 50 strings right there!), though one has to keep in mind these don't work with all the commands. For example the socket related commands only allow global variables as argument. The command reference of course does not mention which commands require the use of global segments, leaving they joy of discovery for the user.

There are also variables available (that are not power-failure safe) with DIM statement, but there is a catch that they don't work with any statements at all! All "DIM variables" are good for is arithmetics.

If one wants to have named variables, they are usually supported by find and replace -operation performed by a preprocessor, meaning there's a separate variable include-file that assigns names that correspond to SLIM type variable literals and constants. As I said, it's a search and replace, so one has to be careful not to have a variable defined that is a part of some longer identifier like a command name. Say one has a constant definition "SOCK,9600" and there is a command SOCKBIND -that just got replaced and is now 9600BIND making the compiler halt and die. Beautiful.

Language UNparseability

Posted Jun 25, 2013 20:18 UTC (Tue) by nix (subscriber, #2304) [Link]

Is there a language version of thedailywtf? 'cos if there is, this needs to go there, stat.

Language UNparseability

Posted Jun 8, 2013 18:43 UTC (Sat) by mpr22 (subscriber, #60784) [Link] (2 responses)

Most language design centers on silly things like the serialization, the syntactic sugar.

Naturally. Programming languages are designed for the convenience of meatbags, not the convenience of boxes of thinking rock.

Language UNparseability

Posted Jun 9, 2013 6:22 UTC (Sun) by smurf (subscriber, #17840) [Link] (1 responses)

I beg to differ; so far, the rocks decline to actually think. Not as we understand that word.

Which is the crux of the problem, because if the language is not easily parseable by both human and silicon processing, the meatbags will too easily assume that the code means something else than what the rocks interpret it as.

Language UNparseability

Posted Jun 11, 2013 21:03 UTC (Tue) by brouhaha (subscriber, #1698) [Link]

Perhaps we'll have to agree to disagree. I routinely use computers to solve some problems that I could solve myself by thinking, but which would take much longer that way. When the computer solves a problem using the same algorithm by which I would solve it by "thinking", then I think it's fair to say that the computer is doing some of my thinking for me.

The alternative would be to claim that if a computer solves a problem via a particular algorithm but is not thinking, that if I solve the same problem using the same algorithm I couldn't be said to be thinking either.

I certainly won't claim that everything the computer does is thinking, nor that the computer can do all the kinds of thinking that I can.

Octal numbers

Posted Jun 9, 2013 23:56 UTC (Sun) by hpa (guest, #48575) [Link] (2 responses)

NASM, an x86 assembler, supports two syntaxes for specifying base: either NNNNNNa ("assembly style") or 0aNNNNNN ("C style"). 0NNNNNN does NOT specify octal, that is specified with a base specifier of "o" or "q". It works for both integers and floating point. Hex is "h" or "x", binary "b" or "y", and decimal "d" or "t". The fact that there are two letters for each base is to accommodate a number of existing conventions, by sheer coincidence there are exactly two for each base.

Octal numbers

Posted Jun 10, 2013 5:04 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (1 responses)

What is the 'y' derived from for binary?

Octal numbers

Posted Jun 10, 2013 11:26 UTC (Mon) by cortana (subscriber, #24596) [Link]

It's the last letter of 'binary', as 'x' is the last letter of 'hex'. But I'm just guessing, I haven't seen 'y' used anywhere else like this.

Little things that matter in language design

Posted Jun 8, 2013 7:30 UTC (Sat) by renox (guest, #23785) [Link]

D deprecated the leading 0 octal notation:it may be a standard but it still is a very bad idea!

Little things that matter in language design

Posted Jun 8, 2013 13:14 UTC (Sat) by cmrx64 (guest, #89304) [Link] (2 responses)

Yes, I proposed and it's generally agreed that if Rust were to have octal literals, it would use 0o prefix. Leading 0 for octal is a definite anti-pattern that we want to avoid.

But Rust has macros and syntax extensions. Octal literals can be supported outside the core language, ie o!(566) or oct!(755). Not as pretty, I think.

Rust doesn't have octal literals right now because they're seen as on occasionally useful, and nobody has bothered adding them in yet.

Little things that matter in language design

Posted Jun 8, 2013 21:44 UTC (Sat) by Tobu (subscriber, #24111) [Link] (1 responses)

As long as the umask-style functions take symbols/strings and refuse integers, there's not much point to bother with octal. Less syntax, and one source of bugs removed.

Little things that matter in language design

Posted Jun 9, 2013 16:27 UTC (Sun) by alankila (guest, #47141) [Link]

Honstly, adding a whole number class just to support a single use case of chmod/umask seems incredibly extravagant to me. In my opinion octal numbers suck even for that purpose, so supporting this 04755 style of writing permissions is actually a negative service rather than positive one.

Little things that matter in language design

Posted Jun 8, 2013 21:53 UTC (Sat) by edeloget (subscriber, #88392) [Link] (1 responses)

> It would also be wonderful if compilers could track-back with
> error messages. For example, a missing } somewhere in the middle
> of the program will usually only throw an error on the last
> line of the file. It would be far more helpful to report the
> line number of the opening { that wasn't ever closed.

Finding the opening { that match a missing } is not that simple. There is a good chance to point the programmer to the wrong opening {, meaning that the information is now wrong (as opposed to not useful).

Little things that matter in language design

Posted Jun 8, 2013 22:14 UTC (Sat) by hummassa (guest, #307) [Link]

> Finding the opening { that match a missing } is not that simple. There is a good chance to point the programmer to the wrong opening {, meaning that the information is now wrong (as opposed to not useful).

just a correction: finding the opening { that match a missing } is not that simple IF THE PROGRAM IS NOT INDENTED CORRECTLY. It is about time some IDE could see exactly *where* the runaway block began and where it *should* end, because IDEs many times have (at least rudimentary) parsers to the language.

Little things that matter in language design

Posted Jun 9, 2013 23:26 UTC (Sun) by tjc (guest, #137) [Link]

It would also be wonderful if compilers could track-back with error messages. For example, a missing } somewhere in the middle of the program will usually only throw an error on the last line of the file. It would be far more helpful to report the line number of the opening { that wasn't ever closed.

Error reporting is difficult with bottom-up shift-reduce parsing. Such parsers do right derivation, which is why lines numbers are sometimes misreported. LALR parsers generated by Bison have this problem.

Top-down parsing (recursive descent parsing, for example) is better in this respect, which is probably one of the reasons the GCC C compiler got a new parser a few years ago.

Little things that matter in language design

Posted Jun 20, 2013 10:10 UTC (Thu) by moltonel (guest, #45207) [Link] (2 responses)

I really like erlang's syntax for this: B#N means the number N writen in base B.

So you have 10 = 16#a = 8#12 = 2#1010 = 10#10 = 36#a = 5#20.

This works for any base between 2 and 36, it's clear, it's consistent. There is similar functionality when printing or parsing a string : any base can be used.

Little things that matter in language design

Posted Jun 20, 2013 11:07 UTC (Thu) by renox (guest, #23785) [Link] (1 responses)

Ada has a similar syntax, but I'm not sure that this is really a gain: is-there really a need for anything else than decimal, binary(0b), hexadecimal(0x) and octal(0o) (for the old Unix compatibility)?

Little things that matter in language design

Posted Jun 20, 2013 13:13 UTC (Thu) by anselm (subscriber, #2796) [Link]

Ada, at least, was hyped as the programming language to make all other programming languages obsolete. Such a programming language would naturally have to cater to the preferences of, e.g., six-fingered space aliens, too.

Little things that matter in language design

Posted Jun 8, 2013 2:34 UTC (Sat) by nlucas (guest, #33793) [Link] (19 responses)

Allowing Unicode identifiers and not allowing decimal values represented in the locale (e.g. 1,23 instead of 1.23) of the programmer seems incoherent, to me.

Off course, that would be impossible to accomplish with simple text files as source (we would need meta-data that doesn't get deleted by mistake).

Little things that matter in language design

Posted Jun 8, 2013 2:58 UTC (Sat) by neilbrown (subscriber, #359) [Link] (2 responses)

Not impossible. We could use Rust's attributes:

[#decimal_comma];

or maybe

#pragma decimal_commas

I don't think I would recommend that though.

Interesting problem - thanks for mentioning it.

Little things that matter in language design

Posted Jun 8, 2013 16:46 UTC (Sat) by nlucas (guest, #33793) [Link] (1 responses)

The problem is if we implement locale based decimals then there could be cases where at first glance there are no differences.

For example, suppose we have a simple file with constants:

const double CONST1 = 123,456;
const double CONST2 = 123.456;

If someone deletes the meta-data indicating the locale, how do you parse it? There is no way to know what the original values were if you don't know the original locale.

You could fix this by making the thousand separator an invalid character (by only allowing '_' or space as the thousand separator), but with so many locales out there could this really be fixed on a global scale?

Localization is hard, and should never be taken lightly. For example, my country uses ',' as decimal separator, but the keyboard "numpad" has '.', not ','. So it's usual for applications to accept both on input as decimal separators. No standard libraries I know of support this case, which means most applications have to implement (or filter) it's input functions instead of relying on the standard libraries.

Little things that matter in language design

Posted Jun 9, 2013 16:32 UTC (Sun) by alankila (guest, #47141) [Link]

I recently hid myself in a dark cave fearing wrath of god after I implemented a heuristic number parser for an application I wrote for customer just so that it would recognize and support various forms of writing money: 1.234,45; 1 234.45; 1 234,45 etc. In general I make the guess that the last dot/comma I see is the decimal separator and all other characters that are not numbers are ignored.

I wrote a very similar horror for date parsing, trying to support yyyy-mm-dd, dd.mm.yyyy and dd/mm/yy OR mm/dd/yy based on whether user is expected to reside in UK or US.

I hate people and their myriad conventions.

Little things that matter in language design

Posted Jun 8, 2013 7:17 UTC (Sat) by jzbiciak (guest, #5246) [Link] (15 responses)

I don't know enough about Go or Rust to comment on them specifically, but more generally, allowing decimal commas seems more disruptive than allowing Unicode identifiers, unless the decimal comma has a different Unicode code point. (Does it? I honestly don't know.)

That is, allowing αβγδ = 3; in a C or C++ program (or in most other languages) doesn't change the meaning of any program that doesn't use that flexibility. But, allowing the programmer to select the meaning of "1,23" is far more disruptive, because it changes the meaning of the ubiquitous comma.

This problem arises because just about every language I've programmed in my 30 years as a programmer uses a comma for an argument separator if it uses a separator at all. Allowing a decimal comma gives the comma two very distinct roles in the same context. If "argument separator comma" and "decimal separator comma" are the same Unicode code point, then you need to use whitespace to disambiguate "1,23" from "1, 23". Ugly.

I suppose you could design a programming language that didn't use commas as C, C++, Perl, and 100s of other languages do. In that case, the decimal comma would never introduce surprises.

In any case, my point is that it should be easy to see why supporting decimal comma is a harder problem than supporting Unicode identifiers.

Little things that matter in language design

Posted Jun 8, 2013 11:57 UTC (Sat) by l0b0 (guest, #80670) [Link] (14 responses)

*nix shells use literal space instead of comma as an argument separator. But that's just trading one syntax problem for another.

Little things that matter in language design

Posted Jun 8, 2013 14:01 UTC (Sat) by mathstuf (subscriber, #69389) [Link] (11 responses)

As do Haskell and Lisps. Shell has a problem because it parses argument separators *after* variable expansion. It has its uses, but I don't think it has been worth the trouble it has caused.

Little things that matter in language design

Posted Jun 8, 2013 15:18 UTC (Sat) by jzbiciak (guest, #5246) [Link] (10 responses)

As I recall, LISP bullies its way out of that with an explosion of parentheses. Haskell looks like it largely avoids that, just glancing at some Haskell code on the net. But I don't know Haskell really at all, so I don't know how it addresses, say, sending the arguments 1, -2 to a function. Without a comma, does that look like the expression "1 - 2" or are there other rules you have to be aware of?

Little things that matter in language design

Posted Jun 8, 2013 16:00 UTC (Sat) by SLi (subscriber, #53131) [Link] (1 responses)

In Haskell, negative numbers have to be parenthesized:

foo 1 (-2) is parsed as ((foo 1) (-2))

(in Haskell all functions really take exactly one argument; here foo would take an integer and return a function taking an integer, i.e. the type would be Integer -> Integer -> a)

foo 1 -2 would get parsed as (foo 1) - 2, so foo needs to be a function of the type Integer -> Integer.

Little things that matter in language design

Posted Jun 10, 2013 0:32 UTC (Mon) by marcH (subscriber, #57642) [Link]

This page for instance shows a good few interesting syntax examples in a quite short space

http://www2.lib.uchicago.edu/keith/ocaml-class/functions....

Little things that matter in language design

Posted Jun 8, 2013 16:15 UTC (Sat) by mathstuf (subscriber, #69389) [Link] (6 responses)

No, it uses spaces for separation of arguments (e.g., (+ 4 5) *not* (+, 4, 5)). The parentheses are Lisp's way of avoiding meaningful indentation, braces, and semicolons. BTW, Lisp gets around the negation problem by only using prefix notation, so (+ 1 -2) is unambiguous since the '-' cannot possibly be a function call here (and "(-2)" is a call to a function named "-2", so that's not an issue either).

Little things that matter in language design

Posted Jun 8, 2013 18:45 UTC (Sat) by jzbiciak (guest, #5246) [Link] (1 responses)

Well, what I was getting at with my LISP comment is that an expression such as a + b * c - d, which needs no parentheses and is completely unambiguous in a C-like language ends up being (+ a (- (* b c) d)) in prefix notation. I went from 0 parentheses to 3 pairs of parentheses.

Now, C has its own problems with its umpteen levels of precedence, problems that tend to lead to excessive parenthesis, but that's really a different conversation.

Little things that matter in language design

Posted Jun 9, 2013 23:56 UTC (Sun) by tjc (guest, #137) [Link]

That's a feature. :)

(Or it would be, if the "bitwise" AND/XOR/OR operators where at a higher precedence level, just below the bit shift operators.)

Little things that matter in language design

Posted Jun 17, 2013 10:44 UTC (Mon) by erich (guest, #7127) [Link] (3 responses)

Oh wow. Avoiding curly braces by parentheses. That is a great feature.
I wonder if Lisp programmers can read their own code a month later...
And finding a misplaced ) in Lisp code is a masochists job.

http://readable.sourceforge.net/

Now if we could come up with a language that requires both S-expression paramtheses, C-style braces *and* python indentation (maybe also add in brainfuck/whitespace and some visual basic for applications) then we can finally build the ultimate programming language of hell.

Little things that matter in language design

Posted Jun 17, 2013 20:03 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (1 responses)

Don't forget Malbolge's semantics, MUMPS's whitespace[1] and abbreviations, and INTERCAL's "PLEASE" keyword!

[1]Arguments to functions use exactly one space between them (two spaces passes as the second argument).

Little things that matter in language design

Posted Jun 20, 2013 13:02 UTC (Thu) by jzbiciak (guest, #5246) [Link]

... and INTERCAL's "PLEASE" keyword!

I would have thought COME FROM would be a better choice for such a language. That, and ABSTAIN to allow for conditional COME FROM.

MUMPS's whitespace¹

¹ Arguments to functions use exactly one space between them (two spaces passes as the second argument).

I forgot how evil MUMPS was, despite multiple articles on TheDailyWTF about it.

Little things that matter in language design

Posted Jun 20, 2013 6:12 UTC (Thu) by dakas (guest, #88146) [Link]

And finding a misplaced ) in Lisp code is a masochists job.

It's rather an editor's job. Lisp does not have a program syntax, it has a read syntax: programs are just "evaluated" standard data structures (lists, usually).

The read syntax is simple enough to be amenable to a lot of automated processing. Emacs has something like M-x check-parens RET to find file-wide problems, but paren matching and indentation is also quite helpful. Even non-LISP aware editors like vi at least offer paren matching via %.

Now this is a language design choice: using macros is so much more dependable, powerful and coherent than with C/C++ that it is not funny.

Evaluating a macro call means taking the unevaluated arguments, calling the macro on it, and evaluating the result. Evaluating a function call means evaluating the arguments and calling the function on it.

Orthogonal, straightforward, powerful. There are is no technical "parser" barrier between input and code. Instead, there is a cultural barrier between code and programmer as humans are used to plenty of interpunction and semi-graphical representations (one of the reasons people prefer mathematical formulas in composed form rather than computer-language versions of them).

It is a tradeoff at a language conceptual level. It's not actually giving in to the machine (programming in assembly or machine language is that) but rather finding a common expressive ground easily manipulated by programs themselves.

Making it human-accessible involves optical patterning via programming styles and proper indentation. It's not the same as punctuation, but then punctuation without proper indentation does not really work all too well, either, and the worst case for programs generated by programs is meaning-carrying whitespace like in Python: you can't just write the program elements without knowing indentation context, running an indenter on the result if you need nice human readability.

LISP/Scheme is a superb environment for writing code that generates and/or analyzes code, because programs are not represented by a grammar but rather directly by their parse tree which has a computer- and tolerably human-readable and -writeable representation.

Little things that matter in language design

Posted Jun 9, 2013 14:37 UTC (Sun) by joey (guest, #328) [Link]

Haskell has a neat trick to avoid needing to close nested parens. (foo $ bar $ baz xyzzy plugh) is the same as (foo (bar (baz xyzzy plugh))) The $ operator is trivially defined as f $ x = f x -- with an especially low precedence.

Haskell code often also avoids parens via other means. For example, the function

f x = foo (bar (baz x))

could be written as

f x = foo $ bar $ baz x

but is more likely to be written in point-free style as

f = foo . bar . baz

Incidentially, something very like the the Virgil-II tuple access syntax mentioned in the article is also available in haskell via the lens library. Haskell's syntax is well-suited to defining really interesting and useful operators. For example:

ghci> _1 .~ "hello" $ ("","world")
("hello", "world")

Little things that matter in language design

Posted Jun 8, 2013 15:18 UTC (Sat) by jzbiciak (guest, #5246) [Link]

Another, potentially much larger problem, isn't it? If you use white space as argument separators, then you need to use some other grouping construct to group together terms in expressions if you also want to allow whitespace in expressions.

(more comment below, replying to mathstuf directly.)

Little things that matter in language design

Posted Jun 8, 2013 16:29 UTC (Sat) by nlucas (guest, #33793) [Link]

I'm not advocating the use of the programmer locale in a programming language. Excel, with it's "locale aware" macro functions shown us that is a very bad thing!

But just for the sake of discussion, on countries where the decimal separator is a comma, they just use ";" as the list separator (at least in my country, don't know about others). E.g. instead of func(1.2,1.3), just do func(1,2;1,3). It's just the same as what is done on mathematics (e.g. a range [-1.2,+1.3] would be [-1,2;+1,3]).

Rust semicolon handling is risky

Posted Jun 8, 2013 7:39 UTC (Sat) by renox (guest, #23785) [Link] (2 responses)

Even though I want to like Rust as they are trying to solve interesting issues, I'm not sure at all that their weird usage of semicolon won't make the language unreadable..
If they wanted so much to avoid the return keyword, they should have used the Smalltalk ^ operator..

Rust semicolon handling is risky

Posted Jun 8, 2013 13:19 UTC (Sat) by cmrx64 (guest, #89304) [Link]

It's really quite manageable and not nearly as complex as the article makes it out to be (at least not in practice).

See https://github.com/cmr/terminfo-rs/blob/master/searcher.rs or any other rust code, for example.

The return keyword is *encouraged* to be used, it's not to be avoided at all. But it doesn't make sense to use as the result of an expression, because then you can't return from the function!

POV

Posted Jun 9, 2013 2:15 UTC (Sun) by ofranja (subscriber, #11084) [Link]

Weird for who? Depending on your referential, having a separation of statements and expressions as most imperative languages have could look even stranger.

The semicolon idea is not new - ML-derived languages had that syntax for decades. Don't think of implicit behaviour, but uniform behaviour: everything is an expression. Semicolon is just a shorthand for grouping them when you don't care about the return. And if you forget one, your code simply does not compile anymore.

I find it very simple, and easier to follow - specially at large code bases.

Little things that matter in language design

Posted Jun 8, 2013 8:15 UTC (Sat) by mgedmin (subscriber, #34497) [Link] (35 responses)

Didn't Plan 9 have a "rune" type for 32-bit Unicode characters? Go was created by the same people who created Plan 9, wasn't it?

Little things that matter in language design

Posted Jun 8, 2013 10:22 UTC (Sat) by lsl (subscriber, #86508) [Link]

Yes, that's where the rune type comes from. Also, the gc toolchain for Go is a direct descendant of the Plan 9 compilers.

Little things that matter in language design

Posted Jun 9, 2013 12:15 UTC (Sun) by tialaramex (subscriber, #21167) [Link] (33 responses)

The article doesn't mention this, but allowing Unicode beyond ASCII in identifiers means you need to do a bunch of extra work that might tempt you to throw in case insensitivity while you're at it.

Unicode is fairly insistent that e.g. although it provides two separate ways to "spell" the e-acute in café for compatibility reasons these two spellings are equivalent and an equality test for the two should pass. For this purpose it provides UAX #15 which specifies four distinct normalisation methods, each of which results in equivalent strings becoming codepoint identical.

If you don't do this normalisation step you can end up with a confusing situation where when the programmer types a symbol (in their text editor which happens to emit pre-combined characters) the toolchain can't match it to a visually and lexicographically identical character mentioned in another file which happened to be written with separate combining characters. This would obviously be very frustrating.

On the other hand, to completely fulfil Unicode's intentions either your language runtime or any binary you compile that does a string comparison needs to embed many kilobytes (perhaps megabytes) of Unicode tables in order to perform the normalisation steps correctly.

Little things that matter in language design

Posted Jun 9, 2013 12:32 UTC (Sun) by mpr22 (subscriber, #60784) [Link] (20 responses)

Case-insensitivity, Unicode, interoperation between Turks and non-Turks. Pick two.

Little things that matter in language design

Posted Jun 10, 2013 0:23 UTC (Mon) by dvdeug (guest, #10998) [Link] (19 responses)

How do you get case-insensitivity and interoperation between Turks and non-Turks? It's not a Unicode problem; Turks want i (ordinary i) to uppercase to İ (I with a dot), and non-Turks don't. Short of making a special Turkish i and I, which comes with its own problems and nobody does, that's going to be a problem.

Little things that matter in language design

Posted Jun 11, 2013 9:56 UTC (Tue) by khim (subscriber, #9252) [Link] (18 responses)

Since offer was "pick two" and you've decided to throw Unicode out the solution is obvious.

Short of making a special Turkish i and I, which comes with its own problems and nobody does, that's going to be a problem.

Sure, but it is as way to achieve case-insensitivity and interoperation between Turks and non-Turks.

Little things that matter in language design

Posted Jun 11, 2013 20:04 UTC (Tue) by dvdeug (guest, #10998) [Link] (17 responses)

Just because something tells you that you can have something cheap, fast or good, pick two, doesn't mean it's true.

Creating a new character set only achieves interoperation in a theoretical way, since nobody is using it. You've not thrown out just Unicode; you've thrown out any character set that has seen actual use for Turkish.

Even if you do and get everyone to use it, how much bad data is going to get created? Imagine a keyboard with 3 i keys; we'd get a bunch of data with the wrong i or wrong I. You've also created a whole new set of spoofing characters; Microsoft had better race to get Microsoft.com (with a Turkish i) as should everyone else with an i in their name.

Little things that matter in language design

Posted Jun 11, 2013 22:04 UTC (Tue) by Cyberax (✭ supporter ✭, #52523) [Link] (16 responses)

Ukrainian has its own letter 'i' which is distinct from ASCII 'i'. It works just fine.

It would be another story if dotless 'i' was the only unique letter in Turkish, but it's not. There also are: Ç, Ğ, I, İ, Ö, Ş, and Ü.

Little things that matter in language design

Posted Jun 12, 2013 5:12 UTC (Wed) by dvdeug (guest, #10998) [Link] (15 responses)

That's because, unlike Turkish, Ukrainian also has a complete alphabet that's distinct from ASCII. People who use languages written in Cyrillic have to switch their whole keyboard back and forth to write in English, unlike people who type in Turkish.

Even then, it doesn't work just fine. There's rules against registering mixed script domain names, and webbrowsers will display microsoft.com differently from mіcrosoft.com, because they detect that mixed script. Other places without that special code will provide no hint that the two aren't the same.

Having different characters with the same glyphs in the same script is even more problematic, because that special code won't work; there's no way a program could tell that microsoft.com (with a Turkish i) was a spoofing attempt.

Little things that matter in language design

Posted Jun 12, 2013 13:57 UTC (Wed) by khim (subscriber, #9252) [Link] (14 responses)

I don't really see your point: you still are trying to explain why dropping Unicode for the sake of keeping case-insensitivity and interoperation between Turks and non-Turks is dumb choice. Yes, it's dumb, people usually pick some other pair. But it does not change the fact that it may work just fine (for some certain definition of "just").

Little things that matter in language design

Posted Jun 12, 2013 19:01 UTC (Wed) by dvdeug (guest, #10998) [Link] (13 responses)

I'm explaining that it's not a real option for anyone that doesn't control their own universe. It's not Unicode; it's every Turkish character set ever. It's what Turkish keyboards give you. Nobody ever picks that pair because it's not a real option.

Little things that matter in language design

Posted Jun 14, 2013 21:40 UTC (Fri) by khim (subscriber, #9252) [Link] (12 responses)

Yet this is what used to solve the problem for Russian. Early computers in USSR only had Russian letters which were different from latin. And they, too, had this upcase problem (upcase for Russian "у" was "У" and for latin "y" was "Y"). It's not clear why Turks can not adopt the same solution. Well, "for historical reasons" probably - but that's still a "Unicode" choice.

Little things that matter in language design

Posted Jun 14, 2013 23:33 UTC (Fri) by dvdeug (guest, #10998) [Link] (11 responses)

I don't think you can call choosing any currently existing Turkish character set a "Unicode" choice. If we're going to dismiss history and how Turks currently use their computers, we could go further and change their whole writing system.

Russian is written in the Cyrillic alphabet, unlike Turkish which is written in the Latin alphabet. It's not written in the Latin alphabet by accident; it was changed from the Arabic alphabet in 1927 in an attempt to modernize the country and attach themselves politically and culturally to the successful West. Separating the Turkish alphabet from the Latin is not a neutral act, particularly when you don't do the same to the French or Romanian.

Little things that matter in language design

Posted Jun 15, 2013 14:19 UTC (Sat) by khim (subscriber, #9252) [Link] (10 responses)

Separating the Turkish alphabet from the Latin is not a neutral act, particularly when you don't do the same to the French or Romanian.

Sure. But this is what Unicode is all about. Unicode didn't happen in one step. Early character encodings were... strange (from today's POV). Not just Russian computers, US-based computers, too (think EBCDIC and all these strange symbols used by APL). Eventually some groups of symbols were put together and some other symbols were separated. Not just Cyrillic, but Greek (charset which is as closely related to Cyrillic as Turkish as related to Romanian), etc. Why Telugu and Kannada are separated but Chineese and Japanese Han characters are merged? If we want to make upcase/lowercase functions locale-independent we can do with Turkish (French, Romanian, etc) what was done with Telugu and Kannada.

Little things that matter in language design

Posted Jun 15, 2013 14:52 UTC (Sat) by mpr22 (subscriber, #60784) [Link]

The relationship between the Turkish variant of the Latin alphabet and some other random European variant of the Latin alphabet more closely resembles the relationship between the Serbian and Russian variants of the Cyrillic alphabet than the relationship between the Cyrillic alphabet and the Greek alphabet.

Little things that matter in language design

Posted Jun 15, 2013 22:52 UTC (Sat) by dvdeug (guest, #10998) [Link] (6 responses)

"Unicode didn't happen in one step" is blaming Unicode for the entire history of computing.

If you don't care about if the Turks are going to use your character set, go ahead and tell them to use ASCII. If you choose to separate their alphabet from the Latin, you're going to have a problem that they consider their alphabet part of the extended Latin alphabet, and they're not going to find that an acceptable solution. If you choose to separate out the alphabets of thousands of languages (even though the English alphabet is a superset of the French and Latin), you might mollify the Turks, but nobody is going to use your character set.

In reality, Turkish support requires locale-sensitive casing functions because every other solution has serious technical and often political problems, as well as not being compatible with existing systems, including keyboards.

Little things that matter in language design

Posted Jun 16, 2013 3:30 UTC (Sun) by hummassa (guest, #307) [Link] (2 responses)

...
> In reality, Turkish support requires locale-sensitive casing functions
...

Let's be plain: there is no "casing functions" that are not locale-sensitive. The Turkish dotted "i"s are one example, the German vs. Austrian "ß" is another, etc. And don't get me started on collation order. If one is going to try to facilitate computations by separating each locale to an alphabet, I wish good luck with its newnicode. The real Unicode thankfully does not work that way. Usually, at least. :-D

Little things that matter in language design

Posted Jun 16, 2013 8:21 UTC (Sun) by khim (subscriber, #9252) [Link] (1 responses)

If one is going to try to facilitate computations by separating each locale to an alphabet, I wish good luck with its newnicode. The real Unicode thankfully does not work that way. Usually, at least. :-D

Well, that's certainly a pity: Unicode was developed to fit in 16bit and thus merged many scripts (it assumed language will be separated "on the side" and/or will be less important then glyphs themselves). They have failed (today there are over 90'000 glyphs in Unicode) yet as a result we can not properly work with English+Turkish (or even German+Austrian) texts as you've correctly pointed out.

Today we are stuck: yes, it's not perfect and this decision certainly made life harder, not easier, but it'll be hard to replace it with anything else at this point. Similar story to QWERTY. Numerous problems which stems from that old decision are considered minor enough and it'll be hard to switch. But note that the most popular OS does exactly that for CJK. It's slowly but surely is replaced by Unicode-based OSes (such as Android) thus in the end Unicode is probably inevitable, but it does not means that you can not achieve interoperability with Turkish people and working upcase/lowercase simultaneously. You can - Unicode prevents that, nothing else.

Little things that matter in language design

Posted Jun 16, 2013 10:33 UTC (Sun) by dvdeug (guest, #10998) [Link]

Let's note that you want every one of 5,000 different languages to have its own code page; your comment about German+Austrian implies that you want every subdialect to have its own code page. And that's not approaching the question of how you want to deal with sometimes wildly different orthographies for one language.

"this decision certainly made life harder, not easier"

There's no certainly about it. To type "mv Das_Boot_German.avi Boata_filmoj" in your system you'd have to change keyboards several times, from whatever language mv is in, to German, to English, possibly to whatever language you count avi as, then to Esperanto. Right now, you can type that from any keyboard that supports the ISO standard 26-letter alphabet. You can't search a document for Bremen without knowing whether someone considered that a German word or an English word, and e = mc², originally written by a German speaker but understood worldwide, would get an arbitrary language tag. While there are some Cyrillic and Greek look-alikes for Latin-script words, you would explode that; "go" could be encoded any number of ways, and any non-English speaker would have to switch their keyboard to go to lwn.net or google.com or any other English-named sight.

"note that the most popular OS does exactly that for CJK."

Note that the article you link to does not say Tron is the most popular OS, and that it does not do exactly that for CJK, because Chinese is not one language; it's a rather messy collection of languages. Tron forces Cantonese to be written in the same script as Mandarin and Jinyu. Note also that Tron treats Turkish the exactly same way Unicode does, as it's a copy of Unicode in everywhere but the Han characters.

"You can - Unicode prevents that, nothing else."

If by Unicode, you mean every character set ever used for Turkish (including Tron). I've never seen a fully worked out draft of a character set that fits your specifications. That's never really impressive, is it, when someone is claiming that something would be clearly easier yet it's never been tried.

Little things that matter in language design

Posted Jun 16, 2013 6:35 UTC (Sun) by micka (subscriber, #38720) [Link] (2 responses)

> even though the English alphabet is a superset of the French and Latin

I suppose you mean "subset" ? Like English alphabet is strictly included inside French (without é, è, à, ...) and latin alphabet (I see no difference) ?

Little things that matter in language design

Posted Jun 16, 2013 9:17 UTC (Sun) by dvdeug (guest, #10998) [Link] (1 responses)

English uses a lot of diacritics on characters if you look hard enough. Façade and résumé are completely standard spellings; coöperate is still used, by the New Yorker, for example. I don't know that there are any words where ÿ is used, so superset might be too strong, but it's certainly not a subset.

(If we're strictly speaking of the alphabet, neither of them count accents, so both French and English have the same 26 letters for the alphabet.)

Little things that matter in language design

Posted Jun 16, 2013 10:49 UTC (Sun) by micka (subscriber, #38720) [Link]

Depending on your sources, the french alphabet is either 26 letters, taking out diacritics, or 42 letters, counting diacritics and ligatures (œ and æ) separately (I suppose, the same as ß). Even the french and english version of "french alphabet" on Wikipedia have different count (error or cultural difference in speciality languages ? I know for example "ring" in mathematics have related but different definitions for french and american mathematicians)

The spanish alphabet is more consistently considered as having 27 letters even though ñ could be considered a n with diacritic. And in the past, even some combination of letters (from the point o view of the latin alphabet) were considered separate letters.

And I don't even talk about http://en.wikipedia.org/wiki/Alphabet_%28computer_science%29 (each diacritic variant would be considered a different letter).

Little things that matter in language design

Posted Jun 16, 2013 5:36 UTC (Sun) by viro (subscriber, #7872) [Link] (1 responses)

You can easily have a text in English with quoted sentences in French or in Turkish, using the same font. Try the same with e.g. Russian and Greek and see if you will be able to read the result[1]. Turkish and French alphabets are Latin with some diacrytics added; current Cyrillic is much more distant from Greek than that, as you bloody well know.

[1] lowercase glyphs aside, (И, Н) and (Η, Ν) alone are enough to render the result unreadable (shift circa 16th century, IIRC; at some point both Eta and Nu conterparts got the slant of the middle strokes changed in the same way, turning 'Ν' into 'Н' and 'Η' into 'И')

Little things that matter in language design

Posted Jun 16, 2013 7:58 UTC (Sun) by khim (subscriber, #9252) [Link]

You can easily have a text in English with quoted sentences in French or in Turkish, using the same font. Try the same with e.g. Russian and Greek and see if you will be able to read the result[1].

Of course you could. What's the problem? You'll be forced to read Greek letter-by-letter probably, but English-speaking person will mangle French or Turkish, too. It's not as if just resemblance letters of the alphabet matters in this case: English and French may use similarly looking characters, but they use them to encode radically different consonants, vowels and words.

[1] lowercase glyphs aside, (И, Н) and (Η, Ν) alone are enough to render the result unreadable (shift circa 16th century, IIRC; at some point both Eta and Nu conterparts got the slant of the middle strokes changed in the same way, turning 'Ν' into 'Н' and 'Η' into 'И')

If you don't know which language is used you can not read your word, period. Identically-looking words in French and Turkish will have radically different pronouncements and will be, in fact, different words.

Little things that matter in language design

Posted Jun 9, 2013 20:09 UTC (Sun) by khim (subscriber, #9252) [Link] (11 responses)

For this purpose it provides UAX #15

Which nobody uses in programming languages because of performance reason.

If you don't do this normalisation step you can end up with a confusing situation where when the programmer types a symbol (in their text editor which happens to emit pre-combined characters) the toolchain can't match it to a visually and lexicographically identical character mentioned in another file which happened to be written with separate combining characters. This would obviously be very frustrating.

It's not as frustrating as you think. They don't type ı followed by ˙, they just type i. And the same with other cases. Any other approach is crazy. Why? Well, because many programming languages will show ı combined with ˙ as "ı˙", not as "ı̇".

You may say that ı˙ is not canonical representation of "i". Ok. "и" plus " ̆" is the canonical representation of "й". Try this for size: $ cat test.c
#include <stdio.h>

int main() {
printf("%c%c%c%c == %c%c\n", 0xD0, 0xB8, 0xCC, 0x86, 0xD0, 0xB9);
}
$ gcc test.c -o test
$ ./test | tee test.txt
й == й
Not sure about you but on my system these two symbols only look similar when copy-pasted in browser - and then only in the main window (if I copy-paste them to "location" line they suddenly looks differently!). And of course these two symbols are different in GNOME terminal, gEdit, Emacs and other tools!

Thus, in the end you have two choices:

Compare strings as sequence of bytes. Result: simple, clean, robust code, but toolchain can't match [symbol] to a visually and lexicographically identical character mentioned in another file
Compare strings as UAX #15 says. Result: huge pole of complicated code and toolchain can match symbol to a visually and lexicographically different character mentioned in another file

Frankly I don't see second alternative as superior.

Little things that matter in language design

Posted Jun 9, 2013 20:48 UTC (Sun) by hummassa (guest, #307) [Link] (7 responses)

> Which nobody uses in programming languages because of performance reason.

(UAX-15). I use it. Perl offers NFC, NFD, NFKC, NFKD without a huge perceivable (to me) performance penalty. AFAICT MySQL uses it, too.

> It's not as frustrating as you think. They don't type ı followed by ˙, they just type i. And the same with other cases. Any other approach is crazy. Why? Well, because many programming languages will show ı combined with ˙ as "ı˙", not as "ı̇".

This silly example tells me you don't have diacritics in your name, do you? Sometimes the "ã" in my last name is in one of the Alt-Gr keys. Sometimes I have to enter it via vi digraphs, either as "a" + "~" or "~" + "a". Sometimes I click "a", "combine", "~" or "~", "combine", "a". Or "~" (it's combining in my current keyboard by deafult, so that if I want to type a tilde, I have to follow it with a space or type it twice) followed by "a".

> й == й
> Not sure about you but on my system these two symbols only look similar when copy-pasted in browser - and then only in the main window (if I copy-paste them to "location" line they suddenly looks differently!). And of course these two symbols are different in GNOME terminal, gEdit, Emacs and other tools!

it seems to me that your system is misconfigured. I could not see the difference between "й" and "й" in my computer, be it in Chrome's main window, location bar, gvim, or in yakuake's konsole window.

> Frankly I don't see second alternative as superior.

UAX15 is important. People sometimes type their names with or without diacritcs (André versus Andre). Some names are in different databases with variant -- and database/time/platform dependent -- spellings. In some keyboards, a "ç" c-cedilla is a single character, in others, you punch first the cedilla dead key and then "c", and in others you type, for instance, the acute dead key followed by "c" (it's the case in the keyboard I'm typing right now). Sometimes you have to say your name over the phone and the person on the other side of the call must be capable of searching the database by the perceived name. Someone could have entered "ﬁ" and another person is searching by "fi".

So, sometimes your "second alternative" is the only viable alternative. Anyway, the programming language should support "compare bytes" and "compare runes/characters" as two different use cases.

Little things that matter in language design

Posted Jun 9, 2013 21:14 UTC (Sun) by khim (subscriber, #9252) [Link] (2 responses)

Anyway, the programming language should support "compare bytes" and "compare runes/characters" as two different use cases.

I may be mistaken, but it looks like you are discussion completely different problem. Both tialaramex and me are talking about programming langauges themselves.

(UAX-15). I use it. Perl offers NFC, NFD, NFKC, NFKD without a huge perceivable (to me) performance penalty.

Really?. Let me check:
$ cat test.pl
use utf8;

$й="This is test";

print "Combined version works: \"$й\"\n";
print "Decomposed version does not work: \"$й\"\n";
$ perl test.pl
Combined version works: "This is test"
Decomposed version does not work: ""

Am I missing something? What should I add to my program to make sure I can refer to $й as $й?

it seems to me that your system is misconfigured. I could not see the difference between "й" and "й" in my computer, be it in Chrome's main window, location bar, gvim, or in yakuake's konsole window.

Of course not! You've replaced all occurrences of "й" with "й" - of course there will be no difference! Not sure why you've did that (perhaps your browser did that for you?) but if you do a "view source" on my message then you'll see a difference, if you do the same with your message - both cases are byte-to-byte identical. It'll be a little strange to see different symbols in such a case.

UAX15 is important.

Sure. In databases, search systems and so on (where fuzzy matching is better then no matching) it's important. In programming languages? Not so much. Most of the time when language tries to save programmer from himself (or herself) it just makes him (or her) miserable long (and even medium) term.

Little things that matter in language design

Posted Jun 10, 2013 16:12 UTC (Mon) by jzbiciak (guest, #5246) [Link]

Wow... Abusing the difference between й and й (and other cases of such fun) would make for some great obfuscated code. Or better yet, subtly malicious code.

Little things that matter in language design

Posted Jun 10, 2013 17:27 UTC (Mon) by hummassa (guest, #307) [Link]

> I may be mistaken, but it looks like you are discussion completely different problem. Both tialaramex and me are talking about programming langauges themselves.

You are right about this and I apologize for any confusion.

Little things that matter in language design

Posted Jun 9, 2013 23:38 UTC (Sun) by wahern (subscriber, #37304) [Link] (3 responses)

Perl6 also has NFG, which is probably the best normalization form out of all of them, although non-standard. It's not really even just a normalization form, but addresses issues of representation and comparison at the implementation level.

Using NFG solves all the low-level problems, including identifiers in source code, by getting rid of combining sequences altogether. Frankly I don't understand why it hasn't become more common. Maybe because most people just don't care about Unicode. Every individual has come to terms with the little issues with their locale. It's only when you look at all of them from 10,000 feet that you can see the cluster f*ck of problems. But few people look at it from 10,000 feet.

Little things that matter in language design

Posted Jun 11, 2013 1:07 UTC (Tue) by dvdeug (guest, #10998) [Link] (2 responses)

NFG isn't a normalization form at all. It doesn't get rid of combining sequences at all; it just invents dummy characters to hide combining sequences from the user. It's not that hard to generate a billion different combining sequences and potentially DoS any system using NFG. Ultimately, it's a lot of complexity for most systems that doesn't gain you that much over NFC.

Little things that matter in language design

Posted Jun 13, 2013 1:19 UTC (Thu) by wahern (subscriber, #37304) [Link] (1 responses)

You can DoS any system that doesn't use the correct algorithms. There are ways of implementing NFG that don't require storing every cluster ever encountered.

And it's not like existing systems don't have their own issues. The nice thing about NFG is that all the complexity is placed at the edges, in the I/O layers. All the other code, including the rapidly developed code that is usually poorly scrutinized for errors, is provided a much safer and more convenient interface for manipulation of I18N text. NFG isn't more complex to implement than any other system that provides absolute grapheme indexing. It's grapheme indexing that is the most intuitive, because it's the model everybody has been using for generations.

But most languages merely aim for half measures, and are content leaving applications to deal w/ all the corner cases. This is why UTF-8 is so popular. And it is the best solution when your goal is pushing all the complexity onto the application.

Little things that matter in language design

Posted Jun 14, 2013 0:22 UTC (Fri) by dvdeug (guest, #10998) [Link]

This is the 21st century; for the most part, I don't index anything. I have iterators to do that work for me, and arbitrary cursors when I need a location. If I want to work with graphemes, I can step between graphemes. If I want to work with words, I can step between words.

Grapheme indexing is not what everybody has been using for generations. In the 60 years of computing history, there have been a lot of cases where people working with scripts more complex then ASCII or Chinese have handled it a number of ways, including character sets that explicitly encoded combining characters (like ISO/IEC 6937) and the use of BS with ASCII to overstrike characters like ^ with the previous character.

UTF-8 is so popular because for many purposes it's 1/4th the size of UTF-32, and for normal data never worse then 3/4 the size. And as long as you're messing with ASCII, you can generally ignore the differences. If people want UTF-32, it's easy to find.

Little things that matter in language design

Posted Jun 10, 2013 16:58 UTC (Mon) by tialaramex (subscriber, #21167) [Link] (2 responses)

The idea that programming languages don't use UAX #15 for symbol matching due to performance problems would be an easier sell if UAX #15 anywhere near to approached the difficulty of something like C++ symbol mangling.

You seem to be suffering some quite serious display problems with non-ASCII text on your system, I don't know what to suggest other than maybe you can find someone to help figure out what you did wrong, or upgrade to something a bit more modern. I've seen glitches like those you describe but mostly quite some years ago. Your example program displays two visually identical characters on my system but I can believe your system doesn't do this, only I would point out that it's /a bug/.

Even allowing for that your last paragraph is hard to understand. Are you claiming that because on your system some symbols are rendered incorrectly depending on how they were encoded those symbols are _different_ lexicographically and everybody else (who can't see these erroneous display differences) should accept that?

Little things that matter in language design

Posted Jun 11, 2013 9:07 UTC (Tue) by etienne (guest, #25256) [Link] (1 responses)

Just a $0.02:
> You seem to be suffering some quite serious display problems with non-ASCII text on your system

It seems (some) people want to use a fixed-width font to write programs, mostly because some Quality Enhancement Program declared the TAB character obsolete, and SPACE character width is not a constant in variable-width fonts editors.
Most software language needs indentations...
With non-ASCII chars in fixed-width font, if you even get the char shape in the font you are using, the only solution is probably to start drawing each char every N (constant) pixels and have the end of large chars superimpose with the beginning of the next char...

Little things that matter in language design

Posted Jun 11, 2013 10:13 UTC (Tue) by mpr22 (subscriber, #60784) [Link]

I use a fixed-width font to write code chiefly out of pure inertia: most of my coding is done in text editors running in character-cell terminals. Code written in Inform 7 is an exception (the Inform 7 IDE's editor uses a proportional font by default, and the IDE is so well-adapted to the needs of typical Inform 7 programming that not using it is silly), but Inform 7 statements look like (somewhat stilted) English prose so I don't mind so much.

Little things that matter in language design

Posted Jun 8, 2013 13:19 UTC (Sat) by bokr (subscriber, #58369) [Link] (11 responses)

Thank you for a nice article on stuff I have also
been thinking about!

I am working on a language of my own (isn't everyone? ;-)
and would be interested in your reactions to what I am
planning for comment and string syntax.

My language's comments follow either # or ##.
The latter is the traditional rest-of-line comment prefix,
and the single # only comments out the immediately following expression.

This means e.g. you can comment out a string of whatever length,
multi-line or not, and #( expression ).another_part(more) foo
comments out from # to foo.

Which brings me to string syntax. I have a sort of purist gag reflex
against non-nestability, e.g. re xml's <![CDATA[ ... ]]>, but even
a little purist angst re python's practical ''' and """ ;-)

My strings are always quoted with single double-quote delimiters,
but to make nested quoting work, I took a hint from MIME boundaries
and here-docs, and optionally delimit strings like
<identifier><double quote><content><double quote><identifier>
e.g., foo"Can quote bar"the bar content"bar without a problem"foo
so long as <double quote><identifier> does not occur in the content
that was started by <identifier><double quote>.

Incidentally, this makes an easy way of commenting out blocks of code,
using the single expression comment prefix #
#unique_ignore_delimiter"
... arbitrary stuff here
"unique_ignore_delimiter

BTW, the preceding usage is not line oriented, strings begin exactly
after the prefixed delimiter and end with the postfixed delimiter.
foo""foo is a zero length string just like _""_ and "".

String content is always raw, and whether to convert at read time or
run time according to C escaping or something else is specified
by a postfix notation whose exact syntax and semantics is for
another time ;-)

Regards,
Bengt Richter

Little things that matter in language design

Posted Jun 8, 2013 13:44 UTC (Sat) by cmrx64 (guest, #89304) [Link] (4 responses)

Honestly, the little things matter for actually using the language, but aren't important characteristics of the language itself.

Little things that matter in language design

Posted Jun 8, 2013 16:47 UTC (Sat) by bokr (subscriber, #58369) [Link] (3 responses)

Could you clarify for me your concept of "important characteristics",
as opposed to "little things"?

(Naturally, I want my language to have good "important characteristics"
(hm, are there bad "important characteristics"?) ;-)

Little things that matter in language design

Posted Jun 8, 2013 19:04 UTC (Sat) by ncm (guest, #165) [Link] (2 responses)

Every language is packed to the gills with bad important characteristics. It's invidious to dwell on them, so it's often better to promote the good bits and hope the bad are forgotten.

Little things that matter in language design

Posted Jun 9, 2013 1:01 UTC (Sun) by mathstuf (subscriber, #69389) [Link] (1 responses)

> Every language is packed to the gills with bad important characteristics. It's invidious to dwell on them, so it's often better to promote the good bits and hope the bad are forgotten.

As one who implements code in languages with such characteristics, I'd *rather* focus on those. Those are the things that are going to have me ripping my hair out for a week tracking down some simple bug. One particularly nasty one I had to track down in C++ recently is where classes change size based on preprocessor defines typically defined on the command line (such as NDEBUG). Not much helps you in this case until you notice that the 'this' pointer in the ctor of one of the members of a derived class (the size-shifting class was a member of the base class) with an inlined constructor is not the same as &this->member.

Little things that matter in language design

Posted Jun 10, 2013 23:50 UTC (Mon) by ncm (guest, #165) [Link]

By "the bad", I meant the languages, almost all of which would be best forgotten if we didn't need them as object lessons.

Little things that matter in language design

Posted Jun 10, 2013 0:13 UTC (Mon) by tjc (guest, #137) [Link] (1 responses)

> My language's comments follow either # or ##.
> The latter is the traditional rest-of-line comment prefix,
> and the single # only comments out the immediately following expression.

I think I would probably flip those around , since # is already widely used for line comments. You will do yourself no favors by breaking common conventions.

Little things that matter in language design

Posted Jun 11, 2013 2:00 UTC (Tue) by bokr (subscriber, #58369) [Link]

Thanks for your comment.
I will try that and see how it works out.

Little things that matter in language design

Posted Jun 10, 2013 20:46 UTC (Mon) by kleptog (subscriber, #1183) [Link] (3 responses)

I find the idea of commenting out an individual expression odd, I've can't recall ever needing to do that, but perhaps I would use if it were possible, who knows.

However, for your string expressions, I'd suggest looking at perl's generalised quoting. It's well thought out and works really well. You don't necessarily need to include qx(), but qw() and qr// are IMHO useful ideas.

Little things that matter in language design

Posted Jun 12, 2013 15:55 UTC (Wed) by bokr (subscriber, #58369) [Link] (1 responses)

Thank you for your comment. It sent me to perl docs,
and made me wonder how much subconscious plagiarizing I am doing vs reinventing ;-/

Haven't played with perl since python became my most fluent pl,
about the time I decided for fun to create a chomsky.py from chomsky.pl[1] ;-)

The qq/qr/qx/qw functionalities are certainly useful.
I can do all those in various ways, including bootstrapping
by defining in terms of my language's more primitive ops.

The question is how built-in to make them, and what to make optional import,
and/or fiddle with startup configuration with invocation options.

At this point I am trying to get the primitives right ;-)

[1] http://www-personal.umich.edu/~jlawler/fogcode.html

Little things that matter in language design

Posted Jun 12, 2013 16:46 UTC (Wed) by bokr (subscriber, #58369) [Link]

Oops, actually I didn't use a perl source. It was a lisp source and it was 1999
and I wrote a perl chomsky.pl based on the lisp original, appending the latter
to the perl script as DATA, and scraping the good stuff without editing the original ;-)

I thought I did python as well, but can't find it on this box. Let's see if google
can find a copy of the lisp .. yup ..

http://www-personal.umich.edu/~jlawler/foggy.lsp

Anyone know who originally wrote it?
[sorry for getting a bit off topic]

Little things that matter in language design

Posted Jun 14, 2013 11:04 UTC (Fri) by bokr (subscriber, #58369) [Link]

The commenting out of individual expressions is completed at the tokenizing phase
if the expression is a name or string, so in that case it can be available for
output at both compile time and run time.

The token gets saved along with source line and char position, as a kind of
source-anchor token. I anticipate debugging use something like (switching as
suggested to ## for this comment prefix),

##"speed m/s"
speed ##meters $(dist) ##seconds time # this ##is all just #-comment to eol
;{; # example syntax error if never closed with }

e.g. might fail because of time instead of $time or $(time) not producing a number,
or at the syntax error if speed can handle a name instead of a number.

The inter-expression position allows error messages to use the anchors to locate
errors more precisely even if full tracebacks are not available, and hopefully
can also locate the last anchor passed for syntax errors, e.g. in case of a runaway
bracket or quote.

This is preliminary musing ;-)

Decimals used by some of my European colleagues

Posted Jun 8, 2013 14:07 UTC (Sat) by jhhaller (guest, #56103) [Link] (3 responses)

Some of my European colleagues transpose the usage of comma and period from what I grew up with in the US, such that three thousand and four hundredths would be represented as 3.000,04. So far, I have not seen this convention baked into a programming language, although it's frequently available for formatting based on the local language settings. It's a bad idea for a program to compile code differently based on what the local language settings at compile time, but is there any common convention for specifying a comma as the decimal point in the code?

Decimals used by some of my European colleagues

Posted Jun 8, 2013 14:27 UTC (Sat) by hummassa (guest, #307) [Link]

COBOL had


ENVIRONMENT DIVISION.
DECIMAL-POINT IS COMMA.

:-D

Decimals used by some of my European colleagues

Posted Jun 13, 2013 5:17 UTC (Thu) by tnoo (subscriber, #20427) [Link] (1 responses)

> So far, I have not seen this convention baked into a programming language, although it's frequently available for formatting based on the local language settings.

Microsoft Excel excels at that. Which is a complete nightmare, opening a german spreadsheet in an english version of Excel.

Decimals used by some of my European colleagues

Posted Jun 13, 2013 10:53 UTC (Thu) by storner (subscriber, #119) [Link]

Amen. It also means that a "comma separated values" file (.csv in Excel) is actually a "semi-colon separated values" file if you export it from some non-US versions. Great fun when exchanging stuff between different locales.

Little things that matter in language design

Posted Jun 8, 2013 16:21 UTC (Sat) by mpr22 (subscriber, #60784) [Link] (2 responses)

Curiously, I find Python's indentation-based block structure quite annoying in Python, but not at all annoying in Inform 7. (Probably because Inform 7 doesn't have a REPL, which is where Python's block structure system manages to fail "no sharp edges".)

Little things that matter in language design

Posted Jun 8, 2013 21:56 UTC (Sat) by Tobu (subscriber, #24111) [Link] (1 responses)

IPython has a %paste function that deals with this intelligently. It would be even better if it just did it by default, by just looking at how indented the first line is. For more complicated editing, like combining several fragments, you can go with %edit. An inline editor would be great, but that might be considered feature creep, and the IPython notebook provides sort-of the same thing.

Little things that matter in language design

Posted Jun 8, 2013 22:01 UTC (Sat) by Tobu (subscriber, #24111) [Link]

Anyway, copy-pasting into the interpreter is annoying because you have to retrace the initialisations that make the block of code work. It's far more practical to insert import IPython; IPython.embed(), switch to prototyping, and paste back into your editor.

Missed opportunities

Posted Jun 8, 2013 18:51 UTC (Sat) by ncm (guest, #165) [Link] (16 responses)

A new industrial language catches on so infrequently that it's almost tragic when opportunities for real syntactic improvements are passed up, almost as much so as when features inherited from earlier languages are misunderstood and corrupted.

One such opportunity was almost touched on in the article. In western languages, flyspecks such as commas and semicolons are put at the end of a sequence, but they really introduce what comes after. Programming practice mimics this usage, but the usage interacts poorly with revision management systems that present a text line as the unit of change. Python elegantly sidesteps the problem at some cost(*). Go institutionalizes it. In C++, we sometimes see

Ob::Ob
(   int a 
,   int b 
,   int c
)
:   _x( a + b - c)
,   _y( a - b + c)
,   _z(-a + b + c)
{}

enum T 
{ T1
, T2
, T3
};

which while practical can be jarring.

The missed opportunity is to prefer markers that do not look odd preceding each item, so that lines can be added at top, bottom, or the middle with no confusing diffs resulting. Regular punctuation does not offer many alternatives, but ":", "*", "+" and "|" have worked well in various contexts. Usually, though, such preceders have been chosen to be deliberately jarring, as in assembly languages and TROFF that use "." to mark meta-directives.

Another common missed opportunity is to eliminate the preceding "*" pointer dereference operator. Pascal's postfix "^" was extremely practical, perhaps the only real virtue in the language. It fell away along with Pascal. In C-influenced languages "^" is too useful for other roles, but "@" would serve admirably. Curiously "@" is rarely used in programming languages, and remains eminently available for such a use in C++1x. "@" as both a unary postfix operator and as a binary array or map indexing operator would free up "[]" brackets for much better uses.

(*) The cost to Python users is that mis-indented lines often cannot be recognized as such. When cutting and pasting code into different contexts, finding the right indentation for each fragment is tedious and (therefore) error-prone.

Missed opportunities

Posted Jun 8, 2013 21:18 UTC (Sat) by eru (subscriber, #2753) [Link]

Pascal's postfix "^" was extremely practical, perhaps the only real virtue in the language

I would say another feature that should be borrowed from Pascal and its relatives is the declaration grammar that allows unambiguous parsing using simple techniques and without requiring feedback from the symbol table. The way C treats typedef names and C++ classes complicates the compiler, and also makes diagnostics worse: C and C++ compilers really cannot tell bare syntax errors apart from missing or mis-spelled declarations.

Missed opportunities

Posted Jun 10, 2013 0:32 UTC (Mon) by tjc (guest, #137) [Link] (8 responses)

> Pascal's postfix "^" was extremely practical, perhaps the only real virtue in the language.

I agree. The Unix signal function declaration, for example, would look a lot nicer with a postfix pointer declarator. But having a corresponding postfix indirection operator causes other problems, most notably with type casts, since postfix operators have higher precedence than prefix operators. You end up with something that looks like this:

((T@)p)@ // where 'p' is a pointer and 'T' is a type

Unless you make type casts postfix as well, but that looks even more unfamiliar:

p(T@)@

It might be best to break the rules and have a postfix pointer declarator while retaining a prefix indirection operator, like this:

*(T^)p

That looks more "normal" to me.

Missed opportunities

Posted Jun 10, 2013 13:35 UTC (Mon) by renox (guest, #23785) [Link] (7 responses)

> most notably with type casts

Which have an AWFUL notation in C-like language: let's make a very *dangerous* operation non-greppable, yeah fun!

That said, your issue isn't too difficult to fix when you realize that a cast is in fact a two parameters operation: the type name and the object, so this syntax fix your issue I think: cast(T@,p)
or better(?) in C++ like notation cast<T@>(p)

IMHO this is a better way to solve the issue..

Missed opportunities

Posted Jun 10, 2013 16:06 UTC (Mon) by tjc (guest, #137) [Link] (6 responses)

Thanks for the suggestions.

What I like about: cast(T@,p):

The operator is outfix with respect to its operands, so there is no precedence problem with the indirection operator, or any other pre- or postfix unary operator.
It's "greppable," as you say.

I don't like the "angle brackets" in the second form, since these lexemes are already commonly overloaded as operators: <, <=, <<, etc.

And I'm not crazy about the comma in the first form. It makes it look like a function call, but a cast is not much like a function call. A function call has operands that are at the same lexical level, but a cast has one operand that acts on the other. I think the syntax should reflect this difference in semantics. ':' might be a better separator.

One alternative is cast(p T@), since 'p' is an identifier and never contains spaces. It reads nicely too: "cast p to whatever." If the first operand is a complex expression, then things get messy again, and extra parenthesis are required. But at least the common case is clean. There's always something. :)

Missed opportunities

Posted Jun 10, 2013 22:05 UTC (Mon) by mathstuf (subscriber, #69389) [Link] (5 responses)

> cast(T@,p)

Not to mention that if T is something like std::map<K, T>, the comma is ambiguous. Related, GCC bug #35 (or so) where this is a parse error:

> void foo(std::map<std::string, int> const& map = std::map<std::string, int>())

because "int>()" is not a valid parameter declaration. You need parentheses (or a typedef) to get GCC to accept it.

Missed opportunities

Posted Jun 14, 2013 10:53 UTC (Fri) by pjm (guest, #2080) [Link] (4 responses)

Does anyone here know how the unparsable but more mathematically conventional < > angle brackets won out in C++ over, say, [ ], as in vector[int] ? Cf PolyJ, one of the early Java extensions for template parameters, and its FAQ entry on this issue: http://www.pmg.csail.mit.edu/polyj/comparison.html#angle . I thought it a shame that in the end Java too opted for unparsable < >.

Missed opportunities

Posted Jun 14, 2013 11:59 UTC (Fri) by mpr22 (subscriber, #60784) [Link] (3 responses)

Regardless of whether [] would be easier for the machine to parse, the decision to allow values (rather than just types) as template parameters means that [] would make life harder for humans trying to parse the code.

Missed opportunities

Posted Jun 16, 2013 3:20 UTC (Sun) by tjc (guest, #137) [Link] (2 responses)

Why is that?

Missed opportunities

Posted Jun 16, 2013 3:31 UTC (Sun) by hummassa (guest, #307) [Link] (1 responses)

because you couldn't know by glancing if

a[3]

is an array dereferencing or a template instantiation.

Missed opportunities

Posted Jun 16, 2013 12:19 UTC (Sun) by pjm (guest, #2080) [Link]

Can you expand on that? So far I'm not convinced:

It seems no worse than the fact that one can't tell whether ‘a’ by itself is a variable or a type (or a macro), or that one can't tell whether or not ‘a[3]’ involves a function call without knowing the type of a.
As has been discussed in this thread, angle brackets cause their own share of difficulty and "embarassment" (in the word of the later proposal for introducing even more parsing complexity to avoid the need for a space in ‘set<set<int> >’).

Missed opportunities

Posted Jun 10, 2013 8:11 UTC (Mon) by jezuch (subscriber, #52988) [Link] (1 responses)

> we sometimes see
>
> Ob::Ob
> ( int a
> , int b
> , int c
> )
> : _x( a + b - c)
> , _y( a - b + c)
> , _z(-a + b + c)
> {}
>
> enum T
> { T1
> , T2
> , T3
> };

I've seen it in some projects and I agree it's ugly as heck, even though I understand the intention behind it. In Java the parser allows "extra" commas after the last element in some places like array initalizers and enum declarations. I'm not sure if this was an accidental omission or intentional but it's quite useful, e.g.:

Object[] arr = new Object[] {
obj1,
obj2,
obj3,
}

enum Test {
TEST1,
TEST2,
TEST3,
;
}

But it's not allowed in parameter lists, alas.

Missed opportunities

Posted Jun 10, 2013 11:03 UTC (Mon) by sorpigal (guest, #36106) [Link]

This is one of those little conveniences that I like about Perl: trailing commas are ignored (more or less), so you can say


my %map = (
    'one'   => 1,
    'two'   => 2,
    'three' => 3,
);

or in an argument list


sub bar{
    return join(',',@_);
}

print bar(
    1,
    2,
    3,
);

This is worth it even if only because the addition of a line to the map creates a one-line diff and not a two-line.

Missed opportunities

Posted Jun 11, 2013 3:14 UTC (Tue) by ceswiedler (guest, #24638) [Link] (3 responses)

I've often thought that the best notations would be:

Declare a pointer: Foo^ (Foo plus a pointy thing)
Dereference an address: @Foo (what's at Foo?)
Take an address: &Foo (address-of Foo--I guess I'm just used to this one)

Missed opportunities

Posted Jun 11, 2013 14:34 UTC (Tue) by tjc (guest, #137) [Link] (2 responses)

Yeah, I like that too. There's no requirement that a postfix declarator has to have a matching postfix operator.

Another thing that might work well is implicit indirection, similar to Algol 68, but with more familiar syntax. That would result in a lot of if (&node == &head) to suppress indirection in some cases, but the common case would be clean. The problem is, one would have to write a compiler and then write a lot of code to see how well this works in practice.

Missed opportunities

Posted Jun 11, 2013 15:17 UTC (Tue) by viro (subscriber, #7872) [Link] (1 responses)

<sarcasm> yes, because A68 is such a stellar success... </sarcasm>

The trouble with that approach is the shitload of hard to spot bugs happening when the programmer's idea of how the expression will be interpreted is different from what the Revised Report says (not to mention the places where compiler's idea of how it should be interpreted differs from either). And the rules are appallingly convoluted, exactly because it tries hard to DWIM. With usual nastiness following from that..

C is actually on a sweetspot between A68-level opaque attempt at DWIM (6 kinds of contexts, etc.) and things like BLISS where you have to spell *all* dereferences out - i = j + 1 is spelled i = .j + 1 (and yes, they went and used . for dereference operator, leading to no end of joy when trying to RTFS, especially when it's a lineprinter-produced listing).

Missed opportunities

Posted Jun 11, 2013 17:38 UTC (Tue) by tjc (guest, #137) [Link]

I'm not an expert on Algol 68 (Adriaan van Wijngaarden was probably the first, one of few, and last), but I think implicit indirection only worked in the language because it restricted the things you could do with pointers. Something like *p-- in C, for example — I don't know how that could be expressed without an explicit indirection operator.

Little things that matter in language design

Posted Jun 9, 2013 15:33 UTC (Sun) by deepfire (guest, #26138) [Link] (2 responses)

I find it sad, that the collective emotional response to this kind of stuff drowns out language semantics.

Really, we should read more of http://www.lambda-the-ultimate.org/

Little things that matter in language design

Posted Jun 10, 2013 0:40 UTC (Mon) by dvdeug (guest, #10998) [Link] (1 responses)

There's a lot of bikeshedding, but every time you paste some Python text onto a message board that messes up your indentation, or get an incoherent error message from C++, that's this problem. If all that matters was language semantics, why would a language developer ever waste time by using any other syntax then Lisp?

Little things that matter in language design

Posted Jun 10, 2013 9:46 UTC (Mon) by eru (subscriber, #2753) [Link]

Exactly. Language design is not only a computer-science problem, it is also an ergonomic problem.

Back around 1980, before I really knew anything about programming langauges, I recall coming across an advert by IBM in some magazine (probably Scientific American), where it highlighted its research. It quoted one IBM researcher as saying something like "programming language design is like designing traffic signs: the meaning must be clear". For some reason that stuck in my mind. A pretty good insight for an advertisement.

Little things that matter in language design

Posted Jun 10, 2013 8:05 UTC (Mon) by grahame (guest, #5823) [Link]

This post dovetails nicely with a blog post by Joe Armstrong just a week or so ago - Armstrong is one of the people behind Erlang. The post discusses Elixir, a language that runs on top of the Erlang VM, and his programming language design insights.

http://joearms.github.io/2013/05/31/a-week-with-elixir.html

Little things that matter in language design: preprocessor support?

Posted Jun 10, 2013 13:13 UTC (Mon) by etienne (guest, #25256) [Link] (13 responses)

If $LANGUAGE support a preprocessor like CPP for C/C++, you can do quite a lot of things, like generating DEBUG versions which:
- adds/modify source lines to be executed like printf()
- increase size of array to add testable cases
- create variables to help problem finding (and pass them to sub-function)
- check some state for integrity (conditional commenting)
All that keeping the same source code in your source control system.
I know people who would love it in VHDL, because managing sub-branches or un-commenting a lot of non consecutive lines just to debug is very complex.

Also, using digit separators is really needed when dealing with 64 bits numbers: 1,152,921,504,606,785,809 - 2,921,504,000,000,000 != 1,474,154,769 (obvious truncation to 32 bits, not enough commas); it should be 1,150,000,000,606,785,809.
Digit separators are used every 4 digits in hexadecimal, I am not sure it shall be the same "digit separator" as for decimal.

Little things that matter in language design: preprocessor support?

Posted Jun 10, 2013 13:31 UTC (Mon) by micka (subscriber, #38720) [Link]

> Also, using digit separators is really needed when dealing with 64 bits numbers

OK, I agree with that, but please, not the comma, it renders numbers unreadable for the part of the world population that use comma as decimal comma.

Little things that matter in language design: preprocessor support?

Posted Jun 10, 2013 14:08 UTC (Mon) by oever (guest, #987) [Link] (11 responses)

Using preprocessor macros in C++ is strongly discouraged by Stroustrup. He says macros should only be used for include guards, something for which C++ has no other mechanism. For other cases can use constexpr, templates and

Macros allow any word (int, void, static, etc) in the code to be redefined which makes parsing the code impossible without knowing the macro definitions. I'd hate to see preprocessor use become more common.

Little things that matter in language design: preprocessor support?

Posted Jun 10, 2013 22:00 UTC (Mon) by mathstuf (subscriber, #69389) [Link]

Well, when I can reduce 100 lines of redundant code down to 10 with macros, I will take it without hesitation. Macros work where other constructs do not, such as in class declarations, stringification of symbols, and more. As an example, the implementation of enum <-> string functions are best written as macros to avoid typos between the case value and the actual string.

I will grant that there are times and situations where using the preprocessor is ugly and unnecessary, but that does not mean that it is always a worse solution.

Little things that matter in language design: preprocessor support?

Posted Jun 11, 2013 17:22 UTC (Tue) by cesarb (subscriber, #6266) [Link] (3 responses)

> macros should only be used for include guards, something for which C++ has no other mechanism.

The C++ standard does not have it, but every relevant implementation (even MSVC) has it: #pragma once (https://en.wikipedia.org/wiki/pragma_once).

Little things that matter in language design: preprocessor support?

Posted Jun 12, 2013 14:16 UTC (Wed) by khim (subscriber, #9252) [Link] (2 responses)

MSVC actually introduced it... and it does not work.

It only works if only ever have one project, never copy headers around and thus never have two versions of the same header. In practice GCC will actually compare files which will generate many nice debugging hours if you use VCS (which tend to mess with dates of files).

Now it works:
$ mkdir lib $ echo $'#pragma once\nint a;' > lib/test.h $ mkdir installed $ cp -a lib/test.h installed/test.h $ echo $'#include "lib/test.h"\n#include "installed/test.h"' > test.c $ gcc -E test.c -I. -o- # 1 "test.c" # 1 "<built-in>" # 1 "<command-line>" # 1 "test.c" # 1 "lib/test.h" 1 int a; # 2 "test.c" 2
And now it does not:
$ touch installed/test.h
$ gcc -E test.c -I. -o-
# 1 "test.c"
# 1 "<built-in>"
# 1 "<command-line>"
# 1 "test.c"
# 1 "lib/test.h" 1

int a;
# 2 "test.c" 2
# 1 "installed/test.h" 1

int a;
# 2 "test.c" 2

Please, don't use #pragma once - it's not worth it.

Little things that matter in language design: preprocessor support?

Posted Jun 12, 2013 19:48 UTC (Wed) by dvdeug (guest, #10998) [Link] (1 responses)

If you have two different copies of same header in the same project, you're in deep trouble. Standard include guards will just cause you to fail in different ways whenever you hit the differences.

Little things that matter in language design: preprocessor support?

Posted Jun 14, 2013 21:47 UTC (Fri) by khim (subscriber, #9252) [Link]

Why would you fail? If newer versions of components are backward-compatible (and they should be backward compatible if they are separate components) then you just need to copy headers in proper order... which happens automatically: first updated header is in new component itself (and it's headers always included before headers from other components), then you update next component in dependencies DAG, etc.

Little things that matter in language design: preprocessor support?

Posted Jun 12, 2013 11:39 UTC (Wed) by etienne (guest, #25256) [Link] (5 responses)

> For other cases can use constexpr, templates and ??

C++ (without CPP) has no way to self reference names, I mean:
printf ("Entering %s\n", __FUNCTION__);

C++ (without CPP) has no way to print/read each of the fields of a struct, the only (dirty) way is:
#undef FIELD_DEF
#define FIELD_LIST() \
FIELD_DEF(char, fieldname1, "0x%X", "%hhx") \
FIELD_DEF(unsigned, fieldname2, "0x%X", "%i")

struct mystruct {
#define FIELD_DEF(type, name, howto_print, howto_scan) type name;
FIELD_LIST()
#undef FIELD_DEF
}

void printstruct(const struct mystruct *str)
{
#define FIELD_DEF(type, name, howto_print, howto_scan) \
printf (#name howto_print "\n", str->name);
FIELD_LIST()
#undef FIELD_DEF
}

int scanstruct(struct mystruct *str, char *inputline)
{
static const char *scanf_format =
#define FIELD_DEF(type, name, howto_print, howto_scan) #name " " howto_scan " "
FIELD_LIST();
#undef FIELD_DEF

static const int scanf_nb = 0
#define FIELD_DEF(type, name, howto_print, howto_scan) + 1
FIELD_LIST();
#undef FIELD_DEF

return scanf_nb == sscanf(inputline, scanf_format,
#define FIELD_DEF(type, name, howto_print, howto_scan) &str->name,
FIELD_LIST()
#undef FIELD_DEF
);
}

C++ (without CPP) has no way to conditionally comment part of the code at compilation time (make DEBUG=1 or gcc -DDEBUG=1) so that the exact same file is kept in your source management system (no special tree for debug).
(obviously for methodologies which do allow bugs to enter the source management system, others don't need special stuff as bugs are fully denied).

C++ (without CPP) cannot manage simple special exception like a new field in a (memory mapped) structure only when generating for this special hardware.

C++ (without CPP) do not have automatic tools to remove a "conditional comment" from source code like "man unifdef"

Little things that matter in language design: preprocessor support?

Posted Jun 12, 2013 14:34 UTC (Wed) by khim (subscriber, #9252) [Link] (1 responses)

C++ (without CPP) has no way to self reference names, I mean:
printf ("Entering %s\n", __FUNCTION__);

Works fine here:
$ cat test.cc #include <stdio.h> int main() { printf ("Entering %s\n", __func__); } $ gcc test.cc -o test $ ./test Entering main

Other examples looks like a classic case for the boost::serialization or other metaprogramming tricks except for the requirement to use preprocessor without preprocessor. I mean: you can't use -D directive which is preprocessor-specific... well, duh - that's directive for CPP, not for the compiler! With C++ you implement you special cases as template specializations and then just construct correct then version you actually need from the main prohgram). make DEBUG=1 works while gcc -DDEBUG=1, of course, does not.

If anything your examples support Stroustrup's position, not contradict it.

The fact that most languages out there work just fine without a CPP (even low-level ones used to interact with hardware and write standalone OSes!) says something, after all.

Little things that matter in language design: preprocessor support?

Posted Jun 12, 2013 16:21 UTC (Wed) by nybble41 (subscriber, #55106) [Link]

Actually, you're still using CPP. The "#include" line should have been a clue, since that's a preprocessor directive and not valid C or C++ syntax. Try renaming the file to a ".ii" extension or passing "-x c++-cpp-output" to skip the preprocessing step.

In this case, however, the original sample would actually work, because __FUNCTION__ (and __func__) are handled by the compiler rather than CPP. The preprocessor doesn't parse the code, and consequently doesn't have any idea what the current function's name is. The __FILE__ and __LINE__ macros would be an entirely different matter.

Little things that matter in language design: preprocessor support?

Posted Jun 12, 2013 19:01 UTC (Wed) by daglwn (guest, #65432) [Link] (1 responses)

> C++ (without CPP) has no way to self reference names, I mean:
> printf ("Entering %s\n", __FUNCTION__);

True. __LINE__ is one of the few reasons I use the preprocessor.

> C++ (without CPP) has no way to print/read each of the fields of a struct,
> the only (dirty) way is:

Wow, that's totally unreadable. Lots of people want introspection and I think we'll get it soon in C++.

> C++ (without CPP) has no way to conditionally comment part of the code at
> compilation time (make DEBUG=1 or gcc -DDEBUG=1) so that the exact same
> file is kept in your source management system (no special tree for debug).

Yes, but not quite in the way you think. I prefer:

#ifdef DEBUG
const int debugEnabled = true;
#else
const int debugEnabled = false;
#endif

if (debugEnabled) { ... }

Using the preprocessor to hide code has bitten me so many times (different results with DEBUG on/off, etc.) that I just don't want to do it anymore.

> C++ (without CPP) cannot manage simple special exception like a new
> field in a (memory mapped) structure only when generating for this
> special hardware.

Not true. Template metaprogramming.

Ok, you might need one #define TARGET, but that's it.

> C++ (without CPP) do not have automatic tools to remove a "conditional
> comment" from source code like "man unifdef"

I don't have that tool and can't imagine what I'd need it for. Can you give an example?

Little things that matter in language design: preprocessor support?

Posted Jun 14, 2013 11:47 UTC (Fri) by etienne (guest, #25256) [Link]

> > C++ (without CPP) has no way to print/read each of the fields of a struct,
> > the only (dirty) way is: ...
> Wow, that's totally unreadable. Lots of people want introspection
> and I think we'll get it soon in C++.

In those few cases, I more wanted a kind of simple database (30 elements with 10 properties), not complex introspection.
Some target I have have very small memory size (256 Kbytes total internal memory on processor before the DDR-RAM is initialised; or soft-processor (written in VHDL inside an FPGA) with 96 Kbytes RAM), I cannot afford indirections and data hiding.
These macros enabled me to reduce the number of lines of source code to maintain, while keeping total control of the structure for the tens of different exceptions where you cannot use the database.

> > C++ (without CPP) do not have automatic tools to remove a "conditional
> > comment" from source code like "man unifdef"
>
> I don't have that tool and can't imagine what I'd need it for.
> Can you give an example?

If you manage big and complex software which has decades lifespan, you will have some code which is no more valid because this hardware is no more in use.
At some point nobody you know remember why this #ifdef was added, and when you try to compile with the #ifdef enabled it does not compile (for the last 3 years).
That is the right time to run "unifdef" to remove that part of code automatically from all your sources.
Sometimes these parts of code are extremely dirty hacks, made to handle the bug of an external company (don't ask, you can't get it fixed), and you really do not want to alter your design to handle that possible bug (only when you sell box A to third party which has box B).
Lucky you are if you are not forced to eat some other company's dog food for years at a time...

Little things that matter in language design: preprocessor support?

Posted Jun 13, 2013 0:01 UTC (Thu) by mathstuf (subscriber, #69389) [Link]

The pattern I typically use for something like[1] this would be:

#define mystruct_members(call) \
call(char, fieldname1, "0x%X", "%hhx") \
call(unsigned, fieldname2, "0x%X", "%i")

struct mystruct {
#define declare_member(type, name, print, scan) type name;
mystruct_members(declare_member)
#undef declare_member
};

If you use this extensively enough, declare_member and such could be factored out into a separate header so that the same expansion for FIELD_DEF isn't used dozens of times.

Maybe it doesn't work on older compilers (passing macro names as arguments and all), but I don't see much of a reason to not use this pattern if it's available and I haven't run into a compiler that hasn't supported it where I've used it so far (granted, that includes MSVC, newer GCC, and LLVM for the projects which use this).

[1]Because C++ has a could different contexts in which things like this can be expanded, the actual meta-macro takes a "ctx" parameter as well which is then used as: "BEG(ctx) call(...) SEP(ctx) call(...) SEP(ctx) call(...) END(ctx)" so that stray semicolons are avoided and that the macro can be expanded as part of an initializer list or argument list if needed.

Little things that matter in language design

Posted Jun 12, 2013 7:58 UTC (Wed) by baberuth (guest, #15655) [Link]

Nice article with a different view on things.

It seems most new languages take a revolutionary, rather than an Evolutionary approach. This might indicate the maturity level of the ecosystem as a whole.

The C2 language tries to be an evolutionary step of C, instead of designing a completely new language. Many of its syntax decisions are still open and will be determined by online polls among programmers.

Little things that matter in language design: make it do what it looks like it does

Posted Jun 13, 2013 13:28 UTC (Thu) by NRArnot (subscriber, #3033) [Link] (12 responses)

I was surprised that Python didn't get a look in. By making the indentation define the statement grouping, it eliminates a class of error caused by code that "obviously" does something because of the (insignificant) whitespace that a human interprets as significant, and which actually parses as something different.

I find the thought of unicode identifiers horrifying. It's hard enough trying to read code written by a programmer whose main (natural) language is not yours, and whose variable names therefore convey rather less hints of meaning to you than they might. But at least they are strings of 63 or so glyphs familiar to all of today's programmers. Trying to recognise strings of unfamiliar glyphs from an "alphabet" of 60,000 or more patterns most of which one has never seen before would be, to me, rather harder than parsing machine code dumped in hexadecimal.

Of course a (say) Arabic-world programmer might wish to write his variable names in Arabic, but in that case isn't the logical progression also to replace the language's reserved words with Arabic equivalents and (of course) to switch left and right on the page or screen. This would fragment programming the same way multiple natural languages fragment human discourse. Programming wasn't fragmented to start with, so shouldn't we keep it that way?

Little things that matter in language design: make it do what it looks like it does

Posted Jun 13, 2013 14:47 UTC (Thu) by renox (guest, #23785) [Link]

I agree that indentation is a "little thing that matter", especially since I remember that a teacher found that his classes (of beginner programmers) learned much more quickly when he made his own-made language indentation sensitive (a la Python).
That said, gofmt(or similar tools) is another way to make sure that indentation is correct without having an n-th discussion on whether you should use tab or space to indent your code and how to configure correctly your editor..

Little things that matter in language design: make it do what it looks like it does

Posted Jun 20, 2013 13:08 UTC (Thu) by Otus (subscriber, #67685) [Link] (10 responses)

> I was surprised that Python didn't get a look in. By making the indentation define the statement grouping, it eliminates a class of error caused by code that "obviously" does something because of the (insignificant) whitespace that a human interprets as significant, and which actually parses as something different.

I used to think Python's significant whitespace was awful, but since using it more I find it actually rather pleasant to work with. However, I now find I hate the colon. Since the indent already tells you where a block starts, why is it needed? Google tells me it's because "explicit is better than implicit", but in that case why is the *end* of a block implicit? Makes no sense to me...

Little things that matter in language design: make it do what it looks like it does

Posted Jun 20, 2013 21:32 UTC (Thu) by neilbrown (subscriber, #359) [Link] (8 responses)

I actually like the colon. It clearly marks the end of the condition and the start of the statement block.

I don't think a newline-followed-by-indent clearly marks an end so well. I can write:

  if something and (otherthing
                    or  whatever) :
       statements

and the first "newline-followed-by-indent" doesn't mark the end of anything.

So the requirement of a colon causes unbalanced brackets in the condition to be easily detected. Without it the compiler might not notice until much later.

A "dedent" (I think that is what python call the opposite of indent), on the other hand, always clearly marks the end of something. There is no uncertainty so no need for extra syntax.

Little things that matter in language design: make it do what it looks like it does

Posted Dec 28, 2014 8:29 UTC (Sun) by maryjmcdermott57 (guest, #100380) [Link] (7 responses)

A little bit back to your article.
I want to ask one question about Rust.

First of all, I want to say thank you for your article. You have the answer for my question (the section "semicolons & expression"), that I asked when should add ; at the end of whole if/else expression because it's also an expression. I asked that question in many forums but many people either not pay attention to it or thinking that it's as a silly question.

But as you answered my question. This leads to more questions.
As you said, I assume that in Rust, if the expression return the unit-() type then we wouldn't require add ; at the end of that expression.

As before, if the internal of if/else expression (i.e function call return unit-() type then we won't necessary add ; at the end of either each internal function call or the whole if/else expression.

So what about the function call outside of if/else expression.
The rule seems not true anymore
I mean I have this code

fn main() {
println!("hello") // I think don't need add ; at here but it's wrong
println!("world") // I think don't need add ; at here but it's wrong
}

As the code run, compile raise error. Why does that happens when println!() return unit-() type? So as the assumption above, this code should work.

And can I ask you one more question? Is the function declaration is expression or not? Because I don't see the ; at the end of } of function declaration. I mean if it's an expression, it should look like this

fn foo() {
// do something
}; // ; should add in here but in practical, it's not there

Little things that matter in language design: make it do what it looks like it does

Posted Dec 28, 2014 9:15 UTC (Sun) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

This works just fine:
>
>fn main()
>{
> println!("Hello, world!")
>}
>

Here you have a 'fn main()' returning the result of 'println!' invocation (a macro).

Little things that matter in language design: make it do what it looks like it does

Posted Dec 28, 2014 13:16 UTC (Sun) by maryjmcdermott57 (guest, #100380) [Link]

Yes. I already know that if there is just 1 function call like your example, I can omit the ; at the end of println!()

But what I want to know is that why although both println!("hello") & println!("world") return unit-() type, we still need separate each of these with ;

Because as you saw in the article, if the return value inside if/else is unit-() type, it don't need to add ; at the end of whole if/else expression. Otherwise we have to.

Little things that matter in language design: make it do what it looks like it does

Posted Dec 29, 2014 9:11 UTC (Mon) by jem (subscriber, #24231) [Link] (4 responses)

"As you said, I assume that in Rust, if the expression return the unit-() type then we wouldn't require add ; at the end of that expression."

No, it's the other way around: if you add a semicolon at the end of an expression it turns the expression into a statement. Doing this throws away the value of the expression and returns () instead.

Rust does not allow an expression to follow another expression, you will have to turn all but the last consecutive expression into statements by inserting semicolons after them. This means that you are only interested in the side effects of all but the last expressions, and return the value of the last expression. If you wish, you can put a semicolon after the last expression too, if the value of the last expression is not useful.

Little things that matter in language design: make it do what it looks like it does

Posted Dec 29, 2014 10:26 UTC (Mon) by maryjmcdermott57 (guest, #100380) [Link] (3 responses)

With your answer, I summed up like this
- with the last expression in a block, add ; is up to me (depend on my intention)

But it's also give me more confusing. So can you explain a little bit more about this code.

if condition {
function1()
} else {
function2()
}
expression3;

With this code above, the if/else is not last expression. And if both function don't return value so we don't need add ; at the end of } of whole if/else. That makes sense (as author said that the language doesn't require that ; if whole if/else doesn't have value). With that knowledge, why we have to add ; at the end of println!("hello"). Because println!("hello") also doesn't return any value

fn main() {
println!("hello") // why we have to add ; at here
println!("world")
}

And if can, please answer me one more question. What about other blocks like match, loop, struct, function declaration? These are also expression as whole or not.

Because I saw function declarations next each other without ; between them. like this

fn foo() {
// ...
} // don't have ; at here
fn bar() {
// ...
}

If they are expression, we have separate them with ; to compile correct, right?

Please bear with me if my questions make you annoying. I'm very confusing about these and I also add question in other places but not get the good answer or some just ignore & vote close topic.

Little things that matter in language design: make it do what it looks like it does

Posted Dec 29, 2014 10:51 UTC (Mon) by Cyberax (✭ supporter ✭, #52523) [Link] (1 responses)

>fn main() {
>println!("hello") // why we have to add ; at here
>println!("world")
>}

You need ';' because you can't just combine two expressions together. What operation should be used for that?

(never mind the fact that units can't be used in _any_ operation)

Little things that matter in language design: make it do what it looks like it does

Posted Dec 29, 2014 11:21 UTC (Mon) by maryjmcdermott57 (guest, #100380) [Link]

Yes, of course I can't combine 2 expressions together. So what about if/else & expression3. Both are expressions, right? But we can omit the ; at if/else expression in some cases. That makes no sense if we can omit ; at if/else expression but can't do that with println!("hello"). If we want consistent, Rust should force add ; at if/else (after the closing } of else) expression too

Little things that matter in language design: make it do what it looks like it does

Posted Dec 29, 2014 12:50 UTC (Mon) by jem (subscriber, #24231) [Link]

"With this code above, the if/else is not last expression. And if both function don't return value so we don't need add ; at the end of } of whole if/else. That makes sense (as author said that the language doesn't require that ; if whole if/else doesn't have value)."

You don't put a semicolon at the end of } of the whole if/else. This has nothing to do with whether the if should return a value or not – you never put a semicolon there. The if/else returns a value if the last expressions in the if and else branches do not end with a semicolon.

"What about other blocks like match, loop, struct, function declaration? These are also expression as whole or not."

Just like in the if/else case, you don't put a semicolon there. As with the if case, this does not mean these constructs have a value. Rust (mostly) borrows this syntax from C.

I think the best way of thinking about this is not to try to minimize the amount of semicolons. Do not focus all the time on whether you can leave out a semicolon, but instead think "do I want to return the value of the last expression in this block as the value of the block?". If that is the case, then you should leave out the semicolon.

Little things that matter in language design: make it do what it looks like it does

Posted Jun 21, 2013 8:41 UTC (Fri) by renox (guest, #23785) [Link]

> "explicit is better than implicit"

Bah, this is very selectively applied in Python, for example other languages distinguish variable declaration from variable assignment:
var x = ... (declare x and assign a value to it)
x = ... (only assignment)
in Python you only have "x = ...": the first assignment implicitly declare x.

Little things that matter in language design

Posted Jun 26, 2013 13:30 UTC (Wed) by nye (subscriber, #51576) [Link]

I feel like the designers of Go sat around a very large conference table, and had a brainstorming session in which they listed every misfeature of every language ever, so they could be sure to implement all of them.

There is no 0b prefix in C11

Posted Sep 30, 2014 7:47 UTC (Tue) by stefanct (guest, #89200) [Link] (1 responses)

The standard allows for compiler-defined extensions (in C99 too btw) and gcc supports a 0b binary prefix, but it is not mandatory and '0b' is not even mentioned explicitly in the standard (at least in the last draft).

There is no 0b prefix in C11

Posted Sep 30, 2014 14:49 UTC (Tue) by jwakely (subscriber, #60262) [Link]

Indeed. 0b prefixes are standard in C++14 though, and so are literals with digit-separators such as 1'000'000'000 (the other characters that could have been used were even worse).