EEVblog Electronics Community Forum
Products => Computers => Programming => Topic started by: hamster_nz on October 20, 2020, 02:54:31 am
-
Can you see what is wrong with this, without compiling it?
#include <stdio.h>
int main(int argc,char *argv[]) {
printf("The next number after 0xE is 0x%X\n", 0xE+1);
}
-
It may think E+1 is the exponent for a float (that isn't present)
-
Just compile it.
-
Just compile it.
Indeed!
It's interesting because it's working as expected for any other hex value except 0xE. 0xF+1, or 0xA+1, for example, work as expected.
But the letter E is special, meaning that the compiler seems to be parsing it from the right, detecting "E<+|-><number>" as a token (I'm assuming it's for floating point exponent), then parsing from the left to find 0x denoting integer constant (of hex representation), then erroring out because these two things can not coexist.
The takeaway seems to be, whitespace should be used to separate arithmetic operators from the operands, even if not doing that works out 99.99% of the time.
-
This looks odd at first, and I had to take the standard out to try and figure out if this is just a compiler's specific implementation (I checked only with GCC) or if it made sense from the standard itself.
So let's see in the "Constants" section.
I think this behavior is explained by something I more or less knew about but have NEVER used myself (and, I think, never seen in actual code, YMMV): hexadecimal floating constants.
Since the token begins with "0x", you'd expect the compiler to expect a hexadecimal constant following it, and the parser most likely exactly does that. But, it doesn't know what to expect yet: it could be an integer hex constant OR a "floating" hex constant.
As per the standard, "hexadecimal floating constants" can include, you guessed it, an exponent part. It's unfortunate that this exponent be indicated with an "E" letter also in this context, whereas it's obviously also an hex digit. It's a recipe for potential parsing issues. As you just saw. I'm sure we can figure out tons of other weird hex constants on this basis.
-
I'm too lazy to dig through the C ISO/IEC standard, but the problem seems to be that it actually allows hexadecimal float constants (and thus also exponents).
I wasn't even aware that there's also the character "p" to define the precision:
From the ISO/IEC 1999:
FLT_MIN 0X1P-126F // hex constant
-
It is compiler or C standard dependend, on Visual Studio 2017 it just works.
-
looks ok and works on my compiler but I would normally add a leading zero when doing hexidecimal eg 0x0e
-
It is compiler or C standard dependend, on Visual Studio 2017 it just works.
Exactly! it is compiler or C standard depended. Even within vendor, Compiler versions and standards evolves.
Even spoken language evolves. Fluke Corporation first introduced "scopemeter" in 1987. Had this forum existed before then, a Fluke would likely be a reference to a blood worm, that is what the word fluke means. But Fluke here now probably refers to a DMM instead of blood worm.
-
As per the standard, "hexadecimal floating constants" can include, you guessed it, an exponent part. It's unfortunate that this exponent be indicated with an "E" letter also in this context, whereas it's obviously also an hex digit. It's a recipe for potential parsing issues. As you just saw. I'm sure we can figure out tons of other weird hex constants on this basis.
According to the information below, the exponent should be indicated with the letter "p" or "P", not "E":
http://books.gigatux.nl/mirror/cinanutshell/0596006977/cinanut-CHP-3-SECT-2.html#:~:text=A%20hexadecimal%20floating%2Dpoint%20constant,the%20letter%20p%20or%20P. (http://books.gigatux.nl/mirror/cinanutshell/0596006977/cinanut-CHP-3-SECT-2.html#:~:text=A%20hexadecimal%20floating%2Dpoint%20constant,the%20letter%20p%20or%20P.)
So perhaps this could be a compiler bug?
-
As per the standard, "hexadecimal floating constants" can include, you guessed it, an exponent part. It's unfortunate that this exponent be indicated with an "E" letter also in this context, whereas it's obviously also an hex digit. It's a recipe for potential parsing issues. As you just saw. I'm sure we can figure out tons of other weird hex constants on this basis.
According to the information below, the exponent should be indicated with the letter "p" or "P", not "E":
http://books.gigatux.nl/mirror/cinanutshell/0596006977/cinanut-CHP-3-SECT-2.html#:~:text=A%20hexadecimal%20floating%2Dpoint%20constant,the%20letter%20p%20or%20P. (http://books.gigatux.nl/mirror/cinanutshell/0596006977/cinanut-CHP-3-SECT-2.html#:~:text=A%20hexadecimal%20floating%2Dpoint%20constant,the%20letter%20p%20or%20P.)
So perhaps this could be a compiler bug?
You're right! Still not familiar with those hex floating constants. I re-read the standard more closely.
The "exponent" part of "hexadecimal-floating-constant" is "binary-exponent-part", and not just "exponent-part" (as I assumed), and the "binary-exponent-part" starts with a 'P' or 'p' instead of 'E' or 'e'.
(Cf. "6.4.4.2 Floating constants" of C99)
So yeah. Definitely looks like a compiler bug. Since it only seems to appear with GCC (I tried latest GCC 10.2), this would warrant a bug report IMO.
-
looks ok and works on my compiler but I would normally add a leading zero when doing hexidecimal eg 0x0e
This is just a matter of style. This doesn't change anything, and even with a leading zero, with GCC, the same bug is triggered (I tried.)
-
Interesting! Never knew about the float literals either.
What version of GCC are y'all finding this compiles in? The oldest version I have on my box is 4.7.3 and it fails with (as does the latest I have, 8.2.0):
test.c: In function ‘main’:
test.c:4:49: error: invalid suffix "+1" on integer constant
Same thing with clang 9 and 10.
I checked on Godbolt (https://godbolt.org/z/rPMExf) and I can't get any of the gcc versions available there to compile this code (including 10.2 and the oldest 4.1.2). MSVC compiles it, but uses 15 as 'expected'. Old versions of ICC do the same; newer ones don't compile.
-
Does it still fail if you write it as 0xE + 1 ?
As normal coding guidelines usually require to add spaces between operands ?
-
I would define any 8 bit byte as 0x0e and printf as %02x to zero fill.
printf("The next number after 0x0E is %02X\n", ( 0x0e + 1 ));
Or your code normalised...
printf("The next number after 0x0E is 0x0F\n");
-
It's a possible bug in the language definition or it's a bug in the compiler implementation. Well, not a 'bug' strictly speaking, but poor specification in the light of knowing what grammars would be used to put a compiler for the language together. Specifically, the grammar looks to be ambiguous - something that a "Context Free Grammar*" cannot parse. I'd have to dig into the real details of the standard to be sure that the specification is at fault, rather than the implementation(s). If the standard disambiguates this, then it's the implementation(s).
There is an ambiguity. When parsing "0xE+1" there's an ambiguity as to whether the "E" belongs to a 'hex digit**' as part of a 'hex constant' or to an 'exponent'. There are two possible parsings of that expression, the compiler has to decide what it believes the "E" represents, and to break the ambiguity (for most common parsing algorithms) the compiler writer has to indulge in a trick to remove the ambiguity by prioritising one interpretation over the other. However, if you include the space "0xE +1" there's no longer an ambiguity because the space implies the end of the 'hex constant'. There's an oddity here: in theory spaces are not significant in C expressions outside of string constants, however the parser still has to deal with their presence, so although not 'significant' a space can, by not being an allowed part of something, indicate that the 'something' has finished and the parser can treat it as recognised. So in practice spaces are significant in some contexts.
* A term of art from linguistics and the branch of mathematics/computer science that deals with parsing languages.
**I'm just making up the names for the terminals and non-terminals of the grammar - these aren't the official ones used in any of the C standards.
-
Next question...
Given int x = 011+1, what is the value of x?
-
Next question...
Given int x = 011+1, what is the value of x?
x = 10 ?
-
Next question...
Given int x = 011+1, what is the value of x?
10 or 012 depending on your mood, or even 0xA.
-
There is an ambiguity. When parsing "0xE+1" there's an ambiguity as to whether the "E" belongs to a 'hex digit**' as part of a 'hex constant' or to an 'exponent'.
I don't agree. There is no ambiguity IMO if you stick to what the standard defines.
Any constant starting with "0x" or "0X" IS to be parsed, by definition, as a hex constant. Period. "E", in the context of a hex constant token, can't be anything else than a hex digit (since the correct exponent in this context would stard with a "P"). So I can't see this behavior as anything else than a parser bug.
Parsers are not easy to implement though. I've written C (and other languages) parsers before, so I do now how hairy that can get. (Never dealt with those hex floating consts though, so that didn't ring a bell.) I see this one example as a "tokenization" issue.
A basic tokenizer may first get "0xE+1" (with no whitespace) as a single token. Now extracting tokens just based on whitespace is a first step, but far from sufficient, obviously. There needs to be additional tokenization steps. And that's where things get hairy, and with MANY particular cases. Other tokenizers may, OTOH, split the above as "0xE", "+", and "1". In this case, for actual exponents (correct ones, such as "1E+1"), the parser would need to aggregate the tokens based on context.
In the OP's case, I'm guessing some parsers (such as GCC's one) may have just ONE case for handling the tokenization for the exponent part of constants, and so detects the sequence "E+..." as an exponent part, whereas it clearly can't be an exponent part in the context of a hex floating constant! This bug is probably a long-standing one, dating back to the first introduction of hex floating constants, so when they started implementing C99.
I can't see the ambiguity from the grammar's POV. But I can definitely see how that would make parsers harder to implement.
-
One point not fully addressed in this thread is the difference between C and C++ languages and standards. Just because something is in the C99 standard doesn't automatically mean it will be supported by a C++ compiler. It may be excluded, or it may require a special option to enable it.
So in this thread, it is important to state and verify whether the test is being performed with C++ or with C. The result could be different. What works in one could be a syntax error in the other.
-
There is an ambiguity. When parsing "0xE+1" there's an ambiguity as to whether the "E" belongs to a 'hex digit**' as part of a 'hex constant' or to an 'exponent'.
I don't agree. There is no ambiguity IMO if you stick to what the standard defines.
As I said, I hadn't [at the time] divined whether this is standard or compiler implementation and left the 'blame' open, but it's now clear that it's the compiler(s) in question. The cause is almost certainly using an "ambiguous grammar*", and an incorrect one to boot. It's pretty clear, just from the context of the error, that there is in the compiler's implementation of it's own grammar for the language an ambiguity between recognising an expression with a hex constant on the left and recognising a constant with an exponent part when a hex E immediately precedes the operator.
It is rare, in most production compilers, to use the grammar exactly as published for any given programming language, usually to accommodate some sort of syntax directed translation scheme/translation., or because you want to use an LL(1) grammar and the published on is LALR(k) or similar.
Just had a quick look at the source for GCC 10.2.0 and 4.1.2. No chance of tracking it down in short order but I thought I'd have a look just in case it was easy to find. GCC was never easy to find your way around the insides of and I'm afraid that it hasn't got any easier some I last dived into the bowels if it (which would have been some flavour of 4.x).
Amazingly I just partially tracked it down. In expr.c when it has determined it has an integer constant it calls interpret_int_suffix. interpret_int_suffix looks ahead in the token, if the token has ended, it returns, if the token hasn't ended and finds any of the u/U/l/L etc suffices it sets things up appropriately, if it finds anything else it returns an error and spits out the "invalid suffix \"%.*s\" on integer constant". The problem here is that an earlier stage has tokenized the "+" into the lexeme instead of treating it as an operator token in its own right - haven't tracked that down. There isn't an explicit written grammar here (as there would have been if this was the classic lex/yacc combo) but this genuinely is an ambiguous grammar (otherwise it wouldn't get the version without a E right).
*Again, a term of art. The grammar:
<expression> ::= <expression> + <expression> | <expression> * <expression> | identifier ;
is ambiguous. This variant is a classic, the absence of any priority implicit in the grammar means that a+b*c can equally well be be parsed as (a+b)*c or a+(b*c) so it is ambiguous.
-
There is an ambiguity. When parsing "0xE+1" there's an ambiguity as to whether the "E" belongs to a 'hex digit**' as part of a 'hex constant' or to an 'exponent'.
I don't agree. There is no ambiguity IMO if you stick to what the standard defines.
As I said, I hadn't [at the time] divined whether this is standard or compiler implementation and left the 'blame' open, but it's now clear that it's the compiler(s) in question. The cause is almost certainly using an "ambiguous grammar*", and an incorrect one to boot. It's pretty clear, just from the context of the error, that there is in the compiler's implementation of it's own grammar for the language an ambiguity between recognising an expression with a hex constant on the left and recognising a constant with an exponent part when a hex E immediately precedes the operator.
Oh, if you meant an ambiguity in the grammar the compiler actually uses, then obviously I agree.
I thought you meant this was ambiguous from the standard itself, which I don't agree with... with a "but". I still have to admit that the C grammar, as exposed in the C standard(s) (this hasn't really improved much), is not that clearly defined. I wish they would add some kind of clear BNF, or something in that vein. There isn't, and (at least IMHO) I think the grammar is exposed pretty "loosely" in the standard itself. That should be addressed in future revisions IMO.
That said, I still see the way this compiler implements parsing (thanks for finding the "offending" part in the source code!) a bit questionable. The C grammar is *heavily* context-dependent (which makes parsing it correctly not really a fun experience), and in this example, I do think they don't use the context properly. Parsing the token(s) possibly representing an exponent part independently of the context (which is a hex constant, due to the prefix) is, IMO, a parsing error. Now to be fair, I admit the parser still has to make a choice regarding the context. When it encounters "0xE+1", did the programmer actually mean "0xE + 1", or was the "0x" actually not intended, and the programmer meant a decimal floating constant instead? Or maybe they meant a hex floating constant, but used "E" instead of "P"? My take on approaching this kind of stuff when writing a parser is pretty simple: 1/ use common sense, 2/ parse it as the "most likely" (here, if "0xE" was written, the probability of meaning anything else than a hex constant is pretty low, unless you want your parser to outsmart people using it, which never works) and finally 3/ determine the context. Finally, the parser should try its best *not* to trigger an *error* if the written code has any possibility of being actually correct. If it encounters something that may look like a typical possible mistake, it should then issue a warning, but not throw an error. Such as when GCC encounters "if (x = y)", it will throw a warning, because this is a common source of bug. This is nice.
But I know this is definitely not easy. C is a bitch to parse, because not only is it heavily context-dependent, but the context is sometimes determined starting from the right first, and sometimes starting from the left (due to the "reversed" nature of many C constructs, such as declarations.) Wirthian languages (Pascal, Modula, Oberon, Ada and many more derivatives) are definitely much, much nicer to parse.
-
I'd disagree slightly, but it's all about semantics (in the everyday sense). C and the whole Wurthian set of languages are equally easy to parse, but C is harder to translate. C was clearly designed with an expectation that it would be parsed by recursive descent, the Wurthian languages were most likely written with the assumption that a parser generator tool would be to hand, or at least more powerful parsing techniques than recursive descent.
When you're trying to do the parsing and the translation in the same pass by recursive descent, it's natural to want to know the type by the time you reach the actual variable declaration when you're doing a left to right sweep through the code, so the <typename> <variable list> construction is "more natural". If you're using a shift-reduce parser then it's more natural to do the whole thing the other way around, apply the semantics as you do the final reduce of the declaration construct - so you don't have carry the semantic information across several layers of the parse tree.
If you're doing parsing and translation in separate passes then it's equally easy or hard to translate either style. The differences between them boil down to what order you do a (sub-)tree walk of the abstract syntax tree.
The moan about the clarity of the formal grammar in the C standard can be expanded, in my opinion, to almost every language standard published. The one clearly notable exception is "The Revised Report on Algol 68", which has the most rigorous grammar ever published for a programming language. Unfortunately the grammar is in Van Wijngaarden Form (W-grammar) so there only a few thousand people in the whole world capable or reading it, and only a couple of hundred people in the world capable of understanding it.
-
I'd disagree slightly, but it's all about semantics (in the everyday sense). C and the whole Wurthian set of languages are equally easy to parse, but C is harder to translate.
Whatever you include in parsing and what you mean by "translating". Please elaborate, getting into academic semantic subtleties was not the point. To me parsing is the whole thing from tokenization to getting some kind of tree which makes sense of the source code from a syntactic POV.
(As defined on Wikipedia for instance: A parser is a software component that takes input data (frequently text) and builds a data structure – often some kind of parse tree, abstract syntax tree or other hierarchical structure, giving a structural representation of the input while checking for correct syntax. The parsing may be preceded or followed by other steps, or these may be combined into a single step.
)
Again, with C, a recursive descent approach or whatever you choose is NOT sufficient, because of the heavy dependency on context. The whole idea is that the more context-dependent a given grammar is, the more difficult it is to parse. (And obviously the more difficult it is to correctly define the grammar itself to begin with.)
Have you actually written a complete C parser? It IS a bitch. I'll give you just one example: C declarations. Try writing a parser that correctly identifies all parts of a C declaration in the general case, and make it correct. Now do the same with Pascal (or Modula, Oberon...) Tell me what you think. I can guarantee you that C declarations are fun. As I said earlier, correctly parsing a C declaration involves taking the context into account in a non-trivial way. You may think that (as can be read sometimes) C declarations are hard to read for humans, but very easy to parse for a machine. It's not completely false, but a very naive way of seeing it. Take a complex C declaration and see how well your parser does for identifying the correct type and whatever is actually the identifier being declared (a task that is merely syntactic and that I would definitely include in a parser's job even with a strict definition of parsing.) It's impossible to do correctly without properly identifying the context, whereas parsing a - say - Pascal declaration can be done almost context-free.
To play with C declarations there's a fun site: https://cdecl.org/
-
I'd disagree slightly, but it's all about semantics (in the everyday sense). C and the whole Wurthian set of languages are equally easy to parse, but C is harder to translate.
Whatever you include in parsing and what you mean by "translating". Please elaborate, getting into academic semantic subtleties was not the point. To me parsing is the whole thing from tokenization to getting some kind of tree which makes sense of the source code from a syntactic POV.
(As defined on Wikipedia for instance: A parser is a software component that takes input data (frequently text) and builds a data structure – often some kind of parse tree, abstract syntax tree or other hierarchical structure, giving a structural representation of the input while checking for correct syntax. The parsing may be preceded or followed by other steps, or these may be combined into a single step.
)
Again, with C, a recursive descent approach or whatever you choose is NOT sufficient, because of the heavy dependency on context. The whole idea is that the more context-dependent a given grammar is, the more difficult it is to parse. (And obviously the more difficult it is to correctly define the grammar itself to begin with.)
The grammar for C is context-free, it has a context free grammar (https://en.wikipedia.org/wiki/Context-free_grammar). The semantics are not "context free" (in a more informal sense). If you think the grammar of C is "context-dependent" then you are confusing the grammar of the language and the semantics of the language. Parsing Context Sensitive Grammars is not merely hard, some problems in parsing context sensitive grammars are provably undecidable. If C had a Context Sensitive Grammar I suspect we would still be waiting for people to write the first ever compiler for it.
Have you actually written a complete C parser? It IS a bitch. I'll give you just one example: C declarations. Try writing a parser that correctly identifies all parts of a C declaration in the general case, and make it correct. Now do the same with Pascal (or Modula, Oberon...) Tell me what you think. I can guarantee you that C declarations are fun. As I said earlier, correctly parsing a C declaration involves taking the context into account in a non-trivial way. You may think that (as can be read sometimes) C declarations are hard to read for humans, but very easy to parse for a machine. It's not completely false, but a very naive way of seeing it. Take a complex C declaration and see how well your parser does for identifying the correct type and whatever is actually the identifier being declared (a task that is merely syntactic and that I would definitely include in a parser's job even with a strict definition of parsing.) It's impossible to do correctly without properly identifying the context, whereas parsing a - say - Pascal declaration can be done almost context-free.
To play with C declarations there's a fun site: https://cdecl.org/
Parsing = syntax and just syntax, all the rest is semantics (in the computer science sense).
void fred()
{
Marshmallow bert = charlie (7);
}
is syntactically correct C. That will parse completely happily, what it wouldn't do is compile or indeed pass the subsequent tests for being a valid C program. The parsing is easy, figuring out that the semantics are correct is relatively hard (e.g. that "Marshmallow" ought to be a type and that the type declaration is missing). You can't do it from syntax alone (for CFLs), therefore you can't do it from parsing alone, at least not with any parser that computer science currently knows how to construct automatically [from the grammar of the language]. I may be wrong on the final point about automatic construction, I haven't kept up with the state of the art on producing parsers from 2 level grammars and informally (i.e. I haven't proven it) a 2 level grammar like VWF could represent the semantics of a C declaration in full. VWF certainly could represent the semantics of Algol 68 and, as we all know, C is partially descended from Algol 68.
At some point in trying to compile that the compiler has a complete, valid [abstract] syntax tree. Only then can it find the errors by running a semantic pass on the parse tree. Although it is possible to combine the parsing and semantic passes into one it is not necessary, and personally I dislike that style, it dates from when compilers tried to be single pass largely owing to restrictions on program size; it's much cleaner to separate a compiler into lots of passes with clean, well defined, interfaces between them.
Am I being pedantic by treating parsing as just the grammar bit? No, because if there's one field where precision is everything, and being pedantic is virtually a necessity, it's in programming. A parser generator like Yacc will deliver you a parser, it won't directly help you with validating the semantics of the language you're trying to parse.
Have I written a complete C parser? Yes, many years ago, as an exercise, for K&R C. Have I written a complete C compiler? No. Have I written complete compilers for other languages? Yes, both for personal fun and commercially, and worked commercially on parts of compilers with other people.
-
Reminds me of that beauty in PHP. Not intentional — it was a bug. But still amazing: Bug #61095 PHP can't add hex numbers (https://bugs.php.net/bug.php?id=61095).
-
Reminds me of that beauty in PHP. Not intentional — it was a bug. But still amazing: Bug #61095 PHP can't add hex numbers (https://bugs.php.net/bug.php?id=61095).
Nice find! Yes it looks sort of similar.
Note: admittedly the OP's case is not as bad, as it spits out an error. Giving a bogus result, as with the PHP bug, is way worse.
-
@Cerebus: I think this is sort of getting fruitless and largely outside of the scope of this thread.
I'll just once again make my point: to me parsing = syntax analysis. That's again the general definition: https://en.wikipedia.org/wiki/Parsing
Syntax analysis isn't just about checking the syntax is correct - it's also about splitting a given text into basic elements, each with an identified function. Nothing more, agreed, but nothing less. If you think it's not what parsing is about, then fine, we just don't agree.
I gave the C declaration example, as it's notoriously tricky to handle properly. As I said: take any C declaration. Now automatically analyze it so that you get the various elements of the declaration and their function. In particular, identify what is actually the identifier being declared (I'm of course not saying checking that any of the identifiers used in the declaration is an existing one or not). Would you consider that not part of syntax analysis? I do. It's easy with ultra-basic examples, but with complex ones, it's definitely not.
Now if, for this example, you're still certain identifying what is the identifier in a declaration is not part of syntax analysis, then we won't agree and that's ok. You just won't convince me. If you agree, I'm just saying that even this basic task IS difficult to implement properly for C, and much more so than for a Wirthian language. For examples of difficult-to-handle C declarations, I gave a web site, but you can find myriads all over the place, including in source code for actual projects.
Back to the OP's topic: to *me*, "0xE+1" interpreted as a single constant (that is then further considered having a syntax error, which is correct if it's considered a single constant) IS a parsing error, as I consider part of parsing the task of correctly identifying the "function" of each separate text element. The simple fact that the compiler spits out what looks clearly in the "syntax error" category makes it hard not to say it's related to the parsing itself. Now we can nitpick to no end if anyone feels like it.
As to the fact C has a context-free grammar or not, here is an article I personally second: https://eli.thegreenplace.net/2007/11/24/the-context-sensitivity-of-cs-grammar
Seems to be a nice source of debate and nitpicking overall!
So anyway, I still think it would be worth reporting this to GCC's team. At least see how *they* consider this issue. (And for people tempted to reply that GCC devs are dumbfucks that never listen to bug reports, I remember a thread a few months ago when I found an alignment bug in GCC for AArch64 targets, many told be GCC team would reject it, and it turned out they actually took it into account and fixed it, so I'm not too worried unless they can come up with a good rationale.)
-
@Cerebus: I think this is sort of getting fruitless and largely outside of the scope of this thread.
Agreed.
I'll just sign off by saying that I think you're still conflating syntax analysis - "this is an identifier, the string for which is 'fred' and it's part of a nonterminal arbitrary called 'declaration'" - (which is most definitely 'parsing', determining language sentence structure) with semantic analysis - "this is an integer declaration of 'fred'" (determining language meaning, which is not part of parsing proper).
Anyway, just at this precise moment I don't care what anybody thinks of anything - I have a hot BLT sandwich in my other hand and all is right with the world. :)
-
C was clearly designed with an expectation that it would be parsed by recursive descent, the Wurthian languages were most likely written with the assumption that a parser generator tool would be to hand, or at least more powerful parsing techniques than recursive descent.
That's exactly backwards.
Wirthian languages are designed to be parsed by very simple recursive-descent parsers, C assumes the use of a table-driven LALR parser.
-
C assumes the use of a table-driven LALR parser.
Which is rooted in C's origins, specifically lex (https://en.wikipedia.org/wiki/Lex_programming_tool) and yacc (https://en.wikipedia.org/wiki/Yacc). Even today, a lot of C compiler/interpreter projects use flex (https://en.wikipedia.org/wiki/Flex_(lexical_analyser_generator)) and bison (https://en.wikipedia.org/wiki/GNU_Bison), so one could say LALR parsers are kinda-sorta built-in to the C ecosystem; it is therefore not surprising that the standard (implicitly?) assumes one is used to parse C itself.
-
C assumes the use of a table-driven LALR parser.
Which is rooted in C's origins, specifically lex (https://en.wikipedia.org/wiki/Lex_programming_tool) and yacc (https://en.wikipedia.org/wiki/Yacc). Even today, a lot of C compiler/interpreter projects use flex (https://en.wikipedia.org/wiki/Flex_(lexical_analyser_generator)) and bison (https://en.wikipedia.org/wiki/GNU_Bison), so one could say LALR parsers are kinda-sorta built-in to the C ecosystem; it is therefore not surprising that the standard (implicitly?) assumes one is used to parse C itself.
An understandable mistake because of the association of them all with Unix, but there exists here (Ritchie's description) (https://www.bell-labs.com/usr/dmr/www/primevalC.html) and here (just source) (https://github.com/mortdeus/legacy-cc) a "primeval" C compiler from circa 1972 the earliest working C compiler to be preserved, written by Dennis Ritchie, with a hand written parser and no trace of lex and yacc anywhere in it. Lex was written by Mike Lesk circa 1975, yacc by Johnson also circa 1975. Lex and Yacc were used for pcc the portable C compiler, but that came later. The original version of yacc was written in B, only later to be translated to C, so it was available to be used in bootstrapping a C compiler if required, but wasn't.
(Frank Deremer, wrote the paper "Practical Translators for LR(k) Languages" in 1969 while working on Project MAC. Also working on project MAC at the same time where all the key people from Bell Labs who came to be the Unix team, so they ought to have been aware of the possibilities, but didn't actually come to exploit them until later. Strangely it wasn't that association, but a conversation with Alfred Aho* (https://en.wikipedia.org/wiki/Stephen_C._Johnson#Bell_Labs_and_AT&T), also at Bell Labs, suggesting than Johnson looked at Knuth's work on LR grammars that led Johnson to use LR grammars as the basis for yacc.)
The nearest to a canonical account of the development of C that exists can be found here: The Development of the C Language - Dennis M. Ritchie (http://heim.ifi.uio.no/inf2270/programmer/historien-om-C.pdf)
*Contrary to your personal Finnish expectations he pronounces it "Ay-Ho", I know this because we once corresponded by email** and I asked him explicitly.
**Gone are the days where a stranger on the Internet could get the email address of a person who is one of "the names" in his field, write to them, and actually get a reply, not merely that but have a civilised conversation about that field with them.
-
An understandable mistake because of the association of them all with Unix, but there exists here (Ritchie's description) (https://www.bell-labs.com/usr/dmr/www/primevalC.html) and here (just source) (https://github.com/mortdeus/legacy-cc) a "primeval" C compiler from circa 1972 the earliest working C compiler to be preserved, written by Dennis Ritchie, with a hand written parser and no trace of lex and yacc anywhere in it. Lex was written by Mike Lesk circa 1975, yacc by Johnson also circa 1975. Lex and Yacc were used for pcc the portable C compiler, but that came later. The original version of yacc was written in B, only later to be translated to C, so it was available to be used in bootstrapping a C compiler if required, but wasn't.
Right. However, when C started gaining traction, lex and yacc were commonly used to implement C compilers, which affected how the C standard (ANSI C) came to be. This was the "origin" I was referring to.
*Contrary to your personal Finnish expectations he pronounces it "Ay-Ho", I know this because we once corresponded by email** and I asked him explicitly.
Contrary to your personal British supremacist expectations, I don't expect people with Finnish surnames and origin, but a different native language, to pronounce their names like Finns do. Most Finns don't. They often say how that name would be pronounced if the person was Finnish, but do not expect them to pronounce it that way. Weird, eh?
-
*Contrary to your personal Finnish expectations he pronounces it "Ay-Ho", I know this because we once corresponded by email** and I asked him explicitly.
Contrary to your personal British supremacist expectations, I don't expect people with Finnish surnames and origin, but a different native language, to pronounce their names like Finns do. Most Finns don't. They often say how that name would be pronounced if the person was Finnish, but do not expect them to pronounce it that way. Weird, eh?
<Sarcasm on>Well I am so sorry for taking your perspective into consideration.<Sarcasm off> Generally, once one gets past the "angry potato" facade you come across as thoughtful, but you can be a real dick at times. Having worked with a fair number of Finns (I used to be operations manager for the UK arm of Saunalahti/Jippi), when I see "Aho" I expect the Finnish pronunciation - is it so foolish of me to expect a Finn would too? And how the hell do you get to "British supremacist expectations" from me expecting a Finnish name to be pronounced in the Finnish fashion - surely "British supremacist expectations" would be "You'll bloody well pronounce it the way we do, foreign scum!" rather than the other way around. Is that a chip I see on your shoulder?
Now, as you obviously want me to be a condescending Brit: Chip on one's Shoulder - "To have a chip on one's shoulder refers to the act of holding a grudge or grievance that readily provokes disputation." (https://en.wikipedia.org/wiki/Chip_on_shoulder). There, feel better now? Prejudices suitably reinforced?
-
when I see "Aho" I expect the Finnish pronunciation - is it so foolish of me to expect a Finn would too? And how the hell do you get to "British supremacist expectations"
Exactly that. It's not foolish, just completely wrong. I believe you think so because English is the lingua franca on the internet, and naturally assume it extends to culture (general individual behaviour) as well. It does not.
There are threads on this forum that deal explicitly with how something is pronounced in English, and the one constant in those threads is the various English-speaking people claiming theirs is the right way, and everybody else is wrong. Jokingly in most instances, but still.
In Finland, kids learn English in school. They are taught to emulate British or American English pronunciation. It is so pervasive that many Finns who can speak Rally English just fine (think of things like directions to the nearest post office/store/pharmacy/bus stop), refuse to speak a word, because they're afraid of being laughed at because of their accent. This is more common among the elderly. Indeed, when Finns pick nicknames used in completely Finnish environments (as in Finnish web sites), they usually pick an English one; the more often the younger the person!
It took me over thirty years to get over that myself, and realize that Rally English is easier to understand, than the mangled emulation of American or British English I was taught to go for.
Now, this has nothing to do with Finnish per se, as in "Finnish is so different"; no. It is just that the pervasiveness of English has masked the variety in cultures, so much so that people tend to assume cultures don't differ that much – especially among those who speak English. Yet, people differ, a lot. (Even in Finland, you have a quite noticeable cultural difference between the "metropolitan" urban Finns and the Finns in smaller towns and rural areas. So much so, that it is becoming a political problem – as in, one side claiming the other must move to cities, because their desire to live outside urban environments is not only unreasonable and unfathomable, but also too costly.)
Chip on one's Shoulder
Yes, and it's a bad one (I mean literally, something I wish I didn't have): I vastly overreact when somebody claims I did or think something, when I definitely do not. :--
I used "supremacist" only to get a rise out of you, because "stuck up brit" is one of those stereotypes you often see in media, but in real life, is damned rare. (The ones I've met personally are definitely not, and calling them one would hurt them.)
I don't actually believe you are, if it matters.
If you knew your Finnish colleagues well enough, you'd realize that deep down, if they were told to welcome somebody named Alfred Aho to a tour of the company (or something similar), they'd worry most about their own pronouciation of Aho, and whether it would offend the prof or not. The concept of "but pronouncing it like a Finnish name is the correct one" would never occur to a typical Finn.
In some ways, it is funny (because it is so different to major cultures); in some ways, it is sad, because I actually think it is a lovely quirk. But I'm biased, of course.
-
when I see "Aho" I expect the Finnish pronunciation - is it so foolish of me to expect a Finn would too? And how the hell do you get to "British supremacist expectations"
Exactly that. It's not foolish, just completely wrong. I believe you think so because English is the lingua franca on the internet, and naturally assume it extends to culture (general individual behaviour) as well. It does not.
I get what you're saying, overall. But that bit above still seems arse about face - the "everybody speaks English" assumption, were one to apply Anglophone cultural imperialism, would be to assume that everybody would use the anglicised pronunciations (e.g. Aho => Ay-Ho, Braun => Brawn, Porche => Poorsh - score one for the Americans, who actually pronounce it like Ferdinand Porche did), not the other way around, making the effort to get the native pronunciation right.
The assumption that a native of a given language would pronounce a name from that language in the native fashion isn't some form of British cultural imperialism born out of the [typical] British inability to get 'foreign' names pronounced properly and therefore an assumption that the rest of the world would be equally crap at it, it's just a reasonable working assumption that a speaker of any language would make about natives of another country and its language.
-
when I see "Aho" I expect the Finnish pronunciation - is it so foolish of me to expect a Finn would too? And how the hell do you get to "British supremacist expectations"
Exactly that. It's not foolish, just completely wrong. I believe you think so because English is the lingua franca on the internet, and naturally assume it extends to culture (general individual behaviour) as well. It does not.
I get what you're saying, overall. But that bit above still seems arse about face - the "everybody speaks English" assumption, were one to apply Anglophone cultural imperialism, would be to assume that everybody would use the anglicised pronunciations (e.g. Aho => Ay-Ho, Braun => Brawn, Porche => Poorsh - score one for the Americans, who actually pronounce it like Ferdinand Porche did), not the other way around, making the effort to get the native pronunciation right.
The assumption that a native of a given language would pronounce a name from that language in the native fashion isn't some form of British cultural imperialism born out of the [typical] British inability to get 'foreign' names pronounced properly and therefore an assumption that the rest of the world would be equally crap at it, it's just a reasonable working assumption that a speaker of any language would make about natives of another country and its language.
I think that is plausible, although I did not see it that way. I rather saw it as "because you speak my native language, you obviously react to things just like I do". (As to arse about face, yeah, that happens to me often, even sober. The Finnish term for rock solid drunk is ass-over-the-shoulder, but I don't drink much nowadays.)
Consider it an overblown reaction, then; and that I wouldn't, nor would the majority of Finns I know, assume that a Canadian surname Aho should be pronounced Aho, or is "properly" pronounced Aho, just because the person has Finnish roots.
-
Consider it an overblown reaction, then; and that I wouldn't, nor would the majority of Finns I know, assume that a Canadian surname Aho should be pronounced Aho, or is "properly" pronounced Aho, just because the person has Finnish roots.
So noted. Alfred did give me quite a good potted history of the Aho family. From memory, so this may not be accurate (the emails are lost to about 3 mail server upgrades since), the family originally settled in Minnesota USA.
-
For those still playing at home...
When MAXIMAL MUNCH attacks. "Due to maximal munch, hexadecimal integer literals ending in e and E, when followed by the operators + or -, must be separated from the operator with whitespace or parentheses in the source"
https://en.cppreference.com/w/cpp/language/translation_phases#maximal_munch
-
Consider it an overblown reaction, then; and that I wouldn't, nor would the majority of Finns I know, assume that a Canadian surname Aho should be pronounced Aho, or is "properly" pronounced Aho, just because the person has Finnish roots.
So noted. Alfred did give me quite a good potted history of the Aho family. From memory, so this may not be accurate (the emails are lost to about 3 mail server upgrades since), the family originally settled in Minnesota USA.
Makes sense; quite a lot of Finns emigrated to Minnesota Iron Range (https://www.loc.gov/rr/european/FinnsAmer/finchro.html) to become miners there. They were even recruited to emigrate from northern Norway by Michigan copper mining companies. Between 1870 and 1920, approximately 320,000 Finns emigrated to the United States, many working in mining and logging, the rest in farming. About 16% of people in the Upper Peninsula of Michigan have Finnish ancestry.
Of the Finnish mindset, a good example is Saint Urho's Day (https://en.wikipedia.org/wiki/Saint_Urho), a completely fictitious character, to be celebrated the day before St. Patrick's Day, to extend the festivities (invented by the emigrants, not celebrated in Finland). The closest Finns have to a saint is Lalli (https://en.wikipedia.org/wiki/Lalli), an anti-hero of sorts, who killed Bishop Henry in 1156.
When MAXIMAL MUNCH attacks. "Due to maximal munch, hexadecimal integer literals ending in e and E, when followed by the operators + or -, must be separated from the operator with whitespace or parentheses in the source"
I do believe that is a direct result of the typical lexical rules used with e.g. lex/flex when separating the input into preprocessing tokens.
-
Can you see what is wrong with this, without compiling it?
#include <stdio.h>
int main(int argc,char *argv[]) {
printf("The next number after 0xE is 0x%X\n", 0xE+1);
}
No return value! ;D
-
No return value! ;D
While this is code smell, in C this is a well defined function returning 0. main is a very special beast in C (and even weirder in C++). If main it returns int, reaching the terminating } is equivalent to calling exit(0).(1)
____
(1) 5.1.2.2.3§1
-
No return value! ;D
While this is code smell, in C this is a well defined function returning 0. main is a very special beast in C (and even weirder in C++). If main it returns int, reaching the terminating } is equivalent to calling exit(0).(1)
____
(1) 5.1.2.2.3§1
As of C99 this is true. Previously having no specified return value would give an undefined termination status.
-
If we are talking about that archaic versions, this is even more complicated. While 9899:1990 indeed makes the value undefined(1), the Ritchie’s reference from 1975 has no return in main (p. 20). :D
____
(1) At least the final draft. I do not have access to the final 1990 publication.