Author Topic: C programming puzzle (Read 8740 times)

hamster_nz · « **on:** October 20, 2020, 02:54:31 am »

Can you see what is wrong with this, without compiling it?

#include <stdio.h>

int main(int argc,char *argv[]) {
  printf("The next number after 0xE is 0x%X\n", 0xE+1);
}

oPossum · « **Reply #1 on:** October 20, 2020, 03:06:28 am »

It may think E+1 is the exponent for a float (that isn't present)

ivaylo · « **Reply #2 on:** October 20, 2020, 04:58:43 am »

Just compile it.

Siwastaja · « **Reply #3 on:** October 20, 2020, 02:38:57 pm »

Quote from: ivaylo on October 20, 2020, 04:58:43 am

Just compile it.

Indeed!

It's interesting because it's working as expected for any other hex value except 0xE. 0xF+1, or 0xA+1, for example, work as expected.

But the letter E is special, meaning that the compiler seems to be parsing it from the right, detecting "E<+|-><number>" as a token (I'm assuming it's for floating point exponent), then parsing from the left to find 0x denoting integer constant (of hex representation), then erroring out because these two things can not coexist.

The takeaway seems to be, whitespace should be used to separate arithmetic operators from the operands, even if not doing that works out 99.99% of the time.

SiliconWizard · « **Reply #4 on:** October 20, 2020, 02:56:44 pm »

This looks odd at first, and I had to take the standard out to try and figure out if this is just a compiler's specific implementation (I checked only with GCC) or if it made sense from the standard itself.
So let's see in the "Constants" section.

I think this behavior is explained by something I more or less knew about but have NEVER used myself (and, I think, never seen in actual code, YMMV): hexadecimal ﬂoating constants.
Since the token begins with "0x", you'd expect the compiler to expect a hexadecimal constant following it, and the parser most likely exactly does that. But, it doesn't know what to expect yet: it could be an integer hex constant OR a "floating" hex constant.

As per the standard, "hexadecimal ﬂoating constants" can include, you guessed it, an exponent part. It's unfortunate that this exponent be indicated with an "E" letter also in this context, whereas it's obviously also an hex digit. It's a recipe for potential parsing issues. As you just saw. I'm sure we can figure out tons of other weird hex constants on this basis.

0xdeadbeef · « **Reply #5 on:** October 20, 2020, 03:22:47 pm »

I'm too lazy to dig through the C ISO/IEC standard, but the problem seems to be that it actually allows hexadecimal float constants (and thus also exponents).
I wasn't even aware that there's also the character "p" to define the precision:
From the ISO/IEC 1999:

Code: [Select]

FLT_MIN 0X1P-126F // hex constant

Kjelt · « **Reply #6 on:** October 20, 2020, 06:12:58 pm »

It is compiler or C standard dependend, on Visual Studio 2017 it just works.

boz · « **Reply #7 on:** October 20, 2020, 06:47:08 pm »

looks ok and works on my compiler but I would normally add a leading zero when doing hexidecimal eg 0x0e

Rick Law · « **Reply #8 on:** October 20, 2020, 06:50:52 pm »

Quote from: Kjelt on October 20, 2020, 06:12:58 pm

It is compiler or C standard dependend, on Visual Studio 2017 it just works.

Exactly! it is compiler or C standard depended. Even within vendor, Compiler versions and standards evolves.

Even spoken language evolves. Fluke Corporation first introduced "scopemeter" in 1987. Had this forum existed before then, a Fluke would likely be a reference to a blood worm, that is what the word fluke means. But Fluke here now probably refers to a DMM instead of blood worm.

IanB · « **Reply #9 on:** October 20, 2020, 06:54:26 pm »

Quote from: SiliconWizard on October 20, 2020, 02:56:44 pm

As per the standard, "hexadecimal ﬂoating constants" can include, you guessed it, an exponent part. It's unfortunate that this exponent be indicated with an "E" letter also in this context, whereas it's obviously also an hex digit. It's a recipe for potential parsing issues. As you just saw. I'm sure we can figure out tons of other weird hex constants on this basis.

According to the information below, the exponent should be indicated with the letter "p" or "P", not "E":

http://books.gigatux.nl/mirror/cinanutshell/0596006977/cinanut-CHP-3-SECT-2.html#:~:text=A%20hexadecimal%20floating%2Dpoint%20constant,the%20letter%20p%20or%20P.

So perhaps this could be a compiler bug?

SiliconWizard · « **Reply #10 on:** October 20, 2020, 07:24:23 pm »

Quote from: IanB on October 20, 2020, 06:54:26 pm

Quote from: SiliconWizard on October 20, 2020, 02:56:44 pm
As per the standard, "hexadecimal ﬂoating constants" can include, you guessed it, an exponent part. It's unfortunate that this exponent be indicated with an "E" letter also in this context, whereas it's obviously also an hex digit. It's a recipe for potential parsing issues. As you just saw. I'm sure we can figure out tons of other weird hex constants on this basis.

According to the information below, the exponent should be indicated with the letter "p" or "P", not "E":

http://books.gigatux.nl/mirror/cinanutshell/0596006977/cinanut-CHP-3-SECT-2.html#:~:text=A%20hexadecimal%20floating%2Dpoint%20constant,the%20letter%20p%20or%20P.

So perhaps this could be a compiler bug?

You're right! Still not familiar with those hex floating constants. I re-read the standard more closely.

The "exponent" part of "hexadecimal-ﬂoating-constant" is "binary-exponent-part", and not just "exponent-part" (as I assumed), and the "binary-exponent-part" starts with a 'P' or 'p' instead of 'E' or 'e'.
(Cf. "6.4.4.2 Floating constants" of C99)

So yeah. Definitely looks like a compiler bug. Since it only seems to appear with GCC (I tried latest GCC 10.2), this would warrant a bug report IMO.

SiliconWizard · « **Reply #11 on:** October 20, 2020, 07:25:54 pm »

Quote from: boz on October 20, 2020, 06:47:08 pm

looks ok and works on my compiler but I would normally add a leading zero when doing hexidecimal eg 0x0e

This is just a matter of style. This doesn't change anything, and even with a leading zero, with GCC, the same bug is triggered (I tried.)

ve7xen · « **Reply #12 on:** October 20, 2020, 11:52:56 pm »

Interesting! Never knew about the float literals either.

What version of GCC are y'all finding this compiles in? The oldest version I have on my box is 4.7.3 and it fails with (as does the latest I have, 8.2.0):

Code: [Select]

test.c: In function ‘main’:
test.c:4:49: error: invalid suffix "+1" on integer constant

Same thing with clang 9 and 10.

I checked on Godbolt and I can't get any of the gcc versions available there to compile this code (including 10.2 and the oldest 4.1.2). MSVC compiles it, but uses 15 as 'expected'. Old versions of ICC do the same; newer ones don't compile.

Kjelt · « **Reply #13 on:** October 21, 2020, 06:27:31 am »

Does it still fail if you write it as 0xE + 1 ?
As normal coding guidelines usually require to add spaces between operands ?

Syntax Error · « **Reply #14 on:** October 21, 2020, 11:02:43 am »

I would define any 8 bit byte as 0x0e and printf as %02x to zero fill.

Code: [Select]

printf("The next number after 0x0E is %02X\n", ( 0x0e + 1 ));
Or your code normalised...

Code: [Select]

printf("The next number after 0x0E is 0x0F\n");

Cerebus · « **Reply #15 on:** October 21, 2020, 12:44:03 pm »

It's a possible bug in the language definition or it's a bug in the compiler implementation. Well, not a 'bug' strictly speaking, but poor specification in the light of knowing what grammars would be used to put a compiler for the language together. Specifically, the grammar looks to be ambiguous - something that a "Context Free Grammar^*" cannot parse. I'd have to dig into the real details of the standard to be sure that the specification is at fault, rather than the implementation(s). If the standard disambiguates this, then it's the implementation(s).

There is an ambiguity. When parsing "0xE+1" there's an ambiguity as to whether the "E" belongs to a 'hex digit^**' as part of a 'hex constant' or to an 'exponent'. There are two possible parsings of that expression, the compiler has to decide what it believes the "E" represents, and to break the ambiguity (for most common parsing algorithms) the compiler writer has to indulge in a trick to remove the ambiguity by prioritising one interpretation over the other. However, if you include the space "0xE +1" there's no longer an ambiguity because the space implies the end of the 'hex constant'. There's an oddity here: in theory spaces are not significant in C expressions outside of string constants, however the parser still has to deal with their presence, so although not 'significant' a space can, by not being an allowed part of something, indicate that the 'something' has finished and the parser can treat it as recognised. So in practice spaces are significant in some contexts.

^* A term of art from linguistics and the branch of mathematics/computer science that deals with parsing languages.
^**I'm just making up the names for the terminals and non-terminals of the grammar - these aren't the official ones used in any of the C standards.

tggzzz · « **Reply #16 on:** October 21, 2020, 12:47:23 pm »

Next question...

Given int x = 011+1, what is the value of x?

IanB · « **Reply #17 on:** October 21, 2020, 02:09:48 pm »

Quote from: tggzzz on October 21, 2020, 12:47:23 pm

Next question...

Given int x = 011+1, what is the value of x?

x = 10 ?

Cerebus · « **Reply #18 on:** October 21, 2020, 03:17:06 pm »

Quote from: tggzzz on October 21, 2020, 12:47:23 pm

Next question...

Given int x = 011+1, what is the value of x?

10 or 012 depending on your mood, or even 0xA.

SiliconWizard · « **Reply #19 on:** October 21, 2020, 04:34:42 pm »

Quote from: Cerebus on October 21, 2020, 12:44:03 pm

There is an ambiguity. When parsing "0xE+1" there's an ambiguity as to whether the "E" belongs to a 'hex digit^**' as part of a 'hex constant' or to an 'exponent'.

I don't agree. There is no ambiguity IMO if you stick to what the standard defines.

Any constant starting with "0x" or "0X" IS to be parsed, by definition, as a hex constant. Period. "E", in the context of a hex constant token, can't be anything else than a hex digit (since the correct exponent in this context would stard with a "P"). So I can't see this behavior as anything else than a parser bug.

Parsers are not easy to implement though. I've written C (and other languages) parsers before, so I do now how hairy that can get. (Never dealt with those hex floating consts though, so that didn't ring a bell.) I see this one example as a "tokenization" issue.

A basic tokenizer may first get "0xE+1" (with no whitespace) as a single token. Now extracting tokens just based on whitespace is a first step, but far from sufficient, obviously. There needs to be additional tokenization steps. And that's where things get hairy, and with MANY particular cases. Other tokenizers may, OTOH, split the above as "0xE", "+", and "1". In this case, for actual exponents (correct ones, such as "1E+1"), the parser would need to aggregate the tokens based on context.

In the OP's case, I'm guessing some parsers (such as GCC's one) may have just ONE case for handling the tokenization for the exponent part of constants, and so detects the sequence "E+..." as an exponent part, whereas it clearly can't be an exponent part in the context of a hex floating constant! This bug is probably a long-standing one, dating back to the first introduction of hex floating constants, so when they started implementing C99.

I can't see the ambiguity from the grammar's POV. But I can definitely see how that would make parsers harder to implement.

IanB · « **Reply #20 on:** October 21, 2020, 06:02:00 pm »

One point not fully addressed in this thread is the difference between C and C++ languages and standards. Just because something is in the C99 standard doesn't automatically mean it will be supported by a C++ compiler. It may be excluded, or it may require a special option to enable it.

So in this thread, it is important to state and verify whether the test is being performed with C++ or with C. The result could be different. What works in one could be a syntax error in the other.

Cerebus · « **Reply #21 on:** October 21, 2020, 06:50:04 pm »

Quote from: SiliconWizard on October 21, 2020, 04:34:42 pm

Quote from: Cerebus on October 21, 2020, 12:44:03 pm
There is an ambiguity. When parsing "0xE+1" there's an ambiguity as to whether the "E" belongs to a 'hex digit^**' as part of a 'hex constant' or to an 'exponent'.

I don't agree. There is no ambiguity IMO if you stick to what the standard defines.

As I said, I hadn't [at the time] divined whether this is standard or compiler implementation and left the 'blame' open, but it's now clear that it's the compiler(s) in question. The cause is almost certainly using an "ambiguous grammar^*", and an incorrect one to boot. It's pretty clear, just from the context of the error, that there is in the compiler's implementation of it's own grammar for the language an ambiguity between recognising an expression with a hex constant on the left and recognising a constant with an exponent part when a hex E immediately precedes the operator.

It is rare, in most production compilers, to use the grammar exactly as published for any given programming language, usually to accommodate some sort of syntax directed translation scheme/translation., or because you want to use an LL(1) grammar and the published on is LALR(k) or similar.

Just had a quick look at the source for GCC 10.2.0 and 4.1.2. No chance of tracking it down in short order but I thought I'd have a look just in case it was easy to find. GCC was never easy to find your way around the insides of and I'm afraid that it hasn't got any easier some I last dived into the bowels if it (which would have been some flavour of 4.x).

Amazingly I just partially tracked it down. In expr.c when it has determined it has an integer constant it calls interpret_int_suffix. interpret_int_suffix looks ahead in the token, if the token has ended, it returns, if the token hasn't ended and finds any of the u/U/l/L etc suffices it sets things up appropriately, if it finds anything else it returns an error and spits out the "invalid suffix \"%.*s\" on integer constant". The problem here is that an earlier stage has tokenized the "+" into the lexeme instead of treating it as an operator token in its own right - haven't tracked that down. There isn't an explicit written grammar here (as there would have been if this was the classic lex/yacc combo) but this genuinely is an ambiguous grammar (otherwise it wouldn't get the version without a E right).

^*Again, a term of art. The grammar:

<expression> ::= <expression> + <expression> | <expression> * <expression> | identifier ;

is ambiguous. This variant is a classic, the absence of any priority implicit in the grammar means that a+b*c can equally well be be parsed as (a+b)*c or a+(b*c) so it is ambiguous.

SiliconWizard · « **Reply #22 on:** October 22, 2020, 01:25:31 am »

Quote from: Cerebus on October 21, 2020, 06:50:04 pm

Quote from: SiliconWizard on October 21, 2020, 04:34:42 pm
Quote from: Cerebus on October 21, 2020, 12:44:03 pm
There is an ambiguity. When parsing "0xE+1" there's an ambiguity as to whether the "E" belongs to a 'hex digit^**' as part of a 'hex constant' or to an 'exponent'.

I don't agree. There is no ambiguity IMO if you stick to what the standard defines.

As I said, I hadn't [at the time] divined whether this is standard or compiler implementation and left the 'blame' open, but it's now clear that it's the compiler(s) in question. The cause is almost certainly using an "ambiguous grammar^*", and an incorrect one to boot. It's pretty clear, just from the context of the error, that there is in the compiler's implementation of it's own grammar for the language an ambiguity between recognising an expression with a hex constant on the left and recognising a constant with an exponent part when a hex E immediately precedes the operator.

Oh, if you meant an ambiguity in the grammar the compiler actually uses, then obviously I agree.

I thought you meant this was ambiguous from the standard itself, which I don't agree with... with a "but". I still have to admit that the C grammar, as exposed in the C standard(s) (this hasn't really improved much), is not that clearly defined. I wish they would add some kind of clear BNF, or something in that vein. There isn't, and (at least IMHO) I think the grammar is exposed pretty "loosely" in the standard itself. That should be addressed in future revisions IMO.

That said, I still see the way this compiler implements parsing (thanks for finding the "offending" part in the source code!) a bit questionable. The C grammar is *heavily* context-dependent (which makes parsing it correctly not really a fun experience), and in this example, I do think they don't use the context properly. Parsing the token(s) possibly representing an exponent part independently of the context (which is a hex constant, due to the prefix) is, IMO, a parsing error. Now to be fair, I admit the parser still has to make a choice regarding the context. When it encounters "0xE+1", did the programmer actually mean "0xE + 1", or was the "0x" actually not intended, and the programmer meant a decimal floating constant instead? Or maybe they meant a hex floating constant, but used "E" instead of "P"? My take on approaching this kind of stuff when writing a parser is pretty simple: 1/ use common sense, 2/ parse it as the "most likely" (here, if "0xE" was written, the probability of meaning anything else than a hex constant is pretty low, unless you want your parser to outsmart people using it, which never works) and finally 3/ determine the context. Finally, the parser should try its best *not* to trigger an *error* if the written code has any possibility of being actually correct. If it encounters something that may look like a typical possible mistake, it should then issue a warning, but not throw an error. Such as when GCC encounters "if (x = y)", it will throw a warning, because this is a common source of bug. This is nice.

But I know this is definitely not easy. C is a bitch to parse, because not only is it heavily context-dependent, but the context is sometimes determined starting from the right first, and sometimes starting from the left (due to the "reversed" nature of many C constructs, such as declarations.) Wirthian languages (Pascal, Modula, Oberon, Ada and many more derivatives) are definitely much, much nicer to parse.

Cerebus · « **Reply #23 on:** October 22, 2020, 02:22:16 pm »

I'd disagree slightly, but it's all about semantics (in the everyday sense). C and the whole Wurthian set of languages are equally easy to parse, but C is harder to translate. C was clearly designed with an expectation that it would be parsed by recursive descent, the Wurthian languages were most likely written with the assumption that a parser generator tool would be to hand, or at least more powerful parsing techniques than recursive descent.

When you're trying to do the parsing and the translation in the same pass by recursive descent, it's natural to want to know the type by the time you reach the actual variable declaration when you're doing a left to right sweep through the code, so the <typename> <variable list> construction is "more natural". If you're using a shift-reduce parser then it's more natural to do the whole thing the other way around, apply the semantics as you do the final reduce of the declaration construct - so you don't have carry the semantic information across several layers of the parse tree.

If you're doing parsing and translation in separate passes then it's equally easy or hard to translate either style. The differences between them boil down to what order you do a (sub-)tree walk of the abstract syntax tree.

The moan about the clarity of the formal grammar in the C standard can be expanded, in my opinion, to almost every language standard published. The one clearly notable exception is "The Revised Report on Algol 68", which has the most rigorous grammar ever published for a programming language. Unfortunately the grammar is in Van Wijngaarden Form (W-grammar) so there only a few thousand people in the whole world capable or reading it, and only a couple of hundred people in the world capable of understanding it.

SiliconWizard · « **Reply #24 on:** October 22, 2020, 03:40:44 pm »

Quote from: Cerebus on October 22, 2020, 02:22:16 pm

I'd disagree slightly, but it's all about semantics (in the everyday sense). C and the whole Wurthian set of languages are equally easy to parse, but C is harder to translate.

Whatever you include in parsing and what you mean by "translating". Please elaborate, getting into academic semantic subtleties was not the point. To me parsing is the whole thing from tokenization to getting some kind of tree which makes sense of the source code from a syntactic POV.
(As defined on Wikipedia for instance:

Quote

A parser is a software component that takes input data (frequently text) and builds a data structure – often some kind of parse tree, abstract syntax tree or other hierarchical structure, giving a structural representation of the input while checking for correct syntax. The parsing may be preceded or followed by other steps, or these may be combined into a single step.

)

Again, with C, a recursive descent approach or whatever you choose is NOT sufficient, because of the heavy dependency on context. The whole idea is that the more context-dependent a given grammar is, the more difficult it is to parse. (And obviously the more difficult it is to correctly define the grammar itself to begin with.)

Have you actually written a complete C parser? It IS a bitch. I'll give you just one example: C declarations. Try writing a parser that correctly identifies all parts of a C declaration in the general case, and make it correct. Now do the same with Pascal (or Modula, Oberon...) Tell me what you think. I can guarantee you that C declarations are fun. As I said earlier, correctly parsing a C declaration involves taking the context into account in a non-trivial way. You may think that (as can be read sometimes) C declarations are hard to read for humans, but very easy to parse for a machine. It's not completely false, but a very naive way of seeing it. Take a complex C declaration and see how well your parser does for identifying the correct type and whatever is actually the identifier being declared (a task that is merely syntactic and that I would definitely include in a parser's job even with a strict definition of parsing.) It's impossible to do correctly without properly identifying the context, whereas parsing a - say - Pascal declaration can be done almost context-free.

To play with C declarations there's a fun site: https://cdecl.org/


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: C programming puzzle (Read 8740 times)

Share me