I notice that you assume each token is separated by whitespace yourself; how sure are you about that? I would not expect that to always be the case.
Yes, I do. I haven't seen GCode variant that doesn't require this but maybe there is one that does not.
I asked, because the "optional skip operator",
/ , is optionally followed by a digit, and the examples on the web do not have a space between it and the following token. Then again, I'm not exactly sure if it should skip just that single token, or the entire line.
If the whitespace between the tokens is optional, then you have quite a task on your hands - what if e.g. substring of an SD card filename or a message to be displayed to the operator matches a valid GCode instruction? Yikes ... That wouldn't be fun to parse at all.
It isn't as nasty as it may sound.
The lexer (the code that uses the compiled regular expressions
_command and
_comment) can emulate how a typical G-code parser parses the code. That is, it basically just needs to understand what the
next token is, and extract it from the line.
It is perfectly fine to just keep those tokens as strings. I personally like to convert them to suitable tuple subclasses, for both ease of use and efficiency -- a tuple is compact in memory, and a tuple of tuples (to describe the tokens on a G-code line) is faster than lists or dicts. For ease of use, I am referring to helper methods in the G-code line tuple, locating desired tokens.
I've used this approach to parse (i.e., lexically separate, then convert to tuples) HPGL, and although Python I/O is not fast, it has quite satisfactory speed.
I have also used awk to parse HPGL, which can split each input line (record) into string fields using regular expressions. It works, but having to convert the string fields to values at each point of use, is a lot of repeated work. (Awk is also a bit funny in that it has no local variables, so helper functions often need to have funky names to not overwrite variables used elsewhere.)
In C, I would use structures with common initial members and a type tag, possibly with a pointer to the exact field contents. (This does involve at least one extra memory copy operation, since nul characters would be inserted between tokens in the input line, but even if using standard I/O character by character, it'll likely be faster than Python. The implicit conversions Python does between bytearray and str really slow it down.)
How can it be a byte array?
Like I wrote, if you obtain the G-code data via a socket (a TCP/IP connection, an Unix domain socket, or even a character device), you almost always need to open that in the binary mode. Instead of forcing the user to remember to convert the bytearray to str via decode(), I added the two lines I thought would avoid that misstep.
Look. You need to think about what kind of code others will write based on your code. Consider the case when a user supplies "a random object" to the
GCodeLine() constructor. The only use case that makes sense, is when the user intends that data to be treated as an ASCII string, then parsed as G-code.
This is my assumption based on my experience on how others use awk and python code I've written to mangle HPGL.
I could be wrong, but that is the basis for
that initial choice.
Nothing you have written thus far has been a convincing argument against that, assuming the code is used as a basis for development, and not the Holy Word on How Things Shall Be Done.
While the files may be defined as ASCII, Python strings are Unicode.
When ASCII text is converted to Unicode, the set of possible whitespace is exactly those six code points.
So you are basically introducing a potential bug and using more verbose code to do it instead of a simple strip() to boot.
Why so hostile? "Potential bug."
Problem is that Python converts text input from the character set used by the users locale to Unicode. For example, if I have a file with \xA4 in it, my Python code will provide it in a string as U+20AC (€) if my locale uses ISO-8859-15, but as U+00A4 if my locale uses Windows-1252. Using UTF-8, Python will raise UnicodeDecodeError. Because of this,
I wanted the code to strip only those Unicode characters that correspond to ASCII whitespace.
I am interested in discussing what kind of choices make sense, but honestly, I'm getting pretty pissed off at those choices being called "bugs" even when I've already explained their rationale. Instead of discussing that, you keep calling the code "buggy" and "overly complicated". OP is doing this to learn, not to just catch tool changes!
I've tried being civil, and try to get something constructive going, but nothing seems to work with you, so I'll just ignore you from now on.