Author Topic: Python text file processing - extracting data based on data found (Read 6305 times)

rx8pilot · « **on:** October 13, 2019, 07:25:16 pm »

SETUP:
Python is fairly new to me. I have a considerable amount of experience in lower-level languages, generally for embedded designs.

PROBLEM:
Extract pieces of data from a text based G-Code program. In general, looking for certain strings followed by gathering other elements before and after that instance. Hoping for an overview of typical options to do this in Python with reasonable elegance.

EFFORT SO FAR:
For the past couple of days, I have been doing a general education effort in Python and trying to get my head separated from my usual C/C++ thoughts. I have opened a few sample text files and used some standard library methods like find, index, and regular expressions. That has given me some ideas on how to iterate through a text file in a basic way, find things in a basic way, etc.

Some sample data:

Code: [Select]

O0401( PGM-M05-610-0325 SB2 V-LOCK H-PLATE B-V2C S1 )
( DATE - SEP. 04 2018 )
( TIME - 9:24 PM )
G20
G0 G17 G40 G80 G90 G94 G98
/M31
G0 G28 G91 Z0.
(  TOOL-2  3/8 FLAT ENDMILL VIPER    D OFF-2  LEN-2  DIA-.375 )
( HAAS SUPER MINI MILL 3 AXIS MILL )
( MACHINE GROUP-1 )
T2 M6
M01
G0 G90 G54 X-2.8392 Y1.6562 S15000 M3
G43 H2 Z.24 /M8
Z.19
G1 Z-.02 F150.
X2.725 F200.
Y1.4818

In this example, I would be looking for the string "M6", get its index, and search forward/backward until I find the 'T' followed by a number, that number would become an element needed later.

G-Code has flexibility and programmers have various styles. M6 calls a tool change on a CNC mill. T is the number of the tool. It can be written in a few ways.
T6 M6
T6M6
T06M6
M6T06
M6 T6
....are all valid commands to call a tool change for the #6 tool.

Ultimately, I would be extracting a lot more of the individual elements of the g-code for analysis but if I can find the M6 and the T6 close to it, I can deal with all the rest. There is a lot of repetition, so in general, I would be looking for an event and then examining the data surrounding that event.

SOLUTION OPTIONS:
Regular Expressions seem to be a powerful way to filter and search text files. Not sure if they are the best for this type of application. The learning curve on regular expressions is not trivial - the powerful nature ensures a long list of syntax rules.

I also used what appears to be a simpler method of .find() where I could get the index position of a string and presumable walk around that index until I find what I am looking for.

I feel like my experience in C has me over-thinking this in Python terms where there are libraries galore to deal with this sort of problem. Since I am trying to drastically improve my Python skill, the goal here is NOT to have someone code it for me, but rather point me in the right direction. I need to learn this, not jsut copy/paste code that I do not understand.

Grateful in advance for any guidance.

DimitriP · « **Reply #1 on:** October 13, 2019, 08:17:21 pm »

Python is all and good, and I don't know the exact end goal but .....
egrep "M6|M06" filenamehere will give you a list of all of 'em

( egrep -n "M6|M06" filenamehere will show line numbers)

Useless in enhancing your python skills , but you do end up with every M6 and M06 in your file in case you have work to do

DimitriP · « **Reply #2 on:** October 13, 2019, 08:26:54 pm »

I might be one of the few that doesn't do python so don't expect any code coming your way, but I havescrewed around with strings
What it sounds like you want to do is find out if a linestring contains one of the toolchangetrings you are looking for, and if it does, substract the toolschange string from the linestring to leave you with the toolsectedstring

If that's what you are after ...

don't take the link personally

but these should get you going : https://www.dummies.com/programming/python/how-to-search-within-a-string-in-python/

janoc · « **Reply #3 on:** October 13, 2019, 09:26:18 pm »

Here is a small example for Python 3 (make sure you have a recent version due to the use of f-strings!) that I have concocted:

Code: [Select]

import re

# Add new patterns as required here
regexs = [re.compile("T(\d+)\s*?M6"),       # T followed by a number, followed by optional whitespace and M6
          re.compile("M6\s*?T(\d+)")]            # M6 followed by optional whitespace, then T and a number


with open("test.gcode") as f:
    lines = f.readlines()       # Let's be lazy and read the entire file into memory
                                # It can obviously be done line by line too, since we are analyzing
                                # the content by lines anyway

    for line_no, line in enumerate(lines):
        for r in regexs:
            match = re.search(r, line)  # Search through the string to see if there is match for our regex

            if match is not None:
                # assume that there is only a single match on a line - group #1
                group = match.group(1)        # Group 0 - entire thing matched by the regex, groups 1-n are content of the parentheses
                number = int(group)           # Get rid of any leading zeroes and converts it to number

                print(f"{line_no}: Match at characters {match.start(1)}:{match.end(1)}, found: {number}")

(if the formatting/whitespace gets mangled by the forum, here is a better copy: http://dpaste.com/3HCEW7Z )

It is not the only way to do it, likely not the most efficient neither and there is zero error checking for brevity but it does what you are after. It assumes your GCode is in a file "test.gcode" in the same directory. If I run it on your example GCode, it prints:

Code: [Select]

$ python3 extract.py
10: Match at characters 1:2, found: 2

(10 is a line number, 1:2 - character position where the match on that line is and finally the number following the T)

It uses regular expressions, they are probably the easiest way how to code this if you have multiple ways the things could be written - different order, optional whitespace, leading zeroes or not, etc. Doing this manually by searching through the string would get really really painful fast, even though Python has good facilities to do that.

The script has the patterns in a list and runs through them, the idea is that you could have a lot of different patterns so a single regex with alternatives would get unwieldy really fast. Instead of printing you can then call some function to process the data or even modify the string and output a modified version - up to you.

Regular expressions are not that complex if you keep it to small patterns. I have found this tool really useful for testing stuff out quickly:
https://regex101.com/#python

If you try the expressions I have used there you will get a detailed explanation of what they do as well.

rx8pilot · « **Reply #4 on:** October 13, 2019, 10:05:50 pm »

Quote from: janoc on October 13, 2019, 09:26:18 pm

Here is a small example for Python 3 (make sure you have a recent version due to the use of f-strings!) that I have concocted:

Wow, thanks.....trying this example. RegEx looks like it is worth the cost of learning for this type of data extraction. I ran my full file through it and it works well. The code is concise.

Since it returns the line number, I can use that to look for additional data before and after the M6. It also looks like RegEx is an easy way to ignore g-code comments that are encapsulated in parentheses: (comment with M6 T6)

...continuing to experiment.

rx8pilot · « **Reply #5 on:** October 14, 2019, 01:17:31 am »

Quote from: janoc on October 13, 2019, 09:26:18 pm

Regular expressions are not that complex if you keep it to small patterns. I have found this tool really useful for testing stuff out quickly:
https://regex101.com/#python

If you try the expressions I have used there you will get a detailed explanation of what they do as well.

RegEx is quite a tool set. After experimenting with your sample code and doing some reading I can see that it covers a ton of territory.
I have been able to find various pieces of data and have the location of that data in the file - excellent.

Trying to understand the syntax in this line:

Code: [Select]

for line_no, line in enumerate(lines):I have been using the for loop like this:

Code: [Select]

for x in range(5):

Not sure what/how 'line_no, line' works

The next step is to find data relative to other data. For example, it is common to have a description of the tools in a comment a few lines ahead of the command to execute the tool change. I am thinking that once I find the 'M6' and have the tool number taken from the 'T' code - I would just scan through the last 5-6 lines looking for comment delimiters and associate that data with the tool number.

IanB · « **Reply #6 on:** October 14, 2019, 01:32:31 am »

If it were me, I wouldn't look at the file as text. I would look at it as a program that can be interpreted. So to do that I would look at the definition of G-Code and how to read it and write it. Then I would read in the file and store it in memory as symbolic instructions in a standardized form. Having done that I would scan the program in its standardized form for the information I was looking for.

I realize this is slightly more complex to do than searching through a text file, but ultimately the effort of taking a more rigorous approach will pay off.

rx8pilot · « **Reply #7 on:** October 14, 2019, 02:10:18 am »

Quote from: IanB on October 14, 2019, 01:32:31 am

If it were me, I wouldn't look at the file as text. I would look at it as a program that can be interpreted. So to do that I would look at the definition of G-Code and how to read it and write it. Then I would read in the file and store it in memory as symbolic instructions in a standardized form. Having done that I would scan the program in its standardized form for the information I was looking for.

I realize this is slightly more complex to do than searching through a text file, but ultimately the effort of taking a more rigorous approach will pay off.

The good news is that I can write g-code in my sleep, the bad news is that this project is an exercise that is devised as a personal skill builder in Python. Your suggestion may overwhelm my fragile little mind, lol.

The exercise is largely to gain an understanding of how to scan some data and extract the 'highlight reel'. While this is g-code, I am hoping to use the skills gained to do similar data scanners for all the systems I use in manufacturing. Most of them have some sort of data that I can understand well enough to get a sense of what is happening. Full interpretation could be helpful, but it will have to wait until I have room in my brain and my schedule to pull it off.

westfw · « **Reply #8 on:** October 14, 2019, 08:33:04 am »

Quote

RegEx looks like it is worth the cost of learning for this type of data extraction.

Assuming that there isn't already a GCODE library for Python, I agree that a typical Python programmer would probably use regular expressions for this sort of thing. Once "built", the parsing of a regular expression is quite fast (comparable to simple string matching, IIRC.)

I can recommend this class: https://www.coursera.org/learn/python-network-data ( Using Python to Access Web Data )
They cover regular expressions first, along with HTML, JSON, and XML (all of which are popular data formats that it's worth having a passing familiarity with.)

The Python code that you have to write to do something "trivial" with data in one of those formats is just about as trivial as it ought to be, which is pretty amazing.

janoc · « **Reply #9 on:** October 14, 2019, 03:57:29 pm »

Quote from: rx8pilot on October 14, 2019, 01:17:31 am

Quote from: janoc on October 13, 2019, 09:26:18 pm
Regular expressions are not that complex if you keep it to small patterns. I have found this tool really useful for testing stuff out quickly:
https://regex101.com/#python

If you try the expressions I have used there you will get a detailed explanation of what they do as well.

RegEx is quite a tool set. After experimenting with your sample code and doing some reading I can see that it covers a ton of territory.
I have been able to find various pieces of data and have the location of that data in the file - excellent.

Trying to understand the syntax in this line:
Code: [Select]
for line_no, line in enumerate(lines):I have been using the for loop like this:
Code: [Select]
for x in range(5):
Not sure what/how 'line_no, line' works

enumerate() runs over a sequence (list, tuple, numpy array ...) and returns tuple (index, value), where index is the index number of the element in the sequence and the value is simply a value from the sequence.

range(5) generates a sequence too.

The difference between Python and e.g. C/C++ is that the for loop in Python is more similar to foreach or range loops in C++, not the C-like for loop. I.e.:

Code: [Select]

for x in sequence:
    print(x)

the loop will on each iteration retrieve the next value from the sequence/iterator and put it in x. So you can open a file and iterate over it like this:

Code: [Select]

with open("somefile.txt") as f:
    for line in f.readline():
       print(line)

Quote from: rx8pilot on October 14, 2019, 01:17:31 am

The next step is to find data relative to other data. For example, it is common to have a description of the tools in a comment a few lines ahead of the command to execute the tool change. I am thinking that once I find the 'M6' and have the tool number taken from the 'T' code - I would just scan through the last 5-6 lines looking for comment delimiters and associate that data with the tool number.

Well, you could do that or you can write a multi-line regexp (that scans multiple lines) and will extract the comment if it see the M6 line. It would be a bit complicated, though. Or remember the last comment you have seen and analyze it when you find the M6.

janoc · « **Reply #10 on:** October 14, 2019, 04:03:46 pm »

Quote from: IanB on October 14, 2019, 01:32:31 am

If it were me, I wouldn't look at the file as text. I would look at it as a program that can be interpreted. So to do that I would look at the definition of G-Code and how to read it and write it. Then I would read in the file and store it in memory as symbolic instructions in a standardized form. Having done that I would scan the program in its standardized form for the information I was looking for.

I realize this is slightly more complex to do than searching through a text file, but ultimately the effort of taking a more rigorous approach will pay off.

That works but if the goal is to only replace/extract few values (and not to simulate/interpret the code), this would be an extreme overkill. GCode isn't really a structured language, think more assembler than e.g. C. Each line is one instruction and that's it.

Worse, there isn't a "definition" of GCode (like a grammar). It is a very good example of an ad-hoc industrial standard that that isn't really standard - every single vendor of a CNC controller has their own dialect, implementing different commands or the same commands have different meaning. Some of has to do with the machine configuration (X,Y,Z means a very different thing on a lathe, mill and 3D printer), some of it has to do simply with the way the vendor has decided to implement things.

rx8pilot · « **Reply #11 on:** October 14, 2019, 05:39:08 pm »

Quote from: westfw on October 14, 2019, 08:33:04 am

Quote
RegEx looks like it is worth the cost of learning for this type of data extraction.
Assuming that there isn't already a GCODE library for Python, I agree that a typical Python programmer would probably use regular expressions for this sort of thing. Once "built", the parsing of a regular expression is quite fast (comparable to simple string matching, IIRC.)

G-Code is chosen as my learning challenge, primarily because I know G-code very well. the Long game is to be able to dig through all sorts of control and configuration data - most of which is non-standard. The goal of gathering this data is to make global changes that impact numerous machines/processes/controllers, etc or for composite analysis of a process that includes many operationally isolated systems.

Quote from: janoc on October 14, 2019, 04:03:46 pm

Worse, there isn't a "definition" of GCode (like a grammar). It is a very good example of an ad-hoc industrial standard that that isn't really standard - every single vendor of a CNC controller has their own dialect, implementing different commands or the same commands have different meaning. Some of has to do with the machine configuration (X,Y,Z means a very different thing on a lathe, mill and 3D printer), some of it has to do simply with the way the vendor has decided to implement things.

True and painful.

RegEx question:
When looking for code M6, it can be presented in a number of ways.
m6
m06
M6
M06

Easy [mM]0?6 is good for all of those. The problem is that there can also be M67, M60, etc which this expression will also grab. How do you ignore any integer following the '6'?

SiliconWizard · « **Reply #12 on:** October 14, 2019, 06:26:07 pm »

Quote from: rx8pilot on October 14, 2019, 05:39:08 pm

Easy [mM]0?6 is good for all of those. The problem is that there can also be M67, M60, etc which this expression will also grab. How do you ignore any integer following the '6'?

I wouldn't do that. I would rather use a regex to grab all tokens starting with an 'm' or 'M', which would be an "M" command, and then convert the numeric constant right after the command letter as an integer. Now you have all that's needed to decode M commands.

You can do this for all commands actually, of the form: letter+numeric value.

rx8pilot · « **Reply #13 on:** October 14, 2019, 07:08:39 pm »

Quote from: SiliconWizard on October 14, 2019, 06:26:07 pm

Quote from: rx8pilot on October 14, 2019, 05:39:08 pm
Easy [mM]0?6 is good for all of those. The problem is that there can also be M67, M60, etc which this expression will also grab. How do you ignore any integer following the '6'?

I wouldn't do that. I would rather use a regex to grab all tokens starting with an 'm' or 'M', which would be an "M" command, and then convert the numeric constant right after the command letter as an integer. Now you have all that's needed to decode M commands.

You can do this for all commands actually, of the form: letter+numeric value.

So you suggest RegEx to find the 'M' and then use the string position returned from RegEx to step through the next 3 characters to see if it is a valid number - ignoring anything other than 0-9? Or, perhaps a nested RegEx that is looking for 1-3 decimals?

SiliconWizard · « **Reply #14 on:** October 14, 2019, 08:11:18 pm »

A decent regex library should be able to iterate through tokens of the form letter+numeric value, and extract the letter and the numeric value for you for each. At least this is easily done with Lua and the built-in regex capabilities.

I don't use Python, so don't know how to do it in Python.
But in Lua, that would look something like: (bonus: it ignores comments as well)

Code: [Select]

function Str_Gcode(Str)
    return string.gmatch(Str, "([%a])([%d%-%.]+)")
end

function Str_RemoveComments(Str)
    return string.gsub(Str, "[%(].+[%)]", "")
end

function Parse_Gcode(FilePath)
    local LastTool
    
    for Line in io.lines(FilePath) do
        Line = Str_RemoveComments(Line)
        
        for Code, Arg in Str_Gcode(Line) do
            local ArgNum = tonumber(Arg)
            
            Code = string.upper(Code)

            if Code == "M" then
                if ArgNum == 6 then
                    print("Tool Change")
                end
            elseif Code == "T" then
                print("Tool number: " .. ArgNum)
                LastTool = ArgNum
            end
        end
    end
end

DimitriP · « **Reply #15 on:** October 14, 2019, 10:01:34 pm »

Quote

...and the built-in regex capabilities

I wish I had a clear idea of what the end result of this is supposed to be other than honing python skills.

Opening the file in Notepad+ and using a regex for M6 M06 will find everything, and all you have to do is look "above" to see what is going on.

If the purpose is to read comments that may or may not appear directly or close to the tool change, I'm completely oblivious to how a list of lines and character positions is useful.

..."other than honing python skills"

rx8pilot · « **Reply #16 on:** October 15, 2019, 03:43:03 am »

Quote from: DimitriP on October 14, 2019, 10:01:34 pm

Quote
...and the built-in regex capabilities
I wish I had a clear idea of what the end result of this is supposed to be other than honing python skills.

Primary goal is to hone Python skills related to searching text data for analysis.

Using a sample of G-code simply to illustrate the type of data extraction I am hoping to do with all sorts of text data that is unrelated to G-code. Finding a keyword or command and then examining the other data that surrounds it.

janoc · « **Reply #17 on:** October 15, 2019, 06:54:51 pm »

Quote from: rx8pilot on October 14, 2019, 07:08:39 pm

Quote from: SiliconWizard on October 14, 2019, 06:26:07 pm
Quote from: rx8pilot on October 14, 2019, 05:39:08 pm
Easy [mM]0?6 is good for all of those. The problem is that there can also be M67, M60, etc which this expression will also grab. How do you ignore any integer following the '6'?

I wouldn't do that. I would rather use a regex to grab all tokens starting with an 'm' or 'M', which would be an "M" command, and then convert the numeric constant right after the command letter as an integer. Now you have all that's needed to decode M commands.

You can do this for all commands actually, of the form: letter+numeric value.

So you suggest RegEx to find the 'M' and then use the string position returned from RegEx to step through the next 3 characters to see if it is a valid number - ignoring anything other than 0-9? Or, perhaps a nested RegEx that is looking for 1-3 decimals?

I agree with what SiliconWizard said above - it is better to find all M commands, then look at the extracted number value and decide whether it is one you care about or not using normal if statement(s). Basically use the regular expression to split the line into tokens, do not to try to interpret them. That is better done separately.

While it is possible to create a regular expression that would grab only the ones that end with 6 and ignore everything else it will be needlessly complex and it is unlikely you will only ever be interested in the M6 command. If you are going to look for M1, M4 or others later, you will have to define specific regexp for each = pain in the butt to write, slow (matching regexps is fairly expensive) and not maintainable code, with a ton of regexps that do almost the same thing, differing only in the value they are searching for.

rx8pilot · « **Reply #18 on:** October 15, 2019, 07:41:56 pm »

Quote from: janoc on October 15, 2019, 06:54:51 pm

I agree with what SiliconWizard said above - it is better to find all M commands, then look at the extracted number value and decide whether it is one you care about or not using normal if statement(s). Basically use the regular expression to split the line into tokens, do not to try to interpret them. That is better done separately.

While it is possible to create a regular expression that would grab only the ones that end with 6 and ignore everything else it will be needlessly complex and it is unlikely you will only ever be interested in the M6 command. If you are going to look for M1, M4 or others later, you will have to define specific regexp for each = pain in the butt to write, slow (matching regexps is fairly expensive) and not maintainable code, with a ton of regexps that do almost the same thing, differing only in the value they are searching for.

ok - I can see the flexibility and simplicity. I will code a few experiments on a larger file to with the goal of taking a user input 'M' or 'G' code and having the program count the instances and return the lines with line numbers. That should be a good exercise to ensure that I can find anything and know it's location if I need to find related data surrounding it by searching a smaller section of the data.

Thanks @SiliconWizard and @janoc

PS: I am coding in C all week during the day just to ensure that my brain will be scrambled eggs after I dive into Python each night.

Nominal Animal · « **Reply #19 on:** October 15, 2019, 07:56:49 pm »

Consider the following Python3 code:

Code: [Select]

#!/usr/bin/python3
import re
from sys import stdin, stdout


class GCodeParameter(tuple):
    """A tuple subclass describing a single GCode parameter"""

    def __new__(cls, name, value):
        # Ensure name part is a string
        if not isinstance(name, str):
            name = str(name, encoding='ascii')

        # Convert name to uppercase
        name = name.upper()

        # Convert value to a numeric type
        if not isinstance(value, (int, float)):
            if not isinstance(value, str):
                value = str(value, encoding='ascii')
            if '.' in value:
                value = float(value)
            else:
                value = int(value, base=10)

        return tuple.__new__(GCodeParameter, (name, value))

    @property
    def name(self):
        "Parameter name (str)"
        return self[0]

    @property
    def value(self):
        "Parameter value (int or float)"
        return self[1]

    @property
    def valuestr(self):
        "Parameter value as a string"
        if isinstance(self[1], float):
            return ("%.9f" % self[1]).rstrip('0')
        else:
            return "%d" % self[1]

    def __str__(self):
        if isinstance(self[1], float):
            return ("%s%.9f" % self).rstrip('0')
        else:
            return "%s%d" % self


class GCodeLine(tuple):
    """A tuple subclass describing a single GCode line"""

    _comment = re.compile(r'\s*\([^)]*\)\s*')
    _command = re.compile(r'\s*(/?[A-Za-z])([-+]?[0-9]*\.[0-9.]*|[-+]?[0-9]+)\s*')

    def __new__(self, line):
        # Ensure line is a string.
        if not isinstance(line, str):
            line = str(line, encoding='ascii')

        # Remove comments and leading and trailing ASCII whitespace.
        line = GCodeLine._comment.sub('', line).strip('\t\n\v\f\r ')

        params = []
        index = 0
        while True:

            # Try to extract next parameter.
            match = GCodeLine._command.match(line)
            if match is None:
                break

            # Add to parameter list,
            params += [ GCodeParameter(match.group(1), match.group(2)) ]

            # and remove from the line.
            line = line[match.end(0):]

        # If the complete line was not parsed, raise a ValueError.
        if len(line) > 0:
            raise ValueError("Invalid GCode: %s" % line)

        # Construct the list subclass instance and return it.
        return tuple.__new__(GCodeLine, params)

    def __str__(self):
        return ' '.join([ str(param) for param in self ])

    def has(self, name, value):
        """Count the number of occurrences of the parameter on this GCode line."""

        if not isinstance(name, str):
            uppername = str(name, encoding='ascii').upper()
        else:
            uppername = name.upper()

        count = 0
        for param in self:
            if param.name == uppername and param.value == value:
                count = count + 1

        return count
        
    def valueof(self, name, default=None):
        """Return the value of the first matching parameter on this GCode line."""

        if not isinstance(name, str):
            uppername = str(name, encoding='ascii').upper()
        else:
            uppername = name.upper()

        for param in self:
            if param.name == uppername:
                return param.value

        return default


if __name__ == '__main__':
    for line in stdin:
        gcode = GCodeLine(line)
        if len(gcode) > 0:

            print(gcode)

            if gcode.has('M', 6):
                print("    (Changed to tool %d)" % gcode.valueof('T', -1))

The GCodeParameter class is a subclass of tuple. It describes a single Gcode parameter, like ('G', 0) (for G0) or ('Z',-0.02) (for Z-0.02).
If p is an instance of that class, you can extract the parameter name using p.name, value using p.value, and the value as a string using p.valuestr.
Implicit or explicit conversion to string uses the __str__() method, which reconstructs the parameter as best it can. In particular, if the value contained a decimal point, the value type will be float (and the string version will also always contain the decimal point); otherwise the value is an int.

The GCodeLine class is also a subclass of tuple. When you create an instance, you supply a line of GCode as a parameter. The __new__() method will parse it into individual GCodeParameters, and return them as a list (actually, tuple; but near enough the same thing, except that tuples cannot be modified). The __str__() method reconstructs the entire string, inserting spaces between each parameter. Comments and extra whitespace is discarded.
The has(name, value) can be used to check how many instances of a specific Gcode parameter and value pairs this line has. For example, if the line was "G0 G5 G2 G0", then has("G", 0)==2 and has("T", 2)==0.
The valueof(name) method can be used to find the value of the first parameter of that name on that line. By default, it returns None if there is no such parameter on the line; but you can supply that value as a second parameter if you want a different return value in the not-found case.

The example main reads gcode from standard input. The line
for line in stdin:
is a loop, where line will contain each consecutive line of input. The loop will end when there is no more data to read.

The next line,
gcode = GCodeLine(line)
constructs a new instance of the GCodeLine class, using the line of input as the data.
When the line contains only whitespace or comments, gcode will be an empty tuple, ().
Note that len(gcode) will contain the number of GCodeParameters parsed from that line.

The line
print(gcode)
reconstructs the gcode (using the two __str__() methods), and prints it to standard output.

The next two lines,
if gcode.has('M', 6):
print(" (Changed to tool %d)" % gcode.valueof('T', -1))
checks if the line has an M6 Gcode parameter, and if so, prints the value of the first T parameter on the same line. If there is no T parameter on that line, it prints -1 as the tool.

Note that if you save the above as say example.py, you can use pydoc3 example to see the help/description/usage for the two classes. It is basically automagic documentation that uses the docstrings (literal strings following the class, method, or function definition); I recommend you make it a habit to write those.
For me, they are indispensable, because it is much easier to run the pydoc3 command and read the description, than look at the source code and try to remember what the heck the thing is, a few months down the line.

Run the above with e.g.
python3 example.py < input.gcode
If the file input.gcode contains the snippet from the first post, the output is
O401
G20
G0 G17 G40 G80 G90 G94 G98
/M31
G0 G28 G91 Z0.
T2 M6
(Changed to tool 2)
M1
G0 G90 G54 X-2.8392 Y1.6562 S15000 M3
G43 H2 Z0.24 /M8
Z0.19
G1 Z-0.02 F150.

Questions? Comments?

rx8pilot · « **Reply #20 on:** October 16, 2019, 05:12:48 pm »

Quote from: Nominal Animal on October 15, 2019, 07:56:49 pm

Consider the following Python3 code:

Questions? Comments?

Looking forward to getting back to my computer now.......

janoc · « **Reply #21 on:** October 16, 2019, 07:20:12 pm »

Quote from: Nominal Animal on October 15, 2019, 07:56:49 pm

Consider the following Python3 code:
...
Questions? Comments?

What exactly is the advantage of subclassing a tuple like this instead of just taking a line and doing something like:

Code: [Select]

my_tokens = line.strip().upper().split() ?

It gives you exactly the same thing without the pointless and, worse, confusing complexity. Confusing, because you are assigning semantic meaning ("name", "value", "parameter" ...) to things that don't really work like that.

E.g. if I type:

G0 X0 Y0 Z100 F300
The command is G0, the arguments are X0 Y0 Z100 and F300, all optional. There is no G0 "Parameter" as your class is named.

Or this (select a file from SD card in the Marlin 3D printer firmware) - won't be meaningfully parsed:
M23 /musicg~1/shav~1.gco

Also some instructions have letter only arguments (there is no convention that it has to always be <letter><number>):
M27 C
(reports currently open filename - again Marlin)

Comments can be also inline, delimited by a semicolon:
G28 X Z ; Home the X and Z axes

As I wrote earlier, GCode is not a formalized language with a consistent grammar. Don't try to overthink it because you will spend more time fixing (tons of) special cases and dialect differences than doing useful work and the result will be an unmaintainable mess.

If you really wanted to do a custom type, then a much better way is to do a high level class representing a single instruction. Give it fields for name, list of arguments, etc., including possible normalization and a function to parse the line. Then you can have subclasses for different instructions where you can verify the arguments or do whatever. That's much more useful than this low level approach which effectively achieves exactly the same thing as the split() function above but takes an entire page worth of code to do it.

However, that's an enormous overkill for just extracting some data from a GCode file. I would go to such length only if I actually wanted to interpret the code for some reason (e.g. because I want to visualize/simulate it).

BTW:

Code: [Select]

# Ensure line is a string.
        if not isinstance(line, str):
            line = str(line, encoding='ascii')

This is very strange. First what else can you get there if not a string? A byte buffer? But then why did you open the file in binary mode? (and byte buffers are converted using xxx.decode() function).

Second, this will actually do weird things if you pass some unexpected object there. It will get silently converted to string using its __str__() or __repr__() functions and you will keep parsing that nonsense instead of throwing an error. That's a really terrible idiom, IMO.

Code: [Select]

something.strip('\t\n\v\f\r ')

This is not a good way to strip whitespace. Python's strings are Unicode by default, by doing this you are stripping only those 6 characters and ignore all the rest that is also considered as whitespace and could have been added e.g. by some Windows software. Unless you have a specific reason to strip only those 6 characters, it is much safer to use strip() (without arguments) that will remove everything that is classed as whitespace (including the Unicode stuff).

Nominal Animal · « **Reply #22 on:** October 17, 2019, 04:49:52 am »

Quote from: janoc on October 16, 2019, 07:20:12 pm

What exactly is the advantage of subclassing a tuple like this instead of just taking a line and doing something like:
Code: [Select]
my_tokens = line.strip().upper().split() ?

That won't split G0X-2Y0.6 correctly. And if you use regular expressions on the line, something like (FAT2 CAT) M6 T4 will ruin your day anyway; you'll find T2 and not T4.

Using a class to describe each logical token (Ha! should have named it GCodeToken!) makes it much easier to examine the tokens.
I haven't really done this with g-code, but the approach works very well for HPGL.

Tuples are compact, and "faster" compared to generic objects. This only matters when you have lots of them, but then, the difference is significant. It is well worth the slight added complexity, in my experience.

Quote from: janoc on October 16, 2019, 07:20:12 pm

Confusing, because you are assigning semantic meaning ("name", "value", "parameter" ...) to things that don't really work like that.

So hostile... Call it GCodeToken or GCodeItem, then.

Sure, the classes can be improved a lot; I did not provide a ready implementation, just something to start with.

It might make sense to convert comments to GCodeComment instances (subclasses of GCodeToken), and use other subclasses for different types of commands.
In particular, the skip thingy (/ or /number) should likely be a different type of subclass.

If a token has an optional value, just make that value None when unspecified.

Quote from: janoc on October 16, 2019, 07:20:12 pm

Or this (select a file from SD card in the Marlin 3D printer firmware) - won't be meaningfully parsed:
M23 /musicg~1/shav~1.gco

It isn't difficult to add, though; it is just a matter of deciding what kind of syntax one uses.

I notice that you assume each token is separated by whitespace yourself; how sure are you about that? I would not expect that to always be the case.

Quote from: janoc on October 16, 2019, 07:20:12 pm

Code: [Select]
# Ensure line is a string. if not isinstance(line, str): line = str(line, encoding='ascii')This is very strange. First what else can you get there if not a string? A byte buffer? But then why did you open the file in binary mode? (and byte buffers are converted using xxx.decode() function).

The point is to convert non-strings to strings, intentionally.

If line is a bytearray, it does get converted to a string, from the ASCII character set. In some cases, for example when communicating over sockets, you need to use a bytearray buffer. Even in that case, G-code is still ASCII, not Unicode.

Another option would be to use
if isinstance(line, (bytearray, GCodeLine)):
line = str(line, encoding='ascii')
if not isinstance(line, str):
raise ValueError("Cannot parse a %s as a g-code line" % type(line))

Perhaps you prefer the inane
if isinstance(line, bytearray):
line = line.decode(encoding='ascii')
if isinstance(line, GCodeLine):
line = str(line, encoding='ascii')
if not isinstance(line, str):
raise ValueError("Cannot parse a %s as a g-code line" % type(line))
instead, which does the same thing, but is more better and Enterprisey, because it has more lines.

Quote from: janoc on October 16, 2019, 07:20:12 pm

This is not a good way to strip whitespace.

If you cared to read the comment above that line, it says "ASCII whitespace". G-code files are ASCII, not Unicode.

janoc · « **Reply #23 on:** October 17, 2019, 02:00:41 pm »

Quote from: Nominal Animal on October 17, 2019, 04:49:52 am

Quote from: janoc on October 16, 2019, 07:20:12 pm
What exactly is the advantage of subclassing a tuple like this instead of just taking a line and doing something like:
Code: [Select]
my_tokens = line.strip().upper().split() ?
That won't split G0X-2Y0.6 correctly. And if you use regular expressions on the line, something like (FAT2 CAT) M6 T4 will ruin your day anyway; you'll find T2 and not T4.

That likely wouldn't parse by the GCode interpreter in the first place, usually you must have whitespace between the command and the arguments and the arguments themselves. However, maybe there is a GCode variant that allows this. In that case you are pretty much screwed with simple regexps, though ...

The example I made wouldn't match T2 in the FAT2 because it looks explicitly for the M6 right before or right after. So it won't match.

Quote from: Nominal Animal on October 17, 2019, 04:49:52 am

I notice that you assume each token is separated by whitespace yourself; how sure are you about that? I would not expect that to always be the case.

Yes, I do. I haven't seen GCode variant that doesn't require this but maybe there is one that does not. If the whitespace between the tokens is optional, then you have quite a task on your hands - what if e.g. substring of an SD card filename or a message to be displayed to the operator matches a valid GCode instruction? Yikes ... That wouldn't be fun to parse at all.

Quote

The point is to convert non-strings to strings, intentionally.

If line is a bytearray, it does get converted to a string, from the ASCII character set. In some cases, for example when communicating over sockets, you need to use a bytearray buffer. Even in that case, G-code is still ASCII, not Unicode.

How can it be a byte array? Are you reading text files (GCode is a text file) in binary mode? If yes, why? And even then, the proper way to decode a byte buffer into a string is using:

Code: [Select]

line = buffer.decode("utf-8", "strict")

or if you explicitly want to reject unicode characters:

Code: [Select]

line = buffer.decode("ascii", "strict")

That will fail if it finds anything that can't be converted to plain ASCII, e.g. accented characters in comments as well. For that reason it may be more practical to use "ignore" or "replace" instead of "strict".

By using decode() you are making it explicit that you want to decode a buffer into a string instead of a conversion of some (potentially completely arbitrary) object (that's what str() is intended for). Decode will explicitly fail if you try to invoke it on something that isn't of appropriate "decodable" type, str() will not.

Quote from: Nominal Animal on October 17, 2019, 04:49:52 am

Another option would be to use

if isinstance(line, (bytearray, GCodeLine)):
line = str(line, encoding='ascii')
if not isinstance(line, str):
raise ValueError("Cannot parse a %s as a g-code line" % type(line))

Perhaps you prefer the inane
if isinstance(line, bytearray):
line = line.decode(encoding='ascii')
if isinstance(line, GCodeLine):
line = str(line, encoding='ascii')
if not isinstance(line, str):
raise ValueError("Cannot parse a %s as a g-code line" % type(line))
instead, which does the same thing, but is more better and Enterprisey, because it has more lines.

No, you have completely missed the point.

That code is simply wrong - there is no way you can get anything but a string there if you read the GCode file correctly (plus your original code introduces a nasty bug that hides when an invalid object gets passed, silently converting it to string instead).

Moreover, this style with type checking is unpythonic - if you really wanted to do it like this for whatever reason, the preferred way is to check whether you can do the operations you need (even byte buffers support strip(), split(), etc. for ex!), not whether something is a certain type. That allows for better reusability of the code. Python uses duck typing - it is better to check whether something can quack if you need it to quack than to check whether it is a duck (more things than ducks can quack).

This concept is used all over Python, e.g. a lot of functions work on sequences - which is an abstract thing (there is no type "sequence" in Python) satisfying a certain interface. The consequence is that those functions will work "automagically" on lists, tuples, strings, byte arrays, numpy arrays, etc ... The entire concept of protocols in Python is built on this idea. If you explicitly type check, none of that will work.

Also Python code is written in the EAFP style ("easier to ask for forgiveness than permission") - you simply try to do the operation you are trying to do and handle the eventual error (exception) instead of checking first. Again, duck typing and code reuse/generalization are the reasons. This is different than what C/C++ programmers are used to, because your program could crash, corrupt data, blow hardware up if you try to do something wrong there. In Python you only get an exception and handle it.

See a discussion on this here:
https://www.reddit.com/r/Python/comments/26irhg/why_is_type_checking_not_pythonic/

Quote from: Nominal Animal on October 17, 2019, 04:49:52 am

Quote from: janoc on October 16, 2019, 07:20:12 pm
This is not a good way to strip whitespace.
If you cared to read the comment above that line, it says "ASCII whitespace". G-code files are ASCII, not Unicode.

You have missed the point again.

While the files may be defined as ASCII, Python strings are Unicode. A lot of Windows software will write files that at the first glance look like ASCII but aren't - e.g. because there is the byte order mark at the start of the file (some text editors do that). Or you may get Unicode characters in comments and strings - e.g. accented letters (just ask any French users ...). The machine will ignore that because it is in a comment but your parser won't. Also the fact that the files are supposed to be ASCII-only (and that e.g. UTF-8 isn't accepted) isn't defined anywhere - remember, there is no formal standard for GCode!

So you are basically introducing a potential bug and using more verbose code to do it instead of a simple strip() to boot.

Nominal Animal · « **Reply #24 on:** October 17, 2019, 07:16:27 pm »

Quote from: janoc on October 17, 2019, 02:00:41 pm

Quote from: Nominal Animal on October 17, 2019, 04:49:52 am
I notice that you assume each token is separated by whitespace yourself; how sure are you about that? I would not expect that to always be the case.
Yes, I do. I haven't seen GCode variant that doesn't require this but maybe there is one that does not.

I asked, because the "optional skip operator", / , is optionally followed by a digit, and the examples on the web do not have a space between it and the following token. Then again, I'm not exactly sure if it should skip just that single token, or the entire line.

Quote from: janoc on October 17, 2019, 02:00:41 pm

If the whitespace between the tokens is optional, then you have quite a task on your hands - what if e.g. substring of an SD card filename or a message to be displayed to the operator matches a valid GCode instruction? Yikes ... That wouldn't be fun to parse at all.

It isn't as nasty as it may sound.

The lexer (the code that uses the compiled regular expressions _command and _comment) can emulate how a typical G-code parser parses the code. That is, it basically just needs to understand what the next token is, and extract it from the line.

It is perfectly fine to just keep those tokens as strings. I personally like to convert them to suitable tuple subclasses, for both ease of use and efficiency -- a tuple is compact in memory, and a tuple of tuples (to describe the tokens on a G-code line) is faster than lists or dicts. For ease of use, I am referring to helper methods in the G-code line tuple, locating desired tokens.

I've used this approach to parse (i.e., lexically separate, then convert to tuples) HPGL, and although Python I/O is not fast, it has quite satisfactory speed.

I have also used awk to parse HPGL, which can split each input line (record) into string fields using regular expressions. It works, but having to convert the string fields to values at each point of use, is a lot of repeated work. (Awk is also a bit funny in that it has no local variables, so helper functions often need to have funky names to not overwrite variables used elsewhere.)

In C, I would use structures with common initial members and a type tag, possibly with a pointer to the exact field contents. (This does involve at least one extra memory copy operation, since nul characters would be inserted between tokens in the input line, but even if using standard I/O character by character, it'll likely be faster than Python. The implicit conversions Python does between bytearray and str really slow it down.)

Quote from: janoc on October 17, 2019, 02:00:41 pm

How can it be a byte array?

Like I wrote, if you obtain the G-code data via a socket (a TCP/IP connection, an Unix domain socket, or even a character device), you almost always need to open that in the binary mode. Instead of forcing the user to remember to convert the bytearray to str via decode(), I added the two lines I thought would avoid that misstep.

Look. You need to think about what kind of code others will write based on your code. Consider the case when a user supplies "a random object" to the GCodeLine() constructor. The only use case that makes sense, is when the user intends that data to be treated as an ASCII string, then parsed as G-code.
This is my assumption based on my experience on how others use awk and python code I've written to mangle HPGL.
I could be wrong, but that is the basis for that initial choice.

Nothing you have written thus far has been a convincing argument against that, assuming the code is used as a basis for development, and not the Holy Word on How Things Shall Be Done.

Quote from: janoc on October 17, 2019, 02:00:41 pm

While the files may be defined as ASCII, Python strings are Unicode.

When ASCII text is converted to Unicode, the set of possible whitespace is exactly those six code points.

Quote from: janoc on October 17, 2019, 02:00:41 pm

So you are basically introducing a potential bug and using more verbose code to do it instead of a simple strip() to boot.

Why so hostile? "Potential bug."

Problem is that Python converts text input from the character set used by the users locale to Unicode. For example, if I have a file with \xA4 in it, my Python code will provide it in a string as U+20AC (€) if my locale uses ISO-8859-15, but as U+00A4 if my locale uses Windows-1252. Using UTF-8, Python will raise UnicodeDecodeError. Because of this, I wanted the code to strip only those Unicode characters that correspond to ASCII whitespace.

I am interested in discussing what kind of choices make sense, but honestly, I'm getting pretty pissed off at those choices being called "bugs" even when I've already explained their rationale. Instead of discussing that, you keep calling the code "buggy" and "overly complicated". OP is doing this to learn, not to just catch tool changes!

I've tried being civil, and try to get something constructive going, but nothing seems to work with you, so I'll just ignore you from now on.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: Python text file processing - extracting data based on data found (Read 6305 times)

Share me