Author Topic: Python text file processing - extracting data based on data found (Read 6275 times)

eliocor · « **Reply #25 on:** October 17, 2019, 09:44:02 pm »

maybe this is a too simple solution, but let's try:
install the pygcode library https://github.com/fragmuffin/pygcode :
pip install pygcode
and use this simple script:

Code: [Select]

from pygcode import Line

# read gcodes from file 'part.gcode':
with open('part.gcode', 'r') as fh:
    commands = []  # define an empty list of gcode commands
    
    # scan whole file for gcode commands
    for line_text in fh.readlines():
        line = Line(line_text)  #decode gcode line

        #print(line)  # will print the line (with cosmetic changes)
        line.block.gcodes  # is your list of gcodes
        line.block.modal_params  # are all parameters not assigned to a gcode, 
                                # assumed to be motion modal parameters
        if (line.block.words): # if list of gcode commands is not empty
            commands = commands + line.block.words  # add gcode commands to the list
        #if (line.block.modal_params) :
        #    print ("M ", line.block.modal_params)
        #if (line.block.gcodes):
        #    print("G ", line.block.gcodes)
        if line.comment:
            line.comment.text  # your comment text
            commands.append(line.comment) # add comment to the list 

    # scan is ended: print the whole list of commands:
    print(*commands, sep='\n') # add a newline after every list item

it will scan/parse the 'part.gcode' file and will create a list in which each element is a different gcode command.
You will be able to scan it, identify (using regexp) if the token you are searching is OK and (as you requested) scan backward/forward the list to see if nearby there is your other token (Mxx, Ty).
Not an elegant/efficient solution but a rather workable one.

P.S.: the parser is more intelligent than shown in my script: you can uncomment some of commented the lines to get some hints.

Nominal Animal · « **Reply #26 on:** October 18, 2019, 12:56:33 pm »

Good point, eliocor. If one were to parse say CSV or XML files in Python, one should definitely use the existing libraries, too.

Even if one wants to write a parser from scratch, say for learning purposes, or just to get a better understanding of g-code, looking at existing code, and especially the issues others have found, is always useful.

The comments in the code also reveal a lot about different g-code dialects, and how different machines parse and process g-code. Even if not writing your own, reading the comments in the code is quite informative.

eliocor · « **Reply #27 on:** October 18, 2019, 04:17:47 pm »

BTW, the libraries will normalize the gcode, converting it from eg: 'M6' to 'M06' easing the token analysis!

janoc · « **Reply #28 on:** October 18, 2019, 04:35:23 pm »

Quote from: Nominal Animal on October 17, 2019, 07:16:27 pm

It isn't as nasty as it may sound.

The lexer (the code that uses the compiled regular expressions _command and _comment) can emulate how a typical G-code parser parses the code. That is, it basically just needs to understand what the next token is, and extract it from the line.
...

Sure, I have never said it can't be done. However, if I faced a task like that where this was possible, I would completely forget about futzing around with regexps and splitting strings and use a proper parser library instead to write the lexer. E.g PLY: https://www.dabeaz.com/ply/ply.html, textX: https://github.com/textX/textX or maybe pyparsing: https://github.com/pyparsing/pyparsing. It would be much cleaner and more maintainable.

Or look at that gcode parsing library someone mentioned above.

However, that's getting waaay too far ahead of what the OP was after.

Quote from: Nominal Animal on October 17, 2019, 07:16:27 pm

Quote from: janoc on October 17, 2019, 02:00:41 pm
How can it be a byte array?
Like I wrote, if you obtain the G-code data via a socket (a TCP/IP connection, an Unix domain socket, or even a character device), you almost always need to open that in the binary mode. Instead of forcing the user to remember to convert the bytearray to str via decode(), I added the two lines I thought would avoid that misstep.

Look. You need to think about what kind of code others will write based on your code. Consider the case when a user supplies "a random object" to the GCodeLine() constructor. The only use case that makes sense, is when the user intends that data to be treated as an ASCII string, then parsed as G-code.
This is my assumption based on my experience on how others use awk and python code I've written to mangle HPGL.
I could be wrong, but that is the basis for that initial choice.

The problem is that Python isn't awk. If someone gets a binary buffer from a socket then the proper way is to run decode("ascii") on it and be done with it. Not blindly convert random objects to strings all over the place. The issue is not the user intentionally passing e.g. an object to the GCodeLine constructor but doing it by mistake (easy to do, with Python being dynamically typed and function arguments not having types declared). Instead of getting an immediate exception your code will happily convert it and continue running - making for a ton of head scratching later when trying to figure out why it isn't doing what it is supposed to do. Imagine the "fun" of finding a bug like that where only 2-3 lines of a huge file get corrupted/misparsed like this. That's what makes your approach a really terrible example.

Quote from: Nominal Animal on October 17, 2019, 07:16:27 pm

Problem is that Python converts text input from the character set used by the users locale to Unicode. For example, if I have a file with \xA4 in it, my Python code will provide it in a string as U+20AC (€) if my locale uses ISO-8859-15, but as U+00A4 if my locale uses Windows-1252. Using UTF-8, Python will raise UnicodeDecodeError. Because of this, I wanted the code to strip only those Unicode characters that correspond to ASCII whitespace.

If you have done the conversion correctly, i.e. used decode("ascii", "replace"), then you wouldn't have a decoding error on the characters that aren't valid ASCII (it would replace them with close character or delete them if you use "ignore" instead of "replace") and you wouldn't need to work around one bug by introducing a potential second one.

Heck, even your approach with str() can do it, because str() has the error argument too, so str(buffer, "ascii", "replace") works as well - with the caveat that it will happily convert arbitrary objects (and not just the intended buffers) and hide bugs in the code, as pointed out elsewhere.

Quote from: Nominal Animal on October 17, 2019, 07:16:27 pm

I am interested in discussing what kind of choices make sense, but honestly, I'm getting pretty pissed off at those choices being called "bugs" even when I've already explained their rationale. Instead of discussing that, you keep calling the code "buggy" and "overly complicated". OP is doing this to learn, not to just catch tool changes!

I've tried being civil, and try to get something constructive going, but nothing seems to work with you, so I'll just ignore you from now on.

I am sorry? You do realize that explaining a rationale for something doesn't make the code any less incorrect, right? I am trying to be constructive here, explaining at length my reasoning and giving examples how it actually should be done instead. I guess you have missed that part, being busy getting offended. But what do I know, only using Python professionally for some 18-something years ...

I rest my case, I hope the OP got what they needed. rx8pilot, feel free to PM me if you have any other Python-related questions.

eliocor · « **Reply #29 on:** October 18, 2019, 07:48:01 pm »

Quote

Or look at that gcode parsing library someone mentioned above.
However, that's getting waaay too far ahead of what the OP was after.

I'm sorry, but it is EXACTLY what the OP asked for:

Quote

I also used what appears to be a simpler method of .find() where I could get the index position of a string and presumable walk around that index until I find what I am looking for.

Maybe not elegant, but good enough without ranting about!

rx8pilot · « **Reply #30 on:** October 18, 2019, 11:33:59 pm »

Quote from: eliocor on October 18, 2019, 07:48:01 pm

Quote
Or look at that gcode parsing library someone mentioned above.
However, that's getting waaay too far ahead of what the OP was after.

I'm sorry, but it is EXACTLY what the OP asked for:

For clarity sake, parsing G-code is the exercise I chose to learn how to parse just about any text data. The primary end product is:

Code: [Select]

from eevblog import skills

skills.python(rx8pilot)

At the end of the day, I am looking to have a solid toolbox to develop code to parse all sorts of data where g-code is simply the first step. I had found the pygcode library but was not able to fully understand it at first. After spending a number of hours studying general Python syntax and libraries, I think I can follow along now and learn something from how the author approached the task. Others have already suggested stepping through other parsing libraries for various formats - I agree that is a good way to expose some of the key concepts and how they are practically implemented in a Python environment.

Now that that weekend is here.....I can leave all the C coding at the office and dive back into Python. This conversation has warmed up nicely and I look forward to some learning experiments.

SparkyFX · « **Reply #31 on:** October 18, 2019, 11:43:47 pm »

Quote from: janoc on October 15, 2019, 06:54:51 pm

While it is possible to create a regular expression that would grab only the ones that end with 6 and ignore everything else it will be needlessly complex and it is unlikely you will only ever be interested in the M6 command. If you are going to look for M1, M4 or others later, you will have to define specific regexp for each = pain in the butt to write, slow (matching regexps is fairly expensive) and not maintainable code, with a ton of regexps that do almost the same thing, differing only in the value they are searching for.

What you are searching for might be a "boundary" = "\b" (boundary between words) or the opposite "word boundary" = "\w" (the word), does actually split what you want (usually non alphanumeric-characters).

So lines with M6 or M06 in it would be found using this regex:

Code: [Select]

/\bM[0]*6\b/

Of course it is also possible to search for variables /\b$str\b/, where $str can be a regex of its own.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: Python text file processing - extracting data based on data found (Read 6275 times)

eliocor

Re: Python text file processing - extracting data based on data found

Nominal Animal

Re: Python text file processing - extracting data based on data found

eliocor

Re: Python text file processing - extracting data based on data found

janoc

Re: Python text file processing - extracting data based on data found

eliocor

Re: Python text file processing - extracting data based on data found

rx8pilot

Re: Python text file processing - extracting data based on data found

SparkyFX

Re: Python text file processing - extracting data based on data found

Share me