Author Topic: bash-scripts, how to handle files with spaces in the names? (Read 2902 times)

DiTBho · « **on:** November 04, 2021, 04:29:08 pm »

I have troubles handling bash-scripts because some filenames contain spaces

like these

Borland Makefiles.rst
CodeBlocks.rst
CodeLite.rst
Eclipse CDT4.rst
Green Hills MULTI.rst <------- e.g. ---------- Green + space + Hills + space + MULTI.rst
Kate.rst
MSYS Makefiles.rst
MinGW Makefiles.rst
NMake Makefiles JOM.rst
NMake Makefiles.rst
Ninja Multi-Config.rst
Ninja.rst
Sublime Text 2.rst
Unix Makefiles.rst
VS_TOOLSET_HOST_ARCH.txt
Visual Studio 10 2010.rst
Visual Studio 11 2012.rst
Visual Studio 12 2013.rst
Visual Studio 14 2015.rst
Visual Studio 15 2017.rst
Visual Studio 16 2019.rst
Visual Studio 6.rst
Visual Studio 7 .NET 2003.rst
Visual Studio 7.rst
Visual Studio 8 2005.rst
Visual Studio 9 2008.rst
Watcom WMake.rst
Xcode.rst

I am working with a python text-based database where each line is a db-entry and looks like this

Code: [Select]

object package path/filename md5sum size

so for me

kind = line_get_arg1(line) (it can be object, symbolic link, etc)
package = line_get_arg2(line)
pfile = line_get_arg3(line) <------------ this only works if the filename doesn't contain spaces

The database was created by a python engine, which is somehow able to handle files with spaces in the names, but I cannot use the python engine on this machine, I can only use bash or a C-written program.

borjam · « **Reply #1 on:** November 04, 2021, 04:31:44 pm »

You can surround the names with quotes or you can prepend a backslash to each space.

PKTKS · « **Reply #2 on:** November 04, 2021, 04:55:41 pm »

In PERL it is already built in and has at least 5 ways ..

ps please fix that ... this for quoted - arbitrary
~~qq ('fo%o$ bar not a easy method. $$$ ');~~

Use this operator instead for proper escaping
quotemeta('fo%o$ bar not a easy method. $$$ ');

In python requires external "module" thingy..

Code: [Select]

 import re
 re.escape('fo%o$ bar not a easy method. $$$ ')

Paul

PS alas if not entirely clear you can use that in a PIPE over bash...

Benta · « **Reply #3 on:** November 04, 2021, 06:45:06 pm »

In bash, file names with spaces are surrounded by single quotes.

DiTBho · « **Reply #4 on:** November 04, 2021, 07:35:08 pm »

yes, but ... I have to "extract" the filename from a single line of text read from a file.

DiTBho · « **Reply #5 on:** November 04, 2021, 07:38:42 pm »

Code: [Select]

object package path/filename md5sum size
object package path/filename md5sum size
object package path/filename md5sum size
...
object package path/filename md5sum size
object package path/filename md5sum size
object package path/filename md5sum size

Code: [Select]

   while IFS= read -r linea
         do
              linea_get_arg1 $linea
              kind = "$ans"

              linea_get_arg2 $linea
              package = "$ans"

              linea_get_arg3 $linea
              pfile = "$ans"  # this only works if the filename doesn't contain spaces

         done < $source

Ian.M · « **Reply #6 on:** November 04, 2021, 07:39:56 pm »

If you cant modify the python database engine to output the file with the filenames quoted, or with the spaces escaped in a way that bash can handle, then you'll have to deal with the file as-is. Fortunately it isn't ambiguous, as the filename is always followed by two more parameters, so the algorithm required would be to take the whole line and locate the separators before and after the filename, working back from the end of the line for the latter then slice out the string you need.

How to do that in bash is another matter. I suspect it would be better to use an external utility such as grep or sed rather that trying to wrangle it with bash's native string handling.

golden_labels · « **Reply #7 on:** November 04, 2021, 07:41:42 pm »

Bash doesn’t care if there are spaces in file names. In fact it doesn’t care about any byte at all, except for the byte 0, which couldn’t be handled by utilities based on c-strings anyway. You may even have newlines in a file name. If there are any issues with spaces, it’s nearly always because of using improper variables expansion (without surrounding them with double quotes) or using wrong tools (like ls to list files).

Therefore please provide the actual, minimal code example, that depicts your issue. So far the answer from me would be “it works here”.

The data format you have mentioned is already broken by design, though, as it will not be able to handle newlines. But with the constraints given, there is no reasonable way to address that, so I am only mentioning the problem.

DiTBho · « **Reply #8 on:** November 04, 2021, 07:55:28 pm »

Quote from: golden_labels on November 04, 2021, 07:41:42 pm

Therefore please provide the actual, minimal code example, that depicts your issue. So far the answer from me would be “it works here”.

the above code

on a text file like this

Code: [Select]

object package-x Green Hills MULTI.rst md5sum size
object package-x NMake Makefiles JOM.rst md5sum size
object package-x Visual_Studio_14_2015.rst md5sum size
object package-x Visual Studio 14 2015.rst md5sum size

(source.txt)

Full minimal code with service functions

Code: [Select]

function line_get_arg1()
{
   ans="$1"
}

function line_get_arg2()
{
   ans="$2"
}

function line_get_arg3()
{
   ans="$3"
}

function line_get_arg4()
{
   ans="$4"
}

source="source.txt"
while IFS= read -r line
         do
              line_get_arg1 $line
              kind="$ans"

              line_get_arg2 $line
              package="$ans"

              line_get_arg3 $line
              pfile="$ans"  # this only works if the filename doesn't contain spaces

              echo "--------------------------"
              echo "kind = [$kind] "
              echo "pack = [$package] "
              echo "file = [$pfile] "

         done < $source

it outputs

Code: [Select]

--------------------------
kind = [object]
pack = [package-x]
file = [Green] <-------------------------- wrong!!!!
--------------------------
kind = [object]
pack = [package-x]
file = [NMake] <-------------------------- wrong!!!!
--------------------------
kind = [object]
pack = [package-x]
file = [Visual_Studio_14_2015.rst] <-------------------------- correct!
--------------------------
kind = [object]
pack = [package-x]
file = [Visual]

golden_labels · « **Reply #9 on:** November 04, 2021, 08:03:34 pm »

Code: [Select]

 $ cat readEntries 
#!/usr/bin/env bash

while read object package tail; do
    unset fname
    hash=
    size=
    for next in $tail; do
        if [[ -z "$fname" ]]; then
            fname="$hash"
        else
            fname+=" $hash"
        fi
        hash="$size"
        size="$next"
    done
    echo '=== Entry ==='
    printf ' Object:    “%s”\n' "$object"
    printf ' Package:   “%s”\n' "$package"
    printf ' File name: “%s”\n' "$fname"
    printf ' Hash:      %s\n' "$hash"
    printf ' Size:      %s\n\n' "$size"
done

 $ cat inputs 
obj#9 pkg-1 Borland Makefiles.rst cdd7ae77e9686e441054887668690f3f 123004
obj#2 pkg-3 CodeBlocks.rst d8cb14f07836f07874f2ae11b27e095e 98941
obj#4 pkg-4 Green Hills MULTI.rst <------- e.g. ---------- Green + space + Hills + space + MULTI.rst 7d99181a2b9b9e533c1470958e343d16 1029394
obj#7 pkg-12 VS_TOOLSET_HOST_ARCH.txt ed60c6dc919f27ebd107041a3b16915d 149951   

 $ ./readEntries <inputs
=== Entry ===
 Object:    “obj#9”
 Package:   “pkg-1”
 File name: “Borland Makefiles.rst”
 Hash:      cdd7ae77e9686e441054887668690f3f
 Size:      123004

=== Entry ===
 Object:    “obj#2”
 Package:   “pkg-3”
 File name: “CodeBlocks.rst”
 Hash:      d8cb14f07836f07874f2ae11b27e095e
 Size:      98941

=== Entry ===
 Object:    “obj#4”
 Package:   “pkg-4”
 File name: “Green Hills MULTI.rst <------- e.g. ---------- Green + space + Hills + space + MULTI.rst”
 Hash:      7d99181a2b9b9e533c1470958e343d16
 Size:      1029394

=== Entry ===
 Object:    “obj#7”
 Package:   “pkg-12”
 File name: “VS_TOOLSET_HOST_ARCH.txt”
 Hash:      ed60c6dc919f27ebd107041a3b16915d
 Size:      149951

Note that this can’t deal with the aforementioned shortcomings of the data format. It will fail for some inputs. If possible, I would strongly recommend:

Using a properly structured data storage format, like JSON or XML.
Using proper extraction/manipulation tools, like jq or XML processors.
Switching to some sane language, like Python or Groovy. Bash is unsuitable for data processing.

DiTBho · « **Reply #10 on:** November 04, 2021, 09:09:16 pm »

@golden_labels
Perfect! Your approach worked so well that I converted the entire database and I also learned new things!
Thank you very much

PKTKS · « **Reply #11 on:** November 05, 2021, 10:55:28 am »

Quote from: golden_labels on November 04, 2021, 08:03:34 pm

(..)
Note that this cant deal with the aforementioned shortcomings of the data format. It will fail for some inputs. If possible, I would strongly recommend:
Using a properly structured data storage format, like JSON or XML.
Using proper extraction/manipulation tools, like jq or XML processors.
Switching to some sane language, like Python or Groovy. Bash is unsuitable for data processing.

Nice bashing..
but true that it will fail when characters other than the page code being use are present...

Had a ton of issues converting from LATIN charset to plain ASCII using not only bash but also SQL itself. Usually the quotemeta operator solves the problem - just need to really quote the nasty chars UTF or whatever present.

JSON is not a problem as long as you decode it properly escaped

Code: [Select]


# my simple way to load a properly escaped JSON

use JSON qw( decode_json ); 
@lines = <FLE>; 
quotemeta($decoded) = decode_json("@lines");

Paul

Nominal Animal · « **Reply #12 on:** November 05, 2021, 12:08:42 pm »

If your Bash script is not localized, add
export LANG=C LC_ALL=C
near the beginning, so that input is considered a sequence of bytes, instead of sequence of characters. This way there is no illegal input.

If the input may contain leading or trailing whitespace, or multiple consecutive spaces as field separators, or use tabs instead of spaces or even a mix, then I'd suggest a more complicated, but also much more generic variant:

Code: [Select]

Separator=$'[ \t]'

# Strip VarName
function Strip() {
    local -n dst="$1"
    local olddst="${dst##$Separator}"
    local newdst="${olddst%%$Separator}"
    while [[ ${#newdst} != ${#olddst} ]]; do
        olddst="${newdst##$Separator}"
        newdst="${olddst%%$Separator}"
    done
    dst="$newdst"
}

# FirstField SrcVarName DstVarName
function FirstField() {
    local -n src="$1" dst="$2"
    dst="${src%%$Separator*}"
    if [[ ${#dst} == ${#src} ]]; then
        src=""
    else
        src="${src#*$Separator}"
        local newsrc="${src##$Separator}"
        while [[ ${#newsrc} != ${#src} ]]; do
            src="$newsrc"
            newsrc="${src##$Separator}"
        done
    fi
}

# LastField SrcVarName DstVarName
function LastField() {
    local -n src="$1" dst="$2"
    dst="${src##*$Separator}"
    if [[ ${#dst} == ${#src} ]]; then
        src=""
    else
        src="${src%$Separator*}"
        local newsrc="${src%%$Separator}"
        while [[ ${#newsrc} != ${#src} ]]; do
            src="$newsrc"
            newsrc="${src%%$Separator}"
        done
    fi
}

The Separator is either a single separator, or the list of acceptable separators in square brackets.
Given a variable name, Strip removes any leading or trailing Separators.
Given a source and destination variable names, FirstField and LastField split off the field from the source variable into the destination variable, and update the source variable.

In the Bash functions, the local -n foo="$1" makes foo a nameref-attributed variable: referring to it or modifying it actually refers to or modifies the variable named in the first parameter to the function. Nifty.

Using the above, the loop is

Code: [Select]

while read Record ; do
    Strip Record
    FirstField Record Object
    FirstField Record Package
    LastField Record Size
    LastField Record MD5Sum
    Pathname="$Record"
    printf 'Object=%s, Package=%s, Pathname=%s, MD5Sum=%s, Size=%s\n' "$Object" "$Package" "$Pathname" "$MD5Sum" "$Size"
done

For example, given (using \t to denote an embedded tab),
\tobj#4 pkg-4 \t Green Hills MULTI.rst\te.g. Green + space + Hills + space + MULTI.rst 7d99181a2b9b9e533c1470958e343d16 1029394
the above will output
Object=obj#4, Package=pkg-4, Pathname=Green Hills MULTI.rst\te.g. Green + space + Hills + space + MULTI.rst, MD5Sum=7d99181a2b9b9e533c1470958e343d16, Size=1029394
i.e., handle correctly all linear whitespace (listed in $Separator) as a single field separator.

In the general case, you use FirstField and LastField to "prune" the non-whitespace-containing fields off the input record, so that whatever is left is stored in the possibly-whitespace-containing field.

The easier option, however, and the one I would recommend, is to simply prepend each tab or space that is not a separator with a backslash. For example, given
Object Package Path\ or\ file\ name MD5Sum Size
a simple
IFS=$' \t'
while read Object Package Pathname MD5Sum Size ; do
printf 'Object=%s, Package=%s, Pathname=%s, MD5Sum=%s, Size=%s\n' "$Object" "$Package" "$Pathname" "$MD5Sum" "$Size"
done
outputs
Object=Object, Package=Package, Pathname=Path or file name, MD5Sum=MD5Sum, Size=Size
because by default, the Bash read built-in supports escaping tabs and spaces by prepending a backslash to them. (Prepending a backslash before a newline is a line continuation: the Bash read built-in includes the continued lines in the record, but removes the backslash-newline pairs.)

If the pathnames are generated by a Bash script, you can use the following function to add those backslashes to any shell variables; just call the function with the names of those variables as parameters:

Code: [Select]

function Backslashes() {
    local tab=$'\t'
    while [[ $# -gt 0 ]]; do
        local -n dst="$1"
        dst="${dst//\\/\\\\}"
        dst="${dst// /\\ }"
        dst="${dst//$tab/\\$tab}"
        shift 1
    done
}

For example, if you have FOO="Foo has spaces" and BAR=$"Bar\thas\t\ttabs", then Backslashes FOO BAR modifies FOO to Foo\ has\ \ space and BAR to $"Bar\\\thas\\\t\\\ttabs" (i.e., adds backslashes before spaces and tabs). This means that newlines cannot be escaped using this method, unfortunately.

For full support of actual backslash escape sequences (\0 \a \b \t \n \v \f \r \\ \' \" \ooo \xHH \uHHHH \UHHHHHH) and quoting (single- and double-quotes, with backslash escapes only parsed outside quotes and within double quotes), you can either use a slow and complicated Bash function, or preprocess the data using e.g. awk, into a stream of nul-separated fields. This does require a fixed number of output fields per record, but that way all possible paths and file names in Linux are supported, and your Bash script can handle all possible C strings (i.e., everything except strings with embedded NUL bytes) as fields.

For example, if the input contains five nul-separated fields (Object Package Pathname MD5Sum Size) per record, the following Bash loop,

Code: [Select]

while read -rd "" Object ; do
    read -rd "" Package || break
    read -rd "" Pathname || break
    read -rd "" MD5Sum || break
    read -rd "" Size || break

    # Object Package Pathname MD5Sum Size

done

will read them very efficiently, and is not affected by IFS.

A more flexible option (not requiring a fixed number of fields per record) is to use a filter, say written in C, that converts the input into a stream of records. I suggest using ASCII US (31, unit separator) for field separator, ASCII RS (30, record separator) for record separator, and ASCII SUB (26, substitute) following a single digit ('0' for NUL, '2' fur SUB, '6' for RS, and '7' for US) for substitutes ("escape sequences"), so that the following Bash loop can parse such records very efficiently:

Code: [Select]

oldIFS="$IFS" ; IFS=$'\037'
while read -rd $'\036' -a Fields ; do
    for ((i = ${#Fields[@]} - 1; i >= 0; i--)) ; do
        temp=$'${Fields[i]//\0326/\036}' # Record separators
        temp=$'${temp//\0327/\037}' # Field separators
        temp=$'${temp//0320/}' # Skip embedded NULs since Bash can't handle them anyway
        Fields[i]=$'${temp//\0322/\032}' # Substitutes
    done
    # Use Fields ... 
done
IFS="$oldIFS"

Except for embedded NUL bytes, this provides you with an array corresponding to the exact fields in each record.

golden_labels · « **Reply #13 on:** November 05, 2021, 08:08:51 pm »

Quote from: PKTKS on November 05, 2021, 10:55:28 am

but true that it will fail when characters other than the page code being use are present...

It will fail in much simpler situations: newlines, U+000D, U+0009, anything interpretable as a glob pattern, possibly also with specific file names containing very large number of space-separated components. I think I might have missed a half of other things. One may attempt to solve each of those issues, but then new edge cases appear and we end up with an unmaintainable monstrosity that can’t even be easily verified to actually work. Which is why I haven’t attempted fixing it and, instead of addressing shortcomings, resolved to adding a clear warning and a suggestion to use tools that can actually deal with the problem properly.

DiTBho · « **Reply #14 on:** November 05, 2021, 08:33:00 pm »

With your help, guys, yesterday I converted all the files(1) with filter that replace ' ' with '_', the data base got updated accordingly.

Problem solved (for now)

(1) 2194 files


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: bash-scripts, how to handle files with spaces in the names? (Read 2902 times)

DiTBho

bash-scripts, how to handle files with spaces in the names?

borjam

Re: bash-scripts, how to handle files with spaces in the names?

PKTKS

Re: bash-scripts, how to handle files with spaces in the names?

Benta

Re: bash-scripts, how to handle files with spaces in the names?

DiTBho

Re: bash-scripts, how to handle files with spaces in the names?

DiTBho

Re: bash-scripts, how to handle files with spaces in the names?

Ian.M

Re: bash-scripts, how to handle files with spaces in the names?

golden_labels

Re: bash-scripts, how to handle files with spaces in the names?

DiTBho

Re: bash-scripts, how to handle files with spaces in the names?

golden_labels

Re: bash-scripts, how to handle files with spaces in the names?

DiTBho

Re: bash-scripts, how to handle files with spaces in the names?

PKTKS

Re: bash-scripts, how to handle files with spaces in the names?

Nominal Animal

Re: bash-scripts, how to handle files with spaces in the names?

golden_labels

Re: bash-scripts, how to handle files with spaces in the names?

DiTBho

Re: bash-scripts, how to handle files with spaces in the names?

Share me