If your Bash script is not localized, add
export LANG=C LC_ALL=C
near the beginning, so that input is considered a sequence of bytes, instead of sequence of characters. This way there is no illegal input.
If the input may contain leading or trailing whitespace, or multiple consecutive spaces as field separators, or use tabs instead of spaces or even a mix, then I'd suggest a more complicated, but also much more generic variant:
Separator=$'[ \t]'
# Strip VarName
function Strip() {
local -n dst="$1"
local olddst="${dst##$Separator}"
local newdst="${olddst%%$Separator}"
while [[ ${#newdst} != ${#olddst} ]]; do
olddst="${newdst##$Separator}"
newdst="${olddst%%$Separator}"
done
dst="$newdst"
}
# FirstField SrcVarName DstVarName
function FirstField() {
local -n src="$1" dst="$2"
dst="${src%%$Separator*}"
if [[ ${#dst} == ${#src} ]]; then
src=""
else
src="${src#*$Separator}"
local newsrc="${src##$Separator}"
while [[ ${#newsrc} != ${#src} ]]; do
src="$newsrc"
newsrc="${src##$Separator}"
done
fi
}
# LastField SrcVarName DstVarName
function LastField() {
local -n src="$1" dst="$2"
dst="${src##*$Separator}"
if [[ ${#dst} == ${#src} ]]; then
src=""
else
src="${src%$Separator*}"
local newsrc="${src%%$Separator}"
while [[ ${#newsrc} != ${#src} ]]; do
src="$newsrc"
newsrc="${src%%$Separator}"
done
fi
}
The Separator is either a single separator, or the list of acceptable separators in square brackets.
Given a variable name, Strip removes any leading or trailing Separators.
Given a source and destination variable names, FirstField and LastField split off the field from the source variable into the destination variable, and update the source variable.
In the Bash functions, the local -n foo="$1" makes foo a nameref-attributed variable: referring to it or modifying it actually refers to or modifies the variable named in the first parameter to the function. Nifty.
Using the above, the loop is
while read Record ; do
Strip Record
FirstField Record Object
FirstField Record Package
LastField Record Size
LastField Record MD5Sum
Pathname="$Record"
printf 'Object=%s, Package=%s, Pathname=%s, MD5Sum=%s, Size=%s\n' "$Object" "$Package" "$Pathname" "$MD5Sum" "$Size"
done
For example, given (using \t to denote an embedded tab),
\tobj#4 pkg-4 \t Green Hills MULTI.rst\te.g. Green + space + Hills + space + MULTI.rst 7d99181a2b9b9e533c1470958e343d16 1029394
the above will output
Object=obj#4, Package=pkg-4, Pathname=Green Hills MULTI.rst\te.g. Green + space + Hills + space + MULTI.rst, MD5Sum=7d99181a2b9b9e533c1470958e343d16, Size=1029394
i.e., handle correctly all linear whitespace (listed in $Separator) as a single field separator.
In the general case, you use FirstField and LastField to "prune" the non-whitespace-containing fields off the input record, so that whatever is left is stored in the possibly-whitespace-containing field.
The easier option, however, and the one I would recommend, is to simply prepend each tab or space that is not a separator with a backslash. For example, given
Object Package Path\ or\ file\ name MD5Sum Size
a simple
IFS=$' \t'
while read Object Package Pathname MD5Sum Size ; do
printf 'Object=%s, Package=%s, Pathname=%s, MD5Sum=%s, Size=%s\n' "$Object" "$Package" "$Pathname" "$MD5Sum" "$Size"
done
outputs
Object=Object, Package=Package, Pathname=Path or file name, MD5Sum=MD5Sum, Size=Size
because by default, the Bash read built-in supports escaping tabs and spaces by prepending a backslash to them. (Prepending a backslash before a newline is a line continuation: the Bash read built-in includes the continued lines in the record, but removes the backslash-newline pairs.)
If the pathnames are generated by a Bash script, you can use the following function to add those backslashes to any shell variables; just call the function with the names of those variables as parameters:
function Backslashes() {
local tab=$'\t'
while [[ $# -gt 0 ]]; do
local -n dst="$1"
dst="${dst//\\/\\\\}"
dst="${dst// /\\ }"
dst="${dst//$tab/\\$tab}"
shift 1
done
}
For example, if you have FOO="Foo has spaces" and BAR=$"Bar\thas\t\ttabs", then Backslashes FOO BAR modifies FOO to Foo\ has\ \ space and BAR to $"Bar\\\thas\\\t\\\ttabs" (i.e., adds backslashes before spaces and tabs). This means that newlines cannot be escaped using this method, unfortunately.
For full support of actual backslash escape sequences (\0 \a \b \t \n \v \f \r \\ \' \" \ooo \xHH \uHHHH \UHHHHHH) and quoting (single- and double-quotes, with backslash escapes only parsed outside quotes and within double quotes), you can either use a slow and complicated Bash function, or preprocess the data using e.g. awk, into a stream of nul-separated fields. This does require a fixed number of output fields per record, but that way all possible paths and file names in Linux are supported, and your Bash script can handle all possible C strings (i.e., everything except strings with embedded NUL bytes) as fields.
For example, if the input contains five nul-separated fields (Object Package Pathname MD5Sum Size) per record, the following Bash loop,
while read -rd "" Object ; do
read -rd "" Package || break
read -rd "" Pathname || break
read -rd "" MD5Sum || break
read -rd "" Size || break
# Object Package Pathname MD5Sum Size
done
will read them very efficiently, and is not affected by IFS.
A more flexible option (not requiring a fixed number of fields per record) is to use a filter, say written in C, that converts the input into a stream of records. I suggest using ASCII US (31, unit separator) for field separator, ASCII RS (30, record separator) for record separator, and ASCII SUB (26, substitute) following a single digit ('0' for NUL, '2' fur SUB, '6' for RS, and '7' for US) for substitutes ("escape sequences"), so that the following Bash loop can parse such records very efficiently:
oldIFS="$IFS" ; IFS=$'\037'
while read -rd $'\036' -a Fields ; do
for ((i = ${#Fields[@]} - 1; i >= 0; i--)) ; do
temp=$'${Fields[i]//\0326/\036}' # Record separators
temp=$'${temp//\0327/\037}' # Field separators
temp=$'${temp//0320/}' # Skip embedded NULs since Bash can't handle them anyway
Fields[i]=$'${temp//\0322/\032}' # Substitutes
done
# Use Fields ...
done
IFS="$oldIFS"
Except for embedded NUL bytes, this provides you with an array corresponding to the exact fields in each record.