To list the number of lines and the script name, I used
find /usr/bin /usr/sbin /bin /sbin -maxdepth 1 -type f -perm /o+x -print0 | (T=0; while read -d "" NAME ; do T=$[T+1]; file "$NAME" | grep -qe 'shell script' || continue ; wc -l "$NAME" ; done ; printf '%d files total\n' $T >&2 ) | awk ' NF>1 { L=L+$1 ; n=n+1 ; print } END { printf "%d scripts containing %d lines\n", n, L }'
And that's the proof that bash sucks for anything than basic stuff.
Well, it does use
find,
bash,
file,
grep,
wc, and
awk to get the work done.
You need to first construct a list of executable files. In Linux, all bytes except NUL and slash are valid in names, so I used
-print0 to print the path to each file NUL-separated.
Next, you need to pick the files that are shell scripts, just the task for
file and
grep -qe 'shell script', as the former will include 'shell script' for POSIX ('POSIX shell script') and Bash ('Bourne-Again shell script'), and the latter will return success iff found, failure otherwise. I used a Bash/POSIX shell loop, with
continue skipping all non-matching file types. For shell scripts, we want the line count, which
wc was designed to report. The output is one line per file, with the line count first and then the file name or path. The loop also counts the total number of files seen, and prints the total count to standard error. For this to work, the entire loop section must be in a subshell.
Finally, the awk part counts the number of shell script files, and sums up the total number of files, printing each input record as it goes.
Because any file names with newlines will make the
wc part confuse awk, it does not actually work for all possible file names, though.
This is better:
find /usr/bin /usr/sbin /bin /sbin -maxdepth 1 -type f -perm /o+x -print0 | xargs -r0 file -00 | gawk 'BEGIN { RS=FS="\0" } { files++; name=$0; getline; if ($0 !~ /shell script/) next; scripts++; n=0; RS="\n"; while (getline < name) n++; close(name); RS="\0"; lines+=n; printf "%9d %s\n", n, name } END { printf "%d files, of which %d shell scripts, containing %d lines\n", files, scripts, lines }'
or split into logical lines,
find /usr/bin /usr/sbin /bin /sbin -maxdepth 1 -type f -perm /o+x -print0 \
| xargs -r0 file -00 \
| gawk '
BEGIN {
RS=FS="\0"
}
{
files++
name=$0
getline
if ($0 !~ /shell script/) next
scripts++
n=0
RS="\n"
while (getline < name) n++
close(name)
RS="\0"
lines+=n
printf "%9d %s\n", n, name
}
END {
printf "%d files, of which %d shell scripts, containing %d lines\n", files, scripts, lines
}'
The "
find /usr/bin /usr/sbin /bin /sbin -maxdepth 1 -type f -perm /o+x -print0 | xargs -r0 file -00" part of the command produces a sequence of "
filename\0
type\0", i.e. filename and type as text, separated by NULs. It does this by piping the NUL-separated file names or paths (
-print0), to
xargs, which splits them back (
-0 NUL-separated, and
-r says not to run if there are no parameters to supply), and executes
file -00 with each file name or path as a separate parameter, as many as can fit at a time. The
-00 parameter tells
file to output a single NUL after each filename and after the type of the preceding file.
Other than executing and piping the commands together, it does not use Bash at all.
gawk can handle NUL-separated records by simply setting
RS="\0". I set also the field separator to the same, because we don't want gawk to waste time splitting records (or lines) into fields. The main rule gets applied to each executable file found in those directories. If
file did not report it as a shell script, it is only counted (as a file, not as a script). For shell scripts, we count the number of lines by reading each line (using newline as the record separator), and print the per-script count, and update the tally. The END rule is applied after all input has been processed, and it prints the summary.
In POSIX C, we could use
nftw() to walk the trees, or
scandir() to obtain the list of files in each directory. After checking the stats of the file (to make sure we only look at group-executable files), we can read the first line of each file using
getline(), and if it looks like a valid shebang line for bash, dash, ash, or sh, count the number of lines in it (by reading each line), and otherwise ignore it. Count the number of files, the number of scripts, and the number of lines in script files, and print them, and you're done.
Do note that while the shell script stanza is concise, it also uses more than one process at a time.
xargs will buffer file names so that we execute
file the fewest number of times; it is faster than executing it once for each file (which you can do by using
xargs -r0 -n 1 file -00 instead). The piped processes are run in parallel, and they are executed all at the same time, which means that on multi-core machines you use at least three cores if available. To do the same in C, you'll need to use threads, since otherwise a single core (at a time) will be used to perform the tasks. (You can use e.g.
popen(), but executing anything once for each file found will be slow.)