Last monday I developed a program to compare text files.
Even in C, there are multiple ways to read lines from a stream (as in C stream: a file, pipe, socket, character device, or similar).
Most learn to do this kind of low-level stuff with fgets(), opendir(), readdir(), closedir(), but for Linux and other POSIXy systems, that's the
wrong approach. (Limited line lengths, and directory traversal prone to issues if the directory tree is modified during traversal.)
One gets much better results with POSIX
getline()/
getdelim(),
nftw(),
scandir(), and
glob(); and the
fts functions that originated in 4.4BSD, but are nowadays available in many standard C libraries, including in Linux. With these, you get dynamic line length support and the most robust filesystem traversal. (With
getdelim(&line, &size, '\0', stream) you can read nul-delimited streams; especially useful for handling file names and paths provided by a subprocess or utility, similar to e.g.
find ... -print0 and
xargs -0 utilities.)
When random access to each line is required, it is more efficient to read or memory-map the entire file into a continuous chunk of memory (via <unistd.h>
open()/
read()/
close(), or more portably using <stdio.h>
fopen()/
fread()/
fclose() (I've shown an example
here); then separately terminate and trim each line, storing a pointer to each line in a separate array. This ensures the data remains continuous in memory, and that data locality can really help cache behaviour when doing e.g.
diff-type work. (Personally, I use the actual
diff utility to do diffs, though, and read the output via a pipe or a socket.)
Yet, all of that pales when considering the effect of
properly choosing the algorithm one implements.
My favourite example (I've described at least three times on these forums already) is the traditional
sort utility implementation in C.
The task is to read a source file, and output the lines in sorted order (ascending or descending, depending on command-line options).
The first question an implementer should ask, is whether the utility will be used in background processes, or by humans, because the two have different definitions of "efficiency". In processes running on the background, only when the machine is otherwise idle, you want to minimize the CPU time used and memory needed to accomplish the task. For human tools, you want to minimize the wall clock time used.
In a background task utility, you'd read the file or memory-map it, allocate and create an array of pointers to each line, then sort the array of pointers using a suitable sorting function, and finally emit the lines in the sorted order.
To minimize the wall clock time used by a
sort utility, you trade a bit of CPU time used, to accomplish the task in shorter elapsed real world time. You do this by realizing that I/O, reading the data from the storage, is the main cause of
latency; real-world delay, when no "work" is done. So, instead of reading the file, you read it line by line, and each line you obtain, you stick into a self-sorting data structure. I use a binary heap, because the code is simple, and it yields a very efficient data structure for this even in the worst cases. (For uniform random input, each addition involves approximately
e≃2.718 swaps in the tree (in the
percolate stage after insertion.)
Effectively, when the last line is read, the lines are already in sorted order, and can be output immediately. Since the binary heap (or whatever data structure you use) will use slightly more CPU time than an offline sort function, you end up using more CPU time; but that CPU time is used while the process would otherwise be waiting for data to arrive from the storage, so the real-world time taken by the entire process is shorter.
This kind of complete change in approach is
easier in higher-level scripting languages like Python, Ruby, or Lua, because there is less code, and therefore less "inertia"; less perceived 'cost' of considering doing something different, just to see if it works better. Converting code from one language to another usually yields poor results, because every one has their "native" approach, their own paradigm, as to how things are done; "how things oughta look like". So don't do that either.
(I've mentioned I write a LOT of C code. Most of that is because I experiment with stuff, just to see if they work better. I don't always do it just for my own interest of for technical interest; I've also looked at e.g. what should a linear algebra library API look like in pure C, to let beginners in C write efficient code. In my observations on unsuspecting
victims test users, I found that using a scripting language with lightweight but efficient bindings to the native library yields better results, unless the point is to help them become better programmers as opposed to just enhancing their workflow with customized tools.)
For the above
sort utility, I haven't checked how a Python version would look like, because as I said, Python I/O is relatively slow, so it isn't the proper language for implementing this. For scripting text files/streams, I often reach for sed, gawk, or mawk, if there isn't a command-line tool already available to do the job. (mawk tends to be faster than gawk, when you do not need the GNU awk extensions.) Some of the utilities, like
diff, implement surprisingly efficient algorithms, and are hard to beat in terms of speed and efficiency; others (like
sort) are more geared towards multifunction use, and can easily be beaten in specific scenarious using specific comparison metrics.