Author Topic: Books on Basic and Advanced Algorithms to program in C/C ++ (Read 11666 times)

Nominal Animal · « **Reply #75 on:** July 11, 2018, 08:38:31 pm »

Quote from: tggzzz on July 11, 2018, 06:07:02 pm

Whether new languages will gradually supplant C for systems programming is yet to be determined.

I'm mostly interested in systems programming (that is, kernels and low level libraries), and HPC (molecular dynamics simulations with classical potentials).

Since we are talking about algorithms, one of my core interests is related to Verlet lists (or neighbour lists); and in particular, enhancing them for both vectorization and cache locality. The key point about neighbour/Verlet lists is that rather than calculate all particle pairs, you maintain a list of neighbours for each particle. The classical potentials have a cut-off distance, above which a pair does not interact. If you (re)construct the neighbour/Verlet list to include all neigbours within the cut-off distance plus twice some delta, you can keep using the same list until at least one particle has moved by delta.

The simplest enhancement I've developed is to allow reordering the neighbours in the list, while keeping a parallel array of distances (and (unit) direction vectors to the neighbors within interaction distance). The list is not kept sorted by distance; instead, the maximum interaction distance acts like the pivot in Quicksort. Interactions are then evaluated for the initial section only, those particles that are within interaction distance. This allows increasing the delta (at cost of increasing the neighbor list size), but also only calculate the actual interactions for truly interacting particle pairs, rather than all those in the neighbour list. (So, you calculate distance, or rather distance squared, for all neighbours, but the actual interaction for only the neighbours the particle really interacts with.)

In theory, that's all there is to that algorithm. It's not "new" or "useful" enough to get published. Implementing it in Fortran is relatively straightforward, although the extra arrays needed (for keeping the distances and/or delta vectors) are a bit annoying. Implementing it in C opens up a whole lot of new possibilities, like using an ASIC/FPGA to calculate the distances at double precision, another to look up the potential and force using piecewise cubic polynomials with very large (but constant) coefficient tables.. perfect for asymmetric multiprocessing.

Quote from: NorthGuy on July 11, 2018, 06:43:02 pm

Looks exactly the same to me. You know as they say: " ... don't multiply entities beyond necessity".

They are normal C constructs. Using intrinsics in C code is very different to say inlining the assembly code.

Or, put another way, you can teach a new C programmer to use the intrinsics and vectorize computation, without teaching them assembly.

Quote from: NorthGuy on July 11, 2018, 06:43:02 pm

Depends on what you're doing. May be rare for some people and frequent for others. It's silly to assign labels without reviewing the particular situation.

It was an opinion/estimate based on the C sources for dozens, if not hundreds of open-source projects. I do believe I've written over ~~a million lines~~ half a million lines of C code myself, too.

Quote from: NorthGuy on July 11, 2018, 06:43:02 pm

Besides, the two examples you had - calculating the vector length and multiplying matrices are not that far from the sum of the array in terms of complexity.

And yet, no C compiler can properly vectorize such code, which was my point. Writing the vectorization explicitly, via functions that operate on specific types that contain multiple data values, yields a significant speedup. In my opinion, that is because the model underlying C does not allow, and cannot allow, the transformations necessary. Even when C compilers can do the transformation, it typically requires a very specific C pattern to be used; essentially, the C compiler detects the pattern, and implements it using a pre-prepared code template. That is like translating between languages with a frasebook: you are limited to the phrases in the book.

Compiler options that tell the compiler to break C rules just emphasize that point, in my opinion.

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

Say you need to know distances with 1mm precision and your biggest distance is 100 km

Then I'd use 32-bit integers, not floating-point types. Just like you do not use floating-point types for financial calculations either.

A better example is when you are told to e.g. calculate the area of a non-self-intersecting 2D polygon to within some number of significant figures. For a polygon with n vertices, this is a sum of 2n products. If you use floating-point numbers, and the polygon is located far from the origin, you'll lose most of the precision in domain cancellation errors; that is, each pair of products contain values very similar in magnitude, but of different signs.

There are several ways of fixing that. One is to find the ~~centroid~~ center of the polygon, and subtract it from all vertices before calculating the products. Another is to use Kahan sum to calculate the sum of the products. Yet another is to save the products in an array and sort them, so you can calculate the sum in descending order of magnitude of the summands. Of these, the Kahan sum is the cheapest to implement (essentially free on current architectures), and very likely to provide sufficient accuracy.

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

I had in mind the likes of 1000-dimentional vectors.

Those tend to be rare in my experience, unless the vectors are sparse (as in sets of (dimension, value) pairs. Operations on them tend to be memory bandwidth limited.

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

The square-rooting takes real long.

On AMD Ryzen, gathering load to a vector register takes longer than a vectorized square root; square root taking roughly the same as two vectorized divisions.

Obviously, it varies a lot between implementations, but in general terms, square root is just a couple of times to a few times as expensive as a division, with addition and multiplication being only limited by memory bandwidth.

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

I would guess the compiler would figure 3-dimentional vectors out.

Why would you guess that? They haven't yet, even though we've had SSE and support for 4-component floats in a single vector register for almost two decades now.

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

BTW: Intel already had horizontal addition/subtraction in SSE3.

HADD[PS][SD] is strictly speaking half or quarter of a horizontal addition; similarly for [V]HSUB[PS][SD]. On most implementations, it is as slow as vectorized division, much slower than vectorized addition or multiplication. This is not just a quirk of the implementations; it is very hard to make it fast.

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

C standard operates on an abstract machine. The goal of the compiler is to produce the same result as the abstract machine. As son as this holds, the compiler is free to do whatever it pleases.

Exactly. One of the results required by the abstract machine are sequence points: points in time at which side effects of each operation must be visible in memory. This is a severe limitation, because there is no way of specifying that two or more operations (or sequences of operations) can be done in any order.

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

There are two solutions - allow compiler to re-arrange floats, or use integers wherever possible.

Much better solution than either of those, is to use a construct similar to C99 cast, and specify that without such casts, the compiler is free to simplify the expressions using arithmetic rules. In C99 and later, (type)(expression) restricts the value of expression to the range and precision of type type.

This means that in such a programming language, (B - A) + A is equivalent to B, but (double)(B - A) + A requires the compiler to ensure that the difference, before the addition, is limited to double range and precision. The notation could be better, however: the cast expressions seem to be pretty hard for humans to parse correctly.

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

For some reason you quoted my post where I was talking about Intel (although you did cut it off right before the "Intel" word).

I apologize if I caused confusion: my only intent is to help here. (Please, do not read anything in my "tone" either. I strive for clarity, precision, and usefulness; that is all. Consider me autistic in my communications that way.)

I probably should explain that I view the SSE2/3/4/AVX/AVX512 "extensions" to the x86-64 hardware architecture as a separate arithmetic-logic unit, one which is pretty accurately modeled as a helper coprocessor, rather than an integral part of the CPU.

Especially if you look at kernel-level C code, it becomes an useful way of considering things. The SSE/AVX register state alone is huge, and storing and loading it back takes a considerable number of cycles. One reason the Linux kernel in particular avoids it, is because it makes context switches (between userspace and kernel space) faster: if the kernel does not touch the SSE/AVX state, the state only needs to be stored/restored when switching from one userspace thread/process to another.

NorthGuy · « **Reply #76 on:** July 11, 2018, 11:50:35 pm »

Quote from: Nominal Animal on July 11, 2018, 08:38:31 pm

A better example is when you are told to e.g. calculate the area of a non-self-intersecting 2D polygon to within some number of significant figures. For a polygon with n vertices, this is a sum of 2n products. If you use floating-point numbers, and the polygon is located far from the origin, you'll lose most of the precision in domain cancellation errors; that is, each pair of products contain values very similar in magnitude, but of different signs.

There are several ways of fixing that. One is to find the centroid of the polygon, and subtract it from all vertices before calculating the products. Another is to use Kahan sum to calculate the sum of the products. Yet another is to save the products in an array and sort them, so you can calculate the sum in descending order of magnitude of the summands. Of these, the Kahan sum is the cheapest to implement (essentially free on current architectures), and very likely to provide sufficient accuracy.

First, if you have a polygon with the area of few mm located 100 km from the origin and the coordinates of the vertices are stored in 32-bit floats, the information is already lost, and there's absolutely no way to recover precision for calculating the area of the polygon.

Then, you do not have to calculate the position of the centroid. You can just calculate everything relative to the position of any (arbitrary) vertex. Or, if you do want something "central", then instead of the centroid, you can simply calculate the average coordinate of the vertex - sum all the X-coordinates all up and divide by the number of points, then do the same with Y-coordinates (but be careful - this approach is rare and will make your program horribly inefficient

).

Quote from: Nominal Animal on July 11, 2018, 08:38:31 pm

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm
I had in mind the likes of 1000-dimentional vectors.
Those tend to be rare in my experience, unless the vectors are sparse (as in sets of (dimension, value) pairs. Operations on them tend to be memory bandwidth limited.

It depends on your application. 30 years ago I used to work with really big multi-dimensional arrays and matrices. Would be so easy now

Quote from: Nominal Animal on July 11, 2018, 08:38:31 pm

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm
I would guess the compiler would figure 3-dimentional vectors out.
Why would you guess that?

I don't know. I always thought that one of the SIMD targets is graphics. I may be wrong, of course.

Quote from: Nominal Animal on July 11, 2018, 08:38:31 pm

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm
C standard operates on an abstract machine. The goal of the compiler is to produce the same result as the abstract machine. As son as this holds, the compiler is free to do whatever it pleases.
Exactly. One of the results required by the abstract machine are sequence points: points in time at which side effects of each operation must be visible in memory. This is a severe limitation, because there is no way of specifying that two or more operations (or sequences of operations) can be done in any order.

Memory is not a concern - until you access a variable, it doesn't matter what it contains or even if it exists at all. That's why they have "volatile" - to tell the compiler that the variable may be accessed by external processes which the compiler don't know anything about.

Quote from: Nominal Animal on July 11, 2018, 08:38:31 pm

(double)(B - A) + A

if A and B are "float" then this cast doesn't prevent big A from cancelling small B.

Quote from: Nominal Animal on July 11, 2018, 08:38:31 pm

I probably should explain that I view the SSE2/3/4/AVX/AVX512 "extensions" to the x86-64 hardware architecture as a separate arithmetic-logic unit, one which is pretty accurately modeled as a helper coprocessor, rather than an integral part of the CPU.

This how ii was historically - 8087 was physically a separate unit. Then they took it in, 8087 donated its registers to MMX, then they replicated MMX functionality with SSE and so on. But these things are still optional. I have a small fanless mini-server running Linux on AMD processor. It is relatively new, but it doesn't have SSE.

Nominal Animal · « **Reply #77 on:** July 12, 2018, 07:16:22 am »

Quote from: NorthGuy on July 11, 2018, 11:50:35 pm

First, if you have a polygon with the area of few mm located 100 km from the origin

You don't need to. Consider a triangle with vertices at (1000, 1000), (1002, 1000), and (1000,1002). The area is computed as

Code: [Select]

A = 0.5*( 1000*1000 - 1000*1002 + 1002*1002 - 1000*1000 + 1000*1000 - 1002*1000 )
  = 0.5*( 1000000 - 1002000 + 1004004 - 1000000 + 1000000 - 1002000 )
  = 2

If your coordinate type has four significant digits, you need eight significant digits to describe the product. Domain cancellation occurs, when you sum such products where some terms cancel each other. It is a very well known numerical problem, and as I already explained, easily avoided by removing the common bias before calculating the product.

Also, I corrected a mistake I made in my previous post: I did mean center when I referred to the centroid; i.e. the mean of vertex coordinates. (It was late, and I was careless. Sorry.)

With regards to floating-point numbers, precision refers to significant figures; i.e. relative precision. Absolute precision is usually referred to as accuracy, or since ISO 5725, "trueness".

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

It depends on your application. 30 years ago I used to work with really big multi-dimensional arrays and matrices. Would be so easy now

It is cheaper to throw more hardware to solve a computational problem, yes. But to truly take advantage of the hardware, you cannot rely on the C compiler. Yes yes, I know the argument: the time it takes me to write and run my computation is worth more than the resource usage wasted by that inefficient code. (In most cases, yes. That is why it is perfectly okay to use e.g. Matlab for numerical computations on your own machine.)

That is not at all the point here. The point is that programming languages, even low-level ones, have not kept up with the possibilities provided by the hardware. We could do better.

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

I always thought that one of the SIMD targets is graphics.

I'm not sure what they are targeted at, really. I do know that graphics cards' approach is very, very different. Both CUDA and OpenCL (C-like languages allowing one to do general purpose numerical computing with graphics cards) are optimized for four-component vectors, but usually limited to single precision.

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

if A and B are "float" then this cast doesn't prevent big A from cancelling small B.

Right; and that was my point. If we had a language where cancellation is allowed (because the computer can optimize the entire mathematical expression in a way that drops the reference to A altogether), we would only need a way to specify that some sub-expression must be evaluated at some specific precision and range. That is exactly how e.g. Kahan summation works.

It would mean that if you wrote A + B - A, some compilers would evaluate it as B, and others might evaluate it as if it was written (double)(A + B) - A or as A + (double)(B - A). The point is, the compiler would be allowed to choose, unless you used a specific notation (the cast here is just an example) to restrict the rules. Lazily written formulae would not reproduce exact same results on all compilers, but there would always be a way to specify the formula exactly (and you only need the precision-and-range restriction operator to do that with floating-point types).

Quote from: NorthGuy on July 11, 2018, 03:42:54 pm

This how ii was historically - 8087 was physically a separate unit.

Yes. Point was, asymmetric multiprocessing (using different types of "cores" simultaneously) is nothing new, but is cropping up everywhere nowadays, much more than symmetric multiprocessing.

(I would love to have a AMD Ryzen Threadripper, though, just to do scientific computation at home. But HPC is kind of a not-so-common target; most devices nowadays are phones, tablets, and laptops, or desktop machines of comparable performance.)

(SSE2 is included in the x86-64 instruction set. If you run on an 64-bit Intel/AMD x86 architecture processor, you always have SSE2.)

Getting back to the topic at hand, none of the current programming languages have a natural way of implementing asymmetric multiprocessing. Only some high-level languages support coroutines with human-friendly expressions, although they are quite easy to implement at the hardware level. Our low-level languages (compiled languages without runtime binaries, supporting freestanding environments like kernels) do not have easy, natural ways to express these constructs at all. Heck, even atomic access and memory model is a recent (less than a decade old) addition to C, and arguably looks like a tacked-on bulge, rather than a natural part of the language.

When studying algorithms and their implementation, it is (in my opinion) important to also realize how the language used affects the implementation, behaviour, and efficiency of the algorithm. To do that, you do need to understand the benefits and limitations of the language used. (You could say you need to perceive the box before you can think outside the box, but I'm not sure if that is an accurate way of putting it.)

For example, the filesystem access in C is very rudimentary. You can open files, creating them if they do not exist; read and write those files; rename files; and delete files. There is no way to e.g. list the contents of a directory in standard C. You need to either use operating system specific extensions, or rely on another standard, usually POSIX.1. However, because POSIX.1 is not taught as a complete spec on its own, developers are often taught to use opendir()/readdir() etc. to scan directory contents, which is problematic on current systems where the contents of directories may change mid-scan. POSIX.1 provides nftw(), scandir(), and glob(), which are perfect for scanning directory contents (as in they should behave in a predictable manner when the directory contents change during scanning), and aside from the DOS-era Borland C library, are available whenever opendir()/readdir()/etc. are.

Even the standard C I/O is quite deficient; especially how the scanf() family of functions behave on incorrect numerical input. (I still don't get how Microsoft got their "safe" functions versions into an annex to C11, with no mention of getline() anywhere.)

tggzzz · « **Reply #78 on:** July 12, 2018, 07:53:33 am »

Quote from: Nominal Animal on July 12, 2018, 07:16:22 am

When studying algorithms and their implementation, it is (in my opinion) important to also realize how the language used affects the implementation, behaviour, and efficiency of the algorithm. To do that, you do need to understand the benefits and limitations of the language used. (You could say you need to perceive the box before you can think outside the box, but I'm not sure if that is an accurate way of putting it.)

Even if not strictly accurate, it is pithy, thought provoking - and no worse than the wretched "think outside the box" mantra!

More importantly, the first part of that paragraph is accurate.

coppice · « **Reply #79 on:** July 12, 2018, 10:17:05 am »

Quote from: NorthGuy on July 11, 2018, 11:50:35 pm

I don't know. I always thought that one of the SIMD targets is graphics. I may be wrong, of course.

SIMD in things like GPUs is obviously targeted at graphics, but I don't think the SIMD in most CPUs ever was. In the PPC world the additions seemed to be targeted at general high throughput computing. In the x86 world I have no idea what the original integer SIMD additions were trying to achieve. They were called MMX (multi-media extension), but only seemed to support some limited image processing (something rather different from graphics) reasonably well. They did a really poor job with anything else considered multi-media at the time, as Intel's own example code showed clearly. The early SSE stuff seemed so focussed on building upon the original integer MMX, that the only really effective thing it brought was an efficient replacement for the x87 for SISD computing. They took years to respond to the needs of most people doing a lot of computation, and got the alignment issues under control. I don't remember seeing any code for early versions of SSE where I felt like it was the algorithm that the hardware designers were trying to do really well. I don't think these people were ever in touch with their inner product.

Nominal Animal · « **Reply #80 on:** July 12, 2018, 10:40:24 am »

If I recall correctly, one of the use cases for x86 SIMD extensions was speeding up discrete cosine transforms and inverse discrete cosine transforms; both 1D (used in audio compression) and 2D (used in video compression and JPEG images).

NorthGuy · « **Reply #81 on:** July 12, 2018, 01:31:54 pm »

Quote from: coppice on July 12, 2018, 10:17:05 am

SIMD in things like GPUs is obviously targeted at graphics, but I don't think the SIMD in most CPUs ever was. In the PPC world the additions seemed to be targeted at general high throughput computing. In the x86 world I have no idea what the original integer SIMD additions were trying to achieve. They were called MMX (multi-media extension) ...

In AMD, the first SIMD extension was called "3DNow".

NorthGuy · « **Reply #82 on:** July 12, 2018, 05:15:39 pm »

Quote from: Nominal Animal on July 12, 2018, 07:16:22 am

Also, I corrected a mistake I made in my previous post: I did mean center when I referred to the centroid; i.e. the mean of vertex coordinates. (It was late, and I was careless. Sorry.)

Center is the one requiring the sum, which you frown upon

Quote from: Nominal Animal on July 12, 2018, 07:16:22 am

Getting back to the topic at hand, none of the current programming languages have a natural way of implementing asymmetric multiprocessing. Only some high-level languages support coroutines with human-friendly expressions, although they are quite easy to implement at the hardware level. Our low-level languages (compiled languages without runtime binaries, supporting freestanding environments like kernels) do not have easy, natural ways to express these constructs at all. Heck, even atomic access and memory model is a recent (less than a decade old) addition to C, and arguably looks like a tacked-on bulge, rather than a natural part of the language.

Long time ago people wrote in assembler. It is quite tedious. So people who did that a lot started using macros. If you looked at their assembler programs, you would see more macros than commands. But macros were not that flexible in some areas. For example it was very still very tedious to calculate expressions. So, someone came with an idea to write a compiler which would allow more complex macros. The compiler would read your macro file and literally write the assembler for you. Of course, you could insert raw assembler where needed. This is how C came to life. C might be considered HDL, but it still has ebb and flow of assembler, unlike other languages such as algol, FORTRAN, PL/1, Ada, which started from abstract models. Because of such flow, C gives you possibility to write practically anything. This explains C popularity.

All the high level concepts may be traced back to simple things. For example, C++ objects are structures containing a pointer to the arrays of virtual functions. But there may be other presentations - such as Objective-C where each object provides a set of functions to access and manage its member functions. Even though both C++ and Objective-C are both object oriented the underlying structure is very different. If you use C++, you must subscribe to the C++ object model. C gives you more flexibility because you are not tied to any particular model and can implement your objects as you please. For example, look at the Linux drivers code. Is it really wise to force a certain model on the programmer through language extensions, such as C++?

Co-routines may be represented as structures containing execution pointer and a set of local variables. But there may be other presentations, special needs etc. Is it wise to force a particular model upon the programmer by making co-routines a part of the language?

coppice · « **Reply #83 on:** July 12, 2018, 06:02:24 pm »

Quote from: NorthGuy on July 12, 2018, 01:31:54 pm

Quote from: coppice on July 12, 2018, 10:17:05 am
SIMD in things like GPUs is obviously targeted at graphics, but I don't think the SIMD in most CPUs ever was. In the PPC world the additions seemed to be targeted at general high throughput computing. In the x86 world I have no idea what the original integer SIMD additions were trying to achieve. They were called MMX (multi-media extension) ...

In AMD, the first SIMD extension was called "3DNow".

True, but was that anything but a good name for publicity? They put two 32 bit reals in a 64 bit word, and made a fairly general instruction set for computing with them. I used 3DNow for a little while, and I don't recall anything that looked application specific. I do recall that it was a heck of a lot easier to get some performance from 3DNow than it was later on to get some performance out of the four reals per word SSE scheme.

tggzzz · « **Reply #84 on:** July 12, 2018, 07:56:20 pm »

Quote from: NorthGuy on July 12, 2018, 05:15:39 pm

Long time ago people wrote in assembler. It is quite tedious. So people who did that a lot started using macros. If you looked at their assembler programs, you would see more macros than commands. But macros were not that flexible in some areas. For example it was very still very tedious to calculate expressions. So, someone came with an idea to write a compiler which would allow more complex macros. The compiler would read your macro file and literally write the assembler for you. Of course, you could insert raw assembler where needed. This is how C came to life. C might be considered HDL, but it still has ebb and flow of assembler, unlike other languages such as algol, FORTRAN, PL/1, Ada, which started from abstract models. Because of such flow, C gives you possibility to write practically anything. This explains C popularity.

There are many mistakes in that, but I'll merely note

C did not come from assembler, it came from BCPL
while C was once almost a low level language, that stopped being true a generation ago. FFI start with https://queue.acm.org/detail.cfm?id=3212479
C is popular because it was given away for free with UNIX, at a time when other language implementations cost real money. That ensured it was taught to students, and the rest is history

Quote

Co-routines may be represented as structures containing execution pointer and a set of local variables. But there may be other presentations, special needs etc. Is it wise to force a particular model upon the programmer by making co-routines a part of the language?

In modern multicore systems, it is wise to allow the programmer to use coroutines if they simplify the application. Coroutines are impossible in pure C, unless there are significant limitations e.g. as in FreeRTOS.

Nominal Animal · « **Reply #85 on:** July 13, 2018, 08:29:59 am »

Quote from: NorthGuy on July 12, 2018, 05:15:39 pm

Center is the one requiring the sum, which you frown upon

Yes, and it is a pretty good example of why. Calculating the center usually requires a loop over all vertices. However, you can achieve the same result by calculating the area using Kahan summation, without an extra pass over the vertices; and on current machines (especially x86 and x86-64 architectures) the extra work done in the Kahan sum is basically free anyway, because the area calculation loop is limited by memory bandwidth.

Sure, there are cases where you want a straight sum over some data. My point is that if you do that, you'll want to take a step backwards, and look at the task at hand as a whole, and see if there are algorithmic improvements one can do, or combine the sum with some other work. It is not evil or bad in itself, it's just one of those things I've learned to see as an indicator. Like a headache in summer heat is often a sign of dehydration.

Quote from: NorthGuy on July 12, 2018, 05:15:39 pm

Long time ago people wrote in assembler.

My very first paid programming job, in very early nineties, was to incorporate a serial number (for tracking if copying became an issue) to a DOS binary I was not given source codes for. I used Turbo Pascal to write a simple TUI that incorporated the serial number into the code, and combined it with my own relocator code (that extracted a copy of the same serial number from relocation data). Sneaky stuff.

I wrote the EXE relocator in DEBUG.EXE, the debugger/assembler that came with MS-DOS. It had neither macro nor label support.. Good times.

If you look at freestanding C, it is a surprisingly simple language: very small number of keywords, operators, and core data types. (It is a common exercise to create a parser to parse C source code using e.g. flex and bison.) It is also very procedural. It could be that C simply hit the sweet spot between human readability, simplicity, and the massive complexity it allows if the programmer can handle it.

Quote from: NorthGuy on July 12, 2018, 05:15:39 pm

For example, look at the Linux drivers code.

Linux kernel is written in GNU C, which is basically C99 with GNU extensions. The freestanding environment (without the standard C library) is very different to the hosted environment (where the standard C library is available). A similar difference exists when programming microcontrollers in e.g. Arduino environment; the compiler supports C++ syntax, but only a subset of C++ features are available (and almost none of the standard C++ library).

To me, this is a good indicator that we are missing a low level systems programming language, that exposes the capabilities of the hardware, but has a minimal footprint and no runtime (say, is capable of running in a freestanding environment).

Quote from: NorthGuy on July 12, 2018, 05:15:39 pm

Co-routines may be represented as structures containing execution pointer and a set of local variables.

A better low-level implementation, in my opinion, is to have coroutines run on their own stacks. The tricky bit is how to express (in human-readable language) the two scopes used simultaneously; the coroutine-local and other-local. At the hardware level, the switch to/from a coroutine is then as cheap as a normal function call, but as the local variable references are to stack, there is no indirect access overhead (which can be significant otherwise).

The yield idiom (returning a value, but not exiting from the coroutine; i.e. next "receive"/"get" continues execution in the coroutine at that position) is then simple to implement, too, as the coroutine state is then completely described using the pair (top of coroutine stack, instruction pointer in coroutine). During coroutine execution, only the corresponding pair (top of caller stack, instruction pointer in caller) need to be maintained (and can be simply kept in the coroutine stack, if the coroutine stack is per-thread).

Quote from: NorthGuy on July 12, 2018, 05:15:39 pm

Is it wise to force a particular model upon the programmer by making co-routines a part of the language?

Good question. I claim that every language "forces" a particular model upon the programmer anyway.

I definitely want a low-level or systems programming language, that does provide coroutines (or generators) as a part of the language. I know of quite a few practical examples how they would be extremely useful, in both low-level libraries (almost all I/O, really), as well as in microcontroller firmwares (especially I/O there too).

coppice · « **Reply #86 on:** July 13, 2018, 09:59:40 am »

Quote from: tggzzz on July 12, 2018, 07:56:20 pm

Quote from: NorthGuy on July 12, 2018, 05:15:39 pm
Long time ago people wrote in assembler. It is quite tedious. So people who did that a lot started using macros. If you looked at their assembler programs, you would see more macros than commands. But macros were not that flexible in some areas. For example it was very still very tedious to calculate expressions. So, someone came with an idea to write a compiler which would allow more complex macros. The compiler would read your macro file and literally write the assembler for you. Of course, you could insert raw assembler where needed. This is how C came to life. C might be considered HDL, but it still has ebb and flow of assembler, unlike other languages such as algol, FORTRAN, PL/1, Ada, which started from abstract models. Because of such flow, C gives you possibility to write practically anything. This explains C popularity.
There are many mistakes in that, but I'll merely note
C did not come from assembler, it came from BCPL

C was explicitly described as a derivative of BCPL, as were with a number of other languages targetted at systems programming. I used Coral extensively (a mostly UK language) before I used C. Coral came from the same school as BCPL, and the transition to C was almost seamless. Coral did support fixed point numbers, though. C never quite got them.

Quote from: tggzzz on July 12, 2018, 07:56:20 pm

while C was once almost a low level language, that stopped being true a generation ago. FFI start with https://queue.acm.org/detail.cfm?id=3212479

"Almost" is right. As soon as you have basic things like a function call model, you've started to move away from the low level.

Quote from: tggzzz on July 12, 2018, 07:56:20 pm

C is popular because it was given away for free with UNIX, at a time when other language implementations cost real money. That ensured it was taught to students, and the rest is history

I don't agree with that. Pl/1 derivatives were pushed by several of the early MPU makers (Motorola, Intel and others), as it was seen by some as a great basis for systems programming on bare metal. Most users didn't entirely agree. Some Pascal compilers appeared with extensions to make them more flexible (notably TurboPascal) and became big hits, because for a while they were the best of a poor bunch of options. Then C compilers appeared for MPUs, and seemed to be a great fit for what people were actually doing with MPUs. C became a huge hit at that point, quite separate from Unix. In the 80s and 90s the vast majority of people programming in C had never touched a Unix machine.

Quote from: tggzzz on July 12, 2018, 07:56:20 pm

[/list]

GeorgeOfTheJungle · « **Reply #87 on:** July 13, 2018, 10:48:29 am »

C is low level imo, as low level as any good macro assembler. Perhaps the abundance of C libraries is what makes people think otherwise, but C libraries are not the C language.

tggzzz · « **Reply #88 on:** July 13, 2018, 11:11:10 am »

Quote from: Nominal Animal on July 13, 2018, 08:29:59 am

Quote from: NorthGuy on July 12, 2018, 05:15:39 pm
Co-routines may be represented as structures containing execution pointer and a set of local variables.
A better low-level implementation, in my opinion, is to have coroutines run on their own stacks. The tricky bit is how to express (in human-readable language) the two scopes used simultaneously; the coroutine-local and other-local. At the hardware level, the switch to/from a coroutine is then as cheap as a normal function call, but as the local variable references are to stack, there is no indirect access overhead (which can be significant otherwise).

The yield idiom (returning a value, but not exiting from the coroutine; i.e. next "receive"/"get" continues execution in the coroutine at that position) is then simple to implement, too, as the coroutine state is then completely described using the pair (top of coroutine stack, instruction pointer in coroutine). During coroutine execution, only the corresponding pair (top of caller stack, instruction pointer in caller) need to be maintained (and can be simply kept in the coroutine stack, if the coroutine stack is per-thread).

While not coroutines per se, xC manages all that by:

at compile time preventing multiple access by different cores
extra keywords for parallel execution
a select keyword, like a case statement except the core suspends until the condition (timeout, input, output, message) occurs
extra keywords for input/output of messages to other cores or i/o ports, including timestamps

The resulting code is simple to read and write, is explicit, and can be analysed to determine the precise execution time.

Very impressive.

coppice · « **Reply #89 on:** July 13, 2018, 03:48:31 pm »

Quote from: tggzzz on July 13, 2018, 11:11:10 am

While not coroutines per se, xC manages all that by:
at compile time preventing multiple access by different cores
extra keywords for parallel execution
a select keyword, like a case statement except the core suspends until the condition (timeout, input, output, message) occurs
extra keywords for input/output of messages to other cores or i/o ports, including timestamps

The resulting code is simple to read and write, is explicit, and can be analysed to determine the precise execution time.

Very impressive.

Its sounds like you have really been working with these xCore processors. I find them interesting, just as I found the preceding lineage like the Transputer interesting. The snag always seems to be finding a good fit with an application where xCore make sense. It feels like their approach is always going to remain a small niche, although then the match is good they should really shine. What has your experience been?

tggzzz · « **Reply #90 on:** July 13, 2018, 05:38:20 pm »

Quote from: coppice on July 13, 2018, 03:48:31 pm

Quote from: tggzzz on July 13, 2018, 11:11:10 am
While not coroutines per se, xC manages all that by:
at compile time preventing multiple access by different cores
extra keywords for parallel execution
a select keyword, like a case statement except the core suspends until the condition (timeout, input, output, message) occurs
extra keywords for input/output of messages to other cores or i/o ports, including timestamps

The resulting code is simple to read and write, is explicit, and can be analysed to determine the precise execution time.

Very impressive.
Its sounds like you have really been working with these xCore processors. I find them interesting, just as I found the preceding lineage like the Transputer interesting. The snag always seems to be finding a good fit with an application where xCore make sense. It feels like their approach is always going to remain a small niche, although then the match is good they should really shine. What has your experience been?

My experience is limited, but far more positive than I was expecting. They are really easy to use for hard realtime embedded systems. I regard that as a significant niche, and one that other processors can't touch. Effectively they are eating into territory that normally requires FPGAs.

I was extremely impressed with the language and tools, traditionally the weak point of previous multicore systems. To that they had added excellent simple i/o support that was neatly tied into the language.

Overall it enables high-level and low-level concepts to be expressed simply and in a unified fashion. And you don't need an RTOS: those functions are all in hardware.

If it reminds you of Transputers, then it might not come as a surprise that Prof David May is a prime instigator in both Transputers and xCORE. He's learned from experience and improved the system - whereas most other systems are a simple boring derivation of stuff that I was doing in the early 80s. (I'm horrified at how little has changed since then!)

The PDF at https://www.xmos.com/published/xmos-programming-guide is easy to read and understand. I have the same good feelings I had as when I read the Java whitepaper in 1995.

bd139 · « **Reply #91 on:** July 13, 2018, 06:46:51 pm »

Only problem vs FPGAs is latency. State machine can have somewhat tighter timings.

Hence why TCP offload exists.

Edit: I had good feelings when I saw Java the first time. Then I got my hands dirty on Java EE

tggzzz · « **Reply #92 on:** July 13, 2018, 08:22:54 pm »

Quote from: bd139 on July 13, 2018, 06:46:51 pm

Only problem vs FPGAs is latency. State machine can have somewhat tighter timings.

Hence why TCP offload exists.

Edit: I had good feelings when I saw Java the first time. Then I got my hands dirty on Java EE

Sometimes latency can be tolerated, provided it is consistent. Otherwise you are correct.

TCP usually isn't the bottleneck; Van Jacobson demonstrated that in '88/'89.

TCP offload has a very chequered history, especially with asymmetric "big-little" multiprocessor systems. Offload often doesn't remove a bottleneck. I first saw that in the late 80s where an Appaling Domain DN10K processor had a 68020 as an network processor. The comms between the DN10K and the 68020 were of equivalent complexity/timing to networking protocols; the real win is arranging it so that there are zero memory copies as the data goes up the stack to the application.

Java EE: I managed to avoid that, might have used Rod Johnson's Spring, but ended up using the telecom equivalent of EE: JAIN SLEE.

I demonstrated a prototype of a telecom system using plain Java plus Tangosol's Coherence as a distributed replicated in-memory database. I really liked Cameron Purdy's mentality, but Tangosol was later borged by Oracle.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

EEVblog Electronics Community Forum

Author Topic: Books on Basic and Advanced Algorithms to program in C/C ++ (Read 11666 times)

Nominal Animal

Re: Books on Basic and Advanced Algorithms to program in C/C ++

NorthGuy

Re: Books on Basic and Advanced Algorithms to program in C/C ++

Nominal Animal

Re: Books on Basic and Advanced Algorithms to program in C/C ++

tggzzz

Re: Books on Basic and Advanced Algorithms to program in C/C ++

coppice

Re: Books on Basic and Advanced Algorithms to program in C/C ++

Nominal Animal

Re: Books on Basic and Advanced Algorithms to program in C/C ++

NorthGuy

Re: Books on Basic and Advanced Algorithms to program in C/C ++

NorthGuy

Re: Books on Basic and Advanced Algorithms to program in C/C ++

coppice

Re: Books on Basic and Advanced Algorithms to program in C/C ++

tggzzz

Re: Books on Basic and Advanced Algorithms to program in C/C ++

Nominal Animal

Re: Books on Basic and Advanced Algorithms to program in C/C ++

coppice

Re: Books on Basic and Advanced Algorithms to program in C/C ++

GeorgeOfTheJungle

Re: Books on Basic and Advanced Algorithms to program in C/C ++

tggzzz

Re: Books on Basic and Advanced Algorithms to program in C/C ++

coppice

Re: Books on Basic and Advanced Algorithms to program in C/C ++

tggzzz

Re: Books on Basic and Advanced Algorithms to program in C/C ++

bd139

Re: Books on Basic and Advanced Algorithms to program in C/C ++

tggzzz

Re: Books on Basic and Advanced Algorithms to program in C/C ++

Share me