See, that's my point, you're communicating something with the diagram, but what? It's a language thing -- can you rephrase it, or describe it some other way, or give something more concrete like a preferred implementation (granted that it's "in progress" and all)? That's all.
As for the description of the problem, "a pairwise balancing mechanism", yes, that is clear -- for a chained architecture, it will accumulate errors.
More generally, say you have N independent current sources, independent in that they're doing their own thing while operating at the same given setpoint. There are N^2 - N possible (directed) connections between them. Presumably there would be a way to connect them such that the error between any given pair is no greater than the error of a single correction circuit, and we would probably want to find connections which minimize the number of correction circuits for a given error.
So, if we draw the graph, plotting a node for each current source, and an edge between any two that are correcting in this way (and note that it's a directed graph when our correction circuit only works one way, from a reference to a destination), we can see that the chain architecture has the greatest maximum distance between any two nodes on the graph -- it's actually the worst possible!
If we rearrange instead for a star topology, we can use one node as the reference for the other N-1 nodes, and incur only one correction error at most. We still need N-1 correctors, and there isn't any obvious way to do better without leaving any nodes completely untouched. This falls short of a rigorous proof, but I'm comfortable with it.
Here's the trick though: say we declare that reference node is an abstract signal, not actually a current source; we add another current source to maintain the same total current, so now we have N correctors for N current sources with 1 reference. We can suck each corrector inside its respective current source, say integrating it into the driver circuitry. There may be some savings in component count / cost this way.
This now describes just any regular old opamp-per-FET architecture, wired up as needed to maintain accuracy. Mind, there might be two wires for setpoint and setpoint ref, to avoid ground loop errors where connections between shunt resistors can introduce errors. We might end up needing two or three opamps per stage to resolve the differential voltages (ref, shunt and gate drive), but that's fine.
We can use the same components in each unit, achieving arbitrary accuracy -- 1% is trivial and probably doesn't even need differential sense, 0.1% you need costly resistors and probably differential sense, 0.01% or better and you'll probably need to calibrate each section individually, perhaps even temperature correction as well.
So -- if this follows the scheme you were considering -- this should show that we can consider a superset of possible architectures, and more or less proves that the canonical architecture is already best.
Lastly; balancing... simply isn't an interesting problem. If you require better than 10% matching, your thermal management is
woefully, and dangerously, inadequate! This should only ever be a design consideration, never an operating requirement!
You can add one more op-amp, outside of all the stages in parallel, to maintain total current equal to ref. This amp can be arbitrarily precise, while the imbalance between stages can be gross (but managed) -- this could greatly simplify the circuit, say by using relatively large source balancing resistors and no per-FET driver at all, as long as the higher minimum (saturation) voltage drop is acceptable. (Perfect for a high voltage CCS; for low voltages, you'd still want amp-per-FET.)
And you can always add thermistors to adjust local gate voltage, say; this would be a good idea if independent heatsinks are used. Wire it up so the thermistor dominates over the FET's natural tempco. Thermistors should be provided anyway to measure heatsink temp, and throttle down or disable operation if any one gets into a dangerous range.
Or, if the outputs aren't wired together, but they're for independent loads -- again, simply make each stage for the desired accuracy and that's that.

HTH,
Tim