If anyone is interested, that segmented memory access model that itches my brain is still bubbling like a fetid pool of industrial waste.
It has now split into two, where one is fundamentally incompatible with standard C (with low
k bits of pointers used to identify the segment, and all composite objects and arrays aligned to 2
k bytes); in particular, wrt. the address-of operator,
&, when applied to members or elements of objects (or basically anything smaller than 2
k bytes, or with smaller alignment requirement, particularly
char and its signed/unsigned variants).
The other boils down to having a paged MMU, with 2
k page tables, each starting at virtual addresses multiples of 2
N-k, or equivalently a paged MMU that has the same overhead regardless of the virtual address. Then, the virtual address space can be split into 2
k "segments", where the page table entries cover a continuous virtual address range up to the "segment limit". Stacks need to grow upwards, though, and therefore you really need addressing modes where an offset is subtracted from the stack or stack frame pointer; especially with a small literal constant (most common use, referring to a local variable).
It took me a couple of days of thinking to see how it really boils down to these, even though they both provide some really nice features
Furthermore, the first one can be emulated by any architecture, by using a composite reference, (
base object address,
offset), instead of scalar pointers. For arrays, a quad (
base object address,
step delta,
first index,
count) specifies a
slice, just like in e.g Fortran or Python, and is valid if and only if both
step delta×
first index and
step delta×(
first index+
count-1) are within the base object. (Of course, using
first offset=
step delta×
first index instead of
first index and
end offset=
step delta×(
first index+
count-1 instead of
count is better, but it is important that the human-readable code does not directly provide
first offset and
end offset because any off-by-one errors would completely confuse it.)
I'm finding it funny how different it is to think in terms of "composite objects" or "structures" –
base,
offset pairs – with direct pointers to small scalar members forbidden/impossible, compared to how one views the memory address space in C.
Having a hardware comparison instruction that checks if 0 ≤
offset <
limit, with
offset a register operand, and
limit either a register or a memory operand, would help a lot because then the bounds check would be a single instruction. Better yet, a three-operand "within bounds" instruction, where the offset being checked is always a register operand, lower bound being either a register or a (small) literal unsigned integer, and upper bound being either a register or a memory address containing the bound.
Also, it would be useful to be able to swap the
greater than and
less than comparison status flags based on the highest bit of a named register; a sort of a dedicated XOR operation. Then, a slicing loop would not need to care whether
step delta is positive or negative, just compare if current offset is less than end offset, and swap greater/less flags if
step delta is negative. Having this in one instruction would be even nicer: compare
offset >
limit if
delta is negative (highest bit set),
offset <
limit otherwise. The result condition is a single bit: true or false, with the next instruction expected to be a branch based on this comparison.
Similarly, an
N×
N=
N multiplication instruction really needs a flag showing if the result is exact, or just the low
N bits. Basically a sticky-carry flag, that is only cleared by a specific clear-sticky-flag instruction and multiplication, and gets set if add, sub, or shift overflowed even temporarily (say, a
x<<6 shifted any nonzero bits out). This would make it trivial to detect size calculation overflow. (Making this into a C built-in or an inline assembly function that returns 0 if an overflow occurs works fine; something like
ulfma(x, y, z) == x·y+z, or zero if the result overflows.)
A "sticky" overflow bit makes sense in general, because then one can omit the rarely-taken branch-if-overflow branch instructions, and just have one at the end of calculation.
I'm rather dejected that after that much of mental itching, it turns out to be this mundane.
On the other hand, it really clarified to me how much the ABI and the programming language affect the way the hardware gets used, and how relatively small things (like the paged MMU implementation details) can provide a completely different paradigm ("segmented" memory, through pointers with
k high bits indicating the "segment", but address space limited to objects at most 2
N-k bytes long).