Author Topic: which Effective Addressing Modes are essential for a HL language? (Read 10681 times)

rhb · « **Reply #25 on:** November 11, 2018, 05:00:58 pm »

Your question betrays a 50's and 60's mindset about computer design. Patterson and Hennessy proved that wrong with the MIPS R2K and the SPARC 1 thirty years ago.

At a minimum you need to read these two books:

"Computer Architecture: A Quantitative Approach" by Patterson and Hennesy

"Advanced Compiler Design & Implementation" by Muchnick

I *strongly* recommend reading P & H starting with the first edition and continuing until the device complexity level exceeds what you want to do.

Also absolutely essential is:

"Cache and Memory Hierarchy Design" by Prrzybylski

or a recent equivalent.

In 1998, a DEC Alpha 164LX at 533 MHz was several times the floating point performance of the fastest Intel and AMD. AMD got their big jump with the Opteron when HP killed the Alpha and AMD managed to get a bunch of team. Register coloring and speculative execution made a huge difference in performance. The Alpha was as big an advance in design as the R2K and SPARC.

Addressing modes really isn't a relevant question today and hasn't been for over 30 years unless you are writing for MCUs in assembler. The name of the game is memory hierarchy and data access patterns and making transformations in the compiler to minimize cache misses and pipeline stalls.

It would be nice if you defined acronyms the first time you use them. I have *no* idea what you mean by MAC and I read the first 3 editions of P & H cover to cover.

I think you'll find it very instructive to implement an out of core matrix inversion or LU decomposition. I don't have this yet, but it looks pretty good. I have the HOBBIES book which covers the out of core LU factorization briefly.

oPossum · « **Reply #26 on:** November 11, 2018, 05:35:26 pm »

MAC - Multiply ACcumulate

also known as

FMA - Fused Multiply Add

brucehoult · « **Reply #27 on:** November 11, 2018, 05:42:13 pm »

Quote from: rhb on November 11, 2018, 05:00:58 pm

Your question betrays a 50's and 60's mindset about computer design. Patterson and Hennessy proved that wrong with the MIPS R2K and the SPARC 1 thirty years ago.

At a minimum you need to read these two books:

"Computer Architecture: A Quantitative Approach" by Patterson and Hennesy

Yes, I always take P&H as read in any discussion of modern computer architecture. Anyone who hasn't read it shoudl do so immediately.

Quote

It would be nice if you defined acronyms the first time you use them. I have *no* idea what you mean by MAC and I read the first 3 editions of P & H cover to cover.

It seems clear in context that he means "Multiply-ACcumulate" aka FMA.

rhb · « **Reply #28 on:** November 11, 2018, 06:02:38 pm »

Thanks. I'm very much a polymath, so I really struggle with acronyms.

The fallacy in the subject line tends to make me expect anything.

I should very much like to know what the objective is. A demonstration project or a DSP engine for a Zynq or other FPGA. I have a long history with DSP going back the the 11/780 and FPS-120B. I spent quite a lot of time circa 1989-90 explaining the implications of the RISC model to my colleagues at the ARCO Plano Research Center. And doing occasional cute tricks like passing data between functions using foo(void), bar*void), etc.

Not only would I cite section 8.3.5 of the F77 standard from memory, I'd top it off with a short description of the Burroughs 5000 and why 8.3.5 was included in the standard.

legacy · « **Reply #29 on:** November 11, 2018, 06:27:36 pm »

Quote from: brucehoult on November 11, 2018, 05:42:13 pm

It seems clear in context that he means "Multiply-ACcumulate" aka FMA.

rt0 = rs2 + (rs1 * rs0)

About that, we have decided to exclude it from the supported addressing modes. For accessing cells in a matrix, we prefer using a MAC instruction to calculate the cell's address, and then using a load/Store instruction with EA = rs0 + imm.

This is neat.

legacy · « **Reply #30 on:** November 11, 2018, 06:35:18 pm »

Quote from: rhb on November 11, 2018, 05:00:58 pm

Your question betrays a 50's and 60's mindset about computer design. Patterson and Hennessy proved that wrong with the MIPS R2K and the SPARC 1 thirty years ago.
At a minimum you need to read these two books:

"Computer Architecture: A Quantitative Approach" by Patterson and Hennesy
"Advanced Compiler Design & Implementation" by Muchnick

I *strongly* recommend reading P & H starting with the first edition and continuing until the device complexity level exceeds what you want to do.

The P & H book talks about MIPS R2K, I did two examinations years ago on this book.
Arise-v2 is based on MIPS R2K, but Elisabeth is looking at the Motorola 88K for some features.

Motorola eight-eight-key. It was one of the first RISC design.

Quote from: rhb on November 11, 2018, 05:00:58 pm

Also absolutely essential is:

"Cache and Memory Hierarchy Design" by Prrzybylski
or a recent equivalent.

In 1998, a DEC Alpha 164LX at 533 MHz was several times the floating point performance of the fastest Intel and AMD. AMD got their big jump with the Opteron when HP killed the Alpha and AMD managed to get a bunch of team. Register coloring and speculative execution made a huge difference in performance. The Alpha was as big an advance in design as the R2K and SPARC.

Addressing modes really isn't a relevant question today and hasn't been for over 30 years unless you are writing for MCUs in assembler. The name of the game is memory hierarchy and data access patterns and making transformations in the compiler to minimize cache misses and pipeline stalls.

Arise-v2 has no cache, and it's not meant for competing with those big irons. We want to keep it simple on both HDL and programming sides, and we want it nice to be programmed in assembly, as well as good for making our own high language compiler.

It's not a commercial product, it's made for fun and learning; the hardware is modeled like a gaming console. You can think *like the Playstation1*. A mini version, of course.

rhb · « **Reply #31 on:** November 11, 2018, 08:13:51 pm »

Sorry, but I seem to be in" too long; do not read" mode.

As it happens I have the processor manuals for both the R2K and the 88K.

There are at least 3 editions of P & H since the R2K was mentioned. Those editions came out not because they wanted to sell books, but because the world had changed so drastically that the previous edition was no longer relevant to current products. The discussion in the 1st ed is dominated by the branch delay slot. Then came speculative execution and branch prediction. I have no idea where we are now. I've got the 4th ed on my shelf, but haven't had an HPC project, so not a lot of point in wading though all that detail. It's probably entirely different now. And certainly will change drastically after the discovery of the speculative execution vulnerabilities.

Linear algebra lives or dies on memory bandwidth. IIRC the Intel i860 was touted as being 80 MFLOPS. I did an analysis of the cycle time to do a vector multiply-add and found that absolute peak performance was more like 10 MFLOPS for vectors of 12 KB which was a typical seismic trace length at the time. They are double that now.

You can have large memory or you can have fast memory, but you cannot have large, fast memory. It is physically impossible because as the memory gets larger the capacitance of the address lines increases and the RC constant gets larger and the cycle time for an access gets longer. So I am *not* going overboard. This is the cold hard truth I have dealt with for 30 years in a work environment where running 10,000 cores for 7-10 days and feeding them 10-12 TB of data is routine. And power and cooling limitations prevent having more than about 50,000 cores at one site, so processing companies have multiple installations scattered all over Houston.

The number of cores allocated to a particular job depends on how urgent it is and how fast the person doing it can do the QC on a run and update the velocity model before running the whole thing again. Typically a seismic processor will be working on two of these, so that while one is on the machine they are doing the QC on the other.

The only people doing anything comparable face a long vacation is a federal prison for discussing such matters. So one never does. A simple statement such as there is some mathematical identity that can be used at a certain point in a calculation can land all involved a very long prison visit.

Generally this is not a big deal for small problems of a few million matrix entries. But when each of 3 dimensions measures in the 10e6-8 range, it gets *really* serious.

I have no recommendation other than to do exactly what I would do, a lot of reading.

If you're not familiar with it, look at the implications of particular strides and cache associativities.

I have written *one* computed GOTO in my life. The Alpha was 10% faster if I used a stride of 2 and 2 temporary variables. The killer in the code (Kirchoff pre-stack time migration) was computing square roots. I observed that a linear approximation was good enough. So I only recomputed the square root at intervals and used a linear approximation in between. But it meant that I had to continually recompute where to calculate the square root and call a different subroutine to do that segment of the integration. Which was most cleanly implemented with a computed GOTO.

The point of all this is quite simple:

If you need to do high performance linear algebra you have to count cycles for *every* operation from the prefetch to the write and make use of whatever parallelism the hardware provides.

I had an interview in 2006 with a startup called SiCortex which had a nifty design for a very interesting processor based on the MIPS. After we had scheduled my interview I got a bunch of documentation to read. The interview turned into $2-3K of free consulting. However, I did get to spend the weekend with a friend who lived nearby.

I did an hour talk in which I explained the data movement of all the migration algorithms and why their design could not do this well. Later I had a cycle by cycle discussion with the chief architect and a discussion with the CEO in which I explained why what they had designed, despite being quite marvelous, was not suitable for seismic processing. The friend who had given my name to the head hunter was doing hands on evaluations of a prototype and came to precisely the same conclusions after several months of testing. Needless to say, I did not get hired. The company went under a few years later.

legacy · « **Reply #32 on:** November 11, 2018, 08:55:07 pm »

Quote from: rhb on November 11, 2018, 08:13:51 pm

As it happens I have the processor manuals for both the R2K and the 88K.

The MC88100 shows a neat example of scaled-index addressing mode.
Never played with a chip, but the ISA looks elegant. We wonder if it helped or not.

Quote from: rhb on November 11, 2018, 08:13:51 pm

You can have large memory or you can have fast memory, but you cannot have large, fast memory

I know, and it's not a problem for Arise-v2, like it was not a problem for earlier 68k systems (68000,68010). Besides, Arise-v2 is clocked @ 66Mhz(1), the external SDRAM is clocked at 120Mhz, there are only 2 wait states in the load/store unit. The static RAM on the asynchronous takes 5 wait-states.

The static RAM (four chip, 64Kbyte@8bit each) is a dual port ram, shared with the GTE.

We don't want/need any cache

edit:
We have to implementations

The multicycles not-pipelined implementation usually takes 7-15 clock edges to complete an instruction
The multicycles pipelined implementation usually evolves in 1 clock edges, except ALU{MUL,DIV,MAC,...}, Load/Store that need to stall the pipeline (bubbling by NOP-injection) or conditional branches that need to stall, go back and resume the pipeline. The currently used branch prediction is a Saturating Counter, hence not smart. I have a second level brand prediction, and even a neural-oriented branch prediction unit, but not yet integrated. For sure the last one will never implement, it's currently written in Matlab's script, and frankly, it's too complex

(1) Elisabeth is persuading the team to carry on the multicycles not-pipelined implementation since it's simpler and easier to be simulated, used and debugged, especially for the HDL stuff.

rhb · « **Reply #33 on:** November 11, 2018, 10:22:36 pm »

I don't think I ever saw an 88K in operation unless perhaps the Data General system I looked at at a trade show used it. Which is probably the case as that would have caught my attention and Wikipedia says they used it. I bought the manual when it was introduced as I really like the 68K, but it never got traction. The SPARCstation, HP "Snakes", and IBM PowerPC all appeared at the same time and MIPS was going great guns in the SGI heavy iron.

bson · « **Reply #34 on:** November 11, 2018, 10:55:31 pm »

I'd use llvm if you want a production quality compiler. Something else if you want a teaching aid.

If you make R0 a constant generator, then you only need one addressing mode, off16(Rn) - register offset. Make off16(R0) = #off16, so when offsetting R0 you can simply bypass the memory cycle and return the offset itself - this gives you a compact 16-bit immediate instruction.

Add a shift-and-load "salh src,dst" halfword instruction that effectively does dst <<= 16; dst |= src.H. Then you can expand 32 bit immediate loads into two instructions:

mov #imm32, Rn
=>
movh #imm16_high, Rn
salh #imm16_low, Rn
=>
mov imm16_high(R0), Rn
salh imm16_low(R0), Rn

Such a "shift and load" is also handy for dealing with unaligned data, or just scattered data, especially if you also add a byte version. This would let a compiler emit "safe" code for packed data. For hand optimized inline assembly, like checksumming loops or realignment copies, it adds another arrow to the quiver.

For byte immediate, use the same off16(R0) and ignore the upper half of the offset.
If you add a shift-and-load, also add a byte version (salb).

bson · « **Reply #35 on:** November 11, 2018, 11:15:43 pm »

Quote from: legacy on November 08, 2018, 10:23:45 pm

Address Register Direct with Displacement Mode EA = mem(reg + sign_ext(const))

Unsigned, no sign extension stuff. You almost always work with a base+offset, not middle+offset. Base of structures, base of buffers, base of regions, base base base. Never middle. Signed offsets effectively just waste half the range. In the very rare case that you need a negative offset just accept the extra instructions needed.

legacy · « **Reply #36 on:** November 11, 2018, 11:20:19 pm »

ok, thanks

corrected to: reg + zero_ext(const)

rhb · « **Reply #37 on:** November 12, 2018, 01:10:30 am »

I learned computer internals on the 6502, so I may be prejudiced. But I really like the addressing modes of the 6502 and that was what excited Woz about it. It was a huge improvement over the 8080/Z80 model.

But given your performance targets I'm a bit baffled about the purpose of the project.

westfw · « **Reply #38 on:** November 12, 2018, 09:23:53 am »

Code: [Select]

[quote]Note: this [code that executes well without complex addressing modes] does depend on using modern compilers. If you're going to write your own compiler then it may be different as it's a *lot* of work[/quote]I thought it was sort-of the other way around - writing a compiler that is easier if there isn't any choice of addressing modes, because it becomes a relatively straightforward register allocation/scheduling problem instead of a complex choice of special-case instructions with different performance characteristics ?  Isn't that part of the whole RISC philosophy? That does depend on caches making the instruction fetches fast...

[/quote]

legacy · « **Reply #39 on:** November 12, 2018, 01:06:15 pm »

Quote

I thought it was sort-of the other way around - writing a compiler that is easier if there isn't any choice of addressing modes, because it becomes a relatively straightforward register allocation/scheduling problem instead of a complex choice of special-case instructions with different performance characteristics? Isn't that part of the whole RISC philosophy? That does depend on caches making the instruction fetches fast...

MIPS R2K has no cache. The point of simple addressing modes in MIPS R2K is that it helps the pipeline to be simple, which helps the machine to be designed with only 5 stages.

langwadt · « **Reply #40 on:** November 12, 2018, 01:13:35 pm »

Quote from: rhb on November 11, 2018, 06:02:38 pm

Thanks. I'm very much a polymath, so I really struggle with acronyms.

The fallacy in the subject line tends to make me expect anything.

I should very much like to know what the objective is. A demonstration project or a DSP engine for a Zynq or other FPGA. I have a long history with DSP going back the the 11/780 and FPS-120B. I spent quite a lot of time circa 1989-90 explaining the implications of the RISC model to my colleagues at the ARCO Plano Research Center. And doing occasional cute tricks like passing data between functions using foo(void), bar*void), etc.

Not only would I cite section 8.3.5 of the F77 standard from memory, I'd top it off with a short description of the Burroughs 5000 and why 8.3.5 was included in the standard.

long history with DSP and not knowing the acronym MAC does not compute

legacy · « **Reply #41 on:** November 12, 2018, 01:15:15 pm »

Quote from: rhb on November 12, 2018, 01:10:30 am

I learned computer internals on the 6502, so I may be prejudiced. But I really like the addressing modes of the 6502 and that was what excited Woz about it. It was a huge improvement over the 8080/Z80 model.

specifically which EA do you like in 6502? (I have zero knowledge about it)

legacy · « **Reply #42 on:** November 12, 2018, 01:45:43 pm »

Z80's addressing modes

Quote

Implied addressing mode: there is no bytes nor bits in the instruction op-codes to explicitly specify the operands, so they are implicit, as in CPL and LD SP, IY instructions. In general, instructions of this type have one or two op-codes.
Immediate addressing mode: instruction needs an explicit operand which is specified in the instruction itself (second ou third byte), and it’s always one-byte length. Example of this mode are instructions ADD A, N, XOR N and LD IXL, N, where N is a one-byte operand.
Extended immediate addressing mode: it is similar to the immediate mode but takes a two-byte operand. Instructions LD HL, NN and LD IX, NN have this kind of addressing.
Register addressing: in the op-code, there is a three-bit field to indicate one 8-bit register as the operand. RL, R and AND R are examples of this addressing mode, where R is a general purpose register (B, C, D, E, H, L and A).
Register indirect addressing: in this case, the operand itself is not in a register. Instead, a register pair contains the operand’s memory address. That’s why it is called indirect addressing. LD A, (BC) and INC (HL) are examples of this mode. Note the use of parenthesis around registers to denote the indirect addressing.
Extended addressing (or direct addressing): the instruction has a two-byte field containing the operand’s memory address. Instructions like LD A, (NN) and LD (NN), HL illustrate the direct addressing mode, where NN is a 16-bit value representing a memory address.
Modified page zero addressing: this mode is exclusive to a single instruction, RST P, which causes the CPU to jump execution to a specified location P, inside the page zero memory area. Page zero is a memory area comprised by the first 256 bytes of memory (usually ROM). P can assume only the following values: 0x0, 0x8, 0x10, 0x18, 0x20, 0x28, 0x30 and 0x38. RST P is normally used to call special ROM subroutines. (For the ZX Spectrum, see which ROM routines are available here).
Relative addressing: it is used in relative jump instructions (conditional and unconditional), in which the memory address to go to is calculated by adding an operand displacement to the PC register. The displacement value ranges from -128 to +127 (signed binary representation), allowing to address a location in memory from -126 to +129 relative to the jump instruction address (this is due to PC getting incremented by two during instruction execution). As an example, instruction JR Z, D, where D is the displacement, uses this kind of addressing mode. The relative addressing mode allows to reduce jump instructions by one byte, compared to absolute jump instructions (JP) where a 16-bit address operand is provided.
Indexed addressing: this mode allows one to use indirect addressing using the index registers IX and IY. Like the relative addressing mode, the instruction has an 8-bit displacement operand, which is added to the index register being used. AND (IX + D) and LD (IY + D), N are examples of the indexed mode, where D is an 8-bit signed displacement and N is an immediate operand

rhb · « **Reply #43 on:** November 12, 2018, 03:31:06 pm »

From: http://www.obelisk.me.uk/6502/addressing.html

Quote

Addressing Modes

The 6502 processor provides several ways in which memory locations can be addressed. Some instructions support several different modes while others may only support one. In addition the two index registers can not always be used interchangeably. This lack of orthogonality in the instruction set is one of the features that makes the 6502 trickier to program well.

Implicit

For many 6502 instructions the source and destination of the information to be manipulated is implied directly by the function of the instruction itself and no further operand needs to be specified. Operations like 'Clear Carry Flag' (CLC) and 'Return from Subroutine' (RTS) are implicit.
Accumulator

Some instructions have an option to operate directly upon the accumulator. The programmer specifies this by using a special operand value, 'A'. For example:

LSR A ;Logical shift right one bit
ROR A ;Rotate right one bit

Immediate

Immediate addressing allows the programmer to directly specify an 8 bit constant within the instruction. It is indicated by a '#' symbol followed by an numeric expression. For example:

LDA #10 ;Load 10 ($0A) into the accumulator
LDX #LO LABEL ;Load the LSB of a 16 bit address into X
LDY #HI LABEL ;Load the MSB of a 16 bit address into Y

Zero Page

An instruction using zero page addressing mode has only an 8 bit address operand. This limits it to addressing only the first 256 bytes of memory (e.g. $0000 to $00FF) where the most significant byte of the address is always zero. In zero page mode only the least significant byte of the address is held in the instruction making it shorter by one byte (important for space saving) and one less memory fetch during execution (important for speed).

An assembler will automatically select zero page addressing mode if the operand evaluates to a zero page address and the instruction supports the mode (not all do).

LDA $00 ;Load accumulator from $00
ASL ANSWER ;Shift labelled location ANSWER left

Zero Page,X

The address to be accessed by an instruction using indexed zero page addressing is calculated by taking the 8 bit zero page address from the instruction and adding the current value of the X register to it. For example if the X register contains $0F and the instruction LDA $80,X is executed then the accumulator will be loaded from $008F (e.g. $80 + $0F => $8F).

NB:
The address calculation wraps around if the sum of the base address and the register exceed $FF. If we repeat the last example but with $FF in the X register then the accumulator will be loaded from $007F (e.g. $80 + $FF => $7F) and not $017F.

STY $10,X ;Save the Y register at location on zero page
AND TEMP,X ;Logical AND accumulator with a zero page value

Zero Page,Y

The address to be accessed by an instruction using indexed zero page addressing is calculated by taking the 8 bit zero page address from the instruction and adding the current value of the Y register to it. This mode can only be used with the LDX and STX instructions.

LDX $10,Y ;Load the X register from a location on zero page
STX TEMP,Y ;Store the X register in a location on zero page

Relative

Relative addressing mode is used by branch instructions (e.g. BEQ, BNE, etc.) which contain a signed 8 bit relative offset (e.g. -128 to +127) which is added to program counter if the condition is true. As the program counter itself is incremented during instruction execution by two the effective address range for the target instruction must be with -126 to +129 bytes of the branch.

BEQ LABEL ;Branch if zero flag set to LABEL
BNE *+4 ;Skip over the following 2 byte instruction

Absolute

Instructions using absolute addressing contain a full 16 bit address to identify the target location.

JMP $1234 ;Jump to location $1234
JSR WIBBLE ;Call subroutine WIBBLE

Absolute,X

The address to be accessed by an instruction using X register indexed absolute addressing is computed by taking the 16 bit address from the instruction and added the contents of the X register. For example if X contains $92 then an STA $2000,X instruction will store the accumulator at $2092 (e.g. $2000 + $92).

STA $3000,X ;Store accumulator between $3000 and $30FF
ROR CRC,X ;Rotate right one bit

Absolute,Y

The Y register indexed absolute addressing mode is the same as the previous mode only with the contents of the Y register added to the 16 bit address from the instruction.

AND $4000,Y ;Perform a logical AND with a byte of memory
STA MEM,Y ;Store accumulator in memory

Indirect

JMP is the only 6502 instruction to support indirection. The instruction contains a 16 bit address which identifies the location of the least significant byte of another 16 bit memory address which is the real target of the instruction.

For example if location $0120 contains $FC and location $0121 contains $BA then the instruction JMP ($0120) will cause the next instruction execution to occur at $BAFC (e.g. the contents of $0120 and $0121).

JMP ($FFFC) ;Force a power on reset
JMP (TARGET) ;Jump via a labelled memory area

Indexed Indirect

Indexed indirect addressing is normally used in conjunction with a table of address held on zero page. The address of the table is taken from the instruction and the X register added to it (with zero page wrap around) to give the location of the least significant byte of the target address.

LDA ($40,X) ;Load a byte indirectly from memory
STA (MEM,X) ;Store accumulator indirectly into memory

Indirect Indexed

Indirect indirect addressing is the most common indirection mode used on the 6502. In instruction contains the zero page location of the least significant byte of 16 bit address. The Y register is dynamically added to this value to generated the actual target address for operation.

LDA ($40),Y ;Load a byte indirectly from memory
STA (DST),Y ;Store accumulator indirectly into memory

The zero page modes. In particular the zero page,x mode. On a processor like the 6502, my language of choice is forth. In the case of forth, it makes implementing the return stack very simple. My 6502 experience was both limited and 35 years ago. Since then I have almost exclusively focused on HPC considerations.

The copyright on the R2K manual is 1987. The first edition of the M88K is 1989. Two years was a *very* long time then. The SPARC and R2K were shipping in quantity and sitting on desktops by 1989. As a summer intern in 1989 everyone I worked with had a DECstation running Ultrix. I had an NCD 1024x1024 greyscale X terminal. I took a few days off and when I returned found a SPARCstation I on my desk. I demanded and got my NCD back. I worked on multiple systems, SGI, DEC & Sun testing the portability of the code I wrote by logging in to lightly used machines. My home was mounted via NFS on all the systems so all I had to do was switch windows. The grey scale NCD gave me a paper white display at close to 100 dpi that was the size of a laser printed page. I used twm with a menu bar of windows down the RHS and a stack of pae size windows. So a double click on the name of a window in the icon menu list and that's the top window. To this day my main system running Solaris 10 u8 which is offline for security uses that configuration. Though now I use a KVM switch and USB drives rather than NFS.

MAC submachine guns designed by Ingram, MAC ethernet media access control for starters. I'm sure that I could come up with many more. I'm accustomed to "multiply-add" or "multiply-accumulate" and was not expecting op code references in a discussion of addressing modes. FWIW the Xilinx Zynq book glossary defines MAC as media access control.

legacy · « **Reply #44 on:** November 12, 2018, 04:03:26 pm »

Quote from: rhb on November 12, 2018, 03:31:06 pm

I had an NCD 1024x1024 greyscale X terminal

On DTB we still have and use a couple of Tektronix XP217's, and XP400 series, that are still used in space and defense. Unbelievable, but that is: we have these terminals in stock because there are customers demanding them

They are MIPS R4K machines, chips made by IDT, with an HW video processor made by TexasInstruments, and they run a special version of VxWorks v5.5 (that must have been compiled by an ancestor version of DiabDB).

Unfortunately, they are pseudo-colors, running an ancestor version of the X11 protocol that of course doesn't support any modern extension, which practically means you can only run OpenMotiv and Motif applications (and you'd best forget GTK-*, QT-*, etc)

Quote from: rhb on November 12, 2018, 03:31:06 pm

I'm accustomed to "multiply-add" or "multiply-accumulate" and was not expecting op code references in a discussion of addressing modes.

Yup it's a weird EA offered by weird beasts like the 88K. I still have to check my manuals about the earlier ARM chips used by Acorn in their RiscPC's, and earlier PowerPC's like the 601

legacy · « **Reply #45 on:** November 12, 2018, 04:06:13 pm »

Quote from: langwadt on November 12, 2018, 01:13:35 pm

long history with DSP and not knowing the acronym MAC does not compute

Have you ever implemented IIR and FIR algorithms with a dummy-DSP like the Motorola M56000?
Then for sure, you can't be wrong, you have for sure typed a "MAC" assembly instruction

I am Kidding, but they may really exist reasons for this:

There is not yet a working version of Gcc/SDCC/whatever C compiler for this DSP, expect an old broken and partially implemented gcc-v2.* (<<2.95)
20-30 years ago, toolchains for this DSP were very very expensive, thus only used by companies. Yet, no C compiler for students and hobbists(1)
There are still books about this DSP (still available on Amazon, e.g. "Real Time Digital Signal Processing Applications With Motorola's Dsp56000 Family 1st Edition") that just talk about assembly
For weird and unknown reasons, this DSP has been used in computer science's courses
whose laboratories give you just an assembly compiler, and yet again you have to physically type "MAC" when your examinator asks you to implement FIR and IIR algorithms (usually for audio)

This chip was made in the 90s. Manuals are no more printed, but we have photocopies and scanned pdf. I had my examination on "numerical methods for computations" (basically DSPs and algorithms) and used this chip during my Erasmus experience in Oxford. It happened 10 years ago, but a couple of friends have recently told me they are still using this DSP

(1) there are a couple of demo-version C compilers for m563xx. I haven't yet checked if they are retro-compatible with 56000.

DJohn · « **Reply #46 on:** November 12, 2018, 05:07:12 pm »

Quote from: legacy on November 12, 2018, 04:03:26 pm

Quote from: rhb on November 12, 2018, 03:31:06 pm
I'm accustomed to "multiply-add" or "multiply-accumulate" and was not expecting op code references in a discussion of addressing modes.
Yup it's a weird EA offered by weird beasts like the 88K.

Not really. What the 88K (and x86, and possibly a handful of others) offer is scaled indexed addressing. You have a base register and an index register. Multiply the index register by a small power of two (on 88K, that's determined by the width of the data the instruction is operating on, and can be 1, 2, 4, 8, or 16 only. The 'multiply' is really just a shift) and add them together. That's it. No one is going to put a full multiplier in the address generation. It would slow everything down enormously for no gain.

It doesn't even help for the linear algebra stuff. Anywhere that you care about speed, you'll be iterating over elements with a fixed stride. Just add that stride to your pointer each time through the loop and you're done. No multiplication required. Not in the addresses, anyway.

legacy · « **Reply #47 on:** November 12, 2018, 05:17:37 pm »

Quote from: DJohn on November 12, 2018, 05:07:12 pm

The 'multiply' is really just a shift

yes, the size of a cell in the matrix is a power of 2, usually four bytes for fixedpoint 32bit.

rhb · « **Reply #48 on:** November 12, 2018, 06:16:28 pm »

I'm a reflection seismologist, or at least used to be. So except for the grad school agony of an FPS-120B attached to an 11/780 all my work has been on general purpose CPUs, although some pretty weird stuff. Trust me, you *really* don't want to program an i860 based "super computer". It was acutally a graphics chip designed in Israel that Intel marketed as doing 80 MFLOPS. Which it would *if and only if* you did almost no I/O.

I have never implemented any real time filters ever. All my DSP has been done in recorded time usually in the frequency domain.

Quote

Anywhere that you care about speed, you'll be iterating over elements with a fixed stride.

And you better pay attention to the cache structure or else *every* memory access will result in a cache miss.

In 1994 Convex introduced a machine, the Exemplar, that would have been just incredible as an X terminal compute server. It had a backplane I/O structure I could have made scream. No workstation then or now could have touched what that architecture would do with comparable hardware. I'm pretty sure it was the basis for a bunch of the later Crays.

The X terminal is perfect for a secure environment and any large facility with thousands of users.

The x86 implements what is called a "segmented" architecture. With the introduction of 32 bit x86 machines everyone set the segment address to zero and ran in "Motorola mode". The genesis of this was theoretical CS muttering by the Multics project at MIT. The Intel 432 stuffed *all* the Multics baggage into the silicon. Which is why most of you have probably never even heard of it. The intent was that *every* object had its own private address space in a different segment with bounds checking on accesses to the object. This was all in the early 60's. As everything had its own segment, this led to wags saying that Multics stood for "many uselessly large tables in core simultaneously".

legacy · « **Reply #49 on:** November 12, 2018, 08:46:16 pm »

Fresh news. About matrices stuff, Arise-v2 needs to add each element of a bounding box containing an image to its local neighbors, weighted by the kernel. This is related to a form of mathematical convolution, not traditional matrix multiplication.

In the future, the GTE will for sure offers support for this, but the current version of the hardware has no support, therefore operations need to be done in software.

A lot of multiplications, kernel-cells by image-cells.


EEVblog Main Site	EEVblog on Youtube	EEVblog on Twitter	EEVblog on Facebook	EEVblog on Odysee

Author Topic: which Effective Addressing Modes are essential for a HL language? (Read 10681 times)

Share me