RISC-V assembly language programming tutorial on YouTube

#225 Reply
Posted by Nominal Animal on 27 Dec, 2018 12:03
Quote from: westfw on 27 Dec, 2018 09:34
Which compilers? gcc-arm didn't optimize it "at all" (for CM4), nor does XCode LLVM :-(
The generic case yields something like
Code: [Select]
00000000 <unpack_u32le>: 0: e5d01002 ldrb r1, [r0, #2] 4: e5d0c001 ldrb ip, [r0, #1] 8: e5d02000 ldrb r2, [r0] c: e1a03801 lsl r3, r1, #16 10: e183340c orr r3, r3, ip, lsl #8 14: e1830002 orr r0, r3, r2 18: e1800c01 orr r0, r0, r1, lsl #24 1c: e12fff1e bx lr 00000020 <unpack_32be>: 20: e5d02001 ldrb r2, [r0, #1] 24: e5d03000 ldrb r3, [r0] 28: e5d01003 ldrb r1, [r0, #3] 2c: e5d00002 ldrb r0, [r0, #2] 30: e1a02802 lsl r2, r2, #16 34: e1823c03 orr r3, r2, r3, lsl #24 38: e1833001 orr r3, r3, r1 3c: e1830400 orr r0, r3, r0, lsl #8 40: e12fff1e bx lrwhich is not optimal, sure, but not absolutely horrible either.

Yes, I do know that on Cortex-M4 the optimal versions are
Code: [Select]
get_native_u32: ldr r0, [r0] bx lr get_byteswap_u32: ldr r0, [r0] rev r0, r0 bx lrsince ldr on Cortex-M4 can handle unaligned accesses just fine; but, you need something like
Code: [Select]
static inline uint32_t get_native_u32(const unsigned char *const src) { return *(const uint32_t *)src; } static inline uint32_t get_byteswap_u32(const unsigned char *const src) { uint32_t result = *(const uint32_t *)src; result = ((result & 0x0000FFFFu) << 16) | ((result >> 16) & 0x0000FFFFu); result = ((result & 0x00FF00FFu) << 8) | ((result >> 8) & 0x00FF00FFu); return result; } #if __BYTE_ORDER__ == __ORDER_LITTLE_ENDIAN__ #define get_u32le(src) get_native_u32(src) #define get_u32be(src) get_byteswap_u32(src) #elif __BYTE_ORDER__ == __ORDER_BIG_ENDIAN__ #define get_u32le(src) get_byteswap_u32(src) #define get_u32be(src) get_native_u32(src) #else #error Unsupported byte order #endifto get ARM-GCC to emit that. Personally, I prefer the first one for readability, but will switch to the latter if it makes a measurable difference at run time.

Quote from: brucehoult on 27 Dec, 2018 10:43
Using || instead of | isn't going to help.
You meanie. Fixed now.

I've actually used such conversion code when storing huge amounts of molecular dynamic data in a binary format. It makes sense to allow the nodes to save the data to local storage in native byte order, with prototype values for each numeric type used to detect the byte order. When the data is read (and usually filtered/selected), the "slow" byte order conversion is completely masked/hidden by the slow storage I/O bottleneck (spinning disks).

#226 Reply
Posted by ataradov on 27 Dec, 2018 23:59
Again, it is a matter of opinion. I personally will not use anything BE. It is just not worth my time. In most cases I want to interoperate with X86 and ARM systems, and I don't want to think about it ever again. No htonl()-like BS for me.

There is probably no reason to hard switch existing established systems, but if the same things gets carried over from design to design, then the switch will not hurt in a long run.

#227 Reply
Posted by NorthGuy on 28 Dec, 2018 01:17
If CPU has instructions which operate on operands of different sizes located in memory, then LE is certainly more efficient, because fetching/storing of the low byte is the same regardless of the size of the operands. In contrast, BE would require different fetching/storing logic for every supported size.

If the CPU doesn't do any memory operations (such as RISC-V or ARM) then LE/BE doesn't matter. The gain for superscalar CPUs is probably too small anyway. But for a small MCU, such as dsPIC33, LE is a natural choice.

#228 Reply
Posted by Nominal Animal on 28 Dec, 2018 03:07
Does byte order make any difference in the complexity of the VHDL/Verilog code? Especially the ALU, or when loading/storing unaligned multibyte values?

#229 Reply
Posted by ataradov on 28 Dec, 2018 03:08
Quote from: Nominal Animal on 28 Dec, 2018 03:07
Does byte order make any difference in the complexity of the VHDL/Verilog code? Especially the ALU, or when loading/storing unaligned multibyte values?
Zero difference.

But it is not about convenience of an implementer. It is about convenience of a programmer.

#230 Reply
Posted by Nominal Animal on 28 Dec, 2018 03:25
(I don't care about convenience, really; I only care if/how it affects the efficiency [resource use] of implementations, and whether one choice allows more alternative implementations than the other.)

The only case I can think of where the byte order actually matters (as in, makes a difference in complexity of implementation), is when accessing arbitrary precision number limbs, or bit strings.

In little-endian byte order, you can use any unsigned integer type to access the ith bit in the string. That is, if map is sufficiently aligned and large enough,
Code: [Select]
unsigned int get_bit64(const uint64_t *map, const size_t bit) { return !!(map[bit/64] & ((uint64_t)1 << (bit & 63))); } unsigned int get_bit8(const uint8_t *map, const size_t bit) { return !!(map[bit / 8] & (1 << (bit & 7))); }then you always have get_bit64(map, i) == get_bit8(map, i). (Ignore any typos in the above code, if you find one.) Not so with big-endian byte order, where you must use a specific word size to access the bit map. Granted, it only matters in some rather odd cases, like when different operations wish to access the binary data in different-sized chunks.

Other than that, the byte order really does not seem to affect me as a programmer much. The fact that there is more than one byte order in use, does, but I guess I'm used to that, having dealt with so many binary data formats with differing byte orders.

#231 Reply
Posted by hamster_nz on 28 Dec, 2018 03:33
Quote from: ataradov on 28 Dec, 2018 03:08
Quote from: Nominal Animal on 28 Dec, 2018 03:07
Does byte order make any difference in the complexity of the VHDL/Verilog code? Especially the ALU, or when loading/storing unaligned multibyte values?
Zero difference.

But it is not about convenience of an implementer. It is about convenience of a programmer.

A tiny (but significant) difference is that block memory initialisation values are usually written at long vectors or hex number strings, so avoiding logical byte swaps is helpful with staying sane.

So bit 8 should be logically to the left of bit 7, and bit 0 of the second 32-bit word should be to the left of bit 31 of the first.

The 64-bit vector x"deadbeef12345678" as 32 bit words is {0x12345678, 0xdeadbeef}, and as 8-bit words it is {0x78, 0x56, 0x34, 0x12, 0xef, 0xbe, 0xad, 0xde}. It doesn't HAVE to be that way, but that is the way that causes me least logical confusion.

#232 Reply
Posted by ataradov on 28 Dec, 2018 03:35
That's FPGA specific. And I personally use scripts to generate memory files from binaries, so it makes no difference to me anyway.

#233 Reply
Posted by NorthGuy on 28 Dec, 2018 04:18
Quote from: Nominal Animal on 28 Dec, 2018 03:07
Does byte order make any difference in the complexity of the VHDL/Verilog code? Especially the ALU, or when loading/storing unaligned multibyte values?

ALU doesn't have any byte order, so obviously there's no difference for ALU.

Loading/storing unaligned multi-byte values requires certain complexity - depending on the actual alignment the memory may need to be accessed twice, then the results need to be synchronized. Thus, if you're doing all this stuff, there will be quite a bit of logic to deal with the alignment, which will dwarf any benefits of LE.

#234 Reply
Posted by legacy on 28 Dec, 2018 13:04
What I would really really like to see is a tensor processing unit implemented to help the A.I. stuff as well as calculous concerning the space-time. New theories seem to imply multi-dimensions (more than four ... e.g. for supersymmetric particles, aka "SUSY"), that means a lot of calculus to be made on bigger matrices: we definitively need a tensor processing unit like the one implemented by Google, or even a better one.

For Google, it's a sort of special add-on hardware with looks like a GPU attached on an ePCI bus, therefore external to the CPU.

Maybe someone in the near future will implement one directly coupled to the CPU, sort of SuperCOP with the proper instruction set and accommodations.

This will boost the performances by several orders of magnitude, which is the case, considering that at CERN they are going to implement a new particle accelerator, bigger than the LHC, for 100Km of trajectory for the energy of 100TeV, which means 10x the data we have today to process.

In Japan, they are considering something similar, smaller in energy, and dedicated to Higgs boson, which currently the LHC is only able to produce one per 1 million interactions and only able to track particles with a precision of micron, and only in 4 dimension matrices.

This produces 20Petabyte of a daily stream. In the near future (by 2024?), we will have to increase 10x this value. At least.

Anyway ... what is considered the most advanced computer for scientific purpose? The DARPA is promoting and funding the IBM (multi multicores supermassive) POWER9 for which they have resurrected the old AIX UNIX; what does is the CERN using for computing 20 Petabyte of daily streams?

looking at pics of the CERN's computer room ... I see .... a cluster of boring Xeon-x86 PeeeeeCeeees

#235 Reply
Posted by ataradov on 28 Dec, 2018 17:05
All the things described above are only problems if you think that BE is better. In that case the other option will create problems. If you think that LE is better, then there are no such problems.

Xeons are highest performing general purpose processors on the market right now, so why not CERN use them?

#236 Reply
Posted by legacy on 28 Dec, 2018 17:14
Quote from: ataradov on 28 Dec, 2018 17:05
All the things described above are only problems if you think that BE is better. In that case

In that case, I have valid points, whereas you have ... none.

#237 Reply
Posted by ataradov on 28 Dec, 2018 17:17
Sure, why not. You can continue using BE while the rest of the world is clearly switching to LE.

I personally don't really care which one to use, I will use whatever is used by all (or majority) of platforms I care about. And all of them happen to be LE.

#238 Reply
Posted by legacy on 28 Dec, 2018 17:44
Quote from: ataradov on 28 Dec, 2018 17:05
Xeons are highest performing general purpose processors on the market right now, so why not CERN use them?

Well, on the LHC they are managing equipment with currents up to 11K A, thus they are clearly promoting technological innovation. Applying this to the computer science it means they should promote a better computer architecture; therefore at CERN they should use the new POWER9, which are more interesting and well performing, and easier to be modified for an embedded tensor processor, not mentioning the built-in supercomputing capabilities to easier aggregate CPUs in a multiprocessing scenario with a sort of NUMAlink. This implies mechanisms in the hardware, and specific instructions ... all things that at Intel are not ready since their recent attempts to fix it are also still bugged.

Besides, x86 sucks and it's a very shitty architecture. But it's cheap and massively produced and used, which makes it even cheaper.

The POWER9 is currently the most advanced architecture on the planet; It's neat and strong, and well designed. Unfortunately ... IBM and DARPA have a reserved agreement, which means we cannot buy and use their supermassive POWER9 but only machines with voluntarily reduced performance

Our *little* P9Ty2 cost 4500 Euro + FedEx S/H from the US + importing tax = a lot of money, and you can also imagine that even this doesn't help: the ratio performance/cost is clearly inflated vs Intel's 86, for which the software is also more stable.

Linux on PPC64/LE (whose profile is a subset of the POWER9's one) is ... not exactly *stable*. Talking about Gentoo ... it's really really "experimental", and neither this helps.

At Cern they have a strong budget's and time's limit, and a lot of computers to be bought and managed, therefore they consider the performance/ratio as the first constraint to be respected.

#239 Reply
Posted by legacy on 28 Dec, 2018 17:46
Oh, about that: if a RISC-V board with PCIe will be manufactured, and guys at Gentoo will decide to support it, I will be happy to support it too

#240 Reply
Posted by ataradov on 28 Dec, 2018 17:46
CERN has to do science and use whatever equipment works for that purpose. Nothing else.

#241 Reply
Posted by legacy on 28 Dec, 2018 18:03
CERN *literally* invented the Web(1) and now is offering a good occasion to design a real tensor processor et al. Otherwise (my speculation) we will only end by using what Google did around their TPU (v3?), which was made for a different purpose: specifically for the speed up of A.I. algorithms.

(1) as a visitor, I saw a self-congratulatory plate signed by the authors. WoW

#242 Reply
Posted by brucehoult on 28 Dec, 2018 23:14

#243 Reply
Posted by legacy on 30 Dec, 2018 11:17
Is there a document about all the requirements and spec for making Linux run on RISCV?
Is there a RISCV board with ePCI or PCI?

#244 Reply
Posted by legacy on 30 Dec, 2018 11:21
Quote
U54-MC
The SiFive U54-MC Standard Core is the world’s first RISC-V application processor, capable of supporting full-featured operating systems such as Linux.

The U54-MC has 4x 64-bit U5 cores and 1x 64-bit S5 core—providing high performance with maximum efficiency. This core is an ideal choice for low-cost Linux applications such as IoT nodes and gateways, storage, and networking.

it seems *this* is the big-guy

#245 Reply
Posted by legacy on 30 Dec, 2018 11:37
Quote
RISC-V system emulator supporting the RV128IMAFDQC base ISA (user level ISA version 2.2, priviledged architecture version 1.10) including:
32/64/128 bit integer registers
32/64/128 bit floating point instructions

who is willing to use 128 bit integer and fp registers? and for which application?

#246 Reply
Posted by legacy on 30 Dec, 2018 11:48
Quote
4) Technical notes
------------------

4.1) 128 bit support

The RISC-V specification does not define all the instruction encodings for the 128 bit integer and floating point operations. The missing ones were interpolated from the 32 and 64 ones.

Unfortunately there is no RISC-V 128 bit toolchain nor OS now (volunteers for the Linux port ?), so rv128test.bin may be the first 128 bit code for RISC-V !

So it's completely ... experimental. But I still wonder WHO needs 128bit registers, and for what

#247 Reply
Posted by brucehoult on 30 Dec, 2018 14:58
Quote from: legacy on 30 Dec, 2018 11:17
Is there a document about all the requirements and spec for making Linux run on RISCV?

The Linux kernel supports RISC-V. There are I suppose at least half a dozen Linux distributions that support RISC-V, the most heavily used probably being Debian, Fedora, and Buildroot.

The main requirement of a board maker is to implement a bootloader and the SBI (Supervidor Binary Interface). The most commonly used bootloader at the moment is BBL (Berkeley BootLoader, which also implements SBI), but it's pretty crude so there is a lot of work going into others such as Das U-Boot and coreboot.

Quote
Is there a RISCV board with ePCI or PCI?

I don't know of a board at the moment with PCI built in. That will come during 2019, at probably an under $500 all-up price.

Right now the usual way to get PCI is to attach a HiFive Unleashed ($1000) board's FMC connector to either a Xilinx VC707 board ($3500) with https://github.com/sifive/fpga-shells/tree/master/xilinx/vc707 loaded in the FPGA, or else a MicroSemi HiFive Unleashed Expansion Board ($2000) with https://github.com/sifive/fpga-shells/tree/master/microsemi loaded in the FPGA.

So that's kinda expensive for amateurs or hobbyists at the moment, but if a company is paying an engineer to work on RISC-V Linux stuff then that's a week of so of salary so not a big deal.

#248 Reply
Posted by rhodges on 30 Dec, 2018 15:04
Quote from: legacy on 30 Dec, 2018 11:48
So it's completely ... experimental. But I still wonder WHO needs 128bit registers, and for what
Maybe it would be useful for Single Instruction Multi Data, without having separate instructions? The programmer or compiler would have to avoid carry or overflow between the words. That would suggest the desire to have a flag to suppress carry and overflow between 8, 16, and 32 bit data.

#249 Reply
Posted by legacy on 30 Dec, 2018 15:12
Quote from: brucehoult on 30 Dec, 2018 14:58
The Linux kernel supports RISC-V. There are I suppose at least half a dozen Linux distributions that support RISC-V, the most heavily used probably being Debian, Fedora, and Buildroot.

The main requirement of a board maker is to implement a bootloader and the SBI (Supervidor Binary Interface). The most commonly used bootloader at the moment is BBL (Berkeley BootLoader, which also implements SBI), but it's pretty crude so there is a lot of work going into others such as Das U-Boot and coreboot.

Gentoo is not on the list, and I am a supporter: it means I am not interested in anything else except a config file for Catalyst, this plus a profile for building a reasonable stage3. This, unfortunately, is not yet existing, and it's not clear to me what I have to support. Specifically when I say "doc and spec", I mean MMU/TLC and ISA's extension, since as retirement I got an emulator which ... -1- it's OS-less (and can't see any MMU/TLC code/doc/whatever code), and -2- does not compile

So, what I have to support for the MMU, and "privileged" rings?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

Are you sure?

There was an error while thanking

Thanking...

Go to page:

« 1 2 3 4 5 6 7 8 9 10 11 12 13 » All

Full site Menu

Navigation

Powered by SMFPacks Advanced Attachments Uploader Mod