Author Topic: Testing the CH32L103  (Read 1463 times)

0 Members and 1 Guest are viewing this topic.

Online SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 16310
  • Country: fr
Testing the CH32L103
« on: March 25, 2025, 11:00:54 pm »
So, I wanted to test the CH32L103, a lower-power variant of the WCH family with USB-PD support. Barely more expensive than the CH224K that I've used for basic USB-PD functionality, and it's a full MCU with 64KB of Flash and 20KB of RAM.

First testing it on an eval board: CH32L103C8T6-EVT-R0. Docs & SDK: https://github.com/openwch/ch32l103

It can be flashed using a WCH Link-E programming adapter. I have one from a third-party (Muse Lab) which works well. It's compatible with the original WCH firmware. But note that you'll need the most recent version of the firmware for the Link-E to be able to connect to a CH32L103. I had bought a couple of those Link-E adapters last year, and of course, they didn't work for the CH32L103. So, I had to update them. This tool allows to update the firmware very easily, and it works flawlessly: https://www.wch.cn/downloads/WCH-LinkUtility_ZIP.html . Windows-only though, but it works with Windows 7. (There's apparently a way to update from Linux, but I didn't bother.)

For the first test, I ported the test firmware I did for the CH32V307 with Bruce's "primes" benchmark.

Interestingly, the result is kinda disastrous.

CH32L103C8T6 (QingKe V4C core, RV32IMAC) @ 96 MHz : 92.9 billion cycles. That's about twice what I got with the CH32V307, and the core is close enough not to make a real difference here.

I figured out the reason quite quickly: the CH32V307 (and maybe the CH32V2xx series too, I haven't checked, but if anyone knows...), as we had discussed earlier, loads the entire Flash content in RAM before starting execution, so that you always have zero wait state. This isn't the case for the CH32L103, which is much cheaper and has apparently no extra internal RAM at all apart from the 20KB available for data. So, it directly executes from Flash, and it doesn't seem to have any kind of cache either, not even a small one. At 96 MHz, it has 2 wait states. So, there you go. That's not very pretty, but that's the price of low price. The wait states are: 0 up to 40 MHz, 1 up to 72 MHz and then 2 up to 96 MHz.

I tried at lower frequencies and the number of cycles was much lower and consistent with the wait states. So, that's something to know. When running from Flash, it may not have any benefit whatsoever to run at higher clock frequencies, on the contrary.

What I'm curious to try next is to run the primes function from RAM at 96 MHz. It should run at zero wait state and get us back to the results I got with the CH32V307, which was about half the number of cycles.

Still a pretty cool small MCU with a full USB-PD controller, which should be great for implementing all kinds of USB-powered devices.
« Last Edit: March 25, 2025, 11:02:29 pm by SiliconWizard »
 
The following users thanked this post: tellurium

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5086
  • Country: nz
Re: Testing the CH32L103
« Reply #1 on: March 26, 2025, 04:22:53 am »
You could manually copy (position-independent, which most naturally is)) speed-critical code from ROM into RAM, but if you don't have much of it (and don't need to overlay different things at different times) then it's easy enough to get the linker to treat code in a certain section the same as initialised data: linked for RAM address, stored in ROM, auto-magically copied to RAM on startup. I'm sure you know that :-)

But 40 MHz is probably fine for many applications anyway. And lower power.

It would be interesting to try to measure the power consumption difference between running from RAM and running from flash.

 

Online SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 16310
  • Country: fr
Re: Testing the CH32L103
« Reply #2 on: March 26, 2025, 01:24:59 pm »
Yes I'm going to test running from RAM, although I'm pretty sure it will get us the same CPI as from Flash with <= 40 MHz.

Just thought I would mention it, because as it is, the higher clock frequencies are almost useless. (I guess I could try setting a lower number of wait states at 96 MHz and maybe tune the core voltage - which this MCU allows, mainly to optimize power consumption - but that would be "out of spec"). Not, admittedly, that it is a real problem, but I was just a bit surprised. As I said, it apparently has no instruction cache at all, while even just like 256 bytes of cache would have made a big difference, but yes, I suppose they didn't think it was worth the trouble for this chip's target.

WCH actually provides examples of running from RAM, but it's mainly for getting lower power. And in terms of low power, it's not the most low-power out there, but it has a lot of parameters to optimize it (like setting the core voltage, putting Flash  & RAM in low-power modes, and so on). The wakeup time from Stop modes is also very short (just a few µs max).

I'll measure power consumption to get actual figures. This eval board has pretty much only the MCU connected to the 3V3 rail, except for the power LED which we can easily desolder (or desolder its series resistor). So I can power it from its Vdd headers directly, there's a LDO from the USB VBUS but they haved added a switch to disconnect the output of the LDO, so that's convenient.

Regarding USB-PD, I had a first look and while it seems feature-complete as of PD 3.0, it is pretty barebones, meaning that a lot has to be handled in software. They provide an example of it and just the code to handle USB PD is over 800 LOCs, and it's apparently not even covering 100%. Cool but not for the faint of heart. I have written my own support for USB HS on the CH32V307, and looking at USB-PD on the L103, I expect a similar amount of work, oddly enough. But USB-PD has become quite complex. Of course it also supports USB FS host & device. So as I said, cool to implement a lot of stuff such as programmable power supplies, soldering irons, custom lighting, the list can be endless.

And this chip is about $0.30 per 100, so that's quite amazing, and if you're already familiar with RISC-V and the CH32V307, it's very easy to reuse your knowledge and even the tools/code you have already written for the latter.
 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5086
  • Country: nz
Re: Testing the CH32L103
« Reply #3 on: March 26, 2025, 10:37:27 pm »
Yes I'm going to test running from RAM, although I'm pretty sure it will get us the same CPI as from Flash with <= 40 MHz.

I would expect so, yes.

Quote
as it is, the higher clock frequencies are almost useless.

I'm not sure that follows. It's going to be at least slightly faster, is it not?

Also, is it not like the '003 where there is a wait state at 48 MHz, but that's per 4 bytes of code, so if you're using 2-byte C instructions the wait state is spread over two instructions not one?

And, of course, running from RAM is a perfectly normal technique, not something exotic and hard.

As an aside, I see that Olimex are now stocking WCH-LinkE and 10-packs of '003s at IIRC 18 eurocents each for 8 pin up to 25 eurocents for 20 pin. That might be useful for people in Europe.
 

Online SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 16310
  • Country: fr
Re: Testing the CH32L103
« Reply #4 on: March 27, 2025, 02:34:40 am »
Yes I'm going to test running from RAM, although I'm pretty sure it will get us the same CPI as from Flash with <= 40 MHz.

I would expect so, yes.

Quote
as it is, the higher clock frequencies are almost useless.

I'm not sure that follows. It's going to be at least slightly faster, is it not?

Not your primes code, at least.
At 72 MHz, so only 1 wait state, it takes 62.5 billion cycles vs. 92.9 billion @ 96 MHz. So, that's about 14.5 min @72 MHz and about 16 min @96 MHz.
So it would require at least 107 MHz to make it just even with 72 MHz. Conversely, 96 MHz would be as fast as ~ 64 MHz.
The whole range above 72 MHz is thus useless, at least, again, with your primes code. Haven't looked at the compiled code on this to see how many instructions are compressed in it.

Now as I said, it may run properly above 72 MHz at 1 wait state (or maybe even 0) from Flash, which I may try just for fun, but it would be out of spec.

And yes we can run code from RAM, but with only 20 KB, that would have to be strictly limited.

Not that this particular MCU is made for speed, but again just worth a mention: users may expect having any runtime benefit clocking it at 96 MHz, but not so from Flash. This is something I have more rarely seen in other MCUs has most that run from Flash have at least a small amount of cache between Flash and core. That's the case for most STM32 MCUs, for instance.

 

Offline brucehoult

  • Super Contributor
  • ***
  • Posts: 5086
  • Country: nz
Re: Testing the CH32L103
« Reply #5 on: March 27, 2025, 05:32:52 am »
At 72 MHz, so only 1 wait state, it takes 62.5 billion cycles vs. 92.9 billion @ 96 MHz. So, that's about 14.5 min @72 MHz and about 16 min @96 MHz.
So it would require at least 107 MHz to make it just even with 72 MHz. Conversely, 96 MHz would be as fast as ~ 64 MHz.
The whole range above 72 MHz is thus useless, at least, again, with your primes code. Haven't looked at the compiled code on this to see how many instructions are compressed in it.

Oh I see , yes.

Twice as many cycles for only 1.5x higher clock speed, and three times as many cycles for only 2x clock speed clearly doesn't make any sense.

At least on '003 it's twice as many cycles for 2x high clock speed, so you can never lose. With 2-byte instructions at 24 MHz you can fetch two instructions per cycle from flash but only execute one. At 48 MHz with 2-byte instructions you can fetch two instructions every two cycles, and also execute two. So 4-byte instructions run at 24 MIPS at either clock speed, but 2-byte instructions run at 24 MIPS at 24 MHz but 48 MIPS at 48 MHz. So your effective speedup is the proportion of all instructions that are 2-byte, which is generally 50% to 60%, but can reach near 100% for carefully hand-written loops/functions.

I'd be surprised if the CH32L103 CPU core is dumber than the '003 one so I'd hope that C-heavy code would get some speedup, but non-C can run more slowly, yeah.

Quote
Now as I said, it may run properly above 72 MHz at 1 wait state (or maybe even 0) from Flash, which I may try just for fun, but it would be out of spec.

Well, yeah, I wouldn't depend on that!

Quote
And yes we can run code from RAM, but with only 20 KB, that would have to be strictly limited.

countPrimes() is under 200 bytes on RV32IMC (depending on exact compiler version) so you could RAM-load up to 100 such functions, depending on how much stact/globals you need.

I mean ... anything that can run at all on an '003 can fit entirely in RAM on your 'L103, with 2k to spare.
 

Online SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 16310
  • Country: fr
Re: Testing the CH32L103
« Reply #6 on: March 27, 2025, 02:19:03 pm »
countPrimes() is under 200 bytes on RV32IMC (depending on exact compiler version) so you could RAM-load up to 100 such functions, depending on how much stact/globals you need.
I mean ... anything that can run at all on an '003 can fit entirely in RAM on your 'L103, with 2k to spare.

Sure, of course. To be restricted to timing-critical code anyway (or possibly code that needs to run at very low power, as it should also lower consumption a bit), so a few KB reserved for that should be plenty.

I'm gonna modify my linker script and add a RAM section for code, place some functions in this section, and also modify the startup code to load those RAM functions automatically. That's the simplest approach not requiring dealing with PIC.
 

Offline HwAoRrDk

  • Super Contributor
  • ***
  • Posts: 1650
  • Country: gb
Re: Testing the CH32L103
« Reply #7 on: March 27, 2025, 08:38:53 pm »
I'm gonna modify my linker script and add a RAM section for code, place some functions in this section, and also modify the startup code to load those RAM functions automatically. That's the simplest approach not requiring dealing with PIC.

You don't particularly need to modify the linker script and startup code. When I was doing my experiment a while ago on the CH32V003 with speed difference between executing code from flash or RAM I didn't do so. Literally all I did was just add an attribute to place the function in a .data subsection, like so:

Code: [Select]
__attribute__((section(".data.test_func_ram"), noinline))
static void test_func_ram(const uint32_t iters) {
    /* etc... */
}

I think this also still satisfies not having to explicitly deal with PIC, because the compiler/linker still knows that the code's in RAM, in the 0x20000000 range, and so any references to it will be pointing there, and any absolute jumps, etc. within these functions should also be correct.

But if you want to separate things for organisational reasons, of course it's your prerogative. :) But personally I would value the simplicity of using a "standard" linker script and startup code.
 

Online SiliconWizardTopic starter

  • Super Contributor
  • ***
  • Posts: 16310
  • Country: fr
Re: Testing the CH32L103
« Reply #8 on: March 27, 2025, 09:30:52 pm »
I've already modified the linker script compared to the one provided by WCH, so no big deal.
I actually like creating a dedicated memory area, not just a section in RAM, for that. Benefits are that I do set the "x" flag only for this area and not for "data" RAM, which is only "rw". Not that it makes a real difference for MCUs in general, but just to make things tidier. That also allows the linker to give me the exact RAM code size taken, which you can't see directly in the memory report if you're putting code in the data section. (You can by looking at the .map file, but that requires going through it.)

Anyway, I'm gonna quickly test it first with your approach to get a quick figure for the number of cycles when executed from RAM @96 MHz. I'll deal with the linker script/startup code later.

Ok, so just tested. countPrimes() running from RAM @ 96 MHz: cycles = 45265266400 (~ 45.3 billion). That's indeed almost identical to what I got with the CH32V307 in terms of cycles.

Quick note for the above way of putting functions in the "data" section: it works but gives a nasty assembler warning:

/tmp/ccySqUdn.s: Assembler messages:
/tmp/ccySqUdn.s:64: Warning: setting incorrect section attributes for .data.code_countPrimes

(I think that's because it conflicts with what the assembler expects from something put in the data section, which is a "reserved" section.)
So, still works, but that's one more point in favor of defining a dedicated section for this.
« Last Edit: March 27, 2025, 10:13:57 pm by SiliconWizard »
 

Offline HwAoRrDk

  • Super Contributor
  • ***
  • Posts: 1650
  • Country: gb
Re: Testing the CH32L103
« Reply #9 on: March 27, 2025, 10:30:22 pm »
Quick note for the above way of putting functions in the "data" section: it works but gives a nasty assembler warning:

/tmp/ccySqUdn.s: Assembler messages:
/tmp/ccySqUdn.s:64: Warning: setting incorrect section attributes for .data.code_countPrimes

(I think that's because it conflicts with what the assembler expects from something put in the data section, which is a "reserved" section.)
So, still works, but that's one more point in favor of defining a dedicated section for this.

Oh yeah, I forgot about that. Gives a "warning: foo.elf has a LOAD segment with RWX permissions" message from the linker too. I didn't bother to look too closely into it at the time, but I figured it was something to do with what you say and could be safely ignored.
 

Offline prosper

  • Regular Contributor
  • *
  • Posts: 120
  • Country: ca
Re: Testing the CH32L103
« Reply #10 on: April 02, 2025, 03:52:31 pm »
as it is, the higher clock frequencies are almost useless.

well, yes and no. They won't result in much additional processing power, true. But it does enable your peripherals to use higher clock speeds. So SPI, for example, could be clocked faster (usually sysclk/2). Or your timers could tick faster, enabling higher precision counting.

It's not all about executing code, but enabling more flexibility in the tasks you might want to do.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf