Author Topic: NVIDIA Releases Open-Source GPU Kernel Modules  (Read 1705 times)

0 Members and 1 Guest are viewing this topic.

Offline DiTBho

  • Super Contributor
  • ***
  • Posts: 2210
  • Country: gb
Re: NVIDIA Releases Open-Source GPU Kernel Modules
« Reply #25 on: May 14, 2022, 11:04:42 am »
Not sure what you are talking about.

Keyword[]={ GSP-RISC-V }

Every GPU loads a piece of firmware, but this one's hefty! 34Mbyte!!! ~900 functions implemented!!! Hence we can affirm that with NVIDIA, a *good portion* of features that AMD and Intel drivers implement in kernel are, instead, provided by a binary blob from inside the GPU, and this blob runs on the GSP, which is a RISC-V core(1).

You actually have a programmable GPU! We already have read some good article like this(2), but this NVIDIA solution looks more like the one I saw in 2001 where a RISC-like CPU was paired with a dual port ram and DAC to create a 2D accelerated video card.

It was a CPU, running software, re-programmable, not dedicated hardware.

(1) only available on Turing  and younger GPUs.
(2) General-purpose computing on graphics processing units
 

Offline DiTBho

  • Super Contributor
  • ***
  • Posts: 2210
  • Country: gb
Re: NVIDIA Releases Open-Source GPU Kernel Modules
« Reply #26 on: May 14, 2022, 01:06:25 pm »
NVIDIA Turing : RISC-V = Rendition Vérité 1000 : V1K-RISC

The v1000 was only a slow V1K-RISC CPU @25 MHz, having a one-cycle multiplication of 32 * 32, occupying a solid part of the chip,  a one-cycle instruction for calculating the approximate inverse value, that is, a two-stroke approximated integer division, and the usual set of RISC instructions, but encoded as custom "V1K-RISC". Oh, and another “bilinear load” instruction that read a 2x2 linear memory block and performed bilinear filtering based on fractional u and v values ​​passed to the instruction. The map had a tiny cache, it seems to be only 4 pixels. Therefore, if a perfectly matching 2x2 block appeared, we received a reduction in the load on the memory bandwidth.

Very old product, rare to see, an nobody remembers open source drivers have ever really worked with x11.

There are ~20 years between NVIDIA Turing and Vérité 1000, but they really look similar: no opensource drivers, all achieved by reverse engineering the PCI binary-only drivers.

Oh, and when it somehow works, well it's always bit bumpy  :-//
 

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 4437
  • Country: fi
    • My home page and email address
Re: NVIDIA Releases Open-Source GPU Kernel Modules
« Reply #27 on: May 14, 2022, 02:30:09 pm »
Closed-source Linux kernel drivers make kernel issues undebuggable, because of the structure of the Linux kernel: a driver can do anything, and crashes usually occur somewhere else completely.  Use-after-frees, stale (incorrect) addresses scribbling over unrelated kernel memory, and so on, are typical examples of this.  (Many end users have a difficult time believing this, and insist that "my [closed source] drivers cannot be the cause, as other users would have reported it also, so it must be [code I'm responsible for]!" Yet, if the closed source drivers are never loaded in the first place, the bug does not occur... It is unbelievably frustrating for someone like me just trying to help.)

Closed firmware running on a peripheral communicating via a Linux kernel controlled bus (including address translation and DMA engines), on the other hand, poses no problems in this respect.  Besides, a typical kernel developer does not have the necessary documentation and/or hardware tools to safely develop custom firmware for arbitrary devices.

Because of this, I personally do not care much what internal firmware a graphics card might run, as long as its access to any other hardware and CPU-accessible memory is completely controlled by the Linux kernel.  As long as all code running on the CPU is open source, I can at least try to debug and fix issues, and that's good enough for me.
 
The following users thanked this post: DiTBho

Offline madires

  • Super Contributor
  • ***
  • Posts: 6956
  • Country: de
  • A qualified hobbyist ;)
Re: NVIDIA Releases Open-Source GPU Kernel Modules
« Reply #28 on: May 14, 2022, 02:51:44 pm »
One problem with the infamous proprietary binary blob is long term support, especially in case of larger changes in the linux kernel. Manufacturers can drop support for their binary blob any time for any reason. When that happens and the driver requires some significant changes due to kernel changes you're stuck because you can't modify the binary blob. You could try to add some compatibility/translation layer for the outdated blob, but this would be just :horse:.
 
The following users thanked this post: DiTBho

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 4437
  • Country: fi
    • My home page and email address
Re: NVIDIA Releases Open-Source GPU Kernel Modules
« Reply #29 on: May 14, 2022, 04:47:29 pm »
Yep, what madires wrote.

A major reason the Linux kernel is so darned versatile –– I mean, being used from phones to routers to desktops to servers to HPC clusters –– is that it is internally modular, with no stable internal APIs.  The userspace interface (syscalls, /proc and /sys pseudofilesystems) should be stable, and mostly is (with limited exceptions), but internally everything is subject to change and refactoring.  The amount of refactoring-like code churn in it is stupifying.  I mean, it is no longer possible for a single human to keep track of all the changes by themselves.  (That's why Linus delegates subsystems to proven capable developers.)

(All that said, I haven't taken a look at the sources to find out exactly what NVidia is doing; whether they open sourced just a shim, or a full kernel driver with any closed source blobs running on the GPU only.  I don't have any NVidia hardware, and prefer OpenCL over CUDA anyway, so it does not affect me right now.)

It would be darned nice to have a computing ASIC, preferably one with a large number of cores able to do 4-component Euclidean and homogenous coordinate vectors and basic vector algebra (dealing with anything 3D) and basic real and complex algebra using double precision floating-point at full speed, and a memory model with extremely fast read-only "global" memory access (data lookup) and read-write "local" memory accesses; perhaps some kind of "page" per core..  Forget about trigonometric and special functions, just super-fast, basic mathy cores in parallel.  That is what could give us a new order of magnitude of efficiency in many simulations (MD, FEM, most 3D non-field/QM stuff).  Do note that graphics doesn't need double precision; single precision suffices for human vision related stuff, but not really for the kinds of simulations I'm talking about.
 
The following users thanked this post: DiTBho

Offline nctnico

  • Super Contributor
  • ***
  • Posts: 24141
  • Country: nl
    • NCT Developments
Re: NVIDIA Releases Open-Source GPU Kernel Modules
« Reply #30 on: May 14, 2022, 06:38:37 pm »
One problem with the infamous proprietary binary blob is long term support, especially in case of larger changes in the linux kernel. Manufacturers can drop support for their binary blob any time for any reason. When that happens and the driver requires some significant changes due to kernel changes you're stuck because you can't modify the binary blob. You could try to add some compatibility/translation layer for the outdated blob, but this would be just :horse:.
That is mainly a problem caused by the Linux kernel developers by changing internal structures at will without really thinking things through. From a software maintenance perspective the Linux kernel is a hot mess so I am very gratefull for any commercial hardware vendor to have Linux support. It is not a thankfull job to keep shooting at a moving target.

To be honest, it wouldn't surprise me if every new kernel version has the same number of new bugs as the number of bugs that are fixed / features added. One example I ran into many years ago: I had a problem with a SOC that wouldn't always come out of a system reboot (reboot command). Turns out that the code to reset the power management to a voltage level where the processor could run at default speed was removed in the kernel version I happened to be using (some developers thought it was a good optimisation). It could happen the voltage regulator got set at a lower voltage just before the processor was reset through software causing the processor to get stuck (voltage too low for the frequency). Turns out this bug actually affected a few PC platforms as well.
« Last Edit: May 14, 2022, 06:42:21 pm by nctnico »
There are small lies, big lies and then there is what is on the screen of your oscilloscope.
 

Offline Nominal Animal

  • Super Contributor
  • ***
  • Posts: 4437
  • Country: fi
    • My home page and email address
Re: NVIDIA Releases Open-Source GPU Kernel Modules
« Reply #31 on: May 14, 2022, 08:52:12 pm »
That is mainly a problem caused by the Linux kernel developers by changing internal structures at will without really thinking things through.
As I just explained, "without really thinking things through" != reality.  It is actually a very common, very silly misconception.

If the Linux kernel developers did not change internal structures as needed, it would not have the capabilities or support it does have today.  Simply put, the lack of any rigid internal structure is the cost of being so versatile.  Feel free to disagree, but the above is the actual reason among Linux kernel developers; the people who actually do this stuff day in day out.

As to SBCs and Linux running on SoCs, I still haven't seen a clean, well-ordered toolchain/kernel sdk from a commercial vendor.  They look more like Linux newbie development trees, and not something put together with a sensible design.  Routers are a perfect example.  Just compare a "plain" OpenWRT installation to say Asus router images or Realtek SDK to see for yourself.  The latter are laughably messy, like something put together by a high-schooler.  Which is also the reason why I wish more people –– especially the ones who might become integrators, building such images, at some point –– would learn LinuxFromScratch, git, and historical stuff like Linux Standard Base and Filesystem Hierarchy Standard, Unix philosophy and software minimalism, and the technical reasons why so many core developers dislike systemd.  It is not that hard to do Linux integration properly; it is just that newbies (with a proprietary software background) make the same mistakes again and again.

Kernel bugs do occur, of course –– but that is because most developers today, Linux kernel developers included, are more interested in adding new functionality and features than fixing bugs and making things robust.  Userspace-breaking stuff gets reverted –– and the language Linus used to use to berate authors making such breaking changes is what so many people complained about! Pity, I liked the sharpness! ––, so it turns out that for least maintenance work, you want to be able to upgrade to newer vanilla kernels, but recommend LTS kernels you test and help maintain yourself.  It is minimal work compared to testing and maintaining your own kernel fork.

It might be that NVidia is feeling pressure on this front from the HPC field.  It is well known nowadays that if you do e.g. CUDA stuff, bug-wise you're on your own; nobody outside NVidia can actually help you.  (There is a related tale of the "tainted flag" Linux kernel uses when a binary-only driver has been loaded, and how the users who think Linux kernel developers should be telepathic and clairvoyant and be able to fix bugs even when the kernel data structures have been accessed by unknown code, believed that it should not apply to NVidia drivers, and is "just offensive"...  :palm:  See the first paragraph describing this in the Linux kernel documentation, and consider the precise and gentle language used.  Heh.)
 
The following users thanked this post: madires, bitwelder, MK14

Offline magic

  • Super Contributor
  • ***
  • Posts: 5149
  • Country: pl
Re: NVIDIA Releases Open-Source GPU Kernel Modules
« Reply #32 on: May 15, 2022, 07:42:43 am »
It might be that NVidia is feeling pressure on this front from the HPC field.  It is well known nowadays that if you do e.g. CUDA stuff, bug-wise you're on your own; nobody outside NVidia can actually help you.
The fact that desktop support is "alpha quality" is a big hint ;)

I think it's not only HPC but also big "machine learning" farms, particularly "cloud" ones made available for hire. They must feel at least a little uneasy about security implications of allowing network-facing applications or applications written by total strangers to call into a proprietary blob in the kernel. And of course, maintaining a tainted kernel is an extra headache too.
 


Share me

Digg  Facebook  SlashDot  Delicious  Technorati  Twitter  Google  Yahoo
Smf