EEVblog Electronics Community Forum
Products => Computers => Programming => Topic started by: PlainName on July 01, 2022, 05:44:57 pm
-
Open source bods want people to quit Github because Github/Microsoft abuse open source licences to sell proprietary software without attribution or respecting copyleft licensing. There's a surprise!
https://www.theregister.com/2022/06/30/software_freedom_conservancy_quits_github/ (https://www.theregister.com/2022/06/30/software_freedom_conservancy_quits_github/)
The referenced URL in the article ( https://sfconservancy.org/GiveUpGitHub/ (https://sfconservancy.org/GiveUpGitHub/) ) is worth a read if you're not sure what this is about.
-
All my code under BSD-3 license, so they are free to legally steal it all they want. They won't be in compliance with the attribution clause, but whatever.
For me GitHub is a convenient place to give away my code. Any code I don't want to be used by others is not going to be published anywhere.
-
Ah, some here already know what I think of github.
This is a nail in the coffin.
-
I closed my personal GitHub account about 3 months ago.
Stuck with it on a corporate account. It’s not going down well though. Constant actions outages, poor security controls and the UI is dog shit.
-
Could someone please help me understand what the bone of contention is? It seems to be that the "Copilot" AI model has been trained on 3rd party open source software, right?
Does that really constitute a use of the 3rd party software which violates the respective licenses? Software is protected by copyright -- which conveys the right to the copyright holder to make that software available to others under a license of their chosing.
But the copyright only stops others from copying the software (or non-trivial pieces thereof) verbatim, right? I can still study the software, I can learn from it, I can create my own implementations by taking cues or (rather small) parts from it.
Isn't that just what the AI does, and isn't that therefore perfectly acceptable use of the 3rd party software? Why would that constitute a copyright violation, and why would it require permission under whichever type of license was used?
-
Yes. It has been demonstrated to write restricted open source code verbatim. This is a massive risk from a liability and legal point of view. It attracts litigation so people want to distance themselves from it.
If you’re doing reverse engineering you only ever observe the black box abstraction or API for these reasons.
-
Does that really constitute a use of the 3rd party software which violates the respective licenses?
Copilot is known to spit out significant chunks of code verbatim, including typos from the original. Whether those chunks are enough to call it IP violation I personally don't know. It probably depends.
But just having that be an unanswered question is the reason enough to not use Copilot for anything serious. You are just inviting to be sued.
-
"While we will not mandate our existing member projects to move at this time, we will no longer accept new member projects that do not have a long-term plan to migrate away from GitHub," said Gingerich and Kuhn.
Source: https://www.theregister.com/2022/06/30/software_freedom_conservancy_quits_github/ (https://www.theregister.com/2022/06/30/software_freedom_conservancy_quits_github/)
Never heard of "SFC" before, but this sounds like hypocrisy, IMHO. :-//
(and I'm no fan of Microsoft, only sick of "mandate"-ers and cancelots)
About GitHub Copilot, it happens that I've just looked into AI code helpers few days ago. GH Copilot was acquired recently from OpenAI. OpenAI is a startup from the dudes that came with GPT-1 and GPT-2, then later GPT-3 (GPT-3 is not FOSS, the other two GPTs are). GPT-3 Codex is the OpenAI product, a GPT-3 type AI that was specialized to generate software starting from plain English.
Codex/Copilot was trained on pretty much all the available sources, including Github and Stackoverflow. It's hard to tell if copyrighted software was used for training, and the Copilot generated code is not verbatim (GPT-3 is a generator type of AI, so it mixes all it learned and it's not a copy paste code retrieval), so it's hard to prove license infringement.
At first, GH Copilot was free for all to use (like some sort of beta), and starting this year Microsoft sells Copilot access as a subscription, for about $10 a months or $100/year. Apparently the free beta was a big success, very productive and almost addictive to many. However, Copilot and future similar products are not expected to be free, they require a lot of hardware to run inference (for example, minimum HW requirements for GPT-NeoX - a similar but smaller AI than GH Copilot, will require about 50GB of RAM and minimum 2 GPU with 50GB VRAM to answer in 1-3 seconds). For training, it needs big clusters of GPUs and CPUs, and also piles and piles of source code to train on. This is way more demanding than free hosting some projects on GitHub, so I don't think the AI GH's Copilot will be offered for free.
-
This and anything OpenAI builds is cancer for society.
All it does is fuck up the signal to noise ratio.
-
It has been demonstrated to write restricted open source code verbatim.
Copilot is known to spit out significant chunks of code verbatim, including typos from the original.
[...] the Copilot generated code is not verbatim
Now I'm confused. ???
-
It is definitely 100% verbatim. We repro'ed it in house.
-
Isn't that just what the AI does, and isn't that therefore perfectly acceptable use of the 3rd party software? Why would that constitute a copyright violation, and why would it require permission under whichever type of license was used?
Actually there are two things to consider: copyright and license. In some countries copyright law allows to publish a brief excerpt of a copyrighted work (AKA quote, in the US: fair use), possibly requiring to note the source. Many software licenses are talking about a 'derived work' which is a broad term and would comprise Copilot. The question is, how large the excerpt may be to be still acceptable under copyright law.
-
It has been demonstrated to write restricted open source code verbatim.
Copilot is known to spit out significant chunks of code verbatim, including typos from the original.
[...] the Copilot generated code is not verbatim
Now I'm confused. ???
My bad, it's not always verbatim, but could be sometimes.
Also it usually generates different code for the same request if the request is repeated. Also depends of how big the training chunks of code were (maybe they retrained it with smaller chunks during that one year free beta, maybe they removed from training those projects with more demanding licenses). Also there is an adjustable "temperature" parameter for inference queries, something similar with "creativity" for humans.
Not easy to prove license infringement against an army of Microsoft lawyers, but yes, AI helpers can copy typo mistakes or full lines, then mix that (or not) with something else, and so on.
The AI behaves similar with a kid learning to speak. Sometimes the kids are using the exact words or idioms they hear, but often they combine the words in new ways.
-
Isn't that just what the AI does, and isn't that therefore perfectly acceptable use of the 3rd party software? Why would that constitute a copyright violation, and why would it require permission under whichever type of license was used?
Actually there are two things to consider: copyright and license. In some countries copyright law allows to publish a brief excerpt of a copyrighted work (AKA quote, in the US: fair use), possibly requiring to note the source. Many software licenses are talking about a 'derived work' which is a broad term and would comprise Copilot. The question is, how large the excerpt may be to be still acceptable under copyright law.
I think we are talking about the same thing there. Of course copyright and licenses are different things. But use of software in a way which is allowed under copyright (e.g. short quotes, paraphrasing of ideas bit in a different form...) cannot be restricted by the license.
Hence, if "Copilot" would not reproduce verbatim the software which it was trained on, but only paraphrase it, that should not be an issue. But there seems to be a broad consensus that Copilot does produce chunks of verbatim copies, and at least widespread concerns that these chunks are substantial enough to consitute copyright violations. (And hence license violations unless the license is very generous).
-
It seems to be that the "Copilot" AI model has been trained on 3rd party open source software, right?
Does that really constitute a use of the 3rd party software which violates the respective licenses? Software is protected by copyright -- which conveys the right to the copyright holder to make that software available to others under a license of their chosing.
Some of the code it used was copylefted, so anything derived from that must at least give proper attribution. Some of the code even goes as far a to prevent use in a server scenario.
There is an argument (pushed by github, of course) that the copilot stuff is a bit like a compiler and you can't (well, you could, but no-one really does) restrict what, say, gcc does. But that misses the point that the vast code database they snaffled is essentially equivalent to the source input to the compiler, and the output certainly does retain the licensing restrictions placed on the source.
The main beef seems to be that github, whose raison d'etre is to manage open source, uses that source to produce a closed source app they then flog to make lots of wonga off, without giving a shit about any license terms that might be involved. If you ignore the legal aspect, it's pretty shit morally. Is that really who you want to be the de facto controller of open source? Because that's what they are - although the point of git was to do away with centralised control, github is exactly that centralised repository and controller, and managed by Microsoft to boot!
-
Even if it didn't spit out code *verbatim*, automatically feeding AI with tons of source code under various licenses, ultimately for being reused (even if in some "digested" form) in projects with incompatible licenses, is already a very serious and questionable matter IMO.
It's not just about using tricks to *barely* get away from copyright infringement, which I have no doubt the "AI" used here will manage over time. It won't be too hard to make it learn to rewrite pieces of code in a manner that would make said code hard to recognize for humans. After a few nasty lawsuits, engineers working on this "AI" will definitely be mandated to work on this.
Can some automated transformation of copyrighted material be considered original work? Uh, yeah. That can of worms looks pretty nice.
But this will likely be an endless race - unless politics starts to meddle - so, over time, as ML is going to be used increasingly, you can also expect ML to be used to spot copyright infrigements that would otherwise be hard to spot by humans only.
There's a lot to consider here - I'm just scratching the surface.
-
Bloke (https://twitter.com/WAHa_06x36/status/1410185682887270401?s=20&t=YlNknDT16au3c_OT8ZKnYQ) (or blokess, or I guess just a blok) on twitter has pointed out their two-facedness:
This would seem to imply that it's OK to take leaked Windows source code and train an ML model on it, and release that to the public.
Is that your interpretation, and if not, how does it differ from doing the same for code under the GPL and stripping the GPL?
-
Bloke (https://twitter.com/WAHa_06x36/status/1410185682887270401?s=20&t=YlNknDT16au3c_OT8ZKnYQ) (or blokess, or I guess just a blok) on twitter has pointed out their two-facedness:
This would seem to imply that it's OK to take leaked Windows source code and train an ML model on it, and release that to the public.
Is that your interpretation, and if not, how does it differ from doing the same for code under the GPL and stripping the GPL?
Well they probably scanned that too :-DD
https://github.com/onein528/NT5.1
https://github.com/ZoloZiak/WinNT4
Has been useful having that code to refer to.
-
About the ethics, useful AI applications are pretty new, so not yet regulated by law. For now, who has the deepest pockets win.
In fact, there is at least one AI regulation already, it's the law allowing autonomous AI weapons. :palm:
Not kidding, that's a fact. Combine this with mass surveillance of individuals, and you'll get a future of very dystopian obedience.
The law passed in USA about a year ago, and was justified as a necessity, as a "zero sum game" against similar autonomous AI weapons that are presumably developed by other nations. Understandable. If a new technology appears, sooner or later that will be used as a weapon.
-
There's money and then there is being honest. When I wrote by dissertation in 1970, I gave credit to chemists who published in the 1890's. No money or plagiarism involved. But, had I omitted those references, I am sure my review committee would have had, at a minimum, raised eyebrows. Make attribution a requirement with clearly defined consequences.
In the US, the way to do that is to establish a reasonable "liquidated damages" clause that is enforceable of say $10,000 per violation for non-attribution. Make the amount enough to discourage the practice of plagiarism but not ridiculous. Period.
-
The code generated by Copilot currently seeems to not be a work, due to how Berne Convention works. That’s in line with earlier US Copyright Office’s opinions and US courts’ verdicts, and confirmed in the most recent case.(1)(2) US law is not binding to court elsewhere, but:
- For the matter discussed the copyright law in United States is the primary concern, as this is where “everything happens”.
- That is tightly bound to the very core of the Berne Convention, which applies virtually everywhere on this planet.
If that is not a work, I find the entire discussion about licensing breach by the generated code moot.
What in my opinion is more important, is whether a trained model is considered a derived work. I see that as much more interesting: if only a binary answer is possible, I can conceive three options:- A trained model is not a work at all
Implies: under the existing copyright regime it has no protection. A new set of IP laws may be forged, but that takes time and gives an opportunity to influence them considerably easier than what is possible with copyright. - A trained model is a work, and is a derived work
Implies: a requirement to abide to licensing terms, including both attribution and granting various rights to the licensees. - A trained model is a work, and is not a derived work
Implies: opposition on these grounds is not possible and creators of later models may use works for training in a similar fashion.
One of the contention points with Copilot is also that it was trained on FOSS sources and data leeched from community, but avoided touching proprietary works to which GitHub also has access. Just because that’s possibly legal doesn’t mean it is perceived as acceptable. My own gripe with Copilot is of different nature. Machine learning on that scale is a relatively new subject facing many philosophical challenges. I think it is still too early to definitely say that Copilot significally differs, in qualitative terms, from a programmer acquiring knowledge from reading sources. The pain, which I find understated in opposition, is the possibility of making software development dependent on such tools. You may say “it’s like a calculator to a mathematician”. But it becomes a problem if — in order to be able to compete with other developers — you are forced to use that calculator and it’s almost guaranteed there will only be a few calculator manufacturers, which use their position to push abusive licensing terms.
Returning to the first paragraph, that situation gives rise to another interesting and very complex situation. If Copilot’s output is not a copyrightable work, any program written with it seem to contain fragments that can’t be protected by copyright. What could that imply if owner of the entire program claims infrigement, but it’s found it only applied to such a fragment? What if the defendant could prove that? If that’s a possibility, whose obligation is it to make the proof and how should it look like? Though hypothetical and a thought experiment in nature, extreme cases to that problem are pretty intriguing.
(1) https://www.theverge.com/2022/2/21/22944335/us-copyright-office-reject-ai-generated-art-recent-entrance-to-paradise (https://www.theverge.com/2022/2/21/22944335/us-copyright-office-reject-ai-generated-art-recent-entrance-to-paradise)
(2) Second Request for Reconsideration for Refusal to Register A Recent Entrance to Paradise, US Copyright Office (February 2022) (https://www.copyright.gov/rulings-filings/review-board/docs/a-recent-entrance-to-paradise.pdf)
-
It doesn't matter if it is or not the legal definition of a work or not.
What matters is how much of your capital you burn when someone accuses you of a license violation. This opens up a whole new world of litigation opportunities so the only option with respect to utilising what the tool outputs is to ban it under policy.
As I mentioned earlier it also has an impact in quality and as far as the normal function of society.
I'm not suggesting we go all Dune on this thing's ass, but ML should be relegated only for assistance (inference only) and never for creating new works of any kind. That includes GPT-n derivatives, this garbage and image generation. All we end up doing is cheaply corrupting society with a new decay cycle.
-
I understand that concern, but AI code companions are not much different from an online search then copy/adapt the found code.
I guess in some years from now people will wonder how was even possible to manually write code by just reading the docs. ;D
"How did programmers survive prior to AI companions era?"
( https://www.eevblog.com/forum/microcontrollers/how-did-you-survive-prior-to-the-internet-making-information-easy-to-find/ (https://www.eevblog.com/forum/microcontrollers/how-did-you-survive-prior-to-the-internet-making-information-easy-to-find/) )
-
It doesn't matter if it is or not the legal definition of a work or not. What matters is how much of your capital you burn when someone accuses you of a license violation. This opens up a whole new world of litigation opportunities so the only option with respect to utilising what the tool outputs is to ban it under policy.
And of those litigants, around 1 in 1000 understand what software actually is. I bailed on **ithub back before Covid. The whole space was becoming an overloaded code dumpster. Version control, wot dat?
Codepilot is clever. Which is why it's the GOSUB for programmers who wish to graduate their homework from javascript to python. But for a pro developer, there's always the feeling that the solution requires a brain not a farmyard of GPU's/CPUs. Organic programmers are cleverer.
As for the problem of code plagerism. Recently YouTuber Fran Blanche a.k.a. FRANLAB received a copyright strike after the sound of the wind on the public domain film she aired was owned by Dizony. It wasn't, but she couldn't prove otherwise. So what happens when some dumb A.I. bot finds a conditional, collection or object using the same variable name as code from one of their client's code dung heaps? Is **itHub asked to remove the offending code?
So is this code plagerism?
printf "Hello World"
-
Well I dumped github because it has been overrun by ticket closing bots. It's impossible getting anything fixed or getting any attention and getting steamrolled by vendors was all too common. They close or delete the tickets. I found a large configuration bug in .Net Core which lead to telemetry being leaked out and my ticket was shitcanned :-//. Then this...
A good secondary point there. This is going to be weaponised. The ML (i refuse to use the term AI here) code writers will be against the ML litigation finders.
The best answer is to keep your code inside your security perimeter.
Edit: mini rant. Oh and most of the actual valuable open source projects appear to be run by slightly more deranged versions of comic book guy from the Simpsons. Decided I'd stick to commercial unix variants after that as I know enough people at the company in question to be able to slide stuff in through the back door.
-
I understand that concern, but AI code companions are not much different from an online search then copy/adapt the found code.
I guess in some years from now people will wonder how was even possible to manually write code by just reading the docs. ;D
"How did programmers survive prior to AI companions era?"
( https://www.eevblog.com/forum/microcontrollers/how-did-you-survive-prior-to-the-internet-making-information-easy-to-find/ (https://www.eevblog.com/forum/microcontrollers/how-did-you-survive-prior-to-the-internet-making-information-easy-to-find/) )
Not technically, but does the AI code companion tell you the license of the code snippet it just generated for you, which it obviously derived from FOSS training data? It does not. If you find the source code by yourself, you will know what you copied and which license is attached to it.
Since an AI is not "creative", everything it spits out is "derived work". It's like you copying FOSS code and then obfuscating it to obscure its origin.
-
You can't rationalise the source of the inferred output because the network is non deterministic over time. You'd have to ship the entire corpus and input data for every single request to get an idempotent determinism of how you got to Z from A. So your 16Kb of source code might have 50TB of inference data to drag around. Obviously another stupid flaw in ML.
We had a similar thing a few years back in financial modelling. The model failed spectacularly occasionally including issuing 7 figure credit caps to bankrupt people and convicted criminals. It went back to basic gated rule and risk calculation almost immediately as no one could work out how it managed to come up with a weighted outcome that bad. The model ran along side the original credit profiling system fortunately and didn't get to the stage where it was allowed to make autonomous decisions.
-
does the AI code companion tell you the license of the code snippet it just generated for you, which it obviously derived from FOSS training data?
Depends of each AI companion. I don't know about the Microsoft one (called GitHub Copilot), but some are telling the license. For example the free "Codeon" extension (https://github.com/sdpmas/Codeon) in VScodium shows each license.
-
Edit: mini rant. Oh and most of the actual valuable open source projects appear to be run by slightly more deranged versions of comic book guy from the Simpsons.
I know that 'forking fat git' only too well. Not ONE line of his code is an undocumentable clusterf*ck of memory leaks. Not ever. So no tickets as they'll be marked complete by the author. He uses the special Jedi feel the force licence.
-
Open source bods want people to quit Github because Github/Microsoft abuse open source licences to sell proprietary software without attribution or respecting copyleft licensing. There's a surprise!
https://www.theregister.com/2022/06/30/software_freedom_conservancy_quits_github/ (https://www.theregister.com/2022/06/30/software_freedom_conservancy_quits_github/)
The referenced URL in the article ( https://sfconservancy.org/GiveUpGitHub/ (https://sfconservancy.org/GiveUpGitHub/) ) is worth a read if you're not sure what this is about.
What a surprise.... what people expect from MS? OSS love and money.... ???
Anyone not old enough to understand MS plain flat M.O. will be surprised..
They have entered the 2 step of their M.O. algo. E.E.E.
- EMBRACE - they had been trying hard to do so w/OSS
- EXTEND - putting that amount of money in github (OSS by default) never made sense...
- EXTINGUISH -- as soon as they manage to extend enough unmanageable and insecure unreliable...
they will just drop all that ... and promote their own products..
They did that countless times with hundreds others small competitors..
This time OSS is different .. however the algo still is the same...
Soon the 3rd step will be visible.. we can already see disruption and results all over.
Paul
-
This is all corporate interest. Red hat are just as bad.
-
This is all corporate interest. Red hat are just as bad.
RedHat made it and was extinguished by 00s
I ve used it some deploys at late 90s... complete nice and safe thing
Since they managed to hold large contracts became nothing less than all other players..
Just for money - squat a shit on these joe consumers and terminal kidos...
RedHat is 100% gone as OSS.. nevertheless the GPL as clever as it is keeps them on the leach..
otherwise the new owner would just drop a huge brownie on our heads...
Paul
-
Um RH still employ a big chunk of Linux core engineering. systemd came from there.
I’ve got 850 RHEL boxes in production too
-
Um RH still employ a big chunk of Linux core engineering. systemd came from there.
I’ve got 850 RHEL boxes in production too
They have locked big contractors..
they are not in the game for OSS any more..
the POTTERIX twisted thingo and several dozen other things linked are just to promote their own agenda and keep contractors as expected.... with something that can be drop more or less like the MS wonder land corporate kiosk..
They are just leveraging corporate kiosks as the core business..
Again.. GPL and BSD (others) keep them tied to the leash..
otherwise.. otherwise they would just tell evb to FKO!
Paul
-
Red Hat charge for support, not the code. When they make or contribute stuff it is open source. They also properly attribute code. Essentially, they are no different from some IT bloke you hire to install a web server and keep it up to date and properly managing the code (except they'll know a hell of a lot more about it).
-
Exactly that. To be fair they do a decent job.
-
there are 2 sides of this coin
First they are tied by the GPL LGPL and etc nature of their whole ecosystem
so they are required to do that either way
Second the size of their pockets..
We just can not put side by side comparison of RedHat...
let's just try SuSe ... and that is even hard...
Their pockets allowed them to manage twist and bias the whole thing...
to of course their own agenda..
Nothing wrong making money...
But classify RedHat today as the same it was on 2000 is not possible anymore..
In particular.. the total fork of *NIX into the POTTERIX realm..
kinda corporate kiosk agenda - required to enter that kind of use..
They are gone as OSS but they just can not just drop the leach.
Thanks GPL/LGPL
Paul