Open Source AI Definition – Weekly update June 17

Explaining the concept of Data information

After much debate regarding training data, @stefano published a summary of the positions expressed and some clarifications about the terminology included in draft v.0.0.8. You can read the rationale about it and share your thoughts on the forum.
Initial thoughts:
- @Senficon (Felix Reda) adds that while the discussion has highlighted the case for data information, it’s crucial to understand the implications of copyright law on AI, particularly concerning access to training data. Open Source software relies on a legal element (copyright licenses) and an access element (availability of source code). However, this framework does not seamlessly apply to AI, as different copyright regimes allow text and data mining (TDM) for AI training but not the redistribution of datasets. This discrepancy means that requiring the publication of training datasets would make Open Source AI models illegal, despite TDM exceptions that facilitate AI development. Also, public domain status is not consistent internationally, complicating the creation of legally publishable datasets. Consequently, a definition of Open Source AI that imposes releasing datasets would impede collaborative improvements and limit practical significance. Emphasizing data innovation can help maintain Open Source principles without legal pitfalls.

@amcasari expresses concern about the usability and neutrality of the “Model Openness Framework” (MOF) for identifying AI systems, suggesting it doesn’t align well with current industry practices and isn’t ready for practical application without further feedback and iteration.
@shujisado points out that the MOF’s classification of components doesn’t depend on the specific IP laws applied, but rather on a general legal framework, and highlights that Japan’s IP law system differs from the US and EU, yet finds discussions based on the OSD consistent.
@stefano emphasizes the importance of having well-thought-out, timeless principles in the Open Source AI Definition document, while viewing the Checklist as a more frequently updated working document. He also supports the call to see practical examples of the framework in use and proposes separating the Checklist from the main document to reduce confusion.

Reviews of eleven different AI systems have been published. We do these review to check existing systems compatibility with our current definition. These are the systems in question: Arctic, BLOOM, Falcon, Grok, Llama 2, Mistral, OLMo, OpenCV, Phy-2, Pythia, and T5.
- @mer has set up a review sheet for the Viking model upon request from @merlijn-sebrechts.
- @anatta8538 asks if MLOps is considered within the topic of the Model Openness Framework and whether CLIP, an LMM, would be consistent with the OSAID.
- @nick clarifies that the evaluation focuses on components as described in the Model Openness Framework, which includes development and deployment aspects but does not cover MLOps as a whole.

@Alek_Tarkowski agrees that certification of open-source AI will be crucial under the AI Act and highlights the importance of defining what constitutes an Open Source license. He points out the confusion surrounding terms like “free and open source license” and suggests that the issue of responsible AI licensing as a form of Open Source licensing needs resolution. Notes that some restrictive licenses are gaining traction and may need consideration for exemption from regulation, thus urging for a consensus.

Slides and the recording of our previous townhall meeting can be found here.