Open source AI between enablement, transparency and reproducibility

Part of the Deep Dive: AI Webinar Series

Open source AI is a misnomer. AI, notably in the form of machine learning (ML), is not programmed to perform a task but to learn a task on the basis of available data. The learned model is simply a new algorithm trained to perform a specific task, but it is not a computer program proper and does not fit squarely into the protectable subject matter scope of most open source software licences. Making available the training script or the model’s ‘source code’ (eg, neural weights), therefore, does not guarantee compliance with the OSI definition of open source as it stands because AI is a collection of data artefacts spread across the ML pipeline.
The ML pipeline is formed by processes and artifacts that focus on and reflect the extraction of patterns, trends and correlations from billions of data points. Unlike conventional software, where the emphasis is on the unfettered downstream availability of source code, in ML it is transparency about the mechanics of this pipeline that takes centre stage.
Transparency is instrumental for promoting use maximisation and mitigating the risk of closure as fundamental tenets of the OSS definition. Instead of focusing on single computational artefacts (eg, the training and testing data sets, or the machine learning model), a definition of open source AI should zoom in on the ‘recipe’, ie the process of making a reproducible model. Open source AI should be less interested in the specific implementations protected by the underlying copyright in source code and much more engaged with promoting public disclosure of details about the process of ‘AI-making’.
The definition of open source software has been difficult to apply to other subject matter, so it is not surprising that AI, as a fundamentally different form of software, may similarly require another definition. In our view, any definition of open source AI should therefore focus not solely on releasing neural network weights, training script source code, or training data, important as they may be, but on the functioning of the whole pipeline such that the process becomes reproducible. To this end, we propose a definition of open source AI which is inspired by the written description and enablement requirement in patent law. Under that definition, to qualify as open source AI, the public release should disclose details about the process of making AI that are sufficiently clear and complete for it to be carried out by a person skilled in machine learning.
This definition is obviously subject to further development and refinement in light of the features of the process that may have to be released (eg, model architecture, optimisation procedure, training data etc.). Some of these artefacts may be covered by exclusive IP rights (notably, copyright), others may not. This creates a fundamental challenge with licensing AI in a single package.
One way to deal with this conundrum is to apply the unitary approach known from the European case law on video games (eg, the ECJ Nintendo case) whereby if we can identify one expressive element that attracts copyright protection (originality), this element would allow us to extend protection to the work as a whole. Alternatively, we can adopt the more pragmatic and technically correct approach to AI as a process embedding a heterogenous collection of artefacts. In this case, any release on open source terms that ensures enablement, reproducibility and downstream availability would have to take the form of a hybrid licence which grants cumulatively enabling rights over code, data, and documentation.
In this session, we discuss these different approaches and how the way we define open source AI and the objectives pursued with this definition may predetermine which licensing approach should apply.

Webinar summary

In this webinar hosted by the Open Source Initiative as a part of the “Deep Dive: Defining Open Source AI” series, Ivo Emanuilov and Jutta Suksi discuss open source AI and its components with a focus on understanding the various phases and layers involved in AI development, particularly with regard to copyright issues. Their presentation highlights the hybrid nature of AI, blending data, software, and other components, leading to complex questions about intellectual property rights. It explores the idea of replicating open source principles for AI by emphasizing transparency, enablement, and reproducibility as key principles. Additionally, Ivo and Jutta examine existing AI licenses and their focus on data, restrictions, weights, and combinations of data, executable models, and source code. This presentation emphasizes the rapidly evolving nature of AI licenses and the importance of considering regulatory requirements in the development of open source AI.