Data Cooperatives and Open Source AI

Part of the Deep Dive: AI Webinar Series

Data Cooperatives have been proposed as a possible remediation to the current power disparity between citizens/internet users from whom data is generated and corporations that process data. But these cooperatives may also evolve to develop their own AI models based on the pooled data. The move to develop machine learning may be driven by a need to make the cooperative sustainable or to address a need of the people pooling the data. The cooperative may consider ‘opening’ its machine learning model even if the data is not open. In this talk we will use Uli, our ongoing project to respond to gendered abuse in Indian languages, as a case study to describe the interplay between community pooled data and open source AI. Uli relies on instances of abuse annotated by activists and researchers at the receiving end of gendered abuse. This crowdsourced data has been used to train a machine learning model to detect abuse in Indian languages. While the data and the machine learning model were made open source for the beta release, in subsequent iterations the team is considering limiting the data that is opened. This is, in part, a recognition that the project is compensating for the lack of adequate attention to non anglophone languages by trust and safety teams across platforms. This talk will explore the different models for licensing data and the machine learning models built on it, that the team is considering, and the tradeoffs between economic sustainability and public good creation in each.

Webinar Summary

In this webinar hosted by the Open Source Initiative as a part of the “Deep Dive: Defining Open Source AI” series, Tarunima Prabhakar and Siddharth Manohar from Tattle Civic Tech organization in India, discuss the challenges of reconciling open source principles with the realities of data cooperatives and machine learning in the context of addressing online abuse, particularly targeting marginalized genders in India. They explore two potential scenarios for differential access to data: one based on the purpose of use (distinguishing between academic, nonprofit, and commercial users) and the other based on the timeliness and relevance of the data. They also consider the implications of these data licensing scenarios on machine learning model licensing, highlighting the balance between openness, transparency, and preventing abuse. The webinar emphasizes the importance of recognizing the labor involved in data annotation, data justice, and the need to protect against misuse of data while enabling legitimate use cases.