Federated Learning: A Paradigm Shift for Secure and Private Data Analysis

Part of the Deep Dive: AI Webinar Series

There are situations where data relevant to a machine learning problem are distributed across multiple locations that cannot share the data due to regulatory, competitiveness, security, or privacy reasons. Federated Learning (FL) is a promising approach to learning a joint machine learning model over all the available data across silos without transferring data to a centralized location. Federated Learning was originally introduced by Google in 2017 for next-word prediction on edge devices [1]. Recently, Federated Learning has witnessed vast applicability across multiple disciplines, especially in healthcare, finance, and manufacturing.

Federated Learning Training
Typically, a federated environment consists of a centralized server and a set of participating devices. Instead of sending the raw data to the central server, devices only send their local model parameters trained over their private data. This computational approach has a great impact on how traditional training of the machine and deep learning models is performed. Compared to centralized machine learning where data need to be aggregated in a centralized location, Federated Learning allows data to live at their original location, hence improving data security and reducing associated data privacy risks. When Federated Learning is used to train models across multiple edge devices, e.g., mobile phones, sensors, and the like, it is known as cross-device FL, and when applied across organizations it is known as cross-silo FL.

Secure and Private Federated Learning
Federated Learning addresses some data privacy concerns by ensuring that sensitive data never leaves the user’s device. Individual data remains secure and private, significantly reducing the risk of data leakage, while users actively participate in the data analysis processes and maintain complete control over their personal information. However, Federated Learning is not always secure and private out-of-the-box. The federated model can still leak sensitive information if not adequately protected [3] while an eavesdropper/adversary can still access the federated training procedure through the communication channels. To alleviate this, Federated Learning has to be combined with privacy-preserving and secure data analysis mechanisms, such as Differential Privacy [4] and Secure Aggregation [5] protocols. Differential Privacy can ensure that sensitive personal information is still protected even under unauthorized access, while Secure Aggregation protocols enable models’ aggregation even under collusion attacks.

In a data-driven world, prioritizing data privacy and secure data analysis is not just a responsibility but a necessity. Federated Learning emerges as a game-changer in this domain, empowering organizations to gain insights from decentralized data sources while safeguarding data privacy. By embracing Federated Learning, we can build a future where data analysis and privacy coexist harmoniously, unlocking the full potential of data-driven innovations while respecting the fundamental rights of privacy.

[1] McMahan, Brendan, Eider Moore, Daniel Ramage, Seth Hampson, and Blaise Aguera y Arcas. “Communication-efficient learning of deep networks from decentralized data.” In Artificial intelligence and statistics, pp. 1273-1282. PMLR, 2017.

[2] Rieke, Nicola, Jonny Hancox, Wenqi Li, Fausto Milletari, Holger R. Roth, Shadi Albarqouni, Spyridon Bakas et al. “The future of digital health with federated learning.” NPJ digital medicine 3, no. 1 (2020): 119.

[3] Gupta, Umang, Dimitris Stripelis, Pradeep K. Lam, Paul Thompson, Jose Luis Ambite, and Greg Ver Steeg. “Membership inference attacks on deep regression models for neuroimaging.” In Medical Imaging with Deep Learning, pp. 228-251. PMLR, 2021.

[4] Dwork, Cynthia. “Differential privacy.” In International colloquium on automata, languages, and programming, pp. 1-12. Berlin, Heidelberg: Springer Berlin Heidelberg, 2006.

[5] Bonawitz, Keith, Vladimir Ivanov, Ben Kreuter, Antonio Marcedone, H. Brendan McMahan, Sarvar Patel, Daniel Ramage, Aaron Segal, and Karn Seth. “Practical secure aggregation for privacy-preserving machine learning.” In proceedings of the 2017 ACM SIGSAC Conference on Computer and Communications Security, pp. 1175-1191. 2017.

Webinar summary

In this webinar hosted by the Open Source Initiative as a part of the “Deep Dive: Defining Open Source AI” series, Dimitri Stripelis, a research scientist at the Information Sciences Institute at the University of Southern California, discusses the promising approach of federated learning for secure and private data analysis. Traditional centralized machine learning faces challenges due to data sharing restrictions imposed by various data regulations worldwide. Federated learning offers a solution where data remains decentralized, and only local model parameters are shared. This approach has applications in healthcare, data engineering, mobile/IoT devices, pharmaceuticals, finance, and more. Challenges in federated learning include computational and statistical heterogeneity, semantic heterogeneity, and data fragmentation, which impact model performance. Security and privacy concerns necessitate encryption methods like fully homomorphic encryption to protect sensitive data. Federated learning frameworks like Metis are emerging, but standardization and benchmarking are ongoing. The goal is for federated learning to become the standard for distributed AI model training across various domains.