Frameworks for managing sensitive data
The challenges of inter-organisational collaborative R&D, are made vastly more difficult if there are constraints on access due to sensitivities in the data or models themselves.
While, a range of techniques have been adopted for collaborative data science involving sensitive data or models, collaboration remains problematic when access to operational data or models is prohibited, for a variety of legal, security, ethical, or practical reasons. In our case, the initial work in developing CSAM classifiers brought this issue to the foreground as Australian law renders any direct access to the the CSAM material illegal for researchers outside of law enforcement.
This situation led directly to the commissioning of infrastructure and protocols that facilitate such collaboration without the possibility of breaching the legal restrictions placed on the data. Originally developed by AiLECS researchers in conjunction with CSIRO/Data 61, this infrastructure was dubbed ‘Data Airlock’ to emphasise the separation of workloads from restricted data in our case, the CSAM data sets used to train, validate, and test models as they underwent development.
Our experience in deploying and using the first iteration of the Data Airlock has exposed a number of assumptions and shortcomings in our implementation, leading to an additional set of requirements for interoperability and scalability in order to support a distributed and heterogeneous research community. The design and implementation of a second version of the Data Airlock is currently underway that will open up possibilities across research domains; running or comparing models without disclosing their technical detail and applying different restricted data ‘recipes’ based on access and trust criteria.
Such a highly-secure, federated platform will enable controlled ‘eyes-off’ access to large, sensitive data sets in order to facilitate collaboration between disparate data-holders and researchers across a variety of problem domains.