The Data Airlock

One of the challenges in the AiLECS lab is finding ways to deeply collaborate with our AFP colleagues in machine learning model development. It’s one thing to sit around a (perhaps, virtual) whiteboard, spitballing architectures and algorithms and talking in generalities about data. But when it comes to actually working together to develop and tune models using real-world operational data as shown in Figure 1, things rapidly become tricky – as this data cannot ordinarily be shared.

Sensitive and Restricted Data

AFP internal stakeholders generate and reference operational data as part of their day-to-day workflow. This sensitive, internal data can take a multitude of forms including structured data, images, text, video, audio, and binary (e.g. device images) along with contextual metadata. These internal domain expert stakeholders rely on technical staff to develop and enhance intelligence and investigative capabilities. These technical stakeholders are domain data experts, with an understanding of the formats, volumes, and content of this data together with an appreciation of the technical and social challenges of working with it. The problem arises when those involved in capability development wish to collaborate outside the organisation – for example, with the AiLECS lab.

Figure 1: The collaboration environment
Figure 1: The collaboration environment

Of course, we can go some way towards granting access by implementing various social measures, not only in the granting of security clearances for lab members, but through additional non-disclosure agreements, and procedural or legal dispensations that facilitate or broaden data sharing.

Similarly we can impose Environmental Constraints that aim to limit the physical contexts of access to prevent the (internal) inadvertent exposure, intentional sharing, or (external) theft of sensitive material beyond the limits imposed by social measures. For example, material may be encrypted both during transmission to the access context and at rest when not in active use. More significant has been the development of Data Safe Havens; secure analytics environments that comprise “appropriate technical and governance controls which are effectively audited and are viewed as trustworthy by diverse stakeholders”. Examples of such data safe havens include the UK’s Secure Anonymised Information Linkage (SAIL) Databank of health and other public service data, and the Australian E-Research Institutional Cloud Architecture (ERICA).

Additionally, Anonymisation Methods can sometimes be employed where the identity of data subjects cannot be disclosed, even though the bulk of the material may be shared. At the most basic level, data may be filtered with individual identifying fields or metadata removed or masked in order to anonymise records. In some contexts, datasets can be permuted or ‘sliced’, numerically summarised, or purturbed so that a complete view of the material is not disclosed, such as as with Differential Privacy that “addresses the paradox of learning nothing about an individual while learning useful information about a population”.

‘Eyes Off’ Access

But such measures are predicated on the assumption that at least some of the data may be shared in the first place. How can we proceed if the data cannot be shared, in any form, outside of Law Enforcement organisations (or indeed organisational units)? For example, in the case of our main project developing automated classifiers for Child Sexual Abuse Material (CSAM), the data (and metadata) is subject to a legislative prohibition of the possession of, or access to, any “material that depicts or describes activity relating to child sexual abuse”. How can we work together with such material without the possibility of breaching the legal restrictions placed on the data?

To be sure, there are technical components that attempt to address this restricted data problem. For example: Data Diodes (hardware devices that physically enforce a one-way flow of data between nodes or networks); Trusted Execution Environments (TEEs) (with secure memory and computational partitions, enabling isolation of specific workloads), and cryptographic approaches (such as Provably Secure Protocols or Homomorphic Encryption) have been floated as possible solutions. However, none of these satisfactorily address training workloads using restricted data. For example, TEEs have no inherent separation between a trusted workload and its data (so the risk of data exfiltration remains) and impose significant performance overheads for complex datasets. Similar performance issues are currently characteristic of cryptographic approaches for all but the simplest of models and data-types.

Our solution was to develop infrastucture that we dubbed the ‘Data Airlock‘ to emphasise the separation of workloads from restricted data – in our case, the CSAM data sets used to train, validate, and test models as they underwent development. As successful models would be deployed into law enforcement production, there was no requirement for this infrastructure to handle the inference case (although model validation was required in addition to training).

The high-level architecture for a first iteration of the airlock, shown in Figure 2, comprises software and hardware components located in a secure data centre. This infrastructure shares some similarities with a data safe haven, inasmuch as it is a separate and partitioned computing platform accessed remotely by users and administrators. However, the secure connectivity and interaction between the isolating partitions, as well as the treatment of the restricted data are markedly different.

Figure 2: The Data Airlock Architecture
Figure 2: The Data Airlock Architecture

The architecture is divided into three logical zones: a Public Zone accessed by external R&D collaborators and workflow administrators; the high-performance Secure Zone in which vetted workloads run against restricted data; and the Restricted Zone where encrypted sensitive data is stored. The three zones are linked by one-way queues and operate under different security and access models. In our implementation, the public zone nodes were virtualised on a single server. The secure zone is necessarily located on its own high performance server. The restricted zone runs on a general purpose computer with high-performance, encrypted storage.

Jobs are enqueued from the public zone and subsequently vetted and cryptographically signed by a workflow administrator to ensure integrity of code executed in the secure zone. As no private keys are held on the infrastructure, the actual signing takes place offline on a vetter private system. Once signed, the code is then enqueued for execution in the secure zone where jobs are dequeued and checked using public vetter keys. If successful, one-time credentials for data access are created and the restricted data is mounted. The restricted zone Data Vault provides secure storage for sensitive data that is physically loaded on-site by data custodians into volumes encrypted with a manual boot-time password. The data vault dequeues secure zone requests for access to restricted data and returns one-time credentials for that access. At the completion of the job, the restricted data is unmounted; the resulting model, any processing results, and output logs are enqueued for vetting before being returned to the job submitter.

Next Steps

The experience with Data Airlock 1.0 demonstrated a number of areas for improvement which are now being considered for a second version of the infrastructure. For example, the code signing etc. could better supported through the use of tamper-resistant hardware for key management (such as a hardware security module) that would reduce complexity and improve security. Additionally, this first implementation was not designed to support a plurality of collaborators and data custodians from multiple organisations. A federated platform for controlled and configurable eyes-off access to restricted data would enable deep and broad collaboration between a range of disparate data-holders and researchers for the training, testing, and comparison of models against data that is held elsewhere.

Such an infrastructure would need to incorporate federated authentication, authorisation, workflow management, and audit. Data-custodians would create catalogue entries of pre-vetted jobs, tasks, and data recipes etc. that run against their secure zone compute resources and restricted zone data, and then grant access to these in much the same manner that vetting is currently performed. From a collaborator perspective, these pre-vetted jobs, tasks, and data recipes would be combined to run across multiple organisation datasets/models in a federated, standardised, and secure manner. Finally, it should be noted that the workload scheduler in this initial architecture was single-threaded, enabling one isolated airlock at a time to execute, providing access to the full complement of compute resources (e.g.~CPU/GPU/RAM) available on the secure zone server. A generalised architecture should allow for more granular and parallel execution of isolated airlocks.

The design and implementation of a second version of the Data Airlock is currently underway that will facilitate collaboration between disparate data-holders and researchers across the Law Enforcement sector, opening up research possibilities across research domains through running or comparing models without disclosing their technical detail and applying different restricted data ‘recipes’ based on access and trust criteria.

Note: This blog post is a condensed version of a technical report published in the resources section of this web site and summarised in a pre-print article published to the [arxiv] academic paper service.