The VALID (Veracity, Agency, Longevity, Integrity and Dignity) in datasets project is actively working on frameworks to improve data quality, ethical accountability, and public trust in machine learning projects directed toward law enforcement and community safety outcomes.
The first dataset to be constructed under VALID will consist of benign legal images of children. Images of children in everyday contexts are important assets for training and evaluating AI models intended to combat child exploitation. They provide inputs from which AI models can learn features that depict children in ‘safe’ contexts; and, in combination with other datasets, produce machine learning models that differentiate benign images of children from exploitative child images. Such models sit behind tools that can classify digital image collections at speed and scale, accurately tagging items likely to contain CSAM, including ‘new’ material that would not be detected by hash matching against databases of previously known images.
AiLECS is thus creating a first-of-its-kind image dataset of benign ‘in-the-wild’ child images that have been acquired for research use with the knowledge and consent of children depicted. We are building this image dataset through a global crowdsourcing initiative: asking persons who are now over the age of 18 to contribute photographs of themselves as children (data collection commencing May 2022). Importantly, we have developed comprehensive strategies for the safe storage and use of this data to preserve the privacy of those depicted, and processes for ensuring ongoing management of consent.
This project has been approved by the Monash University Human Research Ethics Committee (project id #31436)