Curating ethical datasets
The performance of our machine learning tools is directly informed by the quality of training data with which we can work. However, researchers consistently describe datasets comprising ‘in-the-wild’ images of people as being compiled from content harvested from the open web as well as aggregations of this questionable content.
For us, the use of such datasets presents an uncomfortable dissonance whereby development of technology intended to counter child exploitation and address other community safety concerns is dependent on exploitative practices (i.e. the collection and use of images without the knowledge or consent of those individuals).
In response, we are going back to first principles, creating and managing datasets of benign ‘in-the-wild’ material for research use with the knowledge and consent of those depicted. In particular we are concerned with how the data is captured in the first place, and how the consent for its use is managed through time.