Laying foundations for effective machine learning in law enforcement. Majura – a labelling schema for child exploitation materials.
Dalins, J., Tyshetskiy, Y., Wilson, C., Carman, M. J., & Boudry, D. (2018). Laying foundations for effective machine learning in law enforcement. Majura – a labelling schema for child exploitation materials. Digital Investigation, 26, 40-54. https://doi.org/10.1016/j.diin.2018.05.004
The health impacts of repeated exposure to distressing concepts such as child exploitation materials (CEM, aka ‘child pornography’) have become a major concern to law enforcement agencies and associated entities. Existing methods for ‘ﬂagging’ materials largely rely upon prior knowledge, whilst predictive methods are unreliable, particularly when compared with equivalent tools used for detecting ‘lawful’ pornography. In this paper we detail the design and implementation of a deep-learning based CEM classiﬁer, leveraging existing pornography detection methods to overcome infrastructure and corpora limitations in this ﬁeld. Speciﬁcally, we further existing research through direct access to numerous contemporary, real-world, annotated cases taken from Australian Federal Police holdings, demonstrating the dangers of overﬁtting due to the inﬂuence of individual users’ proclivities. We quantify the performance of skin tone analysis in CEM cases, showing it to be of limited use. We assess the performance of our classiﬁer and show it to be sufﬁcient for use in forensic triage and ‘early warning’ of CEM, but of limited efﬁcacy for categorising against existing scales for measuring child abuse severity. We identify limitations currently faced by researchers and practitioners in this ﬁeld, whose restricted access to training material is exacerbated by inconsistent and unsuitable annotation schemas.
Whilst adequate for their intended use, we show existing schemas to be unsuitable for training machine learning (ML) models, and introduce a new, ﬂexible, objective, and tested annotation schema speciﬁcally designed for cross-jurisdictional collaborative use. This work, combined with a world-ﬁrst ‘illicit data airlock’ project currently under construction, has the potential to bring a ‘ground truth’ dataset and processing facilities to researchers worldwide without compromising quality, safety, ethics and legality.