Medical research plays a crucial role within scientific research. Technological advancements, especially those related to the rise of machine learning, pave the way for the exploration of medical issues that were once beyond reach. Unstructured textual data, such as correspondence between doctors, operative reports, etc., often serve as a starting point for many medical applications.
However, for obvious privacy reasons, researchers do not legally have the right to access these documents as long as they contain sensitive data, as defined by regulations like GDPR (General Data Protection Regulation) or HIPAA (Health Insurance Portability and Accountability Act). De-identification, meaning the detection, removal or substitution of all sensitive information, is therefore a necessary step to facilitate the sharing of these data between the medical field and research. Over the past decade, various approaches have been proposed to de-identify medical textual data. However, while entity detection is a well-known task in the natural language processing field, it presents some specific challenges in the medical context. Moreover, existing substitution methods proposed in the literature often pay little attention to the medical relevance of de-identified data or are not very resilient to attacks.
This paper addresses these challenges. Firstly, an efficient system for detecting sensitive entities in French medical data and then accurately substitute them was implemented. Secondly, robust strategies for generating substitutes that incorporate the medical utility of the data were provided, thereby minimizing the difference in utility between the original and de-identified data, and that mathematically ensure privacy protection. Thirdly, the utility of the de-identification system in a context of ICD-10 code association was evaluated. Finally, various systems developed to tackle ICD-10 code association were presented while providing a state-of-the-art model in French.