Objective
Recent years have seen a rapid rise in the development of artificial intelligence (AI) systems for use in healthcare, including those that qualify as a medical device (known as AI as a medical device, AIaMD). This has been enabled by increasing use of electronic health records, accompanied by curation of large-scale health datasets. This Analysis aims to explore existing standards, frameworks and best practices that improve data diversity in health datasets in the context of AIaMD.
Setting
A systematic review and a scoping survey of expert stakeholders.
Design
Systematic searches of the Embase, MEDLINE, Scopus, and Cochrane CENTRAL databases.
Participants
All 30 were published between July 2015 and February 2022. Searches were conducted October 2021
Methods
This research was conducted in compliance with all relevant ethical regulations, including informed consent from all participants.
Results
Database searches yielded 10,646 unique records, of which 100 remained after title and abstract screening. After full-text screening, 30 relevant records were included.
Conclusions
The systematic review provided insight into existing guidelines for data collection, handling missing data and labelling data. A key theme found through the systematic review was a need for transparency in how datasets are prepared, including who is included or excluded from the dataset, how missing data are handled and how data are labelled. Greater transparency in these areas allows better understanding of the context and limitations of a dataset, which in turn provides a guide to the potential limitations of any inferences or innovations derived from that dataset.