June 17, 2025
A general framework for governing marketed AI/ML medical devices

As described further in Methods, we collected data from the FDA’s medical device adverse event reporting database (the “Manufacturer and User Facility Device Experience” or MAUDE database). Our final dataset (which we have made publicly available, together with all the associated code, through the links at the end of this manuscript) comprises 823 unique 510(k)-cleared devices that could be linked to a total of 943 subsequent adverse events reported (MDRs for short) between 2010 and 2023. The vast majority of AI/ML device MDRs come from two products (See Fig. 1). The first is Biomerieux’s Mass Spectrometry Microbial Identification System (product code PEX), an automated mass spectrometer system utilizing matrix-assisted laser technology for the identification of microorganisms cultured from human specimens. This system is designed to provide rapid and accurate identification of a wide range of microorganisms, and to assist healthcare professionals in diagnosing infections and guiding appropriate microbiology treatment plans. The second is DarioHealth’s Dario Blood Glucose Monitoring System (product code NBW), a direct-to-consumer software product which produces blood glucose level readings through its smart app.

Fig. 1: Distribution of adverse events by product code.
figure 1

This figure presents the number of reported adverse events (MDRs) linked to each of the 20 product codes identified in our merged dataset of FDA-cleared AI/ML-enabled devices and adverse event reports. The product code PEX corresponds to Biomerieux’s Mass Spectrometry Microbial Identification System, and NBW refers to Dario’s Glucose Monitoring System.

A first surprising finding is that there is an extremely high concentration of MDRs in such a small number of the AI/ML devices. While one also observes concentration in MDRs for medical devices without AI/ML functions, the concentration is not as extreme. In Supplementary Information, we provide a figure (Supplementary Fig. 1) comparing the market concentration, so to speak, of adverse events in AI/ML devices vs. non AI/ML devices, as well as an associated figure (Supplementary Fig. 2) comparing event types across AI/ML devices and non AI/ML devices. We find that more than 98% of adverse events occurring within AI/ML devices are borne by less than five devices. For non AI/ML devices, the corresponding figure is about 85%. Meanwhile, with respect to the “event type,” 90.88% of AI/ML device reports are reported as malfunctions, whereas for non AI/ML devices, the corresponding figure is 77.05%. Under both measures, therefore, the concentration for AI/ML devices is particularly extreme.

Most MDRs associated with the Mass Spectrometry Microbial Identification System (PEX) are reported as misidentifications of microorganisms. Many of these issues appear to stem from limitations in the system’s knowledge base, which identifies microorganisms by comparing test results to known profiles. It is difficult to understand the true severity of these problems from the available data. Misidentification of microorganisms can be very dangerous, even life threatening. But from the information provided in regulatory databases, it is hard to decipher whether the events reported constitute this level of severity and, if they do, whether the AI/ML device is responsible for the event. This is emblematic of more general issues, we discuss below, as to why the current reporting structure is not fit for purpose for evaluating AI/ML devices.

Meanwhile, for DarioHealth, the main issues reported are associated with incorrect blood glucose level readings, at least some of which could be interpreted as false positives. We do not intend to single out these products as poorly performing—while they are overrepresented in the database, this does not allow for any general conclusion about their quality. Indeed, in the absence of data about the overall frequency of a device’s use and clarity on the salience of problems with the device relative to other devices, it is difficult to speculate about a device’s relative performance or safety. It may even be the case that these products are overrepresented because the relevant manufacturers are particularly diligent about their reporting duties and/or quality control.

In any case, we highlight these examples to illustrate the general character of the adverse event reports that are most represented among AI/ML device MDRs, which is characterized by the kinds of malfunctions and product issues that would traditionally be important in non-AI/ML devices. These types of malfunctions are not always salient for AI/ML devices, hence the value of the reports is limited in the context, and the burden imposed on manufacturers may be unnecessary and/or unequal. We explain the shortcoming of the current system in more detail below.

There are several important limitations of the current MDR system which make it suboptimal for tracking, understanding, and correcting safety issues that arise with AI/ML medical devices. Consistent with the patterns described above, we now further describe the main issues identified in the data.

Missing data

A significant concern with the current state of the MAUDE database is simply the sheer extent of missing data within MDRs–and this is even before one considers selection issues associated with whether adverse events are reported at all. The problem of missing data for FDA cleared AI/ML devices has been raised elsewhere18, but to our knowledge no one has systematically investigated the missing data in the MAUDE reports of AI/ML devices.

Missing data entries within the formal MDRs make it difficult to study AI/ML medical device safety effectively from a quantitative perspective. Figure 2, below, presents the extent of missing information for four important indicators in the MAUDE database. Note that each of the color coded categories below represent different ways in which data can be missing. While they are coded differently—the result is the same.

Fig. 2: Proportion of missing values across key fields in medical device reports.
figure 2

This chart illustrates the extent of missing information across four key fields in a total of 943 medical device reports (MDRs) associated with AI/ML-enabled devices. “Missing” includes blank entries or those marked as “No information,” whereas entries marked “Not applicable” were treated as populated.

In the analysis sample of 943 adverse event reports, data completeness varied significantly across key variables. We consider blank or “No Information” entries as missing data, and we consider completed or “Not applicable” entries as populated data. Event Location is entirely missing for all MDRs in the sample (n = 369 entries are blank, and n = 574 are marked as “No information”). Similarly, 73% of the reports lack information about whether the reporter was a health professional (n = 509 blank, n = 101 “No information”). Event Date is missing in 32% of the reports (n = 298), while Reporter Occupation is absent in 30% of cases (n = 283 blank, n = 1 “Not applicable”). The limited data availability regarding key contextual features of adverse events highlights potential gaps in the reporting process. Yet such information is especially important for AI/MLmedical devices, whose performance is known to be contextually sensitive—such devices’s ability to perform as intended can deteriorate significantly in a different sub-population or when used by different parties.

Most importantly, the extent of missing data was significantly higher in the AI/ML sample as compared to the frequency of missing MDR information for other medical devices. For instance, information about whether the reporter was a health professional was missing 73% of the time in the AI/ML sample, but only 43% in the overall sample. Similarly, Event Date and Reporter Occupation were more likely to be missing in the AI/ML sample (32% vs 21.9% and 30% vs 12.7%, respectively). Event Location was entirely missing (100%) in the AI/ML sample, compared to 90.1% in the general device sample.

Above all, these data deficiencies create difficulties for policymakers, scholars, and manufacturers in accessing and investigating the specific causes of adverse events related to AI/ML devices. Especially for reports submitted by non-health professionals, the absence of key event details such as timing, location, and reporter information would require manufacturers to spend more time on follow-ups to gather additional details. Hence, the overall result is an incomplete picture of the safety issues that the database is intended to capture.

Indeed, a closer look at the MDRs reveals that in many cases, even after self-reported extensive follow-up efforts, manufacturers struggled to obtain more information. For example, in an NBW device report (i.e., DarioHealth’s Dario Blood Glucose Monitoring System) where both the event location and reporter information were missing, the manufacturer DarioHealth mentioned that “the user refused to troubleshoot with Dario’s representatives. There is not enough information available regarding Dario meters to investigate”19. In another report with significant information gaps, it was similarly noted that, “multiple attempts to follow up with the user were made, however, no response has been received to date”20. With frequent missing information in the reports, it becomes challenging for manufacturers to fully understand the context of device issues and malfunctions, thereby limiting their ability to conduct thorough device assessments and provide resolutions. Moreover, the inability to determine the specifics of device failures, particularly for medical events involving serious injury or death, complicates the assignment of responsibility. These data gaps challenge the ability of both manufacturers and regulators to validate the credibility of the reports and to determine the reporter’s level of expertise.

In light of these challenges, it is necessary to reinforce the standardization and completeness of the current MDR data collection process. This is of course true more generally, but in the relatively new context of AI/ML devices, where little regulatory history exists, it is likely to be particularly valuable for quality improvement efforts and, therefore, protecting the public’s health. Furthermore, because there are unique concerns related to transferability (how a model applies across contexts), domain adaptation (how the model adapts to new contexts), and cross-dataset evaluation of AI/ML models (the model’s robustness across different datasets), understanding performance of such algorithms in context is especially important. Such improvements in data collection would not only help manufacturers obtain more complete event information but would also facilitate effective quantitative analysis of the reports database and enable regulators to implement corrective measures more efficiently.

Inadequate event classification

The current database contains a significant proportion of inadequate event classifications—events that are reported a certain way in the database, but when one reads the qualitative description, the recorded information does not seem to match the report, reflecting a disconnect between the actual challenges arising in practice and the categorical constraints of the current reporting system. According to the latest MDR guidelines, all submitted MDRs are classified into three categories: Malfunction (M), Injury (I), and Death (D), which are recorded under Event Type in the MAUDE database.

As seen in Table 1, the majority (91%) of device reports are categorized as ‘Malfunction’, while only two events (0.21%) are classified as ‘Death’. Both ‘Death’ events came from DarioHealth’s Dario Blood Glucose Monitoring System (product code NBW). The FDA stipulates that an MDR should only be classified as a death when the reporter believes that the patient’s cause of death was or may have been attributed to the device. However, in both death-classified reports from DarioHealth’s glucose monitoring system, the death was reported not to be related to the device. In one such report, on November 30, 2018, the spouse of a Dario Blood device user called to report her husband’s death, but it was then subsequently clarified that the death was unrelated to the device21. Similarly, on February 6, 2019, another Dario Blood user’s husband contacted Dario to report the user’s death; however, she noted that the user had many health complications and there was no indication the device was involved22. As such, both “deaths” appear to be inaccurate classifications in the data, as we cannot conclude on this basis that the deaths are causally attributable to the device.

Table 1 Breakdown of adverse event reports by product code and event type

While extreme, these cases illustrate the difficulty of accurately reporting adverse events for medical AI/ML devices. Doing so requires sorting out thorny issues of causality because death while using a device is clearly quite different from death due to the device (and its malfunctioning). Moreover, even if death could be attributed to use of the device, that would still be different from death due to the device’s AI/ML system. Thus, given how far removed the event is from the AI/ML functionality, there are few if any conclusions that one can reliably draw about AI/ML device safety on the basis of what is reported.

Similar and likely far more common inadequate event classification issues are also evident in non-death adverse event reports. On January 18, 2019, a patient performed a Heart Flow Analysis (product code PJA) in a hospital and received a negative result. Later, the patient experienced serious chest pain and received a CT scan, during which the previous diagnosis result was found to be a false negative. The patient was then urgently referred for cardiac catheterization. This event was reported as an Injury. However, subsequent investigations by Heart Flow revealed that the false negative result was due to an analyst’s mistake rather than a problem with the device software itself23. The upshot here is the same as above.

Recording the event as an “Injury” does not seem to be accurate, as the cause of injury was not the device itself (but rather, in this case, an analyst’s mistake). While such misattributions may also occur for other types of medical devices, the complexity of AI/ML product use opens up a new dimension for such user errors. What is particularly interesting is that while one might have predicted that the ambiguity of attributing responsibility could lead to under reporting, it appears to have led to over reporting (i.e., Heart Flow reporting a false negative which is not due to a malfunction of its device). Yet there is very little oversight of reporting activities, so the best that regulators and researchers can do is take reports at face value. It would be very useful to have an independent investigation of serious injuries or deaths before the reports are filed so that we can conclude with more confidence whether the device or its AI/ML functionality was related to the event. In any case, such clear retroactive discrepancies suggest that the MDR system might be tagging device events based on patient outcomes rather than outcomes directly caused or potentially caused by the devices.

Relying on the event type to determine the nature of events might lead to the misperception that the medical AI/ML device caused a death, when in fact, that would require a much more thorough investigation. Such challenges also occur in the context of traditional devices, but in the case of AI/ML devices, they are particularly salient: AI/ML is unique because the error can be hard to detect, and its source even harder to identify. Moreover, whether or not something is an AI/ML error to begin with can be challenging to determine. These facets, together with the novelty of the AI/ML products themselves and the lack of familiarity by medical professionals, may lead to more user errors and/or to a specific type of inadequate classification of MDRs, due to lack of product knowledge. The current system, therefore, likely generates an overall inaccurate picture of the safety of these devices.

The preceding event classification problems revolve around some explicitly inadequate or inaccurate statements in a report (for example, an injury is attributed to a device when in reality the injury occurred due to an analyst’s error). But the current environment is also susceptible to inadequate event reporting due to, what we might call, errors of omission; however, the extent of that problem is difficult to estimate. To explain, we provide a hypothetical example. Suppose that an AI/ML system is incorporated into a certain piece of clinical decision support software. And suppose further the entire product is deployed in a triage environment. Now suppose the product is not working well, and is not useful – in the sense that attending physicians do not find it helpful for optimizing their workflow, or even improving their workflow, relative to their performance without the product. In this case, the attending physicians may simply ignore that product. But that does not mean they will report it to the manufacturer as a malfunction, and it likewise does not mean that the manufacturer will file an official report. Indeed, lack of usability is not ordinarily thought of as a reportable issue. As a result, this example depicts a situation where the environment does not incentivize anyone to create a report that would allow researchers and regulators to become aware of a real problem. This is the medical analogue to a phenomenon that has been commonly observed in the context of criminal recidivism prediction—namely, the fact that we can only observe arrest rates, but not offense rates (i.e., we do not observe committed crimes for which the offenders are not caught)24,25. Likewise, we do not observe problems that are never reported.

In addition to inadequate event classification, the MAUDE database also contains multiple types of incomplete or inaccurate classifications that go beyond the event field.

Severity of risk unclear or unknown

There are no indicators beyond Event Type in the MAUDE database that precisely define the severity of reported device events. The current database only allows an assessment of the event’s outcome but, as noted above, it remains unclear whether and to what extent the event is related to the specific AI/ML device. While qualitative analysis of the report narratives is imaginable, such a project would require the use of fit-for-purpose text analysis and is not currently feasible on a large scale. As such, a data-based assessment of overall safety remains difficult to implement.

Moreover, the frequency of MDRs associated with a specific medical AI/ML device is not necessarily an accurate indicator of the level of risk or device failure. One might suspect that more reports imply a more problematic device, but that is not always the case. For example, DarioHealth’s Dario Blood Glucose Monitoring System (i.e., NBW) reports a total of 275 adverse events, according to Table 1. 231 of these were labeled as Malfunction but many were attributed to user error rather than actual device failure. Specifically, in one of the NBW malfunction reports, the user mentioned that he used an expired glucose strip cartridge, which led to a false negative result26. Another report similarly states that “the user realized that it was his own mistake ……. [t]herefore, it can be determined that there was misuse of the device.”27 For our purposes, these additional examples of user error suggest that there is not always a strong positive relationship between the numerosity of reports and the risk associated with a device. Separately, but relatedly, the salience of adverse events may differ across device types and user profiles, both of which might drive differences in the propensity to report problems in the first place. Finally, without an overall “denominator” for the frequency of a medical device’s use, it is not possible to talk about the relative safety of one product versus another28. This challenge also exists for other medical devices, but the interaction of known AI/ML capabilities with heterogeneous user profiles may lead to differential non-representativeness of MDRs for these products.

As a corollary, a smaller number of reports (or their absence) does not always imply lower product risk or severity. The WAVE Clinical Platform’s heart rate monitors (coded as MWI) are used by healthcare professionals to monitor patients’ waveforms, alarms, and results in real-time remotely. According to Table 1, MWI has only two event reports, which were marked as Malfunction. In both reports, the users noted that the MWI system failed to send emergency alerts to the intended recipients. Although these incidents did not result in severe patient outcomes, the examples emphasize that a recurrence of these incidents could result in serious injury or death. Again, our intention here is not to evaluate the risk or safety of specific devices per se, but rather to point out why and how it might be difficult for regulators and researchers to detect device-related safety issues by quantitatively analyzing the database’s indicators without reading through specific report narratives and in the absence of technology-appropriate categories for reporting problems.

Problems without malfunctions are not tracked

According to the FDA, a product malfunction is an MDR reportable event if it results in the failure of the device to perform as intended in a way which could cause or contribute to a death or serious injury7. There are a few other triggering conditions but in general they are tied to a device’s contribution to death or serious injury. But many–perhaps most–problems that occur with AI/ML devices will not rise to this level of individual injury–either in practice or as a possibility. Indeed, many problems caused by medical AI/ML devices may not even occur as a result of malfunction, in the ordinary sense of that word. For example, suppose that a device predicts an 80% chance of a certain disease for a given patient. If the patient does not actually have the disease, is that a malfunction? Such a question is not possible to answer at an individual patient level. After all, an 80% chance of disease implies a 20% chance of its absence.

The only way to identify probabilistic reports as malfunctional (or not) is to look at performance across a large group of patients. For example, if we have 100 patients, and 30 of them have a certain disease, but the device produces an 80% probability for every single one of them, then one might say the device is malfunctioning—it appears miscalibrated. But this kind of aggregate-level malfunction would not make its way into local/individual level adverse event reports. The adverse event reporting system is designed at a patient/case level to identify local/individual issues, but many problems associated with AI/ML devices will present only at an aggregate or global level—i.e., they can be identified only by analyzing a large number of patient data points and comparing algorithmic performance to “true” diagnoses, outcomes, or base rates. Meanwhile, even if we could observe sub-population performance, a further problem would be non-random differences in the distributions of device errors—for example, error rates may differ across age, gender, ethnicity, race, etc29. This is likewise not something that can be tracked or ascertained from the existing MAUDE reports or MDR categories, but represents a known performance challenge for AI/ML devices. While this is not necessarily a defect of the MAUDE database—the MDR system was not originally designed to track such problems—it does point to the need for rethinking how we should approach postmarket surveillance of AI/ML devices, which we now turn to.

link

Leave a Reply

Your email address will not be published. Required fields are marked *