Accuracy. It's important. How to evaluate the model? It depends on what the model task. QSAR models' accuracy is measured by comparing experimental and predicted values, which means you can't judge accuracy without experimental values. How do you know if it's accurate? The first step is to prepare the answer (here observed values or experimental values). Principle 4 of the OECD QSAR validation guideline. Transparently disclose the accuracy of the model! This principle is so important. No one knows this. However, if you look at QSAR papers published in the late 1900s or early 2000s, you will find that model accuracy was not stated. How to validate the model?
Let's take a look at the QMRF documentation of VEGA's liver toxicity (hepatotoxicity) prediction model. First, validation part is divided into internal validation and external validation. What does internal validation and external validation mean? Why are they divided into two? Internal validation means that the model is validated with the data used to train the model. In this case, it means the prediction accuracy on the training data. External validation means that the model is validated on data that was never used during the training process. Data that has never been used for training is called test data. So the prediction accuracy on the data you used to train the model is called internal validation because model peeked into the dataset internally during training process. Prediction accuracy on data that was not used in the training process is called external validation. Why should we check both? Sometimes... Very rarely. Sometimes, a model will have an unusually high accuracy on test data, but not on training data. In this case, it's hard to say that the training was good enough to make a good prediction. Usually, the model achieves high accuracy on the training data, but not on the validation data, so you need to check both cases.
https://www.vegahub.eu/vegahub-dwn/qmrf/QMRF_HEPA_IRFMN.pdf
The VEGA QMRF documentation starts by explaining how the data was prepared. From this part, please follow along with the documentation in the link above. If you look at the description of 6.6, you will see that two sets of data were aggregated. It is human data, and a total of 950 substances. All of them are organic compounds and single-component substances. (I think there were some non-organic compounds included. Organic compounds are those that are composed primarily of carbon and hydrogen. The document specifies that single-components were collected because sometimes there are mixtures. In the case of a mixture of two or three substances, it is not feasible to determine which substance has the dominant effect, or whether side effect were due to synergetic effects of mixtures, so the mixture results are usually excluded from the data.) The data was divided randomly, with 80% of the data used as training data and 20% as validation data.
Section 6.7 describes how the performance of the model was evaluated. First, the model accuracy in the research article was described. In the data source (research article), internal validation was performed on the training set, 760 substances, which is 80% of entire dataset. Here, 263 substances as true positives. A true positive means a substance known to be positive is predicted to be positive. In this case, a toxic substance was predicted to be toxic. This means that it correctly predicted a toxic substance. And 144 substances are true negatives. A non-toxic substance predicted as non-toxic. The others were incorrect. 72 were not hepatotoxic, but were predicted to be toxic. It means that the prediction was positive but incorrect. A prediction was toxic but was incorrect. 18 substances were predicted to be liver toxic but were not. So the accuracy rate is 81%. But when you look at the number, something doesn't add up...? Because the accuracy is (number of correctly predicted data / total data). Which is (263+144) / (760), and it is 53.55%... Something is wrong with the math.
The following description describes the model implemented in VEGA. Out of 760 substances, 265 were not predicted (classified as unknown), so 265 out of 760 should be excluded when evaluating accuracy. And there are 261 true positives and 144 true negatives. So, if we recalculate the accuracy of the model implemented in VEGA, we have Total count: 261+144 / Total number of data: 760-265 (we need to exclude the non-predicted substances to calculate the accuracy...) This gives us an accuracy of 81.81%. After accuracy, we calculate sensitivity and specificity. Sensitivity checks how many positives are correct out of the total positive data. Specificity checks how many negatives are correct out of the total negative data. Why do we care about accuracy for toxic substances and accuracy for non-toxic substances separately?
Because the accuracy value alone doesn't give an accurate picture of the model accuracy. For example, you have 1000 data points, 800 of which are non-toxic and 200 of which are toxic. If you train the model on this data, the model probably behaves weird, because the model can achieve 80% accuracy by predicting that all substances are non-toxic since 800 out of 1000 are not toxic... If the model says that all substances are non-toxic, then it is only wrong about 200 of them, so the accuracy is 80%. But if you look at sensitivity, it's 0% and specificity 100%. It's like the model gives same answer regardless of what you input into the model. So don't be fooled by the accuracy. The model should be evaluated for each label (toxicity value).
In the VEGA model, there are 261 true positives. 144 true negatives. 72 false positives. 18 false negatives. What is the total number of positives in this data? The number of toxic compounds? 261+18. This is because a true positive is predicted to be hepatotoxic when it is. Therefore, the correct answer is also hepatotoxic. A false negative is predicted to be non-hepatotoxic, but the prediction is incorrect. This means that the experimental value is hepatotoxic for the substances. So 279 are toxic (positive). Conversely, what is the total number of negatives? The number of non-toxic substances? 144+72. This is because there are 144 substances that were correctly predicted to be non-toxic, and 72 false positives that were incorrectly predicted. So the total is 216. Adding 279 and 216 gives us 495, which is equal to 760-265, the total data (760) minus the substances that were not predicted (265 unknowns). The model is good at finding substances that are hepatotoxic based on the internal validation result, but not so good at predicting substances that are not hepatotoxic.
The external validation results are similar as above. The accuracy is lower in the external validation data. I have personally validated VEGA hepatotoxicity model, and the results were similar to the external validation results of 63~66% as reported in QMRF. Is this good...? It's not a great model, but the performance of many hepatotoxicity prediction models was mostly similar. Hepatotoxicity is indeed a difficult value to predict. The descriptor values of the model introduced in Section 4.3 show which structural patterns are hepatotoxic and which are not. Out of 13 structural patterns, 11 hepatotoxics and 2 non-hepatotoxics. We can see that the model is initially focused on finding hepatotoxic substances. This is why the validation results show a high accuracy for hepatotoxic substances (high sensitivity). Considering the scenario in which the model is used, its performance is not bad. The model needs to find substances that are actually hepatotoxic. The goal of the hepatotoxicity prediction model is to identify and drop substances that are likely to cause hepatotoxicity problems before the clinical trials. Naturally, to be useful, they need to be good at identifying hepatotoxic substances. It would be nice to accurately predict non-hepatotoxic substances, but that's not the goal of the model... So I think it's a good model to use under early-fail strategy.
Conclusion: Depending on the strategy, appropriate development and validation methods are required. Model performance validation is also highly relevant to how the model will be used in practice.
'AI & Chemistry' 카테고리의 다른 글
Malicious AI use cases? (0) | 2024.11.22 |
---|---|
AI doesn't know what it's doing (0) | 2024.11.21 |
How accurate is AI? (0) | 2024.11.19 |
Is it really safe if AI says it is safe? (0) | 2024.11.18 |
Is this really AI model? (Expert system vs chatGPT-o1 preview) (0) | 2024.11.17 |