As the amount of data and the performance of computers continue to increase, machine learning technology is gradually infiltrating into all walks of life. Computer vision, natural language processing, robotics and other fields have basically been monopolized by machine learning algorithms and are gradually expanding into traditional industries such as education, banking, and medical. For more on how machine learning can change the traditional education model, see the blogger's article "Using AR, AI, and the Big Data Reform Education System - Creating Your Own Private Custom Learning Path for Every Student." The banking industry currently has a lot of artificial intelligence speculation. Most banks hold a wait-and-see attitude and will not use artificial intelligence to replace most bank staff in a short time. The application of AI in the medical industry is also hot, such as using AI to detect cancer, driving new drug discovery engines, and genetic testing. While sepsis is a common complication in the medical industry, this article will use machine learning to predict post-discharge conditions in patients with sepsis.
Sepsis refers to the systemic inflammatory response syndrome caused by infectious factors. In severe cases, it can cause organ dysfunction or circulatory disturbance. It is a common complication of severe trauma, burns, shock, infection and major surgery because of its symptoms and Other common diseases such as fever and hypotension are very similar, and it is difficult to be detected early. If left untreated, it can be further developed into septic shock. The hospital mortality rate is more than 40%, which is quite dangerous.
Understanding the highest risk of death in patients with sepsis is helpful for clinicians' priority care. The team worked with researchers at the Geisinger Health Care System to establish a model using historical electronic health record data (EHR) to predict all-cause mortality in hospitalized patients with sepsis during hospitalization or 90 days after discharge. The model guides the medical team to carefully monitor patients who are predicted to have a high probability of death and take effective preventive measures.
Data science environment
Use IBM data science experience to provide a data environment for programming environments (three popular programming languages: Python, Scala, and R, two programming analysis tools: Jupyter and Zeppelin). In addition, IBM data science experience is scored in real time or in batches through business applications. To operate the model, integrate feedback loops for continuous model detection and retraining.
Collect and preprocess data
Geisinger received more than 10,000 patients with sepsis diagnosed between 2006 and 2016, including demographics, hospitalization and outpatient, surgery, medical history, medication, transfer between hospital units, and laboratory results.
For each patient, select the nearest hospital and the most relevant hospitalization data, including specific information during the hospital stay, such as the type of surgery, culture location (bacteria), etc. In addition, summary information before admission, such as the number of surgical operations 30 days before hospitalization, was derived, and no data after discharge was used. Figure 1 shows these time-based data-based decisions:
After merging the provided data sets, the resulting data set consisted of 10,599 rows with 199 attributes (features) per patient.
After the data cleansing and feature selection is completed, the task goal is defined as a two-category problem: predicting whether a sepsis patient will die within 90 days of discharge.
The chosen algorithm is Gradient Boosted Trees (GBT) and is implemented by XGBoost packets. Due to the good execution speed and robustness of the love algorithm, it has been a popular algorithm used in machine learning competition. Another motivation for using XGBoots is the ability to fine tune hyperparameters to improve model performance. In the training data, parameters are selected in an iterative manner using ten-fold cross-validation and grid search (GridSearchCV) to maximize the area under the ROC curve (AUC). An example of IBM's data science experience can be seen here.
The data set is divided into a training set and a test set, wherein the training set accounts for 60% and the test set accounts for 40%. Using the training set training model, the trained model parameters are applied to the test set. The performance of the model is shown in Figure 2:
Some of the data in Figure 2 are performance evaluation indicators, such as AUC scores. The closer this number is to 1, the more accurate the model's ability to classify positive predictions (TP), thereby reducing false positives. The test results for AUC data were 0.8561, indicating that the model was able to identify whether most sepsis patients died within 90 days, and if predicted to die, these patients could be appropriately targeted.
For precision and recall, the closer the number is to 1, the more accurate the model is. The data shown in Figure 2 is close to 0.80, which is in favor of a high recall rate – the goal is to minimize the number of patients who are missing from the model and may eventually die from sepsis.
For another evaluation accuracy (Accuracy), use bootstrap to generate 1000 variants of training and test data, then run the XGBoost model on these data, and obtain the accuracy of the model for each run, and the accuracy of 1000 runs. The probability of a degree distribution between 0.77 and 0.79 is 95%, which means that the established model can identify more than three-quarters of the true results.
In addition to the above evaluation indicators, the model's confusion matrix is shown in Figure 3. As can be seen from the figure, for the test data, the model determined that 1190 patients were true positive (death of sepsis predicted to die) and 2087 patients were true negative (survival of sepsis predicted to survive).
XGBoost also has the ability to determine features that do not tell whether the selected feature is a predictor of death or survival, but the information generated by XGBoost is still very useful because it can be learned which features are used to predict death. As shown in Figure 4, 29.5% of patients used the "admission age" feature to predict death.
Further exploration and analysis of features to test how features correspond to death outcomes. Although the above diagram helps to visualize the relationship between features and results, it is more important to understand the mechanism by which XGBoost trains multiple decision trees. Therefore, during the exploration process, important features in the XGBoost model may not be significantly related to these outcome variables.
As shown in Figure 5, features such as "admission age" may indicate that older patients have a higher rate of death than younger patients, and another example of "vasopressor use time" characteristics may indicate the use of a booster drug. Patient mortality is high, but these deaths may also be due to their poor health.
The decision tree rules of XGBoost output can help doctors learn more about how to develop treatment plans for patients. For example, due to the higher risk of death in elderly patients, the medical team can pay special attention to elderly patients, detect the duration of vasopressors taken, and minimize the number of shifts between patients to reduce the impact on susceptible patients. Wait.
Predicting all-cause mortality in patients with sepsis can guide health providers to proactively monitor and take preventive measures to improve patient survival. In this model, important features that are thought to be associated with death in patients with sepsis are selected, that is, machine learning models can help identify variables associated with sepsis death. Subsequent as the amount of data increases, some more key features will be added to improve the model. The method can also be applied to the prediction of other conditions, and it is hoped to produce a more operational model to improve the medical level.