Abstract:
Road traffic accidents occur frequently, yet the data distribution based on traditional accident severity classification is often imbalanced. To explore the coupling effects of multidimensional factors on severe traffic accidents under sample imbalance conditions, this study proposes an analytical framework integrating the Adaptive Synthetic Sampling (ADASYN) algorithm, a Stacking ensemble learning model, and the Apriori algorithm. Utilizing data from the U.S. Department of Transportation's Fatality Analysis Reporting System (FARS) from 2017 to 2021, fifteen potential feature variables are selected across four dimensions—human, vehicle, road, and environment—to analyze the effects of multidimensional factor coupling on the occurrence of severe accidents. The ADASYN algorithm was employed to address sample imbalance. Four classical machine learning models including random forest (RF), categorical boosting (CatBoost), extreme gradient boosting (XGBoost), and gradient boosting decision tree (GBDT), are selected as base learners. Five types of meta-learners, namely logistic regression, Gaussian Na?ve Bayes, support vector machine (SVM), light gradient boosting machine (LightGBM), and multilayer perceptron (MLP), are compared to identify the optimal Stacking ensemble model with the strongest generalization performance. Based on the optimal model, feature importance ranking is obtained to determine key influencing factors, followed by the application of the Apriori algorithm for multidimensional coupling analysis, which explored the impact of five-dimensional factor coupling on the rate of severe accidents. The results indicate that: ①The Stacking ensemble model composed of Logistic Regression as the meta-learner and RF, CatBoost, XGBoost, and GBDT as base learners achieved the best overall performance, with a recall of 0.80; ②The five factors of road type, season, collision type, lighting conditions at the time of the collision, and driver alcohol consumption, accounted for 53.2% of the total importance of all factors, which is substantially higher than that of the other variables. Among them, collisions involving"impact with trees or other pole-like objects"exhibited the highest severe accident rate at 86.2%, and the severe accident rate under illuminated conditions is 13.5% higher than under non-illuminated conditions; ③ Multidimensional factor coupling analysis reveals that the probability of severe crashes is highest when multiple factors coexist: municipal roads, sober drivers, transitions between unlit and lit lighting conditions at the time of the collision, and the autumn season. Under this coupled condition, the confidence level reaches 89.0%, challenging the conventional perception that non-drinking is a low-risk factor.