Abstract

This study is centered around the COVID-19 pandemic which has posed a global health concern for over three years. It emphasizes the importance of effectively utilizing epidemic simulation models for informed decision-making concerning epidemic control. The challenge lies in appropriately choosing, adapting, and interpreting these models. The research constructs three statistical machine learning models to predict the spread of COVID-19 in specific regions and evaluates their performance using real COVID-19 incidence data. The paper presents short-term (3, 7, 14, 21, and 30 days) forecasts of COVID-19 morbidity and mortality for Germany, Japan, South Korea, and Ukraine. The precision of each model was scrutinized based on the type of input data used. Recommendations are provided on how various data sources can enhance the interpretation quality of machine learning models predicting infectious disease dynamics. The initial findings suggest the need for the comprehensive utilization of all available data, favoring cumulative data during holiday-rich periods and daily data otherwise. To minimize the absolute error, databases should be compiled using daily morbidity and mortality rates.

1. Introduction

The COVID-19 pandemic, caused by the spread of the SARS-CoV-2 coronavirus, has been a threat to global public health for almost three years. At the end of 2022, more than 640 million cases were registered worldwide, of which more than 6.6 million were fatal [1].

The global crisis caused by the pandemic has shown the critical role of information technology. The world has accelerated the digitalization of most areas of activity, including healthcare systems [2]. Research related to data-driven medicine is aimed at solving such problems as automated diagnostics [3], analysis of medical [4] and nonmedical interventions [5] to reduce the dynamics of morbidity, analysis of medical images [6], analysis of medical data [7], and modeling the dynamics of the epidemic process [8].

One of the essential tools for controlling the COVID-19 pandemic and other infectious diseases is modeling its dynamics, including forecasting. Forecasting the epidemic process dynamics allows us to predict how the incidence will develop and to conduct experimental studies to evaluate the effectiveness of various preventive measures.

Therefore, this study is aimed at building three statistical machine learning models for predicting the dynamics of COVID-19 in certain areas and at studying the performance of these models using experiments with actual COVID-19 incidence data.

To achieve the goal, the following tasks were formulated: (1)To analyze models and methods for modeling the epidemic process of COVID-19(2)To develop a predictive model for COVID-19 dynamics based on the logistic regression method(3)To develop a predictive model of COVID-19 dynamics based on the decision tree method(4)To develop a predictive model for COVID-19 dynamics based on the support vector regression method(5)To evaluate the results of predicting the dynamics of COVID-19 using the developed models for data in various territories(6)To compare the accuracy and adequacy of the developed models performed with the databases of different countries(7)To analyze the performance of the developed models

The promising contribution of the research is twofold. Firstly, developing predictive models based on statistical machine learning methods will make it possible to analyze their effectiveness for modeling the epidemic process of COVID-19 and other infectious diseases. Secondly, developing predictive models based on statistical machine learning methods will make it possible to use them in public health practice in resource-limited settings to support decision-making on control measures to contain the dynamics of the COVID-19 pandemic. Thirdly, the analysis of models in terms of input data on morbidity will allow future research to be adjusted to model epidemic processes and apply models more effectively.

The further structure of the paper is the following: Section 2 provides an overview of models and methods of COVID-19 epidemic process simulation. Section 3 describes three regression approaches to COVID-19 morbidity forecasting, logistic regression, decision tree, and support vector regression, and describes the metrics used for models’ performance evaluation. Section 4 describes the results of models’ performance, estimation of developed models’ adequacy, and forecasting accuracy. Section 5 discusses the perspective use of models and their limitations and analyzes the effectiveness of using different input data for forecasting. The conclusion describes the outcomes of the research.

Research is part of a complex intelligent information system for epidemiological diagnostics, the concept of which is discussed in [9].

Preliminary research has been done for other statistical machine learning methods for modeling COVID-19: linear regression, lasso regression, ridge regression [10], random forest, K-nearest neighbors, and gradient boosting [11]. This study also explores the problem of input data in modeling epidemic processes.

2. Current Research Analysis

Epidemic process models have been used for over a century to control infectious disease dynamics, study disease behavior, and develop effective interventions to prevent epidemics. The global COVID-19 pandemic has stimulated a new round of research in this direction.

Compartmental models of the dynamics of the new coronavirus remain the most popular for practical application.

The authors of [12] study the theoretical foundations of the simplest SIR (susceptible-infected-recovered) model for modeling a new coronavirus. The authors explore the temporal evolution of different populations and track various significant parameters of the spread of the disease in different communities. However, the forecasts obtained in work are not sufficiently accurate. The work [13] presents a model for early prediction of COVID-19 based on the SIR structure, which allows predicting the situation for 700 days. The authors model the outbreak and possible scenarios for its termination with various types of control measures. The forecast presented by the authors can be retrospectively assessed as unreliable. However, the authors are investigating a scenario with a specialized treatment for COVID-19 that does not exist to date.

The study [14] is devoted to modeling COVID-19 in Canada using various models, including the SIR model. The constructed model does not assume the presence of asymptomatic cases, which is not valid and is an important characteristic that stimulates the spread of infectious diseases. The work [15] presents the SIR model for modeling the dynamics of COVID-19. The study calculates disease-free and endemic equilibrium, with global persistence calculated using the construction of the Lyapunov function and local persistence determined using the Jacobian matrix. The authors conclude that the nature of COVID-19 coincides with SARS, which is not valid.

The article’s authors [16] explore the dynamics of the classical SIR model concerning COVID-19. The model considers the nonlinear removal rate, which depends on the number of hospital bed ratio. The authors conclude that the epidemic declines when the value of the basic reproductive number is less than one, but this is an epidemiological rule. In the study [17], the authors apply a modified SIR model to study the spread of COVID-19 in China. As a result, the authors argue that the increase in the number of control measures by the state has a positive effect on reducing the dynamics of COVID-19. However, the presented model does not allow us to draw such conclusions since social factors and the impact of such state control on other external factors that influence the development of the disease are not investigated.

In [18], the authors apply an implicit time-discrete SIR model that tracks transmission and recovery rates to predict the dynamics of COVID-19 in Fiji. The model does not take into account many factors that play an important role in the dynamics of infectious diseases, including the incubation period, the impact of control measures already taken, and the heterogeneity and openness of the population, as well as the difference between registered and real cases of the disease. The authors of the article [19] extend the standard SIR model with the global dynamics of the COVID-19 pandemic. The proposed model was parameterized using a two-stage model fitting algorithm on data from six randomly selected US cities. Despite the increase in the accuracy of the model compared to the classical SIR model, it does not consider many factors that affect the dynamics of morbidity.

The study [20] discusses the numerical solution of the SIR model of the spread of COVID-19 using the Taylor matrix and the collocation method for Turkey. The model does not consider the dynamics of external factors, so the solutions obtained using the model are difficult to update to a changing situation. The paper [21] proposes an extension of the classical SIR model, the adaptive susceptible-infected-removed-vaccinated model with time-dependent transmission and removal rates. The authors propose a numerical solution to the inverse problem using the variational embedding method, which reduces the inverse problem to the problem of minimizing a well-formed functional to obtain the desired values. The model and its numerical solution are complex, making it difficult to introduce actual changes in disease dynamics into the model, such as changes in virulence and control measures.

Some researchers have extended the classic SIR model for modeling COVID-19 by adding new compartments. The authors of [22] extend the model with the exposed compartment. The model was applied to the early phase of the pandemic in Italy and was analyzed for sensitivity to determine the most critical parameters that have the most significant impact on the basic reproduction number. The article [23] describes the SEIAR model of COVID-19 with five compartments (susceptible-exposed-symptomatic-asymptomatic-recovered/removed). As a result, the authors conclude that the virus is highly contagious for people after the age of 45 years and has low susceptibility to the virus up to 14 years of age. The authors of [24] expand the SEIR model by introducing the characteristics of age groups, symptomatic and asymptomatic disease development, and vaccinated and unvaccinated population. The results show that, despite the high level of detail, the model cannot predict changes in epidemic dynamics caused by the emergence of new strains or the introduction of new control measures.

The work [25] proposes an extended specialized SEIR model for COVID-19 modeling called SEAHIR (susceptible-exposed-asymptomatic-hospitalized-isolated-removed). In the proposed model, the “infected” compartment is divided into “asymptomatic,” “isolated,” and “hospitalized.” The model also considers the impact of nonpharmaceutical interventions such as physical distancing and different testing strategies. The paper [26] presents a hybrid compartmental model for studying the evolution of the COVID-19 pandemic in Italy. The model proposed by SEIRDV includes six compartments, considering the vaccinated population. At the same time, the representation of infection is presented both as a linear and as an exponential piecewise continuous function. The results show that different levels of vaccination give similar infection curves.

All the models described using the compartmental approach have several disadvantages, including modeling for homogeneous and closed populations, the impossibility of taking into account all the factors influencing the dynamics of the epidemic process and the complexity of systems of differential equations describing the system, and the difficulty of making changes to the model, adapting it to reality. These shortcomings affect the adequacy and accuracy of the model, which does not simulate the actual situation with the incidence of COVID-19 effectively.

Higher accuracy is shown by predictive models based on machine learning methods.

The paper [27] presents a predictive model for COVID-19 in India based on an artificial neural network with a long short-term memory (LSTM) architecture. The model predicts the total number of cases, recoveries, and deaths of COVID-19 over 80 days. The model showed an accuracy of 95.46%. The authors of [28] propose models based on recurrent neural networks such as LSTM, bidirectional LSTM, and encoder-decoder LSTM models for multistep (short term) COVID-19 infection forecasting. Using the presented models, a forecast for two months ahead is built based on data on the first and second waves of incidence. However, the authors note the difficulties with modeling associated with the unreliability of the data and the difficulty of considering factors such as population density, logistics, social aspects, and lifestyle of the studied population.

The authors of [29] proposed a deep learning approach that includes recurrent neural networks and LSTM networks for predicting the probable numbers of COVID-19 cases. For a pilot study, data from the European Centre for Disease Prevention and Control on the incidence of COVID-19 in Malaysia, Morocco, and Saudi Arabia were used. The results showed an accuracy of 98.58% for the LSTM model and 93.45% for the RNN model in predicting new COVID-19 cases over seven days.

The authors of [30] employed Bayesian optimization to tune the Gaussian process regression (GPR) hyperparameters to develop an efficient GPR-based model for forecasting the recovered and confirmed COVID-19 cases. The authors show the superiority of the proposed approach in comparison with other time series forecasting models. However, only one dataset was used for forecasting, so the model’s performance may differ depending on the area where the simulation is carried out. The authors of [31] propose three deep learning models, including CNN, LSTM, and the CNN-LSTM, to predict the dynamics of COVID-19 in Brazil, India, and Russia. The authors note that various socioeconomic, geographic, and political reasons may influence public policy in implementing control measures to contain the epidemic dynamics.

Despite the high accuracy of COVID-19 predictive models based on deep learning, they cannot always be applied in resource-limited settings. The requirements for high computing power that such models impose are difficult to meet directly in public health institutions. Therefore, this paper proposes an analysis of COVID-19 predictive models based on statistical machine learning methods.

3. Materials and Methods

As part of this study, three machine learning models were built to predict the dynamics of COVID-19. The models are based on regression methods: logistic regression, decision tree, and support vector regression.

Regression analysis is an analytical method of statistical machine learning that calculates the estimated relationship between a dependent variable and one or more independent variables [32]. Regression analysis finds the model relationships between selected variables and model-based predictive values. Regression analysis uses the chosen estimation method, the dependent variable, and one or more independent variables to create an equation that estimates the values of the dependent variable.

3.1. Logistic Regression

Logistic regression is a data analysis technique that allows finding the relationship between two data factors [33]. This relationship is used to predict the value of one of these factors based on the other. To do this, a dependent variable is introduced, which takes the values 0 and 1, and a set of independent variables . Based on these values, it is necessary to calculate the probability of accepting one or another value of the dependent variable.

Let objects be defined by numerical features: and the space for feature descriptions, in this case

is a finite set of class labels, and a training set of “object-factor” pairs is given as follows:

Consider the case of two classes: . In logistic regression, a linear classification algorithm is built: The following kind is shown: where is the weight of the feature, is the decision threshold, is the weight vector, and is the scalar product of the feature description of the object and the weight vector. At the same time, it is assumed that a zero sign is artificially introduced:

The task of training a linear classifier is to adjust the weight vector based on the sample. In logistic regression, for this, the problem of minimizing empirical risk with a loss function of a special type is solved:

After finding the solution , it becomes possible to estimate the posterior probabilities of its belonging to the classes: where

The following are the advantages of the logistic regression method: (i)Logistic regression models are mathematically less complex than other machine learning methods. It also makes troubleshooting easier(ii)Logistic regression models allow developers to better understand internal processes than other machine learning methods(iii)Logistic regression models can process large amounts of data at high speed because they require less computing power

The following are the disadvantages of the logistic regression method: (i)The model handles a large number of categorical variables poorly(ii)For the model to work, it is necessary to transform nonlinear functions

3.2. Decision Tree

Decision trees are a nonparametric supervised learning method used for classification and regression [34]. The goal of the method is to create a model that predicts the value of the target variable by learning simple decision rules derived from the characteristics of the data. If the target variable has continuous values, decision trees allow for establishing the dependence of the target variable on independent variables.

A decision tree is a hierarchical tree structure consisting of “if-then” decision rules that can be formulated in natural language.

The method recursively divides the original dataset into subsets that become more and more homogeneous concerning certain features, resulting in a tree-like hierarchical structure. The division is carried out based on traditional logical rules in the form “If , then ”, where is some logical condition and is the procedure for dividing the subset into two parts, for one of which condition is true, and for the other, it is false.

To construct a tree, it is necessary to set the quality functional based on which the sample is split at each step. Let be the set of objects that fall into the vertex split at this step; and are the objects that fall into the left and right subtrees for a given predicate. Then, we will use the following functionals: where is the informativeness criterion that evaluates the distribution quality of the target variable among the objects of the set . The smaller the diversity of the target variable, the lower the value of the informativeness criterion should be.

In each leaf, the tree will produce a real number. Based on this, it is possible to evaluate the quality of the set of topic objects: where is some loss function.

To build a regression model, we choose the square of the deviation as a loss function. In this case, the informativeness criterion will look like this:

To build a regression model, we choose the square of the deviation as a loss function. In this case, the informativeness criterion will look like this:

The following are the advantages of the decision tree method: (i)Easy interpretability and visualization capability(ii)Only a little preparation is required(iii)The cost of using a tree is logarithmic in the number of data points used to train the tree(iv)The decision tree model can handle both numerical and categorical data

The following are the disadvantages of the decision tree method: (i)The model can create overly complex trees that do not generalize well(ii)Decision trees can be unstable, and small changes in the data can lead to a completely different tree(iii)The optimal decision tree learning problem is NP-complete in terms of several aspects of optimality and even for simple concepts. Therefore, practical algorithms for learning decision trees are based on heuristic algorithms, such as the greedy algorithm, in which locally optimal decisions are made at each node(iv)It is recommended that the dataset be balanced before fitting to the decision tree

3.3. Support Vector Regression

The basis of the support vector machine for regression problems is the search for a hyperplane, in which the risk in a multidimensional space will be minimal [35].

The support vector machine estimates the coefficients by minimizing the quadratic loss. If the predicted value falls within the hyperplane region, then the loss is zero. Otherwise, the losses equal the difference between the predicted and actual values.

In the support vector machine for the regression problem, it is necessary to evaluate the functional dependence of the dependent variable on the set of independent variables . To do this, the relationship between the independent and dependent variables is determined by a deterministic function and the addition of additive noise:

In this case, it is necessary to find a functional form for that can correctly predict new values. Functional dependence is sought by training the model on a sample population. In this study, we determined the error function by the formula:

The function is minimized under the condition:

Advantages of the support vector machine are as follows: (i)The principle of the optimal separating hyperplane leads to the maximization of the width of the separating strip(ii)The support vector machine is equivalent to a two-layer neural network in which the number of neurons in the hidden layer is determined automatically as the number of support vectors(iii)The convex quadratic programming problem is well-studied and has a unique solution

Disadvantages of the support vector machine are as follows: (i)There is no feature selection in the method(ii)The constant must be selected using cross-validation(iii)Outliers in the initial data become reference objects-violators and directly affect the construction of the separating hyperplane

3.4. Models’ Performance Evaluation Metrics

We used the following metrics to evaluate models’ performance.

Mean absolute error (MAE) is a measure of errors between paired observations expressing the same phenomenon: where is the predicted value, is the observed value, and is the number of observations.

Relative absolute error (RAE) is expressed as a ratio, comparing a mean error to errors produced by a trivial or naïve model: where is the predicted value, is the observed value, is the average of the predicted values, and is the number of observations.

Mean absolute percentage error is a measure of prediction accuracy, which expresses the accuracy as a ratio defined by formula: where is the predicted value, is the observed value, and is the number of observations.

As an accuracy metric, we used a difference of MAPE from 100%:

4. Results

The program realization of the COVID-19 models was performed using Python programming language. An experimental investigation of the models was carried out on four types of data provided by World Health Organization Coronavirus Dashboard [35]: daily new cases, daily fatal cases, cumulative new cases, and cumulative fatal cases. We used data for Germany, Japan, South Korea, and Ukraine. These countries were selected due to the different nature of the dynamics of the epidemic process, various control measures implemented by governments, and various social factors influencing the dynamics of COVID-19. The forecast was calculated for 3, 7, 14, 21, and 30 days. For Germany, Japan, and South Korea, the forecasting period was from August 1, 2022, to August 30, 2022, and for Ukraine, from January 25, 2022, to February 23, 2022. This is due to the full-scale Russian invasion of Ukraine, which affected the dynamics of the epidemic process of infectious diseases, including COVID-19.

Historical morbidity and mortality data available before the forecast periods were utilized to train the machine learning models. This ensured that the models were well-acquainted with the past trends and patterns of the disease’s spread in the respective countries. The forecast period, on the other hand, was exclusively reserved for testing the models’ predictions. This approach maintained a clear demarcation between training and testing data, ensuring the integrity and validity of the model evaluation process.

The forecasting results show the retrospective forecasted dynamics of COVID-19 epidemic process dynamics in the selected area.

4.1. Forecasting Results Using a Logistic Regression Model

Figure 1 shows the results of forecasting of cumulative new cases of COVID-19 in the selected areas with logistic regression model. Figure 2 shows the results of forecasting of daily new cases of COVID-19 in the selected areas with logistic regression model. Figure 3 shows the results of forecasting of cumulative fatal cases of COVID-19 in the selected areas with logistic regression model. Figure 4 shows the results of forecasting of daily fatal cases of COVID-19 in the selected areas with logistic regression model.

4.2. Forecasting Results Using a Decision Tree Model

Figure 5 shows the results of forecasting of cumulative new cases of COVID-19 in the selected areas with decision tree model. Figure 6 shows the results of forecasting of daily new cases of COVID-19 in the selected areas with decision tree model. Figure 7 shows the results of forecasting of cumulative fatal cases of COVID-19 in the selected areas with decision tree model. Figure 8 shows the results of forecasting of daily fatal cases of COVID-19 in the selected areas with decision tree model.

4.3. Forecasting Results Using a Support Regression Model

Figure 9 shows the results of forecasting of cumulative new cases of COVID-19 in the selected areas with support vector regression model. Figure 10 shows the results of forecasting of daily new cases of COVID-19 in the selected areas with support vector regression model. Figure 11 shows the results of forecasting of cumulative fatal cases of COVID-19 in the selected areas with support vector regression model. Figure 12 shows the results of forecasting of daily fatal cases of COVID-19 in the selected areas with support vector regression model.

4.4. Performance of Logistic Regression Model

Table 1 shows MAE values of logistic regression models for confirmed cases of COVID-19 in selected territories.

Table 2 shows MAE values of logistic regression models for fatal cases of COVID-19 in selected territories.

Table 3 shows RAE values of logistic regression models for confirmed cases of COVID-19 in selected territories.

Table 4 shows RAE values of logistic regression models for fatal cases of COVID-19 in selected territories.

Table 5 shows MAPE values of logistic regression models for confirmed cases of COVID-19 in selected territories.

Table 6 shows MAPE values of logistic regression models for fatal cases of COVID-19 in selected territories.

4.5. Performance of Decision Tree Model

Table 7 shows MAE values of decision tree models for confirmed cases of COVID-19 in selected territories.

Table 8 shows MAE values of decision tree models for fatal cases of COVID-19 in selected territories.

Table 9 shows RAE values of decision tree models for confirmed cases of COVID-19 in selected territories.

Table 10 shows RAE values of decision tree models for fatal cases of COVID-19 in selected territories.

Table 11 shows MAPE values of decision tree models for confirmed cases of COVID-19 in selected territories.

Table 12 shows MAPE values of decision tree models for fatal cases of COVID-19 in selected territories.

4.6. Performance of Support Vector Regression Model

Table 13 shows MAE values of support vector regression models for confirmed cases of COVID-19 in selected territories.

Table 14 shows MAE values of support vector regression models for fatal cases of COVID-19 in selected territories.

Table 15 shows RAE values of support vector regression models for confirmed cases of COVID-19 in selected territories.

Table 16 shows RAE values of support vector regression models for fatal cases of COVID-19 in selected territories.

Table 17 shows MAPE values of support vector regression models for confirmed cases of COVID-19 in selected territories.

Table 18 shows MAPE values of support vector regression models for fatal cases of COVID-19 in selected territories.

5. Discussion

The emerging virus SARS-CoV-2, which humanity learned about in December 2019, quickly spread around the globe, causing the COVID-19 pandemic. During the three years of the pandemic, the disease has claimed more than 6.5 million lives, and more than 663 billion cases have been registered worldwide. Each country has chosen its tactics in the fight against COVID-19. The measures included isolation and treatment of patients, wearing masks in crowded places, and physical distancing. Effective vaccines produced using various technologies were developed and implemented relatively quickly and began to be implemented. However, despite this, vaccination coverage against COVID-19 among the population of different countries is still needed to reach the required level. It did not allow for stopping the circulation of the pathogen among the population. The proportion of vaccines vaccinated with one, two, and three doses differ significantly in different countries; low vaccination coverage creates conditions for selecting new strains of the virus with new mutations and makes it difficult to fight infection [36].

Mathematical models have become a good tool for predicting the development of the COVID-19 pandemic and helping to make adequate management decisions to contain the pandemic. Various models have been developed [3739]. However, each of them had some drawbacks, such as the impossibility of taking into account all the factors that negatively or positively affect the development of the COVID-19 epidemic process, that does not take into account the heterogeneity of the human population and differences in the structure of the population in different territories, etc.

We have built three models based on machine learning to predict the dynamics of the spread of COVID-19—logistic regression, decision tree, and support vector regression. The forecast was calculated for 3, 7, 14, 21, and 30 days. The timing of the forecast was not chosen by chance. It is clear that if there is a sharp deterioration in the epidemic situation, an increase in morbidity and mortality from COVID-19 is predicted on day 30; it is necessary to correct preventive measures as soon as possible. Intermediate forecasts for 3, 7, 14, and 21 days make it possible to control the adequacy of the tactics for preventing the incidence and containing the pathogen’s spread. In addition, the weekly interval makes it possible to smooth out fluctuations in the number of registered cases associated with a lower population seeking medical care on weekends and holidays and a sharp increase in case registration immediately after the weekend. Forecasting for 3 days will show the trend in the dynamics of the epidemic process but will not reflect the changes associated with introducing additional preventive measures.

For the analysis of models, countries with different cultures, medical care organization, surveillance, chosen tactics to combat the COVID-19 pandemic, and other factors influencing the development of the pandemic were selected. We chose four countries—Germany, Japan, South Korea, and Ukraine. For the first three countries, the forecast was built for the period from August 1, 2022, to August 30, 2022, and for Ukraine, from January 25, 2022, to February 23, 2022, because it is impossible to check the accuracy of the forecast for August because full-scale Russian invasion of Ukraine led to the destruction and destruction of hospitals, a decrease in the number of medical personnel, limited access to medical care for the population, and part of the territory of Ukraine was occupied, which did not allow registering the incidence in these territories.

In our analysis, a noteworthy discrepancy emerged in the accuracy of the forecast data for Japan when juxtaposed with that of Germany and South Korea. Several factors underpin this observed variation.

Firstly, the healthcare infrastructure and reporting mechanisms differ across countries. Germany and South Korea have been globally recognized for their robust healthcare systems and efficient disease surveillance mechanisms. Their rapid response to the pandemic, extensive testing, and meticulous data recording contributed to a more consistent and comprehensive dataset, facilitating more accurate forecasting.

Conversely, Japan, while having a sophisticated healthcare system, faced challenges in its initial response to the pandemic. The country experienced periods of underreporting, potentially due to limited testing capacities in the early stages and specific administrative bottlenecks. Such inconsistencies in data collection can introduce noise into the dataset, making it more challenging for machine learning models to discern underlying patterns and make accurate predictions.

Furthermore, sociocultural factors played a role. Japan’s unique societal norms, including its densely populated urban centers and specific public health communication strategies, influenced the dynamics of the disease’s spread in ways that diverged from patterns observed in Germany and South Korea.

It is essential to consider the potential influence of viral strains. Different regions might have been affected by varying strains of the virus at other times, each with its transmission dynamics. This could have introduced additional variability into the predictions if Japan was predominantly affected by a strain with varying transmission characteristics during the forecast period.

Table 19 shows accuracy of all models of COVID-19 in selected territories for 30 days regarding the character of the input data.

The support vector regression model shows the highest accuracy for all datasets. At the same time, it can be noted that all models for data from Germany and South Korea show the highest accuracy. This indicates a more complete testing and registration of a higher percentage of actual incidence than in Japan and Ukraine.

The model analysis results showed that the use of cumulative case and death data as input increased the accuracy of the models, which at first glance is attractive and may lead to the misconception of not using data on daily new cases and deaths. However, the evaluation of models using MAE shows a much smaller absolute error. Based on the data obtained, it should be concluded that to build models. It is necessary to use the entire set of available data, both daily and cumulative, giving preference to cumulative data during periods full of holidays and weekends and daily data in other periods. To reduce the absolute error, it is necessary to form databases based on daily morbidity and mortality.

The intricate relationship between machine learning and the available data forms the bedrock of our research endeavors. At its core, machine learning thrives on data; the quality, granularity, and comprehensiveness of this data directly influence the efficacy of the predictive models [40]. In the context of our study, the data sourced from the World Health Organization Coronavirus Dashboard served as the empirical foundation upon which our models were trained, validated, and tested.

They are implementing a forecasting system for a phenomenon as dynamic and multifaceted as the COVID-19 pandemic presents a unique set of challenges distinct from conventional forecasting endeavors. Traditional forecasting models often rely on stable, predictable patterns. In contrast, the COVID-19 pandemic, influenced by many sociopolitical, environmental, and biological factors, exhibits a level of volatility that demands a more adaptive and nuanced modeling approach [41]. Our machine learning models, particularly the support vector regression, were designed to navigate this volatility, learning from the intricacies of the data to make robust predictions.

Looking ahead, the field of COVID-19 forecasting is poised to encounter several challenges. The emergence of new viral strains, changing vaccination rates, and evolving public health measures can introduce unforeseen complexities into the data. By emphasizing the importance of daily and cumulative data, our research offers a blueprint for addressing some of these challenges. By ensuring that our models are trained on comprehensive datasets that capture the full spectrum of the pandemic’s dynamics, we enhance their adaptability and resilience against future uncertainties [42].

Our study contributes significantly to the broader discourse on the role of machine learning in healthcare systems. By demonstrating the potential of machine learning models to make accurate short-term forecasts in the context of a global pandemic, we underscore the transformative potential of these technologies in public health decision-making. As healthcare systems worldwide grapple with the challenges of the 21st century, from pandemics to chronic diseases, the integration of machine learning tools, as evidenced by our research, will be pivotal in driving innovation, efficiency, and improved patient outcomes [43].

The salient observation from our research underscores the differential impact of accumulated holiday data versus daily data during weekdays on the predictive accuracy of our models. While evident in our results, this distinction warrants a more in-depth exploration to elucidate the underlying mechanisms that contribute to this phenomenon.

One plausible hypothesis is that during holidays and weekends, there is a marked reduction in the number of individuals seeking medical attention, leading to the potential underreporting of cases. This underreporting can introduce noise into the data, making daily figures during these periods less reliable. By accumulating data over such periods, we might mitigate this noise’s effects, thereby enhancing the model’s predictive capabilities. Conversely, on regular weekdays, when medical facilities operate at their usual capacity and individuals are more likely to seek medical care, daily data provides a more granular and accurate representation of the disease’s spread.

Furthermore, the implications of this observation extend beyond the realm of academic interest. In practical terms, understanding the nuances of data collection and its impact on model accuracy can significantly influence how healthcare systems approach data-driven decision-making. For instance, policymakers and healthcare administrators could prioritize the collection of cumulative data during holiday-rich periods and place greater emphasis on daily data during regular operational days. This tailored approach to data collection, driven by our findings, could potentially enhance the accuracy of future predictive models, leading to more informed and effective epidemic control measures.

Moreover, while our study has shed light on this particular aspect of data utilization, it also underscores the broader need for a holistic approach to model development in healthcare. It is not merely about selecting the right algorithm or having vast amounts of data; it is about understanding the intricacies of the data, the context in which it is collected, and the myriad factors that can influence its quality and reliability. Only by addressing these nuances can we truly harness the power of machine learning in the service of public health.

6. Conclusions

The article describes the results of an experimental study of three models for predicting the dynamics of COVID-19 based on statistical machine learning methods: logistic regression, decision tree, and support vector regression. For the experiments, data on the incidence and mortality of COVID-19 in Germany, Japan, South Korea, and Ukraine, provided by the World Health Organization COVID-19 Dashboard, were used.

All developed models have shown sufficient accuracy for use in healthcare practice for the development and implementation of control measures to curb the spread of infectious diseases.

The prediction accuracy of the logistic regression model ranged from 96.78% to 99.95% for morbidity and from 98.11% to 99.99% for fatal cases. The accuracy of the decision tree model ranged from 96.78% to 99.98% for morbidity and from 97.39% to 99.99% for lethal cases. The accuracy of the support vector regression model ranged from 99.65% to 99.99% for morbidity and from 95.65% to 99.99% for lethal cases.

At the same time, the analysis of model indicators for all data showed that the most accurate model is a model based on the support vector regression method. The results of the model analysis showed that the use of cumulative case and death data as input increased the accuracy of the models, which at first glance is attractive and may lead to the misconception of not using data on daily new cases and deaths. However, the evaluation of models using MAE shows a much smaller absolute error.

It should be concluded that to build models, it is necessary to use the entire set of available data, both daily and cumulative, giving preference to cumulative data during periods full of holidays and weekends and daily data in other periods. To reduce the absolute error, it is necessary to form databases based on daily morbidity and mortality.

The scientific novelty of the research lies in the development of COVID-19 predictive models based on statistical machine learning methods. Unlike other studies, the article analyzes the performance of the model depending on different forecasting periods. Unlike other studies, the article analyzes the use of various input data (cumulative and daily) for modeling. The results of the analysis will increase the effectiveness of the use of machine learning models of infectious diseases in healthcare systems.

Data Availability

The initial data used in this research is publicly available in World Health Organization (WHO) COVID-19 Dashboard (https://covid19.who.int/) (accessed on 25 May 2023).

Conflicts of Interest

The authors declare that they have no financial and nonfinancial competing interests.

Authors’ Contributions

D.C. was responsible for the conceptualization. D.C. was responsible for the methodology. D.C. and T.D. were responsible for the software. D.C., T.D., S.Y., and T.C. were responsible for the validation. D.C. and T.C. were responsible for the formal analysis. D.C. and S.Y. were responsible for the investigation. D.C. was responsible for the resources. D.C. and T.D. were responsible for the data curation. D.C., T.D., and T.C. were responsible for the writing—original draft preparation. S.Y. was responsible for the writing—review and editing. D.C. and T.D. were responsible for the visualization. D.C. was responsible for the supervision. S.Y. was responsible for the project administration. D.C. was responsible for the funding acquisition. All authors have read and agreed to the published version of the manuscript.

Acknowledgments

We would like to thank the Armed Forces of Ukraine for providing security to perform this work. This work has become possible only because of the resilience and courage of the Ukrainian Army and people. The study was funded by the National Research Foundation of Ukraine in the framework of the research project 2020.02/0404 on the topic “Development of intelligent technologies for assessing the epidemic situation to support decision-making within the population biosafety management.”