Abstract

Understanding the factors that influence COVID-19 transmission is essential in assessing and mitigating the spread of the pandemic. This study focuses on modeling the impact of air pollution and meteorological parameters on the risk of COVID-19 transmission in Western Cape Province, South Africa. The data used in this study consist of air pollution parameters, meteorological variables, and COVID-19 incidence observed for 262 days from April 26, 2020, to January 12, 2021. Lagged data were prepared for modeling based on a 6-day incubation period for COVID-19 disease. Based on the overdispersion property of the incidence, negative binomial (NB) and generalised Poisson (GP) regression models were fitted. Stepwise regression was used to select the significant predictors in both models based on the Akaike information criterion (AIC). The residuals of both NB and GB regression models were autocorrelated. An autoregressive integrated moving average (ARIMA) model was fitted to the residuals of both models. ARIMA (7, 1, 5) was fitted to the residuals of the NB model while ARIMA (1, 1, 6) was fitted for the residuals of the GP model. NB + ARIMA (7, 1, 5) and GP + ARIMA (1, 1, 6) models were tested for performance using root mean square error (RSME). GP + ARIMA (1, 1, 6) was selected as the optimal model. The results from the optimal model suggest that minimum temperature, ambient relative humidity, ambient wind speed, , and at various lags are positively associated with COVID-19 incidence while maximum relative humidity, minimum relative humidity, solar radiation, maximum temperature, NO, PM load, , , and at various lags have a negative association with COVID-19 incidence. Ambient wind direction and temperature showed a nonsignificant association with COVID-19 at all lags. This study suggests that meteorological and pollution parameters play a vital independent role in the transmission of the SARS-CoV-2 virus.

1. Introduction

1.1. Background Information

It has been more than two years since the first COVID-19 case was reported in late December 2019 in Wuhan, China [1]. On January 30, 2020, the International Committee on Taxonomy of Viruses (ICTV) recognized the Severe Acute Respiratory Syndrome Coronavirus 2 (SARS-COV-2) as the responsible causative agent of the disease [2] and on March 11, 2020, the World Health Organisation (WHO) declared it an international public health emergency due to the rapid infections by the virus [3]. This disease is transmitted from one person to another through close contact droplets generated by an infected person when coughing, sneezing, or speaking in closed setups [4].

Dry cough, loss of smell, fever, and fatigue are some of the most common symptoms. Breathing problems, chest pain or pressure, shortness of breath, and loss of speech or movement are some of the more serious symptoms [5]. Patients with minor symptoms have mostly been treated effectively, whereas serious patients required intensive hospitalization and respiratory ventilation [6]. According to the Johns Hopkins viral dashboard, as of April 12, 2022, there are 445,504,399 confirmed cases, 378,505,653 recoveries, and 6,015,762 fatalities globally [7].

In Africa, the reported first case of COVID-19 was on February 14, 2020, in Egypt, and on February 27, 2020, in Nigeria [8]. That was almost two months since the first reported case in China. In the early phase of the pandemic, most of the reported cases from African countries were through importation. While the pandemic was progressing, local transmission surpassed imported cases, and the doubling time shortened [9]. Currently, almost all new COVID-19 cases in Africa are from community transmissions, and there are 11,631,795 confirmed cases, 10,785,146 recoveries, and 251,749 deaths in Africa, as stated by the Johns Hopkins virus dashboard [7].

Previous studies have shown that meteorological and pollution parameters have affected the spread and thriving of several viruses [10]; for example, ambient temperature and relative humidity have shown an inverse association with the infection rate of influenza infection in Japan [11, 12] found that SAR-COV-2 virus has a seasonal oscillation of outbreak, suggesting a strong link between climatic conditions and virus transmission [13] showed that the spread of SAR-COV-2 is higher in winter than in summer.

Exposure to air pollution is considered the cause of several diseases and deaths around the globe [14]; therefore, it would be of great relevance to investigate its impact on COVID-19 transmission. Studies have proven for SARS-CoV-1 that air pollution can facilitate the spread of the virus and increase its persistence in the atmosphere [15]. In the United States, a study put in evidence that long-term exposure to a high concentration of particulate matter with an aerodynamic diameter of less than 2.5-micron () increases the risk of mortality [16].

[17] sampled aerosols (air pollutants) from various locations to evaluate the aerodynamic characteristics of SARS-CoV-2. High concentrations of viral ribonucleic acid (RNA) were found in submicron aerosols, particularly in Wuhan Hospital Intensive Care Units (ICU) rooms, according to the findings. Long-term air-quality data was found to be significantly associated with incidences of COVID-19 in a study of 71 Italian provinces [5]. Another study found that long-term exposure to air pollution is associated with a variety of negative health outcomes, including greater fatality rates and hospital admissions [18].

Air pollution has been found to play an important role in the spread of infectious diseases. Poor air quality, for instance, has been attributed to the spread of severe acute respiratory syndrome diseases [19] and hence respiratory disorders such as asthma, chronic obstructive pulmonary disease (COPD), and lung cancer [20]. Despite the fact that there is substantial evidence to suggest pollution parameters play a crucial role in COVID-19 transmission, the number of data-driven studies investigating the association between air pollution parameters and COVID-19 transmission is still limited. It is therefore crucial to identify the key risk factors that influence COVID-19 transmission.

Zhang et al. [21] explored the correlation between meteorological factors and SARS-COV-2 transmission (2020). Precipitation, humidity, wind speed, and temperature all play a role in the propagation of the SARS-COV-2 virus, according to the findings. As a result, getting a better knowledge of the effects of meteorological factors on the propagation and survival of the SARS-COV-2 virus could be crucial in guiding the COVID-19 pandemic response.

COVID-19 severity in African countries has differed significantly from that in China, Europe, and other parts of the world due to diverse factors, spanning from demographical, epidemiologic, socioeconomic, and environmental implications [22, 23]. Since these parameters drive COVID-19’s evolution, recent studies have focused on determining how each of these factors influences the virus’s local transmission. Some studies have suggested that environmental factors, particularly meteorological and pollutant parameters, influence COVID-19 transmission [24].

The possible role of meteorological variables on COVID-19 transmission was investigated by Diouf et al. [25] in 16 nations in West and North Africa, including three climatic regions: the Sahel, Maghreb, and Gulf Guinea. Kendall nonlinear rank test and Spearman rank correlation test were utilized. The findings revealed a statistically significant negative association between COVID-19-confirmed cases and temperature in the Maghreb and Gulf of Guinea. Positive associations were discovered across the Sahel. Positive associations with specific humidity were reported over the Sahel and Gulf of Guinea, whereas negative associations were found over the Maghreb.

The role of meteorological variables and pollutants in the transmission of COVID-19 during the harmattan season in equatorial Africa was examined by Ogunjo et al. [26]. They studied the link between meteorological factors and air pollutants with COVID-19 incidence in seven Nigerian areas using Spearman and Pearson tests. COVID-19 incidence instances were shown to be highly associated with meteorological variables, according to the findings. In several provinces, temperature and humidity had a negative association with daily COVID-19 incidences. Their impact on COVID-19 transmission, however, was less than that of particulate matter. The COVID-19 incidence had a positive association with particulate matter, but a negative association with relative humidity.

In Dhaka, Bangladesh, Islam et al. [20] explored the correlation between COVID-19, air quality, and meteorological variables using the Spearman correlation test. The association between COVID-19 incidence and the covariates was also assessed using a generalised additive model (GAM) and a multiple linear regression (MLR) model. Particulate matter (), carbon dioxide (), and ozone () had a strong negative association with daily COVID-19 incidence cases. However, there was no significant association between COVID-19 incidences and nitrogen dioxide (). Some meteorological variables, on the other hand, were found to have a substantial association with COVID-19 incidences. Relative humidity was shown to have a significant positive association, while atmospheric pressure was found to have a significant negative association. These results were consistent with the results of authors in [2729].

Jiang et al. [30] focused on investigating the effect of ambient air pollutants and meteorological variables on COVID-19 incidence in four cities in China. The study integrated both multivariate Poisson regression and time series analysis to understand the correlation of the variables with COVID-19 cases. It was shown that the particulate matter () and relative humidity were substantially associated with an increased risk of COVID-19 while particulate matter () and temperature were substantially associated with a decrease in the risk of COVID-19. These results are similar to those of a study conducted by Liang et al. [31] on the association between human influenza cases and particulate matter () concentrations.

Lolli et al. [5] conducted a study aimed to identify the impact of climate and air pollution on COVID-19 transmission in Italy. They used nonlinear Spearman and Kendall rank correlation tests to investigate how climatic and air pollution parameters were related to COVID-19 transmission in Milan and Florence, two important urban centers in Northern Italy. The study’s major findings suggested that virus transmission is adversely associated with relative humidity and temperature.

In five Indian cities, the authors in [32] undertook an exploratory study to look into the association between meteorological factors and ambient air pollution, with SARS-COV-2 transmission and fatality rates. They used Spearman and Kendall rank correlation at 0.01 level of significance. The study’s main findings showed that particulate matter was positively associated with COVID-19 incidence. It also backed up the theory that particulate matter above a certain threshold increases the likelihood of SARS-COV-2 transmission and mortality.

The impact of meteorological factors on the dynamics of the COVID-19 pandemic in Poland was investigated by the authors in [33]. The goal of this study was to investigate the association between COVID-19 dynamics and meteorological factors (relative humidity, temperature, sunshine duration, and wind speed) in Poland. The methods used were cross-correlation function, principal component analysis, and random forests. The results revealed that maximum temperature, relative humidity, sunshine duration, and mean daily temperature variability had a positive association with COVID-19 incidence.

In the case of the Western Cape, there is a dearth of research focusing on modeling the relationship between meteorological factors and pollution parameters with COVID-19 transmission; therefore, this study aims to address the uncertainty surrounding the transmission of COVID-19 by investigating the potential influence of ambient air pollutants and meteorological parameters, considering the unique context of Western Cape Province, South Africa, where high infection rates have been recorded [34]. The overarching objective is to develop a comprehensive model that encompasses both meteorological and air pollution factors and their impact on COVID-19 transmission in this region. Specifically, the study seeks to determine if COVID-19 cases in the Western Cape exhibit overdispersion and discern any significant monotonic trends, develop a predictive model to estimate COVID-19 incidence based on these variables, and utilize the model to analyze the association between COVID-19 incidence and meteorological variables and air pollutants in Western Cape. This research endeavor holds the potential to inform critical public health policies and provide insights that can be extrapolated to other regions with similar conditions and perhaps even extended to the study of other infectious diseases.

2. Data and Methods

2.1. Study Area

The Western Cape is a province in South Africa’s southwestern region, bordering both the Indian and Atlantic oceans. Figure 1 below shows an overview of the Western Cape Province in South Africa. The Western Cape Province is located at a longitude of approximately 21.86S and a latitude of about 33.23S. It is the fourth-largest of South Africa’s nine provinces, covering a total area of 129,449 and boasting a population of approximately 7 million people. This region is characterized by mountainous terrain and features a Mediterranean climate, characterized by hot, dry summers and moderate, wet winters, with minimal summer rainfall along the coastline. The average annual precipitation is around 380 mm, with some areas in the northwest receiving as little as 125 mm. The average temperature is around 23 C, making it the coldest region in South Africa. The primary economic activities in the Western Cape include transportation and industrialization, as documented by Wikipedia [36]. As of April 12, 2022, there have been 673,698 confirmed cases of COVID-19 in the Western Cape, resulting in 21,903 deaths, 647,671 recoveries, and 15,213 reinfections, as reported by the source cited by the authors in [37].

2.2. Data

The data analyzed in this study comprised 19 variables among which there are ten meteorological variables and seven pollution variables. The COVID-19 data were accessed from the COVID-19 (2019-nCoV) Data Repository [38]. The data consist of daily case data, recovered and associated deaths from April 26, 2020, to January 12, 2021. The choice of the study period was influenced by data availability. The meteorological and air pollution data were retrieved from the South African Weather Service air quality information system [39]. Table 1 below gives a detailed description of the variables in the datasets.

2.3. Dealing with Missing Values

The data used in this study contained missing values for the variable from July 10, 2020, to July 21, 2020. , NO, and variables contained missing values from September 23, 2020, to October 8, 2020. , , NO, and variables also recorded missing values from October 10, 2020, to November 26, 2020, with the largest number of missing values observed from December 31, 2020, to January 12, 2021, for all other variables except four out of eighteen variables. The random forest was utilized to impute the missing values of meteorological and pollution variables using multivariate imputation by chained equations (MICE). MICE uses a mean matching method to estimate missing values for continuous data and logistic regression to estimate missing values for binary data using random draws from independent normal distribution centered on means predicted from random forests. In the case of data used in this study, all variables were continuous. The imputation of missing values using R version 4.2.0.

2.4. Hypothesis Tests
2.4.1. Overdispersion Cameron and Trivedi (CT) Test

Cameron and Trivedi (1990) presented the CT test for detecting overdispersion in COVID-19 incidence, where is the equidispersion provided by based on the equation for negative binomial model and for generalised Poisson model. A Poisson regression model should be estimated a priori in order to detect overdispersion in COVID-19 incidence at a specified level of significance. The Poisson regression model is defined as follows:where is the expected count, are parameters to be determined, and are the predictor variables. To perform the CT test, the following steps are used.Step 1. Formulate the null and alternative hypothesesStep 2. Estimate the auxiliary OLS regression model without the intercept. The fitted values of for the first developed Poisson regression model are then used to compute the dependent variable as follows: where is the dependent variable, is the observed value, and is the fitted value. The auxiliary model in (3) sets as its single predictor variable is as follows:Step 3. The value of the predictor variable is then examined using Student’s t-test. When , it is assumed that the data are equidispersed at a certain level of significance. In contrast, if , then overdispersion is verified at a given level of significance [40].

2.4.2. Ljung–Box Test

Ljung–Box test calculated the overall randomness based on the number of lags by examining the absence of autocorrelation up to specified lags. To perform the Ljung–Box test, the following procedure is used:Step 1. Define the null and the alternative hypothesis asStep 2. Compute the Ljung–Box test statistic defined bywhere is the sample size, is the autocorrelation at time lag, and are number of lags, and is the Ljung–Box test statistic.Step 3. At a significance level , the null hypothesis is rejected ifwhere is the percentage point function of the chi-square distribution [41].

2.5. Modeling

Negative binomial regression and generalised Poisson regression models were used in this study. These models are used for count data that are overdispersed, that is, when the variance of the response variable exceeds the mean. The expected value of the response count is expressed as a linear combination of all other predictor variables.

2.5.1. Negative Binomial (NB) Regression Model

The negative binomial regression, also known as Poisson–Gamma mixture distribution, is defined by the probability function aswhere and is the dispersion parameter. The parameter represents the average incidence rate of per unit time of exposure . The parameter is defined as the probability of a new occurrence of the event during a given exposure. The mean of is determined by the exposure time and a set of regressors in a negative binomial regression model and is defined bywhere is a linear function of predictor variables , ,. is the intercept and are unknown parameters to be determined. The conditional mean function is the same as that of the Poisson distribution and it isand the variance [42]. The dispersion factor is denoted by . The distribution is a Poisson distribution with parameter and is equidispersed when . When , the variance is smaller than the mean, and the count variable is underdispersed, and when , the count variable is overdispersed. In this study, the negative binomial (NB) model is of the following form:where is the expected COVID-19 incidence, is the intercept, is the coefficient of variable at lag k, , , …, are the pollution and meteorological variables, and is the residual of the model. This model takes into consideration the lagged values of air pollution and meteorological variables. The average incubation period for COVID-19 is 6 days [43]. However, the incubation period for COVID-19 may be up to more than 6 days. The choice of the maximum length of incubation period in this study was taken to be 6 days. A stepwise model selection procedure based on the Akaike information criterion (AIC) was employed to drop models with the highest AIC values in the fitted negative binomial regression model [44]. This was performed by systematically skipping single variables in order to validate variables that have the least relevance and exclude them from further analysis.

2.5.2. Generalised Poisson (GP) Regression Model

A generalised Poisson (GP) regression model is a form of generalised linear model used to model count data which are random, underdispersed, overdispersed, or equidispersed [45]. A random variable y has a generalised Poisson distribution with parameters if it takes values y = 0, 1, 2, … with probability.

The dispersion factor is denoted by . The distribution is a Poisson distribution with parameter and is equidispersed when . When , the variance is smaller than the mean, and the count variable is underdispersed, and when , the count variable is overdispersed. The mean and variance of the generalised Poisson distribution are calculated as and , respectively. To develop the generalised Poisson regression model, first, the canonical link function is defined as follows:

The linear predictor is then defined as

So that

Therefore, the expected count is expressed aswhere is the intercept, , , are the values of the covariates at time , , , …. are the coefficients of the predictor variables [46]. The generalised Poisson model fitted in this study is of the formwhere the variables and parameters are as in (11). One of the assumptions of NB and GP models is that the residuals of the fitted models must not be autocorrelated [47]. The Ljung–Box test is used to check the residuals for serial autocorrelation in the models. If they show autocorrelation, an autoregressive integrated moving average model (ARIMA) is fitted to the residuals of that model to capture the unexplained patterns that exhibited in the residuals. The optimum ARIMA model for the residuals is chosen based on the AIC criteria.

2.5.3. Autoregressive Integrated Moving Average (ARIMA (p, d, q)) for the Residuals

After the Ljung–Box test of the residuals of the models, ARIMA (p, d, q) models for the residuals were fitted. The ARIMA (p, d, q) model is a time series approach that takes a combination of autoregression, integration, and moving average. Autoregression shows that the time series is regressed with its lagged values as follows:where is the intercept term, , ,…, are parameters to be determined, and is a white noise .

Integrated I(d) means differencing taken at d times until the original series becomes stationary. A stationary time series has properties that are independent of the time at which it was observed. The first order of difference is provided byand general form of difference order is

The moving average (q) takes the present value as a linear combination of all lagged forecast errors, where the error terms are the errors of the autoregression models of the respective lags. Equation (21) shows the MA modelwhere is a constant and , , are parameters to be determined. The complete ARIMA (p, d, q) model is given by

The optimal values for and are those from which the AIC of the corresponding ARIMA (p, d, q) model is the least. The parameters of ARIMA (p, d, q) models are obtained by maximum likelihood estimation [47].

2.6. Model Selection
2.6.1. Using AIC Criterion

In the fitted negative binomial and generalised Poisson models, a stepwise model selection approach based on the Akaike information criterion was used to exclude models with the highest AIC values. This was achieved by systematically skipping single variables in order to validate variables with the least importance and eliminate them from further analysis. From the stepwise regression of the NB, GP, and residual ARIMA (p, d, q) models, the AIC was used to choose the best-fitted model. The step regression and residual models with the lowest AIC values were chosen as the most appropriate optimizing models from the available models. The AIC is defined as follows:where represents the maximum value of the log-likelihood function and represents the number of independent variables [48].

2.6.2. Root Mean Square Error (RSME)

The RSME is used to examine the model performance by determining how far the fitted values fall from the observed values using the Euclidean distance. The RSME was used to determine which of the models NB + ARIMA (p, d, q) and GP + ARIMA (p, d, q) was optimal. The RMSE value is defined bywhere is the observed value, is the fitted value, and is the number of observations [49].

3. Results and Discussion

3.1. Distribution of the Response Variable

Figure 2 illustrates the distribution of COVID-19 incidence during the study period. It can be observed from Table 2 that the response variable is skewed to the right with a mean of 932 and a variance of 894,840. With the variance being 960 times higher than the mean, the incidence count variable is overdispersed. The overdispersion was further confirmed by the overdispersion test at level of significance (Lambda t-test score = 10.882, value ). Based on this property of the response variable, negative binomial and generalised Poisson models were considered in this study.

3.2. Results from NB and GP Models
3.2.1. NB Model

Stepwise regression was performed on the NB model in (11) to drop the nonsignificant predictors. The RSME, which measures the deviation of the observed and fitted counts, was 470.86 for the full model and 456.585 for the reduced model. Furthermore, the AIC was 3695.919 for the reduced model while it was 3770.92 for the full NB model. As a result, the reduced model was preferred for modeling the COVID-19 incidence in Western Cape. The NB regression model fits and the observed COVID-19 incidence are shown in Figure 3. The results suggest that the NB model failed to capture well the behavior of COVID-19 incidence during the first days of the study period.

Autocorrelation and partial autocorrelation functions of residuals of the fitted NB regression model are shown in Figure 4. It can be observed that the residuals are not a white noise and some patterns still existed in the remaining series. The results were further confirmed by the Ljung–Box test at the level of significance ( value , Ljung–Box test statistic = 283.03). To capture the unexplained patterns exhibited in the residuals, ARIMA (p, d, q) model was fitted on residuals. The augmented Dickey–Fuller test (ADF) confirmed that there was a nonstationarity in the residuals of the negative binomial regression model (Dickey–Fuller test statistic = −3.3922, value = 0.05627). The time series of the residuals was stationarised through first differencing. The ACF and PACF of differenced residuals are shown in Figure 5.

The possible parameters for autoregressive (AR) and moving average (MA) components of the ARIMA model were identified from and by changing various combinations of AR and MA components. Several ARIMA models for the residuals were tested. Based on the AIC, the ARIMA (7, 1, 5) was selected as the optimal model.

Diagnostic analysis of residuals of ARIMA (7, 1, 5) was done. The residuals were normally distributed, random, and had a constant mean of 0. All the assumptions of residuals were met by the optimal model as shown in Figure 6. The fitted values for ARIMA (7, 1, 5) were then added to the fitted values of the NB regression model. A plot of the fitted values of NB + ARIMA (7, 1, 5) and the observed counts of the COVID-19 incidence is shown in Figure 7. The fitted series from the NB and NB + ARIMA (7, 1, 5) model can capture well most of the patterns in the actual series. However, both models failed to properly capture the behavior of COVID-19 incidence during the first days of the study period.

GP model stepwise regression was performed on the GP model in (17) to drop the nonsignificant predictors. The root means the square error was 317.68 for the full model and the root mean square error of the reduced model is 311.1374. Furthermore, the AIC of the full NB model was 3732.411 while it was 3655.563 for the reduced model. As a result, the reduced model was preferred for modeling the COVID-19 incidence in Western Cape. The GP regression model fits the observed COVID-19 incidence well, and this is shown in Figure 8. It can be observed that the model fits approximately through the mean of COVID-19 incidence. GP regression model was able to capture well the behavior of COVID-19 incidence during the study period.

Autocorrelation and partial autocorrelation functions of residuals of the fitted GP model are shown in Figure 9. It can be observed that the residuals are autocorrelated and there were some patterns still existing in the remaining series. The results were further confirmed by the Ljung–Box test at level of significance ( value , Ljung–Box test statistic = 148.79). To capture the unexplained patterns exhibited in the residuals, an ARIMA (p, d, q) model was fitted on residuals. The augmented Dickey–Fuller test (ADF) confirmed the residuals of the generalised Poisson regression model (Dickey–Fuller test statistic = −4.3922, value = 0.06627) are nonstationary and it was overcome by first differencing. The time series of the residuals was stationarised through first differencing. The ACF and PACF of differenced residuals are shown in Figure 10.

The possible parameters for autoregressive (AR) and moving average (MA) components of the ARIMA (p, d, q) model were identified and by changing various combinations of AR and MA components, several possible ARIMA (p, d, q) models were tested. Several ARIMA (p, d, q) models for the residuals were tested. Based on the AIC, the ARIMA (1, 1, 6) was selected as the optimal model.

Then, the selected best model ARIMA (1, 1, 6) was checked for the validity of the assumptions. A diagnostic analysis of residuals was carried out. All the assumptions of residuals were met as shown in Figure 11, whereas the Ljung–Box test confirmed the nonautocorrelations at earlier lags at significance level ( value = 0.8985, Ljung–Box test statistic = 4.881). The fitted values for ARIMA (1, 1, 6) were then added to the fitted values of the GP regression model. A plot of the fitted values of NB + ARIMA (1, 1, 6) and the observed incidence of COVID-19 incidence is shown in Figure 12. The fitted series from the GP and GP + ARIMA (1, 1, 6) models can capture uniformly most of the patterns in the actual series of COVID-19 incidence.

The RSME was used to identify the optimal model between NB + ARIMA (7, 1, 5) and GP + ARIMA (1, 1, 6). The RSME of the NB + ARIMA (7, 1, 5) is 456.9698 while that of GP + ARIMA (1, 1, 6) is 447.121 as a result GP + ARIMA (1, 1, 6) was selected as the optimal model with the minimum RMSE. Figure 13 shows a plot for the comparison of observed incidence counts, fitted values of NB + ARIMA (7, 1, 5) and fitted values of GP + ARIMA (1, 1, 6). By observing this plot, it can be seen that both models fit the incidence well. However, the GP + ARIMA (1, 1, 6) seems to be consistent with the observed incidence in Western Cape Province.

3.3. Discussion

The impact of pollution and meteorological variables on COVID-19 transmission was examined using NB and GP models. Overall, the GP + ARIMA (1, 1, 6) suggested that daily COVID-19 incidence was positively associated with minimum temperature, , , ambient relative humidity, and ambient wind speed at various lags. A negative association was observed between COVID-19 incidence and maximum humidity, minimum humidity, maximum temperature, , PM load, solar radiation, , and at various lags as shown in Table 3. Care must be taken while interpreting the impact of air pollution and meteorological variables on COVID-19 incidence. With regards to a positive association, a positive coefficient indicates that an increase of that variable at lag k corresponded to an increase in the COVID-19 incidence holding all other variables constant, while for a negative association, a negative coefficient indicates that a decrease of the variable at lag k corresponds to a decrease of COVID-19 incidence holding all other variables constant.

The findings have shown that at lag 1, lag 3, lag 4, and lag 6 suggested a positive association with incidence while and PM-load suggested a negative association with an incidence at lag 1, lag 3, lag 4. and lag 6, respectively. These findings contradicted the findings of the authors in [50] which showed is negatively associated with COVID-19 incidence and those of authors in [51] which showed that and PM-load were positively associated with COVID-19 incidence. Maximum temperature at lag 3 and lag 4 showed a negative association with incidence while the minimum temperature at lag 2, lag 3, and lag 6 showed a positive association with incidence. These findings are not in line with the findings of authors in [33, 52] which showed that maximum temperature and minimum temperature were positively associated with COVID-19 incidence. Average temperature showed no association with COVID-19 incidence. This was not in line with the findings of authors in [53] which suggested that average temperature was positively associated with COVID-19 incidence.

Moreover, maximum relative humidity at lags 1, 2, and 5 and minimum relative humidity at lags 2, 3, and 6 exhibited a negative association with COVID-19 incidence, while ambient relative humidity at lags 2 and 5 showed a positive association with COVID-19 incidence. These results are consistent with those of authors in [54], which reported that maximum relative humidity was negatively associated with COVID-19 incidence, and authors in [20], which showed that ambient relative humidity was positively associated with COVID-19 incidence. With regards to NO, , , and , this study showed that NO suggested a negative association with incidence at lag 4 and suggested a positive association with COVID-19 incidence at lag 5 while, and showed a negative association with COVID-19 incidence at lag 5, lag 1 and lag 6, respectively. These findings were in line with the findings of authors in [24] which showed that was positively associated with COVID-19 incidence and in contradiction with which showed a positive association with COVID-19 incidence. In addition, ambient wind speed showed a positive association with COVID-19 incidence. This is consistent with the results of authors in [24] which showed that wind speed was positively associated with COVID-19 incidence. Solar radiation suggested a negative association with COVID-19 incidence at lag 4 and lag 6. This is in line with the findings of authors in [55, 56].

4. Conclusion

The impact of pollution factors and meteorological variables on daily COVID-19 incidences in the Western Cape Province of South Africa is explored in this study. By taking into consideration the lags of the incubation period for COVID-19 disease, the study was able to model COVID-19 incidence counts using negative binomial and generalised Poisson regression models. The fitted negative binomial and generalised Poisson regression model residuals were autocorrelated. To capture the unexplained patterns that existed in the residuals, the ARIMA method was used. The residuals of the negative binomial regression model were fitted with an ARIMA (7, 1, 5), while the residuals of the generalised Poisson regression model were fitted with an ARIMA (1, 1, 6). The two models NB + ARIMA (7, 1, 5) and GP + ARIMA (1, 1, 6) were able to capture some of the trends found in the original series of incidence throughout the study period.

Based on the minimum RMSE, GP + ARIMA (1, 1, 6) was selected as the optimal model for COVID-19 incidence and was used to investigate the association between COVID-19 incidence with pollution and meteorological variables. The results revealed that , minimum temperature, , ambient relative humidity, and ambient wind speed at various lags were positively associated with COVID-19 incidence while maximum relative humidity, , , minimum relative humidity, maximum temperature, NO, PM load, solar radiation, and at various lags suggested a negative association with COVID-19 incidence. Ambient wind direction and temperature showed a nonsignificant association with COVID-19 at all lags. The positively associated variables can potentially enhance the risk of COVID-19 transmission while the negatively associated variables can control the risk of COVID-19 transmission. This study has supported the hypothesis that air pollution and meteorological variables impact COVID-19 transmission. Moreover, the findings of this study might be useful for future studies in other provinces and countries with similar meteorological and air pollution conditions.

This study, however, has some shortcomings. For instance, it does not account for confounding factors including population movements, population density, potential seasonal impacts, virus mutation, and public health interventions. This limited the ability to accurately measure the impact of meteorological variables and air pollution parameters on COVID-19 transmission. Second, performing stepwise regression, especially with a large number of lagged variables, has its limitations. Stepwise regression involves iteratively adding or removing variables based on certain criteria, such as significance levels or model fit statistics, which can lead to variable selection bias and inflated type I error rates. This process may indeed penalize regression estimates, particularly if not conducted with caution. Stepwise regression tends to select variables that best fit the sample data, potentially leading to overfitting and poor out-of-sample prediction performance. Stepwise procedures can be sensitive to outliers and multicollinearity, potentially biasing coefficient estimates and inflating standard errors. In addition, stepwise regression does not account for the uncertainty introduced by variable selection, leading to overly optimistic assessments of model performance and coefficient significance. Consequently, while stepwise regression can aid in model simplification, it is essential to interpret the results cautiously.

This study was conducted by analyzing data from the initial phase of the COVID-19 pandemic, which was influenced by changes in people’s behavior and government-imposed containment measures. As a result, all important factors that may influence COVID-19 transmission should be identified through a comprehensive study and incorporated into the model to reduce any inconsistencies between the actual and fitted series. Another study should be undertaken utilizing the most current data and the results compared with the results of this study.

Data Availability

The data used to support the findings of this study are available on request.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors extend their heartfelt gratitude to the faculty and staff at AIMS Rwanda for their invaluable support and guidance throughout the funding for this research. This research, titled “Modeling the Impact of Air Pollution and Meteorological Variables on COVID-19 Transmission in Western Cape, South Africa,” was presented as the culmination of the academic journeys at the African Institute for Mathematical Sciences, AIMS Rwanda, funded by the Next Einstein Initiative Scholarship [57]. The authors extend their sincere appreciation to thesis advisors for their mentorship and insightful feedback, significantly contributing to the development and refinement of this research. The academic environment at AIMS Rwanda provided us with an enriching experience. The authors are thankful for the opportunities to share and discuss their findings with the academic community. As the authors embark on the journey to publish this paper, they acknowledge the pivotal role AIMS Rwanda played in shaping research skills and fostering a passion for scientific inquiry. This study was funded by the African Institute for Mathematical Sciences.