Abstract

Materials discovery is usually done using high-throughput computational screening. The use of costly and complex direct density functional theory (DFT) simulation methods has been commonly used to determine subtle trends in spin-state ordering and inorganic bonding of inorganic materials and, in general, to predict the electronic structure properties of transition metal complexes. A Gaussian process regression (GPR) framework consisting of four kernel functions is introduced for spin-state splitting estimation through inorganic chemistry-appropriate empirical inputs. To this end, the present study reviewed an extensive range of data values from earlier works. According to statistical analysis, the GPR model showed very good performance. The coefficients of determination were calculated to be 0.986 for the exponential and Matern kernel functions, suggesting the highest predictive power of these methods. Moreover, the sensitivity of output to inputs was measured. Artificial intelligence (AI) helped accurately predict the target values through various input ranges.

1. Introduction

Novel compounds, catalysts [1], and materials [2] are routinely discovered via high-throughput computer screening [3, 4]. Numerous screening and recognition experiments still rely on first-principles modeling, but the increased computational expense simulation means that only a narrow subset of the chemical domain can be explored [5, 6]. Lower thresholds of hypothesis, such as machine-learning designs, have emerged as alternatives to traditional methods for efficiently evaluating the latest candidate substances to speed up the exploration [7]. Computational chemists have recently discovered a broad range of uses for artificial neural networks (ANNs) [810]. The versatility of machine-learning methods to potential energy surfaces and, therefore, force field simulations were first recognized [9, 1113]. Molecular or heterogeneous catalyst and substance exploration have lately been studied in exchange-correlation functional advancement [8, 14], common Schrödinger equation strategies [15], functional hypothesis for orbital-free density [16, 17], numerous body expansions [18], dynamics velocity [19, 20], and band-gap estimation [21, 22] among others.

The proper identification of widely relevant qualifiers that allow the ANN to be used dynamically beyond particles in the learning collection, e.g., for bigger molecules or those with varied chemical reactions, are essential difficulties for ANNs to substitute direct computation first-principles techniques. ANNs have had the greatest effectiveness thus far beyond proof-of-concept demonstrations developing force fields for well-defined substances, such as water [23, 24]. To make energetic predictions in organic chemistry, compositional qualifiers like the Coulomb matrix [25] or regional chemical surroundings and adhesive descriptions [26, 27] have been helpful when considering only a small number of mixtures (e.g., C, H, N, and O). Molecular resemblance, force field advancement, numerical structure-activity [28] correlation, and commutative group hypotheses have all been successfully evaluated using cheminformatics in the past. There are just a handful of force fields [29] for transition metal combinations covering the whole spectrum of inorganic chemical bonding interactions [30]. More rigorous construction of qualifiers is needed to accurately anticipate the characteristics of open-shell transition metal combinations since spin state and coordination setting influence binding [31].

In the same way, qualifiers that were effective for organic molecules are ineffective for inorganic crystalline particles [32]. In transition metal combinations, it is well-recognized [33, 34] that the responsiveness of electronic characteristics (such as spin-state separation) correlates strongly with the ligand-atom linkage and ligand-field power [35, 36]. When substituting distantly (e.g., tetraphenyl porphyrin for base porphin), the impact will be restricted because ligands with the identical metal-bonding atom can have vastly distinctive ligand-field powers (for example, C for both weakened field CH3CN and robust field CO). Therefore, the transition metal complex qualifier collection must cautiously balance metal-proximal and metal-distant qualifiers. A second issue pertains to establishing ANN estimations of first-principles characteristics in transition metal chemistry and associated inorganic substances. Transition metal complexes cannot benefit from efficient correlated wave function theory techniques (e.g., MP2) because optimal procedures for transition metal complexes remain mysterious [37]. In transition metal chemistry, while potential paths for ANNs involve projecting lower-level theory findings to a higher-level hypothesis (e.g., from semiempirical assumption) [38], as has been shown for atomization energies [39] and more recently reaction obstacles [40], appropriate degrees of theory for inference are less apparent. The level of precise (Hartree–Fock, HF) transfer to incorporate in the analysis of transition metal combinations is also unclear. Suggestions range from no interchange to alternatively low or large quantities of accurate interchange in a system-dependent way, notwithstanding inordinate delocalization faults in approximation DFT on transition metal combinations [35, 41, 42], with these amounts being determined by the system. It is true that measuring uncertainty about functional choice in energetic forecasts, particularly the responsiveness of projections to include precise interchange, has garnered a lot of attention lately. To get a direct number and understand how the exchange fraction [33, 34] affects spin-state splitting, one must first determine how responsive it is to interchange. To translate empirical forecasts or provide measurements of accuracy on calculated information, a machine-learning system that anticipates spin-state ordering among interchange rates would be helpful.

As a general rule, any presentation of artificial intelligence in inorganic chemistry, such as for the fast identification of novel spin-crossover combination [43, 44], the use of dye-sensitizers throughout solar panels [45], or the quick assessment of spin-state sequencing to determine the responsiveness of open-shell catalysts, should meet two requirements: (i) qualifiers must integrate metal-proximal and metal-distant properties and (ii) they must also anticipate spin-state sequencing when exchange-correlation blending is taking place. Cheminformatics-inspired transition metal complex structure creation instruments help us make progress toward both of these goals in this study. To educate GPR, as a new method, to anticipate the transition metal complex characteristic, we also developed structure-functional responsiveness correlations in transition metal combinations. In this study, various analyzes have been used to evaluate the proposed models. Our goal is to provide a model with high accuracy in predicting this goal parameter.

2. GPR Model

The present work adopted machine-learning and GPR to handle probabilistic (Bayesian) uncertainties [46, 47]. This approach can simply solve complicated problems. Nonlinear GPR techniques may be employed using small training datasets and integrate new evidence as the data points rise in number [48]. Overfitting is avoided to a great extent as optimization includes fewer hyperparameters in the training phase. The model parameters are determined by the GPR training dataset [49, 50]. Previous data are incorporated into the process along with empirical data to construct the GPR model. GPR operates based on posterior distribution calculations rather than identifying the highest consistency with empirical data, unlike traditional machine-learning algorithms [51].

Let x be the input and y be the output. Also, denotes a random testing dataset, and is a random training dataset. GPR begins with [52]where XL and YL are the independent variable and target, respectively. Furthermore, denotes the observation noise, is the noise variance, and is the unit array. As a result, the Gaussian noise model connects y values to f(x). f is assumed to be a random function completely definable by the mean functions and covariance [53]. Similarly,where XT and YT are the testing dataset independent variable and target, respectively, f(x) is the Gaussian process distribution whose kernel function is , and mean function is m(x) [54]. Thus,

Explicit basis functions (BFs) could be employed to determine m(x). It should be noted that m(x) is typically assumed to be zero for simplification purposes, since a constant m(x) is difficult to find [55]. Therefore,

The integration of (1) and (4) gives the y distribution as [56]

Based on the aforementioned parameters [57],

A Gaussian expression is derived by summing up (6) and (7) [58]:

The Gaussian conditioning rule is used to obtain the yT distribution (where ΣT is the covariance, and μT is the mean) [59]:

The output estimate of the testing dataset can be obtained by the independent variable and training dataset. The kernel function in the training phase (with asymmetric, invertible matrix) strongly influences GPR predictive performance. The present study implemented the learning technique to identify the most efficient kernel function, manipulating the Matern, exponential, squared exponential, and rational quadratic functions [60, 61].

The Matern kernel is given bywhere α > 0 is the length scale,  > 0 is the scale mixture, σ denotes amplitude, and σ2 is the variance. Moreover, is the modified Bessel function, is a positive variable, and Γ stands for the gamma function. For  = 0.5, the Matern kernel converts into the exponential kernel function, whereas  = 1.0 transforms the Matern kernel into the squared kernel function (two particular cases of the Matern kernel) [62, 63].

To maximize mode accuracy, 1/5 of the data was employed as the testing dataset to measure model validity, while the remaining data that were exploited was the training dataset for spin-state splitting evaluation. Details of the data are given elsewhere [64]. Performance evaluation was carried out using MSE, R2, STD, MRE, and RMSE. These statistical indices are calculated as [6568]

3. Accuracy Estimation

A portion of data may show inconsistency with the dataset, with some data being suspected. Such data points majorly imply empirical errors [69, 70]. It is necessary to identify suspected data points since they would diminish predictive performance [71]. To detect suspected (outlier) data, the present study adopted the leverage approach, in which outliers are identified using the hat matrix H and critical leverage limit [72]:where U is an i × j matrix, i denotes the number of parameters, and j stands for the number of training data points [73, 74]. Figure 1 shows William’s plot of the standardized residuals versus the hat value in order to evaluate spin-state splitting data accuracy. The reliable region is represented by a critical leverage limit along with standardized results ranging between −3 and +3. As shown, the dataset is concluded to be satisfactory for the model training and testing phases.

4. Results and Discussion

To measure the performance of the model, the present work utilized statistical parameters to evaluate the consistency between the empirical data and the model estimates. Table 1 provides the comparison between the estimates and empirical data. The coefficient of determination was obtained to be 0.985, 0.984, 0.986, and 0.986 for the rational quadratic, squared exponential, Matern, and exponential kernels, respectively. According to the STD, RMSE, MSE, and MRE values, the GPR models showed satisfactory training performance. Moreover, the models should predict spin-state splitting accurately. Hence, testing data were used to evaluate the models. The GPR models with the exponential and Matern kernels should have the highest spin-state splitting prediction performance.

Figure 2 shows the comparison between the empirical data and model estimates. As can be seen, the model estimates well agreed with the empirical spin-state splitting data, suggesting high accuracy for the proposed models. As a result, the GPR models can be claimed to have excellent performance in spin-state splitting estimation.

Figure 3 shows the comparison between the empirical data and the predictions of the models. The fitting of the predictions to the corresponding empirical data points was calculated to have correlation coefficients above 0.9816. The fit lines significantly cross the bisector line (45°) as the model accuracy measure. However, the model with the exponential and Matern kernel functions showed the largest correlation and thus the highest performance.

The relative deviations of the empirical data and the estimates are shown in Figure 4. According to it, the absolute deviations of the Matern, rational quadratic, and squared exponential kernels were calculated to be below 2000%, whereas the exponential kernel showed an absolute deviation below 1500%.

The GPR models were found to be efficient and effective in the estimation of spin-state splitting. To ensure the spin-state splitting estimation performance of the proposed models with different MOFs, the models were compared to earlier studies. Janet and his colleagues used the RMSE statistical parameter to compare LASSO, KRR, SVR, ANN, and KRR models in predicting this parameter [64]. By comparing their results with the results given in Table 1 of our study, it is proved that our proposed models have a higher ability to predict the target data.

5. Conclusion

The present study developed GPR models using four kernel functions, i.e., rational quadratic, Matern, exponential, and squared exponential kernels to evaluate spin-state splitting. As they showed good agreement with the empirical spin-state splitting data, the proposed models were concluded to have high performance. However, the GPR model with the exponential and Matern kernels showed the highest performance. Moreover, a comparison of the models to earlier works in the literature revealed that the proposed GPR models outperformed earlier models.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.