Abstract

Chronic kidney disease (CKD) is a progressive condition characterized by the gradual deterioration of kidney functions, potentially leading to kidney failure if not promptly diagnosed and treated. Machine learning (ML) algorithms have shown significant promise in disease diagnosis, but in healthcare, clinical data pose challenges: missing values, noisy inputs, and redundant features, affecting early-stage CKD prediction. Thus, this study presents a novel, fully automated machine learning approach to tackle these complexities by incorporating feature selection (FS) and feature space reduction (FSR) techniques, leading to a substantial enhancement of the model’s performance. A data balancing technique is also employed during preprocessing to address data imbalance issue that is commonly encountered in clinical contexts. Finally, for reliable CKD classification, an ensemble characteristics-based classifier is encouraged. The effectiveness of our approach is rigorously validated and assessed on multiple datasets, and the clinical relevancy of the strategy is evaluated on the real-world therapeutic data collected from Bangladeshi patients. The study establishes the dominance of adaptive boosting, logistic regression, and passive aggressive ML classifiers with 96.48% accuracy in forecasting unseen therapeutic CKD data, particularly in early-stage cases. Furthermore, the effectiveness of the FSR technique in reducing the prediction time significantly is revealed. The outstanding performance of the proposed model demonstrates its effectiveness in addressing the complexity of healthcare CKD data by incorporating the FS and FSR techniques. This highlights its potential as a promising computer-aided diagnosis tool for doctors, enabling early interventions and improving patient outcomes.

1. Introduction

The kidneys filter about 120 to 150 quarts of blood per day to generate approximately 1 to 2 quarts of urine [1, 2]. The primary function of the kidneys is to remove waste from the body’s fluids via urine. CKD starts with unexpected metabolic disorders that gradually refer to the loss of endocrine, excretory, and metabolic functions in the kidneys [3]. These unusualities are evident as the signs and symptoms of renal damage. Since the underlying cause of the disorder stays unspecified in many patients, the most common causes can be diabetes, hypertension, interstitial diseases, systemic inflammatory disorders, glomerular diseases, congenital conditions, and renovascular abnormalities [4].

In the absence of timely treatment, kidney disease progresses to end-stage renal failure (ESRF), which causes coma and even death in patients [5]. According to [6], approximately 750,000 patients annually are affected by renal failure in the United States, with an estimated 2 million people globally suffering from kidney failure, and the diagnosed patient rate rises at a 5–7% rate annually. Over the past decade, the overall CKD mortality rate has shown a substantial increase at 31.7% [7]. Studies exploit the fact that, in low- and middle-income countries, CKD is a more significant burden when compared to high-income countries [710]. The number of patients diagnosed with renal disease in South Asian cities is 7.2%–17.2% [11]. The regularity reports that 13% of all the available populations in Dhaka city are aged 15 years, or older [12]. About one-third of Bangladesh’s rural people have incurable renal failure risk, as suggested by another community-based report [13]. Hence, CKD poses an upright threat in a developing country like Bangladesh.

A computer-aided diagnosis process can leverage an effective CKD diagnosis for accurate detection at the primary stage. ML is now one of the most essential and prosperous areas in the healthcare sectors for analyzing and making predictions for different diseases and stages [14]. The ML models gain knowledge by exploring large datasets and their features, patterns, modes, and so on. In data analysis, the FS strategy is used to select a subset of the most relevant features in the dataset to improve the performance and interpretability of the ML models, while the FSR technique aids in simplifying the feature representation and overall complexity in the dataset by extracting the principal components [15].

Previous research has shown that choosing the most relevant and useful features can improve early-stage CKD detection. Some researchers have used FS techniques and others have used FSR techniques. However, combining both has not been fully explored, resulting in limitations in reaching a maximum accuracy while keeping the ML model’s generalization capabilities for clinical CKD diagnosis. Moreover, analyzing healthcare table data related to CKD is challenging due to missing or null attributes and categorical values in the dataset. A data encoding methodology is generally well-suited for categorical values, but a suitable strategy for addressing missing or null attribute values that takes into consideration the dataset's random nature is required. Though the existing studies have used various methods to overcome these issues, their effectiveness in dealing with unseen clinical data has not been fully established. Moreover, there are still a number of issues, such as a lack of standardization of CKD, models’ interpretability, generalizability, and fairness, in order to ensure their safe use in normal clinical trials [16, 17].

Therefore, this work aims to extend renal disease diagnosis in a clinical setting by effectively utilizing computer intelligence. To achieve this goal, both the FS and FSR techniques are employed in the preprocessing phase. In addition, a data balancing strategy, as well as data encoding and cleaning, is used to account for clinically unseen data that is imbalanced, missing, or noisy. Finally, multiple classification models are incorporated, with adaptive boosting, logistic regression, and passive aggressive being the recommended ML models for CKD analysis due to their ensemble capabilities.

The effectiveness of the proposed intelligent diagnostic system is evaluated on multiple datasets separately. Finally, the clinical CKD detection performances are evaluated on unseen healthcare data collected from Bangladeshi patients. This study increased the model performances in clinical CKD detection by handling missing values, imbalanced data, data encoding, feature selection, and dimension reduction effectively. To sum up, the most significant contributions of this work are as follows:(1)The datasets are analyzed to ensure that no data loss occurs, even in the case of missing value.(2)The dimension reduction methodology is investigated in order to reduce the feature space; as a result, the model training and testing time could be reduced while simultaneously improving the overall results.(3)This study presented a generalized intelligent diagnostic system to analyze and predict renal disease at an early stage with unseen healthcare data. To the best of our knowledge, this is the first work on CKD prediction with clinical unseen data.(4)A comprehensive analysis was performed on four different datasets to find the best ML models for CKD analysis.(5)Adaptive boosting, logistic regression, and passive aggressive techniques are recommended classifiers for CKD analysis on unseen real-life data due to their robust ensemble capabilities.

The rest of the paper is organized as follows: The related literature review is discussed in Section 2. The proposed methodology is categorized into subsections and briefly discussed in Section 3. Data encoding, balancing, cleaning, feature selection, and dimension reduction techniques are discussed in Section 3.1. The experimental analysis is discussed in Section 4.3. Dataset collection and dataset descriptions are stated in Section 4.1. In Section 4.3 performance evaluation metrics and experimental results are discussed concerning different methods and datasets. Finally, the discussion and conclusion are delivered in Sections 4.4 and 5, respectively.

2. Literature Review

For effective disease classification and prediction, various methodologies are designed and explored. The study [18] examined 12 ML classifiers across four distinct datasets: breast cancer, liver disorders, wine quality, and Indian liver patients. The evaluation primarily focused on accuracy and prediction speed. They concluded that the classifier’s performances are disease specific. However, this study did not elaborate on how the data complexity was handled, and the clinical relevancy was not discussed. As CKD is among the life-threatening diseases that necessitate early detection to enhance patient outcomes, researchers have explored numerous ML algorithms coupled with preprocessing techniques for efficient CKD prediction. A synthetic minority oversampling technique (SMOTE) is employed in [19] to balance the CKD-15 dataset. The authors tested three different FS methods including correlation-based feature selection (CFS) as a filter method, forward feature selection (FFS) as a wrapper method, and the least absolute shrinkage and selection operator (LASSO) feature selection as an embedded feature selection method. The data balancing with SMOTE and FS with LASSO resulted in an increase of 1.39% accuracy compared to using a linear support vector machine (LSVM) with the original dataset. The authors in [20] performed an FS strategy using a genetic algorithm (GA). They achieved the highest accuracy of 99.75% from the multilayer perceptron (MLP) classifier. Different feature-based prediction models were suggested in [21] for detecting kidney disease in which logistic regression with a Chi-square test-based model showed the highest accuracy (98.75%). Similar ML-based models but different applications were analyzed by the authors in [22, 23]. In their work, the gradient boosting-based model was utilized in which their major finding was to utilize the FS and sampling techniques (SMOTE, OneR, etc.) for achieving favorable accuracy. The fuzzy-based intelligent system that incorporated fuzzification, implication, and defuzzification was proposed in [24] for CKD analysis. They modeled an IF-THEN fashion to develop the knowledge base for a fuzzy inference system. A summary of the data imbalanced analysis was presented by the authors [25]. The study investigated 23 class imbalanced techniques (resampling and hybrid systems) with three ML classifiers including random forest (RF), logistic regression (LR), and linear support vector classifier (LinearSVC) to identify the most suitable imbalanced method for the medical dataset. They found that class imbalance learning can significantly improve classification, with random oversampling (ROS) and RF delivering the best results.

Several other FS methods have been explored to identify the most relevant features. The L1-regulated FS technique has been explored in [26] to classify microarray cancer data with improved performance. The authors in [27] applied L1-norm-based and chi-square-based FS strategies to classify breast cancer. In other CKD studies [2830], principal component analysis (PCA) is utilized to extract noteworthy features from the dataset. The authors [28] extracted 19 features using PCA and achieved the highest accuracy of 98% using the support vector machine (SVM) classifier. Other classifiers such as LR, naive Bayes (NB), and k-nearest neighbor (KNN) also demonstrated noteworthy performance. The study [31] utilizes PCA, discriminant analysis (DA), and LR to extract features from the breast cancer dataset. While achieving notable accuracy with a hybrid feature extraction technique, discriminant logistic (DA-LR), the study failed to discuss the data complexity, such as data balancing and cleaning issues. The authors in [32] performed their experimental analysis on the CKD-15 dataset without employing a feature optimization strategy. Despite this, they were able to attain an accuracy of 97.25% using MLP as the classifier, 96.5% using LR, and 95.75% using NB. The highest accuracy of 98.25% was achieved using SVM as the classifier. Another study [30, 33] worked with handling nominal attributes and observed the feature selection strategy in performance analysis. The nominal attributes were transformed into binary attributes, and then they conducted a best-fit feature selection (BFFS) method. According to their findings, SVM and KNN outperformed LR and decision tree (DT) classifiers, with accuracy rates of 98.3% and 98.1%, respectively. Non-numerical data of the CKD-15 dataset were transformed into binary data in the study [34]. The authors aimed to identify the most significant clinical test attributes by using SHapley Additive exPlanations (SHAP) values and reducing the number of attributes to a minimum for optimal clinical testing and high CKD detection accuracy. Among the tested classifiers, the RF achieved the highest accuracy of 99.5%, while gradient boosting (GB), extreme gradient boosting (XGB), LR, and SVM also performed well with high accuracy.

To handle the missing values in the CKD dataset, the authors in [28, 30, 34, 35] replaced the missing values with the mean value. The missing values are handled in [3] with the mean, median, and mode values of the attributes and also dropped the null values. The authors in [36] utilized mutual information measures (MIMs) for feature selection and replaced missing values through multiple imputations while analyzing kidney disease. The authors in [37] used the median technique to replace the missing values. Other studies in [38, 39] replaced the missing values with 0. The top accuracy of 99.1% was achieved by decision forest (DF) and 97.5% while implementing NN with an arbitrary selection of 14 attributes [38]. Other authors [35] have selected 13 out of 24 attributes for classification, and the results showed that adaptive boosting (ADAB) achieved a prediction accuracy of 99% while the extra-tree classifier (ETC) obtained 98% accuracy. The authors in [39] considered 21 attributes from the CKD-15 dataset. During the classification phase, the DF achieved the highest prediction accuracy (99.17%) to predict three different potassium zones: LR 89.17% and NN 82.15%. The authors in [28] handled the categorical variables by converting them to a corresponding numerical value utilizing the one-hot encoding technique. They found the best performance of 98.0% accuracy using an SVM classifier. In a study [1], the attributes with more than 20% missing values were removed from the dataset, and the remaining values were filled using KNN imputation. The authors then selected features based on statistical significance, medical importance, and test data availability. Eleven ML algorithms were evaluated, and four classifiers (DT, RF, ETC, and ADAB classifier) showed 100% accuracy.

The comprehensive literature review highlights various techniques and approaches employed in disease prediction, particularly early CKD diagnosis and reveals common data preprocessing techniques such as nominal-to-binary transformation and one-hot encoding for categorical variables. Handling missing data involves methods like mean imputation, median, mode, multiple imputations, or replacement with 0. Existing studies often focus solely on feature selection or reduction techniques. For CKD prediction, popular methods included CFS, FFS, LASSO, GA, Chi-square, BestFit, SHAP, MIM, and PCA. Breast cancer classification is employed, while L1-regulated and L1-norm-based feature selections are used for efficient breast cancer classification. While these studies demonstrated high accuracy on the datasets utilized for training and testing by splitting them, a critical gap emerged. None of them combined feature selection and reduction methods to improve model performance, particularly in better handling clinical CKD data complexity. In addition, they also lack the assessment of the performance of their models on real-world, unseen clinical CKD data to provide patient-centric CKD solutions at the initial phase. These raise the necessity for an improved automated diagnostic system for CKD detection. This study aims to address these gaps by introducing a novel methodology tailored to enhance CKD diagnostic accuracy and handle data complexity in a patient-centric manner.

3. Proposed Methodology

Preprocessing and classification are the two parts of the proposed methodology. In the preprocessing step, data encoding, balancing, cleaning, feature selection, and dimension reduction approaches were implemented to properly train the ML algorithms. The entire block diagram of the proposed methodology is shown in Figure 1.

3.1. Preprocessing

The datasets contain a mixture of numerical, categorical, nominal, and missing values, so the data are preprocessed to address the issues with categorical, nominal, and missing data. Before starting the preprocessing phase, the “Affected” attribute is manually omitted from the processed data, thus the processed data could not be affected by the class variables.

3.1.1. Categorical Variable

Variables with two or more categories but without intrinsic ordering to the categories are known as categorical variables, often known as nominal variables [40]. Categorical variables are the types of data that may be divided into groups. For example, the categorical variables are age, sex, group, race, educational level, etc.

3.1.2. Data Encoding

Data encoding is the process of converting data or a given sequence of characters, symbols, alphabets, etc., into a specific format that can be processed by a computer system or application. The purpose of data encoding is to transform the data into a standard format. This study utilized the label encoding or ordinal encoding technique to complete this task. All the non-numerical (nominal categorical variables) labels are mapped to numerical labels using this encoding (Table 1).

3.1.3. Data Balancing

Data balancing is a procedure in which the amount of class data is equalized using different data balancing techniques. This analysis used two datasets, CKD-15, and CKD-21; both datasets were imbalanced. CKD-15 contains 250 CKD and 150 non-CKD instances, and CKD-21 contains 78 CKD and 122 non-CKD data. To address this imbalance and prevent potential bias and poor model generalization, the ROS technique was employed to increase the lower number of instances. ROS duplicates minority class examples randomly, ensuring an equal representation of CKD and non-CKD instances in both datasets.

Table 2 shows the data imbalance for both datasets. Table 3 shows the amount of data after balancing the data using the sampling technique.

3.1.4. Data Cleaning

Missing entries are common in clinical CKD data due to the challenges of tackling a large number of CKD patients within a limited time by medical assistants. However, simply removing instances with missing data can pose issues for accurate classification by ML models. In addition, to ensure accurate and reliable outcomes, it is crucial to avoid bias and data distortion caused by incomplete or erroneous data. This necessitates employing a data imputation technique tailored to the specific disease characteristics. Here, the study addressed the lost data by filling up the mean value of the corresponding attribute based on how the missing values were distributed randomly. It serves to preserve the statistical properties of the dataset while ensuring accuracy and reliability in subsequent analyses.

3.1.5. Feature Selection

As the increasing number of features creates computation overhead and increased model overfitting possibilities, FS comes into the solution [41]. The FS strategy reduces the input variables by using only relevant data and eliminating unnecessary and noisy data [42]. It is an automatically relevant feature-choosing process. The significant advantage of using this technique is that it reduces overfitting [43]. Regularization is a useful technique for reducing model complexity and feature selection [26]. The penalty “L1” (Lasso regularization) and solver “liblinear” are used here with the “LogisticRegression” method to select essential features based on the importance weights. It employs the shrinking strategy by penalizing the least-square errors. To minimize the cost function, the model set the weights of some features to zero, and a total of 13 features are chosen for CKD-15, CKD-21, hybrid, and unseen clinical data.

3.1.6. Dimension Reduction

Dimension reduction is a process that reduces the feature space to the most relevant feature space [15, 44] while preserving the maximum amount of relevant information from the actual data. This technique can enormously reduce the time complexity of the ML algorithm’s training phase, and it does not degrade ML model performance [45]. Among other dimension reduction techniques including PCA, singular value decomposition (SVD), linear discriminant analysis (LDA), and generalized discriminant analysis (GDA), an unsupervised ML technique, PCA, is employed here due to its effectiveness and popularity in feature reduction particularly in CKD analysis. PCA employs mathematical principles to reduce a large number of potentially correlated variables to a smaller number of variables (lower dimension), which are referred to as principal components [46]. This investigation utilized PCA as the dimension reduction strategy in four ways to prepare 4 categories of datasets.

For effective PCA analysis, the features of datapoints “X” are standardized through mean removal and scaling to unit variance using the following equation to ensure equal feature scaling:

To determine the direction in which the features are most correlated, the covariance matrix “” is calculated using equation (2). It is a square matrix with dimensions equal to the number of features, and each element in the matrix represents the covariance between two features, indicating their linear association.

Here, N = number of samples in the dataset.

The eigenvectors and eigenvalues of the “” matrix are computed from equation (3). They determine the directions in which the features are most varied and the amount of variance explained by each component. The eigenvalues and their corresponding eigenvectors are sorted in descending order, with the largest eigenvalues being considered the principal components for projecting the data onto a lower-dimensional space.

Here, is the eigenvalue, I is the identity matrix, and is the eigenvector.

Then, the first “k” values are chosen as the largest eigenvalues and their corresponding eigenvectors to form a matrix for the projection step to reduce data dimensionality utilizing the following equation:

In the experiment, the study used k = 2, for CKD-15, k = 7, for CKD-21, k = 3, for hybrid data, and k = 10, for clinical unseen data.

Finally, the new feature vectors are calculated from equation (5). In the experimental analysis, the “k” values are chosen such that they are minimal and outperform the existing models.

3.2. Classification

The advantage of the ML algorithm is that it can adapt to various cases by observing them. Throughout this paper, twelve supervised classification ML algorithms are picked and compared to detect CKD early on in different scenarios.

3.2.1. Logistic Regression

The logistic regression model estimates the possibility of an event within a particular class [47].

LR is commonly used for binary classification, although its title incorporates “regression.” A decision boundary is a value that is set to predict the data class. The sigmoid activation function is used here to compute this classification probability. The mathematical model of the algorithm can be denoted as in the following equation:where i = 1 to N (number of observations), j = 1 to M (number of individual variables),  = probability of “1” at observation i,  = regression coefficient, and  = the variable at observation i.

3.2.2. Decision Tree

The basic goal of the decision tree algorithm is to generate a prediction model from a set of training data sets to predict classes or values of target variables. The DT algorithm is structured like a tree, with leaves, branches, and roots. When compared to other classification algorithms, the DT algorithm is simple to grasp.

3.2.3. Random Forest

This algorithm creates multiple decision trees during training and provides an output class of individual trees [48]. This method incorporates the decorrelated tree by building a substantial range of decision trees on bootstrapped samples from the training dataset. It screens a few feature columns among all feature columns throughout bootstrapping. Gini impurity is used in the experiment with ten maximum depths of the tree. The tree grows with ten maximum leaf nodes. Predictions for unknown data after training can be defined as in the following equation:where B = optimal number of trees and  = prediction from the decision tree for the unknown sample .

Also, the uncertainty of the prediction is defined by the following equation:

3.2.4. Passive Aggressive Classifier

The passive aggressive classifier (PAC) is one of the online learning algorithms in ML. It responds passively to correct classifications and aggressively to any miscalculation. Generally, large-scale learning works better. In contrast to batch learning, where the entire training dataset is utilized at once, input in online ML algorithms is received sequentially, and the ML model is gradually updated. Fifty passes over the training data are used in the experiment.

3.2.5. Support Vector Machines

The SVM is built upon a statistical learning framework, providing solutions for both regression and classification problems [49]. SVM can categorize both linear and nonlinear datasets by using the kernel trick. As a subset of training points in the decision function, it is also computationally efficient (called support vectors). The prediction function of an SVM classifier can be described by the following equation (9). The “rbf” kernel and regularization parameter “1” are used in the experiment.where x = new data point,  = bias, S = set of support vectors,  = corresponding weights of the training data , and are support vectors in the training data.

3.2.6. K-Nearest Neighbor

KNN is the most straightforward supervised ML algorithm [50]. A distance is calculated to determine similarities with other instances. For example, the closest data point to the point under observation is thought to be the most appropriate for the data point. There are numerous distance metrics for calculating the nearest point, such as Euclidean, Hamming, Manhattan, Cosine, Jaccard, and Minkowski distances. In the experiment, 7 neighbors are used with the Euclidean distance 10. Here, and are two points in the space, and are the dimensions of points p and q, and n is the number of dimensions.

3.2.7. Gradient Boosting

This classifier [51] is also operated to estimate the prediction performance as a boosting algorithm. The primary stages of a GB classifier are computing the error residual, learning a regression predictor, and memorizing to predict the residual. Additive models are usually utilized, and weak learners are counted to optimize the loss function. For weak learners, decision trees (regression trees) are employed.

3.2.8. Naive Bayes

The NB is a probabilistic supervised algorithm while classifying data imposes independence of features [52]. The method works effectively for datasets with a significant number of input variables. It assumes all the features available, including weak features, in the final prediction. The probabilistic naive Bayes ML model can be stated as the following equation (11) where A and B are two independent events.

3.2.9. Stochastic Gradient Descent

The word “stochastic” denotes a system or process connected with a random probability. Hence, for each iteration, a few samples are selected randomly instead of the whole data in stochastic gradient descent (SGD). To perform each iteration, SGD uses only a sample, i.e., a batch size of one. The sample is shuffled randomly and picked for executing the iteration. To train the model, L1 regularization and 20 epochs are used.

3.2.10. Multilayer Perceptron

A multilayer perceptron is considered to be the most significant class of feed-forward artificial neural networks (ANNs) that is made up of several layers of perceptron [52]. The network contains three layers where at least one hidden layer is required, and others are the input and the output layer. This experiment used the sigmoid activation function and “lbfgs” solver which is an optimizer in the family of quasi-Newton methods.

3.2.11. Adaptive Boosting

The adaptive boosting algorithm also known as AdaBoost is an ensemble ML technique that merges a number of weak classifiers to form a stronger classifier to increase the classification performance [53]. The performance of this model is improved by using extra copies of the classifier on the same dataset; for incorrectly classified samples, weights are adjusted to represent the final output of the boosted classifier.

3.2.12. Extreme Learning Machine

An extreme learning machine (ELM) is a single hidden layer feed-forward neural network that solves problems by finding the minimum norm least-square (MNLS) solution of a system [54]. It provides good generalization performance by solving problems in a single iteration at an extremely fast speed. The model Moore–Penrose generalized inverse is used to set its weights. In this experiment, 150 hidden nodes with the sigmoid activation function are used. The output of this model is calculated using the following equation:Here, represents the input feature vector, and the prediction is made by summing the product of the weights “” and the activation function “” for each hidden node “i” in the hidden layer.

4. Performance Evaluation

Statistically, finding the best ML classifiers is difficult because it relies on the type of application and the data format. Therefore, the focus of this work is on experimentally validating all ML models in terms of CKD analysis. Based on the data, both balanced and imbalanced conclusions can be drawn about the most effective models for the application.

4.1. Dataset Description

To substantiate the clinical relevancy of this study and demonstrate the effectiveness of ML techniques enhanced by feature selection and reduction, this work employed two distinct datasets: the chronic kidney disease dataset, 2015 (CKD-15) [55] and the chronic kidney disease dataset, 2021 (CKD-21) [56]. These datasets represent diverse patient clinical data and were obtained from the “UCI Machine Learning Repository.”

4.1.1. CKD-15 Dataset

The CKD-15 dataset [55] comprises clinical data collected from the southern part of India with an age range of patients between 2 and 90 years. This dataset encompasses 400 instances, which are classified into two distinct categories: “ckd” (chronic kidney disease) and “notckd” (without chronic kidney disease). Each sample has 25 features, with 24 being predictive variables (11 numeric and 14 nominal). Notably, the dataset exhibits a significant class imbalanced, with 250 instances classified as CKD and 150 as not-CKD.

4.1.2. CKD-21 Dataset

The CKD-21 dataset [56] comprises real-world patient data collected from Enam Medical College, Savar, Dhaka, Bangladesh. It consists of 200 samples, including 78 cases classified as “ckd” and 122 cases as “notckd.” The dataset contains a total of 29 attributes, which are of three types: (i) numerical values, (ii) categorical values, and (iii) nominal values. Within these 29 feature sets, the target values are represented in two specific features, denoted as “class” and “affected.”

Both datasets have a significant number of missing values, especially in the CKD-15 dataset. Tables 4 and 5 contain a description of the attributes with the necessary information for the CKD-15 and CKD-21 datasets, respectively. To apply ML algorithms, data must be well structured and reliable.

4.2. Training and Testing

To train and test the proposed model, two datasets, namely, CKD-15 and the real-world clinical dataset CKD-21 are used in four ways. The extensive experimentation on different datasets ((i) CKD-15, (ii) CKD-21, (iii) hybrid, and (iv) unseen clinical cases) with the combination of multiple evaluation metrics strengthens the validity of the work and demonstrates the proposed model’s generalization capability for early CKD prediction in a clinical setting. Furthermore, validating the model on clinically unseen data highlights its clinical relevance for CKD detection. In the CKD-21 dataset, “Affected” and “Class” attributes have the same meaning.

As both datasets have different dimensions, PCA helps here to bring them to the same number of dimensions for hybrid and unseen cases.(1)For both the CKD-15 and CKD-21, the model was trained with 70% of the data and tested with the rest 30% of the data, as depicted in Figure 2(a).(2)A hybrid dataset is created by utilizing both the CKD-15 and CKD-21 datasets. To make a hybrid dataset, all the datasets must be in the same space. As the datasets contain different feature spaces, this analysis transformed the dimensions of the two datasets into a particular dimension utilizing PCA. Here, for both datasets, 3 feature spaces are chosen to carry out the research by configuring PCA. Then, the vertical (row-wise) concatenation is performed on the transformed CKD-15 and CKD-21 datasets to create a new dataset. The diversity inherent in hybrid datasets significantly enhances the generalization capabilities of ML models, which is a crucial aspect when tackling real-world applications. The ML models are trained on 70% of the sample data and tested on the remaining 30%, as shown in Figure 2(a).(3)The study transformed the existing feature space of both datasets to 10 feature spaces by using PCA for evaluating the ML models on clinically unknown patient data. As Figure 2(b) shows, in the experiment, the model is trained using dataset CKD-15 (i.e., 503 samples) and tested the model with a clinical real dataset CKD-21 (i.e., 256 samples) for clinical analysis of the unseen data.

All three datasets (CKD-15, CKD-21, and hybrid) were additionally split using a random state argument to ensure a nonoverlapping and unbiased evaluation of the proposed approach on all datasets. This approach helps maintain the integrity of the testing process and ensures the generalizability of the model’s performance.

4.3. Experimental Analysis

This work utilized the PCA as a dimension reduction technique that addresses the issue of overfitting in ML models, improves computational efficiency, and enhances the model’s generalization capability, thereby reinforcing its clinical relevance. The use of multiple metrics and datasets provides a holistic assessment and reduces the likelihood of biased results. The previous studies suggest evaluating multiple classifiers comprehensively on multiple datasets using considerable evaluation metrics, recall, true negative ratio (TNR), positive predictive value (PPV), f1-score, area under the receiver operating characteristic (ROC-AUC) curve, and accuracy metrics that are appropriate and relevant to evaluate ML models’ performance in the context of early-stage CKD detection. These metrics are commonly used in medical and healthcare-related studies to understand each classifier’s performance in different aspects, particularly in the context of early-stage CKD detection, where sensitivity, specificity, and diagnostic accuracy are critical. Though cross-validation is a common and widely used technique for evaluating ML models, it may not be feasible for our specific datasets (unseen and hybrid) due to their unique characteristics. For instance, cross-validation on clinical unseen datasets might not provide meaningful insights as this experiment aims to simulate real-world clinical scenarios by testing the model on entirely unseen data. Similarly, for the hybrid dataset, it may introduce biases due to the combination of datasets with varying characteristics. The work fully operated on Google’s cloud platform using “Colab Notebook.”

As Table 6 recites, eleven ML models (i.e., ADAB, DT, ELM, GB, KNN, LR, MLP, PAC, RF, SGD, and SVM) with PCA performed with 100% test accuracy, and ROC-AUC value was exactly 1 in the experiment for CKD-15 dataset. Although three ML algorithms (ADAB, DT, and RF) achieve 100% accuracy without PCA, the other nine ML models degrade their performance. Inferior experimental performances are noticed in the KNN, PAC, SGD, and SVM classifiers without using PCA, ranging the test accuracy from 53.64% to 70.2% where PAC shows the least test accuracy of 53.64%. For the CKD-21 dataset, Table 7 shows seven ML models (i.e., ADAB, DT, ELM, GB, KNN, SVM, and RF) with PCA performing 100% accurately on test data and the lowest accuracy (94.81%) achieved by the PAC classifier. Though the SGD model’s accuracy is not perfect, it has a perfect ROC-AUC curve for the CKD-21 dataset, whereas the ELM classifier’s ROC-AUC value is degraded to 0.729. The proposed model without PCA could not fit the ELM, SVM, and PAC ML learning models well; hence, the overall model’s performance has degraded and ranged in test accuracy from 49.35% to 54.55%. The MLP model performs the best for the hybrid dataset. Table 8 describes the best test accuracy of 99.12% for the MLP model, and the best ROC-AUC value is 0.996 for the LR model though it achieves 96.93% of test accuracy. The ML classifiers DT, GB, RF, and SGD show an equal amount of test accuracy of 97.81% and ROC-AUC of 0.978. In the clinical unseen dataset, the ADAB, LR, and PAC classifiers achieve the highest accuracy of 96.48%, and the best ROC-AUC of 0.984 is achieved by the LR model. Among the other ML models, DT and GB produce the least results (97.27% test accuracy and 0.93 ROC-AUC), as shown in Table 9. Though the NB model’s accuracy is not the best, its ROC-AUC value of 0.981 was the closest to the LR’s ROC-AUC value, establishing it as the second-best well-fitted model whereas with 96.01% accuracy, RF and SGD acquire the second-best performing models for unseen clinical data.

The proposed model with a dimension reduction technique (PCA) achieves the final predicted value for a classifier in an average of 1.93 seconds for CKD-15 datasets and 6.57 seconds for CKD-21 datasets. The model without PCA takes 2.33 seconds for the CKD-15 dataset and 15.9 seconds for the CKD-21 dataset, as shown in Table 10. The hybrid dataset model takes 7.7 seconds, while the unseen dataset model takes 7.95 seconds. An ML model needs to have the same dimension of datasets to create a hybrid dataset, and it also needs to have the same dimension for training and testing the model. Hence, the average required time without considering PCA could not be calculated for hybrid and unseen cases.

4.4. Results and Discussion

Our innovative machine learning approach is fully automated and integrates feature selection through L1 regularization and feature space reduction using PCA during the preprocessing phase. These techniques were specifically designed to address the complexities of therapeutic data in CKD diagnosis, with a primary focus on enhancing early-stage prediction accuracy. Consequently, the intelligent model surpasses all other existing methodologies. A comprehensive analysis of the performance metrics for the four types of datasets, namely CKD-15, CKD-21, hybrid, and unseen clinical cases, is presented in Tables 69 subsequently. The datasets (CKD-21, hybrid, and unseen clinical cases) employed here are novel and unique for early-stage CKD detection, and there were no previous works available that used these datasets for CKD diagnosis. While conducting a direct comparison with state-of-the-art methods for these specific datasets (CKD-21, hybrid, and unseen clinical cases) was not feasible, this study thoroughly evaluated the proposed approach on the CKD-15 dataset in Table 11. The outcomes demonstrated the superiority of our approach over previous works by a wide margin for the CKD-15 dataset, establishing its effectiveness in CKD detection.

Table 6 depicts that overall, the ADAB, DT, and RF classifiers achieve better performance than other models regardless of PCA usage, while the GB classifier performs better when PCA is utilized. The performance of other models steadily decreased when PCA was not considered.

The four ML models (i.e., ELM, SVM, SGD, and PAC) perform worst without PCA for the CKD-21 dataset depicted in Table 7. To the best of our knowledge, this is the first work done on this dataset to detect CKD from the non-CKD class. A few works have been conducted on the CKD-21 dataset but they are limited to identifying renal disease risk factors only. Furthermore, no works on CKD-hybrid and clinical unseen data are found to compare with our model outputs.

Figure 3 shows the ROC-AUC curves, and Figure 4 shows the result comparison with all the 12 ML models on four types of datasets (i.e., CKD-15, CKD-21, hybrid, and unseen case) using PCA. For the CKD-15 dataset, all models perform with 100% of accuracy except for the NB model. The PAC model performs least for the CKD-21 dataset. In aggregate, ADAB, DT, ELM, GB, KNN, SVM, and RF models perform best for both datasets.

MLP performs best for the hybrid dataset (99.12% test accuracy), and NB performs the least (95.18% test accuracy). The average performance on the hybrid dataset was relatively good and is in a steady state, whereas for the unseen clinical data, ML model performances steadily degraded from 96.48% (ADAB, LR, and PAC) to 92.97% (DT and GB) test accuracy.

To evaluate the overall ML model’s performance on the four types of the dataset, this exploration presents the average performance analysis in Table 12 and plots the average train-test accuracy and ROC-AUC value in Figures 5 and 6, respectively. The ELM, LR, MLP, and NB show the best ROC-AUC performances at 0.99 whereas DT shows the least ROC-AUC value at 0.91, and other classifiers achieve 0.98 ROC-AUC performance in Figure 6. Figure 5 describes that on the CKD dataset, the RF classifier averagely performs the best (98.48% test accuracy and 0.98 average ROC-AUC), and ADAB acquires the second-best position with 98.46% test accuracy. Though the train-test accuracy gaps are less for the SVM, KNN, and ELM classifiers, overall, in kidney disease prediction, the RF model could be the best choice for CKD-15, CKD-21, and unseen clinical data considering the accuracy and ROC-AUC performances, and the MLP model could be the best model for hybrid renal data.

5. Conclusion

Kidney failure causes diseases ranging from mild to severe, has significant health implications, and demands accurate diagnosis, especially in rural areas of developing countries where specialists are limited. To address these issues, this work suggests an intelligent diagnostic system for early CKD detection in a clinical environment with high accuracy and in a time-efficient manner. The suggested model was evaluated from four distinct perspectives to enhance its real-life clinical performance and credibility. To optimize the model’s performance, necessary corrections were made to the datasets. It outperforms previous studies for the CKD-15 dataset and exhibits impressive accuracy for test data. This positions it as a valuable novel solution and establishes its validity. The kidney disease prediction has been improved effectively by employing both the logistic regression method with the “L1” penalty as feature selection and PCA as feature space reduction technique alongside an ensemble characteristic-based classifier. It also shows notable performance for CKD-21 and hybrid datasets.

To validate the significance of this study and clinical relevancy, the proposed intelligent diagnostic system was finally evaluated on clinically unseen complex data and achieved impressive performance, demonstrating its potential as a valuable patient-centric solution for early CKD diagnosis in clinical practice. The implementation of the model in local healthcare systems would allow for a swift assessment of patients for early-stage CKD identification. Incorporating PCA into the model improved the CKD detection performance and significantly decreased the analysis time, specifically by 0.4 seconds for the CKD-15 dataset and 9.33 seconds for the CKD-21 dataset. This states the real-life clinical applicability of the suggested model.

A future investigation might include performing statistical tests on more patient-centric data. The study acknowledges the necessity of more work on clinical benchmark data to facilitate thorough comparisons with state-of-the-art methods, especially for novel datasets like CKD-21, hybrid, and unseen clinical cases. Furthermore, validation by domain experts is a necessary step prior to clinical implementation.

Data Availability

The data used to support the findings of this study are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

The authors wish to thank the Department of Computer Science and Engineering of Dhaka University of Engineering & Technology, Gazipur, for supporting research to continue the research work.