Abstract

The abnormality of haemoglobin in the human body is the fundamental cause of thalassemia disease. Thalassemia is considered a common genetic blood condition that has received extensive investigation in medical research globally. Likely, inherited disorders will be passed down to children from their parents. If both parents are beta Thalassemia carriers, 25% of their children will have intermediate or major beta thalassemia, which is fatal. An efficient method of beta thalassemia is prenatal screening after couples have received counselling. Identifying Thalassemia carriers involves a costly, time-consuming, and specialized test using quantifiable blood features. However, cost-effective and speedy screening methods must be developed to address this issue. The demise rate due to thalassemia development is outstandingly high around the globe. The passing rate due to thalassemia development can be reduced by following the proper procedure early; otherwise, it significantly impacts the body. A machine learning-based late fusion model proposes the detection of beta-thalassemia carriers by analyzing red blood cells. This study applied the late fusion technique to employ four machine learning algorithms. For identifying the beta-thalassemia carriers, logistics regression, Naïve Bayes, decision tree, and neural network have achieved an accuracy of 94.01%, 93.15%, 97.93%, and 98.07%, respectively, by using the features-based dataset. The late fusion-based ML model achieved an overall accuracy of 96% for detecting beta-thalassemia carriers. The proposed late fusion model performs better than previously published approaches regarding efficiency, reliability, and precision.

1. Introduction

Thalassemia comes from the Greek terms “Thalassa” and “Haima.” “Thalassa” means “the ocean,” and “Haima” means “the blood.” Thalassemia is a genetic blood disorder characterized by insufficient production of haemoglobin [1]. Haemoglobin plays a crucial role in the human body by transporting oxygen from the lungs to the rest of the body and returning carbon dioxide to the lungs [2].

Thalassemia is one of the world’s most frequent diseases, particularly in the Mediterranean. Many countries are currently dealing with the high and rising incidence of thalassemia, which has become a primary public health concern—a significant source of disability and mortality around the world. Early detection of thalassemia can aid in the reduction of death rates. As a result, healthcare practitioners are responsible for making the right decisions. When distinguishing between ordinary people and patients, complete the following options. Who are carriers of diseases, especially when it comes to genetic disorders such as a condition known as thalassemia [3].

There are two divisions of thalassemia based on two polypeptide chains in haemoglobin. These are known as alpha-thalassemia (α) and beta-thalassemia (β). An abnormality causes alpha-thalassemia in the alpha polypeptide gene of haemoglobin, whereas beta-thalassemia is caused due to disturbance in the beta polypeptide gene. The development of any of the alpha or beta-thalassemia in a person’s body leads to low or abnormal haemoglobin creation in the body [4]. The red blood cells are affected due to inadequate haemoglobin [5].

The classification of thalassemia consists of three stages: major, intermediate, and minor thalassemia. Thalassemia major is the most crucial stage of the disease in which the patient needs a continuous blood transfusion to survive. Thalassemia intermediate is the middle stage of the condition in which the patient occasionally needs blood transfusion. It is also known as mild or moderate anaemia. The patient with thalassemia minor looks physically fit and healthy. They do not need a blood transfusion but can maintain their diet and healthy lifestyle [6].

The World Health Organization (WHO) identifies that beta-thalassemia has 5.1% carriers worldwide [7]. Many tests are required to diagnose the difference between iron-deficiency anaemia and beta-thalassemia. These tests include serum iron level, complete blood count, high-performance liquid chromatography, the binding capacity of iron, and the calculation of ferritin and HBA2. However, these tests are expensive and unavailable everywhere [8].

In many other research disciplines, machine learning approaches are very efficient in producing results. They make managing and analyzing other fields easier and play a significant role in the health sector. A computer-based system can be developed to identify thalassemia with improved accuracy, better results, and a more affordable cost. Various machine learning algorithms have offered effective treatments for various biomedical problems. Many models have been presented to analyze the data of other diseases [9, 10] like brain tumours [11], kidney diseases [12], lung disorders [13], and iron deficiency anaemia by using machine learning techniques [1416], including support vector machine [17], K-nearest neighbour [18], fuzzy logic [1921], deep extreme machine learning [22], and deep neural network [2325].

Logistic regression models a discrete outcome given an input variable. The most popular logistic regression model is a binary result (true/false, yes/no, etc.). When analyzing a classification problem, logistic regression is a helpful analysis tool.

Nave Bayes is a superficial learning algorithm that uses the Bayes rule and assumes attributes are class-dependent. Due to its processing efficiency and other benefits, nave Bayes is commonly used in practice [26].

A tree has numerous analogies in real life and has inspired machine learning, classification, and regression. A decision tree can represent the decision analysis process visually and explicitly.

Feature-based data can be handled very effectively by neural network algorithms. Neural networks are computing systems inspired by human brain neural networks [2, 12].

Although machine learning algorithms are currently helpful for identifying illnesses, earlier research models were less accurate because they mainly concentrated on preprocessing methods, data balancing, and other supervised and semi-supervised learning models. A late fusion technique is needed to fuse the accuracy of many machine learning algorithms while maintaining high sickness detection accuracy. This study proposed a late-fusion-based ML model that implements logistics regression, Naïve Bayes, decision tree, and neural network for data analysis. The system will use a feature-based dataset of thalassemia reports.

It highlights the importance of accurately predicting β-thalassemia carriers to enable early intervention and genetic counselling. The limitations of existing prediction models and the need for an improved approach are discussed. The objectives of the paper are clearly stated as follows:(1)To develop a fuzzy-based fusion model that combines multiple machine learning algorithms for β-thalassemia carrier prediction.(2)To evaluate the performance of the proposed model using relevant performance metrics and compare it with existing approaches.(3)To analyze the effectiveness of fuzzy logic in improving the accuracy and reliability of β-thalassemia carrier prediction.

The results of four different machine learning algorithms were combined through fuzzification to decide on beta-thalassemia carrier identification. The outcomes demonstrate that the proposed approach is more precise and effective than existing solutions.

The goal of the research is to identify thalassemia sickness early. Hirimutugoda and Wijayarathna [2] implemented a three-layer artificial neural network to detect and differentiate malaria and thalassemia. Both diseases are life-threatening and global health issues. Visual inspection of the images of blood analysis taken with a light microscope is a well-known technique for determining malaria and thalassemia. This technique takes much time and is more consuming and expensive. The model used three and four layers of ANN that merged with methods of image analysis to determine the accuracy and effectiveness of the classification for identifying images related to the morphological features of the blood erythrocytes. The study claimed that the three-layered ANN approach generated results with an accuracy of 84.54%.

Ayyıldız and Arslan Tuncer [5] performed a decision-based diagnosis to identify and discriminate the Iron deficiency anaemia (IDA) and beta-thalassemia (β). They implemented red blood cell indices and two effective techniques of machine learning: support vector machines and k-nearest neighbour. Various parameters of complete blood count were used to differentiate between IDA and β–Thalassemia. Implementation of RBC indices improved the efficiency of the diagnostic model. But if the number of features increases, the system becomes complicated.

Das et al.​ [23] employed a decision-based system that used decision trees, ANN, and a Naïve Bayes classifier to discriminate β-thalassemia carriers from ordinary people. The Postgraduate Institute of Medical Education and Research in the Indian city of Chandigarh is where the dataset was gathered. Both ratings were determined to be completely sensitive. The screening score for thalassemia characteristics (BTT) was determined to be 79.25 percent and 91.74 percent, respectively, for the combined score of BTT and HbE. Although the mechanism differentiates two main variants related to haemoglobin, it still requires validation with datasets collected from different countries for implementation and unification.

Egejuru et al. [27] implemented a prediction model for identifying the risk of thalassemia disease. The model used supervised machine learning approaches for analyzing the data collected through questionnaires and medical persons. The Waikato Environment for Knowledge Analysis (WEKA) tool was used for data simulation. Identification variables included demographics (age, marital status, gender, social class, and ethnicity) and clinical variables (spleen enlargement, family history, urine colour, diabetes, and inherited disease status). The dataset consisted of 57% disease carriers and 43% non-carriers. The models implemented in the study are multi-layer perception (MLP) and the Naïve Bayes classifier. The study results show that the MLP is a more effective and reliable mechanism for identifying the risk of thalassemia in patients in Nigeria.

Sadiq et al. [28] constructed an ensemble classifier model using a random forest support vector machine and a Gradient boosting machine to identify patients with thalassemia from the complete blood count (CBC) test data. The model was implemented on the dataset of CBC reports of 5066 patients collected from the Punjab thalassemia prevention program (PTPP). Input parameters used for this study are red blood cells, haemoglobin, hematocrit, mean cell volume, mean cell haemoglobin concentration, mean cell haemoglobin, RBC distribution width, platelet count, and white blood cells. The study achieved an accuracy of 93% in identifying β-thalassemia carriers.

Akhtar et al. [29] implemented a linear discrimination analysis (LDA) classifier to classify the patients with thalassemia using various parameters of a complete blood count report. The parameters used in the study are ferritin, HB, RBC, WBC, HCT, and platelets. The study also used mathematical formulas to discriminate the patients with thalassemia and iron deficiency anaemia. The accuracy achieved 78% results for females and 75% for males.

The fuzzy-based model was developed to classify thalassemia diseases by Susanto et al. [30]. The haemoglobin, MCV, and MCH levels were obtained following the CBC examination to determine the type of thalassemia. Major, intermedia, minor, and not thalassemia are four output models. The doctor’s perspectives on thalassemia were contrasted with the model prediction results against four datasets. Additional data must be used to understand to further test the model’s accuracy.

Jahan et al. [26] investigated the research on red cell indices utilizing machine learning techniques, such as an artificial neural network (ANN), to detect beta-thalassemia traits (BTT) in pregnant women. The optimal cutoff for each index and the BTT detection test characteristics was determined using a receiver operating characteristic (ROC) curve analysis. The C4.5 and Naive Bayes (NB) classifiers and a back-propagation type ANN were constructed and tested over 3947 patients using the red cell indices. The study emphasizes that none of the red cell features alone helps detect BTT. However, ANN, with a mixture of all red cell indices, exhibited good sensitivity and specificity for this use. Further neural network development might produce a valuable tool for thalassemia screening in remote areas.

Mohammed and Al-Tuwaijari [31] presented various artificial intelligence-based methods and machine learning techniques for classifying and detecting thalassemia utilizing CBC test variables such as RBC, HGB, MCV, HTC, and HB. This system was developed to identify patients with minor thalassemia alpha and major thalassemia beta. The classification methods are decision tree, Naive Bayes, and support vector machine.

Tyas et al. [32] examined multilayer perceptron to classify erythrocytes present in thalassemia cases. It combined morphological features with texture and colour features to increase the accuracy of erythrocyte classification. The experimental results of 7108 erythrocytes indicated an accuracy of 98.11% for training and 93.77% for testing based on the combination of features. The system’s effectiveness was assessed using images captured at various magnifications and on different scanning platforms. The least number of red cells to image for analysis was determined using Poisson modelling, and the results were validated using image sets. Table 1 shows the comparative analysis with respect to the accuracy of past works that were about anomaly detection in network security.

The aims and objectives of the paper are as follows:To highlight the importance of identifying beta-thalassemia carriers and their impact on reducing the mortality rate associated with the disease.To identify the limitations of current screening methods and propose developing cost-effective and speedy screening methods.To develop a machine learning-based late fusion model for detecting beta-thalassemia carriers by analyzing red blood cells.To compare the proposed late fusion model’s accuracy, efficiency, reliability, and precision with previously published approaches.To explore the use of machine learning in medical research to detect beta-thalassemia carriers.To evaluate the performance of four machine learning algorithms, including logistics regression, Naïve Bayes, decision tree, and neural network.To use a feature-based dataset for the development of the late fusion model.

2.1. Proposed β-Thalassemia Prediction Model

The Late fusion model based on machine learning is proposed for predicting β-thalassemia carriers. The system used a features-based dataset of thalassemia reports obtained from the Internet of Medical Things (IoMT) enabled devices. The novel features dataset was collected from the Punjab Thalassemia Prevention Program (PTPP) database. Table 2 presents a complete overview of the features.

PTPP is the initiative of the Punjab government of Pakistan to protect the people from thalassemia disease. This platform provides support to thalassemia patients in β-Thalassemia carrier screening. Initially, the dataset is divided into training and testing phases. 70% of records were fixed for training and 30% for testing.

The proposed model consists of training and validation phases. The proposed model consists of various layers that help to diagnose beta-thalassemia disease. These layers are data acquisition, preprocessing, and application. The proposed model’s first layer is the data acquisition layer, which collects the dataset from PTPP based on IoMT devices [28]. It consisted of twelve variables and a total number of 5066 instants. Output is classified into two categories. The first is β-Thalassemia non-carriers, which contains 3051 records, and the second is β-Thalassemia carriers, with 2015 patient records. The sex distribution ratio is 53% for males and 47% for females.

This unprocessed data may have some missing or noisy values. Normalization of the data and treatment of missing values is accomplished in the preprocessing layer. The normalizing method is used to handle noisy data. In contrast, missing values are driven by calculating existing values’ mean and moving averages.

In the training phase, the third layer of the model is the application layer, which predicts thalassemia sickness using four different machine learning algorithms: Logistics Regression (LR), Naïve Bayes (NB), Decision Tree (DT), and Neural Network (NN).

The LR, NB, DT, and NN results are given to the evaluation phase, which calculates the accuracy. It misrates in the targeted class represented by [0, 1], where 0 is for β-Thalassemia noncarrier, and 1 is for β-Thalassemia carrier investigated. The data is sent to the cloud if the learning criteria are satisfied. Otherwise, it needs to be retrained, as shown in Figure 1.

The results of four different techniques are combined in the following stage using a fuzzy inference system to increase the performance of the suggested beta-thalassemia carrier’s model.

The validation phase utilized the 30% records of the thalassemia dataset to validate the model. The trained fusion-based model is imported from the cloud to predict thalassemia. The model discards the value if a beta-thalassemia non-carrier is found. If a beta-thalassemia carrier is found, the patient is referred to the hospital for additional treatment, as shown in Figure 1.

The following conditions (if-then rules) are employed in the fuzzy logic of the suggested late fusion model, which is written as follows:The late fusion-based rules identify beta-thalassemia carriers. = IF (LR is carrier and NB is carrier and DT is carrier and NN is carrier) THEN (Thalassemia is Beta Carrier). = IF (LR is carrier and NB is carrier and DT is carrier and NN is noncarrier) THEN (Thalassemia is Beta Carrier). = IF (LR is carrier and NB is carrier and DT is non-carrier and NN is carrier) THEN (Thalassemia is Beta Carrier). = IF (LR is carrier and NB is carrier and DT is noncarrier and NN is noncarrier) THEN (Thalassemia is Beta Carrier). = IF (LR is carrier and NB is noncarrier and DT is carrier and NN is carrier) THEN (Thalassemia is Beta Carrier). = IF (LR is carrier and NB is noncarrier and DT is carrier and NN is noncarrier) THEN (Thalassemia is Beta Carrier). = IF (LR is carrier and NB is noncarrier and DT is noncarrier and NN is carrier) THEN (Thalassemia is beta carrier). = IF (LR is carrier and NB is noncarrier and DT is noncarrier and NN is noncarrier) THEN (Thalassemia is beta noncarrier). = IF (LR is noncarrier and NB is carrier and DT is carrier and NN is carrier) THEN (Thalassemia is beta carrier). = IF (LR is noncarrier and NB is carrier and DT is carrier and NN is noncarrier) THEN (Thalassemia is beta noncarrier). = IF (LR is noncarrier and NB is carrier and DT is noncarrier and NN is carrier) THEN (Thalassemia is Beta noncarrier). = IF (LR is noncarrier and NB is carrier and DT is noncarrier and NN is noncarrier) THEN (Thalassemia is beta noncarrier). = IF (LR is noncarrier and NB is noncarrier and DT is carrier and NN is carrier) THEN (Thalassemia is Beta noncarrier). = IF (LR is noncarrier and NB is noncarrier and DT is carrier and NN is noncarrier) THEN (Thalassemia is beta noncarrier). = IF (LR is noncarrier and NB is noncarrier and DT is noncarrier and NN is carrier) THEN (Thalassemia is beta noncarrier). = IF (LR is noncarrier and NB is noncarrier and DT is noncarrier and NN is noncarrier) THEN (Thalassemia is Beta noncarrier).

The generated fuzzy rules show that the suggested late fusion-based technique will predict the optimal result based on at least three classification strategies (either beta-thalassemia carrier or beta-thalassemia noncarrier).

The proposed late fusion technique of the rule surface for predicting beta-thalassemia carriers based on NB and LR is shown in Figure 2. If both classification methods indicate that “beta-thalassemia = carrier” is the outcome, then the suggested technique will also mean that “beta-thalassemia = carrier” is the outcome. If both methods indicate that “beta-thalassemia = noncarrier” is the outcome, then the proposed technique will suggest that “beta-thalassemia = noncarrier” is the outcome.

Figure 3 demonstrates that the suggested late fusion technique will also predict “beta-thalassemia = carrier” if NB, DT, and NN make this prediction “beta-thalassemia = carrier.”

Figure 4 shows that if LR and NB show “beta-thalassemia = noncarrier,” even if DT and NN show “beta-thalassemia = carrier,” the proposed method will still show “beta-thalassemia = noncarrier.”

Table 3 shows membership functions based on fuzzy rules. The system testing layer predicts beta-thalassemia carriers. A fuzzy-based cloud model is used to achieve an outcome that stores real-time patient data for evaluation.

3. Results and Simulation

The late fusion-based model is proposed for the earliest prediction of beta-thalassemia carriers. The results are obtained using the MATLAB tool 2022. The proposed model comprises four machine learning techniques, LR, NB, DT, and NN are applied to 5066 features. For both methods, 30% of the fused samples were utilized for validation, while the remaining 70% were used for training. The proposed model diagnoses the beta-thalassemia carrier and beta-thalassemia noncarrier. The statistical metrics used to evaluate the suggested late fusion model’s predicted effectiveness and other categorization methods are explained below. represents beta-thalassemia true predicted, represents beta-thalassemia false predicted, represents beta-thalassemia noncarrier false expected, and means expected false beta-thalassemia carrier.

Accuracy is the number of correctly labelled cases out of the total number of cases.

The percentage of real positives and negatives missed during an experiment is known as the miss rate.

Sensitivity measures the capacity of the proposed model to identify positive cases.

Predictive values, positive and negative, are calculated by dividing each set of results by the proportion of actual successes and failures.

The dataset contains 5066 instances. 70% of the dataset is used for training which consists of 3,546 records, while the remaining 30% is used for testing, which consists of 1,520 records.

The 3546 records were used for training with the LR approach, in which 1715 were beta-thalassemia noncarriers, and 1831 were beta-thalassemia carriers. When trained with LR, 1623 out of 1715 occurrences were noncarriers, while 1717 out of 1831 were found to be carriers. Table 4 displays the results of a comparison between actual and predicted performance throughout training. Results showed an accuracy of 94.2% with a miss rate of 5.8%.

In contrast, during the testing of LR, 716 records out of 757 were identified as noncarriers, while 713 records out of 763 were classified as carriers, as shown in Table 5. In LR testing, the attained accuracy was 94.01%, and the miss rate of 5.99%.

The 3546 records were used for training with the NB approach, in which 1715 were beta-thalassemia noncarrier, and 1831 were beta-thalassemia carriers. When trained with NB, 1618 out of 1715 occurrences were found to be noncarriers, while 1658 out of 1831 instances were found to be carriers. Table 6 displays the results of a comparison between actual and predicted performance throughout training. Results showed an accuracy of 92.4% with a miss rate of 7.6%.

In contrast, during the testing of NB, 721 records out of 757 were identified as noncarriers, while 695 records out of 763 were classified as carriers, as shown in Table 7. In NB testing, the attained accuracy was 93.15% and a miss rate of 6.85%.

The 3546 records were used for training with the DT approach, in which 1715 were beta-thalassemia noncarrier, and 1831 were beta-thalassemia carriers. When trained with DT, 1703 out of 1715 occurrences were noncarriers, while 1813 out of 1831 were found to be carriers. Table 8 displays the results of a comparison between actual and predicted performance throughout training. Results showed an accuracy of 99.15% with a miss rate of 0.85%.

In contrast, during the testing of DT, 756 records out of 757 were identified as noncarriers, while 763 records out of 763 were classified as carriers, as shown in Table 9. In DT testing, the attained accuracy was 99.93%, and a miss rate of 0.07%.

The 3546 records were used for training with the NN approach, in which 1715 were beta-thalassemia noncarrier, and 1831 were beta-thalassemia carriers. When trained with NN, 1700 out of 1715 occurrences were found to be noncarriers, while 1824 out of 1831 instances were found to be carriers. Table 10 displays the results of a comparison between actual and predicted performance throughout training. Results showed an accuracy of 99.4% with a miss rate of 0.6%.

In contrast, during the testing of NN, 757 records out of 757 were identified as noncarriers, while 763 records out of 763 were classified as carriers, as shown in Table 11. In NN testing, the attained accuracy was 100%.

Table 12 displays detailed results for validation of all used classification machine learning techniques (LR, NB, DT, and NN). It can be observed that all four machine learning techniques performed well and achieved an average accuracy is 96.77% and misrate of 3.23%.

Four machine learning techniques are finally provided to the fuzzy system as input for the final prediction. Input to the fuzzy system consists of LR, NB, DT, and NN classifiers and the output class Beta Thalassemia Carriers classifiers. By employing fuzzy rules, the suggested machine learning late fusion-based fuzzy system attained an accuracy of 96% and a miss rate of 4%. The fuzzy system randomly takes twenty-five input ranges for generating the fusion-based results. Based on the fuzzy rules, 12 outputs show beta-thalassemia carriers, and 12 outcomes noncarriers truly predicted. The remaining one is between the carrier and noncarrier stages that, showed the system’s error.

Table 13 displays the results of a comparison between the suggested fused machine learning model and the various thalassemia illness prediction methods described in the literature. The proposed late fusion model is compared with RF [28], SVM [28], GBM [28], FIS [30], NB [26], and CNN [32]. Advanced methods are contrasted with the proposed late fusion model. In comparison to the other methods, the proposed late fusion model excelled. The proposed fused model outperformed the different approaches. The suggested machine learning fusion-based system can be included in intelligent healthcare systems for early and accurate beta-thalassemia carrier prediction. The proposed model has shown the accuracy of beta-thalassemia carrier prediction is 96%.

4. Conclusions

The critical point of this study is to develop a system to analyze beta-thalassemia carrier patients using the late fusion-based ML model. This system is fundamental and more accessible for medical experts and nonexperts. Hence, any person can examine the status of thalassemia just by feeding the required input data. The goal of this study is to analyze the various dimensions of thalassemia. The total precision of this proposed late fusion-based ML model is 96%. The presented framework can be enhanced in the future by utilizing different methods, including federated learning. The study can also be extended by applying short‐term long memory (LSTM) and other ML algorithms and diagnosing the other stages of Thalassemia such as Alpha Max and Min, Beta Max, and Min.

Data Availability

Data will provide on demand.

Conflicts of Interest

The authors declare that there are no conflicts of interest.

Acknowledgments

Thanks to Gapico PVT, who provided us with the simulation platform.