Abstract

Improvements in hyperspectral image technology, diversification methods, and cost reductions have increased the convenience of hyperspectral data acquisitions. However, because of their multiband and multiredundant characteristics, hyperspectral data processing is still complex. Two feature extraction algorithms, the autoencoder (AE) and restricted Boltzmann machine (RBM), were used to optimize the classification model parameters. The optimal classification model was obtained by comparing a stacked autoencoder (SAE) and a deep belief network (DBN). Finally, the SAE was further optimized by adding sparse representation constraints and GPU parallel computation to improve classification accuracy and speed. The research results show that the SAE enhanced by deep learning is superior to the traditional feature extraction algorithm. The optimal classification model based on deep learning, namely, the stacked sparse autoencoder, achieved 93.41% and 94.92% classification accuracy using two experimental datasets. The use of parallel computing increased the model’s training speed by more than seven times, solving the model’s lengthy training time limitation.

1. Introduction

Hyperspectral-imaging spectral sensors, which are now carried out on a variety of platforms, such as satellites, aerospace aircraft, uncrewed aerial vehicles, and ground vehicles, collect rich surface reflectance information [13]. This kind of remote-sensing imaging technology with tens or hundreds of spectral bands is used in geological and mineral detection, environmental investigations [47], vegetation monitoring [810], and marine research [11]. Correction and compression of hyperspectral image data and object detection and feature classification are important research areas of hyperspectral remote sensing [12, 13]. Supervised classification algorithms commonly used for low-dimensional space perform poorly in the classification of hyperspectral data due to the Hughes phenomenon [14, 15]. For classification purposes, it is crucial to retain valuable information from high-dimensional data while reducing their dimensionality [1618]. An urgent issue is reducing the dimensionality of the data while using a reasonable feature extraction method to derive the nonlinear structure information from the image pixels that are more suitable for classification [19, 20].

The pixel features that are more important for classification can be obtained by unsupervised classification. Commonly used hyperspectral data feature extraction methods include linear and nonlinear dimensionality reduction approaches. The main linear dimensionality reduction methods used in hyperspectral image dimensionality reduction are independent component analysis (ICA) [21, 22], principal component analysis (PCA) [2325], linear discriminant analysis (LDA) [22, 26], and local feature analysis (LFA) [27], among others. However, traditional linear dimensionality reduction methods cannot explore the nonlinear features inside the hyperspectral data, leading to lower final classification accuracy. Therefore, there have been fewer studies on hyperspectral feature extraction algorithms for linear dimensionality reduction algorithms in recent years. The commonly used nonlinear dimensionality reduction methods are based on kernel functions and eigenvalues. The kernel function-based methods include methods such as kernel discriminant analysis (KDA) [2830] and kernel principal component analysis (KPCA) [31].

Nonlinear dimensionality reduction methods are used more frequently in remote sensing due to the rapid advancement of deep learning theory and technology [32]. Due to their advantages, deep learning theory is more regularly applied to the feature extraction and classification of hyperspectral image data [3336]. Deep learning is particularly suitable for feature extraction of high-dimensional nonlinear data [37, 38]. Numerous studies have used deep learning methods combined with different classifiers to achieve better classification results of hyperspectral remote-sensing images [3942], including stacked autoencoders (SAEs), deep belief networks (DBNs), and convolutional neural networks (CNN) [43]. Deep learning algorithms combined with spatial background information can extract high-quality spectral and spatial information and achieve high classification accuracy with a small number of training samples and simple classifiers [44, 45]. Multiscale deep learning can be combined to classify hyperspectral images [46]. For example, a new hyperspectral image classification model based on deep understanding is constructed by combining stacked autoencoders (SAEs) and deep convolutional neural networks [47], and a spatial pyramidal pool is used for pooling deep convolutional neural networks, obtaining excellent classification performance [48, 49]. The stacked autoencoder (SAE), which has the advantage of better data dimensionality reduction, is used in the process of feature extraction of hyperspectral remote sensing, reducing processing complexity, and thus improving the efficiency of data abstraction and the accuracy of data classification [50]. Moreover, combined with the classification advantages of the CNN [51, 52], a fusion network for image classification can be constructed based on an SAE optimization, improving classification performance compared to traditional data processing [53, 54]. The semisupervised classification algorithm based on multilabeled samples and deep learning [55], with labels from both the nearest domain information and training samples [56, 57], and nonlabeled samples obtained from self-teaching learning, yields an effective semisupervised hyperspectral image classification method [58, 59]. Numerous classification experiments based on deep learning algorithms on a variety of hyperspectral data found that deep learning algorithms are the optimal classification algorithms in most cases [6066].

This paper’s objective was to apply deep learning theory to hyperspectral image classification, investigate the hyperspectral image classification model combined with a deep learning algorithm, and obtain a preferred method to improve classification accuracy by resolving the challenges of the Hughes phenomenon and of extracting the nonlinear features from within image elements. To achieve the objectives, we examined the development and testing of the best classification models (autoencoders (AEs) and restricted Boltzmann machines (RBMs)) based on deep learning and verified the applicability of deep learning models in the classification of hyperspectral images. We obtained the final classification model by connecting these two feature extraction algorithms and classifiers suitable for hyperspectral image classification by analyzing model building blocks. Through the experiments, we analyzed the effects of the different number of hidden layer neurons, the additional number of hidden layers, different classifiers, and other factors on the model performance and obtained an optimal classification model with optimal parameters. Finally, we proposed optimization strategies to improve classification accuracy and speed.

2. Methods

2.1. Research Overview

This paper is mainly based on hyperspectral image data, introduces the deep learning algorithm, and discusses the SAE and DBN models. Two feature extraction algorithms and classifiers suitable for hyperspectral image classification were connected to obtain the final classification model. Two sets of experimental data were used to verify the two models. Through experiments, we analyzed the influence of different hidden layer neurons, hidden layers, classifiers, and other factors on the model’s performance and obtained an optimal classification model with optimal parameters. Finally, an optimization strategy was proposed to optimize and revise the model regarding classification accuracy and speed. The workflow of the process is shown in Figure 1.

2.2. Stacked Autoencoder (SAE)
2.2.1. Autoencoder (AE)

The autoencoder (AE) consists of a feedforward neural network with an input layer, a hidden layer, and an output layer (Figure 2). The AE assumes approximate equality between the decoder output features and the input features for training, allowing a large amount of unlabeled training sample data to be applied to the model’s training process. As a result, overfitting and local extremes brought on by too many parameters and insufficient labeled training samples are avoided.

The training process starts with first mapping (n1 denotes the dimensionality of the input data, i.e., the number of neurons in the input layer) in the input layer to generate (n2 denotes the number of neurons in the hidden layer) through a linear function containing trainable parameters W1 and B1 and an activation function f(x) (equation (1)) (generic activation functions are sigmoid functions, as in equation (4)). The encoder performs this step in the neural network called encoding. X is then mapped to the output layer by an activation function and a linear function containing trainable parameters W2 and B2 to produce Z (equation (2)), so that the input x approximates Z. This step is performed by the decoder, which is called reconstruction. Sometimes, a linear-decoding approach can also be used to remove the activation function , i.e., equation (3):W1 and W2 denote implied input and implied output weights, respectively, and B1 and B2 denote offsets.

2.2.2. Stacked Autoencoder (SAE)

The training process of a stacked autoencoder (SAE) is essentially unsupervised learning. The model is trained by layer-wise pretraining, which means that the parameters obtained in the first layer are propagated forward to obtain the first hidden layer. Then, it is used as the input layer to train the parameters in the second layer. Training is iterated gradually to guide the training of the parameters in each layer. The training parameters of each layer in the SAE are obtained. Finally, fine-tuning is performed by the connected classifier. Fine-tuning refers to treating all layers of the SAE as a single model and optimizing all weights in the network by an iterative algorithm using a label training sample set. A backpropagation algorithm with autocoding is applied to update the weights, which can be extended to apply as many layers as desired due to the application of the backpropagation algorithm (Algorithms 1 and 2).

2.2.3. Stacked Autoencoder (SAE) Algorithm Flow
Step 1: start
Step 2: given the set of training samples, the number of cells in the visible and hidden layers, the number of iterations, the learning rate, the initialized training parameter weight matrix W, the bias vectors b, c, and the canonical terms
Step 3: using the restricted Newton’s method algorithm, update the training parameters until the algorithm converges
Step 4: closing
Step 1: start
Step 2: given the parameters in Algorithm 1 and the number of hidden layer layers
Step 3: train the first layer of AE; the training algorithm is as in Algorithm 1
Step 4: use the hidden layer of the first layer of AE as the input layer of the second layer of AE and train layer by layer in turn until the last layer of AE
Step 5: connect the last output layer to the classifier to complete fine-tuning
Step 6: closing
2.3. Deep Belief Network
2.3.1. Restricted Boltzmann Machines

An RBM is a randomly generated neural network that learns the probability distribution of the original data. It is a typical energy-based model with the basic structure of a two-part diagram with one layer for the data input layer and one layer for the hidden layer; additionally, the nodes in each layer are not connected, but the nodes between the layers are fully connected. All the nodes can only take values of 0 and 1. They are random binary variable nodes, and this complete probability distribution satisfies the Boltzmann distribution. This model is the RBM:W denotes the connection weight between the visible and hidden layers and b and c represent the respective biases of the visible and invisible layers.

2.3.2. Deep Belief Network

The DBN consists of multiple layers of the RBM superimposed to extract deep features of the original data (Figure 3). The joint probability distribution between the input data in the visual layer and the l-layer hidden layer hk is shown in equation (6). The weights are obtained using unsupervised greedy algorithm (GA) training, by first training the first layer of the RBM and fixing its training parameters, and then using the output of the hidden layer of the first layer of the limited Boltzmann machine as the input of the second layer of the RBM, in turn, training the parameters layer by layer. The last hidden layer is connected to the classifier. Refinement is completed by the supervised gradient descent (GD) algorithm, whose algorithm flow is as follows: except for the first layer RBM, the weights of the remaining RBM can be classified as upward cognitive weights as well as low generative weights. The generation process, after the top layer representation and assignment of low weights to form the bottom layer state, while refining the upward weights between layers, finally obtains the classification model based on DBNs:where P(hl−1, hl) is the joint probability distribution between the topmost RBM model visible layer and the hidden layer (Algorithms 3 and 4).

2.3.3. DBN Algorithm Flow
Step 1: start
Step 2: given the set of training samples, the number of cells in the visible and hidden layers, the number of iterations, the learning rate, the initialized training parameter weight matrix W, and the bias vectors b and c
Step 3: using the contrast scattering algorithm, update the training parameters until the algorithm converges
Step 4: closing
Step 1: start
Step 2: given the parameters in Algorithm 1 and the number of hidden layer layers
Step 3: train the first layer of RBM and the training algorithm as in Algorithm 1
Step 4: use the hidden layer of the first RBM layer as the input layer of the second RBM layer and train layer by layer in turn until the last RBM layer
Step 5: connect the final layer of output to the classifier to complete fine-tuning
Step 6: closing
2.4. Support Vector Machines (SVMs)

SVMs are supervised learning algorithms commonly used for statistical and regression analysis. They are linear classifiers that find the maximum spacing on the feature space. Their goal is to maximize spacing, by which the optimization problem becomes a convex quadratic programming problem. In addition to being able to be used for linear classification, support vector machines can perform nonlinear classification, where vectors are mapped into a high-dimensional space that establishes a maximum spacing hyperplane. For example, in the separated hyperplane of the divided data, there are two parallel hyperplanes whose planes are parallel. The separated hyperplane maximizes the distance between these two hyperplanes if the distance between the hyperplanes is far, implying a small overall error.

In the given training sample set (xi, yi), yi equals −1 or 1, which is the label of xi. Each xi is a p-dimensional real vector requiring a “maximum hyperplane” between the set of points with yi = − 1 and the set of issues with yi = 1, so that the distance between the nearest point xi and the hyperplane is maximized. Any hyperplane satisfies equation (7), and the distance between them can be maximized if the training data are linearly separable, i.e., if the two hyperplanes can be used to separate the two types of data.

The support vector machine is extended to linearly indistinguishable using a loss function (equation (7)). The objective function is as in equation (8). The parameter λ is used to weigh the relationship between increasing the size of the interval and ensuring that xi lies on the right side of the interval. Thus, for sufficiently small values of λ, it is assumed that the original data can be classified linearly even if it is not linearly classifiable, and still, a feasible classification criterion can be learned. Support vector machines are generalized linear classifiers, and as such, the choice of kernel functions and kernel parameters significantly impacts their performance:

2.5. Softmax Regression Classifier

The softmax regression classifier is an extension of the logistic regression classifier to this problem of multiple classifications. The test input x takes a hypothesis function to estimate the probability value for each category j.

The objective function is as in equation (9), where l{ } is the demonstrative function with the rule that l{ } = 1, expression is true, and l{ } = 0, expression is false. The probability that x is classified as j in softmax regression is given in equation (10). The objective function is optimized using the gradient descent method. Equation (11) can be substituted into an algorithm such as the gradient descent method to minimize the objective function (10):

2.6. Sparse Autoencoder

The sparse representation can be defined as follows: when the value of the output neuron is close to 1, it is defined as neuron activation, and when the value of the output neuron is close to 0, it is defined as neuron inhibition. When a large number of neurons in the hidden layer are inhibited, and only a small number of neurons are stimulated, this is the sparse state; i.e., a large number of components in the feature vector are 0. The mathematical expression is equation (12), which indicates that the jth component of the Nth data has as few nonzero terms as possible (e.g., ρ = 0.05). The penalty factor of sparse representation is added to the original objective function, and the loss function of sparse representation is equation (13), where n2 denotes the number of neurons in the hidden layer and j denotes each neuron in the hidden layer. Actual relative entropy KL divergence is also expressed as the relative entropy between two Bernoulli random variables, as shown in equation (14). When the two variables are equal, the relative entropy equals 0. When the difference between the two variables becomes larger, the relative entropy increases until it approaches infinity. Therefore, minimizing the penalty factor can also make ρ and ρi close. The loss function of the original optimization problem can be expressed in equation (15), and μ denotes the weight of the penalty factor of the sparse term:

3. Data

3.1. Data Source

This study used two sets of hyperspectral image data simultaneously for the experiments. The purpose of selecting two sets of different data is to verify the model and method’s reliability and the experimental results’ generalizability and extensiveness. The two sets of data selected have different feature types and spatial resolutions. The two datasets are airborne aerial hyperspectral image data from Pavia, Italy, and ground-based close-range hyperspectral image data acquired using the HySpex imaging spectrometer.

The first dataset is the publicly available hyperspectral data from the city of Pavia, Italy, acquired by the airborne reflective optic system imaging spectrometer (ROSIS-3). This image is 610 × 340 pixels (Figure 4(a)). The ROSIS-3 sensor generates a spectral range of 430–860 nm and 115 bands. The main categories are asphalt, bare ground, gravel, grass, metal sheets, brick, trees, and shadows.

The second dataset is the ground-based near-field hyperspectral image data from the Chengdu University of Technology acquired by using the HySpex imaging spectrometer. The image is 400 × 600 pixels (Figure 4(b)). The HySpex sensor generates 1600 spatial pixels, with a spectral range of 400–1000 nm, and 108 bands. The main features are water, vegetation, concrete roads, lava rocks, steel, glass, and walls.

3.2. Data Preprocessing

The ROSIS-3 data are preprocessed that can be used immediately in experiments. The HySpex data are radiometrically corrected by the radiometric calibration module that comes with the HySpex imaging spectrometer, and reflectance inversion was performed by the flat-field method based on the statistical model; a large concrete field was used as a flat field. The image data can be used for experiments after preprocessing.

4. Results

4.1. Sample Selection

We used the two types of image data to select a total of eight feature classes. The selection of each type of ground object sample data in the ROSIS-3 data is shown in Table 1, and the spectral curve of each type of ground object sample is shown in Figure 4(a). The sample data of each type of ground object in the HySpex data are shown in Table 1, and the sample spectrum curve of each type of ground object is shown in Figure 4(b). The selected samples were divided into training, validation, and test samples in a ratio of roughly 3 : 1 : 1. We used the training samples to adjust the trainable parameters of the model, the validation sample to adjust the hyperparameters of the model, and the test sample to test the classification accuracy of the model Figure 5.

4.2. Classification Experiment Based on the Stacked Autoencoder (SAE)
4.2.1. Analysis of the Autoencoder (AE) Model

AEs with different numbers of neurons in the hidden layer were trained separately with the same training samples. AE training is aimed at making the original and the reconstructed data as similar as possible after encoding and decoding. Thus, the model’s performance can be analyzed by the reconstructive ability of the model on the test samples after the training samples are completed. The AE coded and decoded the first 100 features of the test sample to obtain the reconstructed features. The original and reconstructed features were transformed into a 1010 pixel-size frame to represent the image features. We selected representative asphalt and grassland features in the ROSIS-3 data for the experiment. The visualization experiment results of the features reconstructed using different numbers of neurons in the hidden layer are shown in Figures 6(a) and 6(b). We selected representative water bodies and concrete road features in the HySpex data for the second experiment. The visualization experiment results of the features reconstructed with different numbers of hidden layer neurons are shown in Figures 6(c) and 6(d). The spectral curve of asphalt is obtained, whose reflectance has been stabilized at a lower level, so the image is mainly blue, and the analysis can be performed when the number of neurons in the hidden layer is 30. The best effect of reconstruction is obtained when the number of neurons in the hidden layer is 30; the spectral curve of grass is obtained, whose reflectance has been kept at a lower state, when the wavelength is below 700 nm, and when the wavelength is 700 nm, an obvious jump is produced. Reflectance suddenly increases, and then, the region is stabilized. The image is blue first and then transitions quickly to red. The reconstruction effect is best when the number of neurons in the hidden layer is 50 and the number of neurons in the hidden layer is 30. In summary, the best reconstruction effect is achieved when the number of neurons in the hidden layer of the AE is 30–50, and AE performance is the best at this time.

4.2.2. Comparison of Traditional Methods with AEs

After the analysis has yielded a better selection of the number of neurons in the hidden layer of the AE, it is natural to verify whether it can help the classifier’s classification accuracy. The AE is compared and analyzed with several commonly used downscaling and feature extraction algorithms. The classifiers connected after the feature extraction algorithm are an SVM classifier and the softmax regression classifier. The SVM kernel function is chosen as a radial basis kernel function suitable for hyperspectral image classification. The hyperparameter σ of the kernel function is set to 0.009; the penalty parameter is 100. The learning rate of softmax regression is chosen to be 0.1, and the optimization iteration number is 500. The number of neurons in the first hidden layer is set to 40, and the number of neurons in the second layer is [5, 10, 15, 20, 25, and 30]. We selected the traditional downscaling and feature extraction algorithms as principal component analysis, minimum noise fraction rotation (MNF Rotation), factor analysis (FA), and independent components. The experimental results of comparing the classification accuracy between AEs and traditional methods with ROSIS-3 data are shown in Figures 7(a) and 7(b). The analysis shows that the classification accuracy of AEs is the highest when the SVM is connected. The accuracy changes little with a change in the number of features. The classification accuracy of the factor analysis and principal component analysis is similar and stable but still significantly lower than that of the AE. When the softmax regression classifier is connected, the classification accuracy of the AE is further improved, and the classification accuracy reaches the optimum value (>90%) when the number of features is 20. The experimental results of comparing the classification accuracy between AEs and traditional methods with the HySpex data are shown in Figures 7(c) and 7(d). The analysis shows that the classification accuracy of the AE is the highest when the SVM classifier is connected. The classification accuracy increases slowly as the number of features increases. When the number of features is too large, the classification accuracy decreases significantly.

In contrast, the classification accuracy of the factor analysis and principal component analysis remains stable. Still, the classification accuracy is lower than that of AEs, and independent component analysis and minimum noise separation accuracy are still more inadequate. However, when the softmax regression classifier is connected, the classification accuracy of the AE is improved and the classification accuracy shows a steady increase. Its classification accuracy is better than other methods at this time.

According to the analysis above, the AE performs significantly better overall than the other four feature extraction algorithms. The classification accuracy exhibits a more stable trend as the number of features changes. Additionally, the classification accuracy is higher when the softmax regression classifier is connected than when the support vector machine classifier is connected.

4.2.3. Analysis of the Impact of the Number of Hidden Layers in SAEs

The conclusion of the previous section demonstrated that the feature extraction ability of the AE is better than that of the traditional feature extraction algorithm and that the classification accuracy is better when the AE is connected to the softmax regression classifier. Next, the SAE is constructed by the AE, and the classification model is built by combining the softmax classifier and comparing the effect of different layers of the AE on the classification accuracy.

One of the critical factors determining the performance of SAEs is the number of hidden layers, that is, the selection of the number of AE layers. The number of different AE layers determines what kind of features is extracted and plays a critical role in the final classification accuracy. When the number of layers is too small, only shallow parts can be removed, which affects the image classification accuracy, and as the number of hidden layers increases, more and more abstract feature representations can be obtained. However, when the number of layers is too large, the model may be overfitted [67]. This shows the impact of the number of layers of the AE on the classification performance. From the above section, it can be tentatively determined that better classification accuracy can be obtained when the number of neurons in the hidden layer is 30–50. Thus, the number of hidden layer neurons is fixed at 40, and the number of AE layers is selected as [1–5], respectively, to confirm the reliability of the experiment by repeating the experiment several times under a single condition, which is finally represented by a box plot. The classification accuracy results of the SAE with different hidden layers of the ROSIS-3 data are shown in Figure 8(a). Figure 8(b) shows the classification accuracy results of the stacked autoencoder for the HySpex data with different numbers of hidden layers. As the number of layers of the AE increases from 1 to 3, the classification accuracy increases significantly and the stability of the classification accuracy increases gradually, and with further increases in the number of layers of the AE, the classification accuracy starts to show a specific decreasing trend. When the number of layers of the AE reaches 6, the classification accuracy decreases significantly and the classification accuracy starts to become unstable. The median classification accuracy is 92.12% for the ROSIS-3 data and 94.02% for the HySpex data.

4.2.4. Optimal Model Accuracy Evaluation

According to the experimental analysis, the optimal classification model is obtained when the number of neurons in the hidden layer is 40 and the number of layers in the AE is 3. Based on the conclusion of the experiments in the previous section, two experiments with a classification accuracy of 92.12% under the ROSIS-3 data and 94.02% under the HySpex data were selected to evaluate the accuracy. The confusion matrix of the SAE-SR classification accuracy with the ROSIS-3 data is shown in Table 2. The analysis shows that the classification model is better at recognizing metal sheets, grass, and trees, performs worse for asphalt, gravel, and bare ground, and has an average performance for bricks and shadows, which are easily confused with asphalt, bare ground, and gravel, and between shadows and trees. The confusion matrix of the SAE-SR classification accuracy of the HySpex data is shown in Table 3. Except for steel plates and concrete roads, the recognition degree of the other features is higher; steel plates and walls are easily confused.

4.3. Classification Experiment of Classification Model Based on Deep Belief Networks
4.3.1. Analysis of the RBM Model

Using two types of image data, ROSIS-3 and HySpex, we applied an RBM model of the structural unit of the deep belief network. We analyzed the influence of the number of hidden layer neurons on the performance of the RBM, where the hyperparameters in the RBM were designed, with a learning rate of 0.1, according to the hyperparameter selection advice given in [68]. The number of hidden layer neurons was set as [10, 30, 50, 70, 90], respectively. The RBM containing different numbers of neurons in the hidden layer was trained with the same training samples separately until the algorithm converged. In this section, the model’s performance is determined by the ability of the model to reconstruct the test samples after training was completed.

We compared the spectral curves of the original sample with the reconstructed spectral curves under different experimental parameters. The difference in reconstructing ability of the RBM with different numbers of hidden layer neurons can be visually compared in Figure 9. Representative asphalt and grassland features in the ROSIS-3 data were selected for the experiment. The spectral curve results of the experiment are shown in Figure 9. When the number of hidden layer neurons is 30, the reconstructed asphalt spectral curve is most similar to the original asphalt spectral curve and the reconstructed grass spectral curve is most similar to the original grass spectral curve. The reconstructed cement road spectral curve is most similar to the original cement road spectral curve. In summary, the performance of the RBM is optimal when the number of hidden layer neurons is 30. A too large or too small number of hidden layer neurons have a large impact on the performance.

4.3.2. Comparison of the RBM with the Traditional Method

The impact of the number of neurons in the hidden layer on the performance of the RBM was analyzed. The best performance of the RBM was found when the number of neurons in the hidden layer was 30. The best performing RBM was compared with the AE and the traditional feature extraction algorithm. According to the conclusion obtained in the experiment in Section 4.2.3, the classification accuracy of the two traditional feature extraction algorithms (factor analysis and principal component analysis) was higher, so only these two algorithms are selected as the traditional feature extraction algorithm in this section of the experiment. In the experiments, two neural network layers of the RBM were constructed; the number of neurons in the first hidden layer was set to 30, and the numbers of neurons in the second layer were [5, 10, 15, 20, 25, and 30]. Then, two neural network layers of AEs were constructed; the number of neurons in the first hidden layer was set to 40, and the numbers of neurons in the second layer were [5, 10, 15, 20, 25, and 30]. The traditional dimensionality reduction methods (principal component analysis and factor analysis) extracted [5, 10, 15, 20, 25, and 30] features. The number of extracted features was set consistently for all dimensionality reduction methods and was connected to two classifiers: support vector machine and softmax regression classifier. The kernel function of the support vector machine was selected as a radial basis kernel function suitable for hyperspectral image classification; the hyperparameter σ of the kernel function was set to 0.009, the penalty parameter was 100, the learning rate of the softmax regression was selected as 0.1, and the number of optimization iterations was 500. The experimental results for comparison of the classification accuracy of the RBM with other methods for the HySpex data are shown in Figure 10. The analysis shows that the classification accuracy of the RBM is significantly higher than that of the factor analysis and principal component analysis and is more stable with a change in the number of features; however, the classification accuracy of the RBM is overall inferior to that of the AE.

4.3.3. Analysis of the Impact of the Number of Hidden Layers in DBNs

From the conclusion of the previous section, it can be concluded that the feature extraction ability of the RBM is better than that of the two feature extraction methods of factor analysis and principal component analysis, and its feature extraction ability is inferior to that of the AE. The next step is to build a deep belief network by the RBM and connect the softmax classifier to construct the classification model.

The number of hidden layers of the deep belief network determines whether the appropriate features can be extracted, significantly impacting the final classification accuracy. When the number of layers is too small, only the shallow features can be extracted, improving the classification accuracy. However, when the number of layers is too large, it can produce overfitting. From the above section, it can be preliminary judged that the classification accuracy can be improved when the number of neurons in the hidden layer is 30. We fix the number of hidden layer neurons to 30 and select the number of hidden layers as [1–5], respectively; the reliability of the experiment is confirmed by repeating the experiment several times under a single condition, which is finally represented by the box plot. The experimental results of the classification accuracy of the ROSIS-3 data in different hidden layers of the deep belief network are shown in Figure 11(a). The classification accuracy of the HySpex data is shown in Figure 11(b). The median classification accuracy of the ROSIS-3 data is 88.22%, and the median classification accuracy of the HySpex data is 92.16%.

4.3.4. Optimal Model Accuracy Evaluation

According to the experimental analysis of the number of neurons in the hidden layer and the number of layers in the RBM, the optimal classification model is obtained when the number of neurons in the hidden layer is 30 and the number of layers in the RBM is 2. According to the conclusion of the experiments in the previous section, two experiments with a classification accuracy of 88.22% for the ROSIS-3 data and 92.16% for the HySpex data are selected to perform the confusion matrix of the ROSIS-3 data of DBN-SR classification accuracy shown in Table 4. The analysis shows that, compared with the SAE, there is a significant decrease in recognition of bare ground and bricks by the deep belief network and that there is also a certain decrease in the overall classification accuracy. The confusion matrix of the HySpex data of DBN-SR classification accuracy is shown in Table 5. In the confusion matrix of the DBN-SR classification accuracy, the analysis shows significant degradation in recognizing lava rocks compared to the SAE.

4.4. Comparison of Two Optimal Models

According to the experimental results obtained in the previous two sections, the optimal model of the SAE has a hidden layer with 40 neurons and three layers. The optimal model of the DBN has a hidden layer with 30 neurons and two layers. The comparative analysis of the box plots and the confusion matrix shows that SAE has higher classification accuracy and is more accurate and stable in the recognition of each feature. It can be concluded that the classification model based on the SAE and the softmax regression classifier is the optimal classification model obtained in this experiment. The optimal models of both algorithms were used to classify the entire image, and the classification results of the ROSIS-3 and HySpex data are shown in Figure 12. The analysis shows that, under the ROSIS-3 data, the SAE-SR classification model misclassifies some shadows and trees and some bare ground, grass, and trees. The DBN-SR classification model misclassifies bricks and gravels and confuses with shadows, trees, asphalt, and bricks, with a lower recognition accuracy of gravels and bricks compared with the SAE-SR classification model. Under the HySpex data, the SAE-SR classification model confuses vegetation, water, glass, and steel plates with wall. The DBN-SR classification model has a similar situation, but misclassification is more severe than in the SAE-SR classification model.

4.5. Optimization

According to the above box plot and accuracy evaluation analysis, we devised an optimization strategy using a stacked sparse autoencoder based on the SAE. The number of hidden layer neurons and hidden layers remained unchanged from the optimal parameters obtained which were described above, and the accuracy was compared with that of the ordinary SAE. The comparative analysis of the accuracy of the AE is shown in Table 6, which shows that the classification accuracy is improved by using the stacked sparse autoencoder.

The deep learning algorithm is computationally complex and takes a long time to train because of the many parameters in the deep learning model, which is a deep neural network, and many floating point and matrix operations are performed during the parameter-tuning phase of the model training process. The model computation process in this paper used GPU parallel computing. Figure 13 displays the analysis of the classification speed under GPU parallel computing outcomes. Basic computer hardware specifications in the experiments included an Intel i7-7500U processor, a GeForce GTX 960M graphic card, and 8 GB RAM. The software environment also included Ubuntu, cuda8.0, and cuDNN version 5. The model’s training times for the two sets of image data (ROSIS-3 and HySpex) were reduced from 171 seconds to 23 seconds and from 321 seconds to 45 seconds, respectively. This significantly sped up image classification.

5. Conclusions

In this paper, the two models, SAE and DBN in deep learning, were applied to hyperspectral image classification to address some of the problems of hyperspectral image classification. The classification model was formed by connecting to a softmax regression classifier. Multiple experiments were conducted by changing the number of hidden layer neurons of the AE and the RBM. The reconstructed image features were visualized and analyzed in comparison with the original image features, and the best reconstructed effect was obtained with 30 hidden layer neurons for the AE and 30 hidden layer neurons for the RBM. The AE and RBM were compared with the traditional feature extraction algorithm, and we experimentally verified that both classification algorithms based on the AE and RBM outperformed the classification algorithm based on the traditional feature extraction algorithm in terms of classification performance; the AE was stacked to form an SAE, and the RBM was stacked to form a deep belief network; that is, the single-layer neural network was stacked to form a deep network. Through the experimental analysis, the classification accuracy reached 92.12% and 94.02% with three hidden layers of SAEs and 88.22% and 92.16% with two hidden layers of deep belief networks; thus, the optimal classification model was found to be an SAE. The classification accuracy was further improved to 93.41% and 94.92% by modifying the model to a stacked sparse autoencoder. To shorten the long training time of the deep learning algorithm, GPU computing was applied to the classification model, improving classification speed by more than seven times.

Through comprehensive experimental analysis, we found that the hyperspectral image classification model based on a deep learning algorithm can obtain better classification accuracy than the traditional classification model. However, some problems with the deep learning algorithm still need to be addressed. For example, the selection of model hyperparameters and the number of neurons and layers in the hidden layer have a large impact on the classification accuracy. That is, the selection of model parameters is complicated and has a large impact on the classification performance. Second, the deep learning algorithm is complicated, and under the same hardware equipment conditions, its performance varies considerably. How to improve the optimization algorithm accuracy while reducing its computing time will also need to be studied further.

Data Availability

The data used to support the findings of this research are included within the article.

Conflicts of Interest

The authors declare that they have no conflicts of interest.

Acknowledgments

This work was supported by the National Key Research and Development Program of China (Grant nos. 2021YFB3900505 and 2021YFC3000401), the National Natural Science Foundation of China (Grant no. 41941019), the Sichuan Mineral Resources Research Center (Grant no. SCKCZY2021-ZC003), the Sichuan Provincial Department of Education Humanities and Social Sciences (Zhang Daqian Research) Key Project (Grant no. ZDQ2021−01), the Open Foundation of Sichuan Center for Disaster Economic Research (Grant no. ZHJJ2021-ZD001), and the Industry-School Cooperative Education Program of the Ministry of Education (Grant nos. 202101162001 and 202102245035).