Abstract

For the existed biological degenerate primer design problem, a solution method based on an ant colony optimization algorithm is proposed, in which an optimization model considers primer coverage as well as degeneracy while allowing for a small number of base mismatches. Using the construction graph model, the algorithm makes full use of the dynamic change information during the iterative optimal path optimization process, strengthens the explored new path pheromone, and selects the primers that meet the constraints. The results of DNA template sequence matching experiments show that the degenerate primers designed by the proposed algorithm outperform existing approaches.

1. Introduction

Degenerate primers are a mixture of different sequences encoding all different base probabilities of a single amino acid, and their degeneracy is defined as the number of combinations of all sequences at one or more positions in a PCR primer [1]. They are the primers with a length of k, whose degeneracy is most at d, matching at least m strings in a set of n strings (m represents the coverage of the primer).

Under the condition of certain specificity, degenerate primers should have the lowest degeneracy and the highest coverage. In PCR primer design, the length of primers is mostly fixed; therefore, only coverage degree m and degeneracy degree d need to be optimized [25]. However, as a consequence of incoherence between sequences and inadequacies in the fitness evaluation function, this method frequently results in high degeneracy of primers. Therefore, while designing primer pairs, some rather than all of the target DNA sequences are usually chosen to match. Consequently, this paper focuses on a degenerate primer design (DPD) problem that minimizes the degeneracy and maximizes coverage while allowing for a few mismatches. For a primer S, if its uncovered degree and degeneracy are computed by functions Γ(S) and Ψ(S), then its quality can be computed by the following formula.

The mu and phi are weight parameters used to adjust the importance of coverage and degeneracy. The degeneracy degree is the total number of sequence combinations contained within the primer at one or more positions of a degenerate PCR. The computational procedure of the degeneracy degree is described in detail in [6]. Moreover, the constraints [11] related with primer length, primer pair differential length, primer melting temperature, primer pair differential melting temperature, GC content, 3′end constraint, hairpin structure constraint, and double chain structure constraint are also considered in this work.

The existing related methods for this problem mostly are classical heuristic ways. For example, dynamic pattern matching [7], two-phase MIPS algorithms [8], iterative beam search [9], and HYDEN are based on the approximate algorithm [15]. These methods have been included in some well-known primer design tools [1214], and some of them have been parallelized [16]. In recent years, swarm intelligence optimization algorithms [1721] have proven to be an effective way to tackle difficult optimization problems. The genetic algorithm is also used for the primer design problem, such as the single PCR design [10] and multiple PCR design [11], and get competitive performance. But few swarm intelligence algorithms have been applied to solve the constraint DPD problem.

This research presents an original way to DPD using an ant colony optimization algorithm with an enhanced pheromone update mechanism. With the help of the construction graph model, the constraints of DPD are dynamically satisfied during the construction process. The primer pairs are finally screened by analyzing and comparing the two indexes of degeneracy and coverage with existing methods.

2. Ant Colony Optimization Model for Degenerate Primer Design Problem

In this paper, an enhanced pheromone updating mechanism is proposed for ant colony optimization—ACOH. To optimize the pheromone update and improve convergence speed without compromising quality, the algorithm fully utilizes the dynamic information used in the iterative optimal path optimization process.

To the original pheromone matrix, an incremental matrix is added. All ants in a colony share the same pheromone incremental matrix. IOSolution represents the optimal (best) solution set so far; BIOSolution represents the better (better) solution set, and VInformation represents the differences between them. Using the pheromone dynamic update mechanism can improve the efficiency of the solution based on the existing pheromone update mechanism.

The algorithm generates an iterative optimal solution, known as the IOSolution, through several iterations and then compares it with the iterative better solution BIOSolution generated by the algorithm determining the dynamic change information VInformation, which indicates which path side arcs are newly explored. Then, when the original pheromone is updated, we apply additional dynamic pheromone enhancement to this modified information. The increment is stored in the pheromone increment matrix and set to the reciprocal of the iterative better path fitness evaluation function value. In this way, the probability of ants visiting the side arc in the future may be improved, and the diversity of understanding can be expanded to some level; consequently, the run time of the algorithm is reduced, improving its convergence rate and solution efficiency.

The method, with the assistance of the construction graph model (CGM), used ACOH to complete the design of degenerate primers. The algorithm focuses on DPD with maximum coverage and minimum degeneracy.

The number of layers of the CGM depends on the length of primers and generally ranges from 18 to 26 layers. After the 18th layer, an empty node is set in each layer, which facilitates completing the primer design and exiting the program. Each layer is a combination of 15 base pairs including [ATGC], [GTA], [GCA], [GCT], [CTA], [GT], [GA], [CG], [CA], [CT], [AT], [A], [T], [G], and [C]. The final combination must meet a number of limitations and basic pairing requirements and iterate repeatedly to obtain the best primer pair, or it stops here at the maximum iteration after a series of iterations. With the restrictions of coverage and degeneracy, as well as the principles of base pairing, each layer selects the appropriate base combination branch. It iterates circularly until the suitable primer pair is searched, and the results are output, or the maximum iteration time is reached. Then, the output results are screened and filtered by the constraints of multiple primer designs. The primer pairs that finally meet the requirements of constraints and various parameters are output, such as position, length, annealing temperature of up primers and down primers, the content of GC in primer pairs, the judgment of complementarity and specificity between primers, and so on.

The algorithm consists of five stages: initialization, constructing solutions, updating pheromones, evaluating solutions, and judging the outcome of the algorithm. The whole process of solving is shown in Figure 1.

In order to make the ACOH algorithm applicable to DPD, the initialization, constructing solutions, and judging the outcome of the algorithm were redesigned accordingly, and the other parts were completed using the methods in the ACO algorithm [22]. The details of the redesign are as follows:

In the construction of the initial graph model, each ant randomly selects an initial node and has fifteen choices for the following node. According to the input template DNA sequence and base pairing principle, ants roughly screen out the required base combinations. Following that, a roulette wheel is used to select the next node that will be visited using the probability state transition formula. Here, the ant constructs the solution matrix using a three-dimensional matrix, because in the first 18 nodes, each node initially has 15 choices. To distinguish one node from the same base combination in another node, a sentinel identification bit is specially established. For example, the third node has a base combination of [GC], and the fourth node also has a base combination of [GC]. Some ants are just on the second node, facing the same base combination [GC], which makes it hard to distinguish. At this point, the sentry of the identification position plays a key role in the primer design process, and the third node is selected to ensure order. In this way, these nodes are selected in turn until the 18th node is reached. The length of primers is generally 18–26 bp; therefore, an empty node is set on the 18th through 26th base combinations in addition to the above 15 combinations, which facilitates the exit of the path construction at any time after one iteration. After that, each ant completes the path construction once, which represents the completion of the design of the initial primer pair. Then, each path is evaluated to screen out an iterative optimal path. The specific process will be introduced in detail in the later evaluation stage. After the pheromone is updated in this path, the next iteration is continued. Repeat until the maximum amount of iterations has been reached, or fitness requirements have been met.

In the process of degenerate primer design, the introduction to the second part reveals that constraints are crucial for the final primer pair design. Hard constraint conditions and soft constraint conditions are the two types of constraint conditions, with the hard constraint condition referring to the need that the primer pair must meet. Otherwise, the primer pair cannot be formed at all. For instance, the length of the primer is generally between 18 and 26 bases. The difference between the lengths of the up primer and down primer should be strictly less than 3, and so on. Soft constraints are a type of degenerate primer design rule that includes phrases such as “at most” and “at least.” In the design of degenerate primers with maximum coverage, for example, the degeneracy is required to be d at most.

The ants use the roulette wheel instead of the node with the highest probability to choose the next node in their path and avoid falling into a local optimum. The simplified view of the construction graph model is shown in Figure 2.

At the construction stage, heuristic factors and pheromone factors associated with the constraints guide the ants toward selecting the next base or the next base combination in the primer sequence. The heuristic factors play a vital role, especially in guiding the choice of the next node. The probability state transition formula is shown in the following formula:where i and j, respectively, represent the starting node; τij is a pheromone factor, which mainly includes two operations: pheromone volatilization and pheromone enhancement. The pheromone update strategy used in this paper is as follows: after the kth ant completes a path search, the pheromone is updated for all paths on the route, and the pheromone increment is related to the overall route of this search, which is a global information update. The detailed update process of τij is described in [22]; ηij is a heuristic factor; the weights of the two factors ηij and τij are denoted by β and α, respectively; allowed k is a collection of nodes that ants have not accessed. After heuristic analysis, the following values can be set:where Degeneracyij denotes the degeneracy of edge arc ij and Convergeij denotes the coverage of edge arc ij. As the study in this paper uses a maximum coverage degeneracy primer design, the degeneracy of primers is constant, and the greater coverage represents the better experimental results. Therefore, the heuristic factor η is larger, and the probability of the side arc being selected again is relatively large under the same conditions.

In order to meet different constraints, a local search process called LocalSearch is designed in this paper, including three ideas pretreatment before optimization, processing in the path construction process, and constraint processing after the end of path construction. Next, the situation will be introduced. First, some constraints need to be preprocessed before optimization. GC Clamp, for instance, can detect whether the two bases at the end of the 3′end of primer S are G or C. However, considering another constraint condition, the bases should be distributed randomly; it means that polypurine or polypyrimidine should not appear, and no more than three consecutive G or C should occur at the 3′end of primer S. Thus, three bases can be set at the end of the 3′end of primers, and their combination range must be restricted. Second, some constraints need to be dealt with in the optimization process. For instance, the length of primers is generally between 18 and 26 bases, and this information was covered in detail in the process above on constructing solution paths, so it would not be repeated here. Third, some constraints need to be addressed near or after optimization. For example, the difference between the up primer's length and the down primer's length should be strictly less than 3. After designing the up primer, its length is assumed to be 21, and in the process of finding the down primers, the length range of the down primer is narrowed down to 18–24. This method significantly reduces the search space and enhances the efficiency of the process when dealing with thousands of DNA template sequences. Furthermore, after the final design of primer pairs has been determined, it is also essential to judge the hairpin structure and complementary sequence between primer pairs. If there is a hairpin structure or complementary sequence, the primer pairs must be adjusted further.

A degenerate primer evaluation is also a fundamental stage of ACOH. After a certain number of iterations, many ants generate solution paths, respectively, which involve comparing and analyzing multiple paths (actually primers). In order to express more clearly, this paper simplifies multiple paths into two paths for comparison. The evaluation of the solution path is analyzed and discussed in three cases: the two solution paths meet constraint conditions; one meets the constraint, the other does not; neither one meets the constraint. Next, the above three cases will be introduced, respectively. First, when both paths meet many constraints, they can be divided into two cases according to whether the coverage is the same. The soft constraint condition degeneracy can be used to assess when the coverage is the same. Eventually, the solution path with less degeneracy is taken. When the coverage is different, on the other hand, the result uses the solution path with greater coverage since experiments seek the design of degenerate primers with maximum coverage. Second, one of the two fits the constraint requirements, while the other does not by obviously choosing the solution path that meets the constraint conditions or meets more constraint conditions. Third, when both solutions fail to meet the constraint conditions, the final result discusses two cases based on whether the number of constraints violated by the two solution paths is the same. When two solution paths have the same amount of unsatisfied constraints, the solution chooses the one with greater coverage and then adjusts the primer pair to meet the constraints as much as possible. In contrast, when the number of constraints violated by two solution paths differs, one with fewer constraints violated by the two solution paths should obviously be selected. The specific execution process of the ACOH algorithm to solve the DPD problem is shown in Algorithm ACOH for Degenerate Primer Design pseudo-code, which is as follows:

ACOH algorithm complexity is as follows:

When constructing a complete primer pair, calculating the state transition probability formula requires O(nm) operations. In order to improve the quality of the constructed graph, O(m) operations are necessary. It takes O(n2) operations to complete pheromone updates after each global path optimization. Therefore, the total time complexity is

Nc, n, and m correspond to the number of iterations of the algorithm, the number of vertices in the graph, and the total number of ants, respectively. Thus, time complexity increases, with the size of the problem. The total complexity of space is

3. Comparison and Experimental Analysis

100 sets of DNA template sequences for molecular biology are the experimental subjects of this paper. This algorithm experiment was carried out on a PC (3.40 GHz CPU and 16 GB RAM) using the eclipse. Java was selected as the programming language. The algorithms were run 30 times under all test cases independently, and the experimental results data were processed and analyzed for algorithms’ comparison.

3.1. Parameter Adjustment

For designing degenerate primers, an ant colony optimization algorithm considers a number of parameters, including α, β, ρ, mu, and phi. The parameters are shown in Table 1 as follows:

A significant number of tests are required to determine the exact value of coverage mu, as well as the value of degeneracy phi, within a certain range. The final experimental results are depicted by a box-line plot, which shows that how many restrictions in the vertical axis cannot be satisfied for a given horizontal axis value, as can be seen in Figures 3 and 4.

By analyzing and comparing four typical data sets in Figure 3, it appears that at a parameter mu of 10, the boxplot distribution is more uniform, and the minimum value is the most optimal. In Case2, for instance, when mu = 10, the minimum, maximum, lower quartile, median, and upper quartile values are clearly better than the other values. In Case3, however, it has a superior upper quartile, median, and lower quartile than those in the two parametric cases, even though the minimum value at mu = 10 is the same as at mu = 5 or 10.

According to the analysis and comparison of the boxplots in Figure 4, there is an even distribution of the boxplot when the value of parameter phi is 3, and the minimum value at this time is the most optimal.

3.2. Experimental Comparative Analysis

To make the experiment more convincing, the experiment results of degenerate primer design through the ant colony optimization algorithm are compared with those determined through Primer 3 and Oligo 7 [23]. The final test results need to be compared by two comparative analysis methods, i.e., descriptive statistical analysis and convergence analysis, to finally draw reliable conclusions.

3.2.1. Comparative Study Based on Convergence Analysis

The convergence analysis focuses on comparing the values of the fitness evaluation function of the standardized treatment for a certain number of fitness evaluations. ACOH, software Primer 3, and Oligo 7 solved the convergence curve of degenerate primer design on the DNA template of 500 bps and 600 bps, respectively, as shown in Figure 5.

As shown in Figure 5, on a 500 bps DNA template, Primer3 converged approximately 300 times with a minimum fitness evaluation function of 3.245. Oligo7 converged approximately 600 times with a minimum fitness evaluation function of 3.218. ACOH also converged approximately 600 times with a minimum fitness evaluation function of 3.204. Although Primer3 converges the earliest, it has the highest fitness evaluation function value. For the same number of fitness evaluations, the normalized fitness evaluation function value of ACOH is the smallest among these three methods. ACOH has more advantages in this kind of problem since this section is to solve the minimum meeting constraint optimization problem. Similarly, ACOH has also shown an advantage on the DNA template of 600 bps.

3.2.2. Analysis and Comparison Based on Descriptive Statistical Methods

The up primer pair, the down primer pair, and many constraint parameters were compared at multiple angles. Table 2 and Figure 6 summarize the experimental results.

As can be seen from Table 2 and Figure 6, no matter 500 bps or 600 bps, the ACOH model performs best at degeneracy and convergence. In addition, at 500 bps, the data of up primer GC-clamp and down primer GC-clamp of the three models are [C] [C] [G] and [GG] [GG] [C]. At 600 bps, the up primer GC-clamp and down primer GC-clamp of the three conditions were [GG][G][CC] and [CG][CG][GG]. Self-annealing and pair-annealing are no in all conditions. DPD solved by the ant colony optimization algorithm with the construction graph model is obviously lower than that solved by Primer 3 and Oligo 7. Moreover, the primer pair is specifically compared with Primer 3, which is inseparable from the preprocessing of the search space before constructing the solution path. Finally, only the condition optimized by ACOH has specificity.

Following that, four DNA template sequences, such as 100 bps, 200 bps, 300 bps, and 400 bps, will be used as test cases, with the corresponding primer pairs and constraint parameters acquired. As shown in Table 3 and Figure 7, DNA templates of 100 bps and 200 bps provide the same level of coverage as those of 300 bps and 400 bps, but their degeneracy is relatively smaller. There is a correlation between the number of base pairs in the DNA sequence and the excellent search performance of the ant colony algorithm. In addition, the data of up primer GC-clamp and down primer GC-clamp of the four conditions are [GC] [G] [GG] [G]and [GG] [G] [G] [G]. Both self-annealing and pair-annealing were no in all models, but all models had specificity.

4. Conclusion

DPD is a hot topic in current research. Multiple limitations must be met by the candidate primers identified in this study. In addition, these primers should have a large amount of coverage and a small amount of degeneracy while allowing a minority of base mismatches. As a result, the research develops a DPD optimization model and proposes a solution to the problem based on the notion of ant colony optimization. Results of the experiments suggest that primers designed by this method have obvious effects on coverage, degeneracy, specificity, and stability and finally improve the solving efficiency. Besides, candidate primers are scored in this paper. Scoring is the process of weighing degeneracy and coverage into a single index, which necessitates the weight value before calculation. In practice, more pre-experiments are needed to determine the best weight values. As a result, in our next research, we will investigate how to create a multiobjective optimization model for the DPD and find an effective solution method. The solution method can give multiple candidate degenerate primer design schemes at the same time, which do not dominate each other.

Data Availability

The data used to support the findings of this study are available from the corresponding author upon request.

Conflicts of Interest

The authors declare that there are no conflicts of interest regarding the publication of this paper.

Acknowledgments

This work was supported by Liaoning Province Doctoral Research Start-up Fund Project (no. 2020-BS-152). This work was supported by Xingliao Talent Plan (no. XLYC1902095). This work was supported by Scientific Research Young Talents Project of Liaoning Education Department (no. LJKZ0266).