1. Support Vector Regression
SVR is a statistical machine learning method primarily used in regression prediction
developed from Support Vector Machine (SVM) [36]. In the context of multi-class classification, the objective of SVR is to construct
a high-dimension hyperplane for the input sample $x$. Its definition is shown in Eq.
(1).
where $w$ is the normal vector that determines the direction of the hyperplane, and
$b$ is the displacement factor determining the intercept between the hyperplane and
the origin [37]. This construction ensures that the closest classes of samples are as far from the
hyperplane as possible.
If all samples fall on this hyperplane, the classification error is zero, signifying
that a perfect prediction function has been achieved. However, as it is unlikely for
all samples to fit with the defined linear function, a constant value $\xi$ is introduced
as a proper approximation in SVR, representing allowable tolerance on the approximation
error for each input x. In regression prediction, a predefined range interval $\xi$
is set as the tolerable deviation on both the upper and lower sides of the hyperplane.
The sample values fall within the range of $w^T+b-\xi $ to $w^T+b+\xi $, as shown
in Fig. 2, with the loss approaches 0. This ensures that the closest classes of samples are
kept as far from the hyperplane as possible.
Fig. 2. Schematic diagram of support vector regression.
In order to ensure generalization ability, it is crucial to obtain an appropriate
value of $\xi$ that allows most samples to have a loss close to 0. If $\xi$ is too
small, leading to high accuracy of the data columns, the risk of underfitting or overfitting
significantly increases [38]. Therefore, the slack variables $\zeta$ is introduced, which represents the distances
from the samples beyond $w^{T}x+b-\xi$ to $w^{T}x+b+\xi$ with respect to the boundary.
The SVR model can be defined as
this model needs to satisfy the constrains
where ${C}$ is a regularization constant used to balance the maximum boundary and
minimize the training error, a larger ${C}$ value makes the model more inclined to
reduce the training set error, increasing model complexity and enhancing the fitting
degree to training data. However, excessive focus on the training set may overlook
the overall patterns, leading to overfitting. Conversely, a smaller ${C}$ value allows
higher tolerance for training set errors, yielding a smoother and simpler function.
Nevertheless, this might result in underfitting, as the model fails to capture data
characteristics effectively, and x represents the original sample data. The minimization
problem can be transformed into a pairwise problem by introducing Lagrange multipliers
in Eq. (2) and constructing Lagrange functions. The constraints can be obtained by making partial
derivatives of $w$, $b$, $\zeta_{i}$, $\zeta_{i}^{*}$ equal to zero, respectively.
By combining the restraints with the Sequential Minimal Optimization (SMO) algorithm,
the SVR model can be defined as [39]
where $\alpha_i$ and $\alpha^{*}_i$ are Lagrange Multipliers. In this equation, referenced
to SVM kernel function. The definition of the kernel function is given by Eq. (7).
Fig. 3. The two-dimensional data is projected into the three-dimensional feature space
by $\phi $.
where $\phi $is the mapping function, which maps the input ${x}_{i}$ to a higher dimensional
feature space and increases the probability of linear classification as shown in Fig. 3; and ${k}(x_{i}$, $x_{j})$ is a kernel function that can be used to calculate the
inner product of eigenvectors in high-dimension feature spaces and make accurate predictions
even when samples can not be linear classified. Thus, SVR can be modified as
The RBF kernel is one of the available kernels, alongside Linear, Polynomial, Sigmoid,
and others. It has been found to exhibit superior performance in various application
areas, including small-signal model prediction [40,41]. The definition of the RBF kernel is given by Eq. (9).
where $g$ represents the impact of a single sample on the hyperplane. A larger ${g}$
value enhances the model's ability to fit local data points, making the decision boundary
more complex and sensitive. When g is too large, it causes overfitting: the model
overemphasizes the local details of training data, leading to poor performance on
test data. Even if the training error is minimal, the model's generalization capability
significantly declines. Conversely, a smaller $g$ value strengthens the model's global
fitting ability, yielding a smoother and simpler decision boundary. When $g$ is too
small, the model becomes overly smooth, failing to capture complex data patterns and
resulting in large errors in both training and testing. The selection of $g$ has been
extensively analyzed in the literature, aiming to achieve the best generalization
ability [42]. As $g$ increases, individual samples are more likely to become support vectors,
leading to a greater number of support vectors. Consequently, adjusting $\xi$ and
$\zeta$ is crucial when using the SVR model for predictions, ensuring globally optimal
prediction results and minimizing risk.
The accuracy, generalization ability, and small sample characteristics of SVR model
are all based on optimum parameter selection. As for a RBF kernel, the parameters
that have the greatest impact on the prediction performance are $C$ and $g$. Therefore,
Gray Wolf Optimization (GWO) algorithm is adopted to obtain the optimum values of
$C$ and $g$, which can save the calculating time while improving the reliability.
2. Grey Wolf Optimization
GWO simulates the hunting behavior of grey wolves using mathematical algorithms. The
grey wolf population is divided into four distinct groups: alpha ($\alpha$), beta
($\beta$), delta ($\delta$), and omega ($\omega$). The alpha wolves act as leaders
and are responsible for making decisions, while the beta wolves assist the alpha group
in decision-making processes. The delta (d) wolves hold the third-highest position
in the grey wolf social hierarchy and are required to submit to the dominance of the
higher-ranking wolves. The remaining wolves, categorized as omega, hold the lowest
rank within the hierarchy. As illustrated in Fig. 4, the tracking, surrounding, and attacking behavior of the omega wolves is directed
by the alpha, beta, and delta wolves [43].
Fig. 4. Principle of the Grey Wolf algorithm.
The mathematical model of the grey wolf algorithm consists of three steps: encircling,
hunting, and attacking the prey. The encircling behavior can be mathematically described
as
where $\overrightarrow{A}$ and $\overrightarrow{D}$ are the coefficient vectors; $\overrightarrow{X_p}\left(t\right)$
and $\overrightarrow{X}(t)$ are the current position vectors of the prey and the grey
wolf, respectively; $t$ is the current iteration.
The coefficient vectors $\overrightarrow{A}=2\overrightarrow{a}\times\overrightarrow{r_1}-\overrightarrow{a}$
and $\overrightarrow{C}=2\times \overrightarrow{r_2}$., where components of $\overrightarrow{a}$
are linearly decreased from 2 to 0 over the course of iterations and $\overrightarrow{r_1}$,
$\overrightarrow{r_2}$ are random vectors in $[0$, $1]$. The vectors $\overrightarrow{r_1}$
and $\overrightarrow{r_2}$ are random, while $\overrightarrow{A}$ is the coefficient
vector. If $\left|\overrightarrow{A}\right|$ is less than 1, the wolves will attack
the prey, and this will lead the system to a locally optimal solution. If $\left|\overrightarrow{A}\right|$
is greater than 1, the wolves are separated from the prey and seek a more suitable
target. To ensure a well-balanced search probability and avoid falling into locally
optimal solutions, $\overrightarrow{r_1}$ is set in the range of $[0$, $1]$. The vector
$\overrightarrow{C}$ contains search coefficients that assign random weights to the
prey. $\overrightarrow{r_2}$ is randomly selected from a vector ranging from 0 to
1. This helps the GWO to explore more stochastically from the initial iteration through
to the final iteration. This leads to a global search in the decision space and as
a result, avoids getting trapped in a local optimum during optimization, particularly
in the middle and late stages, preventing a dead-end loop of local optimum solutions.
The hunting behavior is usually guided by the alpha, as shown in Eqs. (12)-(14). $\overrightarrow{D_\alpha}$, $\overrightarrow{D_\beta}$, $\overrightarrow{D_\delta}$
are the distances between $\alpha$, $\beta$, $\delta$ and other individuals. $\overrightarrow{C_1}$,
$\overrightarrow{C_2}$, $\overrightarrow{C_3}$ are random vectors and $\overrightarrow{X}$
is the current position of the grey wolf. $\overrightarrow{X_1}$, $\overrightarrow{X_2}$,
and $\overrightarrow{X_3}$ determine the step length and direction of each individual
wolf in the pack towards $\alpha$, $\beta$, and $\delta$, respectively. The final
position of the wolf is determined by $\overrightarrow{X}(t+1)$.
3. GWO based SVR
The SVR model, known for its use of theoretical analysis to achieve high accuracy
with a limited number of samples, shows good stability and generalization capacity
when tackling non-linear problems. The selection of appropriate parameters, such as
the penalty factor c and the kernel function parameter g, significantly impacts the
predictive effect within the SVR model. Typically, the optimal c and g values are
manually selected or debugged when encountering specific problems, which entails a
considerable amount of work and results in low reliability. To address this issue,
the GWO is employed to automatically identify the optimal values of c and g within
a designated range, offering both speed and certain generalization ability in solving
diverse problems. Specifically, the GWO-SVR algorithm proposed in this study follows
a set of steps, as detailed below, with the algorithm flowchart shown in Fig. 5.
Fig. 5. Flowchart of GWO-SVR algorithm.
1. Initially, the data is partitioned into a training set and a test set, wherein
the ratio of training set to test data significantly impacts the fitting accuracy
in regression prediction problems. To evaluate the small-sample properties of the
SVR algorithm, two variants are considered: GWO-SVRL, with a training set comprising
30% and a test set comprising 70%, and the GWO-SVR algorithm, with a training set
of 70% and a test set of 30% [44]. The contrasting training-test set ratios aim to comprehend the algorithm's behavior
under varying sample sizes.
2. Subsequently, the GWO-related parameters are initialized to streamline computation
speed. Specifically, the number of wolves affects the balance between search accuracy
and computational efficiency, while a reasonable setting of iterations can avoid overfitting
or local optimality. When both the number of wolves and iterations are set to 10,
this value balances accuracy and time consumption in pre-experimental tests. As model
performance improves slightly beyond this value, it is thus chosen. The upper and
lower bounds of parameters C and g are respectively fixed at 0.001 and 300 to constrain
their values within a feasible range.
3. Then, the fitness of the current position of the wolves is computed by applying
the SVR model. The positions of the grey wolves are updated based on the calculated
fitness values, in line with the algorithm's objective to predict complex S-parameters.
The fitness function employed for evaluation is the mean square error (MSE), as indicated
in Eq. (15), chosen for its effectiveness in quantifying the disparity between prediction outcomes
and sample values.
4. After reaching the maximum iterations, the parameters are obtained within the defined
range to achieve optimal fitness. These optimized parameters are then used in the
SVR model to produce the final regression predictions.