Abstract
Background and issues
A generalized linear model is a model where the probability distribution of the residue belongs to the family of exponential distributions, including linear regression, logistic regression, Poisson regression, and others. Feature selection (as a problem) consists of selecting effective feature values (explanatory variables) during the creation of the predictive model. In recent years, feature selection using the mixed integer optimization technique has received increased attention.
Feature selection can reduce the cost of data collection and storage, making it possible to predict regression coefficients more efficiently. In addition, it helps to explain the causal relationship between selected features and the responses inferred by the regression model. It is possible to improve the predicting performance by eliminating redundant features.
A comprehensive method of finding the best subset of features is to evaluate the quality of every possible subset of features. However, this method is not feasible due to its computational load, except in the case of a small number of features.
Many previous research works focus on heuristic optimization algorithms, to address the problem of selecting large subsets. The most widely-known algorithm of this type is the step-wise method, which consists of repeating the addition or subtraction of a single feature at a time.
Various regularization methods for sparse prediction have been investigated recently, among which the most generally used is the Lasso method. In the Lasso method, the L1 regularization term is applied, which makes unused coefficients equal to zero.
Bertsimas et al. proposed a discrete first-order (DFO) method to efficiently find an approximate optimal solution for the feature subset selection problem. However, the DFO algorithm cannot escape traps resulting from local optimal solutions. In view of that, Bertsimas et al. propose that a random initialization be performed, with repeated executions of the DFO algorithm.
In the present research, we propose a stochastic discrete first-order method, which is an improved version of the DFO method. In this method, random variations are introduced in the candidate solutions in order to avoid local optimal solutions. In addition, in order to accelerate conversion, the optimal step size along the gradient descending direction is calculated. Moreover, in order to improve the predictive performance of the obtained model, a regularization term L2 is introduced.
Proposed method
The feature selection problem is formulated as follows. Consider n to be the number of samples, p the number of features, π β βnΓ1 the vector of response values (predicted values), πΏ β βnΓp a matrix containing the values of features (explanatory variables), and π· β βpΓ1 the vector of partial regression coefficients. The multiple regression model can then be expressed as y = Xπ· + e, where e is the prediction error.
The feature selection problem is an optimization problem, where a fixed number of features must be found that minimize e. In other words, the feature selection problem is:
minimize f(π·) := 1/2||y β Xπ·||22 (1)
subject to ||π·||0 β€ k. (2)
Here, ||π·||0 is the number of non-zero elements of π· (i.e., the number of features) and k is the size of the desired subset, satisfying the condition < min {n, p}.
Discrete first-order method
The discrete first-order method, which is the usual method, can be described as follows
- Define threshold Ξ΄ as a sufficiently small positive integer. Consider π·(1) β βnΓ1 satisfying ||π·(1)||0 β€ k, with m β 1.
- Update Ξ². π·(m+1) β βk(π·(m) β 1/Lβf(π·(m))).
- If |f(Ξ²(m)) β f(Ξ²(m+1))| β€ Ξ΄, then the algorithm is stopped, and Ξ²(m+1) is obtained as the solution.
- Return to step 1 with m β m + 1.
Here, L is the Lipschitz constant related to the gradient of f. In other words, for all Ξ·, Ξ² β βnΓ1, ||βf(Ξ·) β βf(Ξ²)||2 β€ L||Ξ· β Ξ²||2 is satisfied. From Equation (2), we have βf(Ξ²) = XTXΞ² β XTy.
Moreover, βk represents an operator that converts all elements π + 1 onwards to zero, when they are placed in descending order of absolute values (Figure 1). For instance, when Ξ² = (1.3, 0.8, 0.4, -1.2, -0.2)T and k = 3, the above, rearranged in descending order of the absolute values of Ξ², is (1.3, -1.2, 0.8, 0.4, -0.2). βk(Ξ²) makes the elements k + 1 onwards of the rearranged set above equal to zero, resulting in βk(Ξ²) = (1.3, 0.8, 0, -1.2, 0).
Stochastic discrete first-order method
The proposed stochastic discrete first-order method improves step 1 of the conventional method, as follows.
- Update Ξ². Ξ²(m+1) β Gk(Ξ²(m) β 1/Lβf(Ξ²(m)) | ΞΎ(m))
Here, ΞΎ(m) β βpΓ1 is a vector formed by independent, normal random numbers with an average of zero and a covariance π2m. Like βk, Gk is an operator that makes all elements π + 1 onwards equal to zero when they are arranged in descending order of absolute values. However, it differs in the fact that a normal random number is added during the rearrangement (Figure 2). This procedure makes it possible to escape from local optimal solutions.
The proposed method also optimizes the displacement width along the search direction in order to accelerate convergence speed. In addition, a penalty term πΏ2 is introduced in the target function in order to improve predictive performance.
Evaluation and demonstration
The effectiveness of the proposed method was demonstrated through an experiment using logistic regression. The experiment data are as follows.
Partial regression coefficientsγΞ²0 = (1, πT, 1, πT,β¦, 1, πT)T β βpΓ1.
Feature matrix πΏ β βnΓp. (It is generated using equal independent distributions Np (π, πΊ). The component (i, j) of the covariance matrix πΊ is 7|i-j|.)
Response vector π β βnΓ1
We obtained the correct answer ratios of prediction and features (F1 value: for the case where the correct features were selected), and the number of selected features, as evaluation parameters. We compared the predictive performance against other methods (Figure 3).
Results and proposal
In the present research, we proposed a stochastic, discrete first-order method to select a subset of the feature variables of a multiple regression model. In the proposed method, a random variation is applied to candidate solutions in order to avoid local optimal solutions. The optimal step size along the descending gradient direction was derived to achieve fast convergence, and a regularization term L2 was introduced to improve prediction performance.
According to simulation results, we verified that our method is able to significantly reduce out-of-sample prediction errors compared to the case where a randomly initialized, discrete first-order method is repeatedly executed. Moreover, our method yielded better accuracy than the Lasso and forward-stepwise methods, both in terms of subset selection and prediction.
In addition to the results above, it is worth noting that our method is extremely simple, and easy to understand and to implement. These features will make it appealing to researchers and data analysts in companies.