What are some recent advances in non-convex optimization research? originally appeared on Quora - the knowledge sharing network where compelling questions are answered by people with unique insights.
Non-convex optimization is now ubiquitous in machine learning. While previously, the focus was on convex relaxation methods, now the emphasis is on being able to solve non-convex problems directly.
It is not possible to find the global optimum of every non-convex problem due to NP-hardness barrier. An alternate approach is: when can it be solved efficiently (preferably in low order polynomial time). Recent theoretical work has established that many non-convex problems can be solved near-optimally, but simple iterative algorithms, e.g. gradient descent with random restarts. The conditions for success turn out be mild and natural for many learning problems.
Some examples of guaranteed non-convex methods:
- Many latent variable models can be learnt by decomposing higher order moment tensors (which is a non-convex problem). We are guaranteed to learn a consistent model, when the hidden or latent representation is non-degenerate: the effect of one hidden variable cannot be expressed as a linear combination of effects of others. See our Also see an excellent by Rong Ge.
- Non-negative matrix factorization becomes tractable under a separability condition, see This condition may not always hold, but in many problems such as learning PCFGs in NLP, we can always add more features till we have a separable problem. See
- Non-convex problems tend to work better in practice, but until now theory was only available for convex relaxation methods. This is changing fast: many non-convex methods are proven to succeed in the regimes where convex relaxation succeeds. For instance, for the problem of robust PCA, where the goal is to find a low rank + sparse decomposition of the matrix, we show that a natural non-convex method, which is more efficient than the convex method, have the same regimes of success. See
- For non-convex problems, the main drawback is the need for good initialization. This is where practitioners tend to have good intuitions and employ domain specific knowledge to engineer these initializations. This is also now being tackled in theory. For instance, for sparse coding problem, we can initialize by finding solution to an overlapping clustering problem. See and
- The above works are relying on gradient descent, or other local methods, with specific initializations. An alternative method is based on smoothing, where the function is progressively transformed into a coarse-grained objective through local smoothing. With sufficient amount of smoothing, it becomes convex, and its global optimum is used as initialization to a finer-grained objective which is non-convex. Recent analyze when such smoothing methods succeed for non-convex optimization.
Avoiding saddle points: Saddle points are critical points, where the gradient vanishes, but it is not a local optimum, meaning there exist directions where the objective value improves. Saddle points slow down stochastic gradient descent (SGD), especially as the number of variables grows. This is a very challenging problem, in fact, it has been argued that saddle points are much more of an issue than local optima; see Recent works have made progress on how to escape saddle points efficiently in high dimensions. In fact, if the SGD is noisy enough, this shows that it can escape non-degenerate saddle points, which can be determined by second order derivatives. A more challenging case involves saddle points which require higher order derivatives to escape, and we analyze them in a recent work
Discrete optimization: So far, I have discussed continuous optimization. There is also interesting work in the analysis of discrete optimization, especially in the context of probabilistic inference. For instance, has been analyzing the effect of random projections on information projections; see Gibbs sampling is another popular technique for probabilistic inference, but it is challenging to establish a bound on the mixing time. A recent by Christopher Re's group is trying to provide new analysis tools for Gibbs sampling. Previously, I have worked on learning Markov graph structure of probabilistic models, and show that simple and efficient methods succeed in the regime of "high temperature." You can read more about it
Other resources on non-convex optimization:
- We had a recent workshop on non-convex optimization at NIPS. Slides of invited talks can be found at You can read my blog post about it
- We have an upcoming workshop as well:
- You can also see my recent interview with David Beyer, where I talk about convex vs. non-convex optimization:
- Sanjeev Arora, Moritz Hardt and Nisheeth Vishnoi are maintaining a blog on non-convex optimization:
- Deep Learning: How are Tensor methods used in deep learning?
- Open Source: What are the good open source libraries in Python or R for Tensor methods? Does your team maintain any?
- Gradient Descent: Are there optimization techniques that do not use gradient and are hence not affected by initialization?