学习优化研讨会(Workshop on Optimization and Learning)

2025.01.02

召集人:Thorsten Koch(ZIB,教授)、印卧涛(阿里巴巴达摩院,教授)、文再文(北京大学,教授)

时间:2025.02.09—2025.02.15


Workshop Schedule

 

Date

Time

Workshop   Information

Feb 10, Monday

09:00-10:00

Chair: Yaxiang Yuan

Talk:  Yinyu Ye

Mathematical Optimization in the Era of AI

10:00-10:30

Break

10:30-11:30

ChairRujun Jiang

Panel: Jinyan Fan,  Dongdong   Ge,  Jie Song,  Zaiwen Wen,    Yinyu Ye, Yaxiang Yuan

12:00-14:00

Lunch

14:00-15:00

ChairMing Yan

Talk:  Ruoyu Sun

理解和改进大模型的训练: 一些新进展

15:00-15:30

Break

15:30-16:30

ChairCaihua Chen

Panel: Cong Fang,  Yutong He,  Luo Luo,  Ruoyu Sun,  Xiangfeng Wang

17:00-19:00

Dinner

20:00-21:00

Free group discussion

Feb 11, Tuesday

09:00-10:00

ChairYongjin Liu

Talk:  Xudong Li

A New Dual Semismooth Newton Method for Polyhedral   Projections

10:00-10:30

Break

10:30-11:30

ChairZiyan Luo

Panel: Bo Jiang, Xudong Li, Yongjin Liu, Andre Milzarek, Ruisong Zhou

12:00-14:00

Lunch

14:00-20:30

Free   discussion

Feb 12, Wednesday

09:00-10:00

Chair: Jie Song

Talk:    Defeng Sun

HPR-LP: An   implementation of an HPR method for solving linear programming

10:00-10:30

Break

10:30-11:30

ChairYaohua Hu

Panel: Bin Gao, Zheng Qu, Defeng Sun, Lei Yang, Shenglong Zhou

12:00-14:00

Lunch

14:00-15:00

ChairJinyan Fan

Talk:  Yuling Jiao

DRM Revisited: A Complete Error Analysis

15:00-15:30

Break

15:30-16:30

ChairZuoqiang Shi

Panel: Yuling Jiao, Tianyou Li, Xiantao Xiao, Zhonglin Xie, Haijun Zou

17:00-19:00

Dinner

20:00-21:00

Free group discussion

Feb 13, Thursday

09:00-10:00

ChairBin Gao

Talk:  Yuhong Dai

Optimization with Least Constraint Violation

10:00-10:30

Break

10:30-11:30

ChairZi Xu

Panel: Kuang Bai, Yuhong Dai, Houduo Qi, Cong Sun, Jiaxin Xie

12:00-14:00

Lunch

14:00-20:30

Free   discussion

Feb 14, Friday

09:00-10:00

Chair: Cong Sun

Talk: Deren   Han

鞍点问题的原始对偶算法—收敛性分析与平衡性

10:00-10:30

Break

10:30-11:30

ChairJingwei Liang

Panel: Deren Han, Bo Jiang, Ruiyang Jin, Yibin Xiao, Pengcheng You,   Shangzhi Zeng

12:00-14:00

Lunch

14:00-15:00

ChairXiantao Xiao

Talk:  Kun Yuan

A Mathematics-Inspired Learning-to-Optimize   Framework for Decentralized Optimization

15:00-15:30

Break

15:30-16:30

ChairQing Ling

Panel: Shixiang Chen, Yujie Tang, Ming Yan, Kun Yuan, Jiaojiao Zhang

17:00-19:00

Dinner

20:00-21:00

Free group discussion


 

 

Titles and Abstracts

 

 

09:00-10:00, Feb. 10, Monday

SpeakerYinyu Ye

TitleMathematical Optimization in the Era of AI

AbstractThis talk aims to respond to the question: are the classical mathematical optimization/game models, theories, and algorithms remaining valuable in the AI and LLM era? We present several cases to show how mathematical programming and AI/Machine-Learning technologies complement each other. In particular, we discuss advances in LP, SDP and/or Market-Equilibrium computing aided by Machine Learning techniques, First-Order methods and the GPU Implementations. On the other hand, we describe how classic optimization techniques can be applied to accelerate the Gradient Methods that is popularly used for LLM Training and Fine-Tuning.

   

14:00-15:00, Feb. 10, Monday

SpeakerRuoyu Sun

Title:理解和改进大模型的训练: 一些新进展

Abstract本次报告讨论对大型语言模型(LLMs)训练算法的理解和提升。在第一部分中,我们分析为什么AdamTransformer上优于SGD,并提出一种轻量级的替代方法——Adam-mini。我们解释了SGDTransformer上的失败原因:(i) Transformer异质性的:参数块之间的Hessian谱差异显著;(ii) 异质性阻碍了SGDSGD在存在块异质性的问题上表现不佳。受此发现启发,我们引入了Adam-mini,它为每个块中的所有权重分配了一个单一的二次动量项。我们通过实验证明,Adam-mini在不牺牲性能的情况下,相比Adam节省了35-50%的内存,在包括8B规模的语言模型和ViT在内的多种模型上表现出色。在第二部分中,我们介绍MoFO,一种在LLMsSFT阶段减轻遗忘效应的新算法。我们观察到遗忘的一个原因是SFT后权重偏离了预训练权重,因此提出一种结合了Adam和贪心坐标下降法的新算法MoFO,并给出收敛分析。和传统的混合预训练数据和 sft数据的算法相比,我们的方法不需要使用预训练数据,且在 7B 模型实验中效果显著更好。在第三部分中,我们介绍一种新的RLHF(基于人类反馈的强化学习)算法ReMax。我们指出未被经典PPO方法充分利用的RLHF任务的三个特性,并提出一个新方法ReMaxReMax减少了50%GPU内存使用,并将训练加速了1.6倍,在Mistral-7B模型测试中表现优于 PPO DPO

  

09:00-10:00, Feb. 11, Tuesday

SpeakerXudong Li

TitleA New Dual Semismooth Newton Method for Polyhedral Projections

AbstractWe propose a new dual semismooth Newton method for computing the orthogonal projection onto a given polyhedron. Classical semismooth Newton methods typically depend on subgradient regularity assumptions for achieving local superlinear or quadratic convergence. Our approach, however, marks a significant breakthrough by demonstrating that it is always possible to identify a point where the existence of a nonsingular generalized Jacobian is guaranteed, regardless of any regularity conditions. Furthermore, we explain this phenomenon and its relationship with the weak strict Robinson constraint qualification (W-SRCQ) from the perspective of variational analysis. Building on this theoretical advancement, we develop an inexact semismooth Newton method with superlinear convergence for solving the polyhedral projection problem.

 

 09:00-10:00, Feb. 12, Wednesday

SpeakerDefeng Sun

TitleHPR-LP: An implementation of an HPR method for solving linear programming

AbstractIn this talk, we aim to introduce an HPR-LP solver, an implementation of a Halpern Peaceman-Rachford (HPR) method with semi-proximal terms for solving linear programming (LP).  We start with showing that the HPR method enjoys the highly desired iteration complexity of O(1/k) in terms of the Karush-Kuhn-Tucker residual and the objective error via the theory developed recently for accelerated degenerate proximal point methods. Based on the complexity results, we then design an adaptive strategy of restart and penalty parameter update to improve the efficiency and robustness of the HPR method. We conduct extensive numerical experiments on different LP benchmark datasets using NVIDIA A100-SXM4-80GB GPU in different stopping tolerances. Our solver's Julia version achieves a 2.39x to 5.70x speedup measured by SGM10 on benchmark datasets with presolve (2.03x to 4.06x without presolve) over the award-winning solver PDLP with the tolerance of 10^{-8}. Several practical techniques underlining the efficiency of solver will be highlighted.    

  

14:00-15:00, Feb. 12, Wednesday

SpeakerYuling Jiao

TitleDRM Revisited: A Complete Error Analysis

AbstractThe error analysis of deep learning includes approximation error, statistical error, and optimization error. However, existing works often struggle to integrate these three types of errors due to overparameterization. In this talk, we aim to bridge this gap by addressing a key question in the analysis of the Deep Ritz Method (DRM): "Given a desired level of precision, how can we determine the number of training samples, the parameters of neural networks, the step size of gradient descent, and the number of iterations needed so that the output deep networks from gradient descent closely approximate the true solution of the PDEs with the specified precision?"

 

09:00-10:00, Feb. 13, Thursday

SpeakerYuhong Dai

TitleOptimization with Least Constraint Violation

AbstractFeasibility is a fundamental issue in the field of both continuous optimization and

discrete optimization, no matter whether the problem is feasible or not feasible. Even if the

optimization problem is feasible, state-of-the-art solvers have to deal with infeasible

subproblems and may generate infeasible cluster points. In this talk, I shall summarize recent

advances on optimization with least constraint violation and related applications and give some

discussions.

 

 09:00-10:00, Feb. 14, Friday

SpeakerDeren Han

Title:鞍点问题的原始对偶算法收敛性分析与平衡性

Abstract鞍点问题是最优化、博弈论领域的一类重要问题,其理论和应用研究一直受到广泛的关注,基于原始-对偶的算法是解决该问题的核心算法之一。本报告给出一些基本鞍点问题的收敛性分析,给出考虑原始和对偶均衡性的新算法。

 

14:00-15:00, Feb. 14, Friday

SpeakerKun Yuan

TitleA Mathematics-Inspired Learning-to-Optimize Framework for Decentralized Optimization

AbstractMost decentralized optimization algorithms are handcrafted. While endowed with strong theoretical guarantees, these algorithms generally target a broad class of problems, thereby not being adaptive or customized to specific problem features. This paper studies data-driven decentralized algorithms trained to exploit problem features to boost convergence. Existing learning-to-optimize methods typically suffer from poor generalization or prohibitively vast search spaces. In addition, the vast search space of communicating choices and final goal to reach the global solution via limited neighboring communication cast more challenges in decentralized settings. To resolve these challenges, this paper first derives the necessary conditions that successful decentralized algorithmic rules need to satisfy to achieve both optimality and consensus. Based on these conditions, we propose a novel Mathematics-inspired Learning-to-optimize framework for Decentralized optimization (MiLoDo). Empirical results demonstrate that MiLoDo-trained algorithms outperform handcrafted algorithms and exhibit strong generalizations. Algorithms learned via MiLoDo in 100 iterations perform robustly when running 100,000 iterations during inferences. Moreover, MiLoDo-trained algorithms on synthetic datasets perform well on problems involving real data, higher dimensions, and different loss functions.