召集人:Thorsten Koch(ZIB,教授)、印卧涛(阿里巴巴达摩院,教授)、文再文(北京大学,教授)
时间:2025.02.09—2025.02.15
Workshop Schedule
Date | Time | Workshop Information |
Feb 10, Monday | 09:00-10:00 | Chair: Yaxiang Yuan |
Talk: Yinyu Ye Mathematical Optimization in the Era of AI | ||
10:00-10:30 | Break | |
10:30-11:30 | Chair:Rujun Jiang | |
Panel: Jinyan Fan, Dongdong Ge, Jie Song, Zaiwen Wen, Yinyu Ye, Yaxiang Yuan | ||
12:00-14:00 | Lunch | |
14:00-15:00 | Chair:Ming Yan | |
Talk: Ruoyu Sun 理解和改进大模型的训练: 一些新进展 | ||
15:00-15:30 | Break | |
15:30-16:30 | Chair:Caihua Chen | |
Panel: Cong Fang, Yutong He, Luo Luo, Ruoyu Sun, Xiangfeng Wang | ||
17:00-19:00 | Dinner | |
20:00-21:00 | Free group discussion | |
Feb 11, Tuesday | 09:00-10:00 | Chair:Yongjin Liu |
Talk: Xudong Li A New Dual Semismooth Newton Method for Polyhedral Projections | ||
10:00-10:30 | Break | |
10:30-11:30 | Chair:Ziyan Luo | |
Panel: Bo Jiang, Xudong Li, Yongjin Liu, Andre Milzarek, Ruisong Zhou | ||
12:00-14:00 | Lunch | |
14:00-20:30 | Free discussion | |
Feb 12, Wednesday | 09:00-10:00 | Chair: Jie Song |
Talk: Defeng Sun HPR-LP: An implementation of an HPR method for solving linear programming | ||
10:00-10:30 | Break | |
10:30-11:30 | Chair:Yaohua Hu | |
Panel: Bin Gao, Zheng Qu, Defeng Sun, Lei Yang, Shenglong Zhou | ||
12:00-14:00 | Lunch | |
14:00-15:00 | Chair:Jinyan Fan | |
Talk: Yuling Jiao DRM Revisited: A Complete Error Analysis | ||
15:00-15:30 | Break | |
15:30-16:30 | Chair:Zuoqiang Shi | |
Panel: Yuling Jiao, Tianyou Li, Xiantao Xiao, Zhonglin Xie, Haijun Zou | ||
17:00-19:00 | Dinner | |
20:00-21:00 | Free group discussion | |
Feb 13, Thursday | 09:00-10:00 | Chair:Bin Gao |
Talk: Yuhong Dai Optimization with Least Constraint Violation | ||
10:00-10:30 | Break | |
10:30-11:30 | Chair:Zi Xu | |
Panel: Kuang Bai, Yuhong Dai, Houduo Qi, Cong Sun, Jiaxin Xie | ||
12:00-14:00 | Lunch | |
14:00-20:30 | Free discussion | |
Feb 14, Friday | 09:00-10:00 | Chair: Cong Sun |
Talk: Deren Han 鞍点问题的原始对偶算法—收敛性分析与平衡性 | ||
10:00-10:30 | Break | |
10:30-11:30 | Chair:Jingwei Liang | |
Panel: Deren Han, Bo Jiang, Ruiyang Jin, Yibin Xiao, Pengcheng You, Shangzhi Zeng | ||
12:00-14:00 | Lunch | |
14:00-15:00 | Chair:Xiantao Xiao | |
Talk: Kun Yuan A Mathematics-Inspired Learning-to-Optimize Framework for Decentralized Optimization | ||
15:00-15:30 | Break | |
15:30-16:30 | Chair:Qing Ling | |
Panel: Shixiang Chen, Yujie Tang, Ming Yan, Kun Yuan, Jiaojiao Zhang | ||
17:00-19:00 | Dinner | |
20:00-21:00 | Free group discussion |
Titles and Abstracts
09:00-10:00, Feb. 10, Monday
Speaker:Yinyu Ye
Title:Mathematical Optimization in the Era of AI
Abstract:This talk aims to respond to the question: are the classical mathematical optimization/game models, theories, and algorithms remaining valuable in the AI and LLM era? We present several cases to show how mathematical programming and AI/Machine-Learning technologies complement each other. In particular, we discuss advances in LP, SDP and/or Market-Equilibrium computing aided by Machine Learning techniques, First-Order methods and the GPU Implementations. On the other hand, we describe how classic optimization techniques can be applied to accelerate the Gradient Methods that is popularly used for LLM Training and Fine-Tuning.
14:00-15:00, Feb. 10, Monday
Speaker:Ruoyu Sun
Title:理解和改进大模型的训练: 一些新进展
Abstract:本次报告讨论对大型语言模型(LLMs)训练算法的理解和提升。在第一部分中,我们分析为什么Adam在Transformer上优于SGD,并提出一种轻量级的替代方法——Adam-mini。我们解释了SGD在Transformer上的失败原因:(i) Transformer是“异质性的”:参数块之间的Hessian谱差异显著;(ii) 异质性阻碍了SGD:SGD在存在块异质性的问题上表现不佳。受此发现启发,我们引入了Adam-mini,它为每个块中的所有权重分配了一个单一的二次动量项。我们通过实验证明,Adam-mini在不牺牲性能的情况下,相比Adam节省了35-50%的内存,在包括8B规模的语言模型和ViT在内的多种模型上表现出色。在第二部分中,我们介绍MoFO,一种在LLMs的SFT阶段减轻遗忘效应的新算法。我们观察到遗忘的一个原因是SFT后权重偏离了预训练权重,因此提出一种结合了Adam和贪心坐标下降法的新算法MoFO,并给出收敛分析。和传统的混合预训练数据和 sft数据的算法相比,我们的方法不需要使用预训练数据,且在 7B 模型实验中效果显著更好。在第三部分中,我们介绍一种新的RLHF(基于人类反馈的强化学习)算法ReMax。我们指出未被经典PPO方法充分利用的RLHF任务的三个特性,并提出一个新方法ReMax。ReMax减少了50%的GPU内存使用,并将训练加速了1.6倍,在Mistral-7B模型测试中表现优于 PPO 和 DPO。
09:00-10:00, Feb. 11, Tuesday
Speaker:Xudong Li
Title:A New Dual Semismooth Newton Method for Polyhedral Projections
Abstract:We propose a new dual semismooth Newton method for computing the orthogonal projection onto a given polyhedron. Classical semismooth Newton methods typically depend on subgradient regularity assumptions for achieving local superlinear or quadratic convergence. Our approach, however, marks a significant breakthrough by demonstrating that it is always possible to identify a point where the existence of a nonsingular generalized Jacobian is guaranteed, regardless of any regularity conditions. Furthermore, we explain this phenomenon and its relationship with the weak strict Robinson constraint qualification (W-SRCQ) from the perspective of variational analysis. Building on this theoretical advancement, we develop an inexact semismooth Newton method with superlinear convergence for solving the polyhedral projection problem.
09:00-10:00, Feb. 12, Wednesday
Speaker:Defeng Sun
Title:HPR-LP: An implementation of an HPR method for solving linear programming
Abstract:In this talk, we aim to introduce an HPR-LP solver, an implementation of a Halpern Peaceman-Rachford (HPR) method with semi-proximal terms for solving linear programming (LP). We start with showing that the HPR method enjoys the highly desired iteration complexity of O(1/k) in terms of the Karush-Kuhn-Tucker residual and the objective error via the theory developed recently for accelerated degenerate proximal point methods. Based on the complexity results, we then design an adaptive strategy of restart and penalty parameter update to improve the efficiency and robustness of the HPR method. We conduct extensive numerical experiments on different LP benchmark datasets using NVIDIA A100-SXM4-80GB GPU in different stopping tolerances. Our solver's Julia version achieves a 2.39x to 5.70x speedup measured by SGM10 on benchmark datasets with presolve (2.03x to 4.06x without presolve) over the award-winning solver PDLP with the tolerance of 10^{-8}. Several practical techniques underlining the efficiency of solver will be highlighted.
14:00-15:00, Feb. 12, Wednesday
Speaker:Yuling Jiao
Title:DRM Revisited: A Complete Error Analysis
Abstract:The error analysis of deep learning includes approximation error, statistical error, and optimization error. However, existing works often struggle to integrate these three types of errors due to overparameterization. In this talk, we aim to bridge this gap by addressing a key question in the analysis of the Deep Ritz Method (DRM): "Given a desired level of precision, how can we determine the number of training samples, the parameters of neural networks, the step size of gradient descent, and the number of iterations needed so that the output deep networks from gradient descent closely approximate the true solution of the PDEs with the specified precision?"
09:00-10:00, Feb. 13, Thursday
Speaker:Yuhong Dai
Title:Optimization with Least Constraint Violation
Abstract:Feasibility is a fundamental issue in the field of both continuous optimization and
discrete optimization, no matter whether the problem is feasible or not feasible. Even if the
optimization problem is feasible, state-of-the-art solvers have to deal with infeasible
subproblems and may generate infeasible cluster points. In this talk, I shall summarize recent
advances on optimization with least constraint violation and related applications and give some
discussions.
09:00-10:00, Feb. 14, Friday
Speaker:Deren Han
Title:鞍点问题的原始对偶算法—收敛性分析与平衡性
Abstract:鞍点问题是最优化、博弈论领域的一类重要问题,其理论和应用研究一直受到广泛的关注,基于原始-对偶的算法是解决该问题的核心算法之一。本报告给出一些基本鞍点问题的收敛性分析,给出考虑原始和对偶均衡性的新算法。
14:00-15:00, Feb. 14, Friday
Speaker:Kun Yuan
Title:A Mathematics-Inspired Learning-to-Optimize Framework for Decentralized Optimization
Abstract:Most decentralized optimization algorithms are handcrafted. While endowed with strong theoretical guarantees, these algorithms generally target a broad class of problems, thereby not being adaptive or customized to specific problem features. This paper studies data-driven decentralized algorithms trained to exploit problem features to boost convergence. Existing learning-to-optimize methods typically suffer from poor generalization or prohibitively vast search spaces. In addition, the vast search space of communicating choices and final goal to reach the global solution via limited neighboring communication cast more challenges in decentralized settings. To resolve these challenges, this paper first derives the necessary conditions that successful decentralized algorithmic rules need to satisfy to achieve both optimality and consensus. Based on these conditions, we propose a novel Mathematics-inspired Learning-to-optimize framework for Decentralized optimization (MiLoDo). Empirical results demonstrate that MiLoDo-trained algorithms outperform handcrafted algorithms and exhibit strong generalizations. Algorithms learned via MiLoDo in 100 iterations perform robustly when running 100,000 iterations during inferences. Moreover, MiLoDo-trained algorithms on synthetic datasets perform well on problems involving real data, higher dimensions, and different loss functions.