学习优化研讨会（Workshop on Optimization and Learning）----天元数学国际交流中心

学习优化研讨会（Workshop on Optimization and Learning）

2025.01.02

召集人：Thorsten Koch（ZIB，教授）、印卧涛（阿里巴巴达摩院，教授）、文再文（北京大学，教授）

时间：2025.02.09—2025.02.15

Workshop Schedule

Date	Time	Workshop Information
Feb 10, Monday	09:00-10:00	Chair: Yaxiang Yuan
	09:00-10:00	Talk: Yinyu Ye Mathematical Optimization in the Era of AI
	10:00-10:30	Break
	10:30-11:30	Chair：Rujun Jiang
	10:30-11:30	Panel: Jinyan Fan, Dongdong Ge, Jie Song, Zaiwen Wen, Yinyu Ye, Yaxiang Yuan
	12:00-14:00	Lunch
	14:00-15:00	Chair：Ming Yan
	14:00-15:00	Talk: Ruoyu Sun 理解和改进大模型的训练: 一些新进展
	15:00-15:30	Break
	15:30-16:30	Chair：Caihua Chen
	15:30-16:30	Panel: Cong Fang, Yutong He, Luo Luo, Ruoyu Sun, Xiangfeng Wang
	17:00-19:00	Dinner
	20:00-21:00	Free group discussion
Feb 11, Tuesday	09:00-10:00	Chair：Yongjin Liu
	09:00-10:00	Talk: Xudong Li A New Dual Semismooth Newton Method for Polyhedral Projections
	10:00-10:30	Break
	10:30-11:30	Chair：Ziyan Luo
	10:30-11:30	Panel: Bo Jiang, Xudong Li, Yongjin Liu, Andre Milzarek, Ruisong Zhou
	12:00-14:00	Lunch
	14:00-20:30	Free discussion
Feb 12, Wednesday	09:00-10:00	Chair: Jie Song
	09:00-10:00	Talk: Defeng Sun HPR-LP: An implementation of an HPR method for solving linear programming
	10:00-10:30	Break
	10:30-11:30	Chair：Yaohua Hu
	10:30-11:30	Panel: Bin Gao, Zheng Qu, Defeng Sun, Lei Yang, Shenglong Zhou
	12:00-14:00	Lunch
	14:00-15:00	Chair：Jinyan Fan
	14:00-15:00	Talk: Yuling Jiao DRM Revisited: A Complete Error Analysis
	15:00-15:30	Break
	15:30-16:30	Chair：Zuoqiang Shi
	15:30-16:30	Panel: Yuling Jiao, Tianyou Li, Xiantao Xiao, Zhonglin Xie, Haijun Zou
	17:00-19:00	Dinner
	20:00-21:00	Free group discussion
Feb 13, Thursday	09:00-10:00	Chair：Bin Gao
	09:00-10:00	Talk: Yuhong Dai Optimization with Least Constraint Violation
	10:00-10:30	Break
	10:30-11:30	Chair：Zi Xu
	10:30-11:30	Panel: Kuang Bai, Yuhong Dai, Houduo Qi, Cong Sun, Jiaxin Xie
	12:00-14:00	Lunch
	14:00-20:30	Free discussion
Feb 14, Friday	09:00-10:00	Chair: Cong Sun
	09:00-10:00	Talk: Deren Han 鞍点问题的原始对偶算法—收敛性分析与平衡性
	10:00-10:30	Break
	10:30-11:30	Chair：Jingwei Liang
	10:30-11:30	Panel: Deren Han, Bo Jiang, Ruiyang Jin, Yibin Xiao, Pengcheng You, Shangzhi Zeng
	12:00-14:00	Lunch
	14:00-15:00	Chair：Xiantao Xiao
	14:00-15:00	Talk: Kun Yuan A Mathematics-Inspired Learning-to-Optimize Framework for Decentralized Optimization
	15:00-15:30	Break
	15:30-16:30	Chair：Qing Ling
	15:30-16:30	Panel: Shixiang Chen, Yujie Tang, Ming Yan, Kun Yuan, Jiaojiao Zhang
	17:00-19:00	Dinner
	20:00-21:00	Free group discussion

Titles and Abstracts

09:00-10:00, Feb. 10, Monday

Speaker：Yinyu Ye

Title：Mathematical Optimization in the Era of AI

Abstract：This talk aims to respond to the question: are the classical mathematical optimization/game models, theories, and algorithms remaining valuable in the AI and LLM era? We present several cases to show how mathematical programming and AI/Machine-Learning technologies complement each other. In particular, we discuss advances in LP, SDP and/or Market-Equilibrium computing aided by Machine Learning techniques, First-Order methods and the GPU Implementations. On the other hand, we describe how classic optimization techniques can be applied to accelerate the Gradient Methods that is popularly used for LLM Training and Fine-Tuning.

14:00-15:00, Feb. 10, Monday

Speaker：Ruoyu Sun

Title：理解和改进大模型的训练: 一些新进展

Abstract：本次报告讨论对大型语言模型（LLMs）训练算法的理解和提升。在第一部分中，我们分析为什么Adam在Transformer上优于SGD，并提出一种轻量级的替代方法——Adam-mini。我们解释了SGD在Transformer上的失败原因：(i) Transformer是“异质性的”：参数块之间的Hessian谱差异显著；(ii) 异质性阻碍了SGD：SGD在存在块异质性的问题上表现不佳。受此发现启发，我们引入了Adam-mini，它为每个块中的所有权重分配了一个单一的二次动量项。我们通过实验证明，Adam-mini在不牺牲性能的情况下，相比Adam节省了35-50%的内存，在包括8B规模的语言模型和ViT在内的多种模型上表现出色。在第二部分中，我们介绍MoFO，一种在LLMs的SFT阶段减轻遗忘效应的新算法。我们观察到遗忘的一个原因是SFT后权重偏离了预训练权重，因此提出一种结合了Adam和贪心坐标下降法的新算法MoFO，并给出收敛分析。和传统的混合预训练数据和 sft数据的算法相比，我们的方法不需要使用预训练数据，且在 7B 模型实验中效果显著更好。在第三部分中，我们介绍一种新的RLHF（基于人类反馈的强化学习）算法ReMax。我们指出未被经典PPO方法充分利用的RLHF任务的三个特性，并提出一个新方法ReMax。ReMax减少了50%的GPU内存使用，并将训练加速了1.6倍，在Mistral-7B模型测试中表现优于 PPO 和 DPO。

09:00-10:00, Feb. 11, Tuesday

Speaker：Xudong Li

Title：A New Dual Semismooth Newton Method for Polyhedral Projections

Abstract：We propose a new dual semismooth Newton method for computing the orthogonal projection onto a given polyhedron. Classical semismooth Newton methods typically depend on subgradient regularity assumptions for achieving local superlinear or quadratic convergence. Our approach, however, marks a significant breakthrough by demonstrating that it is always possible to identify a point where the existence of a nonsingular generalized Jacobian is guaranteed, regardless of any regularity conditions. Furthermore, we explain this phenomenon and its relationship with the weak strict Robinson constraint qualification (W-SRCQ) from the perspective of variational analysis. Building on this theoretical advancement, we develop an inexact semismooth Newton method with superlinear convergence for solving the polyhedral projection problem.

09:00-10:00, Feb. 12, Wednesday

Speaker：Defeng Sun

Title：HPR-LP: An implementation of an HPR method for solving linear programming

Abstract：In this talk, we aim to introduce an HPR-LP solver, an implementation of a Halpern Peaceman-Rachford (HPR) method with semi-proximal terms for solving linear programming (LP). We start with showing that the HPR method enjoys the highly desired iteration complexity of O(1/k) in terms of the Karush-Kuhn-Tucker residual and the objective error via the theory developed recently for accelerated degenerate proximal point methods. Based on the complexity results, we then design an adaptive strategy of restart and penalty parameter update to improve the efficiency and robustness of the HPR method. We conduct extensive numerical experiments on different LP benchmark datasets using NVIDIA A100-SXM4-80GB GPU in different stopping tolerances. Our solver's Julia version achieves a 2.39x to 5.70x speedup measured by SGM10 on benchmark datasets with presolve (2.03x to 4.06x without presolve) over the award-winning solver PDLP with the tolerance of 10^{-8}. Several practical techniques underlining the efficiency of solver will be highlighted.

14:00-15:00, Feb. 12, Wednesday

Speaker：Yuling Jiao

Title：DRM Revisited: A Complete Error Analysis

Abstract：The error analysis of deep learning includes approximation error, statistical error, and optimization error. However, existing works often struggle to integrate these three types of errors due to overparameterization. In this talk, we aim to bridge this gap by addressing a key question in the analysis of the Deep Ritz Method (DRM): "Given a desired level of precision, how can we determine the number of training samples, the parameters of neural networks, the step size of gradient descent, and the number of iterations needed so that the output deep networks from gradient descent closely approximate the true solution of the PDEs with the specified precision?"

09:00-10:00, Feb. 13, Thursday

Speaker：Yuhong Dai

Title：Optimization with Least Constraint Violation

Abstract：Feasibility is a fundamental issue in the field of both continuous optimization and

discrete optimization, no matter whether the problem is feasible or not feasible. Even if the

optimization problem is feasible, state-of-the-art solvers have to deal with infeasible

subproblems and may generate infeasible cluster points. In this talk, I shall summarize recent

advances on optimization with least constraint violation and related applications and give some

discussions.

09:00-10:00, Feb. 14, Friday

Speaker：Deren Han

Title：鞍点问题的原始对偶算法—收敛性分析与平衡性

Abstract：鞍点问题是最优化、博弈论领域的一类重要问题，其理论和应用研究一直受到广泛的关注，基于原始-对偶的算法是解决该问题的核心算法之一。本报告给出一些基本鞍点问题的收敛性分析，给出考虑原始和对偶均衡性的新算法。

14:00-15:00, Feb. 14, Friday

Speaker：Kun Yuan

Title：A Mathematics-Inspired Learning-to-Optimize Framework for Decentralized Optimization

Abstract：Most decentralized optimization algorithms are handcrafted. While endowed with strong theoretical guarantees, these algorithms generally target a broad class of problems, thereby not being adaptive or customized to specific problem features. This paper studies data-driven decentralized algorithms trained to exploit problem features to boost convergence. Existing learning-to-optimize methods typically suffer from poor generalization or prohibitively vast search spaces. In addition, the vast search space of communicating choices and final goal to reach the global solution via limited neighboring communication cast more challenges in decentralized settings. To resolve these challenges, this paper first derives the necessary conditions that successful decentralized algorithmic rules need to satisfy to achieve both optimality and consensus. Based on these conditions, we propose a novel Mathematics-inspired Learning-to-optimize framework for Decentralized optimization (MiLoDo). Empirical results demonstrate that MiLoDo-trained algorithms outperform handcrafted algorithms and exhibit strong generalizations. Algorithms learned via MiLoDo in 100 iterations perform robustly when running 100,000 iterations during inferences. Moreover, MiLoDo-trained algorithms on synthetic datasets perform well on problems involving real data, higher dimensions, and different loss functions.