大数据背景下统计前沿与交叉研讨会(Workshop on Frontiers and Intersections in Statistics in the Context of Big Data)

2025.01.23

召集人:王启华(中国科学院数学与系统科学研究院,研究员)、常晋源(西南财经大学,研究员)、张新雨(中国科学院数学与系统科学研究院,研究员)

时间:2025.02.16—2025.02.22


会议日程


217日(周一)

主持人

时间

报告人

报告题目

 

邹国华

 

8:35-8:40

王启华致欢迎词

8:40-9:40

金加顺

The Statistics Triangle

9:40-10:10

讨论

10:10-10:30

合影、茶歇

何煦

10:30-11:30

林乾

Towards A Statistical Understanding of Deep Learning: Beyond the NTK Theory

11:30-12:00

讨论

        午餐、午休

 

赵建华

14:10-15:10

孔新兵

Price Staleness and Volatility Estimation

15:10-15:40

讨论

15:40-16:00

茶歇

胡江

16:00-17:00

荆炳义

大模型中的近期进展

17:00-17:30

讨论

  

218日(周

主持人

时间

报告人

报告题目

郁文

8:40-9:40

林华珍

Spatial Effect Detection Regression for Large Scale Spatio-Temporal Covariates

9:40-10:10

讨论

10:10-10:30

茶歇

项冬冬

10:30-11:30

刘玉坤

Unsupervised Optimal Deep Transfer Learning for Classification Under General Conditional Shift

11:30-12:00

讨论

        午餐、午休

李国栋

14:10-15:10

常晋源

Deep Conditional Distribution Learning via Conditional Föllmer Flow

15:10-15:40

讨论

15:40-16:00

茶歇

李长城

16:00-17:00

郭旭

Estimation and Inference of High-Dimensional Factor Augmented Regression Model

17:00-17:30

讨论

219日(周

主持人

时间

报告人

报告题目

刘小惠

8:40-9:40

张正军

Hamiltonian-Clustering Modernized Asymmetric Causality

9:40-10:10

讨论

10:10-10:30

茶歇

赵俊龙

10:30-11:30

郑泽敏

SOFARI: High-Dimensional Manifold-Based Inference

11:30-12:00

讨论

        午餐、午休


14:10-17:30

自由讨论

 

220日(周

主持人

时间

报告人

报告题目

张立新

8:40-9:40

杨宇红

On A Synergistic Learning Phenomenon in Nonparametric Domain Adaptation

9:40-10:10

讨论

10:10-10:30

茶歇

於州

10:30-11:30

邱宇谋

Minimax Optimal Clustering and Signal Recovery Under Block Signals and Computational Constraint

11:30-12:00

讨论

        午餐、午休

练恒

14:10-15:10

王军辉

Longitudinal Networks: Adaptive Merging and Efficient Estimation

15:10-15:40

讨论

15:40-16:00

茶歇

俞章盛

16:00-17:00

林伟

Demystifying Neural Networks: Provable Generalization and Feature Learning

17:00-17:30

讨论

 

221日(周

主持人

时间

报告人

报告题目

张日权

8:40-9:40

黄坚

Generative Learning Through Continuous Normalizing Flows

9:40-10:10

讨论

10:10-10:30

茶歇

李子林

10:30-11:30

周永道

不依赖于模型的大数据子抽样方法研究

11:30-12:00

讨论

        午餐、午休

马诗洋

14:10-15:10

席瑞斌

A Generic Graphical Model for Continuous and Discrete Data: Efficiency, Diversity and Heterogeneity

15:10-15:40

讨论

15:40-16:00

茶歇

姜丹丹

16:00-17:00

张新雨

基于AUC值的模型平均方法

17:00-17:30

讨论

 

摘要

 

金加顺

东南大学、卡耐基梅隆大学

题目:The Statistics Triangle

摘要:In his Fishers Lecture in 1996, Efron suggested that there is a philosophical triangle in statistics with Bayesian, Fisherian, and Frequentist being the three vertices, and most of the statistical methods can be viewed as a convex linear combination of the three philosophies. We collected and cleaned a data set consisting of the citation and bibtex (e.g., title, abstract, author information) data of 83,331 papers published in 36 journals in statistics and related fields, spanning 41 years. Using the data set, we constructed 21 co-citation networks, each for a time window between 1990 and 2015. We propose a dynamic Degree-Corrected Mixed-Membership (dynamic-DCMM) model, where we model the research interests of an author by a low-dimensional weight vector (called the network memberships) that evolves slowly over time. We propose dynamic-SCORE as a new approach to estimating the memberships. We discover a triangle in the spectral domain which we call the Statistical Triangle, and use it to visualize the research trajectories of individual authors. We interpret the three vertices of the triangle as the three primary research areas in statistics: Bayes, Biostatistics and Non-parametrics. The Statistical Triangle further splits into 15 sub-regions, which we interpret as the 15 representative sub-areas in statistics. These results provide useful insights over the research trend and behavior of statisticians.

 

林乾

清华大学

题目:Towards A Statistical Understanding of Deep Learning: Beyond the NTK Theory

摘要:A primary advantage of neural networks lies in their feature learning characteristics, which is challenging to theoretically analyze due to the complexity of their training dynamics. We propose a new paradigm for studying feature learning and the resulting benefits in generalizability. After reviewing the neural tangent kernel (NTK) theory and recent results in kernel regression, which address the generalization issue of sufficiently wide neural networks, we examine limitations and implications of the fixed kernel theory (as the NTK theory) and review recent theoretical advancements in feature learning. Moving beyond the fixed kernel/feature theory, we consider neural networks as adaptive feature models. Finally, we propose an over-parameterized Gaussian sequence model as a prototype model to study the feature learning characteristics of neural networks.

 

孔新兵

东南大学、南京审计大学

题目:Price Staleness and Volatility Estimation

摘要:In this paper, we introduce a novel nonstationary price staleness factor model allowing for friction pervasive across assets and possible input covariates. With large panel high-frequency data, we give the maximum likelihood estimators of the regressing coefficients, and the factors and their loading parameters, which recovers the time-varying price staleness probability and an integrated functional of the price staleness over two assets. The asymptotic results are obtained when both the dimension d and the sampling frequency n diverge simultaneously. With the local principal component analysis (PCA) method, we find that the price co-volatilities (both systematic and idiosyncratic components inclusive), are biased upward due to the presence of staleness. Bias corrected estimators of the systematic and idiosyncratic covolatities, spot or integrated, are provided and proved to be consistent. Interestingly, beside their dependence on the dimensionality d, the integrated estimates converge with a factor of image.png though the local PCA estimates converge with a factor of   image.png,    validating the aggregation efficiency after nonlinear factor analysis. But the bias correction degrade the convergence rates of the estimated systematic covolatilies, spot or integrated. Numerical experiments justify our theoretical findings. Empirically, we observe that the staleness correction leads to reduced out-of-sample portfolio risk almost uniformly in tested gross exposure levels.

 

荆炳义

南方科技大学

题目:大模型中的近期进展

摘要:近年来,人工智能领域经历了飞速的变革,而大模型的崛起无疑是这一变革的核心。大模型(LLMVLM等)已经展示了强大的生成能力和广泛的应用潜力。本报告将回顾大模型的发展历程,并展望其未来的发展反向。同时本报告也将汇报我们课题组近期的研究工作。


林华珍

西南财经大学

题目:Spatial Effect Detection Regression for Large Scale Spatio-Temporal Covariates

摘要:We develop a Spatial Effect Detection Regression (SEDR) model to capture the nonlinear and irregular effects of high-dimensional spatio-temporal predictors on a scalar outcome. Specifically, we assume that both the component and the coefficient functions in the SEDR are unknown smooth functions with respect to location and time. This allows us to leverage spatially and temporally correlated information among ultrahigh dimensional large-scale covariates and to transform the curse of dimensionality of high-dimensional spatio-temporal predictors into a blessing, which is confirmed by our theoretical and numerical results. Moreover, we introduce a set of 0-1 regression coefficients to automatically identify the boundaries of the spatial effect, which is implemented through a novel penalty. Combining penalized approaches and B-spline smoothing techniques, we develop a simple iterative algorithm consisting of explicit forms at each updating step. With the initial values given in the paper, the algorithm is shown to converge. Furthermore, we establish the convergence rate and selection consistency for the proposed estimator to explore the performance of the resulting estimator across various scenarios of the dimensionality and the effect space. We thoroughly evaluate the superior performance of our proposed method in terms of bias and empirical efficiency through simulation studies. Finally, we demonstrate the effectiveness of the proposed method by analyzing and forecasting environmental monitoring data and Alzheimers Disease Neuroimaging Initiative study, revealing interesting findings and obtaining much smaller out-of-sample prediction errors than those of existing methods.

 

刘玉坤

华东师范大学

题目:Unsupervised Optimal Deep Transfer Learning for Classification Under General Conditional Shift

摘要:Classifiers trained on labelled source data can yield misleading results when applied to unlabelled target data from a different distribution. Transfer learning can rectify this by transferring knowledge from source to target data, but its validity often hinges on strict assumptions like label shift. In this paper, we introduce a general conditional shift (GCS) assumption, which is novel and encompasses label shift as a special case. We show that the target distribution is identifiable under GCS. Using deep neural networks (DNN), we estimate the conditional probabilities spacer.gifηp for source data. After transferring the DNN estimator to the target data, we estimate the target label distribution πQ via a pseudo maximum likelihood method and construct a Bayes classifier. We derive concentration bounds for our estimators of both ηp and πQ. Unlike existing competitors, our method can mitigate the curse-of-dimensionality when πQ exhibits low-dimensional structure. We prove that our DNN-based classifier achieves the optimal minimax rate up to a logarithm factor. Numerical and real data results demonstrate its superiority.


常晋源

西南财经大学、中国科学院数学与系统科学研究院

题目:Deep Conditional Distribution Learning via Conditional Föllmer Flow

摘要We introduce an ordinary differential equation (ODE) based deep generative method for learning conditional distributions, named Conditional Föllmer Flow. Starting from a standard Gaussian distribution, the proposed flow could approximate the target conditional distribution very well when the time is close to 1. For effective implementation, we discretize the flow with Eulers method where we estimate the velocity field nonparametrically using a deep neural network. Furthermore, we also establish the convergence result for the Wasserstein-2 distance between the distribution of the learned samples and the target conditional distribution, providing the first comprehensive end-to-end error analysis for conditional distribution learning via ODE flow. Our numerical experiments showcase its effectiveness across a range of scenarios, from standard nonparametric conditional density estimation problems to more intricate challenges involving image data, illustrating its superiority over various existing conditional density estimation methods.


郭旭

北京师范大学

题目:Estimation and Inference of High-Dimensional Factor Augmented Regression Model

摘要:Factor model is a powerful tool to deal with high correlations among predictors. It has also been incorporated in regression analysis. In this talk, I will share recent developments about estimation and inference of high-dimensional factor augmented regression model. In particular, I will discuss high-dimensional semiparametric factor augmented regression model. Among others, single-index model and partially linear regression model are two widely investigated semiparametric models. However existing methods do not perform well when predictors are highly correlated. We first address the concern whether it is necessary to consider the augmented part by introducing a score-type test statistic. Compared with previous test statistics, our proposed test statistic does not need to estimate the high-dimensional regression coefficients, nor high-dimensional precision matrix, making it simpler in implementation. We also propose a Gaussian multiplier bootstrap to determine the critical value. The validity of our procedure is theoretically established under suitable conditions. We further investigate the penalized estimation of the regression model. With estimated latent factors, we establish the error bounds of the estimators. Lastly, we introduce debiased estimator and construct confidence interval for individual coefficient based on the asymptotic normality. Simulation studies and real data analysis are conducted to illustrate the proposed methods.

 

张正军

中国科学院大学

题目:Hamiltonian-Clustering Modernized Asymmetric Causality

摘要:Understanding causal relationships among variables is crucial in economic, biological, medical, climate, and other applied research. Conventional methods often falter with asymmetric causality and high-dimensional data. From a machine learning perspective, this talk introduces Hamiltonian-Clustering Modernized Asymmetric Causality (HMAC) to address these challenges. HMAC integrates Generalized Measures of Correlation (GMC) into deep clustering with a RadViz-style representation, using an optimal Hamiltonian cycle to map clusters, similarities, and outliers, allowing clear visualization of causal relationships, which is a significantly different causal representation from the literature. Under the minimum mean squared prediction error principle, we theoretically justify GMCs lead to the best causative method. HMAC requires the least structure and theoretical statistical assumptions compared to other causal inferences. It is widely applicable, easily implementable, and empirically interpretable. Extensive and rigorous experiments across synthetic, engineering, machine learning, economic, and financial data demonstrate HMAC's superior performance over existing methods. Significantly, HMAC reveals that USD/CNY exchange rate changes drive changes in USD/EUR, USD/GBP, and USD/JPY, alongside identifying annual block timing effects in macroeconomic indicators. HMAC demonstrates indirect causal effects in MNIST and fashion designs, which are hardly doable using other causal methods. Joint work with Tianyi Huang and Shenghui Cheng.

 

郑泽敏

中国科学技术大学

题目:SOFARI: High-Dimensional Manifold-Based Inference

摘要:Multi-task learning is a widely used technique for harnessing information from various tasks. Recently, the sparse orthogonal factor regression (SOFAR) framework, based on the sparse singular value decomposition (SVD) within the coefficient matrix, was introduced for interpretable multi-task learning, enabling the discovery of meaningful latent feature-response association networks across different layers. However, conducting precise inference on the latent factor matrices has remained challenging due to orthogonality constraints inherited from the sparse SVD constraint. In this paper, we suggest a novel approach called high-dimensional manifold-based SOFAR inference (SOFARI), drawing on the Neyman near-orthogonality inference while incorporating the Stiefel manifold structure imposed by the SVD constraints. By leveraging the underlying Stiefel manifold structure, SOFARI provides bias-corrected estimators for both latent left factor vectors and singular values, for which we show to enjoy the asymptotic mean-zero normal distributions with estimable variances. We introduce two SOFARI variants to handle strongly and weakly orthogonal latent factors, where the latter covers a broader range of applications. We illustrate the effectiveness of SOFARI and justify our theoretical results through simulation examples and a real data application in economic forecasting.

 

杨宇红

清华大学

题目:On A Synergistic Learning Phenomenon in Nonparametric Domain Adaptation

摘要: Consider nonparametric domain adaptation for regression, which assumes the same conditional distribution of the response given the covariates but different marginal distributions of the covariates. An important goal is to understand how the source data may improve the minimax convergence rate of learning the regression function when the likelihood ratio of the covariate distributions of the target data and the source data are unbounded. A previous work of Pathak, Ma and Wainwright (2022) shows that the minimax transfer learning rate is simply determined by the faster rate of using either the source or the target data alone. In this talk, we present a new synergistic learning phenomenon (SLP) that the minimax convergence rate based on both data may sometimes be faster (even much faster) than the better rate of convergence based on the source or target data only. The SLP occurs when and only when the target sample size is smaller (in order) than but not too much smaller than the source sample size in relation to the smoothness of the regression function and the nature of the covariate densities of the source and target distributions. Interestingly, the SLP happens in two different ways according to the relationship between the two sample sizes as will be explained. This talk is based a joint work with Ling Zhou at Southwest University of Finance and Economics.

 

邱宇谋

北京大学

题目:Minimax Optimal Clustering and Signal Recovery Under Block Signals and Computational Constraint

摘要:This paper derives both statistical and computational minimax lower bounds for high-dimensional clustering and signal recovery under block signal structures, where the computational minimax boundaries are restricted to the algorithms with polynomial computation complexity.  The minimax boundaries are constructed in terms of the strength, sparsity and block size of signals. They show a phase transition phenomenon where no algorithm (or polynomial time algorithm for computational boundary) can consistently separate two clusters or identify the components with signals if the signal strength is smaller than that implied by the corresponding minimax boundary. The minimax results show that signal recovery is more difficult than clustering for dense signals and vice versa for sparse signals. Motivated by this, we propose two new sets of methods, moving average PCA (MA-PCA) and cross-block feature aggregation PCA (CFA-PCA), designed for dense and sparse block signals, respectively. Both methods adaptively utilize the block signal structure for spectral clustering, applicable to non-Gaussian data with heterogeneous variances and non-diagonal covariance matrices. Particularly, the CFA method utilizes a U-statistic formulation that can consistently select useful features for clustering nonparametrically under sparse signals. We demonstrate that the proposed MA-PCA and CFA-PCA methods can achieve the computational minimax boundaries for clustering and signal recovery in the dense and sparse signal regimes, respectively, indicating the derived computational minimax boundaries are tight. Our results also show that clusters with much weaker signals can be detected if a block structure exists. Simulation studies are conducted to evaluate the proposed method, which show its superiority over the existing methods. Case studies on air pollution and geoscience data demonstrate the utility of the proposed method in practice.

 

王军辉

香港中文大学

题目:Longitudinal Networks: Adaptive Merging and Efficient Estimation

摘要:Longitudinal network consists of a sequence of temporal edges among multiple nodes, where the temporal edges are observed in real time. It has become ubiquitous with the rise of online social platform and e-commerce, but largely under-investigated in literature. In this talk, we present an efficient estimation framework for longitudinal network, leveraging strengths of adaptive network merging, tensor decomposition and point process. It merges neighboring sparse networks so as to enlarge the number of observed edges and reduce estimation variance, whereas the estimation bias introduced by network merging is controlled by exploiting local temporal structures for adaptive network neighborhood. A projected gradient descent algorithm is proposed to facilitate estimation, where the upper bound of the estimation error in each iteration is established. Theoretical analysis of the proposed method shows that it can significantly reduce the estimation error and also provides guideline for network merging under various scenarios. We further demonstrate the advantage of the proposed method through extensive numerical experiments on synthetic datasets and a militarized interstate dispute dataset.

 

林伟

北京大学

题目:Demystifying Neural Networks: Provable Generalization and Feature Learning摘要:Neural networks have achieved remarkable successes in modern deep learning practice. Yet, their ability to generalize well even in the overparametrized regime and to learn meaningful representations from data remains controversial and mysterious in theory. In this talk, we suggest new statistical theories to elucidate the superiority of two-layer ReLU neural networks over classical machine learning methods. In the first part, we approach the problem from a nonparametric point of view and derive unified generalization bounds for any finite-width network, thereby providing a justification for the double descent phenomenon. In the second part, we consider the teacher-student setting where the data are generated from a parametric teacher network with well-separated features. By borrowing ideas from high-dimensional statistics, we establish identifiability and estimation bounds for the student work under an appropriate reparametrization, thereby providing a theoretical guarantee for feature learning.

黄坚

香港理工大学

题目:Generative Learning Through Continuous Normalizing Flows

摘要:Continuous normalizing flows (CNFs) are a generative method based on ordinary differential equations for learning probability distributions. This method has shown success in applications like image synthesis, protein structure prediction, and molecule generation. We present the CNF method and study its theoretical properties using a flow matching objective function. We establish non-asymptotic error bounds for the distribution estimator based on CNFs, in terms of the Wasserstein-2 distance, under the assumption that the target distribution has bounded support, is strongly log-concave, or is a mixture of Gaussian distributions. Our convergence analysis addresses errors due to velocity estimation, discretization, and early stopping. We also develop uniform error bounds with Lipschitz regularity control for deep ReLU networks approximating the Lipschitz function class. Our analysis provides theoretical guarantees for using CNFs to learn probability distributions from finite random samples.

 

周永道

南开大学

题目:不依赖于模型的大数据子抽样方法研究

摘要:在大数据时代,如何提供高质量的数据是研究热点之一。通过有效的数据采集方法,可从海量大数据中抽取高质量的子样本,从而可以减少训练成本并快速地训练模型。真实模型往往是非线性的,因此需要考虑不依赖于模型的大数据子抽样方法。试验设计是重要的数据采集方法。本报告将介绍模型稳健的试验设计方法,并基于这些试验设计方法提出不依赖于模型的大数据子抽样方法。仿真和实例都说明所得子样本具有模型稳健的效果。

 

席瑞斌

北京大学

题目:A Generic Graphical Model for Continuous and Discrete Data: Efficiency, Diversity and Heterogeneity

摘要:Traditional network inference methods, such as Gaussian Graphical Models (GGM), are built on continuity and homogeneity, face challenges when modeling discrete data and heterogeneous frameworks. Furthermore, under high-dimensionality the parameter estimation of such model can be hindered by the notorious intractability of high-dimensional integrals. In this paper, we introduce a new and flexible device for graphical models, which accommodates diverse data types, including Gaussian, Poisson log-normal (PLN), and latent Gaussian copula models.  The new device is driven by a new marginally recoverable parametric family, which can be effectively estimated without evaluating the high-dimensional integration in high-dimensional settings thanks to the marginal recoverability. We further introduce mixture of marginal recoverable models to capture ubiquitous heterogeneous structures. We show the validity of the desirable properties of the models and the effective estimation methods, and demonstrate its advantages over the state-of-the art network inference methods via extensive simulation studies and a gene regulatory network analyses of real scRNA-seq data.

 

张新雨

中国科学院数学与系统科学研究院

题目:基于AUC值的模型平均方法

摘要:ROC 曲线分析常用于临床分析和社会科学,用以评估分类预测模型的敏感性和特异性二者间的权衡。ROC 曲线下的面积 AUC 值可以用来衡量分类预测模型的泛化能力,是最常用的评估分类模型预测性能的指标之一。实际应用中,通常我们有很多候选模型可以用于处理二分类问题,但是我们不确定使用哪个模型来建模。在本文中,我们提出了两种基于 AUC 值的L折交叉验证模型平均方法,用于解决二分类问题中候选模型不确定性的问题。