大规模数据下统计前沿研讨会

2024.01.15

大规模数据下统计前沿研讨会

时间:2024.1.21-2014.1.27

召集人:王启华、常晋源、张新雨


会议日程

1月22日(周一)

时间

报告人

报告题目

上午

 

主持人:

邹国华

8:35-8:40

王启华致欢迎词

8:40-9:30

陈松蹊

Ensemble Kalman Filter for High Resolution Data   Assimilation

9:30-9:40

讨论

9:40-10:00

合影、茶歇

10:00-10:50

朱仲义

Decentralized Learning of Quantile Regression: a Smoothing   Approach with Two Bandwidths

10:50-11:00

讨论

11:00-11:50

席瑞斌

单细胞、空间组学数据的统计模型和方法

11:50-12:00

讨论

           午餐、午休

 

下午

 

主持人:

张立新

14:15-15:05

孙文光

Integrative Conformal P-values for Out-of-distribution   Testing with Labeled Outliers

15:05-15:15

讨论

15:15-16:05

赵俊龙

Residual Importance Weighted   Transfer Learning for High-dimensional Linear Regression

16:05-16:15

讨论

16:15-16:30

茶歇

16:30-17:20

兰伟

Multivariate Reduced   Rank Spatiotemporal Models

17:20-17:30

讨论

 

1月23日(周二)

时间

报告人

报告题目

 

上午

 

主持人:

赵鹏

8:45-9:35

朱力行

Weighted Residual   Empirical Processes, Martingale Transformations, and Model Specification   Tests with Diverging Number of Parameters

9:35-9:45

讨论

9:45-10:35

郭旭

Model-free Variable   Importance Detection with Machine Learning Methods

10:35-10:45

讨论

10:45-11:00

茶歇

11:00-11:50

李赛

Multi-dimensional Domain   Generalization with Low-rank Structures

11:50-12:00

讨论

           午餐、午休

 

下午

 

主持人:

张新雨

14:15-15:05

刘卫东

Online Estimation and   Inference for Robust Policy Evaluation in Reinforcement Learning

15:05-15:15

讨论

15:15-16:05

张庆昭

Graphical Model-based   Heterogeneity Analysis

16:05-16:15

讨论

16:15-16:30

茶歇

16:30-17:20

毛晓军

Decentralized Reduced   Rank Regression for Response Partition

17:20-17:30

讨论

1月24日(周三)

时间

报告人

报告题目

 

 

上午

 

主持人:

林华珍

8:45-9:35

荆炳义

GAN-based Policy   Learning for Offline Reinforcement Learning

9:35-9:45

讨论

9:45-10:35

林伟

Heterogeneous Federated Learning on Arbitrary Graphs

10:35-10:45

讨论

10:45-11:00

茶歇

11:00-11:50

郁文

Neural Frailty Machines for Survival Analysis

11:50-12:00

讨论

午餐、午休

下午

14:30-17:30

自由讨论

 

1月25日(周四)

时间

报告人

报告题目

 

 

 

上午

 

主持人:

王学钦

8:45-9:35

张正军

Towards Precision Oncology   Discovery: Four Less Known Genes and Their Unknown Interactions as   Highest-Performed Biomarkers for Colorectal Cancer

9:35-9:45

讨论

9:45-10:35

苗旺

Correction for   Nonresponse Bias in Estimation of US Presidential Election Turnout Using   Callback Data

10:35-10:45

讨论

10:45-11:00

茶歇

11:00-11:50

骆威

On Efficient Dimension   Reduction with Respect to the Interaction Between Two Response Variables

11:50-12:00

讨论

午餐、午休

 

 

下午

 

主持人:郑术蓉

14:15-15:05

朱利平

非线性相关与独立性检验

15:05-15:15

讨论

15:15-16:05

何煦

Sequentially Refined   Latin Hypercube Designs with Flexibly and Adaptively Chosen Sample Sizes

16:05-16:15

讨论

16:15-16:30

茶歇

16:30-17:20

王典朋

A Subsampling Method for   Regression Problems Based on Minimum Energy Criterion

17:20-17:30

讨论

 

1月26日(周五)

时间

报告人

报告题目

 

 

 

上午

 

主持人:

耿直

8:45-9:35

王启华

Distributed Nonparametric   Regression Imputation for Missing Response Problems with Large-scale Data

9:35-9:45

讨论

9:45-10:35

赵普映

Sample Empirical Likelihood   Inference with Complex Survey Data

10:35-10:45

讨论

10:45-11:00

茶歇

11:00-11:50

王涛

Factor Augmented Inverse   Regression and Its Application to Microbiome Data Analysis

11:50-12:00

讨论

午餐、午休

 

 

下午

 

主持人:

艾明要

14:15-15:05

常晋源

Exploring Excellence:   Bayesian Penalized Empirical Likelihood and MCMC Sampling

15:05-15:15

讨论

15:15-16:05

金百锁

空间动态面板数据的模型估计和选择

16:05-16:15

讨论

16:15-16:30

茶歇

16:30-17:20

马诗洋

Knockoff-based Statistics for the   Identification of Putative Causal Genes in Genetic Studies

17:20-17:30

讨论

 

 

 

摘要

 

陈松蹊

(北京大学)

 

题目:Ensemble Kalman Filter for High Resolution Data Assimilation

摘要:The ensemble Kalman Filter (EnKF), as a fundamental data assimilation approach, has been widely used in many fields of earth science, engineering and beyond. However, there are several unknown theoretical aspects of the EnKF, especially when the state variable is of high dimensional and the physical model is misspecified. This paper first proposes several high dimensional EnKF methods which provide consistent estimators for the important forecast error covariance and the Kalman gain matrix. It then studies the theoretical properties of the EnKF under both the fixed and high dimensional state variables, which provides the mean square errors of the analysis states to the underlying oracle states offered by the Kalman filter and gives the much needed insight into the roles played by forecast error covariance on the accuracy of the EnKF. The accuracy of the data assimilation under the misspecified physical model is also considered. Simulation studies on the Lorenz-96 and the Shallow Water Equation models illustrate that the proposed high dimensional EnKF algorithms perform better than the standard EnKF methods as they provide more robust and accurate assimilated results.

 

朱仲义

(复旦大学)

 

题目:Decentralized Learning of Quantile Regression: a Smoothing Approach with Two Bandwidths

摘要:Distributed estimation has attracted a significant amount of attention recently due to its advantages in computational efficiency and data privacy preservation. In this article, we focus on quantile regression over a decentralized network. Without a coordinating central node, a decentralized network improves system stability and increases efficiency by communicating with fewer nodes per round. However, existing related works on decentralized quantile regression either have slow (sub-linear) convergence speed or rely on some restrictive modelling assumptions (e.g. homogeneity of errors). We propose a novel method for decentralized quantile regression which is built upon the smoothed quantile loss. However, we argue that the smoothed loss proposed in the existing literature using a single smoothing bandwidth parameter fails to achieve fast convergence and statistical efficiency simultaneously in the decentralized setting, which we refer to as the speed-efficiency dilemma. We propose a novel quadratic approximation of the quantile loss using a big bandwidth for the Hessian and a small bandwidth for the gradient. Our method enjoys a linear convergence rate and has optimal statistical efficiency. Numerical experiments and real data analysis are conducted to demonstrate the effectiveness of our method. Keywords: Communication efficiency; Decentralized learning; Linear convergence; Quantile

 

 

席瑞斌

(北京大学)

 

题目:单细胞、空间组学数据的统计模型和方法

摘要:单细胞、空间转录组等组学技术的革命性发展为生物医学研究提供了强大的技术平台,得到了广泛的应用,产生了大量的数据,然而,这些新型组学大数据的也对统计分析提出了很多挑战,在数据清洗、降维、去噪、数字特征提取、整合分析等方面都亟需发展新的统计方法。在本报告中,我将介绍我们针对单细胞数据的基因融合检测及空间组学数据的可解释、多模态、高分辨率降维等问题开发的统计、计算方法。

 

 

孙文光

(浙江大学)

 

题目:Integrative Conformal P-values for Out-of-distribution Testing with Labeled Outliers

摘要:We present novel conformal inference methods for out-of-distribution testing that leverage side information from labeled outliers, which are commonly underutilized or even discarded by conventional conformal p-values. Blending inductive and transductive conformal inference strategies in a principled way, our methods are computationally efficient and can automatically take advantage of the most powerful model from a collection of one-class and binary classifiers. Then, we study how to control the false discovery rate in multiple testing with a conditional calibration strategy. Simulations with synthetic and real data show the proposed integrative conformal p-values outperforms existing methods.

  

 

赵俊龙

(北京师范大学)

 

题目:Residual Importance Weighted Transfer Learning for High-dimensional Linear Regression

摘要:Transfer learning is an emerging paradigm for leveraging multiple sources to improve the statistical inference on a single target. In this paper, we propose a novel approach named residual importance weighted transfer learning (RIW-TL) for high-dimensional linear models built on LASSO. Compared to existing methods such as Trans-Lasso that selects source data in an all-in-all-out manner, RIW-TL includes samples via importance weighting and thus may permit more effective sample use. To determine the weights, remarkably RIW-TL only requires the knowledge of one-dimensional densities dependent on residuals, thus overcoming the curse of dimensionality of having to estimate high-dimensional densities in naive importance weighting. We show that the oracle RIW-TL provides faster rate than its competitors and develop a cross-fitting procedure to estimate this oracle. We discuss variants of RIW-TL by adopting different choices for residual weighting. The theoretical properties of RIW-TL and its variants are established and compared with those of LASSO and Trans-Lasso. Extensive simulations and a real data analysis confirm its advantages. The code is freely available on https://github.com/RIW-TL/Transfer-learning.

 

 

兰伟

(西南财经大学)

 

题目:Multivariate Reduced Rank Spatiotemporal Models

摘要:Multivariate spatio-temporal data arise frequently in practical applications, often involving complex dependencies across cross-sectional units, time points and multivariate variables. In the literature, few studies jointly model the dependence in three dimensions. To simultaneously model the cross-sectional, dynamic and cross-variable dependence, we propose a multivariate reduced-rank spatio-temporal model. By imposing the low-rank assumption on the spatial influence matrix, the proposed model achieves substantial dimension reduction and has a nice interpretation, especially for financial data. Due to the innate endogeneity, we propose the quasi-maximum likelihood estimator (QMLE) to estimate the unknown parameters. A ridge-type ratio estimator is also developed to determine the rank of the spatial influence matrix. We establish the asymptotic distribution of the QMLE and the rank selection consistency of the ridge-type ratio estimator. The proposed methodology is further illustrated via extensive simulation studies and two applications to a stock market dataset and an air pollution dataset.

 

 

朱力行

(北京师范大学)

 

题目: Weighted Residual Empirical Processes, Martingale Transformations, and Model Specification Tests with Diverging Number of Parameters

摘要:This paper proposes a new methodology for testing the parametric forms of the mean and variance functions based on weighted residual empirical processes and their martingale transformations in regression models. The dimensions of the parameter vectors can be divergent as the sample size goes to infinity. We study the convergence of weighted residual empirical processes and their martingale transformation under the null and alternative hypotheses in diverging dimension settings. The proposed tests based on weighted residual empirical processes can detect local alternatives distinct from the null at the fastest possible rate of order $n^{-1/2}$ but are not asymptotically distribution-free. While tests based on martingale transformed weighted residual empirical processes can be asymptotically distribution-free, yet, unexpectedly, can only detect the local alternatives converging to the null at a much slower rate of order $n^{-1/4}$, which is somewhat different from existing asymptotically distribution-free tests based on martingale transformations. As the tests based on the residual empirical process are not distribution-free, we propose a smooth residual bootstrap and verify the validity of its approximation in diverging dimension settings. Simulation studies and a real data example are conducted to illustrate the effectiveness of our tests.

 

 

郭旭

(北京师范大学)

 

题目:Model-free Variable Importance Detection with Machine Learning Methods

摘要:In this paper, we propose a new procedure to detect variable importance in a model-free framework. Flexible machine learning methods are adopted to estimate unknown functions. Under null hypothesis, our proposed test statistic converges to standard chi-squared distribution. While under local alternative hypotheses, it converges to non-central chi-square distribution. It has non-trivial power against the local alternative hypothesis which converges to the null at root-n rate. We also extend our procedure to test conditional independence. Asymptotic properties are also developed. Numerical studies and a real data example are conducted to illustrate the performance of our proposed test statistic.

 

 

李赛

(中国人民大学)

 

题目:Multi-dimensional Domain Generalization with Low-rank Structures

摘要:Conventional machine learning methods typically assume that the test data are identically distributed with the training data. However, this assumption does not always hold, posing a significant challenge for making statistical inferences about minority groups. We present a novel approach to addressing this challenge in linear regression models. We organize the model parameters for all the sub-populations into a tensor. By studying a structured tensor completion problem, we can achieve robust domain generalization. We establish rigorous theoretical guarantees for the proposed method and demonstrate its minimax optimality.


 

刘卫东

(上海交通大学)

 

题目:Online Estimation and Inference for Robust Policy Evaluation in Reinforcement Learning

摘要:Recently, reinforcement learning has gained prominence in modern statistics, with policy evaluation being a key component. Unlike traditional machine learning literature on this topic, our work places emphasis on statistical inference for the parameter estimates computed using reinforcement learning algorithms. While most existing analyses assume random rewards to follow standard distributions, limiting their applicability, we embrace the concept of robust statistics in reinforcement learning by simultaneously addressing issues of outlier contamination and heavy-tailed rewards within a unified framework. In this paper, we develop an online robust policy evaluation procedure, and establish the limiting distribution of our estimator, based on its Bahadur representation. Furthermore, we develop a fully-online procedure to efficiently conduct statistical inference based on the asymptotic distribution. This paper bridges the gap between robust statistics and statistical inference in reinforcement learning, offering a more versatile and reliable approach to policy evaluation. Finally, we validate the efficacy of our algorithm through numerical experiments conducted in real-world reinforcement learning experiments.

 

张庆昭

(厦门大学)

 

题目:Graphical Model-based Heterogeneity Analysis

摘要:Heterogeneity is a hallmark of cancer, diabetes, cardiovascular diseases, and many other complex diseases. Recent studies have shown that incorporating interconnections among variables can lead to more informative heterogeneity structures. To this end, graphical model-based approaches have been developed. In this talk, we present some developments in graphical model-based heterogeneity analysis.

  

毛晓军

(上海交通大学)

 

题目:Decentralized Reduced Rank Regression for Response Partition

摘要:Distributed learning in decentralized networks has been extensively studied and applied in various machine-learning scenarios. However, previous research primarily focused on data partitioning based on samples. In this paper, we address the less explored scenario of response partition, where different components of the response vector are collected and stored across multiple nodes in a multi-agent network. To mitigate the information loss resulting from response partitioning, we use the Reduced Rank Regression (RRR) model to establish connections between the response components. Subsequently, we formulate an optimization problem that involves both local and global parameters within the framework of matrix factorization, capturing both inter-node and intra-node correlations. To solve this problem efficiently, we propose an algorithm based on Decentralized Gradient Descent with Gradient Tracking (DGGT), which incorporates an additional step for local estimation. The theoretical analysis yields non-asymptotic error bounds for both estimation error and consensus error. As the number of iterations tends to infinity, the statistical error rate converges to the optimal performance achieved in the centralized case. Furthermore, we validate the effectiveness of our method through simulations and real-world applications. The numerical results not only align with our theoretical findings but also demonstrate the superiority of our approach over local reduced-rank regression methods.

 

 

荆炳义  

(南方科技大学)

 

题目:GAN-based Policy Learning for Offline Reinforcement Learning

摘要:TBA

 

林伟

(北京大学)

 

题目:Heterogeneous Federated Learning on Arbitrary Graphs

摘要:Federated learning has emerged as a promising paradigm for privacy-preserving distributed machine learning, where algorithms are trained across multiple decentralized devices without sharing local data. In this talk, we consider parameter estimation in federated learning with heterogeneity in data distribution and communication, and with limited computational capacity of devices. We model the distribution heterogeneity using a latent characteristic graph, in which devices are adjacent if and only if they share the same parameters. With knowledge of a surrogate for the characteristic graph, we propose to jointly estimate parameters for all devices within the $M$-estimation framework with network fusion regularization. We provide nonasymptotic statistical guarantees for our regularized estimator on an arbitrary graph, exhibiting an inherent trade-off between aggregation and heterogeneity. In particular, when a graph fidelity condition is met, our estimator is optimal as if we could aggregate all samples sharing the same distribution. We further propose an edge selection procedure via multiple testing to maximize the graph fidelity. To avoid the need of a central machine and reduce the burden of local computation, a decentralized stochastic version of the ADMM algorithm, termed FedADMM, is developed with strong convergence guarantees. We also extend FedADMM to the case where devices are randomly inaccessible during the training process. The statistical and computational efficiency of our method is evidenced by simulation experiments and an analysis of the 2020 U.S. presidential election data.

 

 

郁文

(复旦大学)

 

题目: Neural Frailty Machines for Survival Analysis

摘要:We propose a flexible deep neural network modeling framework for semi-parametric regression analysis of survival data. The framework consists of a baseline hazard rate and the nonlinear covariates effect convoluted through a multiplicative frailty. The multiplicative frailty captures the potential (unobserved) heterogeneity among individuals. The deep neural network architectures are adopted to approximate the baseline hazard rate and the nonlinear covariate effect, leading towards a class of neural frailty machines (NFM). This NFM may be viewed as an extension of the neural proportional hazard model. To train the neural network and to estimate the frailty parameter, the log-likelihood is used as the objective function and popular stochastic training methods can be applied. The non-asymptotic error bounds based on a Hellinger-type distance are derived. Asymptotic results on consistency and rate of convergence are obtained. In particular, it is shown that the optimal nonparametric rates of convergence are attained.  Simulation studies are carried out to assess the finite sample performance and to compare with the theoretical findings. The proposed method is applied to six benchmark datasets and results show the superior performance over the existing state-of-the-art survival models.

  

 

张正军

(中国科学院大学)

 

题目:Towards Precision Oncology Discovery: Four Less Known Genes and Their Unknown Interactions as Highest-Performed Biomarkers for Colorectal Cancer

摘要:The goal of this study was to use a new interpretable machine-learning framework based on max-logistic competing risk factor models to identify a parsimonious set of differentially expressed genes (DEGs) that play a pivotal role in the development of colorectal cancer (CRC).  Transcriptome data from nine public datasets were analyzed, and a new Chinese cohort was collected to validate the findings. The study discovered a set of four critical DEGs - CXCL8, PSMC2, APP, and SLC20A1 - that exhibit the highest accuracy in detecting CRC in diverse populations and ethnicities. Notably, PSMC2 and CXCL8 appear to play a central role in CRC, and CXCL8 alone could potentially serve as an early-stage marker for CRC. This work represents a pioneering effort in applying the max-logistic competing risk factor model to identify critical genes for human malignancies, and the interpretability and reproducibility of the results across diverse populations suggests that the four DEGs identified can provide a comprehensive description of the transcriptomic features of CRC. The practical implications of this research include the potential for personalized risk assessment and precision diagnosis and tailored treatment plans for patients.

 

  

苗旺

(北京大学)

 

题目:Correction for Nonresponse Bias in Estimation of US Presidential Election Turnout Using Callback Data

摘要:Overestimation of turnout in election surveys has been a longstanding problem in political science, with nonresponse or voter overrepresentation regarded as one of the primary sources of bias. For adjusting nonresponse of covariates, the census data are readily available to obtain the covariates distribution. However, nonresponse adjustment for the turnout is substantially challenging, because identification generally fails to hold in the absence of additional information. Nonetheless, in order to improve response rates, many modern large-scale surveys often continue to contact nonrespondents and record the number of calls, referred to as callback data. Based on a real ANES Non-Response Follow-Up (NRFU) survey concerning the 2020 U.S. presidential election, we investigate the role of callback data in  nonresponse bias adjustment  in turnout estimation. We show that under a stableness of resistance assumption, the full data distribution is identifiable by leveraging the callback data. We propose  semiparametric estimators including a doubly robust one to adjust for nonignorable nonresponse bias in the NRFU study. Our estimates (around 0.666) successfully recover the ture vote turnout rate (0.662, obtained after the 2020 election), while traditional estimation methods (around 0.85) show large bias. Besides, our methods successfully capture the tendency of declining to vote as response reluctance or contact difficulty increases. Our analysis results suggest a possible nonignorable missingness mechanism in this political survey concerning turnout, and reveals the potential of using callback data in adjustment for such bias.

 

 

骆威

(浙江大学)

 

题目:On Efficient Dimension Reduction with Respect to the Interaction Between Two Response Variables

摘要: In this paper, we propose the novel theory and methodologies for dimension reduction with respect to the interaction between two response variables, which is a new research problem that has wide applications in missing data analysis, causal inference, and graphical models, etc. We formulate the parameters of interest to be the locally and the globally efficient dimension reduction subspaces, and justify the generality of the corresponding low-dimensional assumption. We then construct estimating equations that characterize these parameters, using which we develop a generic family of consistent, model-free, and easily implementable dimension reduction methods called the dual inverse regression methods. We also build the theory regarding the existence of the globally efficient dimension reduction subspace, and provide a handy way to check this in practice. The proposed work differs fundamentally from the literature of sufficient dimension reduction in terms of the research interest, the assumption adopted, the estimation methods, and the corresponding applications, and it potentially creates a new paradigm of dimension reduction research. Its usefulness is illustrated by simulation studies and a real data example at the end.

 

 

朱利平

(中国人民大学)

 

 题目:非线性相关与独立性检验

摘要:统计学“始于度量、兴于相关”。度量相关性和检验独立性是统计学领域的基本问题,是评估预测能力的重要方法和关键依据。预测是统计学和机器学习领域的核心问题。本报告将会回顾统计学领域线性相关和非线性相关各类度量准则,以及独立性检验的最新进展。

  

 

何煦

(中国科学院数学与系统科学研究院)

 

题目:Sequentially Refined Latin Hypercube Designs with Flexibly and Adaptively Chosen Sample Sizes

摘要:Latin hypercube designs are the most popular type of experimental design for computer experiments. Sequentially refined Latin hypercube designs are useful for computer experiments that are carried out in batches. In this work, we propose the first type of sequentially refined Latin hypercube designs that allow the size of subsequent batches to be flexibly chosen after completing former batches. Numerical results show our proposed designs are uniformly better than preceding types of sequentially refined Latin hypercube designs for the problem of uncertainty quantification.

 

 

王典朋

(北京理工大学)

 

题目: A Subsampling Method for Regression Problems Based on Minimum Energy Criterion

摘要:The extraordinary amounts of data generated nowadays pose heavy demands on computational resources and time, which hinders the implementation of various statistical methods. An efficient and popular strategy of downsizing data volumes and thus alleviating these challenges is subsampling. However, the existing methods either rely on specific assumptions for the underlying models or acquire partial information from the available data. For regression problems, we propose a novel approach, termed adaptive subsampling with the minimum energy criterion (ASMEC). The proposed method requires no explicit model assumptions and smartly incorporates information on covariates and responses. ASMEC subsamples possess two desirable properties: space-fillingness and spatial adaptiveness. We investigate the limiting distribution of ASMEC subsamples and their theoretical properties under the smoothing spline regression model. The effectiveness and robustness of the ASMEC approach are also supported by a variety of synthetic examples and two real-life examples.

 

 

王启华

(中国科学院数学与系统科学研究院)

 

题目:Distributed Nonparametric Regression Imputation for Missing Response Problems with Large-scale Data

摘要:Nonparametric regression imputation is commonly used in missing data analysis. However, it suffers from the “curse of dimension".  The problem can be alleviated by the explosive sample size in the era of big data, while the large-scale data size presents some challenges on the storage of data and the calculation of estimators.  These challenges make the classical nonparametric regression imputation methods no longer applicable.  This motivates us to develop two distributed nonparametric regression imputation methods.  One is based on kernel smoothing and the other is based on the sieve method.  The kernel based distributed imputation method has extremely low communication cost and the sieve based distributed imputation method can accommodate more local machines. In order to illustrate the proposed imputation methods, response mean estimation is considered. Two distributed nonparametric regression imputation estimators are proposed for the response mean, which are proved to be asymptotically normal with asymptotic variances achieving the semiparametric efficiency bound.  The proposed methods are evaluated through simulation studies and are illustrated by a real data analysis.

 

 

赵普映

 (云南大学)

 

题目:Sample Empirical Likelihood Inference with Complex Survey Data

摘要:The sample empirical likelihood approach provides a powerful tool for analysis of complex survey data. We present results of sample empirical likelihood for point estimation and linear or nonlinear hypothesis tests on finite population parameters defined through just-identified or over-identified estimating equation systems with smooth or non-differentiable estimating functions under general unequal probability sampling designs. We propose a penalized sample empirical likelihood for variable selection and establish its oracle property under the design-based framework. Practical implementations of the methods are also discussed. Finite sample performances of the proposed methods for quantile regression and variable selection are examined through simulation studies.

 

 

王涛

(上海交通大学)

 

题目:Factor Augmented Inverse Regression and Its Application to Microbiome Data Analysis

摘要:We investigate the relationship between count data that inform the relative abundance of features of a composition, and factors that influence the composition. We introduce multinomial Factor Augmented Inverse Regression (FAIR) of the count vector onto response factors as a general framework for obtaining low-dimensional summaries of the count vector that preserve information relevant to the response. By augmenting known response factors with random latent factors, FAIR extends multinomial logistic regression to account for overdispersion and general correlations among counts. The method of maximum variational likelihood and a fast variational expectation-maximization algorithm are proposed for approximate inference based on variational approximation, and the asymptotic properties of the resulting estimator are derived. The effectiveness of FAIR is illustrated through application to a microbiome data set.

  

常晋源

(西南财经大学、中国科学院数学与系统科学研究院)

 

题目:Exploring Excellence: Bayesian Penalized Empirical Likelihood and MCMC Sampling

摘要:In this study, we introduce a novel methodological framework referred to as Bayesian penalized empirical likelihood, designed to tackle the computational challenges associated with empirical likelihood methods. Our approach pursues two primary objectives: firstly, preserving the inherent flexibility of empirical likelihood to accommodate a wide range of model conditions, and secondly, providing convenient access to well-established Markov chain Monte Carlo (MCMC) sampling schemes as alternatives to the intricate optimization steps in empirical likelihood. To achieve the first objective, we propose a penalized approach that effectively selects model conditions by regulating Lagrange multipliers, thereby reducing the dimensionality of the problem while leveraging a comprehensive set of model conditions. For the second objective, our approach overcomes the obstacles inherent in devising sampling schemes for Bayesian applications through efficient dimensionality reduction.  Our Bayesian penalized empirical likelihood framework offers a highly flexible and efficient approach, enhancing the adaptability and practicality of empirical likelihood methods in statistical inference. Importantly, our study demonstrates the practical advantages of employing sampling techniques over traditional optimization methods in solving empirical likelihood problems. These techniques exhibit rapid convergence to global optima of posterior distributions, ensuring effective resolution of complex statistical estimation problem.

 

 

金百锁

(中国科学技术大学)

 

题目:空间动态面板数据的模型估计和选择

摘要:时空数据广泛存在于环境学、流行病学、计量经济学和管理学等多个领域。空间动态面板数据模型是一种有效的分析时空数据的计量经济学模型,有大量的相关研究。由于空间动态面板数据模型的解释变量和误差不独立,因此存在内生性,如何有效的避免内生性的影响,是估计空间动态面板数据模型,最基础也是最核心的问题。是否能脱离已有的估计方法,提出一种新的估计方法适应大规模时空数据分析的需求,仍是当前一个具有挑战性的问题。我们的研究主要有以下几个方面的成果:(1)为了避免模型内生性的影响,并提高运算速度,解决大型而复杂的计算。我们采用与并行计算类似的想法,但并不是在空间上或时间上并行,而是在空间投影方向上并行,提出了一种基于空间权重矩阵特征分解的两步最小二乘估计方法,证明了理论性质,并通过模拟和实证表明方法具有较高的估计精度和运算速度。具体计算步骤如下:a.     第一步(分):对模型数据在不同空间权重矩阵的特征向量上投影,每个投影方向上,结合时间维度,采用最小二乘法得到每个方向上的投影结果。b. 第二步(和):汇总不同方向上的投影结果,再次采用最小二乘方法得到最终的模型估计。(2)为了提高方法的适用范围,我们也把该方法推广到空间权重矩阵的特征根和特征向量为复数时的情形,并提出了基于空间权重矩阵特征分解的复数两步最小二乘估计方法。(3)为了更好的应用我们的方法,在模型选择方面,我们也改进了正交贪婪算法,提出了复数正交贪婪算法。同时我们给出了具有时间和个体固定效应的空间动态面板数据模型的估计方法,并给出了理论性质。

 

 

马诗洋

(上海交通大学)

 

题目:Knockoff-based Statistics for the Identification of Putative Causal Genes in Genetic Studies

摘要:Gene-based tests are important tools for elucidating the genetic basis of complex traits. Despite substantial recent efforts in this direction, the existing tests are still limited, owing to low power and detection of false-positive signals due to the confounding effects of linkage disequilibrium. In this talk, we describe a gene-based test that attempts to address these limitations by incorporating data on long-range chromatin interactions, several recent technical advances for region-based testing, and the knockoff framework for synthetic genotype generation. Through extensive simulations and applications to multiple diseases and traits, we show that the proposed test increases the power over state-of-the-art gene-based tests and provides a narrower focus on the possible causal genes involved at a locus. We also propose a computationally efficient gene-based testing approach for biobank-scale data, and show applications to UK Biobank data with 405,296 participants for multiple binary and quantitative traits.


参会名单

序号

姓名

工作单位

1

艾明要

北京大学

2

陈松蹊

北京大学

3

林伟

北京大学

4

苗旺

北京大学

5

丘竞昆

北京大学

6

王守霞

北京大学

7

席瑞斌

北京大学

8

闫晗

北京大学

9

耿直

北京工商大学

10

吴鹏

北京工商大学

11

王典朋

北京理工大学

12

郭旭

北京师范大学

13

赵俊龙

北京师范大学

14

朱力行

北京师范大学

15

李长城

大连理工大学

16

孙发省

东北师范大学

17

郑术蓉

东北师范大学

18

郁文

复旦大学

19

朱仲义

复旦大学

20

刘玉坤

华东师范大学

21

於州

华东师范大学

22

赵鹏

江苏师范大学

23

荆炳义

南方科技大学

24

孔新兵

南京审计大学

25

冯龙

南开大学

26

吴未迟

清华大学

27

刘庆丰

日本法政大学

28

林路

山东大学

29

刘卫东

上海交通大学

30

马诗洋

上海交通大学

31

毛晓军

上海交通大学

32

王涛

上海交通大学

33

邹国华

首都师范大学

34

姜丹丹

西安交通大学

35

喻达磊

西安交通大学

36

常晋源

西南财经大学、中国科学院数学与系统科学研究院

37

杜悦

西南财经大学

38

何婧

西南财经大学

39

兰伟

西南财经大学

40

林华珍

西南财经大学

41

方匡南

厦门大学

42

张庆昭

厦门大学

43

陈飞

云南财经大学

44

杨晓洁

云南财经大学

45

赵普映

云南大学

46

骆威

浙江大学

47

孙文光

浙江大学

48

张立新

浙江大学

49

金百锁

中国科学技术大学

50

王学钦

中国科学技术大学

51

张正军

中国科学院大学

52

何煦

中国科学院数学与系统科学研究院

53

李冬雨

中国科学院数学与系统科学研究院

54

刘子辰

中国科学院数学与系统科学研究院

55

潘越峰

中国科学院数学与系统科学研究院

56

盛赢

中国科学院数学与系统科学研究院

57

王启华

中国科学院数学与系统科学研究院

58

张新雨

中国科学院数学与系统科学研究院

59

黄辉

中国人民大学

60

李赛

中国人民大学

61

李扬

中国人民大学

62

朱利平

中国人民大学