Full Program

Keynote, IPS & CPS

Dec 6. 09:30-10:30.

23 WW Price Theatre. Keynote Talk 1

IN THIS SESSION

Prof Wing Kam Fung (The University of Hong Kong)

Robust Mendelian Randomization Methods Accounting for Horizontal Pleiotropy and Weak Instruments

Mendelian randomization (MR) is an instrumental variable (IV) method that estimates the causal effect of an exposure on an outcome of interest, even in the presence of unmeasured confounding. The method uses genetic variants as IVs. However, the validity of MR analysis relies on three core IV assumptions, which could be violated in the presence of horizontal pleiotropy and weak instruments. Therefore, the estimation of causal effect might be biased if horizontal pleiotropy and weak instruments are not properly accounted for. In this study, we propose a robust approach named MRCIP to account for correlated and idiosyncratic pleiotropy. Additionally, we develop a novel penalized inverse-variance weighted (pIVW) estimator, which adjusts the original IVW estimator to account for the weak IV issue. Extensive simulation studies demonstrate that the proposed methods outperform competing methods. We also illustrate the usefulness of the proposed methods using real datasets.

Dec 6. 11:00-12:40.

14SCO Theatre 3. IPS01: Addressing challenges in Bayesian analysis for complex data

IN THIS SESSION

Organiser: Matias Quiroz

Virginia He (University of Technology Sydney), Matt P. Wand (University of Technology Sydney)

Bayesian Generalized Additive Model Selection Including a Fast Variational Option

We use Bayesian model selection paradigms, such as group least absolute shrinkage and selection operator priors, to facilitate generalized additive model selection. Our approach allows for the effects of continuous predictors to be categorized as either zero, linear or non-linear. Employment of carefully tailored auxiliary variables results in Gibbsian Markov chain Monte Carlo schemes for practical implementation of the approach. In addition, mean field variational algorithms with closed form updates are obtained. Whilst not as accurate, this fast variational option enhances scalability to very large data sets. A package in the R language aids use in practice.

Andrew Zammit-Mangion (University of Wolloongong), Matthew Sainsbury-Dale (University of Wollongong), Jordan Richards (KAUST), and Raphaël Huser (KAUST)

Neural Bayes Estimators for Irregular Spatial Data using Graph Neural Networks

Neural Bayes estimators are neural networks that approximate Bayes estimators in a fast and likelihood-free manner. Neural Bayes estimators are appealing to use with spatial models and data, where estimation is often a computational bottleneck. However, neural Bayes estimation in spatial applications has, to date, been restricted to data collected over a regular grid. These estimators are currently also implicitly dependent on a prescribed set of sampling locations, which means that the neural network needs to be re-trained for new spatial locations; this renders them impractical in many applications and impedes their widespread adoption. In this work, we employ graph neural networks to tackle the important problem of spatial-model-parameter estimation from arbitrary spatial sampling locations. In addition to extending neural Bayes estimation to irregular spatial data, our architecture leads to substantial computational benefits, since the estimator can be used with any arrangement or number of locations and independent replicates, thus amortising the cost of training for a given spatial model. We also facilitate fast uncertainty quantification by training an accompanying neural Bayes estimator for the marginal posterior quantiles. We illustrate our methodology on Gaussian and max-stable processes, where the latter have an intractable likelihood function. Finally, we showcase our methodology in a global sea-surface temperature application, where we estimate the parameters of a Gaussian process model in 2,161 regions of the globe, each containing a few hundred to more than 12,000 irregularly-spaced data points, in just a few minutes with a single graphics processing unit.

Thomas Goodwin (University of Technology Sydney), Arthur Guillaumin (Queen Mary University of London), Matias Quiroz (University of Technology Sydney), Mattias Villani (Stockholm University) & Robert Kohn (UNSW Sydney)

Bayesian Inference via a Frequency Domain Approach for Latticed Random Fields

Estimation of regularly spaced stationary random fields on a lattice is computationally demanding. Likelihood evaluations for Gaussian random fields have a cost of $O (n^{2})$ , where $n^{2}$ is the number of lattice points, and thus quickly become intractable for large datasets. Approximate frequency domain methods with cost $O (n \log n)$ have been proposed for parameter estimation, however, these methods suffer from severe bias, are costly due to data imputations for circulant approximations, or rely on strict assumptions of the data generating process. We propose a method for Bayesian inference based on the debiased spatial Whittle likelihood, which is an $O (n \log n)$ cost frequency domain pseudo-likelihood that reduces bias and accounts for aliasing. This pseudo-likelihood results in a pseudo-posterior with poor posterior coverage, but we use a curvature adjustment method that provides asymptotically accurate posterior coverage. We illustrate the method on Venus topography data using a Matern covariance kernel where the slope and scale parameters carry physical meaning pertaining to the underlying geological processes in the observed topography.

Matias Quiroz (University of Technology Sydney), Thomas Goodwin (University of Technology Sydney) & Robert Kohn (UNSW Sydney)

Dynamic Linear Regression Models for Semi Long Memory Time Series

Dynamic linear regression models forecast the values of a time series based on a linear combination of a set of exogenous time series while incorporating a time series process for the error term. This error process is often assumed to follow an autoregressive integrated moving average (ARIMA) model, or seasonal variants thereof, which are unable to capture a long-range dependency structure of the error process. A novel dynamic linear regression model that incorporates the long-range dependency feature of the errors is proposed, showing that it improves the model’s forecasting ability.

14SCO Theatre 4. IPS02: New movements in computational statistics: data-driven approaches, causality, visualisation, generative AI

IN THIS SESSION

Organiser: Yuichi Mori

Sheng-Hsuan Lin (National Yang Ming Chiao Tung University)

From Linear Structural Equation Modeling to Generalized Multiple Mediation Formula

In longitudinal studies with time-varying exposures and mediators, the mediational g- formula is an important method for the assessment of direct and indirect effects. However, current methodologies based on the mediational g-formula can deal with only one mediator. This limitation makes these methodologies inapplicable to many scenarios. Hence, we develop a novel methodology by extending the mediational g- formula to cover cases with multiple time-varying mediators. We formulate two variants of our approach that are each suited to a distinct set of assumptions and effect definitions and present nonparametric identification results of each variant. We further show how complex causal mechanisms (whose complexity derives from the presence of multiple time-varying mediators) can be untangled. A parametric method along with a user-friendly algorithm was implemented in R software. We illustrate our method by investigating the complex causal mechanism underlying the progression of chronic obstructive pulmonary disease. We found that the effects of lung function impairment mediated by dyspnea symptoms and mediated by physical activity accounted for 14.6% and 11.9% of the total effect, respectively. Our analyses thus illustrate the power of this approach, providing evidence for the mediating role of dyspnea and physical activity on the causal pathway from lung function impairment to health status.

Takafumi Kubota (Tama University), Haruko Ishikawa (Tama University), Mutsumi Suganuma (Tama University), Norikazu Yoshimine (Tama University), Sonoko Nakamura (Tama University)

Potentials for Application of Generative AI in Basic Education Subjects at Universities

This study investigates the possibility of using a generative AI such as ChatGPT (hereafter referred to as ChatGPT) in basic subjects in university education, especially in mathematics and languages. ChatGPT is assumed to be for university students to answer questions by themselves, for teachers to improve their teaching methods and to be one of the teaching materials.

The research methods include a) sending messages to ChatGPT as questions of math and language to answer; b) using exam results to create individual student diagnostic forms to improve their skills; and c) sending messages to ChatGPT as creating for similar questions.

First, sending questions from the Business Mathematics Proficiency Test Level 3 textbook to ChatGPT, the results were 71% on GPT 3.5 and 100% on GPT 4 (whereas SAT Math was 74% on GPT 3.5 and 88% on GPT 4).

Sending practice test questions in the official TOEIC question booklet to ChatGPT, the results were 87% for GPT 3.5 and 98% for GPT 4 ( whereas for the SAT EBRW, the results were 84% for GPT 3.5 and 89% for GPT 4). Next, new and improved prompts given to ChatGPT through the chain of thought method resulted in responses to almost 100% of the math and language questions.

In addition, for mathematics, an examination is given at the midpoint of each semester, and the examination results are used to give diagnostic results, including different individual strengths and weaknesses, to approximately 300 students. For future studies, as language questions, especially for TOEIC, are not divided into categories, similar questions, such as those in the textbook, are created by prompting ChatGPT. It is then being considered to use these questions to carry out individual diagnostics of students similar to those in mathematics.

Jinfang Wang (Waseda University), Shigetoshi Hosaka (Hosaka Clinic)

Health Improvement Strategies and Causal Effects Based on Health Checkup Data

In this talk, we address the challenges of personalized health management based observational data and assess the causal effects of such health management. Health management, such as Diabetes management, is a complex and individual-specific task, which requires an individual-specific combination of lifestyle modifications and medication regimens. In this study, we propose a data-driven framework to construct personalized health management plans for subjects with diabetes based on a Japanese health checkup dataset. Our approach consists of several key components:

First, we develop an optimal predictive model for estimating glucose levels using state-of-the-art machine learning models. We then identify a subset of “manageable variables” based on this model. These variables can be altered through lifestyle changes or medications. Next, we adapt and refine the concept of the actionability score (Nakamura, et al., 2021) to develop a new metric, the “manageability score”, for evaluating the ease of adjusting a variable among the manageable ones. The manageability score is derived from the posterior predictive probabilities of based on the modifiable variables, such as BMI, Systolic Pressure, and gamma-GPT, among others. Finally, we also consider the problem of evaluating the causal effects of the counterfactual health management strategies.

Yoshiro Yamamoto (Tokai University), Sanetoshi Yamada (Toaki University) & Tadashi Imanishi (Toaki University)

Availability and Visualization of the Number of People Infected with COVD-19 in the Region in Japan

New coronavirus infections spread around the world beginning in 2020. Various websites and dashboards were released to visualize the status of the spread of infection in each country.

In Japan, the number of infected persons was published in different forms for each prefecture and municipality. In each prefecture, several data aggregation and publication sites were provided, including NHK’s site, and visualization sites using these data were developed and published.

We collected data on the prefectures where the university is located, developed a visualization system that can grasp the status of the spread of infection in each municipality, and made it publicly available.

As for the application we created, we tried to share it through GitHub and shinyapps.io, but it was not released to the public due to problems with the Japanese language on shinyapps.io.

Data collection, processing, and visualization were done in R. An interactive application was realized using Shiny, a package of the statistical software R.

The data collection method by scraping and details of the interactive visualization are reported.

14SCO Theatre 5. CPS01: Stochastic Processes

IN THIS SESSION

Tsuyoshi Ishizone (Meiji University), Yasuhiro Matsunaga (Saitama University), Sotaro Fuchigami (University of Shizuoka) & Kazuyuki Nakamura (Meiji University)

Biomolecule’s Conformational Representations by Latent Time-structured Model

The three-dimensional structure of proteins involves complex dynamics such as folding/unfolding and collective motion, and the elucidation of these dynamics is closely related to advanced medicine and drug discovery. Protein structures exhibit dynamics on time scales ranging from femtoseconds (e.g., molecular vibrations) to seconds (e.g., folding), and it is important to obtain time-scale-separated representations from structural trajectory data. In particular, “slow” time-scale dynamics such as folding/unfolding are closely related to biological functions, and methods for acquiring slow representations have been actively studied. In this study, we introduce a latent time series model in which the expression series naturally forms slow changes, and extract slow structural dynamics of proteins. Our method employs an encoder-decoder architecture consisting of an inference model and a generative model, in the form of a sequential variational auto-encoder (SVAE) that includes time transitions in both models. The generative model can be formulated as a state-space model to represent uncertainty in structural dynamics. We conduct several approaches to biomolecular structural trajectories. In the quantitative evaluation, the continuous state is transformed into a discrete-state Markov process, and implied time-scales calculated from the Markov nuclei are used to show that our method extracts slower dynamics.

Wei-Ann Lin (National Cheng Kung University), Chih-Li Sung (Michigan State University), Ray-Bing Chen (National Cheng Kung University)

Category Tree Gaussian Process for Computer Experiments with Many-Category Qualitative Factors and Application to Cooling System Design

In computer experiments, Gaussian process (GP) models are commonly used for emulation. However, when both qualitative and quantitative factors are in the experiments, emulation using GP models becomes challenging. In particular, when the qualitative factors contain many categories in the experiments, existing methods in the literature become cumbersome due to the curse of dimensionality. Motivated by the computer experiments for the design of a cooling system, a new tree-based GP for emulating computer experiments with many-category qualitative factors is proposed, which we call category tree GP. The proposed method incorporates a tree structure to split the categories of the qualitative factors, and GP or mixed-input GP models are employed for modeling the simulation outputs in the leaf nodes. The splitting rule takes into account the cross-correlations between the categories of the qualitative factors, which have been shown by a recent theoretical study to be a crucial element for improving the prediction accuracy. In addition, a pruning procedure based on the cross-validation error is proposed to ensure the prediction accuracy. The application to the design of a cooling system indicates that the proposed method not only enjoys marked computational advantages and produces accurate predictions, but also provides valuable insights into the cooling system by discovering the tree structure.

Sachin Sachdeva (University of Hyderabad), Barry C Arnold (University of California, Riverside) & B G Manjunath (University of Hyderabad)

Some Power Function Distribution Processes

It is known that Proportional Reverse Hazard (PRH) processes can be derived by a marginal transformation applied to a Power Function Distribution (PFD) process. The purpose of this paper is to study the PFD processes and their potential use in deriving Proportional Reverse hazard (PRH) processes. Kundu[1] investigated PRH processes that can be viewed as being obtained by marginal transformations applied to a particular PFD process. We critically assessed its claimed Markovian and Stationary properties. In the present note, we have introduced a novel PFD process that exhibits Markovian and Stationary properties. We have discussed the distinctive distributional features of such a process, explored inferential aspects, and provided an example of the application of the PFD process to real-life data.

Chia-Li Wang (National Dong Hwa University), Yan Su

The Double-Sided Priority Queue and its Application to the Limit Order Book

The double-sided queue was first constructed to model taxis and customers waiting at a taxi-stand many decades ago. Due to its special purpose and analytical difficulty, it had been rarely studied and applied otherwise since. Only until recently, its application is found in sharing economic, in particular, Uber, where the system is often composed of customers and servers both with different preferences. Yet, a more important and exquisite application also arises recently, that is the so called limit order book in modern equity markets. Performance of interest of this financial model includes the conditional probability of a limit order being executed given the current state of the system, dynamics of the mid-price movement of the limit order book, and moments of the mid-price movement. To tackle the analytical difficulty, we apply the matrix-analytic method and derive the approximation algorithms to calculate these quantities. We then compare the accuracy and calculation efficiency with the Laplace transformation method, and list some advantages in computation. We also give the sensitive analysis to the expectation time of the mid-price movement and provide some reasonable analyses and explanations for the counter intuitive results.

Dec 6. 13:30-15:10.

14SCO Theatre 3. IPS03: Interpreting and explaining complex model

IN THIS SESSION

Organiser: Di Cook

Discussant: Jessica (Jess) Leung

Xiaoqian Wang (Monash University), Yanfei Kang (Beihang University, China) and Feng Li (Central University of Finance and Economics, China)

Another Look at Forecast Trimming for Combinations: Robustness, Accuracy and Diversity

Forecast combination is widely recognized as a preferred strategy over forecast selection due to its ability to mitigate the uncertainty associated with identifying a single “best” forecast. Nonetheless, sophisticated combinations are often empirically dominated by simple averaging, which is commonly attributed to the weight estimation error. The issue becomes more problematic when dealing with a forecast pool containing a large number of individual forecasts. In this paper, we propose a new forecast trimming algorithm to identify an optimal subset from the original forecast pool for forecast combination tasks. In contrast to existing approaches, our proposed algorithm simultaneously takes into account the robustness, accuracy and diversity issues of the forecast pool, rather than isolating each one of these issues. We also develop five forecast trimming algorithms as benchmarks, including one trimming-free algorithm and several trimming algorithms that isolate each one of the three key issues. Experimental results show that our algorithm achieves superior forecasting performance in general in terms of both point forecasts and prediction intervals. Nevertheless, we argue that diversity does not always have to be addressed in forecast trimming. Based on the results, we oﬀer some practical guidelines on the selection of forecast trimming algorithms for a target series.

Przemyslaw Biecek (Warsaw Tech)

Shapley Lenses, How to Investigate Models to Extract Useful Information

Shapley values are today the most popular technique for explanatory model analyzing (EMA) ana explainable artificial intelligence (XAI). Various modifications and extensions are being developed to tailor this method to meet the challenges of a wide variety of applications.

In this talk, I will show examples in which Shapley values (and more broadly, methods used in explainable artificial intelligence) can be used to separate models with different behaviour, even if they look identical when looking at their performance. I will then outline a proposal for a process for iterative analysis of models using Shapley values. This process, inspired by Rashomon perspectives, and referred to as Shapley Lenses, allows for a more nuanced view of predictive models. The knowledge extracted from predictive models can be used to build the next iteration of more interpretable models.

Susan Vanderplas (University of Nebraska Lincoln) & Muxin Hua (University of Nebraska Lincoln)

How Do You Define a Circle? Perception and Computer Vision Diagnostics

Neural Networks are very complicated and very useful models for image recognition, but they are generally used to recognize very complex and multifaceted stimuli, like cars or people. When neural networks are used to recognize simpler objects with overlapping feature sets, things can go a bit haywire. In this talk, we’ll discuss a model built for applications in statistical forensics which uncovered some very interesting problems between model-based perception and human perception. I will show visual diagnostics which provide insight into the model, and talk about ways we might address the discrepancy between human perception and model perception to produce more accurate and useful model predictions.

14SCO Theatre 4. IPS04: Some Inference Strategies for Large Complex Data

IN THIS SESSION

Organiser: Erniel B. Barrios

Adrian Matthew Glova (UP School of Statistics)

Predictive Modelling of Mixed-Frequency Time Series with Structural Change

Predictive ability of time series models can perform poorly in the presence of structural change (characterized as a change in the mean, variance, autoregressive parameter or any combination) as data patterns held in the past no longer hold. This is common among financial and economic variables amidst market shocks and policy regime shifts. This problem is remedied by estimating semiparametric mixed-frequency models, which include high frequency data in the conditional mean or the conditional variance equations. The high frequency data, incorporated through non-parametric smoothing functions, supplement the low frequency data to better capture non-linear relationships arising from the structural change.

Simulation studies indicate that in the presence of structural change, the varying frequency in the mean model provides improved in-sample fit and superior out-of-sample predictive ability relative to low frequency time series models. These findings hold across a broad range of simulation settings, such as differing time series lengths, changing structural break points, and varying degrees of autocorrelation. The proposed method is illustrated with time series models for stock prices and foreign exchange rates relevant to the Philippines.

Louise Adrian Castillo (UP School of Statistics) & Erniel Barrios (Monash University Malaysia)

Nonparametric Density-Based Procedure

Detecting emerging events early enough is key to mitigation, prevention and containment of events especially those related to disease outbreaks, natural disasters, public safety, among others. Most tests for goodness of fit focuses on the central location of the density function, missing to detect changes that usually occur at the tails of the distribution. Often times, these methods are prone to false negative results. We proposed two methods of discovering emerging events based on nonparametric density estimates from a data- generating process reflecting those in social media text data. The first method compares the percentile values between the baseline and speculated distribution while the second algorithm modifies the Kolmogorov Smirnov test to focus on the upper tails of the distribution. Simulation studies suggest that both algorithms can detect emerging events specially in cases with large number of data points.

Shirlee Ocampo (De La Salle University, Philippines) & Erniel Barrios (Monash University, Malaysia)

Bootstrap-based Inference for Sparse Spatio-temporal Models

A bootstrap-based inference in estimating the parameters of a sparse spatio-temporal model is introduced in the hybrid of the backfitting algorithm and Cochrane-Orcutt procedure. The proposed bootstrap-based method was applied to simulated sparse data with many zeroes. Philippine daily COVID-19 data in provincial level which are sparse having many gaps and zeroes are used to illustrate the proposed method.

14SCO Theatre 5. CPS02: Statistical Methodology

IN THIS SESSION

Yong Wang (University of Auckland) & Xiangjie Xue (University of Auckland)

A New approach to Null Proportion Estimation in Large-Scale Simultaneous Hypothesis Testing

We present a new approach for estimating the proportion of null effects, a vital yet difficult problem associated with large-scale simultaneous hypothesis testing. Our approach utilises naturally nonparametric mixtures and makes a novel use of a profile likelihood function. Computational methods will also be described. Numerical studies show that the new estimator has an apparently convergent trend and outperforms the existing ones in the literature in various scenarios.

Takai Keiji (Kansai University)

A Bisection Estimation Method for a Gamma Distribution and the Gamma-related Distributions

In this presentation, we present a simple method to estimate parameters of a gamma distribution and gamma-related distributions. We give a proposition that the maximum likelihood estimate falls in an interval. At the left endpoint of the interval, the function to be solved has a positive value and at the right endpoint, it has a negative value. In addition, the function to be solved is monotonically increasing. It follows that the bisection method can be applied for computation of the gamma parameters. This method has merits of simplicity and accuracy. In addition, this method can be applied to some of the gamma mixture of distributions, such as the negative binomial distribution and the Pareto distribution of the 2nd kind. By reformulating these distributions as an incomplete-data problem, the estimation problems are reduced to those of the gamma distribution. We show some theoretical results with some simulations to show numerical performance of these distributions.

Eri Kurita (Tokyo University of Science) & Takashi Seo (Tokyo University of Science)

A Modified Test Statistic for Measure of Multivariate Skewness

This talk deals with a sample measure of multivariate skewness, which is used as a test statistic in multivariate normality testing problems. Assessing multivariate normality of the data is a difficult task, and many test procedures have been proposed. Multivariate skewness and multivariate kurtosis are the third-order and fourth-order moments, respectively, which have different definitions by Mardia (1970), Srivastava (1984), and Koziol (1989). The null distributions of the test statistics using the sample measures of multivariate skewness and multivariate kurtosis are given for a large sample. Recently, a multivariate normality test using a normalizing transformation for Mardia’s kurtosis test statistic was given by Enomoto et al. (2020). Moreover, Kurita et al. (2023) gave a modification of its normalizing transformation statistic, and Kurita and Seo (2022) presented a definition of sample measure of multivariate kurtosis and its expectation and variance under the assumption of a two-step monotone missing data. In this talk, we will focus on test statistics based on multivariate skewness defined by Mardia (1970). Mardia (1970) expressed the sample measure of multivariate skewness in terms of the third-order sample moments and the components of the inverse of the sample variance-covariance matrix. Using the expression, the asymptotic distribution was given. The derivation is based on finding the variance for each decomposed individual moment for large sample and standardizing them to obtain a chi-square approximation. Based on that derivation, we propose a modified test statistic by evaluating their variances using a perturbation method. Finally, the accuracy of the chi-square approximation of the proposed test statistic is investigated by using a Monte Carlo simulation. Using these results, we can give a multivariate Jarque-Bera type test statistic, which is the combined test statistic based on multivariate skewness and multivariate kurtosis.

Hong-Ji Yang (National Cheng Kung University, Taiwan) & Chung-I Li (National Cheng Kung University, Taiwan)

A Study on Development of Constructing Control Limits Incorporating Exceedance Probability Criterion

The main focus of this paper is to improve the initial phase (Phase I) of statistical process monitoring (SPM) by integrating estimation uncertainty assessment. The proposed approach involves the utilization of control charts with the exceedance probability criterion (EPC). These control charts maintain a desired false alarm rate with a specified nominal coverage probability. The EPC criterion has recently gained attention and strong recommendations from various authors for its potential to improve control chart design. The primary focus of the study lies in the investigation of semiparametric and non-parametric approaches, which seek to integrate the EPC into control charts while delineating their individual methodologies and limitations. Particularly noteworthy is the innovative use of extreme value theory within the semiparametric approach, a novel aspect that has yet to be explored in the existing literature. Furthermore, we have put forth two non-parametric approaches that showcase better performance in contrast to the current method.

In conclusion, the findings have been summarized, highlighting the contributions made and outlining potential future avenues for research.

Dec 6. 15:40-17:20.

14SCO Theatre 3. IPS05: Recent Advances in Industrial and Applied Statistics.

IN THIS SESSION

Organiser: Chang-Yun Lin & Tsung-Jen Shen

Frederick Kin Hing Phoa (Institute of Statistical Science, Academia Sinica) & Jing-Wen Huang (National Tsing Hua University)

A Systematic Design Construction and Analysis for Cost-Efficient Order-of-Addition Experiment

In this work, we propose a systematic design construction method for cost-efficient order-of-addition (OofA) experiments, and its corresponding statistical models for analyzing experimental results. In specific, our designs take the effects of two successive treatments into consideration. Each pair of level settings from two different factors in our design matrix appears exactly once to achieve cost-efficiency. Compared to designs in recent studies of OofA experiments, our design is capable of conducting experiments of one or more factors, so practitioners can insert a placebo, or choose different doses as level settings when our design is used as their experimental plans. We show an experimental analysis based on our design results in better performance than those based on the minimal-point design and Bayesian D-optimal design with the pairwise-order modeling in terms of identifying the optimal order.

Tsung-Jen Shen (National Chung Hsing University), Youhua Chen (Chinese Academy of Sciences), Yongbin Wu (South China Agricultural University) & Chia-Hao Chang (National Chung Hsing University)

Minimal distance to encounter the first organism of new species when conducting biodiversity sampling

Based on limited biodiversity data collected from a line transect, what is the minimal distance to find the first individual of new species that have not been discovered in the original sample? Resolving this puzzle can be of practical value in ecological studies, for example, an assessment of sampling intensity for the purpose of discovering new species. In this study, a simple estimator is developed to predict the minimal distance of finding a single new species. Numerical and empirical tests verified the high predictive power of the proposed estimator.

Chang-Yun Lin (National Chung Hsing University, Taiwan) & Kashinath Chatterjee (Augusta University)

Design Construction and Model Selection for Small Mixture-Process Variable Experiments with High-Dimensional Model Terms

This paper considers the design construction and model selection for mixture- process variable experiments where the number of variables is large. For such experiments the generalized least squares estimates cannot be obtained and hence it will be difficult to identify the important model terms. To overcome these problems, here we employ the generalized Bayesian-D criterion to choose the optimal design and apply the Bayesian analysis method to select the best model. Two algorithms are developed to implement the proposed methods. A fish-patty experiment demonstrates how the Bayesian approach can be applied to a real experiment. Simulation studies show that the proposed method has a high power to identify important terms and well controls the type I error.

14SCO Theatre 4. IPS06: Machine Learning in Econometric Analysis

IN THIS SESSION

Organiser: Erniel B. Barrios

Jie Ying (Ewilly) Liew (Monash University, Malaysia)

Exploring the Political-Economical Influence of the Middle-Income Trap Using Open Data

The middle-income trap refers to a prolonged slowdown of economic growth faced by middle-income countries. These countries face challenges in advancing national economic development akin to high-income countries. For these countries to escape the middle-income trap, various fiscal stimulus and policy remedies have been injected to sustain the national economic growth momentum. Yet the prospects for a high-income acceleration remain bleak.

Past studies reveal some popular views around investment-driven growth, technology-centric development, and politically mature institutions behind the middle-income trap. This study focuses on the East Asia and Pacific region, aligning with the proposition that the middle-income trap is “more politics than economics.” Strategic perceptions of geopolitical security facing this region differ between the northern and southern spheres and across contentions of the superpower politics, secessionist movements, and inter-ethnic disputes that pose threats to the national economy.

Open data was scrapped from the World Bank database API to capture the country names, regions, income classification, and geographical coordinates, including economic data on gross national income (GNI) and gross domestic product (GDP). Political risk assessment data was obtained from trusted PRS Group reports to measure internal conflict risk and investment profile risk on a 12-point Likert scale. Using R, data was joined to understand the middle-income trap from different perspectives through visualizations and statistical models. Data integrity was checked, and missing data was treated. Results identify the middle-income countries trapped in the region and explain the influence of political risk factors on the national economy. Discussions are made to understand the political-economical influence of the middle-income trap in the East Asia and Pacific region.

Nazirul Hazim A Khalim (Monash University Malaysia) & Erniel B. Barrios (Monash University Malaysia)

Evaluating the Efficacy of Monetary Policy Transmission Channels in the Context of Islamic Financial Intermediation: A Quantitative Examination Utilising High-Granularity Datasets in the Malaysian Dual Banking System

Introduction
The growing significance of the ‘non-interest’ approach of Islamic banking and finance raises questions about its impact on the monetary transmission mechanism and, consequently, the effectiveness of monetary policy. This research aims to investigate the implications of Islamic banking for the monetary transmission mechanism, specifically examining how shocks in the overnight interbank money market rate affect output and inflation. Malaysia serves as THE context for this study due to its distinctive dual banking system, where both conventional and Islamic banking institutions operate concurrently.

Methodology
This research employs econometric methods including Vector Autoregression (VAR), Cointegration Analysis, Error Correction Model (ECM), impulse response analyses, and Granger causality tests. It aims to reveal differences in the timing, magnitude, and direction of these impacts compared to conventional (interest-based) banking systems. First, it examines the short-term and long-term impact of monetary policy shocks on output and inflation in both sectors. Second, it analyses the interest rate pass-through mechanism in Islamic banking in Malaysia, with a particular focus on the Islamic Interbank Money Market (IBMM) rate. The comparative analysis extends to its conventional counterpart, investigating the extent to which retail bank rates respond to changes in interbank rates.

Results
Preliminary results from a sample of daily data spanning from 1997 to 2022 indicate that the conventional banking sector experiences a more immediate, though not necessarily larger, response to monetary policy shocks in terms of interest rates and credit availability. In contrast, the Islamic sector exhibits a quicker but generally smaller adjustment in its pass-through mechanism.

Significance
The findings offer important considerations for monetary policy formulation, suggesting that policy impact may vary between the conventional and Islamic banking sectors. Consequently, these results highlight the need for policymakers to consider these differences when designing and implementing monetary policy in an increasingly complex and diverse financial landscape.

How Chinh Lee (Monash University Malaysia) & Erniel B. Barrios (Monash University Malaysia)

Stochastic Frontier Models with Varying Frequencies

Stochastic Frontier Model (SFM) is one possible strategy to characterise the production efficiency of firms. The error term of the production function is decomposed into random error arising from firms’ inability to access and utilise factors of production at the frontier level and pure error. Production is at the frontier level if output feasible from given inputs is achieved within the present system. Management options, chance disruptions, and technological innovations often hinder reaching this theoretical frontier.

Even with the enactment of energy laws leading to the active integration of electric cooperatives (distributors of electricity to all consumers), the Philippines’ power generation still faces issues of supply-demand imbalance periodically escalating into crises at times. While financial management can guarantee the operational efficiency of the cooperatives, threats from extreme weather vulnerabilities cannot be discounted. The operational efficiency of electric cooperatives is postulated to be attributed to extreme weather elements, such as typhoons and heavy rainfall. An SFM that incorporates random effects, assimilating sub-annual typhoon windspeed and rainfall into the efficiency equation and production function (annual) is postulated. This unveils the linkage between adverse weather conditions and electricity distribution efficiency. The model however suffers from complications arising from varying frequencies of weather and electricity production attributes, posing challenges to the estimation of SFM.

A hybrid estimation procedure in a backfitting algorithm framework is proposed for estimation, while hypothesis testing procedures are formulated based on the bootstrap. The model is then estimated with data on Philippine electric cooperatives from 2010-2022. The proposed model holds the potential to advance our understanding of the intricate relationship between weather dynamics and electricity distribution efficiency.

14SCO Theatre 5. CPS03: Bayesian Methodology & Analysis

IN THIS SESSION

Yoshito Tan (The University of Tokyo) & Yujie Zhang (Benesse Educational Research and Development Institute)

A Multilevel Bayesian Cognitive Diagnostic Model for Rule-Based Item Design in Educational Assessments

Cognitive diagnostic assessments (CDAs) are a specialized form of educational assessment used to make inferences as to whether students have mastered each of the target attributes (cognitive skills). Cognitive diagnostic models (CDMs) are a family of statistical models that are used in CDAs to formalize the inference-making process. The cost and time to make a CDA can be reduced by generating test items based on a rule-based item design, where one first specifies an item model (template) that determines the basic structure of test items based on a given attribute pattern (i.e., the set of attributes required to solve those items). Within the specified item model, items are generated by changing the model’s surface features (e.g., specifying values of coefficients in a linear equation). It is generally desirable that the item response probabilities for students with the same attribute mastery status are similar between items generated from the same item model. However, the surface features might cause different response probabilities of the items within each item model. In this study, we developed a novel CDM to examine the variabilities of the response probabilities of items within each item model via Bayesian multilevel modeling. We generated 78 items from our rule-based automatic item generation system. The items were generated from 42 item models to measure 9 attributes in the domain of linear equations in junior high school mathematics. We applied the proposed model to a dataset to examine the variabilities of the item response probabilities. We found that the response probabilities of most items were similar within each item model under the same attribute mastery status, although the probabilities moderately differed across some items within their item models. We discuss the practical utility of the proposed model as well as its limitations and future directions.

Kensuke Okada (The University of Tokyo), Keiichiro Hijikata (The University of Tokyo), Motonori Oka (London School of Economics and Political Science) & Kazuhiro Yamaguchi (University of Tsukuba)

Development of R Package for Variational Bayesian Estimation of Diagnostic Classification Models

Variational Bayesian (VB) inference is a method from machine learning for approximating intractable posterior distributions in large-scale Bayesian modeling. Due to its scalability and efficiency compared to traditional Markov chain Monte Carlo (MCMC) methods, VB methods have recently gained notability in statistics. Our research group has developed VB estimation methods of diagnostic classification models (DCMs)—a class of restricted latent class models for diagnosing respondents’ mastery status of a set of skills or knowledge based on item response data. In order to facilitate the application of these techniques to solve real-world problems, the present study introduces variationalDCM, an R package that provides a collection of recently developed VB estimation methods for DCMs. Currently, the package offers five functions for estimating five sub-classes of DCMs: the deterministic input, noisy “and” gate (DINA) model, deterministic input, noisy “or” gate (DINO) model, saturated DCM, multiple-choice DCM, and hidden Markov DCM. The required arguments that are common to all these functions are the item response data and the Q-matrix, which is a prespecified binary matrix that represents the mapping of items to skills. Optional arguments include the maximum number of iterations performed by the optimization algorithm and convergence tolerance. To investigate the utility of the developed framework, we comparatively analyzed existing datasets by the proposed VB functions as well as existing techniques offered by other packages for performing MCMC estimation. The results demonstrated that the proposed approach is as accurate as—and much faster than—existing methods, although posterior standard deviations are slightly underestimated due to the variational approximation. The VB estimation methods we have developed, which are summarized in this package, will be useful in analyzing large-scale psychometric datasets in order to facilitate personalized and adaptive learning.

Dongming Huang (National University of Singapore) & Feicheng Wang (Harvard University), Samuel Kou (Harvard University) & Donald Rubin (Harvard University)

Catalytic Priors: Using Synthetic Data to Specify Prior Distributions in Bayesian Analysis

Catalytic prior distributions provide general, easy-to-use, and interpretable specifications of prior distributions for Bayesian analysis. They are particularly beneficial when observed data are insufficient for accurately estimating a complex target model. A catalytic prior distribution stabilizes a high-dimensional “working model” by shrinking it toward a “simpler model.” The shrinkage is achieved by supplementing the observed data with a small amount of “synthetic data” generated from a predictive distribution under the simpler model. We apply the catalytic prior to generalized linear models, where we propose various strategies for the specification of a tuning parameter governing the degree of shrinkage and we examine the resulting properties. The catalytic priors have simple interpretations and are easy to formulate. In our numerical experiments and a real-world study, the performance of the inference based on the catalytic prior is either superior or comparable to that of other commonly used prior distributions.

This talk is based on joint work with Feicheng Wang, Donald Rubin, Samuel Kou.

Dec 7. 09:30-10:30.

23 WW Price Theatre. Keynote Talk 2

IN THIS SESSION

Prof Dianna Cook (Monash University)

New Tools for Visualising High-Dimensional Data Using Linear Projections

In the last few years, there have been several huge strides in new methods available for exploring high-dimensional data using “tours”, a collective term for visualisations built on linear projections. A tour consists of two key elements: the path, which generates a sequence, and the display that presents the low-dimensional projection. Numerous path algorithms are available and implemented in the tourr R package. These include the old (grand, guided, little, local, manual), and the new (slice, sage, radial). This talk will highlight these new tools and their application for contemporary challenges. Join me in exploring the fascinating world of high-dimensional data.

Dec 7. 11:00-12:40

14SCO Theatre 3. IPS07: Advances in Symbolic Data Analysis

IN THIS SESSION

Organiser: Paula Brito

Feng Chen (Chinese Academy of Sciences), Bai Huang (Central University of Finance & Economics), Yuying Sun (Chinese Academy of Sciences & University of Chinese Academy of Sciences) & Shouyang Wang (Chinese Academy of Sciences & University of Chinese Academy of Sciences)

The Interval Factor Model: Estimation and Forecasting

Due to the ability to effectively summarize information in high dimensional data sets, analysis of factor models has been a heavily researched topic in economics and finance fields. The core of available methods on factor analysis is based on single-valued data. However, sometimes the data are represented by intervals which are difficult to analyze with traditional techniques. In this paper, we develop an econometric theory for large dimensional factor models with interval-valued data. We propose a strategy for extending factor analysis to such data in the case where the variables values are intervals. First, we establish the convergence rate for the factor estimates under the framework of large $N$ and large $T$ . We then propose some criteria and show that the number of factors can be consistently estimated using the criteria. Finally, we show that the factors, being estimated from the high dimensional data, can help to improve forecast. The finite sample performance of our estimators is investigated using Monte Carlo experiments.

Ann Maharaj (Monash University), Paula Brito (University of Porto) & Paulo Teles (University of Porto)

Classification of Interval Time Series

The aim of this study is to classify interval time series (ITS) into two or more known or pre-conceived groupings. We use two specific methods to obtain the inputs for classification. For the first method, we use wavelet variances of the radius and centre of each ITS at the relevant number of scales and apply the K-nearest neighbour (K-NN) classifier with Euclidean distances as well as linear and quadratic discriminant classifiers. These radius and centre features ensure the interval variability between the upper and lower bounds is encompassed. For the second method, we use a distance matrix to compare the ITS, in particular, we use point-wise distances and autocorrelation-based distances. We apply the K-NN classifier with these respective inputs. For both methods, in all cases, we use the hold-out-one cross-validation technique to evaluate the quality of the classification performance. Simulation studies using ITS generated from space-time autoregressive models show very good classification rates. An application to sets of real time series from which ITS are constructed, reveal the usefulness of this approach and its possible advantage over traditional time series classification.

Scott Sisson (UNSW Sydney), Tom Whitaker (UNSW Sydney), Boris Beranger (UNSW Sydney), Huan (Jaslene) Lin (Macquarie University)

Fitting models for large environmental datasets using histogram summaries

Fitting models to environmental datasets is becoming more challenging due to the increase in the size of the datasets. However, intuitively, we may not need to know the exact details of each data point in a database in order to be able to fit a desired model. Perhaps broad indications of e.g. the location, scale, shape and other similar summary aspects of a dataset – such as those captured by a histogram of the relevant data – could be enough information to accurately fit a model. In this work we will develop ideas from symbolic data analysis to construct likelihood-based approaches for fitting data to models where the data has been summarised into a histogram form. We will demonstrate that this allows large datasets to be fitted to models quickly and sufficiently accurately, without needing to know the details of the full dataset.

Ana Santos (Universidade do Porto), Sónia Dias (Instituto Politécnico de Viana do Castelo & LIAAD-INESC TEC), Paula Brito (Universidade do Porto & LIAAD-INESC TEC) & Paula Amaral (Universidade Nova de Lisboa & CMA)

Classifying Distributional Data in More than Two Groups

This work addresses multiclass classification of distributional data (see Brito, Dias, 2022).

The proposed method relies on defining discriminant functions as linear combinations of variables whose observations are empirical distributions, represented by the corresponding quantile functions (Dias et al, 2021). A discriminant function allows obtaining a score (quantile function) for each unit. The Mallows distance between this score and the score obtained for the barycentric histogram of each group is determined, and the unit is then assigned to the group for which the distance is minimum.

In the presence of more than two a priori classes, two approaches are considered. The first one consists in dividing the multiclass classification problem into several binary classification sub-problems. In this case, two well-known multiclass classification techniques may be applied: One-Versus-One (OVO) and One-Versus-All (OVA).

The alternative approach, named Consecutive Linear Discriminant Functions, consecutively defines linear discriminant variables that separate the groups at best, under the condition that each new variable is uncorrelated with all previous ones. This leads to several score histogram-valued variables with null symbolic linear correlation coefficient. Classification is then based on a suitable combination of the corresponding obtained scores, using the Mallows distance.

This method is applied to discrimination of network data, described by the distributions, across the network nodes, of four centrality measures. The goal is to identify the network model used to generate the networks.

14SCO Theatre 4. IPS08: Novel Statistical Methodologies: Applications in Genomics, Randomized Controlled Trials, and Time Series

IN THIS SESSION

Organiser: Dongming Huang

Kristen Hunter (UNSW, Sydney), Luke Miratrix (Harvard Graduate School of Education) Kristin Porter (K.E. Porter Consulting LLC & UNSW Sydney)

PUMP: Estimating power when adjusting for multiple outcomes in multi-level experiments

For randomized controlled trials (RCTs) with a single intervention’s impact being measured on multiple outcomes, researchers often apply a multiple testing procedure (such as Bonferroni or Benjamini-Hochberg) to adjust p-values. Such an adjustment reduces the likelihood of spurious findings, but also changes the statistical power, sometimes substantially. A reduction in power means a reduction in the probability of detecting effects when they do exist. This consideration is frequently ignored in typical power analyses, as existing tools do not easily accommodate the use of multiple testing procedures. We introduce the PUMP (Power Under Multiplicity Project) R package as a tool for analysts to estimate statistical power, minimum detectable effect size, and sample size requirements for multi-level RCTs with multiple outcomes. PUMP uses a simulation-based approach to flexibly estimate power for a wide variety of experimental designs, number of outcomes, multiple testing procedures, and other user choices. By assuming linear mixed effects models, we can draw directly from the joint distribution of test statistics across outcomes and thus estimate power via simulation. One of PUMP’s main innovations is accommodating multiple outcomes, which are accounted for in two ways. First, power estimates from PUMP properly account for the adjustment in p-values from applying a multiple testing procedure. Second, when considering multiple outcomes rather than a single outcome, different definitions of statistical power emerge. PUMP allows researchers to consider a variety of definitions of power in order to choose the most appropriate types of power for the goals of their study. The package supports a variety of commonly used frequentist multi-level RCT designs and linear mixed effects models. In addition to the main functionality of estimating power, minimum detectable effect size, and sample size requirements, the package allows the user to easily explore sensitivity of these quantities to changes in underlying assumptions.

Weihao Li (National University of Singapore) & Dongming Huang (National University of Singapore)

Bayesian inference using Catalytic prior distributions for Cox regression models

The catalytic prior is a general class of prior distributions that can be applied to arbitrary parametric models. This class provides stable posterior estimation especially when the sample size is small relative to the dimensionality of the model and classical likelihood-based inference is problematic. In this work, we propose an extension of the catalytic prior to stabilize the maximum partial likelihood estimation. A similar challenge emerges when the sample size is too small for a Cox proportional hazards model (Cox model) in survival analysis to yield stable maximum partial likelihood estimation of the effects parameters. In this work, we propose an extension of the catalytic prior to stabilize the Bayesian inference for a Cox model. Our catalytic prior is formulated as the likelihood of the Cox model with a known baseline hazards function, supplemented with synthetic data. This baseline hazards function can be provided by the user or estimated from the data, and the synthetic data is generated from a simple model. For point estimation, we approximate the posterior mode via a penalized log partial likelihood estimator. In our simulation study, the proposed method outperforms classical maximum partial likelihood estimation and is comparable to existing shrinkage methods. We also illustrate its application in several real datasets.

Xiaotong Lin (National University of Singapore) & Dongming Huang (National University of Singapore)

Goodness-of-fit Test and Conditional Randomization Test for Gaussian Graphical Models

We propose a novel method for constructing goodness-of-fit (GoF) tests with the null hypothesis that observed data follows a Gaussian graphical model (GGM) with respect to a given graph $G$ . This method is statistically valid for both low and high-dimensional data. It also enjoys flexibility in the choice of test statistic, which may enhance the power by incorporating prior information. We conduct extensive numerical experiments to examine the resulting power with various choices of test statistics. We show that when the null hypothesis is not true and the graph has a certain property, the proposed method is more powerful than traditional methods based on the Chi-square test, F test, and generalized likelihood ratio test. We demonstrate the usefulness of our GoF test in real-world data sets.

Kin Wai Chan (The Chinese University of Hong Kong) & Liu Xu Chan (The Chinese University of Hong Kong)

No-lose Converging Kernel Estimation of Long-run Variance

Kernel estimators have been popular for decades in long-run variance estimation. To minimize the loss of efficiency measured by the mean-squared error in important aspects of kernel estimation, we propose a novel class of converging kernel estimators that have the “no-lose” properties including: (1) no efficiency loss from estimating the bandwidth as the optimal choice is universal; (2) no efficiency loss from ensuring positive-definiteness using a principle-driven aggregation technique; and (3) no efficiency loss asymptotically from potentially misspecified prewhitening models and transformations of the time series. A shrinkage prewhitening transformation is pro- posed for more robust finite-sample performance. The estimator has a positive bias that diminishes with the sample size so that it is more conservative compared with the typically negatively biased classical estimators. The proposal improves upon all standard kernel functions and can be well generalized to the multivariate case. We discuss its performance through simulation results and two real-data applications including the forecast breakdown test and MCMC convergence diagnostics.

14SCO Theatre 5. CPS04: Model Selection & Dimension Reduction

IN THIS SESSION

Takayuki Yamada (Kyoto Women’s University), Sakurai Tetsuro (Suwa university of science) & Fujikoshi Yasunori (Hiroshima university)

Knock-one-out method for selecting nonzero partial correlations

In this study, we proposed a knock-one-out (KOO) method based on a general information criterion, which contains AIC and BIC, for selecting nonzero partial correlations. In general, using such the information criterion for variable selection, it needs to compute the statistics for the elements of power set of all combinations of variables. On the other hand, KOO method diminishes the times to compute for model selection since it suffices to compute statistics for all combinations of variables. We showed the consistency such that the selecting model is identified with the true model under the high-dimensional asymptotic framework that the dimensionality p and the sample size n go toward infinity together which the ratio p/n converges to a positive constant less than 1. Simulation results revealed that the proposed KOO method choose the true model with highly probability.

Tien-En Chang (National Taiwan University) & Argon Chen (National Taiwan University)

Using Relative Weight for Variable Importance Assessment - How Does It Work and Could It Work Better?

Relative importance analysis quantifies the proportionate contribution of each predictor to the R 2 in a regression model, encompassing both the individual and combined effects. While originally introduced for explanatory purposes, this analysis has found utility in solving variable selection problems. However, assessing the relative importance of each predictor remains intricate due to the possible presence of high intercorrelations among them.

Among various approaches, the general dominance index (GD) stands out as one of the most plausible methods. It evaluates a predictor’s importance by averaging its contribution across all possible combinatorial sub-models, providing a comprehensive perspective. Nevertheless, it becomes computationally intractable as the number of predictors increases. Therefore, exploring alternative procedures that yield comparable outcomes with less computational complexity becomes valuable. J. W. Johnson’s relative weight (RW) emerges as the frontrunner in this endeavor by introducing uncorrelated intermediate predictors derived from the orthogonalization of the original predictors. The contributions of these orthogonal predictors towards explaining the variance in the response variable are first calculated and then distributed back to the original predictors to estimate the relative importance. This method is shown to yield results akin to the GD but has faced criticism regarding its underlying principles, particularly the contribution distributing process of the intermediate predictors.

This paper argues that the key to the RW’s success is the orthogonal predictors themselves rather than the distributing process of contributions. Supported by both theoretical and empirical observations, we show that the orthogonalization process to obtain the intermediate predictors significantly outweighs the contribution distributing process in accurately estimating the relative importance resembling the GD. We delve into the theoretical rationale behind the proximity between RW and GD, particularly when R. M. Johnson’s transformation is employed. Moreover, we demonstrate the possibility of identifying improved orthogonalization methods that lead to even stronger alignment with the GD method.

Masaaki Okabe (Doshisha University) & Hiroshi Yadohisa (Doshisha University)

Supervised dimensionality reduction method using ordinal compatibility for RNA-seq data

Dimensionality reduction methods are widely used to analyze gene expression data that represent the states of cells during processes such as cell development and differentiation. These methods are also used for data visualization. One such method, $t$ -distributed Stochastic Neighbor Embedding ( $t$ -SNE) is often used for dimensionality reduction in the analysis of biological data as it embeds high-dimensional data onto a lower-dimensional space by focusing on local similarities. However, t-SNE has limitations capturing global structures, which can result in data from the same cluster being embedded far apart. In this study, we propose a method to prevent data from the same cluster in t-SNE from being embedded too far apart. Our approach employs the concept of ordinal distance compatibility to impose constraints on external ordinal label information for mapping in a low-dimensional space. This method can be used for dimensionality reduction in applications such as trajectory inference to uncover biological mechanisms using single-cell RNA sequence data.

Junnosuke Yoshikawa (Doshisha University), Masaaki Okabe (Doshisha University), Hiroshi Yadohisa (Doshisha University)

Supervised Non-Negative Matrix Factorization Considering Local Structure for Multi-View Data

Multi-view data are the comprehensive data obtained from various information sources about the same object. Such data are prevalent in diverse fields including image classification, audio processing, and data mining. They generally have the advantage of containing complementary information from different perspectives. One approach for dimensionality reduction in multi-view data is multi-view non-negative matrix factorization (NMF). However, existing multi-view NMF methods are unsupervised and do not effectively utilize class labels even when available. Furthermore, if the data can be potentially represented on a latent manifold structure, dimensionality reduction that does not consider its local structure may result in the loss of this valuable information. Original NMF techniques, which are not designed for multi-view data, have extended to methods like constrained NMF (CNMF) and graph regularized NMF (GNMF) to address these issues. However, these methods are not suitable for multi-view data and do not simultaneously address both local structure and label information. In a situation where local structure and class label information are accounted for simultaneously, training on a singular basis may not capture both types of information effectively. In this study, we propose an NMF approach that can simultaneously learn local structure and class label information for multi-view data. This method is based on three main ideas. First, to consider the local structure, we introduced a graph structure based on GNMF as a constraint. Second, to utilize class label information, we incorporated class labels into the objective function based on CNMF. Third, to simultaneously consider these two situations, we decomposed the basis image into two terms: one representing the local structure and the other representing class information. Through this method, it is possible to reduce dimensions while preserving the latent structure of multi-view data.

Dec 7. 13:30-15:10

14SCO Theatre 3. IPS09: Vast and aggregative data analysis from theoretical to practical approaches Part I

IN THIS SESSION

Organiser: Hiroyuki Minami

Fumio Ishioka (Okayama University), Yusuke Takemura (Kyoto Women’s University) & Koji Kurihara (Kyoto Women’s University)

Covid-19 Infection Trends and Visualization in Tokyo, Japan: Insights from Space-time Hotspot Clusters

The novel coronavirus infection (COVID-19), first reported in Wuhan, China, in December 2019, triggered an unprecedented global catastrophe. The current global situation surrounding COVID-19 is approaching a turning point, underscoring the increasing significance of verifying and analyzing various relevant data from a statistical perspective. In this study, we aim to identify and visualize hotspot clusters for each wave of data concerning the number of COVID-19 positive cases in Tokyo. In recent years, in fields such as spatial epidemiology, the identification and visualization of hotspot clusters using spatial scan statistics (Kulldorff, 1997) to discuss their characteristics have been established as a significant method. We will utilize the spatial scan statistics based on echelon analysis (Myers, et al., 1997; Kurihara et al., 2020; Takemura et al., 2022), which enables the detection of hotspot clusters with flexible shape for large-scale data. This approach helps to clarify ‘which areas’ and ‘since when’ the COVID-19 hotspot clusters in Tokyo have existed. Furthermore, we will discuss the differences and features of each detected cluster in every wave.

~~Tajima Yusuke (Shiga University Data Science and AI Innovation Research Promotion Center)~~ (Withdrawn)

Time series analysis of pressure sensor vibration values during sleep and estimation of health condition

Sleep is a very important activity that occupies one-third of the day. However, it is difficult to know own sleep without using a sensor such as a camera. Therefore, this study aims to analyze sleep using the vibration value of the pressure sensor. Specifically, heart rate and activity level are obtained from the vibration values by frequency analysis. Using these as inputs, the sleep state is output by the state space model. The sleep state represents a sleep stage. This was derived from the analysis of electroencephalogram data by specialists. It is used as a correct label only for checking whether it is correct or not, is not used as an input for the model. The electroencephalogram data was measured by wearing an electroencephalograph at the same time as measuring vibration values during sleep. In the presentation on the day, this study focused on the estimation of sleep stages, sleep rhythm, amount of wakefulness, etc. for three days of measured data for three healthy subjects without insomnia, etc., and evaluated their health status.

Yusuke Takemura (Kyoto Women’s University), Fumio Ishioka (Okayama University) & Koji Kurihara (Kyoto Women’s University)

A New Method for Evaluating Reliability of Spatial Clusters Using Echelon Analysis

The detection of spatial clusters has been used in various fields, such as epidemiology. For example, it is important to identify regions at high risk of disease mortality from a public health perspective. Various methods have been proposed to detect spatial clusters. In particular, the circular scan method (Kulldorff, 1997) is widely used as the detecting method based on spatial scan statistics (Kulldorff, 1997). In this study, we focus on the echelon scan method (Ishioka et al., 2019), which is based on the same statistics as the circular scan method. Because this method uses the topological structure of spatial data to detect clusters, it is possible to detect clusters with flexibly shapes not detectable in the circular scan method. However, the echelon scan method is more susceptible to variations in observed data than the circular scan method. Therefore, it is important to evaluate how reliable the detected clusters are. We will utilize echelon analysis (Myers, et al., 1997; Kurihara et al., 2020; Kurihara and Ishioka, 2022) to evaluate whether each region belongs to spatial clusters. This approach will make it possible to detect spatial clusters that are closer to true clusters.

Yoshikazu Terada (Osaka University/RIKEN) & Hidetoshi Matsui (Shiga University)

Dynamic prediction for variable-domain functional data

In classical FDA, all observed curves are defined on the same domain. However, for example, when we consider the atmospheric pressure variations from the genesis to the dissipation of certain typhoons, each observed curve is defined on a different domain. Such functional data with varying domain lengths for functional observation is called variable-domain functional data.

In this talk, we propose a dynamic prediction method for the variable-domain functional data. More precisely, when we partially observe a new trajectory from the starting time to a particular middle time point, the proposed method predicts the future trajectory based on observed variable-domain curves. Through numerical experiments and real data examples, we will show the performance of the proposed method.

14SCO Theatre 4. IPS10: Survival Regression Models and Computations

IN THIS SESSION

Organiser: Jun Ma

Aishwarya Bhaskaran (Macquarie University), Benoit Liquet-Weiland (Macquarie University), Jun Ma (Macquarie University), Serigne Lo (Melanoma Institute of Australia), Stephane Heritier (Monash University)

Accelerated Failure Time Models under Partly Interval Censoring and Time-Varying Covariates

Accelerated failure time (AFT) models are frequently used for modelling survival data. It is an appealing approach as it asserts a direct relationship between the time to event and covariates, wherein the failure times are either accelerated or decelerated by a multiplicative factor in the presence of these covariates. Several methods exist in the current literature for fitting semiparametric AFT models with time-fixed covariates. However, most of these methods do not easily extend to settings involving both time-varying covariates and partly interval censored data. We propose a maximum penalised likelihood approach with constrained optimisation to fit a semiparametric AFT model with both time-fixed and time-varying covariates, for survival data with partly interval censored failure times.

Serigne Lo (Melanoma Institute of Australia), Jun Ma (Macquarie University), Maurizio Manuguerra (Macquarie University)

Competing Risks Cause-specific Analysis in Presence of Unknown Cause of Failure: Application to Melanoma Patients Registry

Competing risk models are popular tools to analyse time-to-event data where each individual faces multiple failure types. In health-research for instance, competing risk models are used to assess the burden and etiology attributable to a specific disease. However, a complexity of applying the method arises when some failure types are missing. Some ad-hoc approaches such as imputation methods (e.g. missing failure types can be re-coded to one of the known failure mostly the failure type of interest as a conservative approach), removing or censoring individual with missing failure before running standard analysis. Other authors have proposed more sophisticated approaches with considerable gain in terms of bias and precision compared with ad-hoc approaches. However, all those existing methods are based on partial likelihood function type or fully-parametric regression models, and provide an estimate of the cause-specific hazards. Assuming unknown failure types are missing-at-random, we develop a novel constrained maximum penalised likelihood estimates for semi-parametric proportional hazard models for both the cause-specific and sub-distribution hazards. Penalty functions are used to smooth the baseline hazards. The appealing feature of our approach is all relevant estimates in competing risk setting are provided including cause-specific hazards and cumulative incidence function are provided. Asymptotic results of these estimates will also be developed.

The superiority of our methods to alternative approaches is demonstrated by simulation. We illustrate the use of our method in an analysis of a melanoma dataset from an institutional research database, where the cause of death can be missing especially for patients where death information is not obtaining through medical institution.

Annabel Webb (Macquarie University) & Jun Ma (Macquarie University)

A maximum penalised likelihood approach for Cox models with time-varying covariates and partly-interval censored survival data

Time-varying covariates can be important predictors when analysing time-to-event data. A Cox model that includes time-varying covariates is usually referred to as an extended Cox model, and when only right censoring is present, the conventional partial likelihood method is applicable to estimate the regression coefficients of this model. However, model estimation becomes more complex in the presence of partly-interval censoring. This talk will detail the fitting of a Cox model using a maximum penalised likelihood method for partly-interval censored survival data with time-varying covariates. The baseline hazard function is approximated using spline functions, and a penalty function is used to regularise this approximation. Asymptotic variance estimation of regression coefficients and survival quantities are computed. The method will be illustrated in an application to breast cancer recurrence.

Jun Ma (Macquarie University)

Penalized Likelihood Estimation of Stratified Semi-Parametric Cox Models under Partly Interval Censoring

This talk focuses on the stratified Cox model under partly interval-censored survival times. We present a penalized likelihood method for estimating the model parameters, including the baseline hazards. Penalty functions are used to produce smoothed baseline hazards estimates, and also to relax the requirement on optimal number and location of the knots used in the baseline hazards estimates. We also derive a large sample normality result for the estimates, which can be used to make inferences on quantities of interest, such as survival probabilities, without relying on computing-intensive resampling methods.

14SCO Theatre 5. CPS05: Survey Methodology & Missing Data

IN THIS SESSION

Shakeel Ahmed (National University of Sciences and Technology)

On Small Area Estimation Strategies using Data from Successive Surveys

When information on similar characters are available on two or more surveys conducted on the same population and the estimates are likely to be stable over the period then the surveys can be combined to obtain reliable estimates at more granular level. In this article, we suggest four different strategies of obtaining small area estimates by combining the data from two successive surveys under direct, synthetic, and composite methods. The performance of the mean estimators proposed strategies is evaluated through a bootstrapped study using the demographic health surveys conducted by National Institute of Population Studies Pakistan in years 2017-18 and 2019. Strategy 2 (S2) and Strategy 3 (S3) outperform other strategies considered in this studies both in terms of mean squared error (MSE) and percentage contribution of bias (PCoB) in MSE. The suggested strategies are used to obtained estimates of the parameters (totals or means) on reproductive health characteristics in different geographical units of Pakistan. An R Package is established to obtain the estimated sample sizes, estimates of mean along with their root mean square error and 95% confidence intervals using the suggested strategies.

Hiroko Katayama (Okayama University of Science)

Item Selection for Categorical Data - Consideration of Item Selection

Categorical data is widely used in various surveys, yet concerns arise regarding the analytical methods and the potential burden of excessive item numbers on respondents. Furthermore, there are situations where researchers aim to glean insights into respondents’ abilities and perceptions using categorical data. Thus, the exploration of an approach utilizing Item Response Theory (IRT) is considered. IRT, rooted in test theory, enables the estimation of item discrimination based on response probabilities. It also evaluates difficulty levels and respondent abilities. We are contemplating the application of this theory to analyze categorical data. To address the issue of excessive item numbers, we propose an approach that leverages latent trait values estimated through IRT. This approach involves selecting items where the discrepancy between latent trait values obtained from all items and those obtained after item reduction is minimized. Additionally, we are considering an approach guided by the evaluation of model fit using AIC and BIC as criteria for item selection. We present these methodologies as means to select items. Applying these approaches to item selection not only retains crucial items for future surveys but also eliminates unnecessary ones, potentially serving as valuable resources for future research.

Baoluo Sun (National University of Singapore), Wang Miao (Peking University) & Deshanee Senevirathne Wickramarachchi (National University of Singapore)

Adjusting for Nonignorable Missing Data Using Instrumental Variables

Nonignorable missing data presents a major challenge to health and social science research, as features of the full data law are inextricably entwined with features of the nonresponse mechanism. Therefore, valid inference is compromised even after adjusting for fully observed variables. We develop novel semiparametric estimation methods with nonignorable missing data by leveraging instrumental variables that are associated with the nonresponse mechanism, but independent of the outcome of interest. The proposed estimators remain consistent and asymptotically normal even under partial model misspecifications. We investigate the finite-sample properties of the estimator via a simulation study and apply it in adjusting for missing data induced by HIV testing refusal in the evaluation of HIV seroprevalence in Mochudi, Botswana, using interviewer experience as an instrumental variable.

Yongshi Deng (University of Auckland) & Thomas Lumley (University of Auckland)

Multiple Imputation Through Autoencoders

Standard implementations of multiple imputation have limitations in handling missing data in large datasets with complex data structures. Achieving satisfactory imputation performance often depends on properly specifying the imputation model to account for interactions among variables. Therefore, imputing a large dataset can be daunting, particularly when there is a large number of incomplete variables. In this talk, we will discuss the potential of applying different variants of autoencoders to multiple imputation. A comprehensive analysis on the the effect of hyperparameters on imputation performance is given. We provide insights into the suitability of using autoencoders for multiple imputation tasks and give practical suggestions to improve their imputation performance. The proposed procedure is implemented in an R package miae, which uses torch as the backend, so that setting up Python is not required. In addition, miae aims to provide an automated procedure, where the main imputation function can automatically handle tasks such as data preprocessing and proprocessing, without requiring extra work from users. Various statistical techniques have also been implemented to enhance the imputation performance of miae and its performance is evaluated and compared to those of mice and mixgb. The development version of miae is available at https://github.com/agnesdeng/miae.

Dec 7. 15:40-17:20

14SCO Theatre 3. IPS11: Vast and Aggregative Data Analysis from Theoretical to Practical Approaches Part II

IN THIS SESSION

Organiser: Hiroyuki Minami

Liang-Ching Lin (National Cheng Kung University), Meihui Guo (National Sun Yat-sen University), Sangyeol Lee (Seoul National University)

Monitoring Photochemical Pollutants based on Symbolic Interval-Valued Data Analysis

This study considers monitoring photochemical pollutants for anomaly detection based on symbolic interval-valued data analysis. For this task, we construct control charts based on the principal component scores of symbolic interval-valued data. Herein, the symbolic interval-valued data are assumed to follow a normal distribution and an approximate expectation formula of order statistics from the normal distribution is used in the univariate case to estimate the mean and variance via the method of moments. In addition, we consider the bivariate case wherein we use the maximum likelihood estimator calculated from the likelihood function derived under a bivariate copula. We also establish the procedures for the statistical control chart based on the univariate and bivariate interval-valued variables, and the procedures are potentially extendable to higher dimensional cases. Monte Carlo simulations and real data analysis using photochemical pollutants confirm the validity of the proposed method. The results particularly show the superiority over the conventional method that uses the averages to identify the date on which the abnormal maximum occurred.

Nobuo Shimizu (The Institute of Statistical Mathematics), Junji Nakano (Chuo University), Yoshikazu Yamamoto (Tokushima Bunri University).

A Visualization of Aggregated Symbolic Data by Multiple Correspondence Analysis Method

When large amounts of individual data are available, we often look at them not as a set of individual data but as a set of natural and meaningful groups. We can represent the characteristics of each group by up to second-order moments of a categorical and continuous variable, which include the mean and covariance matrix for continuous variables, the Burt matrix and marginal probability for categorical variables, and the mean of a continuous variable relative to a value of a categorical variable. We call these statistics aggregated symbolic data (ASD) for expressing a group.

It is well known that several categorical variables are analyzed by the multiple correspondence analysis (MCA). MCA visualizes the structure of the data set by deriving a score for a category value of a categorical variable from the Burt matrix and displaying it in a low-dimensional space. We propose a method to apply MCA to ASD containing continuous and categorical variables. An example about the real estate data in Japan is analyzed and visualized by the proposed method.

Hiroyuki Minami (Hokkaido University)

Security Information and Event Management (SIEM) application with Symbolic Data Analysis and its implementation

In the Internet era, cyber security is one of the most important topics and we need to grasp a trend about cyber attack to protect ourselves.

From the statistical viewpoint, the typical observations are system logs. Their amount is growing day by day and the sorts vary. For example, we investigate a log of MTA (Message Transfer Agent, so-called ‘Mail Server’) to prevent from SPAM and irregular mail relay, IDS/IPS (Intrusion Detection/Prevention System) and some operating systems to detect suspicious network actions.

SIEM (Security Information and Event Management) is a framework to gather and give a overview based on any system logs. Some SIEM applications says they have an ‘analyse’ function but we are not satisfied because they are short of statistical consideration, as far as we’re concerned.

To simplify them, the elements of IDS/IPS system log are 5 tuples, say, a pair of Source IP Address and port number, that of Destination, a sort of protocol (TCP, UDP). In addition, timestamp and ‘whois’ information are to be considered, since the suspicious action might have some kind of time-series trend and be from some specific area and/or company. To re-formalize and summarize them in a proper way, we would grasp a short and/or long trend of cyber attacks.

We apply Symbolic Data Analysis to them and try to utilize its framework. We’ll offer the application and some practical examples.Due to the confidentiality, we use ‘reprocessed’ dataset based on practical observations.

14SCO Theatre 5. CPS06: Statistical Learning

IN THIS SESSION

Elizabeth Chou (National Chengchi University) & Bo-Cheng Hsieh (National Chengchi University)

Data-Driven Learning for Imbalanced Data

The Siamese neural network is a powerful metric-based few-shot learning approach, capable of extracting features effectively from limited data and enhancing classifier accuracy through its unique training method and architecture. Despite its widespread usage as a feature extraction method for unstructured data in computer vision and natural language processing, its performance in structured data applications has not been thoroughly compared with other algorithms. To address this gap, our study explores the application of the Siamese neural network alongside nine algorithm combinations, four different classifiers, and SMOTE for supervised learning across five imbalanced datasets. Additionally, we investigate the impact of varying imbalanced ratios on the performance of different algorithms. The findings reveal that the Siamese neural network demonstrates impressive performance in structured data settings and exhibits greater resilience than other algorithms.

Qian Jin (UNSW Sydney/DARE center), Pierre Lafaye de Micheaux (UNSW Sydney & Université de Montpellier), Clara Graizan (The Uinveristy of Sydney & DARE center)

Generalized Partial Least Square in Deep Neural Network

While deep learning has shown exceptional performance in many applications, the model’s mathematical understanding, model designing, and model interpretation are still new areas. Combining the two cultures of deep and statistical learning can provide insight into model interpretation and high-dimensional data applications. This work focuses on combing deep learning with generalized partial least square estimation. In particular, Bilodeau et al. (2015) proposed a generalized regression with orthogonal components (GROC) model, which is more flexible than the standard partial least square (PLS), because it may involve more complex structure of dependence and the use of generalized additive model (GAM) instead of linear regression. We propose a deep-GROC (DGROC) model, which allows for different measures of dependence (through their copula representation) to be used and shows a high prediction accuracy. Hyperparameter selection and transfer learning in the training loop are included to boost model performance. The superiority of the proposed method is demonstrated on simulations and real datasets, which show that our method achieves competitive performance compared to GROC, PLS and traditional Neural Networks.

Tatsuki Akaki (Okayama University of Science), Yuichi Mori (Okayama University of Science), Masahiro Kuroda (Okayama University of Science) & Masaya Iizuka (Okayama University)

Dimension-Reduced Fuzzy Clustering for Categorical Data

In the fields such as marketing and psychology, categorical data is often dealt with, and the data sometimes consists of many variables. In this situation, we consider how to classify the objects in the data. In this study, we propose a modified fuzzy clustering for multi-dimensional categorical data (catRFCM). The catRFCM quantifies the original categorical data and estimates low-dimensional cluster centers by implementing fuzzy c-means (FCM) with quantification and FCM with dimension reduction simultaneously. The algorithm of catRFCM is as follows: [Step 1] Determine the number of clusters, the number of principal components, and the values of tuning parameters, and initialize the cluster center matrix H, the principal component loading matrix A, and the membership matrix U. [Step 2] Calculate the category quantification scores in each variable. [Step 3] Update H and the reduced cluster-centered score matrix F. [Step 4] Update A. [Step 5] Update U. [Step 6] If H, A and U are converged, stop. Otherwise return to Step 2. Numerical experiments are conducted to evaluate the performance of the proposed method by comparing with methods that analyze the original categorical data as a numerical data and tandem clusterings in which quantification, dimension reduction, and fuzzy clustering are executed in sequence.

Tetta Noguchi (Doshisha University), Takehiro Shoji (Doshisha University), Toshiki Sakai (Doshisha University), Hiroshi Yadohisa (Doshisha University)

Clusterwise Geographically Weighted Regression for Spatial Heterogeneous Data

Clusterwise linear regression (CLR) is a statistical method that assumes a clustered structure in the data and simultaneously estimates both the clusters and unique regression coefficients for each of them. CLR is widely utilized to examine spatial data wherein clusters are identified based on their geographical location. Another method for analyzing spatial data is Geographically weighted regression (GWR), which assumes a gradual variation in the relationship between the objective and explanatory variables across space. GWR estimates different regression coefficients for each location after weighting—based on geographical proximity. Here, the situations assumed by CLR and GWR can coexist simultaneously. For example, previous studies have modeled paved road data using both CLR and GWR, thus necessitating the consideration of cluster structure and spatial variation. We present a method that merges CLR and GWR. Specifically, CLR identifies clusters with different regression coefficients, while GWR simultaneously estimates the regression coefficients within the clusters. This enables us to characterize clusters by geographical location while also incorporating regression coefficients’ gradual variability within them. Consequently, we postulate that prediction accuracy improves in situations wherein regression coefficients’ cluster structure as well as smooth variation must be considered—compared to using CLR or GWR alone.

Dec 8. 09:30-10:30.

23 WW Price Theatre. Keynote Talk 1

IN THIS SESSION

Prof Thomas Lumley (University of Auckland)

Fitting Mixed Models to Data from Complex Surveys

It is surprisingly hard to fit mixed models to data from multistage surveys, especially if the structure of the model and the structure of the sampling are not the same. I will review why it is hard and some of the approaches. I will describe a very general approach to linear mixed models based on pairwise composite likelihood, which extends work by J.N.K. Rao and Grace Yi and co-workers and was motivated by modelling questions arising in the Hispanic Community Health Study/Study of Latinos in the US. This approach is implemented in a new R package, svylme, and allows for nested and crossed random effects and for the sort of correlations that arise in genetic models; I will present some examples and discuss some computational issues.

Dec 8. 11:00-12:40.

14SCO Theatre 3. IPS12: Visualising high-dimensional and complex data

IN THIS SESSION

Organiser: Di Cook

Discussant: Patrick Li

Jayani P.G. Lakshika (Manash University), Dianne Cook (Monash University), Paul Harrison (Monash University), Michael Lydeamore (Monash University), Thiayanga Talagala (University of Sri Jayewardenepura, Sri Lanka)

Viewing the Model from a Non-Linear Dimension Reduction in the High-Dimensional Data Space

Non-linear dimension reduction (NLDR) techniques such as $t$ -SNE, UMAP, PHATE, PaCMAP, and TriMAP provide a low-dimensional representation of high-dimensional data by applying a non-linear transformation. The complexity of the transformations and data structure can create wildly different representations depending on the method and parameter choices. It is difficult to determine whether any are accurate, which is best, or whether they have missed structure. To help assess the NLDR and decide on which, if any, is best, we have developed an algorithm to create a model that is then used to display as a wireframe in high dimensions. The algorithm hexagonally bins the data in the NLDR view and triangulates the bin centroids to create the wireframe. The high-dimensional representation is generated by mapping the centroids to high-dimensional points by averaging the values of observations belonging to each bin. The corresponding lines connected in the NLDR view also connect the corresponding high-dimensional points. The resulting model representation is overlaid on the high-dimensional data and viewed using a tour, a movie of linear projections. From this, we can understand how an NLDR warps the high-dimensional space and fits the data. Different methods and parameters yield the same fits, even when the NLDR view appears different. The process is what Wickham et al. (2015) coined viewing the “model in the data space”, and the NLDR view would be considered viewing the “data in the (NLDR) model space”. This work will interest analysts in understanding the complexity of high-dimensional data, especially in bioinformatics and ecology, where NLDR is prevalent. The algorithm is available via the R package quollr.

Paul Harrison (Monash University)

Visualizing high-dimensional genomics data, and what Non-Linear Dimension Reduction hides and misrepresents

High-throughput genomic sequencing has opened a window on wide range of information about the inner workings of biological cells. For example the number of RNA sequences from different genes tell us which proteins a cell may currently be producing. Using this data we can view biological cells as inhabiting a gene-space with around 20,000 dimensions. Recent methods provide information about many thousands of individuals cells (“single cell sequencing”), or at high resolution on a microscope slide (“spatial sequencing”). A rich variety of geometric features are present in this high-dimensional space, including cell types, variation within cell types, developmental trajectories, responses to treatments, and also artifacts due to limitations of the technology or data processing. Non-Linear Dimension Reduction (NLDR) methods such as UMAP are typically used to try to represent this data in 2D, and while these can be very effective at revealing real biology they also have the potential to misrepresent the data. To evaluate NLDR methods, we can compare them to 2D linear projections of the data, which have less potential to distort the data. These linear projections will be animated, providing a “tour” of the data, using an interactive Javascript-based widget called Langevitour. Langevitour explores the space of possible projections, either at random or seeking informative projections, using a method borrowed from statistical mechanics. This comparison will help show which specific features in biological data are represented well by NLDR or are omitted or misrepresented.

Paulo Canas Rodrigues (Federal University of Bahia)

Visualization and Spatio-Temporal Modeling of the Brazilian Wildfires: The Influence of Human and Meteorological Variables

Wildfires are among the most common natural disasters in many world regions and actively impact life quality. These events have become frequent due to climate change, other local policies, and human behavior. This study considers the historical data with the geographical locations of all the “fire spots” detected by the reference satellites covering the Brazilian territory between January 2011 and December 2022, comprising more than 2.2 million fire spots. This data was modeled with a spatio-temporal generalized linear model for areal unit data, whose inferences about its parameters are made in a Bayesian approach and use meteorological variables (precipitation, air temperature, humidity, and wind speed) and a human variable (land-use transition and occupation) as covariates. The change in land use from the forest and green areas to farming significantly impacts the number of fire spots for all six Brazilian biomes. (Joint work with Jonatha Pimentel and Rodrigo Bulhões)

14SCO Theatre 4. IPS13: Recent Advances in the Methods of Network Data Analysis

IN THIS SESSION

Organiser: Frederick Kin Hing Phoa

Hohyun Jung (Sungshin Women’s University)

Analysis of Popularity Bias in Bipartite Networks with Applications to Flickr and Netflix

User-item bipartite networks consist of users and items, where edges indicate the interactions of user-item pairs. We propose a Bayesian generative model for the user-item bipartite network that can measure the two types of rich-get-richer biases: item popularity and user influence biases. Furthermore, the model contains a novel measure of an item, namely the item quality that can be used in the item recommender system. The item quality represents the genuine worth of an item when the biases are removed. The Gibbs sampling algorithm alongside the adaptive rejection sampling is presented to obtain the posterior samples to perform the inference on the parameters. Monte Carlo simulations are performed to validate the presented algorithm. We apply the proposed model to Flickr user-tag and Netflix user-movie networks to yield remarkable interpretations of the rich-get-richer biases. We further discuss genuine item quality using Flickr tags and Netflix movies, considering the importance of bias elimination.

Yuji Mizukami (Nihon University) & Junji Nakano (Chui University)

Characteristics of Scientists in AI-Related Fields for Several Countries Based on Non-Negative Matrix Factorization of Authorship of Scientific Papers

We extract information on articles and co-authors in AI-related research from the Web of Science (WOS) and compare the trends of research in the field. We analyze 12804 authors from the top 20 countries in terms of number of publications in AI-related papers in 2013. We focus on the number of papers in each of the 23 WOS Essential Science Indicators (ESI) research fields for each author. Our studies use non-negative matrix factorization (NMF), mainly because of the ease of interpretation of the results, from which we can extract the strength and trends of each country’s research in this field.

We must note that in the NMF analysis, there is a problem of having to specify the number of bases, which represents the granularity of the model. In this study, the number of bases was determined by heristics using the Frobenius norm of the matrix.

Wei-chung Liu (Academia Sinica, Taiwan)

Quantifying Biodiversity from Network Perspectives

Conventional biodiversity research concerns with the number of species and their genomic diversity in an ecosystem. Since species interact trophically forming a food web, we argue biodiversity can also be viewed from a network perspective. In this study we propose three approaches. First is based on the notion of regular equivalence that measures species positional similarity in a food web. Second is based on the concept of positional centrality where diversity is defined in terms of how diverse species centrality values are in a food web. Third is based on the ecological concept of direct and indirect interactions, and diversity here is defined by the heterogeneity in species interaction patterns in a food web. We discussion the relationship between our diversity measures and the structural organization of food webs, as well as future developments in this particular field of network research in ecology.

Tso-Jung Yen (Academia Sinica, Taiwan) & Wei-Chung Liu (Academia Sinica, Taiwan)

Link Prediction via Exploring Common Neighborhoods

Social network analysis aims to establish properties of a network by exploring link structure of the network. However, due to concerns such as confidential and privacy, a social network may not provide full information on its links. As some of the links are missing, it is difficult to establish the network’s properties by exploring its link structure. In this paper we propose a method for recovering such missing links. We pay attention on a situation in which some nodes have fully-observed links. The method relies on exploiting the network of these anchor nodes to recover missing links of nodes that have neighborhoods overlapping with the anchor nodes. It uses a graph neural network to extract information from these neighborhoods, and then applies the information to a regression model for missing link recovery. We demonstrated this method on real-world social network data. The results show that this method can achieve better performance than traditional methods that are solely based on node attributes for missing link recovery.

14SCO Theatre 5. CPS07: Biostatsitics & Survival Analysis

IN THIS SESSION

Jo-Ying Hung (National Cheng Kung University) & Pei-Fang Su (National Cheng Kung University)

Integrating Auxiliary Subgroup Restricted Mean Survival Information to the Cox Model when Population Heterogeneity Exists

With the increasing accessibility of data sources, utilizing published information to enhance estimator efficiency in individual-level studies has become a topic of interest. We focus on the restricted mean survival time, a model-free and easily interpretable statistic, as the external information. When combining information from different sources, heterogeneity may arise due to the presence of variations within a population. In this research, we propose a double empirical likelihood method to incorporate auxiliary restricted mean survival time information into the estimation of the Cox proportional hazard model. To account for heterogeneity between different sources, we incorporate a semiparametric density ratio model into the estimating equation. The large sample properties of the proposed estimators are established, and they are shown to be asymptotically more efficient than the classical partial likelihood estimators. To demonstrate our proposed method, simulation studies will be conducted and the method will be applied to a diabetes data.

Sajeeka Nanayakkara (University of Otago), Jiaxu Zeng (University of Otago), Robin Turne (University of Otago), Matthew Parry (University of Otago) & Mark Sywak (The University of Sydney)

Empirical Evaluation of Internal Validation Approaches in the Development of Clinical Prediction Models

Background: Risk prediction models are essential tool for aiding clinical decision-making to improve healthcare. Evaluating the predictive performance of such models is crucial, as it reveals the validity of the outcome predictions for new patients. While existing literature suggests cross-validation and optimism-corrected bootstrapping methods for internal validity assessment, their effectiveness is rarely discussed, and empirical evidence is limited. In this study, we empirically evaluate three methods, namely, cross-validation, repeated cross-validation and optimism-corrected bootstrapping for internal validation during the development of prediction models.

Methods: We compared the effectiveness of these internal validation methods, using different model building strategies: conventional logistic regression, ridge, lasso, and elastic net regression. We used a tertiary thyroid cancer service database in Australia which comprises of demographic and clinical characteristics, to predict the risk of structural recurrence of thyroid cancer. The predictive performance of the models was assessed in terms of discrimination, calibration and overall performance.

Results: This study included 3561 patients with thyroid cancer, of which 281 (7.9%) patients experienced recurrence requiring further surgeries. The performance measures indicated that all prediction models performed well in predicting the risk of structural recurrence in thyroid cancer patients. Optimism values demonstrated that all shrinkage models were better at mitigating overfitting compared to the logistic regression model. Confidence intervals of the performance measures showed that repeated cross-validation had large variability whereas optimism-corrected bootstrapping had relatively low variability.

Conclusions: In general, the point estimates of internal validity were comparable for all three methods, but the optimism-corrected bootstrapping could yield relatively more precise confidence intervals of those estimates, especially, with prediction models involving rare outcomes.

Dec 8. 13:30-15:10

14SCO Theatre 3. IPS14: Analysis of the pricing and outbreaks using the model and machine learning

IN THIS SESSION

Organiser: Hyojung Lee

Young-Geun Choi (Sungkyunkwan University), Gi-Soo Kim (UNIST), Yunseo Choi (Sookmyung Women’s University), Woo Seong Cho (Seoul National University), Myunghee Cho Paik (Seoul National University) & Min-Hwan Oh (Seoul National University)

Semi-Parametric Contextual Pricing Algorithm using Cox Proportional Hazards Model

Contextual dynamic pricing is a problem of setting prices based on current contextual information and previous sales history to maximize revenue. A popular approach is to postulate a distribution of customer valuation as a function of contextual information and the baseline valuation. A semi-parametric setting, where the context effect is parametric and the baseline is nonparametric, is of growing interest due to its flexibility. A challenge is that customer valuation is almost never observable in practice and is instead type-I interval censored by the offered price. To address this challenge, we propose a novel semi-parametric contextual pricing algorithm for stochastic contexts, called the epoch-based Cox proportional hazards Contextual Pricing (CoxCP) algorithm. To our best knowledge, our work is the first to employ the Cox model for customer valuation. The CoxCP algorithm has a high-probability regret upper bound of $\tilde{O} (T^{\frac{2}{3}} d)$ , where $T$ is the length of horizon and $d$ is the dimension of context. In addition, if the baseline is known, the regret bound can improve to $O (d \log T)$ under certain assumptions. We demonstrate empirically the proposed algorithm performs better than existing semi-parametric contextual pricing algorithms when the model assumptions of all algorithms are correct.

~~Geunsoo Jang (Kyungpook National University), Hyojung Lee (Kyungpook National University) & Jeonghwa Seo (Kyungpook National University)~~ (Withdrawn)

~~Estimation of the Early Detection for the Outbreaks of the Seasonal Infectious Diseases using Machine Learning~~

Seasonal infectious disease such as norovirus and influenza can be controlled through personal hygiene management and isolation in the event of an outbreak. Predicting and announcing outbreaks in advance can be a low-cost way to stop the spread of diseases. As such, early detection of outbreaks is key to preventing seasonal infectious disease. We aim to predict the trend of disease and find starting point of outbreaks early.

We proposed a new model for seasonal infectious disease and develop methods for early detection. We take the meteorological characteristics, previous detection rate and others. We used machine learning techniques such as LSTM, GRU to predict seasonal infection disease. We also applied change point method and anomaly detection for early detection. We have included RCP scenarios to reflect future climate changes. We analyzed the change in starting point of outbreak under each scenario.

Hyojung Lee (Kyungpook National University)

Modeling for Infectious Diseases to Control an Epidemic

Several coronaviruses have emerged, including SARS-CoV, MERS-CoV, and SARS-CoV-2 within the span of two decades. These viruses, along with new variant strains, pose a significant and escalating global threat. Furthermore, the rapid dissemination of misinformation about the outbreaks has contributed to worldwide panic. To mitigate the spread of infectious diseases, the Republic of Korea has adopted a localized quarantine strategy instead of a global lockdown, effectively containing the disease spread. This policy has been adjusted based on the severity of the epidemic.

To provide scientific information and guide policy decisions, mathematical and statistical modeling have been employed to predict the spread of COVID-19. We first discuss the characteristics of COVID-19 transmission dynamics in Korea and the “K-quarantine” measures implemented during different periods. Second, we analyze mathematical and statistical models to assess the effectiveness of COVID-19 vaccination. Third, we evaluate the impact of control interventions. Finally, we highlight recent research topics concerning COVID-19 transmission in Korea.

14SCO Theatre 4. CPS08: Time Series Analysis

IN THIS SESSION

Ning Zhang (Macquarie University), Nan Zou (Macquaire University), Yanlin Shi (Macquaire University), Georgy Sofronov (Macquaire University), Andrew Grant (The University of Sydney)

Bottom-up Change-Point Detection in Time Series Data

Time series analysis involves the study of the evolution of one or more variables over time. A critical aspect of time series analysis is the detection of significant changes in the underlying data-generating mechanism. In recent years, researchers across various fields have rekindled their interest in solving the change-point detection problem. Specifically, change-point detection poses a challenge and finds application in diverse disciplines, e.g., financial time series analysis (e.g., identifying shifts in volatility), signal processing (e.g., conducting structural analysis on EEG signals), geology data analysis (e.g., examining volcanic eruption patterns), and environmental research (e.g., detecting shifts in ecological systems caused by critical climatic thresholds). In this paper, we introduce a rigorous bottom-up approach to change-point detection. More precisely, our approach divides the original signal into numerous smaller sub-signals, calculates a difference metric between adjacent segments and, at each iteration, selects the time points with the smallest differences. We demonstrate the effectiveness of this bottom-up approach through its application to both simulated and real-world data.

Jun Seok Han (Macquarie University), Nino Kordzakhia (Macquarie University), Pavel V. Shevchenko (Macquarie University), Stefan Truck (Macquarie University)

Two-Factor and Hybrid with LS-SVR Models: Application to EUA Futures Prices

In this study, we assess the performance of h-step-ahead forecast of logarithms of futures prices, using the extended two-factor model by Han et al. (2023), which was developed based on Schwartz & Smith (2000). The two-factor model assumes that the short and long-term components of the logarithm of futures price are mean-reverting. These two components represent correlated latent stochastic processes and are jointly estimated, along with model parameters, using the Kalman Filter through the maximum likelihood method.

Furthermore, we assume that the error terms in the measurement equation system are interdependent and serially correlated. A comparative analysis has been carried out between three models: a) the reduced-form model, b) the full model, and c) a hybrid model, where futures prices follow ARIMA process. In c), we will use the least squares support vector regression (LS-SVR) model for prediction of the residuals of multivariate ARIMA process. The historical daily prices of European Union Allowance (EUA) futures contracts from January 30, 2017 to April 1, 2022 were used in this study.

The talk is based on the joint work with N. Kordzakhia, P. Shevchenko and S. Tr̈uck.

Kai Kasugai (Chuo University) & Toshinari Kamakura (Chuo University)

Low Variance Method for Gradient Estimator of FIVO

The purpose of time-series analysis is to predict and control the systems. In case of complex systems, it is difficult to describe the time evolution of the system based only on observed data. Recently, the variational inference combined with the Bayesian nonlinear filtering produces the front-line results in the latent time-series modeling with Sequential variational auto-encoder (SVAE). These studies have focused on sequential Monte Carlo (SMC), e.g., filtering variational objectives (FIVO). FIVO can theoretically provide more rigorous solution than IWAE for MLE in sequential models. However, serious problems in particle degeneracy and biases of the gradient estimators occur. These problems from categorical distributions for resampling in SMC prevent obtaining stable model parameters. The second term of gradient estimators from resampling is calculated by the REINFORCE that causes high variance, for there is no method to apply reparameterization-trick to discrete distributions. Some methods have been proposed to reduce variance of gradient estimator; one is to ignore the term from resampling, the other is to relax categorical distribution and other is to relax categorical distribution. Our goal is to propose the computing method for lower-variance and unbiased gradient estimator of FIVO. REBAR provides lower-variance and unbiased gradient estimators for models with discrete latent variables. REBAR is one of the candidates to obtain the better estimators, but some devices are required to rearrange the lattice structures to time ordered ones. Our method is advantageous over existing ones in that it does not lose the unbiasedness of the gradient estimator. We shall illuminate the results of numerical experiments to compare the existing methods with ours by simulation study.

Rob Hyndman (Monash University), Daniele Girolimetto (University of Padova), Tommaso Di Fonzo (University of Padova), George Athanasopoulos (Monash University)

Cross-temporal Probabilistic Forecast Reconciliation

Forecast reconciliation is a post-forecasting process that involves transforming a set of incoherent forecasts into coherent forecasts which satisfy a given set of linear constraints for a multivariate time series. We extend the current state-of-the-art cross-sectional probabilistic forecast reconciliation approach to encompass a cross-temporal framework, where temporal constraints are also applied. Our proposed methodology employs both parametric Gaussian and non-parametric bootstrap approaches to draw samples from an incoherent cross-temporal distribution. To improve the estimation of the forecast error covariance matrix, we propose using multi-step residuals, especially in the time dimension where the usual one-step residuals fail. We evaluate the proposed methods through a detailed simulation study that investigates their theoretical and empirical properties. We further assess the effectiveness of the proposed cross-temporal reconciliation approach by applying it to two empirical forecasting experiments, using the Australian GDP and the Australian Tourism Demand datasets. For both applications, we show that the optimal cross-temporal reconciliation approaches significantly outperform the incoherent base forecasts in terms of the Continuous Ranked Probability Score and the Energy Score. Overall, our study expands and unifies the notation for cross-sectional, temporal and cross-temporal reconciliation, thus extending and deepening the probabilistic cross-temporal framework. The results highlight the potential of the proposed cross-temporal forecast reconciliation methods in improving the accuracy of probabilistic forecasting models.

14SCO Theatre 5. CPS09: Visualisating Complex data & Anormly Detection

IN THIS SESSION

Cynthia A Huang (Monash University)

Visualising Category Recoding and Redistributions

This paper proposes graphical representations of data and rationale provenance in workflows that convert both category labels and associated numeric data between distinct but semantically related taxonomies. We motivate the graphical representations with a new task abstraction, the cross-taxonomy transformation, and associated graph-based information structure, the crossmap. The task abstraction supports the separation of category recoding and numeric redistribution decisions from the specifics of data manipulation in ex-post data harmonisation. The crossmap structure is illustrated using an example conversion of numeric statistics from a country-specific taxonomy to an international classification standard. We discuss the opportunities and challenges of using visualisation to audit and communicate cross-taxonomy transformations and present candidate graphical representations.

Janith Wanniarachchi (Monash University), Di Cook (Monash University) Patricia Menendez (Monash University), Kate Saunders (Monash University) & Thiyanga Talagala (University of Sri Jayewardenepura, Sri Lanka)

Seeing the Smoke Before the Fire: Visualising Spatiotemporal Indicators of Bushfires through Counterfactual Explanations of Machine Learning Models

The prevailing reliance on high-performance black box models calls for a shift towards Explainable AI (XAI) methodologies. Counterfactuals, a subset of Explainable AI (XAI) methods, work by generating new instances around a specific instance that lead to a specific desired output from a model. Exploring the counterexamples obtained from explaining spatio-temporal black box models provides us with the ability to understand and explore the underlying relationships through space and time.

We introduce high-dimensional data visualisation methods designed to understand explanations derived from black box models, thereby enabling the extraction of latent features and decision pathways across temporal and spatial dimensions. This work is motivated by the need to advance bushfire management strategies with the aim of preventing, mitigating and provide tools for managing future catastrophic bushfires similar to those in 2019-2020. By using early detection data from hotspots combined with a holistic understanding of the scenarios leading to higher fire ignition risk this work aims at contributing to the bush fire risk management.

Rui Tanaka (Chuo University) & Kosuke Okusa (Chuo University)

Model-Based Radar Signal Anomaly Detection Using Autoencoder with Application to Gait Analysis

Autoencoders are widely used in the field of anomaly detection and have a variety of applications, such as detecting defective products on production lines and detecting anomalous patterns in video images. On the other hand, these detection methods require a large amount of training data, which is a problem. Radar signals have the characteristic that the observed waveform differs greatly depending on the positional relationship between the radar and the object, even if the motion is the same. It is difficult to construct training data for all of these movement patterns when considering application to an autoencoder. In this study, we propose a method to reduce the cost of collecting observation data by mathematically modeling the motion of the target and generating training data by simulation. As an example of application, we conducted experiments to detect two patterns of abnormal gait using simulation data generated from a human gait model, and confirmed that the proposed method has a higher detection rate than conventional methods that use CNN or Data-Augmentation.

Yoshiki Kinoshita (Chuo University)

Anomaly Detection via Statistical Signal Processing with Stochastic Differential equation

Anomaly detection for signal data is important in acoustic surveillance. The method consists of two parts. First, the researcher extracts a feature from the signal data. Second, the feature is used in machine learning methods. One of basic features is a spectrogram generated by Fourier or wavelet transform. The spectrogram is regarded as an image and many image processing methods are applied to it. Statistical signal processing is another choice to handle signal data. The method treats signal data as stochastic processes and effective to reduce a noise. In this presentation, we regard a spectrogram as stochastic processes and apply a stochastic differential equation to describe it. Anomaly detection is executed based on quasi-likelihood function.

Dec 8. 15:40-17:20

14SCO Theatre 3. IPS15: Recent Advances in Rank-based Inference

IN THIS SESSION

Organiser: Leung Ho Philip Yu

Michele La Rocca (University of Salerno, Italy), Bastian Pfeifer (Medical University of Graz, Austria) & Michael G. Schimek, (Medical University of Graz, Austria)

Efficient Bootstrapping for Signal Reconstruction and Inference of Two Independent Ranker Groups

Statistical ranking procedures are widely used to rate objects’ relative quality or relevance across multiple assessments. Beyond rank aggregation, estimating the usually unobservable latent signals that inform consensus rankings in groups of assessors is of interest. Under the only assumption of independent rankings, we have previously introduced estimation via convex optimisation. The procedure is computationally efficient even when the number of assessors is much lower than the number of objects to be ranked. It can be seen as an indirect inference approach, and standard asymptotic arguments are not straightforward. However, the suggested signal estimator can be written as a weighted estimator, which opens the possibility of using weighted bootstrap schemes to implement efficient resampling procedures and to derive accurate approximations for the unknown sampling distribution of the statistics involved.

In that general framework, we will explore using an efficient weighted resampling scheme with a low computational burden to implement a test procedure to compare the signals estimated from two groups of independent assessors. The test has a multiple testing structure, and the overall procedure is designed to keep the Familywise Error Rate or the False Discovery Proportion under control. Results of a Monte Carlo simulation study designed to demonstrate the relative merits of the proposed procedure will be discussed, considering different scenarios with increasing problem complexity. Finally, an application to real data will be presented.

Ke Deng (Tsinghua University)

Efficient Aggregation of Heterogeneous Ranking Lists via Unsupervised Statistical Learning

Rank data, especially in the form of a group of ranking lists for a common set of entities, arise in many fields, including recommendation system, sport industry, Bioinformatics and so on. Rank aggregation, which aims to derive a more reliable ranking list from a group of input ranking lists, is in a great peal in these fields for efficient decision making.

Considering that rankers, which can be either domain experts or ranking algorithms, are often heterogeneous in term of reliability, it is crucial to take the heterogeneity of rankers into account to facilitate rank aggregation effectively. Existing methods for rank aggregation in the literature, however, can barely deal with this critical issue. Therefore, conceptually flexible and computationally efficient methods are highly desirable in the field of rank aggregation. In this talk, we will introduce the recent progresses on aggregation of heterogeneous ranking lists via unsupervised statistical learning, which allows us to infer the heterogeneity of the involved ranking lists without outside information and achieve more efficient aggregation. Effectiveness of the proposed methods is guaranteed by theoretical analysis, and further supported by simulation studies as well as real-world examples.

Leung Ho Philip Yu (Education University of Hong Kong) & Yipeng Zhuang (Education University of Hong Kong)

Graph Neural Network for Preference data

Individual preferences of items arise in many situations in our daily lives. Often, their choice behaviors may be influenced strongly from their peers or friends on social media. In this talk, we will introduce a novel graph neural network for modeling preference data, where individuals are connected in a network. Empirical studies will be conducted to demonstrate the performance of our model in predicting the preferences of unrated items.

14SCO Theatre 4. IPS16: Statistical Modeling for Medical and Biological Data

IN THIS SESSION

Organiser: Ivan Chang

Osamu Komori (Seikei University), Yusuke Saigusa (Yokohama City University), Shinto Eguchi (The Institute of Statistical Mathematics)

Species Distribution Modeling Using Geometric-Mean Divergence

Species distribution modeling plays a crucial role in estimating the abundance of species based on environmental variables such as temperature, precipitation, evapotranspiration, and more. Maxent, a representative method, is widely used across various scientific fields, especially in ecology and biodiversity research. However, the calculation of the normalizing constant in the likelihood function becomes time-consuming when the number of grid cells sharply increases. In this presentation, we propose geometric-mean divergence and derive an exponential loss function, which can be calculated efficiently even when the number of grid cells increases significantly. A sequential estimating algorithm is also proposed to minimize the exponential loss in a way similar to that of Maxent. The results of simulation studies and the analysis of Japanese vascular plant data are demonstrated to validate the efficacy of this approach.

Sheng-Mao Chang (National Taipei University)

CMILR: An Explainable Approach for Varying-Location Feature Identification of Images

This study was motivated by two image discrimination examples: handwritten digit recognition and COVID-19 lung CT scanning image recognition. These two problems have a significant difference. Handwritten ones, for example, have a slash in the middle of all images, whereas locations of lung damage vary from one person to another. Linear classifiers excel at handling the former due to the consistent patterns, but they struggle with the latter due to the varying lung damage locations. To tackle the latter discrimination problem, we propose a novel approach called convolutional multiple-instance logistic regression (CMILR) that combines convolutional neural network (CNN) and multiple-instance learning. In the case of COVID-19 lung CT scans, CMILR resulted in an accuracy of 0.81 with only 169 parameters. In contrast, a fine-tuned CNN model resulted in an accuracy of 0.88 and 377,858 parameters. Additionally, CMILR provides a probability map indicating the likelihood of lung damage, offering valuable insights for medical diagnosis and making the learning algorithm explainable.

Yuan-chin Chang (Academia Sinica), Zhanfeng Wang (University of Science and Technology of China) & Xinyu Zhang (University of Science and Technology of China)

Ensemble of Sequential Learning Models of Distributed Data Centers and Its Applications

Tackling the formidable task of managing massive datasets stands as a paramount challenge in contemporary data analysis, particularly in the critical domains of epidemiology and medicine. This study introduces a groundbreaking approach that harnesses the power of sequential ensemble learning to masterfully dissect these extensive datasets. Our central focus revolves around optimizing efficiency, meticulously considering both statistical and computing dimensions. Furthermore, we tackle intricate challenges, including seamless data communication and the safeguarding of private information, echoing the discourse within the realm of federated learning in machine learning literature.

Dec 6. Poster Session 1

During Afternoon Tea Break @ 12WW Foyer

Shey-Huei Sheu (Asia University)

Bivariate Optimal Replacement Policy for a Shock Model

This paper discusses analytically the optimal preventive replacement policy with shock damage interaction. The system consists of two units, named unit 1 and unit 2, and is subject to shocks. Shock is assumed to arrive according to a non-homogeneous pure birth process and can be divided into two types. Type 1 shock causes unit 1 minor failure, which can be removed by a minimal repair, and also yields some additive damage to unit 2. While type 2 shock causes the system into catastrophic failure and is rectified by corrective replacement. Assume that the probability of type 2 shock depends on the number of shocks since the last replacement. Furthermore, the unit 2 with cumulative damage $x$ may suffer minor failure with probability $p (x)$ at the instant of occurrence of type 1 shock. The corrective replacement is implemented immediately at type 2 shock or when the total damage of unit 2 has exceeded a failure level $K$ . In additional, the system undergoes preventive replacement at age $T$ or the total damage of unit 2 exceeds a managerial level $z$ , whichever comes last. The expected cost rate is derived and optimal replacement policy $(T, z)$ which minimizes it is discussed.

Masataka Taguri (Tokyo Medical University) & Kentaro Sakamaki (Juntendo University)

Bias Corrected ANCOVA Estimators in Randomized Clinical Trials using Cross-Fitting

In randomized clinical trials, unbiased estimates of treatment effects can be obtained through unadjusted analyses that do not adjust for baseline covariates. However, it is widely known that adjusting for covariates through methods like analysis of covariance (ANCOVA) can enhance statistical efficiency and guidelines from regulatory agencies such as FDA and EMA recommend covariate adjustment for the primary analysis of randomized clinical trials (FDA, 2021; EMA, 2015). Recently, a method called PROCOVA (Schuler et al., 2021) has been proposed to conduct adjusted analysis using information from past trials or real-world data and it has received a positive opinion from EMA (EMA, 2022). Nevertheless, if our interest is on estimating differences in means between two groups, a mis-specified ANCOVA model can produce the estimator suffered from non-negligible small-sample bias (Tackney et al., 2023). In this study, we propose a new approach to correct the small-sample bias of the ANCOVA estimator using cross-fitting (Zivich and Breskin, 2021). Unlike the ordinary ANCOVA estimator, our corrected estimator has the unbiasedness. We also show that our estimator has the same asymptotic distribution as the ANCOVA estimator. Simulation experiments will be presented to evaluate the performance of the proposed method.

Tatsuhiko Matsumoto (Murata Manufacturing Co., Ltd. and Osaka University) & Yutaka Kano (Osaka University)

Longitudinal Data Analysis on the Relationship between Tendon Vibrations of the Ankle and Lower Limb Muscle Activity Using Structural Equation Modeling (SEM)

Quantifying human muscle activity is essential, particularly for lower limb muscle activity, which plays a pivotal role in motor control and rehabilitation. Historically, Electromyography (EMG) has been the primary quantitative measure. However, long-term EMG measurements face challenges due to factors such as electrode placement and noise from sweat. In this study, we focused on the vibrations emitted from the tendons of the ankle as an alternative bio-signal to electromyography (EMG). The rationale for selecting tendons is that they are connected to muscles, and we hypothesized that signals generated by muscle activity would transmit to the tendons. Additionally, we chose the ankle because it has a concentration of tendons and is a region with less subcutaneous fat, making it easier to capture bio-signals. We conducted an experiment involving 63 subjects, where we applied different levels of load to the lower limb muscles. The data obtained from the experiment was processed into longitudinal data, with the load levels on the x-axis and the magnitude of vibrations on the y-axis. Using Structural Equation Modeling (SEM) on this longitudinal data, we were able to reveal three possibilities.The first is that the magnitude of vibrations emitted from the ankle is proportional to the exerted muscle force.The second is that the relationship between vibration magnitude and exerted muscle force can be represented by a non-linear model equation.The third is that individual variances in the model equation for vibration magnitude and exerted muscle force can be explained by body information such as body fat mass and skeletal muscle mass. In the conference, we are planning to explain the process of using SEM to elucidate these three points.

Rinhi Higashiguchi (The University of Tokyo), Kensuke Okada (The University of Tokyo), Saeko Ishibashi (Tsuruga Nursing University), Takuya Makino (University of Fukui), Futoshi Suzuki (Kamibayashi Memorial Hospital), Kei Ohashi (Nagoya City University), Atsurou Yamada (Nagoya City University), Hirotaka Kosaka (University of Fukui), Takeshi Nishiyama (Nagoya City University)

The Application of the Bifactor S-1 Model on Externalizing Disorders Under an Item Response Theory Framework

The internal structure of attention-deficit/hyperactivity disorder (ADHD) alongside oppositional defiant disorder (ODD) and conduct disorder (CD) is often conceptualized using the symmetrical bifactor model in research on externalizing disorders. The symmetrical bifactor model reportedly has problems with its factor loadings taking on values close to zero, negative, and/or non-significant. As an alternative, the bifactor S-1 model, which establishes the general factor by treating one of the specific factors as a general reference factor, has been proposed. The main goal of this work was to evaluate the use of the bifactor S-1 model on the internal structure of externalizing disorders in the clinical population of Japan. Considering the similarities between a categorical factor analysis model and item response theory (IRT) model, the bifactor S-1 model was applied under the IRT framework. Criterion-related validity of measures were also considered in order to see if the bifactor S-1 model can be a tool in the assessment of externalizing disorders. The symmetrical and bifactor S-1 model were both applied to external disorder ratings by parents for 360 children. To test criterion-related validity, we then applied the two models to ratings by parents of 135 children, who have a formal diagnosis after completing a semi-structured interview considered as the gold standard (the Kiddie Schedule for Affective Disorders and Schizophrenia-Present and Lifetime Version). Anomalous factor loadings were seen when the symmetrical bifactor model was applied. The bifactor S-1 model, on the other hand, yielded expected factor loadings. Model fit for the bifactor S-1 model was also deemed sufficient and most indices were consistent with previous findings. Point biserial coefficients were used to test criterion-related validity, and all values were significant. The bifactor S-1 model is considered to have potential for use in the evaluation of symptoms of externalizing disorders.

Sungjune Choi (Kyungpook National University), Geunsoo Jang (Kyungpook National University) & Hyojung Lee (Kyungpook National University)

Comparative forecasting of ICU capacity during the endemic in the capital area (Seoul) and Non-capital area (Daegu/Kyungpook)

Over the past few years, COVID-19 has been a global pandemic. The unexpected contagiousness of the disease has created many challenges, including a lack of policy and healthcare resources. With COVID-19 now transitioning to an endemic, it’s important to be prepared for new infectious diseases in the future. When the number of COVID-19 cases have been rapidly increased, it is important to predict the required number of intensive care units (ICU), especially to manage the severe cases.

We aim to predict the number of ICU beds at hospitals according to the threshold of ICU capacity during COVID-19 epidemic or non-epidemic. We construct the SEIHR compartmental model to predict the demand for ICU in Republic of Korea, considering the vaccination and variants. The transmission parameters are estimated using the epidemiological data such as COVID-19 cases, occupancy of ICU. We explore the effect of ICU capacity on controlling the severe cases. Moreover, we compare the COVID-19 transmission dynamics of mild cases, severe cases, and deaths by varying the required number of ICU when new outbreak is started. The present study can contribute to manage the ICU capacity in the hospitals by using the model.

Finally, we will evaluate ICU patient capacity for non-capital areas by forecasting and comparing ICU patient capacity for the capital area (Seoul) and the non-capital area (Daegu). Forecasting healthcare resources such as ICU beds for non-capital areas is essential because healthcare facilities are concentrated in the metropolitan area. We predict the number of ICU beds needed in non-capital areas and operate hospitals efficiently by considering economic costs.

Takehiro Shoji (Doshisha University), Jun Tsuchida (Kyoto Women’s University) & Hiroshi Yadohisa (Doshisha University)

Adaptive Group Lasso for Doubly Robust Estimation of Quantile Treatment Effect

Estimating the treatment effect, that is the impact of a treatment on an outcome, is important. The quantile treatment effect (QTE) helps evaluate the degree of effect on the lower and upper sides of the distribution. When either the model of the propensity score or the outcome is correctly specified, the doubly robust (DR) estimator of QTE is consistent. Considering the average treatment effect, when unnecessary variables are included in the model of propensity score or outcome in a DR estimator of QTE, the variance of the estimator may increase. Therefore, the covariates included in the models must be carefully selected. However, it is often difficult to predetermine the covariates necessary for a model. Data-driven covariate selection methods have been proposed to address this problem. For example, a method that applies an adaptive lasso penalty to the propensity score model with the coefficients of quantile regression as weights. and uses the coefficients of quantile regression as weights has been proposed. However, this method assumes that an estimated propensity score is used for IPW estimators and tends to miss the confounders that are weakly related to the outcome. We propose a data-driven covariate selection method for the DR estimators of QTE by simultaneously estimating the model parameter of the propensity score and the outcome regression. Specifically, we use an adaptive group lasso with the regression coefficients of the quantile regression as weights. The proposed method considers covariates that should be simultaneously included in both the models; it is expected to enable the selection of the confounders that are weakly related to the outcome.

Jungbok Lee (Asan Medical Center and University of Ulsan College of Medicine)

Model Derived Estimate Based Determination of Cut Off Value in Predictive Scoring Model

For practical clinical application, various prediction models related to disease onset, risk, prognosis or survival, etc. are being developed using EMR or clinical research data. The score is calculated through this prediction model and used as a scale to present risk, and the score is often categorized for the purpose of practical use. Determination of an appropriate cut off for score categorization has become a interesting topic. For example, in the case of time to event outcome, the most intuitive method is to find a cut off that maximizes the log rank test statistic. However, the method using test statistics has a problem in that the cut off may varies depending on the distribution of the data set for model building. In this study, we present a phenomenon in which the cut off varies when there are relatively many or few high (or low)-risk subjects in the training set, we propose a method using a resampling technique based on estimates such as log odds ratio and hazard ratio.

Daisuke Shimada (The University of Tokyo) & Kensuke Okada (The University of Tokyo)

Reliability Coefficients for Knowledge Tracing Models

The problem of efficiently tracking each student’s achievement within online learning systems is known as knowledge tracing (KT). To facilitate KT, a range of statistical models, known as KT models, have been proposed to analyze the data. To accurately track achievement, it is necessary to assess the precision of the results of the achievement estimates from the KT model used by practitioners. In this context, it becomes crucial to establish a high level of reliability, a fundamental concept in psychometrics that gauges the consistency of a measurement. However, the reliability of the KT model in measuring achievement has yet to be directly discussed, and no method for calculating reliability coefficients for the KT model from response data has been established. Therefore, we propose an approach to calculate reliability coefficients for two representative categories of the KT models: Bayesian knowledge tracing (BKT) and item response theory (IRT)-based KT. Specifically, we extend the reliability coefficients for the two major categories of the measurement model, diagnostic classification models (DCM) and IRT models (Templin & Bradshaw, 2013, J Classif), to time series data. Our proposed reliability coefficients can be interpreted as virtual test-retest reliability. In particular, it is a metric that indicates how consistent the measurement results are between these two measurements in a hypothetical situation where the same student responds to the same item twice independently at each time point. By applying the proposed method to actual response data, we demonstrate that this method enables us to evaluate the reliability of the KT model.

Dec 7. Poster Session 2

During Afternoon Tea Break @ 12WW Foyer

Jihyeon Kim (Kyungpook National University) Geunsoo Jang (Kyungpook National University), Hyojung Lee (Kyungpook National University)

Analysis of the impact of age-specific vaccination strategies on COVID-19 epidemic by region

COVID-19 has been prevalent since the first case was reported in 2020. The vaccination against to COVID-19 has been implemented since February 26, 2021 in Korea. And several orders of vaccination were given during the pandemic. From April 2023, the current strategy entails an annual COVID-19 vaccine administration in Korea, considering the varying population demographics across different regions. We investigate that regions have different age-specific population distribution. We investigate the effective vaccination strategies according to each region considering the population distribution by age.

We addressed several key research inquiries within our study. Firstly, we are investigating strategies that demonstrate the greatest efficacy in reducing the count of reported COVID-19 cases and associated fatalities. Additionally, we determine which specific age groups should be prioritized for vaccination. In the present study, the total population is divided into 17 age groups with a 5-year interval. We develop an age-specific mathematical model that considers vaccine effects and death, and vaccination doses by age group. The transmission rates and case-fatality rates are estimated for each age groups. We conducted the various scenarios of vaccination to describe the effective vaccination strategies by age group and region. The comparative analysis is carried out on an age-group within each region, enabling us to identify the most effective course of action. It could provide to find the most effective strategy to minimize both the number of cases and deaths to compare by age group for each region.

Jeonghwa Seo (Kyungpook National University) Geunsoo Jang , Hyojung Lee

Prediction of the Outbreaks for Infectious Diseases with Seasonality

Norovirus stands as the most prevalent viral source of gastroenteritis among humans. Its outbreaks commonly extend to affect groups, with no available vaccine, underscoring the importance of preventive measures. We aim to develop the hybrid model incorporating the statistical model in machine learning method to improve the estimation of long-term outbreaks.

Our strategy involves the integration of meteorological data such as temperature and precipitation with epidemiological data. By comparing various models – including the SIRS compartmental model for mathematical representation, SARIMA for statistical analysis, and LSTM for machine learning – we seek to obtain more precise estimations for norovirus outbreak occurrences.

Building upon this foundation, we intend to extend our model's applicability by incorporating other seasonal infectious diseases such as influenza and scrub typhus. Through this comprehensive framework, we aim to identify the most effective model for predicting outbreaks of various seasonal infectious diseases. The outcomes of this study are expected to greatly contribute to better preparedness and preventive strategies against outbreaks of such seasonal infectious diseases.

Wataru Hongo (Tokyo University of Science & Novartis Pharma K.K.), Shuji Ando (Tokyo University of Science), Jun Tsuchida (Kyoto Women’s University) & Takashi Sozu (Tokyo University of Science)

Performance Evaluation of Augmented Inverse Propensity Weighted Estimators with Penalized Regression Methods

The inverse propensity weighted estimator with outcome-adaptive lasso (IPW-OAL), which is a penalized regression method for propensity score (PS) models, was proposed as an estimator for average treatment effects. Considering that OAL tends to select variables related to the outcome, the performance of IPW-OAL is usually superior to other IPW estimators when the PS and outcome models can be correctly specified. However, IPW-OAL can be biased due to misspecification of the PS or outcome regression model.

To address the shortcoming of IPW-OAL, the augmented IPW (AIPW) estimator can be employed using outcome regression models. However, the choice of penalized outcome regression models for the AIPW estimators with OAL remains unclear. In this study, we evaluated the performance of the AIPW estimators with OAL for a PS model and penalized outcome regression models that can be easily implemented. We employed penalized regression methods with and without the oracle property and evaluated their performance by numerical experiments with various combinations of sample size and number of covariates. Scenarios included situations where PS and/or outcome models could be correctly specified and those where they could not. The bias and variance of AIPW estimators with penalized regression methods with the oracle property for PS and outcome models (i.e., OAL for the PS model and penalized regression methods with the oracle property for the outcome models) were smaller than those of other estimators when the PS and/or outcome models could be correctly specified. The performance among the AIPW estimators differed minimally when using the penalized regression methods with the oracle property.

Mamoru Okuda (Tokyo University of Science), Jun Tsuchida (Kyoto Women’s University) & Takashi Sozu (Tokyo University of Science)

Classification of Starters by Pitch Type and Stats in Nippon Professional Baseball

The objective of the pitcher in baseball is to prevent the opposing team from scoring runs. Because the pitcher’s performance has a considerable impact on the results of a game, many studies have been conducted to examine the factors associated with pitching stats and performance. Clustering plays an important role in finding these factors. Asano (2015) classified pitchers according to proportions of each pitch type to investigate the relationship between estimated clusters and the opponent’s batting average. Camp et al. (2020) classified pitch types into clusters to examine the relationship between pitch type and pitching stats. However, these studies failed to consider interactivity between pitchers and pitch types in the context of clustering. Because the speed and movement of the pitch may vary between pitchers with the same pitch type, the role of the pitch type also differs between pitchers. To effectively cluster pitchers, we must therefore simultaneously consider the features of each pitch type and each pitcher’s results for that pitch type. Although non-negative tensor factorization (NTF) can be employed to reflect the interactivity between pitch type and pitcher, the estimated clusters might not sufficiently relate to factors associated with the pitchers’ performance. Furthermore, it is difficult to define the characteristics associated with stats for each pitch type prior to data analysis. To solve this issue, we adapted supervised NTF for pitcher clustering. In addition to allowing us to examine the factors linked to pitcher performance while considering interactivity between pitch types and pitcher clusters, this method may also eliminate the instability inherent to clusters obtained by NTF. In this study, we demonstrate an application of supervised NTF to three years of data from the Nippon Professional Baseball Organization through the fiscal year 2022, with the pitch speed and swinging strike probability designated as features of pitch types.

Saeed Fayyaz

Improving Official Statistics Literacy Through Effective Data Visualization /Investigation on Polynomial Regression Model (Case study)

The polynomial regression model is useful when there is reason to believe that relationship between two variables is curvilinear. This project is investigating on the fish market data for fitting the best regression model (polynomial) to find the important factors on fishes’ weight. In the initial attempt the linear regression was fitted. Noticeably, the are sever multicollinearity between variables and showed that the relation between dependent and independent variable may be polynomial. Then the hieratical quadratic and cubic regression line with and without the interaction between variables were fitted. Also, to compare the models’ adequacy, they were splatted to the estimation (Train) and prediction (Test) group and the regression lines were fitted to calculate the good of fitness index. The results revealed that the quadratic form with three variables is the best polynomial regression model for this data.

Alternatively, the PCA methods deployed and regression line was fitted. Considering the Fish Data, both reduced polynomial (Quadratic is the best) and regression using new variables from PCA can be fitted. In the first approach the two variables were removed from the model and in the second approach PCA was deployed to solve the issue of multicollinearity and fitted the appropriate regression model. Based on the MSE, PRESS and R-Square- Adj, the PCA has better performance and lower error and it is recommended for this data.

Zara Mehitabel A. Lagrimas (University of Santo Tomas, Manila), Jazmin Yuri R. Ruma (University of Santo Tomas, Manila), Eduardo C. Dulay Jr. (University of Santo Tomas, Manila)

Multiple Linear Regression Analysis of the Factors Affecting Purchase Intention of Variable Universal Life Insurance Among Generation Z: A Market Research Approach

Variable Universal Life Insurance (VUL) is a popular life insurance policy in the Philippines that provides investment options and a death benefit. Insurance hesitancy is still present among Filipinos, suggesting that there needs to be more research on the factors that influence individuals’ purchase intention of VUL policies. This research aims to analyze the factors that affect Generation Z’s purchase intention of VUL in the National Capital Region. The study used a convenient sampling method and obtained responses from 385 individuals belonging to Generation Z. Multiple linear regression was employed to analyze the data and model the relationship between the purchase intention of VUL insurance and the independent variables. The results of the multiple regression indicated that the following were significant predictors of VUL purchase intention: age, income/allowance range of Php30,001 - Php50,000, educational attainment is undergraduate, mother’s educational attainment is undergraduate, self-perceived health, experience in investing, life insurance awareness, and life insurance purpose (protection-driven or investment-driven). The study’s findings shed light on the key factors that insurers in the Philippines should consider when designing VUL policies and developing marketing strategies to promote VUL products for Generation Z consumers. Thereby, Filipino citizens will be insured and protected. Future studies should explore other possible factors and utilize different modeling techniques on VUL purchase intention.

Jiqiang Wang (Kyushu University) & Kei Hirose (Kyushu University)

SSVD Based Biclustering Methods via Mixed Prenet Penalty

Standard clustering methods typically group samples based on their entire set of observed features. In large datasets, however, only a few features may play a role in distinguishing different clusters.

In our research, we observed that if certain biclusters produced by the algorithm are excessively similar, which means they have a high degree of repetition (overlap). Such redundancy can pose challenges to our analysis because it is difficult to identify the useful variables. On the other hand, these repeated portions may also contain valuable information. However, a simple prohibition or allowance of repetition is not sufficient. We need to find a method to identify when it is necessary to retain duplicated parts. In our study, we successfully improved the SSVD (Sparse Singular Value Decomposition) by proposing the following mixed Prenet penalty (a hybrid of Prenet (product-based elastic net) penalty and the original elastic net penalty) to replace the original adaptive Lasso penalty in the SSVD method.

The Prenet penalty was originally proposed by Hirose and Terada (2022) to deal with the loading matrix in Factor analysis. It is based on the product of a pair of elements in each row of the loading matrix. The Prenet not only shrinks some of the factor loadings toward exactly zero but also enhances the simplicity of the loading matrix, which plays an important role in the interpretation of the common factors. However, the original Prenet penalty itself cannot provide a good clustering result in our experience, then we extended it to make it compatible with the general elastic net and allow users to easily control the threshold for allowing overlapping by adjusting parameter values. This improvement has a noticeable effect on reducing the degree of dummy overlapping.

Arezoo Orooji (Mashhad University of Medical Sciences), Mohammad Hosseini-Nasab Kei Hirose (Shahid Beheshti University), Hassan Doosti (Macquarie University), Humain Baharvahdat (Mashhad University of Medical Sciences)

Estimation of Historical Functional Linear Model for Subarachnoid Hemorrhage Data

Functional data refers to data where dense repetitions for each subject are substantial. The underlying philosophy of functional data analysis involves interpreting the numerous repetitions of measurements for each individual as a stochastic process. In the analysis of functional data, different models can be employed based on the nature of the predictor and response variables. One of these models is the historical functional linear model. Aneurysmal subarachnoid hemorrhage is a severe disease with a high mortality rate. In the context of aneurysmal subarachnoid hemorrhage (aSAH), intracranial pressure (ICP) plays a critical role in the patient’s status. Therefore, it is crucial to explore the factors influencing intracranial pressure curves. recognizing that the variables change over time, they are treated as functions of time, and a historical functional linear model is employed for data analysis. Our study involved a cohort of 24 patients diagnosed with aneurysmal subarachnoid hemorrhage, selected from individuals seeking medical care at Ghaem hospital in Iran between March 2020 and May 2022. Among the cohort of 24 patients diagnosed with aneurysmal subarachnoid hemorrhage, the gender distribution revealed that 14 patients (58 percent) were female. The patients’ average age was calculated to be 58.08(16.86) years. The findings suggested that intervention curves and diastolic blood pressure were significant factors contributing to the elevation of intracerebral pressure, and Although study shows that the sodium level was not significant for almost the whole period, it seems that if the number of repetitions were more, the effect of sodium level would be significant.