Ciclo de Palestras 2021 - 1


Factor Analysis is a popular method for modeling dependence in multivariate data. However, determining the number of factors and obtaining a sparse orientation of the loadings are still major challenges. In this paper, we propose a decision-theoretic approach that brings to light the relation between a sparse representation of the loadings and factor dimension. This relation is done through a summary from information contained in the multivariate posterior. To construct such summary, we introduce a three-step approach. In the first step, the model is fitted with a conservative factor dimension. In the second step, a series of point-estimates with a decreasing number of factors is obtained by minimizing an expected predictive loss function. In step three, the degradation in utility in relation to the sparse loadings and factor dimensions is displayed in the posterior summary. The findings are illustrated with a simulation study, and an application to personality data. We used different prior choices and factor dimensions to demonstrate the flexibility of the proposed method. This is joint work with
Henrique Bolfarine (USP), Carlos Carvalho (UT Austin) and Jared Murray (UT Austin).


The stochastic volatility in mean (SVM) model proposed by Koopman and Uspensky (2002) is revisited. We offer to approximate the likelihood function of the SVM model applying Hidden Markov Models (HMM) machinery to make possible Bayesian inference in real-time. We sample from the posterior distribution of parameters with a multivariate normal distribution with mean and variance given by the posterior mode and the inverse of the Hessian matrix evaluated at this posterior mode using importance sampling (IS). We conduct a simulation study to verify the frequentist properties of estimators. An empirical analysis of five Latin American indexes to see the impact of the volatility in the mean of the returns is performed. The results indicate that volatility negatively impacts returns, suggesting that the volatility feedback effect is stronger than the effect related to the expected volatility. This result is exact and opposite to the finding of Koopman and Uspensky (2002). We compare our methodology with the Hamiltonian Monte Carlo (HMC) and Riemannian HMC methods based on Abanto-Valle et al. (2021).

Assista à palestra no Youtube

The semiparametric Cox regression model is often fitted in the modeling of survival data. One of its main advantages is the ease of interpretation, as long as the hazards rates for two individuals do not vary over time. However, the proportionality assumption of the hazards may not be true in some situations. In addition, in several survival data is common a proportion of units not susceptible to the event of interest, even if, accompanied by a sufficiently large time, which is so-called immune, “cured,” or not susceptible to the event of interest. In this context, several cure rate models are available to deal with the long-term. Here, we consider the generalized time-dependent logistic (GTDL) model with a power variance function (PVF) frailty term introduced in the hazard function to control for unobservable heterogeneity in patient populations. Our approach enables us to accommodate unobservable heterogeneity and non-proportional hazards, as well as survival data with long-term survivors. Its practice relevance is illustrated in a real medical dataset from a population-based study of incident cases of melanoma diagnosed in the state of São Paulo, Brazil.

Assista à palestra no Youtube

A Multiregression Dynamic Model (MDM) is a class of multivariate time series that represents various dynamic causal processes in a graphical way. One of the advantages of this class is that, in contrast to many other Dynamic Bayesian Networks, the hypothesised relationships accommodate conditional conjugate inference. We demonstrate how straightforward it is to search over all possible connectivity networks with dynamically changing intensity of transmission to find the Maximum a Posteriori Probability (MAP) model within this class. We proceed to show how diagnostic methods, analogous to those defined for static Bayesian Networks, can be used to suggest embellishment of the model class to extend the process of model selection. All methods are illustrated using simulated and real resting-state functional Magnetic Resonance Imaging (fMRI) data.

Assista à palestra no Youtube

Factor analysis is a flexible technique for assessment of multivariate dependence and codependence. Besides being an exploratory tool used to reduce the dimensionality of multivariate data, it allows estimation of common factors that often have an interesting theoretical interpretation in real problems. However, standard factor analysis is only applicable when the variables are scaled, which is often inappropriate, for example, in data obtained from questionnaires in the field of psychology, where the variables are often categorical. In this framework, we propose a factor model for the analysis of multivariate ordered and non-ordered polychotomous data. The inference procedure is done under the Bayesian approach via Markov chain Monte Carlo methods. Two Monte Carlo simulation studies are presented to investigate the performance of this approach in terms of estimation bias, precision and assessment of the number of factors. We also illustrate the proposed method to analyze participants’ responses to the Motivational State Questionnaire dataset, developed to study emotions in laboratory and field settings.

Assista à palestra no Youtube

The coronavirus disease (COVID-19) pandemic continues to cause a massive burden in the world, especially in countries such as Brazil, with poor implementation of strategies to mitigate the transmission of SARS-CoV-2. The number of cases, severe cases, and deaths by COVID-19 are important indicators of how the COVID-19 epidemic is affecting a particular region and can be used by decision-makers to act in order to reduce morbidity and mortality. However, a common problem with surveillance data is reporting delays, whereby cases and deaths are recorded in the surveillance system days or even weeks after they occurred. Statistical models can estimate the actual number of cases, severe cases, and deaths by COVID-19 accounting for the delays (nowcasting). We proposed a Bayesian hierarchical model to nowcast deaths and hospitalised cases for Brazil and also for the 27 federal units. Finally, we provide some general discussion about the COVID-19 situation in Brazil.

Assista à palestra no Youtube
05/05 (esta palestra será excepcionalmente às 16h e faz parte do evento I Encontro de Mulheres na Estatística e Ciência de Dados)

Respondent-Driven sampling (RDS) is a sampling method devised to overcome challenges with sampling hard-to-reach human populations. The sampling starts with a limited number of individuals who are asked to recruit a small number of their contacts. Every surveyed individual is subsequently given the same opportunity to recruit additional members of the target population until a pre-established sample size is achieved. The recruitment process consequently implies that the survey respondents are responsible for deciding who enters the study. Most RDS prevalence estimators assume that participants select among their contacts completely at random. The main objective of this work is to correct the inference for departure from this assumption, such as systematic recruitment based on the characteristics of the individuals or based on the nature of relationships. To accomplish this, we introduce three forms of non-random recruitment, provide estimators for these recruitment behaviors and extend three estimators and their associated variance procedures. The proposed methods are applied to a public health setting.


We discuss Bayesian foundations for the estimation of the Average Treatment Effect (ATE) in the scenario of multilevel observations and in presence of confounding. Confounding occurs when a set of covariates (confounders) impact exposure and outcome simultaneously. In particular, we focus on scenarios when the set of confounders may include unobserved ones. We study the situation wherein multiple observations are made at a given location (e.g. individuals living across cities of a state). We explore the use of the propensity score approach through covariate adjustment to provide balancing of the treatment allocation (exposure). We discuss, through different simulation studies, the need to include location level random effects in the propensity score model to reduce bias in the estimation of the ATE. We also explore different prior specifications for the local level random effects. Our motivating example entails the effectiveness of the driven observed therapy (DOT) in the treatment of Tuberculosis (TB) for individuals who had TB across cities of the state of São Paulo, Brazil, in 2016. This is joint work with Alexandra M. Schmidt, Erica E. M. Moodie and David A. Stephens.

Assista à palestra no Youtube

Election polling poses a unique applied statistical challenge as it is one of the very few situations in which the finite population parameters are known at some point briefly after the estimation occurs. In recent years, election polls have been put into question given their performance in major elections, such as the 2016 U.S. Presidential Elections. Understanding what went wrong and finding solutions for their problems is of utmost importance for the survival of this important tool in democracies around the world.

Statisticians are generally trained to account for sampling errors in surveys, but there are other sources of error that can have a larger impact on the quality of the survey estimates. The Total Survey Error (TSE) provides a good framework to understand these sources of error and how to address them through design or statistical adjustments. In this presentation, I will review the different sources of errors in election polls from a TSE perspective and present a few case-studies from past election polls to understand how these errors impacted their results.

Assista à palestra no Youtube

Spatio-temporal processes in environmental applications are often assumed to follow Gaussian models, under or not particular transformations. However, heterogeneity in space and time may have patterns that are not accommodated by transforming the data in question. In such scenario, modelling the variance is paramount. The methodology presented in this paper adds flexibility to the usual Dynamical Gaussian model by defining the studied process as a scale mixture between a Gaussian process and Log-Gaussian one. The scale is represented by a process varying smoothly over space and time. State-space equations drive the dynamics over time for both response and variance processes resulting in a more computationally efficient estimation and prediction. Two applications are presented. The first one models the maximum temperature in the Spanish Basque Country and the following one models ozone levels in the UK dataset. They illustrate the effectiveness of our proposal in modelling varying variances over both time and space.

Jointly with: Thais C. O. Fonseca (DME/UFRJ, Brazil) and Alexandra M. Schmidt (McGill University, Canada)

Assista à palestra no Youtube

This work proposes a very simple extension of the usual fully connected hidden layers in deep neural networks for classification. The objective is to transform the
latent space on the hidden layers to be more suitable for the linear separation that
occurs in the sigmoid/softmax output layer. We call such architectures radial neural
networks because they use projections of fully connected hidden layers onto the surface of a hypersphere. We provide a geometrical motivation for the proposed method and show that it helps achieve convergence faster than the analogous architectures that they are built upon. As a result, we can significantly reduce training time on neural networks for classification that use fully connected hidden layers. The method is illustrated as an application to image classification, although it can be used for other classification tasks.

Assista à palestra no Youtube

The main goal of this work is to compare the skewness and kurtosis of continuous distributions. It compares their moment skewness and kurtosis and compares their centile skewness and kurtosis. It shows the flexibility in skewness and kurtosis of different continuous distributions (within the gamlss R package), which is informative in the selection of an appropriate distribution. It introduces the bucket plot, a visual tool to detect skewness and kurtosis in a continuously distributed random variable. Join work with R. A. Rigby D. M. Stasinopoulos, G. Z. Heller and L. A. Silva.


Multivariate linear regression models are formed by a vector of variables of interest/responses, a set of explanatory variables, a linear predictor formed by a linear combination of these explanatory variables and regression coefficients, and a random component that flexibilities the relationship systematic and the responses vector. Various experimental or observed phenomena in nature generate data with asymmetric behavior and/or heavy tails, such as phenotypic measurements in athletes, rainfall, among others. Thus, the usual hypothesis of normality of the data is relaxed using a more general class of distributions that incorporate asymmetry and heavy tails and have in particular cases the normal distribution, as well as other symmetrical/asymmetric distributions. In this lecture, we will approach some of these classes of distributions highlighting their properties, methods of parameter estimation, and applications in real data.

Assista à palestra no Youtube

Spatial confounding is defined as the confounding between the fixed and spatial random effects in generalized linear mixed models (GLMMs). It gained attention in the past years, as it may generate unexpected results in modeling. We introduce solutions to alleviate the spatial confounding beyond GLMMs for two families of statistical models. In the shared component models, multiple count responses are recorded at each spatial location, which may exhibit similar spatial patterns. Therefore, the spatial effect terms may be shared between the outcomes in addition to specifics spatial patterns. Our proposal relies on the use of modified spatial structures for each shared component and specific effects. Spatial frailty models can incorporate spatially structured effects and it is common to observe more than one sample unit per area which means that the support of fixed and spatial effects differs. Thus, we introduce a projection-based approach for reducing the dimension of the data. An R package named “RASCO: An R package to Alleviate Spatial Confounding” is provided. Cases of lung and bronchus cancer in the state of California are investigated under both methodologies and the results prove the efficiency of the proposed methodology.

Assista à palestra no Youtube

Recent literature finds many alternative proposals for modeling and estimating a smooth function. In this talk, I focus on the variants of smoothing splines, called penalized regression splines. This is an attractive approach to modeling the nonlinear smoothing effects of covariates. This study discusses the knots selection and a penalty is introduced to control the selection of knots. The approach will be through a full Bayesian Lasso with variational inference. Choosing the appropriate number of knots and their position is a difficult problem, therefore we propose a two steps procedure. 1. For a fixed number of knots we use a full Bayesian Lasso, which combines features of shrinkage and variable selection, to obtain the relevant knots; 2. The number of knots is chosen based on the evidence lower bound (ELBO) over a grid of values.

Assista à palestra no Youtube