Journal Volume: 68      No.: 2     Year: 2014
S.No Title Abstract Download
1 Some Analyses of the Interaction among Local Field Potentials and Neuronal Discharges in a Mouse using Mutual Information
Author: Apratim Guha      Pages: 117-129
Constructing models for neuroscience data is a challenging task, more so when the data sets are of hybrid nature, and there exists very little work. As a first step, here we introduce a technique based on mutual information to look at bivariate hybrid time series data from the field of neuroscience. As an example, we use a data set on the local field potentials (which is a continuous time series) and nerve cell firings (which is a point process) of anaesthetized mice. We explore data driven confidence bounds for the mutual information statistics and discuss a test of independence between the two components of the hybrid process. A comparative study with the findings from some spectral domain methods are also discussed. It is found that the mutual information, as a time domain tool, complements the spectral domain methods. Keywords: Hybrid process, Time series, Point process, Mutual information, Independence of time series components, Coherence, Phase.
2 Influence of Measures of Significance based Weights in the Weighted Lasso
Author: Tanya P. Garcia and Samuel Miller      Pages: 131-144
When part of the regressors can act on both the response and some of the other explanatory variables, the already challenging problem of selecting variables in a p > n context becomes more difficult. A recent methodology for variable selection in this context links the concept of q-values from multiple testing to the weighted Lasso. In this paper, we show that different informative measures of significance to q-values, such as partial correlation coefficients or Benjamini-Hochberg adjusted p-values, give similarly promising performance as when using q-values. Keywords: Adjusted p-values, Convex optimization, Partial correlation coefficients, q-values, Variable selection, Weighted Lasso.
3 Generic Feature Selection with Short Fat Data
Author: B. Clarke and J.H. Chu      Pages: 145-162
Consider a regression problem in which there are many more explanatory variables than data points, i.e., p >> n. Essentially, without reducing the number of variables inference is impossible. So, we group the p explanatory variables into blocks by clustering, evaluate statistics on the blocks and then regress the response on these statistics under a penalized error criterion to obtain estimates of the regression coefficients. We examine the performance of this approach for a variety of choices of n, p, classes of statistics, clustering algorithms, penalty terms, and data types. When n is not large, the discrimination over number of statistics is weak, but computations suggest regressing on approximately [n/K] statistics where K is the number of blocks formed by a clustering algorithm. Small deviations from this are observed when the blocks of variables are of very different sizes. Larger deviations are observed when the penalty term is an Lg norm with high enough q. Consider a regression problem in which there are many more explanatory variables than data points, i.e., p >> n. Essentially, without reducing the number of variables inference is impossible. So, we group the p explanatory variables into blocks by clustering, evaluate statistics on the blocks and then regress the response on these statistics under a penalized error criterion to obtain estimates of the regression coefficients. We examine the performance of this approach for a variety of choices of n, p, classes of statistics, clustering algorithms, penalty terms, and data types. When n is not large, the discrimination over number of statistics is weak, but computations suggest regressing on approximately [n/K] statistics where K is the number of blocks formed by a clustering algorithm. Small deviations from this are observed when the blocks of variables are of very different sizes. Larger deviations are observed when the penalty term is an Lg norm with high enough q. Keywords: Large p small n, LASSO, Ridge, Bridge, Clustering, Variance-bias tradeoff, Summary statistics.
4 Variable Selection and Shrinkage via a Conditional Likelihood-based Penalty
Author: Arpita Ghosh, Andrew B. Nobel, Fei Zou and Fred A. Wright      Pages: 227-236
The usefulness of penalized regression to analyze large datasets is increasingly recognized, with a growing role in genome- wide association scans and in the analysis of data from other -omics technologies. Penalized regression has been applied to data in fields as diverse as health sciences, economics, and finance. We investigate connections between procedures to address ?significance bias? or ?winner?s curse? in genome-wide association studies and the shrinkage of coefficient estimates and variable selection that is applied in existing penalized regression procedures. We use a conditional likelihood approach that has been applied to correct for significance bias in order to propose a new penalized regression procedure. The approach has a natural interpretation when the number of predictors is smaller than the sample size. In addition, we describe an analogous procedure when the number of predictors is larger than the sample size. We demonstrate via data examples and simulations that the procedure performs favorably in terms of prediction error in both low-dimensional and high-dimensional settings in comparison to competing approaches, especially when the proportion of true nonzero coefficients is small. Keywords: Variable selection, Shrinkage, Penalized regression, Conditional likelihood, Significance bias, Winner?s curse.
5 A Modified Theta-logistic Model with Cooperation for Understanding Species Extinction
Author: Amiya Ranjan Bhowmick, Bapi Saha, Joydev Chattopadhyay and Sabyasachi Bhattacharya      Pages: 163-179
In general population growth models account for two apparently opposite forces (1) the natural proclivity of the species population for exponential growth, and (2) a negative density-dependent feedback governed by the environmental carrying capacity. However, the role of cooperation amongst conspecifics is a third factor that enhances population growth, and is generally ignored in currently available growth models. We consider cooperation as a fundamental aspect of population growth along with the other two factors and propose an extended family of generalized logistic growth models. Modified version of the proposed model are also discussed when cooperation is feeble. We consider stochastic counterpart of the models incorporating demographic noise to estimate the extinction measures, probability of extinction and expected time to extinction. Parameters of the proposed model are estimated using both simulated and real life data from Global Population Dynamics Database and their significance are justified in ecological context. We develop an inferential procedure to compare the biotic potential (maximum per capita growth rate) of two populations. Our analysis can have an impact in understanding extinction patterns and enable us to identify demographic threats which lead to decision making for conservation management. Keywords: Allee effect, Grid search, Weighted least squares, Biotic potential, Global population dynamics.
6 A Novel Metric Distance on Registered Curves with Application to a Fourier Transform-infrared Spectroscopy Analysis of Maize
Author: Yishi Wang, Susan J. Simmons, Latasha L. Smith and Ann E. Stapleton      Pages: 181-190
Registered curves containing a variety of information are becoming more and more frequent in natural sciences research. To date, most statistical analysis of such curves involves using only a portion of the information contained within these curves. In order to utilize information across the entire spectrum of the curve, we propose to consider shape and/or magnitude distance that measures the similarity of these non-smooth functional curves, and an object function to compare the effectiveness of different distances measures. Once a similarity/dissimilarity matrix is obtained, various statistical properties can be ascertained about the relationship between two or more curves. Herein, we develop an approach that can identify the most effective distance measure and apply it to an analysis of maize seed Fourier transform-infrared spectroscopy (FT-IR) spectral data. Dimension reduction techniques, such as multi-dimensional scaling (MDS), is then applied to represent the original curves in a lower dimensional space. Keywords: Functional curve, Dimension reduction, Multidimensional scaling, Procrustes distance.
7 Outlier Detection through Independent Components for Non Gaussian Data
Author: Asis Kumar Chattopadhyay and Saptarshi Mondal      Pages: 237-244
Observations lying ?far away? from the main part of a data set and probably not following the assumed model may be termed as outliers.It is clear from the definition of outlier that significant presence of outlying observations may lead to erroneous results which in turn affects the statistical analysis of the data. So a very natural consequence of the above phenomenon will lead to the identification of outliers and eliminate them from the data set. Standard outlier detection techniques often fail to detect true outliers for massive, high dimensional and non Gaussian data sets. Several authors proposed different methods for this purpose. Filzmoser et al. (2008) proposed an algorithm using the properties of Principal Components to identify outliers in the transformed space. In the present work this method is modified by using Independent Components which is necessary for dealing with non Gaussian data. Primarily the dimension has been reduced through Independent Component Analysis and the proposed method has been applied in the reduced space in order to identify the outliers. The utility of the proposed method has been verified through massive, non Gaussian simulated data as well as real astronomical data related to Globular clusters of the Galaxy NGC 5128. Keywords: Outlier identification, Independent components, Simulation, Globular clusters, Galaxy.
8 Big Data Comes to Functional Neuroscience: New Vistas for Statisticians
Author: Mark Reimers      Pages: 293-301
High-throughput data from new imaging technologies is coming to experimental neuroscience, as big data came to genomics a decade earlier. This represents a significant opportunity for new methodology in statistics. However many of the challenges facing statisticians will be quite different from those in genomics. This talk will introduce some issues in analysis of high- throughput functional neuroscience data, and illustrate them with recently published work, mostly drawn from animal studies. First many new technologies are burdened by significant noise so signal extraction techniques need development. Classical statistical dimension reduction strategies seem capture very limited fractions of variance in neuroscience data, and yet multivariate predictions and decoding have yielded some biological insight. Some alternative multivariate strategies have been proposed, but none are entirely satisfactory. How to characterize plasticity from neural activity data remains unclear. Finally we may anticipate a convergence of theoretical neuroscience with detailed experimental observations, as the heretofore unobservable dynamics of neural networks becomes visible. This emerging field presents an exciting opportunity to statisticians who are willing to learn neuroscience and engage with the field?s questions. Keywords: Neuroscience, Big data, Time series.
9 Multiple Hypothesis Testing: A Review
Author: Stefanie R. Austin, Issac Dialsingh and Naomi S. Altman      Pages: 303-314
Simultaneous inference was introduced as a statistical problem as early as the mid-twentieth century, and it has been recently revived due to advancements in technology that result in the increasing availability of data sets containing a high number of variables. This paper provides a review of some of the significant contributions made to the field of multiple hypothesis testing, and includes a discussion of some of the more recent issues being studied. Keywords: Family-wise error rate, FWER, False discovery rate, FDR, Discrete test, Adaptive FDR, Simultaneous testing, Simultaneous inference.
10 Back Titles
Author: ISAS      Pages: 4
11 Hindi Supplement
Author: ISAS      Pages: 315-321
12 Preface
Author: ISAS      Pages: 1
13 Statistical Challenges in Analysing Large Longitudinal Patient-level Data: The Danger of Misleading Clinical Inferences with Imputed Data
Author: Gijo Thomas, Kerenaftali Klein and Sanjoy K. Paul      Pages: 191-201
Large patient-level longitudinal databases play a crucial role in providing the evidence base for identifying pathways to optimal health outcomes, either informing effective prevention strategies or optimal clinical interventions. However, there are inherent complex challenges for valid statistical analyses of such data for robust assessment of risk factors and health outcomes. The longitudinal data often have a non-trivial amount of missing data with complex missing patterns. Many of the risk factors are also measured with errors. These crucial issues are often ignored in standard analyses, which often lead to biased estimates and misleading clinical or epidemiological inferences. These issues are addressed in this study, along with an empirical assessment of how different imputation techniques for missing data could affect the clinical inferences. A simulated longitudinal data on systolic blood pressure (SBP) conditional upon the long-term macrovascular events (MVE) were generated following the risk factors? distributions observed in the BP arm of ADVANCE clinical trial. Missing data on longitudinal SBP measures were created following a random missing pattern. The effects of the dynamic changes in SBP over time on the risk of MVE were evaluated using complete as well as multiply imputed missing data sets. The performances of multiple imputations by Multivariate Normal Imputation and Fully Conditional Specification were compared with the analysis of complete data in relation to the consistency of clinical inferences. The trajectories of longitudinal measures of BP appeared to be significantly different while compared between two sets of multiply imputed data and the original complete data. Although the clinical inferences in relation to the assessment of the effects of higher levels of BP over time on the risk of MVE were not contradictory between complete and imputed data sets, the multiple imputations of missing data could potentially mislead the true trajectory of SBP over time. This exploratory study clearly suggests the need for further methodological assessments of imputation techniques for missing data while dealing with large patient-level longitudinal data. Keywords: Electronic clinical data, Longitudinal data, Missing data analysis, Multiple imputation, Survival analysis.
14 Soil Property Estimation and Design for Agroecosystem Management using Hierarchical Geospatial Functional Data Models
Author: Christopher K. Wikle, Scott H. Holan, Kenneth A. Sudduth and D. Brenton Myers      Pages: 203-216
Sustainable agriculture requires a site-specific approach to address crop management problems and environmental degradation processes that are spatially and temporally variable. These issues lead to production losses (water stress, low fertility, pest problems), soil degradation (erosion, soil organic carbon losses, compaction), and water quality degradation (sediment, nutrients, agrochemicals) - often at the sub-field scale. Management solutions must be implemented at the resolution of the problems; however, changes require information on the magnitude and extent of the issue. Unfortunately, landscape processes and properties can change at a finer spatial resolution than can be practically analyzed with lab methods due to time and cost of sampling and analysis. Thus, it is increasingly important to augment lab methods with field-sensor methods that can accurately characterize within-field variability at a more reasonable cost and with reliability and timeliness. These instruments can produce large data profiles and require calibration and prediction methods that can accommodate ?big data.? We consider a functional spatial approach to perform calibration, spatial prediction, and design in this big data context. Specifically, using hierarchical Bayesian methodology we develop a signal/feature extraction approach for visible and near-infrared (VNIR) spectroscopic data that facilitates prediction of cation exchange capacity (CEC) over space. This methodology is also used to develop optimal spatial sampling locations to minimize the mean squared prediction error corresponding to a predicted spatial surface of this CEC response variable. Keywords: Adaptive design, Bayesian, DRS, Functional data, Optimal spatial design, Principal components, Stochastic search variable selection, VNIR.
15 A Multivariate Normal Block Versus a Principal Components Approach: Competing Strategies for Multiple Testing in a Genome-wide Case-control Association Framework
Author: Arunabha Majumdar and Saurabh Ghosh      Pages: 217-225
The Genome-wide association studies have been partially successful in identifying novel variants involved in complex disorders. However, correcting for multiple testing in such studies becomes inevitable to maintain the appropriate overall false positive error rate. In this article, we consider a block wise strategy MVNblock of multiple testing correction based on an asymptotic multivariate normal framework for performing tests of association at correlated SNPs in a case-control study design. We investigate few of its important theoretical properties and using extensive simulations, compare its performance with a principal components analysis (PCA) based approach simpleM. We find that MVNblock behaves less conservatively than simpleM with respect to controlling for 7WXR. Moreover, MVNblock consistently produces a lower estimate of the effective number of independent SNPs compared to simpleM, and hence is expected to produce higher power compared to simpleM. Keywords: Genome-wide association analyses, Family-wise error rate, Linkage disequilibrium.
16 Unconstrained Bayesian Model Selection on Inverse Correlation Matrices with Application to Sparse Networks
Author: Nitai D. Mukopadhyay and Sarat C. Dass      Pages: 245-255
Bayesian statistical inference for an inverse correlation matrix is challenging due to non-linear constraints placed on the matrix elements. The aim of this paper is to present a new parametrization for the inverse correlation matrix, in terms of the Cholesky decomposition, that is able to model these constraints explicitly. As a result, the associated computational schemes for inference based on Markov Chain Monte Carlo sampling are greatly simplified and expedited. The Cholesky decomposition is also utilized in the development of a class of hierarchical correlation selection priors that allow for varying levels of network sparsity. An explicit expression is obtained for the normalizing constant of the elicited priors. The Bayesian model selection methodology is developed using a reversible jump algorithm and is applied to a dataset consisting of gene expressions to infer network associations. Keywords: Bayesian, Correlation matrix model, Sparse correction, Reversible Jump MC.
17 A Fay-Herriot Type Approach for Better Prediction in Multi-Indexed Response with Application to Arctic Seawater Data Analysis
Author: Ujjal Mukherjee and Snigdhansu Chatterjee      Pages: 257-272
We consider the problem of fitting a nonparametric curve to Arctic Ocean temperature data. Since several alternative curves may be fitted, we consider borrowing strength over fitted curves to create an esemble fit. This lead to a novel exercise involving nonparametric curve fitting and small area methods. Our results indicate that climate data analysis is a complex process, and standard statistical techniques may need to be considerably enhanced for applicability to big data arising from climate studies. Keywords: Small area, Fay-Herriot model, Local polynomial regression, Arctic Ocean temperature.
18 Applications of Sufficient Dimension Reduction Algorithms on Non-elliptical Data
Author: Andreas Artemiou      Pages: 273-283
Sufficient dimension reduction (SDR) is a class of supervised dimension reduction techniques which generally perform much better than unsupervised dimension reduction techniques like Principal Component Analysis (PCA). In this paper we present classic methodology in the SDR framework that is based on inverse moments and we discuss the theoretical assumptions. At the end we demonstrate the advantage of a recently introduced method known as Principal Support Vector Machine (PSVM) in the presence of predictors which violate the theoretical assumption of ellipticity of the marginal distribution. Keywords: Suffcient dimension reduction, Categorical predictors, Sliced inverse regression, Principal support vector machine, Principal component analysis.
19 Bayesian Multiscale Phylogenetics
Author: Marco A.R. Ferreira and M. Alejandra Jaramillo      Pages: 285-292
We propose a computational approach for the construction of Bayesian multiscale phylogenetic trees. Specifically, first we classify the DNA sites or nucleotides in different scales of evolutionary resolution using entropy. After that, for each evolutionary resolution level we run a Markov chain Monte Carlo (MCMC) analysis that uses the molecular data up to that resolution level, as well as the last phylogenetic trees simulated from the immediate coarser level. We illustrate the use of our multiscale phylogenetics framework with an application to a large molecular dataset for primates. Keywords: Bayesian inference, Entropy, Markov chain Monte Carlo, Multiscale analysis.