AT&T Research at Joint Statistical Meetings 2011

Jul 30 - Aug 04, 2011

Miami Beach, FL, Miami Beach Convention Center

Contact: Christopher T. Volinsky

AT&T Research will be heavily represented at this year's JSM conference, presenting eight papers and offering a tutorial in data stream mining. David Poole will serve as Program Chair for the Section on Statistical Computing, Chris Volinsky is Program Chair-Elect. Simon Urbanek, a past-chair for the Section on Statistical Graphics, is an officer.

 

Late-Breaking Session: Heritage Health Prize: Christopher T. Volinsky

Lessons Learned from the Netflix Prize

August 1, 2011 at 10:30 AM

 

Invited Talks:

 

The Future of Statistical Computing Environments — Invited Papers (Section on Statistical Computing , Section on Statistical Graphics) : Simon Urbanek

Taking Statistical Computing Beyond S and R (Session Link)
Aleph is an open-source project to create the next generation of statistical computing software, possibly as a successor to R. The goal is to provide a modern, flexible system suitable for statistical analysis. All aspects of the project are currently experimental and up for discussion. The current experimental implementation is written in C and features its own C-level object system.

August 1, 2011 at 08:30 AM

 

Statistics in Computational Advertising — Invited Papers (Section on Statistical Computing , Section on Statistical Graphics): Suhrid Balakrishnan

Issues in Targeted Advertising for Local Search (Session Link)
Advances have recently been made on modeling click-through rate in well studied settings like sponsored search and context match. Local search, has received relatively less attention. The geographic nature of local search and associated local browsing makes interesting research challenges and opportunities possible. We consider a novel application of a relational regression model to local search. The model is attractive in that it allows us to explicitly control and represent geographic and category-based neighborhood style constraints on the samples that result in superior click-through rate estimates.

August 3, 2011 at 10:30 AM

 

Classification in the Real World: The Development of Practical Predictions from High-dimensional Markers — Invited Papers (Section on Statistical Computing): Christopher T. Volinsky

Large-Scale Network Data Analysis (Session Link)
Telecommunications data is all about networks - packet delivery networks, cell tower networks, fiber optic networks. But perhaps the most interesting network is the virtual one created by billions of telephony transactions every day. This callgraph network represents hundreds of millions of devices and the billions of connections between them - the social network of our customers. How do we make sense of such a massive graph? How do we find communities, or look for influential members? In this talk I will cover our ego-centric representation of the graph (Communities of Interest) and discuss how it helps us to analyze the graph at speed and scale - in applications such as fraud detection, customer loyalty, and targeted marketing

August 4, 2011 at 08:30 AM

 

Tutorial:

 

Tutorial (CE_06C): Tamraparni Dasu

Data Stream Mining: Tools and Applications (Session Link)
Mining data streams is a challenging task complicated by the dynamic nature of the data, high rate of accumulation and limited, one-time access. The streams are riddled with complex, interdependent data glitches. At the same time, stream mining is important since many critical data mining applications involve streams such as sensor networks, internet traffic, and mobility applications. This tutorial provides a comprehensive approach to data stream mining with emphasis on tools, techniques and their application to solve real world stream mining problems. We start with an introduction to data streams, discuss analytical and computing challenges posed by the unique constraints associated with them. We present nonparametric methods that are eminently suitable for stream mining and are computationally lightweight. We demonstrate some of these through R code that will be a part of the course. We use running examples from social networking, sensor networks and financial ticker streams to illustrate a wide variety of stream mining tasks - nonparametric summaries of the stream; detecting outliers and distributional shifts; computing and updating models for evolving streams; visualizing streams and stream summaries; measuring data quality and data cleaning. We conclude with an overview of open research problems in the area of statistical stream mining.

July 31, 2011 at 08:00 AM

 

Contributed Talks:

 

Business Applications — Contributed Papers (Business and Economic Statistics Section) : Shu-ngai C. Yeung

Automatic Forecasting Of Double Seasonal Time Series With Applications On Mobility Network Traffic Prediction (Session Link)
Automatic forecasting procedures are common in business practice where large number of time series are needed for forecast. One of such applications is on mobility network resource planning which requires accurate prediction of future peak usage at each cell tower location within the network. In this paper, we developed an automatic procedure based on univariate double seasonal ARIMA models (DSARIMA) to forecast time series database with multiple seasonal patterns. A large scale empirical study comparing automatic DSARIMA with double seasonal Exponential Smoothing (DSEXP) is performed using real mobile phone network data. We also considered the performance of combined forecasts of the two models based on OLS and variations. The results show that automatic DSARIMA models and combined forecast outperform DSEXP, especially in the forecasting horizon beyond one day ahead.

July 31, 2011 at 04:00 PM

 

Real World Applications of Statistical Learning and Data Mining Methods — Contributed Papers (Section on Statistical Learning and Data Mining): Ganesh K. Subramaniam

Spatio-Temporal Models for Wireless Network Data (Session Link)
Spatial-temporal models arise when data are collected across both space and time. With AT&T network data, a typical example would be that of a monitoring data on the mobility network (a network of towers) on which data are collected at regular intervals, say on a monthly basis. We have a time series associated with usage of minutes (voice) and Kb (data) for every tower located throughout the country. Thus the analysis has to take account of spatial dependence among the towers, but also that the observations at each tower typically are not independent but form a time series. In other words, one must take account of temporal correlations as well as spatial correlations. The topic of interest is how do the temporal patterns associated with the time series of a given tower correlate to temporal patterns in neighboring towers. We use a sample of time series from the network data to explore this question.

August 2, 2011 at 08:30 AM

 

Rating Competitors in Games and Sports in the 21st Century — Topic Contributed Papers (Section on Statistics in Sports): Kenneth E. Shirley

The Abcs Of Xqjkz: A New Scrabble Rating System Based On A Statistical Model For Tile-By-Tile Play (Session Link)
We develop a statistical model for Scrabble in which we model the number of points scored on each turn as a function of the individual tiles in a player's rack. The result is a detailed model that describes a player's Scrabble skill in terms of dozens of player-specific variables related to interpretable Scrabble skills such as how often a player plays each tile, how many points he earns per tile, and how much he augments his score by incorporating tiles already on the board into his play. Our data comes from a public database of about 600 games of Scrabble played at the expert level. We find that most of the variation in points scored is explained by the frequency with which tiles are played and the frequency with which players get a "bingo" (playing all 7 tiles in the rack on a single turn). The player-specific model parameters can be used as the basis of a much more detailed player rating system than is currently in use. This work also sheds light on the degree to which the outcomes of Scrabble games depend on the randomness inherent in drawing tiles, as opposed to player skill. The largest component of the model consists of a hierarchical Bayesian logistic regression model.

August 2, 2011 at 10:30 AM

 

Modeling Atmosphere and Oceanic Data — Contributed Papers (Section on Statistics and the Environment): Yi Fang Chen

Statistical Combination Of Climate Models (Session Link)
Atmosphere-ocean general circulation models (GCMs) are the primary tool to study how climate responds to increases in the concentration of greenhouse gases in the atmosphere. Outputs from different AOGCM's have been combined using weighting schemes related to how well they reproduce the historical data. Earlier approaches have inferred model weights in the context of a Bayes hierarchical scheme that treats both the historical record and the several model outputs as independent random samples from a distribution of possible weather data centered around the true climate. However, recent work points to evident correlations among model outputs. Our approach is based on optimizing the fit of model output combinations to historical data, allowing for weighting that is location specific with smoothing of both space-time temperature trends and model weight coefficients. Estimated model weights are extrapolated and applied to `future' AOGCM output to produce predictions of future space-time temperature trends. The approach is illustrated using observed summer temperature data for central North America for the 50-year time period 1940-1989, together with corresponding output from two GCMs.

August 4, 2011 at 08:30 AM

 

Discussants:

Deborah Swayne: Advances in R Software

 

Session chairs

Parni Dasu for Hypothesis Testing (Sun, 7/31/2011, 2:00 PM - 3:50 PM)

Kenny Shirley for New Methods of Estimation (Tue, 8/2/2011, 8:30 AM - 10:20 AM)

David Poole for Statistics in Computational Advertising (Wed, 8/3/2011, 10:30 AM - 12:20 PM)

Suhrid Balakrishnan for Ensemble Methods (Mon, 8/1/2011, 2:00 PM - 3:50 PM)

Chris Volinsky for Classification and Clustering (Tue, 8/2/2011, 10:30 AM - 12:20 PM)

Simon Urbanek for Advances in Monte Carlo Simulation (Sun, 7/31/2011, 4:00 PM - 5:50 PM)