AT&T

 

Labs image
AT&T Labs - Research
Statistics
Research

Data Mining Algorithms

Home
People
Research
Awards
Employment
Contact

Algorithms for alerting on data streams

We increasingly encounter large-scale streaming and time series data cross-classified by two or more factor variables into potentially large number of cells. Monitoring such data for deviations can lead to the discovery of important events for alerting. However, naive approaches based on thresholds or p-values are hopeless in such situations due to the multiple hypotheses testing problem.

We have developed two novel Bayesian algorithms called hbmix and kfgps that use shrinkage estimation to solve the above problem.

We applied this technology first to data in our own industry:

  • packet loss among several source-destination pairs on an ISP network,
  • call-center incoming calls cross-classified by caller intent and location.
We have found it equally helpful in other areas:
  • daily sales volumes at each store location for a large retail enterprise, and
  • classification of medical tests performed at labs of a major health provider.
Examples where we have applied this technology include daily sales volumes at each store location for a large retail enterprise, packet loss among several source-destination pairs on an ISP network, call-center incoming calls cross-classified by caller intent and the location, and classification of medical tests performed at labs of a major health provider.

To learn more about our work in this area, contact Chris Volinsky.

Classification using PRIM

The patient rule induction method (PRIM), introduced by Jerry Friedman in 1999, is a powerful information mining algorithm. The objective of PRIM is to find response ``hotspots'' (or bumps) in a high-dimensional space of predictor variables. PRIM seeks box-shaped subregions in the predictor space where the average value of the response is significantly larger than its average over the entire space. It achieves this via a sequence of peeling and pasting steps, in which small chunks of the dataset are peeled away (or pasted back on) such that the average response in the resulting box is maximized. TurboPRIM is a local modification of PRIM designed specifically for massive datasets. Our approach is to create an out-of-memory, disk-based implementation of PRIM where the dataset is never stored in the memory of a computer, and all calculations are performed by making a minimal series of passes over the data on disk.

To learn more about our work in this area, contact David Poole.