Machine learning for large-scale data streams

Machine learning for large-scale data streams

High-velocity streams of high-dimensional data pose significant “big data” analysis challenges across a range of applications and settings. Modern sensors are collecting very high-dimensional data at unprecedented rates, often from platforms with limited processing power. These large datasets allow scientists and analysts to consider richer physical models with larger numbers of variables, and thereby have the potential to provide new insights into the underlying complex phenomena. While recent advances in online learning have led to novel and rapidly converging algorithms, these methods are unable to adapt to nonstationary environments arising in real-world problems. My lab has developed methods for processing such data “on the fly”. One contribution is a dynamic mirror descent framework which yields low theoretical regret bounds via accurate, adaptive, and computationally efficient algorithms which are applicable to broad classes of problems. The methods are capable of learning and adapting to an underlying and possibly time-varying dynamical model. Empirical results in the context of dynamic texture analysis, solar flare detection, sequential compressed sensing of a dynamic scene, traffic surveillance, tracking self-exciting point processes and network behavior in the Enron email corpus support the core theoretical findings. More recent work proposes a method of online data thinning, in which large-scale streaming datasets are winnowed to preserve unique, anomalous, or salient elements for timely expert analysis. At the heart of this approach is an online anomaly detection method based on dynamic, low-rank Gaussian mixture models. The low-rank modeling mitigates the curse of dimensionality associated with anomaly detection for high-dimensional data, and recent advances in subspace clustering and subspace tracking allow the proposed method to adapt to dynamic environments.

  • X. Jiang and R. Willett, “Online Data Thinning via Multi-Subspace Tracking.” arXiv:1609.03544, 2016. Code.
  • E. Hall and R. Willett, “Online convex optimization in dynamic environments,” IEEE Journal of Selected Topics in Signal Processing – Signal Processing for Big Data, vol. 9, no. 4, arXiv:1307:5944, 2015. code