Taking on the Complexities of Big Data Analysis

Amy Biemiller
February 24, 2014

Computers have long been instrumental in helping government, business, education, health care, and scientific organizations sort and understand data. But as machines networked with sensors and software collect large amounts of structured and unstructured data, analyzing and sharing that data in a timely manner becomes challenging. Using typical data processing methods to manage the confluence of volume and variety of data complicates the process and slows decision-making.

“From a modeling standpoint, inference algorithms can be extremely slow when using a massive amount of data,” explains John Paisley, assistant professor of electrical engineering. “A typical approach is to randomly subsample a small set of the data and throw away the rest. The assumption is that this gives an adequate representation of the larger data set, but empirically we've seen that performance will degrade.”

The solution to more effectively managing big data is the novel research domain of machine learning—a hybrid use of statistics and algorithmic computer science that allows for the construction and study of systems that can learn from the data. Machine learning research focuses on the development of fast and efficient algorithms for real-time data processing that results in accurate predictions. Researchers like Paisley focus on making methodological contributions to statistical machine learning and broadening the reach of its applications. To accomplish that, Paisley is developing Bayesian models and posterior inference techniques that address the big data problem, including: topic modeling, collaborative filtering and dictionary learning, data analysis and exploration, recommendation systems, information retrieval, and compressed sensing.

“Bayesian nonparametric models can be infinitely complex, but ultimately allow the data to determine their complexity. As data accrues, the greater complexity can give a finer resolution on what's in the data, while still summarizing the content in a way that's far more interpretable than the raw data itself,” he says. For example, Paisley developed a tree-structured model using 1.8 million New York Times documents dated between 1987 and 2007. This model “captured the underlying themes of the data, with themes becoming more refined the farther the analysis progressed down the tree.

“As the amount of data increases, the amount of potentially interesting information that can be uncovered increases as well. That’s why it is important, when working with large quantities of data, that we are able to efficiently analyze all data and exploit the information that is discovered along the way,” he says.

Paisley joined Columbia Engineering in 2013 after completing postdoctoral fellowships at the UC Berkeley and Princeton. He received a Notable Paper Award at the International Conference on Artificial Intelligence and Statistics in 2011 and is an affiliated member of Columbia’s Institute for Data Sciences and Engineering.

BSE, Duke University, 2004; MS, Duke University, 2007; PhD, Duke University, 2010