\documentclass[10pt]{article}
\usepackage[left=60pt,right=60pt,top=50pt,bottom=50pt]{geometry}
\usepackage[pdftex]{hyperref}
\usepackage{latexsym}
\usepackage{fancyhdr}
\usepackage{mathptmx}
\usepackage{wrapfig}
\usepackage{graphicx}
\usepackage{float}
\usepackage{subfigure}
\usepackage{cite}
\usepackage{times}
\usepackage{titling}
\usepackage{hyperref}

\hypersetup{
  backref=page,
  colorlinks= true, %Colours links instead of ugly boxes
  urlcolor  = green, %Colour for external hyperlinks
  linkcolor = blue, %Colour of internal links
  citecolor = red %Colour of citations
}

\begin{document}
\begingroup  
  \centering
  \vspace{-0.5in}
  \LARGE \textbf{Research Statement -- Subhabrata Bhattacharya}
\endgroup
\vspace{0.2in}

My research in computer vision brings together machine learning, insights 
from psychology, computer graphics, algorithms, and a great deal of 
computation. Over the last few years, I have had the opportunity to explore 
several broad areas of research in Computer Vision - including:\\

\par\noindent\textbf{\large{Recognizing Complex Events in Consumer Videos}}\\\\
The goal of complex event recognition~\cite{trecvid10:cuucf,trecvid11:sri,
trecvid12:sri,ijmir12:survey} is to automatically detect high-level events 
in a given video sequence. However, due to the fast growing popularity of 
such videos, especially on the Web, solutions to this problem are in high 
demand. A feasible solution can directly make video search and retrieval 
more efficient and rewarding experience for the users. This can also help 
track user interest based on the video contents they watch, thereby 
promoting advertisement of certain products. Furthermore, it can help 
broadcast agencies predict important statistics about a video such as 
virality of views, geographical location of viewers etc. moments after a 
video is uploaded, so that channel bandwidth could be optimized. In 
addition, such systems can provide human observers with meaningful textual 
recounting of a video in a relatively short time without substantial human 
intervention. That said, this in itself is an extremely challenging 
problem and requires thorough algorithmic breakthroughs at multiple 
tiers. My research attempts to address some of the sub-problems which 
are crucial in context of complex event recognition and are listed as 
follows:\\ 

\par\noindent\textbf{(a) Design of features:}  Within the purview of this effort, 
we explore two complementary sources of information to design features 
that are useful for content based video analysis in realistic scenarios.
The first one is semi-global in nature, computed from small segments 
from the video~\cite{tpami12:cov}, while the second one is based on ambient 
camera motion~\cite{tmm12:cam} present during the video capture process. \\

The \textbf{semi-global clip-level descriptor} is a concise representation of 
a temporal window/clip of subsequent frames from a video rather than localized 
spatio-temporal patches, which eliminates the use of specific detectors. The 
descriptor is based on covariance of complementary low-level motion (optical 
flow and their derivatives, vorticity, divergence etc.) and appearance cues 
(first and second order derivatives of pixel intensities etc.). Since 
covariance matrices capture joint statistics between individual low-level 
feature modalities, they automatically transform our random vector of samples 
into statistically uncorrelated random variables, leading to a compact 
representation of a video. 

In addition to the descriptor itself, we investigate two sparse coding based 
approaches~\cite{tpami12:cov} to use the descriptor in context of action and 
gesture recognition. Within this, the sparse approximation of a set of 
covariance matrices is treated as a determinant maximization problem, where 
the bases (covariance matrices) are obtained from training videos. We compare 
this approach with a sparse linear approximation alternative suitable for 
equivalent vector spaces of covariance matrices using Orthogonal Matching 
Pursuit. We show the applicability of our video descriptor and the associated 
recognition algorithms through various experiments on challenging datasets. 
Our experiments provide promising insights in large scale video analysis.

\textbf{Camera-motion} is often an under-exploited cue when it comes to the 
analysis of videos depicting complex events in consumer uploaded videos. 
Complex events like ``Attempting a board trick'' and ``Parkour'' usually 
have a lot of jittery camera motion coupled with pan and tilt motions. 
Similarly, videos depicting events such as ``Wedding Ceremony'' and 
``Birthday Party'' are mostly captured by stationary cameras with limited 
pan and some amount of zoom. The objective of this effort~\cite{tmm12:cam} is 
to investigate an efficient set of methodologies, that can be leveraged to 
represent videos in terms of their ambient camera motion in large scale, 
without resorting to computationally prohibitive full-3D reconstruction 
techniques.

We devise this novel representation on top of inter-frame homographies which 
serve as coarse indicators of the camera motion. Next, using Lie algebra of 
projective groups, we transform the homography matrices to an intermediate 
vector space that preserves the intrinsic geometric structure of the 
transformation. Multiple time series are  then 
constructed from these mappings. We perform an exhaustive analysis of 
effective features that can be computed from these time-series based on 
theoretical foundations from both linear (Hankel matrices) and non-linear
(Chaotic invariants) dynamical systems. Features computed on these time 
series are used for discriminative classification of video shots. Our 
proposed camera motion based shot classification outperforms previously 
published algorithms and achieves comparable performance to an 
implementation that involves recovery of structure from motion on our 
dataset of eight shot categories. This encourages us to evaluate our 
method for complex event recognition in challenging datasets
~\cite{trecvid11:sri,trecvid12:sri}, which demonstrates conclusive evidence 
towards its applicability in open-source video analysis.\\

\par\noindent\textbf{(b) Engineering computationally efficient intermediate 
representations:} Designing intermediate representations on top raw features 
is very crucial for any recognition algorithm in order to handle 
outliers efficiently and reduce processing of large volumes of high 
dimensional data. A popular approach in this context is the Bag-of-Visual-Words
(BoVW) methods where raw features extracted in a video or image are quantized 
using common clustering algorithms and reduced to a histogram representation, 
which becomes the intermediate representation or signature for a video or 
image. We present an efficient alternative~\cite{cvpr11:anchors} to the 
traditional vocabulary based on BoVW methods used for visual classification 
tasks. 

Our representation is both conceptually and 
computationally superior to the bag-of-visual words: (1) We iteratively 
generate a \textbf{Maximum Likelihood estimate} of an instance given a set of 
characteristic features in contrast to the BoVW methods (2) We randomly sample 
a set of characteristic features called \textbf{anchors} instead of employing 
computation intensive clustering algorithms used during the vocabulary 
generation step of BoVW methods. Our performance compares favorably to the 
state-of-the-art on experiments over three challenging human action and a 
scene categorization dataset, demonstrating the universal applicability of our 
method.

We integrate the above representation scheme to detect semantically accurate,
human-understandable mid-level spatio-temporal concepts for modeling complex 
events. To this we introduce a benchmark dataset for spatio-temporal concepts 
extracted from amateur videos depicting complex events. This dataset consists 
of $104$ mutually exclusive, concept categories over $10,000$ annotated audio 
visual samples extracted from NIST's TRECVID MED 2011 event corpus that 
replicates complex events observed in common video footages. Detectors are 
trained on the proposed anchors based representation specific to each concept 
category on different information modalities (motion, static, and audio). This
approach achieved respectable target detection~\cite{trecvid11:sri} in the 
annual NIST TRECVID Multimedia Event Detection 2011 competition.\\

\par\noindent\textbf{(c) Formulating complex event models:}Just as low-level 
features and the associated intermediate representations are crucial for 
recognition, efficient complex event models can be created if temporal 
dynamics are exploited effectively exploited. So far researchers have 
proposed the use of various configurations of graphical models in this 
context. Although these models are mathematically intuitive and elegant, 
they are computationally complex and require extensive training coupled 
with substantial domain knowledge.

Here we represent each video depicting a complex event, as an ordered vector 
time-series, where each time-step is a vector containing confidences returned 
by a set of pre-trained spatio-temporal concept detectors~\cite{trecvid11:sri}.
 Using, foundations from linear dynamical systems, we extract two complementary 
features, the first is based on Block Hankel matrices, which captures 
dependencies between each observation vector, within the context of the entire 
time-series. The second exploits statistically meaningful characteristics from 
multiple interacting time-series such as lag-independence, harmonics, frequency 
proximity etc. We also integrate the above feature computation steps into a 
Bayesian concept selection framework, that automatically identifies the concepts 
necessary to achieve a respectable trade-off between accuracy and computational 
efficiency of the recognition process. Experiments conducted on NIST's, TRECVID 
datasets for Multimedia Event Detection (MED 2011 \& MED 2012), demonstrate how 
our proposed method~\cite{acm13:tempdyn} outperforms the state of the art in 
context of complex event recognition.\\\\

\par\noindent\textbf{\large{Computational Photo-aesthetics}}\\\\
The deluge of image hosting Web sites and increasing affordability of consumer 
grade digital cameras, have introduced two new problems in image sharing 
perspective:  the first is the ability to select the best-looking ones from a 
large pool of photographs captured during certain occasion. The next is the
flexibility to edit a photograph with minimal photographic compositional 
knowledge so that the result looks reasonably better than the original ones. 
These two key issues motivate us to propose a set of novel algorithms
~\cite{mm10:photoqual,tomccap11:photoqual} that enable naive users to improve 
the visual aesthetics of their digital photographs using several novel spatial 
recompositing techniques. This work differs from earlier efforts in two 
important aspects: (1) it focuses on both photo quality assessment and 
improvement in an integrated fashion, (2) it enables the user to make informed
decisions about improving the composition of a photograph. 

The tool facilitates interactive selection of one or more than one foreground 
objects present in a given composition, and the system presents recommendations 
for where it can be relocated in a manner that optimizes a learned aesthetic 
metric while obeying semantic constraints. For photographic compositions that 
lack a distinct foreground object, the tool provides the user with crop or 
expansion recommendations that improve the aesthetic appeal by equalizing the 
distribution of visual weights between semantically different regions. The 
recomposition techniques presented here emphasize learning support 
vector regression models that capture visual aesthetics from user data and seek 
to optimize this metric iteratively to increase the image appeal. The tool 
demonstrates promising aesthetic assessment and enhancement results on variety 
of images and provides insightful directions towards future research. This 
work~\cite{mm10:photoqual} was also nominated for \textbf{best paper} in ACM MM 
2010 full paper track, which was later extended in~\cite{tomccap11:photoqual}.

\newpage
\par\noindent\textbf{\large{Aerial Video Analysis}}\\\\
Quadrotor helicopters have gained immense visibility in the area of aerial 
surveillance and reconnaissance over the last decade. Due to their portability, 
ease of control, low risk of operation and affordable cost of deployment, 
these low flying platforms are getting popular across law enforcement 
departments around the world for applications such as tracking vehicles or
monitoring suspicious activities. We introduced a technique to solve the 
problem of tracking objects persistently from surveillance platforms 
integrating quad-rotor aerial (moving) and ground (fixed) platforms in typical 
urban scenarios. Under this framework~
\cite{tr09:camfusion} we track moving objects from a moving aerial platform 
using a three staged conventional technique~\cite{bc10:objdet} consisting of 
ego-motion compensation, blob detection, and blob tracking with near-realtime 
precision. A hierarchical robust background subtraction followed by a motion 
correspondence algorithm is applied to track objects from the ground 
surveillance camera. 

We further refine~\cite{ei10:aerialsensor} the metadata available at the 
airborne camera and along with the calibration parameters of the ground 
camera, we are able to transform the object’s position in both cameras’ local 
coordinate system to a generic world coordinate system. Trajectories obtained 
in terms of the world coordinates are then merged assuming temporal continuity. 
False candidate trajectories are eliminated using similarity metric based on 
color intensity of the object that generated it. Our system has been tested in 
$3$ real-world scenarios where it has been able to merge trajectories 
successfully in $80\%$ of the cases. The tools developed~\cite{bc10:objdet,
ei10:aerialsensor} as part of this project were important contributions towards 
UCF--Lockheed Martin’s involvement in the \textbf{DARPA Video Image Retrieval 
and Analysis Tool (VIRAT)} program and is extensively used to extract motion-
compensated chips depicting human activities from aerial videos.\\

\par\noindent\textbf{\large{Collaborations and Outreach}}\\\\
One of the advantages of working in such a vibrant field is the opportunity 
for fruitful collaboration across both industries and academia. Currently
In past, I have collaborated with researchers at Columbia University
~\cite{trecvid10:cuucf,ijmir12:survey}, Carnegie Melon University
~\cite{trecvid11:sri}, University of Michigan~\cite{cvpr11:anchors} and 
University of Klagenfuert~\cite{ei10:aerialsensor}. I have been fortunate to 
publish with several renowned researchers in computer vision, and 
participated in research projects with industrial partners such as Lockheed 
Martin~\cite{bc10:objdet}, SRI Sarnoff, Google Research. I have also interned 
in two separate occasions with Microsoft Research and Intel 
Labs~\cite{mm10:photoqual,tomccap11:photoqual} during the summers of 2012 
and 2010, respectively. Having worked for research and development in systems 
(IBM Systems \& Tech. Groups and Infosys Tech. Ltd.) provides me with a 
natural edge to effectively contribute to large groups.

In addition to the two high profile conferences in computer vision and multimedia, 
I regularly speak at specialized workshops on recognition. My work has been 
funded by DARPA, IARPA, Intel. I am actively involved in writing grant proposals
for AFOSR, NSF and NASA. 
\bibliographystyle{abbrv}
\bibliography{researchstatement}
\end{document}