---

* Research

Ph.D. Theses

Nonconvex Nonsmooth Optimization for Bioinformatics

By Amina Shabbeer
Advisor: Kristin P. Bennett
August 16, 2013

In this talk, we investigate nonconvex nonsmooth optimization problems that arise in bioinformatics and general visual data analytics tasks. This work is motivated by a need for tools that extract information from molecular epidemiological data and provide a view of the genetic diversity in the Mycobacterium tuberculosis complex (MTBC) population. Such tools are crucial for the effective tracking and control of tuberculosis (TB). We survey classification tools that group MTBC strains into genetic families and visualization tools for molecular epidemiology of TB. We develop TB-Lineage a classification tool that employs domain knowledge of signature patterns in DNA fingerprint data and Bayesian Networks for MTBC major lineage classification. We review spoligoforests as a tool for visualization of biogeographic diversity of MTBC that accurately represents both the underlying evolutionary relationships and the genetic distances between strains thus creating new insights on lineage definitions. Generating such a graph embedding involves addressing two challenges (i) preserve proximity relations as measured by some embedding objective, and (ii) simultaneous optimization of an aesthetic criterion, no edge-crossings in the embedding, to create a clear representation of the underlying graph structure. We propose a new approach to generating such an embedding that optimizes for multiple criteria. The method uses the theorems of the alternative to express the condition for no edge-crossings as a system of nonlinear inequality constraints. This approach has an intuitive geometric interpretation closely related to support vector machine classification. While edge crossing minimization can be utilized in conjunction with any optimization-based embedding objective, here we demonstrate the approach on multidimensional scaling by modifying the stress majorization algorithm to include penalties for edge crossings. We use an alternating approach to solve this nonconvex problem, iteratively performing two steps: computing the layout for a given set of constraints, and altering the constraints based on the new embedding. We provide a detailed analysis of the convergence of this algorithm. Alternating Directions of Multiplier Methods (ADMM) are proposed as a method for efficiently handling a large number of non-smooth constraints. MAA+, an iterative solution using ADMM is described for generating graph embeddings that adhere to non-intersection constraints between edges. We create spoligoforests generated for the TB-Insight collection of strains, including all strains observed in patients diagnosed with TB in the U.S. from 2006-2010. We also develop a standalone tool for spoligoforest generation. The method is also demonstrated on a suite of randomly generated graphs with corresponding Euclidean distances that have planar embeddings with high stress. The proposed edge-crossing constraints and iterative penalty algorithm can be readily adapted to other supervised and unsupervised optimization-based embedding or dimensionality reduction methods. The constraints can be generalized to remove intersections of general convex polygons including node-edge and node-node intersections. Host-pathogen maps that visualize relationships between MTBC strains and host groups are proposed as a testbed for minimizing intersections between nodes of arbitrary shape and size. MAA+ is applied to variations of the proximity preservation and edge-crossing minimization problem, to create low stress embeddings with no overlaps between nodes, edges and subgraphs.

* Return to main PhD Theses page


---

---