by Marco Pellegrini
The Institute for Computational Mathematics (IMC-CNR), Pisa, with the collaboration of Prof. Alberto Segre of University of Iowa, is now combining several core areas of expertise (sequential, distributed and parallel algorithms for numerical liner algebra and continuous optimization, methods for efficient computation of electrostatic fields) in a project on Protein Conformation Simulation.
The need for better methodologies to simulate biological molecular systems has not been satisfied by the increased computing power of workstations available nowadays. Instead such increased speed has pushed research into new areas and increasingly complex simulations. A clear example is the problem of protein folding. This problem is at the core of the technology of rational drug design and its impact on future society cannot be underestimated. In a nutshell, the problem is that of determining the 3-dimensional structure of a protein (and especially its biologically active sites) starting from its known DNA encoding. There are favourable opportunities for approaches employing a spectrum of competences, from strictly biological and biochemical to mathematical modeling and computer science, with special openings for the design of efficient algorithms.
Protein Folding requires searching an optimal configuration (minimizing the energy) among exponentially large spaces of possible configurations with a number of degrees of freedom ranging from hundreds to a few thousands. To cope with this challenge, a double action is required. On one hand, the use of High Performance Computing Networks permits the exploitation of the intrinsic parallelism which is present in many aspects of the problem. On the other hand, sophisticated and efficient algorithms are needed to exploit fully the deep mathematical properties of the physics involved. Brute force methods are at a loss here and the way is open for effective sampling methodologies, algorithms for combinatorial and continuous optimization, and strategies for searching over (hyper)-surfaces with multiple local minima.
Innovative techniques will be used in order to push forward the state of the art: a new distributed paradigm (Nagging and Partitioning), techniques from Computational Geometry (hierarchical representations) and innovative algorithms for long range interactions.
Nagging and Partitioning
As observed above, the protein folding problem is reduced to a searching problem in a vast space of conformations. A classical paradigm for finding the optimum over a complex search space is that of partitioning. A master process splits the search space among several slave processes and selects the best one from the solutions returned by the slaves. This approach is valid when it is relatively easy to split the workload evenly among the slave processes and the searching strategy of each slave is already optimized. The total execution time is determined by the slowest of the slaves and, when any slave is faulty, the computation is either blocked or returns a sub-optimal solution. A different and complementary approach is that of nagging. Here slaves operate on the same search space, however each slave uses a different search strategy. The total time is determined by the most efficient slave and the presence of a faulty slave does not block the computation. Moreover, in a branch and bound overall philosophy, it has been shown that the partial solutions of any processor help to speed up the search of others. Searching for the optimal energy conformation is a complex task where parallelism is present at many different levels, so that neither pure nagging nor pure partitioning is the best choice for all of them. A mixed strategy that uses both is more adaptive and yields better chances of success.
Hierarchical Representations
Techniques for representing and manipulating hierarchies of 3-dimensional objects have been developed in computational geometry for the purpose of speeding up visibility computations and collision detection and could be adapted to the folding problem. One of the main sub-problems in finding admissible configurations is to determine the existence or absence of steric clashes among single atoms in the model. This is a problem similar to that of collision detection for complex 3d models of macroscopic objects in robotics and geometric modeling. A popular technique is that of inclusion hierarchies. The hierarchy is organized as a tree and the root is associated with a bounding volume (eg an axis parallel box) enclosing the molecule. Recursively, we split the molecule into two groups of atoms and build the corresponding bounding boxes. The process stops at the leaves of the tree, each of which correspond to a single atom. Such a representation speeds up steric testing since it quickly rules out many pairs of atoms that are distant in the tree hierarchy. In general many such trees may be built with the same input and the aim is to obtain good properties such as minimizing the total surface or volume of the bounding boxes. It is interesting to note that such trees are more flexible and use less storage than uniform grid decompositions and even oct-tree data structures.
Electrostatic Long Range Interactions
A line of research at IMC has found interesting connections between electrostatic fields and integral geometric formulae leading to new efficient and robust algorithms for computing electrostatic forces. In particular, for 3-dimensional continuous distributions of charge, there is a representation without analytic singularities for which an adaptive Gaussian quadrature algorithm converges with exponential speed. Such recent techniques benefit from the calculation of molecular energy since they obtain a good approximation without considering all pairs of atoms thus avoiding quadratic complexity growth.
Please contact:
Marco Pellegrini - IMC-CNR
Tel: +39 050 315 2410
E-mail: pellegrini@imc.pi.cnr.it
http://www.imc.pi.cnr.it/~pellegrini