Solving the Protein Threading Problem in Parallel Nocola Yanev, Rumen AndonovMotivationTalk OverviewTechniquesPowerPoint PresentationReduction to Network Flow: An ExampleReduction to Network Flow: Variables and ConstraintsDrawbacks of ApproachParallel SolutionImproving Parallel SolutionEvaluationResult TablesSlide 13Slide 14Slide 15Related WorkObservationsSlide 18Slide 19Solving the Protein Threading Problem in ParallelNocola Yanev, Rumen AndonovIndrajit BhattacharyaCMSC 838T PresentationCMSC 838T – PresentationMotivationProblem paper is trying to solve3D structure prediction using threadingIs a given target sequence likely to fold to a 3D template core?Find the alignment that minimizes some score functionNP-complete; optimal solution not possibleMAX-SNP-hard; arbitrary approximation not possibleWhy do we care3D structure determines biological function of proteinAmino acid sequence (almost) uniquely determines 3D structureThreading is usually less accurate than comparative modeling but easier to solveCMSC 838T – PresentationTalk OverviewOverview of talkMotivationTechniquesEvaluationRelated workObservationsCMSC 838T – PresentationTechniquesApproachReduce the problem to some known theoretical problem of interestIn this case, network flowUse existing tools for solving the theoretical problem efficientlyCPLEXExplore possibilities for parallelizing the problemInvestigate the intrinsic hardness for real biological examplesCMSC 838T – PresentationTwo structurally similar proteins Spatial adjacencies (interactions) Possible threading with a sequence Objective functionMathematical FormulationCMSC 838T – PresentationReduction to Network Flow: An ExampleCMSC 838T – PresentationReduction to Network Flow:Variables and ConstraintsStandard Network FlowVariable xi,t for each segment to position assignmentRestricted to [0, 1]With standard flow conservation constraintsAdditional cost for non-local interactionsVariable zi,t,i’,t’ for each non-local interactionRestricted to {0, 1}Constrained to sum to 1 for each non-local pair (i, i’)Upper bounded by flow entering (i, t) and leaving (i’, t’)CMSC 838T – PresentationDrawbacks of ApproachInteger programming is hard to solve!Relax to linear programming with (0, 1) variablesApproximate to integer solution using standard heuristicsExisting tools like CPLEXHuge number of variablesFor 36 segments and 81 positions, IP problem has 741264 rows, 360945 columns and 54145231 non-zero variables!Need to reduce number of variables and constraintsCalls for parallelization if possibleCMSC 838T – PresentationParallel SolutionUtilize special flow constraints Split into sub-problems that may be solved parallelySplit the k-th layer in the graph into r intervalsForce path for a sub-problem to pass through a particular interval in the layerPass best bound for objective function found so far as parameter to sub-problemSub-task aborts when dual objective function exceeds the current best boundCMSC 838T – PresentationImproving Parallel SolutionDrawback: Hardest Sub-Problem Dominates! Parallel strategy was found to be slower than the sequential!Sub-problems can potentially become harder to solveMany more difficult sub-problems than easy onesSolution: Break the atomicity of the tasksEach sub-task periodically checks the current best bound and updates its cut-offExtra overhead is still small compared to task granularityNow the easiest executing sub-task dominates!CMSC 838T – PresentationEvaluationExperimental environmentReal protein sequencesILOG CPLEX Callable LibrarySUN Ultra-Sparc II, 450 MhzObjective function coefficients generated from FROSTMaximum of 7 processors and 29 sub-problemsEvaluation resultsSequential version much faster than previous branch-and-bound results for the same problem formulationTime taken comparable to PROSPECTSplitting and parallelization significantly improve turnaroundReally tiny gap between relaxed LP and ILP solutionsMostly integer solutions even for relaxed LP!CMSC 838T – PresentationResult TablesComparison with branch and bound algorithmComment: Self threading results in significantly lower scores (as should be)CMSC 838T – PresentationResult TablesGap between relaxed LP and ILPComment: Tiny relaxation gap. (significance?)CMSC 838T – PresentationResult TablesSize of the LP formulationComment: LP problem size is still too large.CMSC 838T – PresentationResult TablesPerformance with parallel sub-tasksComment: Longer times with more sub-problems??CMSC 838T – PresentationRelated WorkSimilar / previous approachesLathrop and Smith, 1998Uses same cost functionBranch and bound algorithm for searching the space of threadings Xu, Xu and Uberbacher, 1998Divide and conquer algorithmXu, Li, Lin, Kim and Xu, 2003Linear programming formulationSolved using b&b algorithmNone of the above suggest any parallelizing schemeCMSC 838T – PresentationObservationsPoints of InterestMapping to a known problem of interestNicely utilizes particular constraints to break into independent subtasksThreading of real amino acid sequences seems possibleRaises interesting questions about real-life protein threading being in PSolver tailored for this particular problem may yield better resultsCMSC 838T – PresentationObservationsCriticismNot enough experiments with large number of subtasks and processors to show scalingProhibitively large number of variables and constraintsHow accurate are the objective function coefficients?What is the resolution of the objective function?Threading onto multiple sequences for prediction still looks dauntingNot clear how to extend the idea for 3-way and more complex interactionsImprovementsSeems possible to break up the sub-tasks recursivelyCMSC 838T – PresentationThank
View Full Document