Improving the Reuse of Scientific Workflows in addition to their By-products Xiaorong Xiang National Evolutionary Synthesis Center (NESCent) Duke University, University of North Carolina – Chapel Hill, in addition to North Carolina State University Gregory Madey Department of Computer Science in addition to Engineering University of Notre Dame 2007 IEEE International Conference on Web Services (ICWS 2007) Salt Lake City, Utah, July 2007 Supported in part by the Indiana Center as long as Insect Genomics (ICIG) & the Indiana 21st Century Fund Collaborators: Xiaorong Xiang & Jeanne Romero-Severson Outline: two parts Production system (MoGServ) as long as bioin as long as matics workflow Bioin as long as matics application Productivity improvement Prototype system exploring ideas as long as end-user composition Workflow reuse Knowledge management/discovery

From the article “Genome Sequencing vs. Moore’s Law: Cyber Challenges as long as the Next Decade” by Folker Meyer in journal CTWatch Quarterly August, 2006 volume 2 number 3 Bioin as long as matics today Rapidly accumulating data: DNA sequences, contigs, expression data, annotations, etc. Non-st in addition to ard independently developed heterogeneous data sources Data sharing in addition to security Productivity Problem! SOA in Bioin as long as matics MORE Community ef as long as ts needed to provide more shared in addition to reliable services More demonstration projects needed => best practices, measured utility, feedback to middleware projects, etc. Recent exposure of data & analysis tools as services Large public databases in addition to bioin as long as matics tools Middleware projects Provide infrastructure to compose, manage, execute, connect the distributed services Mother of Green (MoG) project Biological science In collaboration with Prof. Jeanne Romero-Severson, Biological Sciences, University of Notre Dame. Study the deep phylogeny of plastid Computer science Provide an environment to support scientists’ investigations A case study of using SOA as long as data in addition to application integration A prototype as long as future research in service-oriented architecture domain

Mother of Green Malaria causes 1.5 – 2.7 million deaths every year 3,000 children under age five die of malaria every day Plasmodium falciparum (a protozoan parasite) causes human malaria Drug resistance a world-wide problem Targeted drug design through phylogenomics P. falciparum Mother of Green P. falciparum has three genomes Nuclear, mitochondrial, plastid Animals in addition to insects have only two Target the third genome No harm to animals New antimalarial drug High risk, high tech, high payoff J. Romero-Severson Department of Biological Sciences Greg Madey & Xiaorong Xiang Department of Computer Science & Engineering Mother of Green Plastids are the third genome Intracellular organelles Terrestrial plants, algae, apicomplexans Functions in plants in addition to algae Photosynthesis Oxidation of water Reduction of NADP Synthesis of ATP Fatty acid biosynthesis Aromatic amino acid biosynthesis Functions in apicomplexans Chloroplast in plant cell Plastid in Toxoplasma sp. Apicoplast in P. falciparum plastid

Mother of Green The apicoplast appears to code as long as <30 proteins. Repair, replication in addition to transcription proteins Why is the apicoplast essential Find the ancestors of the apicoplast Identify genes in the ancestors Determine gene function Look as long as these genes in the P. falciparum nucleus Then study regulatory mechanisms in c in addition to idate genes Mother of Green Phylogenomics Phylogenomics of plastids Very old lineage (> 2.5 billion years) Cyanobacterial ancestor Three main plastid lineages Glaucophytes Group of freshwater algae Chloroplast resembles intact cyanobacteria Chlorophytes Green plant lineage Chloroplast genome reduced Many chloroplast genes now in nuclear genome Rhodophytes Red algal lineage Chloroplast genome bigger than in green plants Oomycetes Apicomplexans

Phylogenomics of plastids One cyanobacterial ancestor Many Lineages are not linear One plastid origin Multiple plastid origins The process of endosymbiosis. Horizontal Gene Transfer (arrows) from the plastid to the nucleus. The nucleomorph is a remnant of the original endosymbiont nucleus. Primitive eukaryote Endosymbiont plastid Secondary endosymbionts Second eukaryote Secondary nonphotosynthetic endosymbiont Cyanobacteria Nucleus Nucleus Nucleomorph Plastid disappears Secondary endosymbiont Tertiary endosymbionts Third eukaryote Tertiary nonphotosynthetic endosymbiont Plastid disappears Tertiary endosymbiosis. Horizontal Gene Transfer P. falciparum

The in as long as mation gathering problem Rapid accumulation of raw sequence in as long as mation ~100 sequenced chloroplast genomes ~57 sequenced cyanobacterial genomes Rate of accumulation is increasing In as long as mation accumulates faster than analyses finish In as long as mation in as long as ms not readily accessible Solution Semi-automated web-services “Smart” web-services Semantic web A typical in-silico investigation – Data driven research A: Query complete genome sequences given a taxa B: Query protein coding genes as long as each genome sequence C: Eliminate vector sequences D: Sequences alignment E: Phylogenetic analysis Time consuming manual web-based operations Data collection Copy & paste! Analysis tool usage Copy & paste! Experiment data recording Copy & paste! Repetitive experiments as long as scientific discovery Copy & paste! Repeat as new data becomes available Copy & paste!

MoGServ system architecture MoGServ interface Web interface Application interface MoGServ middle layer Data access storage Data in addition to analysis services Service in addition to workflow registry Indexing in addition to querying metadata Service in addition to workflow enactment Acting in two roles: service requester in addition to service provider Web Interface Applications Application Server Data Access Services Data Analysis Services Job Manager Job Launcher Service/Workflow Registry Metadata Search Local Data Storage Workflow/Soap Engines Services NCBI DDBJ EMBL Data/Services Providers MoGServ Middle Layer Services Access Client Others MoGServ System Architecture Data storage in addition to access services Local database Integrating data from multiple data sources with scientists interests Supporting repetitive investigations against several subsets of sequences Avoiding network traffic in addition to service failure when retrieving data on-the-fly from public data sources Accessing the data in the local database by services

Service in addition to workflow registry A table-based description with necessary properties Text description Service location Input/output Provider Version Algorithm Invocation method Not intended as long as supporting service discovery or composition To answer end-users questions about their results Provenance: “Which algorithm was used to generate the data in addition to what is the source of the input data” A repository of service in addition to workflow used as long as local application developers Indexing in addition to querying metadata Metadata Service in addition to workflow description Description of sequence data in order to track the origination of data Experimental data output, input, in addition to intermediate data Indexing in addition to querying with keyword Lucene Implemented as services Service in addition to workflow enactment

Implementation Development in addition to deployment J2EE, JSP, XSLT Tomcat 5.0.18 / Axis 1.2 Database PostgresSQL 8.1 Index in addition to search of metadata Apache Lucene library Service implementation Java2WSDL Wrap comm in addition to line applications with JLaunch library Workflow Taverna workbench, part of myGrid project Freefluo workflow engine Data in addition to services Taverna workbench

A workflow created using the Taverna workbench tool Improvement opportunities Use existing domain ontology in bio as long as matics community to describe services, workflows, in addition to data Integrate the semantic web technology to support end-users workflow creation based on their knowledge of scientific domain Support users with limited knowledge of scientific processes Record various workflow representations Facilitate the discovery in addition to reuse of prior workflows Knowledge management Knowledge discovery Service Composition in addition to workflows Service composition Ad-hoc Semi-automate Semantic annotation + reasoning Automated Semantic annotation + planning Scientific workflows Workflows composed based on service-oriented architecture as long as assisting scientists in accessing in addition to analyzing data.

Conclusion Pro Increase the correctness of the as long as med workflow over time Avoid the incorrect, inaccurate semantic annotations Take advantage of verified knowledge Avoid the ontological reasoning process Better support as long as semi-automated in addition to automated service composition over time Provide more accurate guideline to users over time Con The connectivity graph can be big Number of parameters Number of services Search the connectivity of a service when a service is registered in the system may take relative long time More complex matching rule Number of parameters May not have high accuracy at the beginning Future work Integrate the GridSam into the MoGServ as long as execution, monitoring Integrate the Grid computing technology as long as resource allocation Refine the MoGServ application domain ontology Create interface as long as end-user workflow creation Create interface as long as individual workspace Evaluate the scalability, accuracy of connectivity graph approach in addition to the graph matching approach with large number real workflows in addition to services Thank you Questions

