Challenge

Data Assembly and Integration

Challenge: Disparate molecular sequence datasets are currently organized in multiple formats in multiple locations. Only a small fraction of the data needed to assemble very large phylogenetic trees is in widely available databases such as GenBank. Data that could be used for construction of trees are often not broadly accessible or needs proper pre-processing for inclusion in phylogenetic analysis.

My-Plant.org

Leveraging many years of collaboration in the plant phylogenetics community, the Data Assembly and Integration Working Group organizes workshops and outreach activities to bring together data providers to discuss strategies for assembling large-scale sequence data sets for plants to be used by the tree reconstruction team. An equally important part of this process is ensuring orderly integration and interoperability of the data.

 

Some of the key activities of this group include:

  • My-Plant.org, is a scientific networking web site designed to bring together plant scientists to discuss and organize information about plant taxa with an ultimate goal of to bringing in data to iPToL. My-Plant uses a phylogenetic tree metaphor to organize species and scientists with related “clades”, each of which has a volunteer manager who moderates activities for that clade.
  •  

  • Data intake. The data assembly working group is implementing data intake pipelines using compute resources in order to standardize and scale the assembly of sequence data into character matrices suitable for large-scale phylogenetic analysis. The PHLAWD intake pipeline, developed by Stephen Smith, has been implemented and work is underway to implement a generalized sequence intake pipeline developed by Gordon Burleigh.
  •  

  • Perpetually updating tree. The task of assembling the tree of life for all green plants is an incremental one. Although trees in excess of 50,000 plant species are now a reality, as new data come in, scientists such as Alexis Stamatakis and Casey Dunn are planning for a perpetually updating alignment and phylogenetic tree creation on an ongoing basis.

Working Group Members

Name Role Institution
Douglas Soltis
Working Group Co-Lead
University of Florida
Pamela Soltis Working Group Co-Lead University of Florida
Michael Donoghue Collaborator Yale University
Val Tannen Collaborator University of Pennsylvania
Gordon Burleigh Collaborator University of Florida
Casey Dunn Collaborator Brown University
Sheldon McKay Scientific Lead iPlant Collaborative, Cold Spring Harbor Laboratory
Steve Mock Team Lead, My-Plant iPlant Collaborative, Texas Advanced Computing Center
Matthew Hanlon Developer, My-Plant iPlant Collaborative, Texas Advanced Computing Center
John Cazes Developer iPlant Collaborative, Texas Advanced Computing Center