Data Integration

 

Databases of phylogenetic trees already archive hundreds of thousands of relatively small trees that can form the starting point for synthetic analyses, and several other data sources (e.g., sequences, alignments) will be crucial for the comprehensive analyses. Tree databases range from those designed explicitly for phylogenetic biology (TreeBASE, PhyLoTA Browser, ToL Web Project), to databases aimed at the gene and protein evolution communities (PFAM, Phytome, Phylofacts, PlantTribes). These databses differ in syntax and semantics, and vary tremendously in the sophistication of their application program interfaces . Database integration is a task well-known for its difficulty, and efforts to integrate "everything" have consistently failed. Therefore, we believe that taking an incremental approach, especially to on-the-fly integration, would be most productive. We envision first focusing on resources that are critical for subsequent steps and that are managed by iPToL participants. This will both focus our efforts and demonstrate the benefit of the environment. Subsequently, we should actively invite and enable other parties to integrate their resources into the discovery environment through activities such as hackathons. This team is working on overcoming differences in database design and interfaces in available tree databses to enable better access to trees already constructed by the community.

 

Working Group Members

Val Tannen, Team Lead, University of Pennsylvania
William Piel, Team Lead, Yale University