HOGENOM is a phylogenomic database providing families of homologous genes and associated phylogenetic trees (and sequence alignments) for a wide set sequenced organisms.

HOGENOM material and methods

Origin of data

We used complete genomes from Ensembl and EnsemblGenomes for the eukaryotes and from NCBI for the bacteria and archaea.

Selection of representative genomes

The HOGENOM-CORE database contains a selection of representative genomes.
Concerning eukaryotes and archaea, selection was done according to human expertise.
Concerning bacteria, we used a semi-automatic method.

Choice of the 13 specific phyla

We selected all the phyla containing at least 50 species, this allowed us to build the 12 phyla-specific HOGENOM-PHYLUM[n] databases
All remaing phyla have been merged into a 13th database HOGENOM-PHYLUM0

Clustering pipeline

The clustering was done on all proteins of the databases. First of all we perfomed a KCLUST custering to create pre-clusters.
For each pre-cluster, a HMM profile was build as well a a consensus sequence.
We performed an HMM search of all consensus over all HMM profiles.
Then we agregate similar preclusters into clusters using SiLiX on the HMM results.
Finaly two rounds of Louvain algorithm was apllied to the relatiosnship graphs inside the cluster obtained by SiLiX to split these cluster into more homogenous families.

Alignments

Alignments have calculated for each cluster with MAFFT.

Alternative splicing

In case of alternative splicing, a unique representative transcript has been selected among the several gene transcripts. Before phylogenetic tree calculation, the non representative transcripts were removed from the alignment. For this reason, the number of genes in the tree and in the alignment may differ. In the alignment, the alternative transcipts are tagged as: REPRESENTATIVE (the transcript is representative of the gene), ISOFORMIN (non representative transcript of a gene whose representative transcript is in the alignment) and ISOFORMEX (non representative transcript of a gene whose representative transcript is not in the alignment)

Phylogenetic trees

Alignments have been splited into sub-alignements containing sequences from the core only , and foreach phyla-specific sequence from the core and the phyla.
First of all, phylogenetic trees have been calculated for each core- only alignemnts ( "core trees").
Then phylogenetic trees have been calculated with IQTREEE for each core-and-phyla alignemnts using the "core trees" as a constraint.

Note: Global alignments (i.e. before being splitted into sub-alignments) are available by clicking on the "ALIGNEMENT" link on the left of the tree.

Facilities

This work was performed using the computing facilities of the CC LBBE/PRABI

Citation

If you are using HOGENOM families, please cite :
Penel S, Arigon AM, Dufayard JF, Sertier AS, Daubin V, Duret L, Gouy M and Perriere G (2009) "Databases of homologous gene families for comparative genomics" BMC Bioinformatics, 10 (Suppl 6):S3

Previous Release

The previous release web page is available here

License


This contents is governed by the CeCILL license under French law and abiding by the rules of distribution of free software. You can use, modify and/ or redistribute the software under the terms of the CeCILL license as circulated by CEA, CNRS and INRIA at the following URL "http://www.cecill.info". As a counterpart to the access to the source code and rights to copy, modify and redistribute granted by the license, users are provided only with a limited warranty and the software's author, the holder of the economic rights, and the successive licensors have only limited liability. In this respect, the user's attention is drawn to the risks associated with loading, using, modifying and/or developing or reproducing the software by the user in light of its specific status of free software, that may mean that it is complicated to manipulate, and that also therefore means that it is reserved for developers and experienced professionals having in-depth computer knowledge. Users are therefore encouraged to load and test the software's suitability as regards their requirements in conditions enabling the security of their systems and/or data to be ensured and, more generally, to use and operate it in the same conditions as regards security. The fact that you are presently reading this means that you have had knowledge of the CeCILL license and that you accept its terms.