HOGENOM is a phylogenomic database providing families of homologous genes and associated phylogenetic trees (and sequence alignments) for a wide set sequenced organisms.

HOGENOM material and methods

Origin of data

We used complete genomes from Ensembl and EnsemblGenomes for the eukaryotes and from NCBI for the bacteria and archaea.

Selection of representative genomes

The HOGENOM-CORE database contains a selection of representative genomes.
Concerning eukaryotes and archaea, selection was done according to human expertise.
Concerning bacteria, we used a semi-automatic method.

Choice of the 13 specific phyla

We selected all the phyla containing at least 50 species, this allowed us to build the 12 phyla-specific HOGENOM-PHYLUM[n] databases
All remaing phyla have been merged into a 13th database HOGENOM-PHYLUM0

Clustering pipeline

The clustering was done on all proteins of the databases. First of all we perfomed a KCLUST custering to create pre-clusters.
For each pre-cluster, a HMM profile was build as well a a consensus sequence.
We performed an HMM search of all consensus over all HMM profiles.
Then we agregate similar preclusters into clusters using SiLiX on the HMM results.
Finaly two rounds of Louvain algorithm was apllied to the relatiosnship graphs inside the cluster obtained by SiLiX to split these cluster into more homogenous families.


Alignments have calculated for each cluster with MAFFT.

Alternative splicing

In case of alternative splicing, a unique representative transcript has been selected among the several gene transcripts. Before phylogenetic tree calculation, the non representative transcripts were removed from the alignment. For this reason, the number of genes in the tree and in the alignment may differ. In the alignment, the alternative transcipts are tagged as: REPRESENTATIVE (the transcript is representative of the gene), ISOFORMIN (non representative transcript of a gene whose representative transcript is in the alignment) and ISOFORMEX (non representative transcript of a gene whose representative transcript is not in the alignment)

Phylogenetic trees

Alignments have been splited into sub-alignements containing sequences from the core only , and foreach phyla-specific sequence from the core and the phyla.
First of all, phylogenetic trees have been calculated for each core- only alignemnts ( "core trees").
Then phylogenetic trees have been calculated with IQTREEE for each core-and-phyla alignemnts using the "core trees" as a constraint.

Note: Global alignments (i.e. before being splitted into sub-alignments) are available by clicking on the "ALIGNEMENT" link on the left of the tree.


This work was performed using the computing facilities of the CC LBBE/PRABI


If you are using HOGENOM families, please cite :
Penel S, Arigon AM, Dufayard JF, Sertier AS, Daubin V, Duret L, Gouy M and Perriere G (2009) "Databases of homologous gene families for comparative genomics" BMC Bioinformatics, 10 (Suppl 6):S3

Previous Release

The previous release web page is available here


