CD-HIT wiki page at GitHub


The algorithms of CD-HIT and more detailed studies are available from the references below. In addition, if you find CD-HIT useful, please cite ref. #6,#3; if you use the CD-HIT web server, please cite ref. #4; and please cite ref. #5 for CD-HIT-454.

1. "Clustering of highly homologous sequences to reduce the size of large protein database", Weizhong Li, Lukasz Jaroszewski & Adam Godzik Bioinformatics, (2001) 17:282-283. PDF Pubmed Citations

2. "Tolerating some redundancy significantly speeds up clustering of large protein databases", Weizhong Li, Lukasz Jaroszewski & Adam Godzik Bioinformatics, (2002) 18:77-82. PDF Pubmed Citations

3. "Cd-hit: a fast program for clustering and comparing large sets of protein or nucleotide sequences", Weizhong Li & Adam Godzik Bioinformatics, (2006) 22:1658-9. Open access PDF Pubmed Citations

4. Ying Huang, Beifang Niu, Ying Gao, Limin Fu and Weizhong Li. CD-HIT Suite: a web server for clustering and comparing biological sequences. Bioinformatics, (2010). 26:680 PDF Pubmed Citations

5. Beifang Niu, Limin Fu, Shulei Sun and Weizhong Li, Artificial and natural duplicates in pyrosequencing reads of metagenomic data. BMC Bioinformatics, (2010), 11:187 PDF Pubmed Citations

6. Limin Fu, Beifang Niu, Zhengwei Zhu, Sitao Wu and Weizhong Li, CD-HIT: accelerated for clustering the next generation sequencing data. Bioinformatics, (2012), 28 (23): 3150-3152. doi: 10.1093/bioinformatics/bts565. PDF, pubmed.