DOC PREVIEW
UMD CMSC 828G - Using Growing Self-Organising Maps to Improve the Binning Process

This preview shows page 1-2-3-4 out of 13 pages.

Save
View full document
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
View full document
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience
Premium Document
Do you want full access? Go Premium and unlock all 13 pages.
Access to all documents
Download any document
Ad free experience

Unformatted text preview:

INTRODUCTIONMETHODSGrowing self-organising mapQuality measurement of the clustering performance in the mixing regionDataset preparation and data preprocessingAlgorithms parameters and experiment detailsRESULTSDISCUSSION, CONCLUSION, AND FUTURE WORKREFERENCESHindawi Publishing CorporationJournal of Biomedicine and BiotechnologyVolume 2008, Article ID 513701, 10 pagesdoi:10.1155/2008/513701Methodology Repor tUsing Growing Self-Organising Maps to Improvethe Binning Process in Environmental Whole-GenomeShotgun SequencingChon-Kit Kenneth Chan,1Arthur L. Hsu,1Sen-Lin Tang,2and Saman K. Halgamuge11Dynamic Systems & Control Group, Department of Mechanical Engineering, University of Melbourne, VIC 3010, Australia2Research Center for Biodiversity, Academia Sinica, Taipei 115, TaiwanCorrespondence should be addressed to Chon-Kit Kenneth Chan, [email protected] 31 August 2007; Accepted 18 November 2007Recommended by Daniel HowardMetagenomic projects using whole-genome shotgun (WGS) sequencing produces many unassembled DNA sequences and smallcontigs. The step of clustering these sequences, based on biological and molecular features, is called binning. A reported strat-egy for binning that combines oligonucleotide frequency and self-organising maps (SOM) shows high potential. We improve thisstrategy by identifying suitable training features, implementing a better clustering algorithm, and defining quantitative measuresfor assessing results. We investigated the suitability of each of di-, tri-, tetra-, and pentanucleotide frequencies. The results showthat dinucleotide frequency is not a sufficiently strong signature for binning 10 kb long DNA sequences, compared to the otherthree. Furthermore, we observed that increased order of oligonucleotide frequency may deteriorate the assignment result in somecases, which indicates the possible existence of optimal species-specific oligonucleotide frequency. We replaced SOM with growingself-organising map (GSOM) where comparable results are obtained while gaining 7%–15% speed improvement.Copyright © 2008 Chon-Kit Kenneth Chan et al. This is an open access article distributed under the Creative CommonsAttribution License, which permits unrestricted use, distribution, and reproduction in any medium, provided the original work isproperly cited.1. INTRODUCTIONMetagenomics is an emerging area of genome research thatallows culture-independent, functional, and sequence-basedstudies of microbial communities in environmental sam-ples. Whole-genome shotgun (WGS) sequencing has beenapplied to most of the metagenomic projects [1–6]. Theseprojects have unveiled remarkable information on microbialgenomics and also brought an unprecedentedly comprehen-sive and clearer picture of microbial communities. In theWGS sequencing approach, random sampling of DNA frag-ments of all microbes that form a community in an environ-mental sample is performed. The individual DNA fragmentsare sequenced and then assembled into genomes by us-ing computing techniques. However, a fundamental limit ofWGS sequencing is that only the genomes of high-abundancespecies can be completely or near-completely assembled [7]due to the requirement of multiple overlapping fragmentsfor a confident assembly. In one of the prominent metage-nomic studies conducted by Venter et al. [2], about 1 Gb ofDNA sequences has been successfully sequenced from Sar-gasso Sea samples. This study has clearly indicated the exis-tence of far more diverse microbial communities than pre-viously thought. Most of the environmental genomes se-quenced to date contain only few high-abundance speciesbut many low-abundance species in the communities thataccount for a large portion of the total genome size of an en-vironmental sample. The presence of large amount of DNAfragments from the low-abundance species poses a prob-lem for assembling genomes. In order to infer the biologicalfunctions of a microbial community from sequences, a pro-cess named “binning” is used to group these unassembledDNA sequence fragments and small contigs into biologicallymeaningful “bins,” such as phylogenetic groups [8].There are a number of tools currently available for thebinning process. These include Chisel System [9, 10], Meta-Clust [4, 11], TETRA [12, 13], PhyloPythia [14], and thecombination of oligonucleotide frequency and SOM [15].The Chisel System helps binning the sequences according tothe identification, characterisation, and comparative analysis2 Journal of Biomedicine and Biotechnologyof taxonomic and evolutionary variations of enzymes. Theremaining above-listed tools use the method of analysingnucleotide composition of sequences that is considered tohave the potential of working well for the binning processin WGS sequencing [8]. MetaClust computes different DNAsignatures followed by the use of a clustering algorithm toassign sequences into bins. TETRA bins the species-specificsequences by the use of tetranucleotide-derived z-score cor-relations. PhyloPythia uses a supervised-learning approachwhere it trains a multiclass support vector machine (SVM)classifier using all the known genome sequences in the ex-isting database then assigns the unknown environmental se-quences to the closest clade in the selected taxonomic level.This method has been demonstrated to be able to classifymost DNA sequence fragments with high accuracy. However,considering the current amount of known genomes which isfar less than 1% of the entire microbial genomes [7], it is rea-sonable to assume that the currently available training data isinsufficient to represent all the extremely diverse microbialgenomes for supervised-learning methods. Unsupervised-learning may provide the answer to this problem. The com-bination of oligonucleotide frequency and the well-knownunsupervised learning method self-organising map (SOM)was used by Abe et al. [15] to explore genome signatures.They used the di-, tri-, and tetranucleotide frequencies as thetraining features of SOM to cluster the 1 kb and 10 kb DNAsequence fragments derived from 65 bacteria and 6 eukary-otes. Clear species-specific separations of sequences were ob-tained in the 10 kb fragment tests. Their results showed thatthe combination of oligonucleotide frequency and SOM canbe used as a powerful tool to cluster or bin the DNA sequencefragments after WGS sequencing.In order to successfully bin the DNA sequence frag-ments, using an appropriate genome signature as the train-ing feature is


View Full Document

UMD CMSC 828G - Using Growing Self-Organising Maps to Improve the Binning Process

Documents in this Course
Lecture 2

Lecture 2

35 pages

Load more
Download Using Growing Self-Organising Maps to Improve the Binning Process
Our administrator received your request to download this document. We will send you the file to your email shortly.
Loading Unlocking...
Login

Join to view Using Growing Self-Organising Maps to Improve the Binning Process and access 3M+ class-specific study document.

or
We will never post anything without your permission.
Don't have an account?
Sign Up

Join to view Using Growing Self-Organising Maps to Improve the Binning Process 2 2 and access 3M+ class-specific study document.

or

By creating an account you agree to our Privacy Policy and Terms Of Use

Already a member?