Genome project

When printed, the human genome sequence fills around 100 huge books of close print

Genome projects are scientific endeavours that ultimately aim to determine the complete genome sequence of an organism (be it an animal, a plant, a fungus, a bacterium, an archaean, a protist or a virus) and to annotate protein-coding genes and other important genome-encoded features.^[1] The genome sequence of an organism includes the collective DNA sequences of each chromosome in the organism. For a bacterium containing a single chromosome, a genome project will aim to map the sequence of that chromosome. For the human species, whose genome includes 22 pairs of autosomes and 2 sex chromosomes, a complete genome sequence will involve 46 separate chromosome sequences.

The Human Genome Project was a landmark genome project that is already having a major impact on research across the life sciences, with potential for spurring numerous medical and commercial developments.^[2]

Genome assembly

Main article: Sequence assembly

Genome assembly refers to the process of taking a large number of short DNA sequences and putting them back together to create a representation of the original chromosomes from which the DNA originated. In a shotgun sequencing project, all the DNA from a source (usually a single organism, anything from a bacterium to a mammal) is first fractured into millions of small pieces. These pieces are then "read" by automated sequencing machines, which can read up to 1000 nucleotides or bases at a time. (The four bases are adenine, guanine, cytosine, and thymine, represented as AGCT.) A genome assembly algorithm works by taking all the pieces and aligning them to one another, and detecting all places where two of the short sequences, or reads, overlap. These overlapping reads can be merged, and the process continues.

Genome assembly is a very difficult computational problem, made more difficult because many genomes contain large numbers of identical sequences, known as repeats. These repeats can be thousands of nucleotides long, and some occur in thousands of different locations, especially in the large genomes of plants and animals.

The resulting (draft) genome sequence is produced by combining the information sequenced contigs and then employing linking information to create scaffolds. Scaffolds are positioned along the physical map of the chromosomes creating a "golden path".

Assembly software

Originally, most large-scale DNA sequencing centers developed their own software for assembling the sequences that they produced. However, this has changed as the software has grown more complex and as the number of sequencing centers has increased. An example of such assembler Short Oligonucleotide Analysis Package developed by BGI for de novo assembly of human-sized genomes, alignment, SNP detection, resequencing, indel finding, and structural variation analysis.^[3]^[4]^[5]

Genome annotation

Main article: DNA annotation

Since the 1980s, molecular biology and bioinformatics have created the need for DNA annotation. DNA annotation or genome annotation is the process of identifying attaching biological information to sequences, and particularly in identifying the locations of genes and determining what those genes do.

When is a genome project finished?

When sequencing a genome, there are usually regions that are difficult to sequence (often regions with highly repetitive DNA). Thus, 'completed' genome sequences are rarely ever complete, and terms such as 'working draft' or 'essentially complete' have been used to more accurately describe the status of such genome projects. Even when every base pair of a genome sequence has been determined, there are still likely to be errors present because DNA sequencing is not a completely accurate process. It could also be argued that a complete genome project should include the sequences of mitochondria and (for plants) chloroplasts as these organelles have their own genomes.

It is often reported that the goal of sequencing a genome is to obtain information about the complete set of genes in that particular genome sequence. The proportion of a genome that encodes for genes may be very small (particularly in eukaryotes such as humans, where coding DNA may only account for a few percent of the entire sequence). However, it is not always possible (or desirable) to only sequence the coding regions separately. Also, as scientists understand more about the role of this noncoding DNA (often referred to as junk DNA), it will become more important to have a complete genome sequence as a background to understanding the genetics and biology of any given organism.

In many ways genome projects do not confine themselves to only determining a DNA sequence of an organism. Such projects may also include gene prediction to find out where the genes are in a genome, and what those genes do. There may also be related projects to sequence ESTs or mRNAs to help find out where the genes actually are.

Historical and technological perspectives

Historically, when sequencing eukaryotic genomes (such as the worm Caenorhabditis elegans) it was common to first map the genome to provide a series of landmarks across the genome. Rather than sequence a chromosome in one go, it would be sequenced piece by piece (with the prior knowledge of approximately where that piece is located on the larger chromosome). Changes in technology and in particular improvements to the processing power of computers, means that genomes can now be 'shotgun sequenced' in one go (there are caveats to this approach though when compared to the traditional approach).

Improvements in DNA sequencing technology has meant that the cost of sequencing a new genome sequence has steadily fallen (in terms of cost per base pair) and newer technology has also meant that genomes can be sequenced far more quickly.

When research agencies decide what new genomes to sequence, the emphasis has been on species which are either high importance as model organism or have a relevance to human health (e.g. pathogenic bacteria or vectors of disease such as mosquitos) or species which have commercial importance (e.g. livestock and crop plants). Secondary emphasis is placed on species whose genomes will help answer important questions in molecular evolution (e.g. the common chimpanzee).

In the future, it is likely that it will become even cheaper and quicker to sequence a genome. This will allow for complete genome sequences to be determined from many different individuals of the same species. For humans, this will allow us to better understand aspects of human genetic diversity.

Example genome projects

Main articles: List of sequenced eukaryotic genomes, List of sequenced archaeal genomes, and List of sequenced prokaryotic genomes

L1 Dominette 01449, the Hereford who serves as the subject of the Bovine Genome Project

Many organisms have genome projects that have either been completed or will be completed shortly, including:

Humans, Homo sapiens; see Human genome project
Humans, Homo sapiens; see The Human Genome Project–Write
Palaeo-Eskimo,^[4] an ancient-human
Neanderthal, "Homo neanderthalensis" (partial); see Neanderthal Genome Project
Common chimpanzee Pan troglodytes; see Chimpanzee Genome Project
Domestic Cow^[6]^[7]
Bovine Genome
Honey Bee Genome Sequencing Consortium
Horse genome^[8]
Human microbiome project
International Grape Genome Program
International HapMap Project
Tomato 150+ genome resequencing project
100K Genome Project
Genomics England

References

↑ Pevsner, Jonathan (2009). Bioinformatics and functional genomics (2nd ed.). Hoboken, N.J: Wiley-Blackwell. ISBN 9780470085851.
↑ "Potential Benefits of Human Genome Project Research". Department of Energy, Human Genome Project Information. 2009-10-09. Retrieved 2010-06-18.
↑ Li R, Zhu H, Ruan J, Qian W, Fang X, Shi Z, Li Y, Li S, Shan G, Kristiansen K, Li S, Yang H, Wang J, Wang J (February 2010). "De novo assembly of human genomes with massively parallel short read sequencing". Genome Research. 20 (2): 265–272. doi:10.1101/gr.097261.109. ISSN 1549-5469. PMC 2813482. PMID 20019144.
1 2 Rasmussen M, Li Y, Lindgreen S, Pedersen JS, Albrechtsen A, Moltke I, Metspalu M, Metspalu E, Kivisild T, Gupta R, Bertalan M, Nielsen K, Gilbert MT, Wang Y, Raghavan M, Campos PF, Kamp HM, Wilson AS, Gledhill A, Tridico S, Bunce M, Lorenzen ED, Binladen J, Guo X, Zhao J, Zhang X, Zhang H, Li Z, Chen M, Orlando L, Kristiansen K, Bak M, Tommerup N, Bendixen C, Pierre TL, Grønnow B, Meldgaard M, Andreasen C, Fedorova SA, Osipova LP, Higham TF, Ramsey CB, Hansen TV, Nielsen FC, Crawford MH, Brunak S, Sicheritz-Pontén T, Villems R, Nielsen R, Krogh A, Wang J, Willerslev E (2010-02-11). "Ancient human genome sequence of an extinct Palaeo-Eskimo". Nature. 463 (7282): 757–762. doi:10.1038/nature08835. ISSN 1476-4687. PMC 3951495. PMID 20148029.
↑ Wang J, Wang W, Li R, Li Y, Tian G, Goodman L, Fan W, Zhang J, Li J, Zhang J, Guo Y, Feng B, Li H, Lu Y, Fang X, Liang H, Du Z, Li D, Zhao Y, Hu Y, Yang Z, Zheng H, Hellmann I, Inouye M, Pool J, Yi X, Zhao J, Duan J, Zhou Y, Qin J, Ma L, Li G, Yang Z, Zhang G, Yang B, Yu C, Liang F, Li W, Li S, Li D, Ni P, Ruan J, Li Q, Zhu H, Liu D, Lu Z, Li N, Guo G, Zhang J, Ye J, Fang L, Hao Q, Chen Q, Liang Y, Su Y, San A, Ping C, Yang S, Chen F, Li L, Zhou K, Zheng H, Ren Y, Yang L, Gao Y, Yang G, Li Z, Feng X, Kristiansen K, Wong GK, Nielsen R, Durbin R, Bolund L, Zhang X, Li S, Yang H, Wang J (2008-11-06). "The diploid genome sequence of an Asian individual". Nature. 456 (7218): 60–65. doi:10.1038/nature07484. ISSN 0028-0836. PMC 2716080. PMID 18987735. Retrieved 2012-12-22.
↑ Yates, Diana (2009-04-23). "What makes a cow a cow? Genome sequence sheds light on ruminant evolution" (Press Release). EurekAlert!. Retrieved 2012-12-22.
↑ Elsik, C. G.; Elsik, R. L.; Tellam, K. C.; Worley, R. A.; Gibbs, D. M.; Muzny, G. M.; Weinstock, D. L.; Adelson, E. E.; Eichler, L.; Elnitski, R.; Guigó, D. L.; Hamernik, S. M.; Kappes, H. A.; Lewin, D. J.; Lynn, F. W.; Nicholas, A.; Reymond, M.; Rijnkels, L. C.; Skow, E. M.; Zdobnov, L.; Schook, J.; Womack, T.; Alioto, S. E.; Antonarakis, A.; Astashyn, C. E.; Chapple, H. -C.; Chen, J.; Chrast, F.; Câmara, O.; Ermolaeva, C. N. (2009). "The Genome Sequence of Taurine Cattle: A Window to Ruminant Biology and Evolution". Science. 324 (5926): 522–528. doi:10.1126/science.1169588. PMC 2943200. PMID 19390049.
↑ http://www.genome.gov/20519480

External links

The Wikibook Next Generation Sequencing (NGS) has a page on the topic of: De_novo_assembly

GOLD:Genomes OnLine Database
Genome Project Database
The Protein Naming Utility
SUPERFAMILY
EchinoBase An Echinoderm genomic database, (previous SpBase, a sea urchin genome database)
NRCPB.
Global Invertebrate Genomics Alliance (GIGA)

Omics

Genomics	Cognitive genomics Computational genomics Comparative genomics Functional genomics Genome project Human Genome Project Metagenomics Personal genomics Social genomics Structural genomics

Bioinformatics	Biochip Cheminformatics Chemogenomics Connectomics Glycomics Immunomics Lipidomics Metabolomics Microbiomics Nutrigenomics Paleopolyploidy Pharmacogenetics Pharmacogenomics Systems biology Toxicogenomics Transcriptomics

Structural biology	Proteomics Human Proteome Project Call-map proteomics Structure-based drug design Expression proteomics

Research tools	2-D electrophoresis Mass spectrometer Electrospray ionization Matrix-assisted laser desorption ionization Matrix-assisted laser desorption ionization-time of flight mass spectrometer Spotted array-based tools Microfluidic-based tools Isotope affinity tags Molecular scanner

Organizations	National Institutes of Health (USA) DNA Data Bank of Japan (JP) European Molecular Biology Laboratory (EU) Sanger Centre (EN)

This article is issued from Wikipedia - version of the 11/19/2016. The text is available under the Creative Commons Attribution/Share Alike but additional terms may apply for the media files.