SEMINAR REPORT
ON
BIT TORRENT
CERTIFICATE
This is to certify that Mr. Suraj Kadam has successfully completed his seminar on Bit torrent in partial fulfillment of third year degree course in Information Technology in academic year 2006 – 2007.
Date :
Prof. Mr. Narendra Pathak Prof. Mrs. Rathi Prof. Dr. A.S Tavildar
(Head of the Department) Seminar Guide Principal
V.I.I.T, Pune. V.I.I.T, Pune. V.I.I.T, Pune.
Bract’s VIIT Pune – 48.
Department Of Information Technology
Sr. No. – 2/3/4,
Kondhawa Bdruk.
Tuesday, January 6, 2009
Bioinformatics and Role of Software Engineers in It
Bioinformatics
and Role of Software Engineers in It
By-
Makarand Arjun Kokane
BE-2
Roll number 222
Year 2002-2003
Pune Institute of Computer Technology
Pune 411043
Pune Institute of Computer Technology
Pune 411043
CERTIFICATE
This is to certify that Mr. Makarand Arjun Kokane has successfully completed his seminar on the topic “Bioinformatics and role of software engineers in it” under the guidance of Prof. R. B. Ingle towards the partial completion of Bachelor’s degree in Computer Engineering at Pune Institute of Computer Engineering during academic year 2002-2003.
Date:
Seminar Incharge HOD Principal
(Computer Engineering) PICT
Acknowledgement
I take this opportunity to thank respected Prof. R. B. Ingle Sir (my seminar guide) for his generous assistance. I am immensely grateful to our HOD Prof. Dr. C.V.K Rao for his encouragement and guidance. I extend my sincere thanks to our college library staff and all the staff members for their valuable assistance. I am also thankful to my fellow colleagues for their help and suggestions.
Makarand Kokane
(B.E. – 2)
Abstract
Bioinformatics is the application of computers in biological sciences. It is concerned with capturing, storing, graphically displaying, modeling and ultimately distributing biological information. It is becoming an essential tool in molecular biology as genome projects generate vast quantities of data.
The Human Genome Project has created the need for new kinds of scientific specialists who can be creative at the interface of biology and other disciplines, such as computer science, engineering, mathematics, physics, chemistry, and the social sciences. As the popularity of genomic research increases, the demand for these specialists greatly exceeds the supply. In the past, the genome project has benefited immensely from the talents of non-biological scientists, and their participation in the future is likely to be even more crucial. Through this report I have tried to analyze the future requirements in development of advances technologies in this field and what role, we, as software engineers can play in development of these technologies.
Contents Page
1 Introduction to Bioinformatics 1
1.1 What is Bioinformatics? 1
1.2 Computers and Biology 2
1.3 Limitations in the use of computers 2
1.4 Current Stage of Research 3
1.5 Microbial, Plant and Animal Genomes 4
1.6 History (Stages of development) 4
2 Basics of Molecular Biology 7
2.1 Nucleotide 7
2.2 Amino acid 7
2.3 Properties of Genetic Code 8
2.4 DNA (Deoxy-ribonucleic Acid) 8
2.5 Chromosomes 9
2.6 Gene 9
2.7 Protein 9
2.8 Sequencing 9
2.9 Genome 9
2.10 Clone 10
2.11 Model Organism 10
3 Role of software Engineers and Technology in Biotechnology 11
3.1 Need for software automation 11
3.2 Genetic Algorithms 12
3.2.1 Database Searching 12
3.2.2 Comparing Two Sequences 12
3.2.3 Multiple Sequence Alignment 13
3.3 Genome Projects 13
3.4 Goals for Advancements in Sequencing Technology 14
3.5 Developing Technology to handle Sequence Variations 15
3.6 Need of Technology in Functional Genomics 18
3.7 Bioinformatics and Computational Biology 19
3.8 Job Opportunities and Job Requirements 21
3.9 Training Goals included in the Human Genome Project Plan 21
4 Human Genome Project 23
4.1 Introduction 23
4.2 Details of the Human Genome Project 23
4.3 U.S. Human Genome Project 5-Year Goals 1998-2003 25
4.3.1 Human DNA Sequencing 25
4.3.2 Sequencing Technology 27
4.3.3 Sequence Variation 27
4.3.4 Functional Genomics 27
4.3.5 Comparative Genomics 28
4.3.6 Ethical, Legal, and Social Implications (ELSI) 28
4.3.7 Bioinformatics and Computational Biology 29
4.3.8 Training 29
5 Biological Databases 30
5.1 The Biological sequence/structure deficit 30
5.2 Biological Databases 30
5.3 Primary Sequence Databases 31
5.3.1 Nucleic acid Sequence Databases 31
5.3.2 Protein Sequence Databases 32
5.4 Composite Protein Sequences Databases 32
5.5 Secondary Databases 33
5.6 Tertiary Databases 33
6 Applications of Bioinformatics 34
6.1 Application to the Ailments of Diseases 34
6.2 Application of Bioinformatics to Agriculture 36
6.2.1 Improvements in Crop Yield and Quality 36
6.3 Applications of Microbial Genomics 37
6.4 Risk Assessment 39
6.5 Evolution and Human Migration 39
6.6 DNA Forensics (Identification) 40
Bibliography 41
Chapter 1
Introduction to Bioinformatics
1.1 What is Bioinformatics?
Bioinformatics is the application of computers in biological sciences and especially analysis of biological sequence data. It is concerned with capturing, storing, graphically displaying, modeling and ultimately distributing biological information. It is becoming an essential tool in molecular biology as genome projects generate vast quantities of data. With new sequences being added to DNA databases on an average, once every minute, there is a pressing need to convert this information into biochemical and biophysical knowledge by deciphering the structural, functional and evolutionary clues encoded in the language of biological sequences.
What Bioinformatics therefore offers to the researcher, the entrepreneur, or the Venture Capitalist is an enormous and exciting array of opportunities to discover how living systems metabolise, grow, combat disease, reproduce and regenerate. The current knowledge represents only the tip of the iceberg. Exciting and startling discoveries are being made everyday through Bioinformatics, which is building up an extensive encyclopedia from which life’s mysteries will be unraveled. The importance of computational science in collating this information and its simultaneous interpretation by biologists is the underlying ethos of Bioinformatics.
Having an interest in biology and having a strong inclination towards genetics is all right. But from our point of view, the most important thing is that biocomputing requires lots of software professionals. And there is more to do for these people than the experts in biology.
1.2 Computers and Biology
Bioinformatics is the symbolic relationship between computational and biological sciences. The ability to sort and extricate genetic codes from a human genomic database of 3 billion base pairs of DNA in a meaningful way is perhaps the simplest form of Bioinformatics. Moving on to another level, Bioinformatics is useful in mapping different people’s genomes and deriving differences in their genetic make-up through pattern recognition software. But that is the easiest part. What is more complex is to decipher the genetic code itself to see what the differences in genetic make-up between different people translate into in terms of physiological traits. And there is yet another level, which is even more intricate and that is the genetic code itself. The genetic code actually codes for amino acids and thereby proteins and the specific role, played by each of these proteins controls the state of our health. The role or function of each of our genes in coding for a specific protein, which in turn regulates a particular metabolic pathway, is described as “functional genomics”. The true benefit of Bioinformatics therefore lies in harnessing information pertaining to these genetic functions in order to understand how human beings and other living systems operate.
Computational simulation of experimental biology is an important application of Bioinformatics, which is referred to as “in silico” testing. This is perhaps an area that will expand in a prolific way, given the need to obtain a greater degree of predictability in animal and human clinical trials. Added to this, is the interesting scope that “in silico” testing provides to deal with the growing hostility towards animal testing. The growth of this sector will largely depend on the acceptance of “in silico” testing by the regulatory authorities. However, irrespective of this, research strategies will certainly find computational modeling to be a vital tool in speeding up research with enormous cost benefits.
1.3 Limitations in the use of computers
The last decade has witnessed the dawn of a new era of ‘silicon-based’ biology, opening the door, for the first time, to the possible investigation and comparative analysis of complete genomes. Genome analysis means to elucidate and characterize the genes and gene products of an organism. It depends on a number of pivotal concepts, concerning the processes of evolution (divergence and convergence), the mechanism of protein folding, and the manifestation of protein function.
Today, our use of computers to model such processes is limited by, and must be placed in the context of, the current limits of our understanding of these central themes. At the outset, it is important to recognize that we do not yet fully understand the rules of protein folding; we cannot invariably say that a particular sequence or a fold has arisen by divergent or convergent evolution; and we cannot necessarily diagnose a protein function, given knowledge only of its sequence or of its structure, in isolation. Accepting what we cannot do with computers plays an essential role in forming an appreciation of what, in fact, we can do. Without this kind of understanding, it is easy to be misled, as spurious arguments are often used to promote perhaps rather overenthusiastic points of view about what particular programs and software packages can achieve.
Nature has its own complex rules, which we only poorly understand and which we cannot easily encapsulate within computer programs. No current algorithm can ‘do’ biology. Programs provide mathematical and therefore infallible, models of biological systems. To interpret correctly whether sequences or structures are meaningfully similar, whether they have arisen by the processes of divergence or convergence, whether similar sequences or similar folds have the same or different functions: these are the most challenging problems. There are no simple solutions, and computers do not give us the answers; rather, given a sea of data, they help to narrow the options down so that the users can begin to draw informed biologically reasonable conclusions.
1.4 Current Stage of Research
In the field of Bioinformatics, the current research drive is to be able to understand evolutionary relationships in terms of the expression of protein function. Two computational approaches have been brought up to bear on the problem, tackling the identification of protein function from the perspectives of sequence analysis and of structure analysis respectively. From the point of view of sequence analysis, we are concerned with the detection of relationships between newly determined sequences and those of known function (usually within a database). This may mean pinpointing functional sites shared by disparate proteins (probably the result of convergent evolution), or identifying related functions in similar proteins (most commonly the result of divergent evolution.
The identification of protein function from sequence sounds straightforward, and indeed, sequence analysis is usually a fruitful technique. But, function cannot be inferred from sequence for about one-third of proteins in any of the sequenced genomes, largely because biological characterization cannot keep pace with the volume of data issuing from the genome projects (large number of database sequences thus either carry no annotation beyond the parent gene name, or are simply designated as hypothetical proteins). Another important point is that, in some instances, closely related sequences, which may be assumed to share a common structure, may not share the same function. What this means is that, though sequence or structure analysis can be used for deducing gene functions, still neither technique can be applied infallibly without reference to the underlying biology.
1.5 Microbial, Plant and Animal Genomes
Although the human genome appears to be the focal point of interest, microbial, plant and animal genomes are equally exciting to explore through Bioinformatics. Mining plant genomics has an important impact on opening up new vistas for research in agriculture. Microbial genomics offers a dual opportunity of developing new fermentation-based products and technologies as well as defining new ways of combating microbial infections. Exploring animal genomics opens up unlimited scope to pursue research in veterinary science and transgenic models.
1.6 History (Stages of development)
The science of sequencing began slowly. The earliest techniques were based on methods for separation of proteins and peptides, coupled with methods for identification and quantification of amino acids. Prior to 1945, there was not a single quantitative analysis available for any one protein. However, significant progress with chromatographic and labeling techniques over the next decade eventually led to the elucidation of the first complete sequence, that of the peptide hormone insulin (1955). Yet, it was the first five years before the sequence of the first enzyme (ribonuclease) was complete (1960). By 1965, around 20 proteins with more than 100 residues had been sequenced, and by 1980, the number was estimated to be of the order of 1500. Today, there are more than 3,00,000 sequences available.
Initially, the manual process of sequential Edman degradation – dansylation, obtained the majority of protein sequences. A key step towards the rapid increase in the number of sequenced proteins was the development of automated sequencers, which, by 1980, offered a 104-fold increase in sensitivity relative to the automated procedure implemented by Edman and Begg in 1967.
In the 1960s, scientists struggled to develop methods to sequence nucleic acids, but the first techniques to emerge were really only applicable to tRNA because they are short (74 to 95 nucleotides in length) and it is possible to purify individual molecules.
As against RNA, DNA is very long: human chromosomal molecules may contain between 55*106 and 250*106 base pairs. Assembling the complete nucleotide sequence of a complete DNA molecule is a huge task. Even if the sequence can be broken down into smaller fragments, purification remains a problem. The advent of gene cloning provided a solution to how the fragments can be separated. By 1977, two sequencing methods had emerged using chain termination and chemical degradation approaches. With only minor changes, the techniques propagated to laboratories throughout the world, and laid the foundation for the sequence revolution of the next two decades, and the subsequent birth of Bioinformatics.
During the last decade, molecular biology has witnessed an information revolution as a result of both of the development of rapid DNA sequencing techniques and of the corresponding progress in computer base technologies, which are allowing us to cope with this information deluge in increasingly efficient ways. The broad term that was coined in the mid-1980s to encompass computer applications in biological sciences is Bioinformatics. The term Bioinformatics has been commandeered by several different disciplines to mean rather different things. In its broadest sense, the term can be considered to mean information technology applied to the management and analysis of biological sequence data; this has implications in diverse areas, ranging from artificial intelligence and robotics to genome analysis. In the context of genome initiatives, the term was originally applied to the computational manipulation and analysis of biological sequence data. However in view of this recent rapid accumulation of available protein structures, the term now tends also to be used to embrace the manipulation and analysis of 3D structure data.
Chapter 2
Basics of Molecular Biology
This chapter explains in short some of the common biological terms absolutely essential to get a clear understanding of what exactly is Bioinformatics all about. I have avoided getting into the intricacies of Genetics because the basic aim of this report is to know the latest developments in the field of Bioinformatics, try to visualize where it is heading, understand what it has got to offer to the community, and exploit the opportunities available in this field.
2.1 Nucleotide
A nucleotide is a macromolecule made up of three sub-units: a pentose sugar, a nitrogen base and a phosphate. Nucleic acids are polymers of nucleotides. Pentose sugar is either ribose or deoxyribose (this decides whether the genetic material formed is RNA or DNA). Nitrogen bases are of two types: Purines (Adenine (A), Guanine (G)) and Pyramidines (Cytosine (C), Thymine (T) and Uracil (U))
2.2 Amino acid
It is the fundamental building block of proteins. There are 20 naturally occurring amino acids in animals and around 100 more found only in plants. A sequence of three nucleotides forms one amino acid. The logic behind this is as follows: There are four types of nucleotides depending on the nitrogenous base: (A,G,C,T) in DNA and (A,G,C,U) in RNA. 20 different amino acids are to be coded using permutations of 4 types of nucleotides. So obviously, 3 nucleotides are required to signify one amino acid (43 > 20), because less than 3 will be insufficient and more than 3 will cause redundancy. The sequence of three nucleotide specifying an amoni acid is called a triplet code or codon (coding unit). All 64 codons specify something or the other. Most of them specify amino acids, but a few are instructions for starting and stopping the synthesis.
2.3 Properties of Genetic Code
1. Three nucleotides in a DNA molecule code for one amino acid in the corresponding protein. Such a triplet is called a codon.
2. The code is read from a fixed starting point.
3. Codes for starting and stopping are present, but not for a pause in the middle, or
comma.
4. The nucleotides are read three at a time in a non-overlapping manner.
5. Most of the 64 possible nucleotide triplets stand for one amino acid or the other.
6. A few triplets stand for starting and stopping the synthesis.
7. There are two or more different codons for the same amino acid. Because of this,
the genetic code is said to be degenerate.
8. The code has polarity because it can be read only in one direction.
9. The code is universal. Practically all the organisms use the same code.
2.4 DNA (Deoxy-ribonucleic Acid)
The long, thread-like DNA molecule consists of two strands that are joined to one another all along their length. Each strand is a polymer made up of repeated sub-units (nucleotides). Hence each strand is also called a polynucleotide. DNA is the basic genetic material in all the living material existing on this earth. The two essential mechanisms possessed by DNA are (1) Transmission of hereditary characters and (2) Ability of self-duplication. In the DNA molecule, tow long polynucleotide chains are spirally twisted around each other. This is also called helical coiling and the DNA is often referred to as a double helix. A polynucleotide chain has polarity and the two strands of a DNA molecule run in opposite directions, hence they are said to be anti parallel. The two chains are joined together by hydrogen bonds existing between the nitrogenous bases on the inside. Adenine (A) forms a bond only with Thymine (T) and Guanine (G) can form a bond only with Cytosine (C). Because of the base pairing restriction, the two strands are always complementary to each other.
The sequence of bases along the polynucleotide is not restricted in any way. An infinite variety of combinations is possible. It is the precise sequence of bases that determines the genetic information. There is no theoretical limit to the length of a DNA molecule.
2.5 Chromosomes
Chromosomes are the paired, self-replicating genetic structures of cells that contain the cellular DNA; the nucleotide sequence of the DNA encodes the linear array of genes.
2.6 Gene
A gene is the fundamental physical and functional unit of heredity. A gene is an ordered sequence of nucleotides located in a particular position on a particular chromosome that encodes a specific functional product (i.e. a protein or RNA molecule).
2.7 Protein
Protein is a molecule composed of one or more chains of amino acids in a specific order. The order is determined by the base sequence of nucleotides in the gene coding for the protein. Proteins are required for the structure, function and regulation of cells, tissues and organs, each protein having a specific role (e.g., hormones, enzymes and antibodies).
DNA carries the hereditary material and the only thing that they do is to synthesize proteins, and thereafter, all the hereditary characteristics get reflected in the activities of the body cells because of proteins.
2.8 Sequencing
Sequencing means the determination of the order of nucleotides (base sequences) in a DNA or RNA molecule, or the order of amino acids in a protein.
2.9 Genome
Genome of an organism means all the genetic material in its chromosomes. Its size is generally given as its total number of base pairs. Genomes of different organisms can be compared to identify similarities and disparities in the strategies for the ‘Logic of Life’.
2.10 Clone
Clone is an exact copy made of biological material such as a DNA segment, a whole cell or a complete organism. The process of creating a clone is called as cloning.
2.11 Model Organism
Saccharomytes cerevisiae commonly known as the baker’s yeast have emerged as the model organism. It has demonstrated the fundamental conservation of the basic informational pathways found in almost all the organisms. From the detailed study of the genomes of these organisms (which is possible today), we can gain an insight into their functioning. All this data will lead to the fundamental insights into human biology.
Vast amount of genetic data available on this species provides important clues helpful for the ongoing research on human genetics. Saccharomytes cerevisiae has become the workhorse of many biotechnology labs. It can exist either in a haploid or a diploid state and divides by the vegetative process of budding. Yeast cultures can be easily propagated in labs. It has become the model organism partly because of the ease with which genetic manipulations can be carried out. Random mutations can be induced into the genome by the treatment of live cells with chemicals such as ethyl-methanesulphone or by exposure to ultra-violet rays. Targeted gene inactivations can also be carried out; this property is very important during experiments for the unambiguous assignments of gene functions.
Saccharomytes cerevisiae has a compact genome of 12 lakh base pairs of DNA present on 16 chromosomes. This presented a reasonable goal for complete sequencing and analysis of it’s genome. The Saccharomytes genome database (SGD) was established at the Stanford University in 1995.
Knowing the complete sequence of a genome is only the first step in understanding how the huge amount of information contained in genes is translated into functional proteins.
Chapter 3
Role of software Engineers and Technology in Biotechnology
The tools of computer science, statistics and mathematics are critical for studying biology as an informational science. Curiously, biology is the only science that at it’s very heart, employs a digital language. The grand challenge in biology is to determine how the digital language of the chromosomes is converted into 3-D and 4-D (time varying) languages of living organisms.
3.1 Need for software automation
DNA encodes the information necessary for building and maintaining life. DNA is a non-branching, double-stranded macromolecule in which the nucleotide building blocks (A,C,G,T) are linked. Bases are arranged in A-T and C-G pairs. Small viral genomes of the order of several thousand bases were the first to be sequenced in 1970. Few years later, genomes of the order of 40 kilo base pairs represented the limit of what could reasonably be sequenced. At this stage, the need for automation was recognized and methods were applied to the degree possible. By the year 1997, the yeast genome consisting of 12 Mega base pairs was completed, and in 1998, the conclusion of the 100 Mega base pairs nematode genome project was announced. Most recently, the 180 Mega base pairs fruit-fly genome was also completed. All of these projects relied on substantially higher levels of software automation. We are now in the midst of the most ambitious project so far: sequencing of the 3 Giga base pairs Human Genome. For this effort, and those yet to come, software automation lies at the very core of the planning and executing of the project.
The need for automation is driven largely by the trend of handling ever larger sizes of DNA and the corresponding increase in the amount of raw data this entails. Mathematical analysis indicates that the size of a project is roughly proportional to the size of the genome. This is due to the fact that the amount of information obtained for an individual sequencing experiment is relatively constant and is independent of the genome size. It is estimated that for the human genome, as order of 108 individual experiments are required to cover the genome. To meet the projected goals, modern large scale sequencing centers have developed throughput capacities of the order of several million experiments per month, with data processing handled on a continuous basis. Managing such large projects without a high degree of automation would clearly be impossible in terms of cost and time requirements.
So, DNA is the basic genetic material. It transmits hereditary characters from one generation to the next. During synthesis of proteins, mRNA which act as the messengers of information (the exact genetic code) are build from DNA. Proteins are synthesized using mRNA molecules. Protein interactions give rise to information pathways and networks which help in building cells which are identical to their parent cells. Clustering of many cells in a predefined format composes a tissue. An organ is a combination of tissues and an organism is nothing but an organization of organs. Refer figure 3.1.
The challenge for computer professionals is to create tools that can capture and integrate these different levels of biological information.
3.2 Genetic Algorithms
All that computers can do is implement algorithms. Hence when we talk of using computers for processing of biological information, we have to define precise mathematical algorithms. Following are a few absolutely basic algorithms in Bioinformatics.
3.2.1 Database Searching
Database interrogation can take the form of text queries (e.g. Display all the human adrenergic receptors) or sequence similarity searches (e.g. Given the sequence of a human adrenergic receptor, display all the similar sequences in the database). Sequence similarity searches are straightforward because the data in the databases is mostly in the form of sequences.
3.2.2 Comparing Two Sequences
Let us take the case of comparing two protein sequences. The alphabet complexity is 20, since a protein is nothing but a sequence of amino acids and there are 20 possible amino acids. The naïve approach is to line up the sequences against each other and insert additional characters to bring the two strings into vertical alignment. More the matches, more is the closeness in the two sequences.
The process of alignment can be measured in terms of the number of gaps introduced and the number of mismatches remaining in the alignment. A metric relating such parameters represents the distance between two sequences.
3.2.3 Multiple Sequence Alignment
In the previous sub section, we saw pairwise sequence alignment, which is fundamental to sequence analysis. However, analysis of groups of sequences that form gene families requires the ability to make connections between more than two members of the group, in order to reveal subtle conserved family characteristics. The goal of multiple sequence alignment is to generate a concise information-rich summary of sequence data in order to inform decision-making on the relatedness of sequences to a gene family.
Multiple sequence alignment is a 2D table, in which the rows represent individual sequences and the columns the residue positions. The sequences are laid onto this grid in such a manner that (a) the relative positioning of residues within any one sequence is preserved, and (b) similar residues in all the sequences are brought into vertical register.
3.3 Genome Projects
In the mid-1980s, the united states department of Energy initiated a number of projects to construct detailed genetic and physical maps of the human genome, to determine its complete nucleotide sequence, and to localize its estimated 100000 genes. Work on this scale required the development of new computational methods for analysing genetic map and DNA sequence data, and demanded the design of new techniques and instrumentation for detecting and analysing DNA. To benefit the public most effectively, the projects also necessitated the use of advanced means of information dissemination in order to make the results available as rapidly as possible to scientists and physicians. The international effort arising from this vast initiative became known as the human genome project.
Similar research efforts were also launched to map and sequence the genomes of a variety of organisms used extensively in research labs as model systems. In April 1998, although the sequencing projects of only a small number of relatively small genomes had been completed, and the human genome is not expected to be complete until after the year 2003, the results of such projects were already beginning to pour into the public sequence databases in overwhelming numbers.
3.4 Goals for Advancements in Sequencing Technology
DNA sequencing technology has improved dramatically since the genome projects began. The amount of sequence produced each year is increasing steadily; individual centers are now producing tens of millions of base pairs of sequence annually. In the future, de novo sequencing of additional genomes, comparative sequencing of closely related genomes, and sequencing to assess variation within genomes will become increasingly indispensable tools for biological and medical research. Much more efficient sequencing technology will be needed than is currently available. The incremental improvements made to date have not yet resulted in any fundamental paradigm shifts. Nevertheless, the current state-of-the-art technology can still be significantly improved, and resources should be invested to accomplish this. Beyond that, research must be supported on new technologies that will make even higher throughput DNA sequencing efficient, accurate, and cost-effective, thus providing the foundation for other advanced genomic analysis tools. Progress must be achieved in three areas:
a) Continue to increase the throughput and reduce the cost of current sequencing technology.
Increased automation, miniaturization, and integration of the approaches currently in use, together with incremental, evolutionary improvements in all steps of the sequencing process, are needed to yield further increases in throughput (to at least 500 Mb of finished sequence per year by 2003) and reductions in cost. At least a twofold cost reduction from current levels (which average $0.50 per base for finished sequence in large-scale centers) should be achieved in the next 5 years. Production of the working draft of the human sequence will cost considerably less per base pair.
b) Support research on novel technologies that can lead to significant improvements in sequencing technology.
New conceptual approaches to DNA sequencing must be supported to attain substantial improvements over the current sequencing paradigm. For example, microelectromechanical systems (MEMS) may allow significant reduction of reagent use, increase in assay speed, and true integration of sequencing functions. Rapid mass spectrometric analysis methods are achieving impressive results in DNA fragment identification and offer the potential for very rapid DNA sequencing. Other more revolutionary approaches, such as single-molecule sequencing methods, must be explored as well. Significant investment in interdisciplinary research in instrumentation, combining chemistry, physics, biology, computer science, and engineering, will be required to meet this goal. Funding of far-sighted projects that may require 5 to 10 years to reach fruition will be essential. Ultimately, technologies that could, for example, sequence one vertebrate genome per year at affordable cost are highly desirable.
c) Develop effective methods for the advanced development and introduction of new sequencing technologies into the sequencing process.
As the scale of sequencing increases, the introduction of improvements into the production stream becomes more challenging and costly. New technology must therefore be robust and be carefully evaluated and validated in a high-throughput environment before its implementation in a production setting. A strong commitment from both the technology developers and the technology users is essential in this process. It must be recognized that the advanced development process will often require significantly more funds than proof-of-principle studies. Targeted funding allocations and dedicated review mechanisms are needed for advanced technology development.
3.5 Developing Technology to handle Sequence Variations
Natural sequence variation is a fundamental property of all genomes. Any two haploid human genomes show multiple sites and types of polymorphism. Some of these have functional implications, whereas many probably do not. The most common polymorphisms in the human genome are single base-pair differences, also called single-nucleotide polymorphisms (SNPs). When two haploid genomes are compared, SNPs occur every kilobase, on average. Other kinds of sequence variation, such as copy number changes, insertions, deletions, duplications, and rearrangements also exist, but at low frequency and their distribution is poorly understood. Basic information about the types, frequencies, and distribution of polymorphisms in the human genome and in human populations is critical for progress in human genetics. Better high-throughput methods for using such information in the study of human disease are also needed.
SNPs are abundant, stable, widely distributed across the genome, and lend themselves to automated analysis on a very large scale, for example, with DNA array technologies. Because of these properties, SNPs will be a boon for mapping complex traits such as cancer, diabetes, and mental illness. Dense maps of SNPs will make possible genome-wide association studies, which are a powerful method for identifying genes that make a small contribution to disease risk. In some instances, such maps will also permit prediction of individual differences in drug response. Publicly available maps of large numbers of SNPs distributed across the whole genome, together with technology for rapid, large-scale identification and scoring of SNPs, must be developed to facilitate this research.
a) Develop technologies for rapid, large-scale identification of SNPs and other DNA sequence variants. The study of sequence variation requires efficient technologies that can be used on a large scale and that can accomplish one or more of the following tasks: rapid identification of many thousands of new SNPs in large numbers of samples. Although the immediate emphasis is on SNPs, ultimately technologies that can be applied to polymorphisms of any type must be developed. Technologies are also needed that can rapidly compare, by large-scale identification of similarities and differences, the DNA of a species that is closely related to one whose DNA has already been sequenced. The technologies that are developed should be cost-effective and broadly accessible.
b) Identify common variants in the coding regions of the majority of identified genes Initially, association studies involving complex diseases will likely test a large series of candidate genes; eventually, sequences in all genes may be systematically tested. SNPs in coding sequences (also known as cSNPs) and the associated regulatory regions will be immediately useful as specific markers for disease. An effort should be made to identify such SNPs as soon as possible. Ultimately, a catalog of all common variants in all genes will be desirable. This should be cross-referenced with cDNA sequence data.
c) Create an SNP map of at least 100,000 markers. A publicly available SNP map of sufficient density and informativeness to allow effective mapping in any population is the ultimate goal. A map of 100,000 SNPs (one SNP per 30,000 nucleotides) is likely to be sufficient for studies in some relatively homogeneous populations, while denser maps may be required for studies in large, heterogeneous populations. Thus, during this 5-year period, the HGP authorities have planned to create a map of at least 100,000 SNPs. If technological advances permit, a map of greater density is desirable. Research should be initiated to estimate the number of SNPs needed in different populations.
d) Develop the intellectual foundations for studies of sequence variation. The methods and concepts developed for the study of single-gene disorders are not sufficient for the study of complex, multigene traits. The study of the relationship between human DNA sequence variation, phenotypic variation, and complex diseases depends critically on better methods. Effective research design and analysis of linkage, linkage disequilibrium, and association data are areas that need new insights. Questions such as which study designs are appropriate to which specific populations, and with which population genetics characteristics, must be answered. Appropriate statistical and computational tools and rigorous criteria for establishing and confirming associations must also be developed.
e) Create public resources of DNA samples and cell lines. To facilitate SNP discovery it is critical that common public resources of DNA samples and cell lines be made available as rapidly as possible. To maximize discovery of common variants in all human populations, a resource is needed that includes individuals whose ancestors derive from diverse geographic areas. It should encompass as much of the diversity found in the human population as possible. Samples in this initial public repository should be totally anonymous to avoid concerns that arise with linked or identifiable samples.
DNA samples linked to phenotypic data and identified as to their geographic and other origins will be needed to allow studies of the frequency and distribution of DNA polymorphisms in specific populations and their relevance to disease. However, such collections raise many ethical, legal, and social concerns that must be addressed. Credible scientific strategies must be developed before creating these resources.
3.6 Need of Technology in Functional Genomics
Functional genomics is the interpretation of the function of DNA sequence on a genomic scale. Already, the availability of the sequence of entire organisms has demonstrated that many genes and other functional elements of the genome are discovered only when the full DNA sequence is known. Such discoveries will accelerate as sequence data accumulate. However, knowing the structure of a gene or other element is only part of the answer. The next step is to elucidate function, which results from the interaction of genomes with their environment. Current methods for studying DNA function on a genomic scale include comparison and analysis of sequence patterns directly to infer function, large-scale analysis of the messenger RNA and protein products of genes, and various approaches to gene disruption. In the future, a host of novel strategies will be needed for elucidating genomic function. This will be a challenge for all of biology. The HGP will be contributing to this area by emphasizing the development of technology that can be used on a large scale, is efficient, and is capable of generating complete data for the genome as a whole. To the extent that available resources allow, expansion of current approaches as well as innovative technology ideas should be supported in the areas described below.
a) Develop cDNA resources. Complete sets of full-length cDNA clones and sequences for both humans and model organisms would be enormously useful for biologists and are urgently needed. Such resources would help in both gene discovery and functional analysis. High priority should be placed on developing technology for obtaining full-length cDNAs. Complete and validated inventories of full-length cDNA clones and corresponding sequences should be generated and made available to the community once such technology is at hand.
b) Improved technologies are needed for global approaches to the study of non-protein-coding sequences, including production of relevant libraries, comparative sequencing, and computational analysis.
c) Develop technology for comprehensive analysis of gene expression. Information about the spatial and temporal patterns of gene expression in both humans and model organisms offers one key to understanding gene expression. Efficient and cost-effective technology needs to be developed to measure various parameters of gene expression reliably and reproducibly. Complementary DNA sequences and validated sets of clones with unique identifiers will be needed for array technologies, large-scale in situ hybridization, and other strategies for measuring gene expression. Improved methods for quantifying, representing, analyzing, and archiving expression data should also be developed.
d) Improve methods for genome-wide mutagenesis. Creating mutations that cause loss or alteration of function is another prime approach to studying gene function. Technologies, both gene- and phenotype-based, which can be used on a large scale in vivo or in vitro, are needed for generating or finding such mutations in all genes. Such technologies should be piloted in appropriate model systems, including both cell culture and whole organisms.
e) Develop technology for global protein analysis. A full understanding of genome function requires an understanding of protein function on a genome-wide basis. Development of experimental and computational methods to study global spatial and temporal patterns of protein expression, protein-ligand interactions, and protein modification needs to be supported.
3.7 Bioinformatics and Computational Biology
Bioinformatics support is essential to the implementation of genome projects and for public access to their output. Bioinformatics needs for the genome project fall into two broad areas: (i) databases and (ii) development of analytical tools. Collection, analysis, annotation, and storage of the ever increasing amounts of mapping, sequencing, and expression data in publicly accessible, user-friendly databases is critical to the project's success. In addition, the community needs computational methods that will allow scientists to extract, view, annotate, and analyze genomic information efficiently. Thus, the genome project must continue to invest substantially in these areas. Conservation of resources through development of portable software should be encouraged.
a) Improve content and utility of databases. Databases are the ultimate repository of genome project’s data. As new kinds of data are generated and new biological relationships discovered, databases must provide for continuous and rapid expansion and adaptation to the evolving needs of the scientific community. To encourage broad use, databases should be responsive to a diverse range of users with respect to data display, data deposition, data access, and data analysis. Databases should be structured to allow the queries of greatest interest to the community to be answered in a seamless way. Communication among databases must be improved. Achieving this will require standardization of nomenclature. A database of human genomic information, analogous to the model organism databases and including links to many types of phenotypic information, is needed.
b) Develop better tools for data generation, capture, and annotation. Large-scale, high-throughput genomics centers need readily available, transportable informatics tools for commonly performed tasks such as sample tracking, process management, map generation, sequence finishing, and primary annotation of data. Smaller users urgently need reliable tools to meet their sequencing and sequence analysis needs. Readily accessible information about the availability and utility of various tools should be provided, as well as training in the use of tools.
c) Develop and improve tools and databases for comprehensive functional studies. Massive amounts of data on gene expression and function will be generated in the near future. Databases that can organize and display this data in useful ways need to be developed. New statistical and mathematical methods are needed for analysis and comparison of expression and function data, in a variety of cells and tissues, at various times and under different conditions. Also needed are tools for modeling complex networks and interactions.
d) Develop and improve tools for representing and analyzing sequence similarity and variation. The study of sequence similarity and variation within and among species will become an increasingly important approach to biological problems. There will be many forms of sequence variation, of which SNPs will be only one type. Tools need to be created for capturing, displaying, and analyzing information about sequence variation.
e) Create mechanisms to support effective approaches for producing robust, exportable software that can be widely shared. Many useful software products are being developed in both academia and industry that could be of great benefit to the community. However, these tools generally are not robust enough to make them easily exportable to another laboratory. Mechanisms are needed for supporting the validation and development of such tools into products that can be readily shared and for providing training in the use of these products. Participation by the private sector is strongly encouraged.
3.8 Job Opportunities and Job Requirements
The Human Genome Project has created the need for new kinds of scientific specialists who can be creative at the interface of biology and other disciplines, such as computer science, engineering, mathematics, physics, chemistry, and the social sciences. As the popularity of genomic research increases, the demand for these specialists greatly exceeds the supply. In the past, the genome project has benefited immensely from the talents of non-biological scientists, and their participation in the future is likely to be even more crucial. There is an urgent need to train more scientists in interdisciplinary areas that can contribute to genomics. Programs must be developed that will encourage training of both biological and non-biological scientists for careers in genomics. Especially critical is the shortage of individuals trained in Bioinformatics. Also needed are scientists trained in the management skills required to lead large data-production efforts. Another urgent need is for scholars who are trained to undertake studies on the societal impact of genetic discoveries. Such scholars should be knowledgeable in both genome-related sciences and in the social sciences. Ultimately, a stable academic environment for genomic science must be created so that innovative research can be nurtured and training of new individuals can be assured. The latter is the responsibility of the academic sector, but funding agencies can encourage it through their grants programs.
3.9 Training Goals included in the Human Genome Project Plan
a) Nurture the training of scientists skilled in genomics research.
A number of approaches to training for genomics research should be explored. These include providing fellowship and career awards and encouraging the development of institutional training programs and curricula. Training that will facilitate collaboration among scientists from different disciplines, as well as courses that introduce scientists to new technologies or approaches, should also be included.
b) Encourage the establishment of academic career paths for genomic scientists.
Ultimately, a strong academic presence for genomic science is needed to generate the training environment that will encourage individuals to enter the field. Currently, the high demand for genome scientists in industry threatens the retention of genome scientists in academia. Attractive incentives must be developed to maintain the critical mass essential for sponsoring the training of the next generation of genome scientists.
c) Increase the number of scholars who are knowledgeable in both genomic and genetic sciences and in ethics, law, or the social sciences.
As the pace of genetic discoveries increases, the need for individuals who have the necessary training to study the social impact of these discoveries also increases. The ELSI program should expand its efforts to provide postdoctoral and senior fellowship opportunities for cross-training. Such opportunities should be provided both to scientists and health professionals who wish to obtain training in the social sciences and humanities and to scholars trained in law, the social sciences, or the humanities who wish to obtain training in genomic or genetic sciences.
Chapter 4
Human Genome Project
4.1 Introduction
Begun in 1990, the U.S. Human Genome Project is a 13-year effort coordinated by the department of Energy and the National Institutes of Health. The project originally was planned to last 15 years, but effective resource and technological advances have accelerated the expected completion date to 2003. Project goals are to
· identify all the approximately 30,000 genes in human DNA,
· determine the sequences of the 3 billion chemical base pairs that make up human DNA,
· store this information in databases,
· improve tools for data analysis,
· transfer related technologies to the private sector, and
· address the ethical, legal, and social issues that may arise from the project.
4.2 Details of the Human Genome Project
The Human Genome Project (HGP) is fulfilling its promise as the single most important project in biology and the biomedical sciences--one that will permanently change biology and medicine. With the recent completion of the genome sequences of several microorganisms, including Escherichia coli and Saccharomyces cerevisiae, and the imminent completion of the sequence of the metazoan Caenorhabditis elegans, the door has opened wide on the era of whole genome science. The ability to analyze entire genomes is accelerating gene discovery and revolutionizing the breadth and depth of biological questions that can be addressed in model organisms. These exciting successes confirm the view that acquisition of a comprehensive, high-quality human genome sequence will have unprecedented impact and long-lasting value for basic biology, biomedical research, biotechnology, and health care. The transition to sequence-based biology will spur continued progress in understanding gene-environment interactions and in development of highly accurate DNA-based medical diagnostics and therapeutics.
Human DNA sequencing, the flagship endeavor of the HGP, is entering its decisive phase. It will be the project's central focus during the next 5 years. While partial subsets of the DNA sequence, such as expressed sequence tags (ESTs), have proven enormously valuable, experience with simpler organisms confirms that there can be no substitute for the complete genome sequence. In order to move vigorously toward this goal, the crucial task ahead is building sustainable capacity for producing publicly available DNA sequence. The full and incisive use of the human sequence, including comparisons to other vertebrate genomes, will require further increases in sustainable capacity at high accuracy and lower costs. Thus, a high-priority commitment to develop and deploy new and improved sequencing technologies must also be made.
Availability of the human genome sequence presents unique scientific opportunities, chief among them the study of natural genetic variation in humans. Genetic or DNA sequence variation is the fundamental raw material for evolution. Importantly, it is also the basis for variations in risk among individuals for numerous medically important, genetically complex human diseases. An understanding of the relationship between genetic variation and disease risk promises to change significantly the future prevention and treatment of illness. The new focus on genetic variation, as well as other applications of the human genome sequence, raises additional ethical, legal, and social issues that need to be anticipated, considered, and resolved.
The HGP has made genome research a central underpinning of biomedical research. It is essential that it continue to play a lead role in catalyzing large-scale studies of the structure and function of genes, particularly in functional analysis of the genome as a whole. However, full implementation of such methods is a much broader challenge and will ultimately be the responsibility of the entire biomedical research and funding communities.
Success of the HGP critically depends on Bioinformatics and computational biology as well as training of scientists to be skilled in the genome sciences. The project must continue a strong commitment to support of these areas.
As intended, the HGP has become a truly international effort to understand the structure and function of the human genome. Many countries are participating according to their specific interests and capabilities. Coordination is informal and generally effected at the scientist-to-scientist level. The U.S. component of the project is sponsored by the National Human Genome Research Institute at the National Institutes of Health (NIH) and the Office of Biological and Environmental Research at the Department of Energy (DOE). The HGP has benefited greatly from the contributions of its international partners. The private sector has also provided critical assistance. These collaborations will continue, and many will expand. Both NIH and DOE welcome participation of all interested parties in the accomplishment of the HGP's ultimate purpose, which is to develop and make publicly available to the international community the genomic resources that will expedite research to improve the lives of all people.
4.3 U.S. Human Genome Project 5-Year Goals 1998-2003
4.3.1 Human DNA Sequencing
Providing a complete, high-quality sequence of human genomic DNA to the research community as a publicly available resource continues to be the HGP's highest priority goal. The enormous value of the human genome sequence to scientists, and the considerable savings in research costs its widespread availability will allow, are compelling arguments for advancing the timetable for completion. Recent technological developments and experience with large-scale sequencing provide increasing confidence that it will be possible to complete an accurate, high-quality sequence of the human genome by the end of 2003, 2 years sooner than previously predicted. NIH and DOE expect to contribute 60 to 70% of this sequence, with the remainder coming from the effort at the Sanger Center and other international partners.
This is a highly ambitious goal, given that only about 6% of the human genome sequence has been completed thus far. Sequence completion by the end of 2003 is a major challenge, but within reach and well worth the risks and effort. Realizing the goal will require an intense and dedicated effort and a continuation and expansion of the collaborative spirit of the international sequencing community. Only sequence of high accuracy and long-range contiguity will allow a full interpretation of all the information encoded in the human genome.
Availability of the human sequence will not end the need for large-scale sequencing. Full interpretation of that sequence will require much more sequence information from many other organisms, as well as information about sequence variation in humans. Thus, the development of sustainable, long-term sequencing capacity is a critical objective of the HGP. Achieving the goals below will require a capacity of at least 500 megabases (Mb) of finished sequence per year by the end of 2003.
a) Finish the complete human genome sequence by the end of 2003.
To best meet the needs of the scientific community, the finished human DNA sequence must be a faithful representation of the genome, with high base-pair accuracy and long-range contiguity. Specific quality standards that balance cost and utility have already been established. These quality standards should be reexamined periodically; as experience in using sequence data is gained, the appropriate standards for sequence quality may change. One of the most important uses for the human sequence will be comparison with other human and nonhuman sequences. The sequence differences identified in such comparisons should, in nearly all cases, reflect real biological differences rather than errors or incomplete sequence. Consequently, the current standard for accuracy--an error rate of no more than 1 base in 10,000--remains appropriate.
The current public sequencing strategy is based on mapped clones and occurs in two phases. The first, or "shotgun" phase, involves random determination of most of the sequence from a mapped clone of interest. Methods for doing this are now highly automated and efficient. Mapped shotgun data are assembled into a product ("working draft" sequence) that covers most of the region of interest but may still contain gaps and ambiguities. In the second, finishing phase, the gaps are filled and discrepancies resolved. At present, the finishing phase is more labor intensive than the shotgun phase. Already, partially finished, working-draft sequence is accumulating in public databases at about twice the rate of finished sequence.
b) Make the sequence totally and freely accessible.
The HGP was initiated because its proponents believed the human sequence is such a precious scientific resource that it must be made totally and publicly available to all who want to use it. Only the wide availability of this unique resource will maximally stimulate the research that will eventually improve human health.
4.3.2 Sequencing Technology
Create a long-term, sustainable sequencing capacity by improving current technology and developing highly efficient novel technologies. Achieving this HGP goal will require current sequencing capacity to be expanded 2-3 times, demanding further incremental advances in standard sequencing technologies and improvements in efficiency and cost. For future sequencing applications, planners emphasize the importance of supporting novel technologies that may be 5-10 years in development.
4.3.3 Sequence Variation
Develop technologies for rapid identification of DNA sequence variants. A new priority for the HGP is examining regions of natural variation that occur among genomes (except those of identical twins). Goals specify development of methods to detect different types of variation, particularly the most common type called single nucleotide polymorphisms (SNPs) that occur about once every 1000 bases. Scientists believe SNP maps will help them identify genes associated with complex diseases such as cancer, diabetes, vascular disease, and some forms of mental illness. These associations are difficult to make using conventional gene hunting methods because any individual gene may make only a small contribution to disease risk. DNA sequence variations also underlie many individual differences in responses to the environment and treatments.
4.3.4 Functional Genomics
Expand support for current approaches and innovative technologies. Efficient interpretation of the functions of human genes and other DNA sequences requires developing the resources and strategies to enable large-scale investigations across whole genomes. A technically challenging first priority is to generate complete sets of full-length cDNA clones and sequences for human and model organism genes. Other functional genomics goals include studies into gene expression and control, creation of mutations that cause loss or alteration of function in nonhuman organisms, and development of experimental and computational methods for protein analyses.
4.3.5 Comparative Genomics
Obtain complete genomic sequences for C. elegans (1998), Drosophila (2002), and mouse (2008). A first clue toward identifying and understanding the functions of human genes or other DNA regions is often obtained by studying their parallels in nonhuman genomes. To enable efficient comparisons, complete genomic sequences already have been obtained for the bacterium E. coli and the yeast S. cerevisiae, and work continues on sequencing the genomes of the roundworm, fruit fly, and mouse. Planners note that other genomes will need to be sequenced to realize the full promise of comparative genomics, stressing the need to build a sustainable sequencing capacity.
4.3.6 Ethical, Legal, and Social Implications (ELSI)
· Analyze and address implications of identifying DNA sequence information for individuals, families, and communities.
· Facilitate safe and effective integration of genetic technologies.
· Facilitate education about genomics in nonclinical and research settings.
Rapid advances in genetics and applications present new and complex ethical and policy issues for individuals and society. ELSI programs that identify and address these implications have been an integral part of the US HGP since its inception. These programs have resulted in a body of work that promotes education and helps guide the conduct of genetic research and the development of related health professional and public policies. Continuing and new challenges include safeguarding the privacy of individuals and groups who contribute samples for large-scale sequence variation studies; anticipating how resulting data may affect concepts of race and ethnicity; identifying how genetic data could potentially be used in workplaces, schools, and courts; commercial uses; and the impact of genetic advances on concepts of humanity and personal responsibility.
4.3.7 Bioinformatics and Computational Biology
Improve current databases and develop new databases and better tools for data generation and capture and comprehensive functional studies. Continued investment in current and new databases and analytical tools is critical to the success of the Human Genome Project and to the future usefulness of the data. Databases must be structured to adapt to the evolving needs of the scientific community and allow queries to be answered easily. Planners suggest developing a human genome database analogous to model organism databases with links to phenotypic information. Also needed are databases and analytical tools for the expanding body of gene expression and function data, for modeling complex biological networks and interactions, and for collecting and analyzing sequence variation data.
4.3.8 Training
Nurture the training of genomic scientists and establish career paths.
Increase the number of scholars knowledgeable in genomics and ethics, law, or the social sciences. Planners note that future genomics scientists will require training in interdisciplinary areas that include biology, computer science, engineering, mathematics, physics, and chemistry. Additionally, scientists with management skills will be needed for leading large data-production efforts.
Chapter 5
Biological Databases
5.1 The Biological sequence/structure deficit
At the beginning of 1998, in publicly available, non-redundant databases, more than 3,00,000 protein sequences have been deposited, and the number of partial sequences in public and proprietary Expressed sequence tag databases is estimated to run into millions. By contrast, the number of unique 3D structures in the Protein Data Bank (PDB) was less than 1500. Although structural information is far more complex to derive, store and manipulate than are sequence data, these figures nevertheless highlight an enormous information deficit. This situation is likely to get worse as the genome projects around the world begin top bear fruit. Of course, the acquisition of structural data is also hastening, and the future large-scale structure determination enterprise could conceivably furnish 2000 3D structures annually. But this is a small yield by comparison with that of sequence databases, which are doubling in size every year, with a new sequence being added, on average once a minute.
5.2 Biological Databases
If we are to derive the maximum benefit from the deluge of sequence information, we must deal with it in a concerted way; this means establishing, maintaining and disseminating databases; providing easy to use software to access the information they contain; and designing state-of-the-art analysis tools to visualize and interpret the structural and functional clues latent in the data.
The first, then, in analysing sequence information is to assemble it into central shareable resources i.e. databases. Databases are effectively electronic filling cabinets, a convenient and efficient method of storing vast amounts of information. There are many different database types, depending both on the nature of the information being stored and on the manner of data storage( eg: whether in flat-files, tables in a relational database or objects in an object oriented database).
In the context of protein sequence analysis, we will encounter primary, composite and secondary databases. Such resources store different levels of information in totally different formats. In the past, this has led to a variety of communication problems, but emerging computer technologies are beginning to provide solutions, allowing seamless, transparent access to disparate, distributed data structures over the internet.
Primary and secondary databases are used to address different aspects of sequence analysis, because they store different levels of protein sequence information.
The primary structure of a protein is its amino acid sequence; these are stored in primary databases as linear alphabets that denote the constituent residues. The secondary structure of a protein corresponds to regions of local regularity, which, in sequence alignments, are often apparent as well conserved motifs; these are stored in secondary databases as patterns. The tertiary structure of a protein arises from the packing of its secondary structure elements which may form discrete domains within a fold, or may give rise to autonomous folding units or modules; complete folds, domains and modules are stored in structure databases as sets of atomic co-ordinates.
5.3 Primary Sequence Databases
In the early 1980s, sequence information started to become more abundant in the scientific literature. Realising this, several laboratories saw that there might be advantages to harvesting and storing these sequences in central repositories. Thus, several primary database projects began to evolve in different parts of the world.
5.3.1 Nucleic acid Sequence Databases
The principle DNA sequence databases are GenBank (USA), EMBL (Europe) and DDBJ (Japan), which exchange data on a daily basis to ensure comprehensive coverage at each of the sites.
EMBL is the nucleotide sequence database from the European Bioinformatics Institute. The rate of growth of DNA databases has been following an exponential trend, with a doubling time less than a year. EMBL data predominantly (more than 50%) consist of model organisms.
DNA Data Bank of Japan is produced, distributed and maintained by the National Institute of Genetics.
GenBank, the DNA database from the National Center for Biotechnology Information, exchanges data with both EMBL and DDBJ to help ensure comprehensive coverage. The database is split into 17 smaller discrete divisions.
5.3.2 Protein Sequence Databases
PIR, MIPS, SWISS-PROT, and TrEMBL are the major protein sequence databases.
PIR was developed for investigating evolutionary relations between proteins. In its current form, the database is split into four distinct sections PIR1-PIR4, which differ in terms of the quality of data and the level of annotation provided.
MIPS collects and processes sequence data for the tripartite PIR-International Protein sequence Database Project.
SWISS-PROT is a protein sequence database which, endeavors to provide high level annotations, including descriptions of the function of the protein, and of the structure of its domain, its post translational modifications and so on.
TrEMBL was created as a supplement to the SWISS-PROT. It was designed to address the need for a well structured SWISS-PROT-like resource that would allow very rapid access to sequence data from the genome projects, without having to compromise the quality of SWISS-PROT itself by incorporating sequences with insufficient analysis and annotation.
5.4 Composite Protein Sequences Databases
One solution to the problem of proliferation primary databases is to compile a composite, i.e. a database that amalgamates a variety of different primary sources. Composite databases render sequence searching much more efficient, because they obviate the need to interrogate multiple resources. The interrogation process is stream lined still further if the composite has been designed to be non-redundant, as this means that the same sequence need not be searched more than once. The choices of different sources and the application of different redundancy criteria have led to the emergence of different composites. The major composite databases are Non-Redundant Database, OWL, MIPSX, SWISS-PROT+TrMBL.
5.5 Secondary Databases
Secondary databases contain the fruits of analyses of the sequences in the primary resources. Because there are several different primary databases and a variety of ways of analysing protein sequences, the information housed in each of the secondary resources is different. Designing software tools that can search the different types of data, interpret the range of outputs, and assess the biological significance of the results is not a trivial task. SWISS-PROT has emerged as the most popular primary source and many secondary databases now use it as their basis.
Some of the main secondary resources are as follows:
Secondary database Primary source Stored Information
PROSITE SWISS-PROT Regular expressions
Profiles SWISS-PROT Weighted matrices
PRINTS OWL Aligned motifs
Pfam SWISS-PROT Hidden Marcov Models
BLOCKS PROSITE/PRINTS Aligned motifs (blocks)
IDENTIFY BLOCKS/PRINTS Fuzzy regular expressions
5.6 Tertiary Databases
Tertiary databases are the databases derived from information housed in secondary (pattern) databases (e.g. the BLOCKS and eMOTIF databases, which draw on the data stored within PROSITE and PRINTS). The value of such resources is in providing a different scoring perspective on the same underlying data, allowing the possibility to diagnose relationships that might be missed using the original implementation.
Chapter 6
Applications of Bioinformatics
A big amount of investment is being made in the field of biotechnology. In this chapter, I have attempted to take a review of the overall outcome obtained so far and what all is estimated in the future.
6.1 Application to the Ailments of Diseases
The miraculous substance that contains all of our genetic instructions, DNA, is rapidly becoming a key to modern medicine. By focusing on the diaphanous and extraordinarily long filaments of DNA that we inherit from our parents, scientists are finding the root causes of dozens of previously mysterious diseases: abnormal genes. These discoveries are allowing researchers to make precise diagnoses and predictions, to design more effective drugs, and to prevent many painful disorders. The new findings also pave the way for the development of the ultimate therapy - substituting a normal gene for a malfunctioning one so as to correct a patient's genetic defect permanently.
Recently, scientists have made spectacular progress against two fatal genetic diseases of children, cystic fibrosis and Duchenne muscular dystrophy. In addition, they have identified the genetic flaws that predispose people to more widespread, though still poorly understood ailments - various forms of heart disease, breast and colon cancer, diabetes, arthritis - which are not usually thought of as genetic in origin.
While many of the researchers who are exploring our genetic wilderness want to find the sources of the nearly 4,000 disorders caused by defects in single genes, others have an even broader goal: They hope to locate and map all of the 50,000 to 100,000 genes on our chromosomes. This map of our complete biological inheritance "the marvelous message, evolved for 3 billion years or more, which gives rise to each one of us," as Robert Sinsheimer of the University of California, Santa Barbara, calls it - will guide biological research for years to come. And it will radically simplify the search for the genetic flaws that cause disease.
Once scientists have identified such a flaw, they need to understand just how it produces a particular illness. They must determine the normal gene's function in human cells: What kind of protein does it instruct the cells to make, in what quantities, at what times, and in what specific places? Then the researchers can ask whether the genetic flaw results in too little protein, the wrong kind of protein, or no protein at all - and how best to counteract the effects of this failure.
For most genetic disorders, researchers are still at the very beginning of the trail. They have no clues to the DNA error that causes a disease, and they are still trying to find large families whose DNA patterns can help them track it down.
By contrast, scientists who work on cystic fibrosis and a few other diseases have covered much of the trail. They have already succeeded in correcting the gene defect inside living human cells by inserting healthy genes into these cells in a laboratory dish - an achievement that may lead to gene therapy.
The farther scientists go along the trail, the broader the implications of their findings. For example, the discovery of the gene defect that causes Duchenne muscular dystrophy, a muscle-wasting disease, led scientists to identify a previously unknown protein that plays an important role in all muscle function. This gives them a clearer view of how muscle cells work and allows them to diagnose other muscle disorders with exceptional precision, as well as devise new approaches to treatment.
Any new treatment will need to be tested on animals. In fact, the next explosion of information in medical genetics is expected to come from the study of animals - particularly with defects that mimic human disorders. The techniques for producing animal models of disease are improving rapidly. Even today, "designer mice" are playing an increasingly important role in research.
The growth of powerful computerized databases is bringing further insights. Only a month after the discovery of the genetic error involved in neurofibromatosis, a disfiguring and sometimes disabling hereditary disease, a computer search revealed a match between the protein made by normal copies of the newly uncovered gene and a protein that acts to suppress the development of cancers of the lung, liver, and brain - a key finding for cancer researchers.
Such revelations are becoming increasingly frequent. "If a new sequence has no match in the databases as they are, a week later a still newer sequence will match it," observes Walter Gilbert of Harvard University.
Brain disorders such as schizophrenia or Alzheimer's disease may be next to yield to the genetic approach. "We won't know what went wrong in most cases of mental disease until we can find the gene that sets it off," says James Watson, co-discoverer of the structure of DNA.
6.2 Application of Bioinformatics to Agriculture
Techniques aimed at crop improvement have been utilized for centuries. Today, applied plant science has three overall goals: increased crop yield, improved crop quality, and reduced production costs. Biotechnology is proving its value in meeting these goals. Progress has, however, been slower than with medical and other areas of research. Because plants are genetically and physiologically more complex than single-cell organisms such as bacteria and yeasts, the necessary technologies are developing more slowly.
6.2.1 Improvements in Crop Yield and Quality
In one active area of plant research, scientists are exploring ways to use genetic modification to confer desirable characteristics on food crops. Similarly, agronomists are looking for ways to harden plants against adverse environmental conditions such as soil salinity, drought, alkaline earth metals, and anaerobic (lacking air) soil conditions.
Genetic engineering methods to improve fruit and vegetable crop characteristics - such as taste, texture, size, color, acidity or sweetness, and ripening process, are being explored as a potentially superior strategy to the traditional method of cross-breeding.
Research in this area of agricultural biotechnology is complicated by the fact that many of a crop's traits are encoded not by one gene but by many genes working together. Therefore, one must first identify all of the genes that function as a set to express a particular property. This knowledge can then be applied to altering the germlines of commercially important food crops. For example, it will be possible to transfer the genes regulating nutrient content from one variety of tomatoes into a variety that naturally grows to a larger size. Similarly, by modifying the genes that control ripening, agronomists can provide supplies of seasonal fruits and vegetables for extended periods of time.
Biotechnological methods for improving field crops, such as wheat, corn and soybeans, are also being sought, since seeds serve both as a source of nutrition for people and animals and as the material for producing the next plant generation. By increasing the quality and quantity of protein or varying the types in these crops, we can improve their nutritional value.
6.3 Applications of Microbial Genomics
· new energy sources (biofuels)
· environmental monitoring to detect pollutants
· protection from biological and chemical warfare
· safe, efficient toxic waste cleanup
· understanding disease vulnerabilities and revealing drug targets
In 1994, taking advantage of new capabilities developed by the genome project, DOE initiated the Microbial Genome Program to sequence the genomes of bacteria useful in energy production, environmental remediation, toxic waste reduction, and industrial processing.
Despite our reliance on the inhabitants of the microbial world, we know little of their number or their nature: estimates are that less than 0.01% of all microbes have been cultivated and characterized. Programs like the DOE Microbial Genome Program help lay a foundation for knowledge that will ultimately benefit human health and the environment. The economy will benefit from further industrial applications of microbial capabilities.
Information gleaned from the characterization of complete genomes in MGP will lead to insights into the development of such new energy-related biotechnologies as photosynthetic systems, microbial systems that function in extreme environments, and organisms that can metabolize readily available renewable resources and waste material with equal facility.
Expected benefits also include development of diverse new products, processes, and test methods that will open the door to a cleaner environment. Biomanufacturing will use nontoxic chemicals and enzymes to reduce the cost and improve the efficiency of industrial processes. Already, microbial enzymes are being used to bleach paper pulp, stone wash denim, remove lipstick from glassware, break down starch in brewing, and coagulate milk protein for cheese production. In the health arena, microbial sequences may help researchers find new human genes and shed light on the disease-producing properties of pathogens.
Microbial genomics will also help pharmaceutical researchers gain a better understanding of how pathogenic microbes cause disease. Sequencing these microbes will help reveal vulnerabilities and identify new drug targets.
Gaining a deeper understanding of the microbial world also will provide insights into the strategies and limits of life on this planet. Data generated in this young program already have helped scientists identify the minimum number of genes necessary for life and confirm the existence of a third major kingdom of life. Additionally, the new genetic techniques now allow us to establish more precisely the diversity of microorganisms and identify those critical to maintaining or restoring the function and integrity of large and small ecosystems; this knowledge also can be useful in monitoring and predicting environmental change. Finally, studies on microbial communities provide models for understanding biological interactions and evolutionary history.
6.4 Risk Assessment
· assess health damage and risks caused by radiation exposure, including low-dose exposures
· assess health damage and risks caused by exposure to mutagenic chemicals and cancer-causing toxins
· reduce the likelihood of heritable mutations
Understanding the human genome will have an enormous impact on the ability to assess risks posed to individuals by exposure to toxic agents. Scientists know that genetic differences make some people more susceptible and others more resistant to such agents. Far more work must be done to determine the genetic basis of such variability. This knowledge will directly address DOE's long-term mission to understand the effects of low-level exposures to radiation and other energy-related agents, especially in terms of cancer risk.
6.5 Bioarchaeology, Anthropology, Evolution, and Human Migration
· study evolution through germline mutations in lineages
· study migration of different population groups based on female genetic inheritance
· study mutations on the Y chromosome to trace lineage and migration of males
· compare breakpoints in the evolution of mutations with ages of populations and historical events
Understanding genomics will help us understand human evolution and the common biology we share with all of life. Comparative genomics between humans and other organisms such as mice already has led to similar genes associated with diseases and traits. Further comparative studies will help determine the yet-unknown function of thousands of other genes.
Comparing the DNA sequences of entire genomes of differerent microbes will provide new insights about relationships among the three kingdoms of life: archaebacteria, eukaryotes, and prokaryotes.
6.6 DNA Forensics (Identification)
· identify potential suspects whose DNA may match evidence left at crime scenes
· exonerate persons wrongly accused of crimes
· identify crime and catastrophe victims
· establish paternity and other family relationships
· identify endangered and protected species as an aid to wildlife officials (could be used for prosecuting poachers)
· detect bacteria and other organisms that may pollute air, water, soil, and food
· match organ donors with recipients in transplant programs
· determine pedigree for seed or livestock breeds
· authenticate consumables such as caviar and wine
Any type of organism can be identified by examination of DNA sequences unique to that species. Identifying individuals is less precise at this time, although when DNA sequencing technologies progress further, direct characterization of very large DNA segments, and possibly even whole genomes, will become feasible and practical and will allow precise individual identification.
To identify individuals, forensic scientists scan about 10 DNA regions that vary from person to person and use the data to create a DNA profile of that individual (sometimes called a DNA fingerprint). There is an extremely small chance that another person has the same DNA profile for a particular set of regions.
Bibliography
1. IEEE Magazine
Engineering in Medicine and Biology
Volume 20, Number 4, July/August 2002
2. Introduction to Bioinformatics
By T. K. Attwood and D. J. Parry-Smith
First Edition
Publication: Pearson Education Ltd.
3. Web Sites
Human Genome Project http://www.ornl.gov/TechResources/Human_Genome/
Beyond Discovery http://www4.nas.edu/beyond/beyonddiscovery.nsf/
Bioinformatics in India http://bioinformatics-india.com
Other sites http://bioinform.com
http://bioinformatics.org
and Role of Software Engineers in It
By-
Makarand Arjun Kokane
BE-2
Roll number 222
Year 2002-2003
Pune Institute of Computer Technology
Pune 411043
Pune Institute of Computer Technology
Pune 411043
CERTIFICATE
This is to certify that Mr. Makarand Arjun Kokane has successfully completed his seminar on the topic “Bioinformatics and role of software engineers in it” under the guidance of Prof. R. B. Ingle towards the partial completion of Bachelor’s degree in Computer Engineering at Pune Institute of Computer Engineering during academic year 2002-2003.
Date:
Seminar Incharge HOD Principal
(Computer Engineering) PICT
Acknowledgement
I take this opportunity to thank respected Prof. R. B. Ingle Sir (my seminar guide) for his generous assistance. I am immensely grateful to our HOD Prof. Dr. C.V.K Rao for his encouragement and guidance. I extend my sincere thanks to our college library staff and all the staff members for their valuable assistance. I am also thankful to my fellow colleagues for their help and suggestions.
Makarand Kokane
(B.E. – 2)
Abstract
Bioinformatics is the application of computers in biological sciences. It is concerned with capturing, storing, graphically displaying, modeling and ultimately distributing biological information. It is becoming an essential tool in molecular biology as genome projects generate vast quantities of data.
The Human Genome Project has created the need for new kinds of scientific specialists who can be creative at the interface of biology and other disciplines, such as computer science, engineering, mathematics, physics, chemistry, and the social sciences. As the popularity of genomic research increases, the demand for these specialists greatly exceeds the supply. In the past, the genome project has benefited immensely from the talents of non-biological scientists, and their participation in the future is likely to be even more crucial. Through this report I have tried to analyze the future requirements in development of advances technologies in this field and what role, we, as software engineers can play in development of these technologies.
Contents Page
1 Introduction to Bioinformatics 1
1.1 What is Bioinformatics? 1
1.2 Computers and Biology 2
1.3 Limitations in the use of computers 2
1.4 Current Stage of Research 3
1.5 Microbial, Plant and Animal Genomes 4
1.6 History (Stages of development) 4
2 Basics of Molecular Biology 7
2.1 Nucleotide 7
2.2 Amino acid 7
2.3 Properties of Genetic Code 8
2.4 DNA (Deoxy-ribonucleic Acid) 8
2.5 Chromosomes 9
2.6 Gene 9
2.7 Protein 9
2.8 Sequencing 9
2.9 Genome 9
2.10 Clone 10
2.11 Model Organism 10
3 Role of software Engineers and Technology in Biotechnology 11
3.1 Need for software automation 11
3.2 Genetic Algorithms 12
3.2.1 Database Searching 12
3.2.2 Comparing Two Sequences 12
3.2.3 Multiple Sequence Alignment 13
3.3 Genome Projects 13
3.4 Goals for Advancements in Sequencing Technology 14
3.5 Developing Technology to handle Sequence Variations 15
3.6 Need of Technology in Functional Genomics 18
3.7 Bioinformatics and Computational Biology 19
3.8 Job Opportunities and Job Requirements 21
3.9 Training Goals included in the Human Genome Project Plan 21
4 Human Genome Project 23
4.1 Introduction 23
4.2 Details of the Human Genome Project 23
4.3 U.S. Human Genome Project 5-Year Goals 1998-2003 25
4.3.1 Human DNA Sequencing 25
4.3.2 Sequencing Technology 27
4.3.3 Sequence Variation 27
4.3.4 Functional Genomics 27
4.3.5 Comparative Genomics 28
4.3.6 Ethical, Legal, and Social Implications (ELSI) 28
4.3.7 Bioinformatics and Computational Biology 29
4.3.8 Training 29
5 Biological Databases 30
5.1 The Biological sequence/structure deficit 30
5.2 Biological Databases 30
5.3 Primary Sequence Databases 31
5.3.1 Nucleic acid Sequence Databases 31
5.3.2 Protein Sequence Databases 32
5.4 Composite Protein Sequences Databases 32
5.5 Secondary Databases 33
5.6 Tertiary Databases 33
6 Applications of Bioinformatics 34
6.1 Application to the Ailments of Diseases 34
6.2 Application of Bioinformatics to Agriculture 36
6.2.1 Improvements in Crop Yield and Quality 36
6.3 Applications of Microbial Genomics 37
6.4 Risk Assessment 39
6.5 Evolution and Human Migration 39
6.6 DNA Forensics (Identification) 40
Bibliography 41
Chapter 1
Introduction to Bioinformatics
1.1 What is Bioinformatics?
Bioinformatics is the application of computers in biological sciences and especially analysis of biological sequence data. It is concerned with capturing, storing, graphically displaying, modeling and ultimately distributing biological information. It is becoming an essential tool in molecular biology as genome projects generate vast quantities of data. With new sequences being added to DNA databases on an average, once every minute, there is a pressing need to convert this information into biochemical and biophysical knowledge by deciphering the structural, functional and evolutionary clues encoded in the language of biological sequences.
What Bioinformatics therefore offers to the researcher, the entrepreneur, or the Venture Capitalist is an enormous and exciting array of opportunities to discover how living systems metabolise, grow, combat disease, reproduce and regenerate. The current knowledge represents only the tip of the iceberg. Exciting and startling discoveries are being made everyday through Bioinformatics, which is building up an extensive encyclopedia from which life’s mysteries will be unraveled. The importance of computational science in collating this information and its simultaneous interpretation by biologists is the underlying ethos of Bioinformatics.
Having an interest in biology and having a strong inclination towards genetics is all right. But from our point of view, the most important thing is that biocomputing requires lots of software professionals. And there is more to do for these people than the experts in biology.
1.2 Computers and Biology
Bioinformatics is the symbolic relationship between computational and biological sciences. The ability to sort and extricate genetic codes from a human genomic database of 3 billion base pairs of DNA in a meaningful way is perhaps the simplest form of Bioinformatics. Moving on to another level, Bioinformatics is useful in mapping different people’s genomes and deriving differences in their genetic make-up through pattern recognition software. But that is the easiest part. What is more complex is to decipher the genetic code itself to see what the differences in genetic make-up between different people translate into in terms of physiological traits. And there is yet another level, which is even more intricate and that is the genetic code itself. The genetic code actually codes for amino acids and thereby proteins and the specific role, played by each of these proteins controls the state of our health. The role or function of each of our genes in coding for a specific protein, which in turn regulates a particular metabolic pathway, is described as “functional genomics”. The true benefit of Bioinformatics therefore lies in harnessing information pertaining to these genetic functions in order to understand how human beings and other living systems operate.
Computational simulation of experimental biology is an important application of Bioinformatics, which is referred to as “in silico” testing. This is perhaps an area that will expand in a prolific way, given the need to obtain a greater degree of predictability in animal and human clinical trials. Added to this, is the interesting scope that “in silico” testing provides to deal with the growing hostility towards animal testing. The growth of this sector will largely depend on the acceptance of “in silico” testing by the regulatory authorities. However, irrespective of this, research strategies will certainly find computational modeling to be a vital tool in speeding up research with enormous cost benefits.
1.3 Limitations in the use of computers
The last decade has witnessed the dawn of a new era of ‘silicon-based’ biology, opening the door, for the first time, to the possible investigation and comparative analysis of complete genomes. Genome analysis means to elucidate and characterize the genes and gene products of an organism. It depends on a number of pivotal concepts, concerning the processes of evolution (divergence and convergence), the mechanism of protein folding, and the manifestation of protein function.
Today, our use of computers to model such processes is limited by, and must be placed in the context of, the current limits of our understanding of these central themes. At the outset, it is important to recognize that we do not yet fully understand the rules of protein folding; we cannot invariably say that a particular sequence or a fold has arisen by divergent or convergent evolution; and we cannot necessarily diagnose a protein function, given knowledge only of its sequence or of its structure, in isolation. Accepting what we cannot do with computers plays an essential role in forming an appreciation of what, in fact, we can do. Without this kind of understanding, it is easy to be misled, as spurious arguments are often used to promote perhaps rather overenthusiastic points of view about what particular programs and software packages can achieve.
Nature has its own complex rules, which we only poorly understand and which we cannot easily encapsulate within computer programs. No current algorithm can ‘do’ biology. Programs provide mathematical and therefore infallible, models of biological systems. To interpret correctly whether sequences or structures are meaningfully similar, whether they have arisen by the processes of divergence or convergence, whether similar sequences or similar folds have the same or different functions: these are the most challenging problems. There are no simple solutions, and computers do not give us the answers; rather, given a sea of data, they help to narrow the options down so that the users can begin to draw informed biologically reasonable conclusions.
1.4 Current Stage of Research
In the field of Bioinformatics, the current research drive is to be able to understand evolutionary relationships in terms of the expression of protein function. Two computational approaches have been brought up to bear on the problem, tackling the identification of protein function from the perspectives of sequence analysis and of structure analysis respectively. From the point of view of sequence analysis, we are concerned with the detection of relationships between newly determined sequences and those of known function (usually within a database). This may mean pinpointing functional sites shared by disparate proteins (probably the result of convergent evolution), or identifying related functions in similar proteins (most commonly the result of divergent evolution.
The identification of protein function from sequence sounds straightforward, and indeed, sequence analysis is usually a fruitful technique. But, function cannot be inferred from sequence for about one-third of proteins in any of the sequenced genomes, largely because biological characterization cannot keep pace with the volume of data issuing from the genome projects (large number of database sequences thus either carry no annotation beyond the parent gene name, or are simply designated as hypothetical proteins). Another important point is that, in some instances, closely related sequences, which may be assumed to share a common structure, may not share the same function. What this means is that, though sequence or structure analysis can be used for deducing gene functions, still neither technique can be applied infallibly without reference to the underlying biology.
1.5 Microbial, Plant and Animal Genomes
Although the human genome appears to be the focal point of interest, microbial, plant and animal genomes are equally exciting to explore through Bioinformatics. Mining plant genomics has an important impact on opening up new vistas for research in agriculture. Microbial genomics offers a dual opportunity of developing new fermentation-based products and technologies as well as defining new ways of combating microbial infections. Exploring animal genomics opens up unlimited scope to pursue research in veterinary science and transgenic models.
1.6 History (Stages of development)
The science of sequencing began slowly. The earliest techniques were based on methods for separation of proteins and peptides, coupled with methods for identification and quantification of amino acids. Prior to 1945, there was not a single quantitative analysis available for any one protein. However, significant progress with chromatographic and labeling techniques over the next decade eventually led to the elucidation of the first complete sequence, that of the peptide hormone insulin (1955). Yet, it was the first five years before the sequence of the first enzyme (ribonuclease) was complete (1960). By 1965, around 20 proteins with more than 100 residues had been sequenced, and by 1980, the number was estimated to be of the order of 1500. Today, there are more than 3,00,000 sequences available.
Initially, the manual process of sequential Edman degradation – dansylation, obtained the majority of protein sequences. A key step towards the rapid increase in the number of sequenced proteins was the development of automated sequencers, which, by 1980, offered a 104-fold increase in sensitivity relative to the automated procedure implemented by Edman and Begg in 1967.
In the 1960s, scientists struggled to develop methods to sequence nucleic acids, but the first techniques to emerge were really only applicable to tRNA because they are short (74 to 95 nucleotides in length) and it is possible to purify individual molecules.
As against RNA, DNA is very long: human chromosomal molecules may contain between 55*106 and 250*106 base pairs. Assembling the complete nucleotide sequence of a complete DNA molecule is a huge task. Even if the sequence can be broken down into smaller fragments, purification remains a problem. The advent of gene cloning provided a solution to how the fragments can be separated. By 1977, two sequencing methods had emerged using chain termination and chemical degradation approaches. With only minor changes, the techniques propagated to laboratories throughout the world, and laid the foundation for the sequence revolution of the next two decades, and the subsequent birth of Bioinformatics.
During the last decade, molecular biology has witnessed an information revolution as a result of both of the development of rapid DNA sequencing techniques and of the corresponding progress in computer base technologies, which are allowing us to cope with this information deluge in increasingly efficient ways. The broad term that was coined in the mid-1980s to encompass computer applications in biological sciences is Bioinformatics. The term Bioinformatics has been commandeered by several different disciplines to mean rather different things. In its broadest sense, the term can be considered to mean information technology applied to the management and analysis of biological sequence data; this has implications in diverse areas, ranging from artificial intelligence and robotics to genome analysis. In the context of genome initiatives, the term was originally applied to the computational manipulation and analysis of biological sequence data. However in view of this recent rapid accumulation of available protein structures, the term now tends also to be used to embrace the manipulation and analysis of 3D structure data.
Chapter 2
Basics of Molecular Biology
This chapter explains in short some of the common biological terms absolutely essential to get a clear understanding of what exactly is Bioinformatics all about. I have avoided getting into the intricacies of Genetics because the basic aim of this report is to know the latest developments in the field of Bioinformatics, try to visualize where it is heading, understand what it has got to offer to the community, and exploit the opportunities available in this field.
2.1 Nucleotide
A nucleotide is a macromolecule made up of three sub-units: a pentose sugar, a nitrogen base and a phosphate. Nucleic acids are polymers of nucleotides. Pentose sugar is either ribose or deoxyribose (this decides whether the genetic material formed is RNA or DNA). Nitrogen bases are of two types: Purines (Adenine (A), Guanine (G)) and Pyramidines (Cytosine (C), Thymine (T) and Uracil (U))
2.2 Amino acid
It is the fundamental building block of proteins. There are 20 naturally occurring amino acids in animals and around 100 more found only in plants. A sequence of three nucleotides forms one amino acid. The logic behind this is as follows: There are four types of nucleotides depending on the nitrogenous base: (A,G,C,T) in DNA and (A,G,C,U) in RNA. 20 different amino acids are to be coded using permutations of 4 types of nucleotides. So obviously, 3 nucleotides are required to signify one amino acid (43 > 20), because less than 3 will be insufficient and more than 3 will cause redundancy. The sequence of three nucleotide specifying an amoni acid is called a triplet code or codon (coding unit). All 64 codons specify something or the other. Most of them specify amino acids, but a few are instructions for starting and stopping the synthesis.
2.3 Properties of Genetic Code
1. Three nucleotides in a DNA molecule code for one amino acid in the corresponding protein. Such a triplet is called a codon.
2. The code is read from a fixed starting point.
3. Codes for starting and stopping are present, but not for a pause in the middle, or
comma.
4. The nucleotides are read three at a time in a non-overlapping manner.
5. Most of the 64 possible nucleotide triplets stand for one amino acid or the other.
6. A few triplets stand for starting and stopping the synthesis.
7. There are two or more different codons for the same amino acid. Because of this,
the genetic code is said to be degenerate.
8. The code has polarity because it can be read only in one direction.
9. The code is universal. Practically all the organisms use the same code.
2.4 DNA (Deoxy-ribonucleic Acid)
The long, thread-like DNA molecule consists of two strands that are joined to one another all along their length. Each strand is a polymer made up of repeated sub-units (nucleotides). Hence each strand is also called a polynucleotide. DNA is the basic genetic material in all the living material existing on this earth. The two essential mechanisms possessed by DNA are (1) Transmission of hereditary characters and (2) Ability of self-duplication. In the DNA molecule, tow long polynucleotide chains are spirally twisted around each other. This is also called helical coiling and the DNA is often referred to as a double helix. A polynucleotide chain has polarity and the two strands of a DNA molecule run in opposite directions, hence they are said to be anti parallel. The two chains are joined together by hydrogen bonds existing between the nitrogenous bases on the inside. Adenine (A) forms a bond only with Thymine (T) and Guanine (G) can form a bond only with Cytosine (C). Because of the base pairing restriction, the two strands are always complementary to each other.
The sequence of bases along the polynucleotide is not restricted in any way. An infinite variety of combinations is possible. It is the precise sequence of bases that determines the genetic information. There is no theoretical limit to the length of a DNA molecule.
2.5 Chromosomes
Chromosomes are the paired, self-replicating genetic structures of cells that contain the cellular DNA; the nucleotide sequence of the DNA encodes the linear array of genes.
2.6 Gene
A gene is the fundamental physical and functional unit of heredity. A gene is an ordered sequence of nucleotides located in a particular position on a particular chromosome that encodes a specific functional product (i.e. a protein or RNA molecule).
2.7 Protein
Protein is a molecule composed of one or more chains of amino acids in a specific order. The order is determined by the base sequence of nucleotides in the gene coding for the protein. Proteins are required for the structure, function and regulation of cells, tissues and organs, each protein having a specific role (e.g., hormones, enzymes and antibodies).
DNA carries the hereditary material and the only thing that they do is to synthesize proteins, and thereafter, all the hereditary characteristics get reflected in the activities of the body cells because of proteins.
2.8 Sequencing
Sequencing means the determination of the order of nucleotides (base sequences) in a DNA or RNA molecule, or the order of amino acids in a protein.
2.9 Genome
Genome of an organism means all the genetic material in its chromosomes. Its size is generally given as its total number of base pairs. Genomes of different organisms can be compared to identify similarities and disparities in the strategies for the ‘Logic of Life’.
2.10 Clone
Clone is an exact copy made of biological material such as a DNA segment, a whole cell or a complete organism. The process of creating a clone is called as cloning.
2.11 Model Organism
Saccharomytes cerevisiae commonly known as the baker’s yeast have emerged as the model organism. It has demonstrated the fundamental conservation of the basic informational pathways found in almost all the organisms. From the detailed study of the genomes of these organisms (which is possible today), we can gain an insight into their functioning. All this data will lead to the fundamental insights into human biology.
Vast amount of genetic data available on this species provides important clues helpful for the ongoing research on human genetics. Saccharomytes cerevisiae has become the workhorse of many biotechnology labs. It can exist either in a haploid or a diploid state and divides by the vegetative process of budding. Yeast cultures can be easily propagated in labs. It has become the model organism partly because of the ease with which genetic manipulations can be carried out. Random mutations can be induced into the genome by the treatment of live cells with chemicals such as ethyl-methanesulphone or by exposure to ultra-violet rays. Targeted gene inactivations can also be carried out; this property is very important during experiments for the unambiguous assignments of gene functions.
Saccharomytes cerevisiae has a compact genome of 12 lakh base pairs of DNA present on 16 chromosomes. This presented a reasonable goal for complete sequencing and analysis of it’s genome. The Saccharomytes genome database (SGD) was established at the Stanford University in 1995.
Knowing the complete sequence of a genome is only the first step in understanding how the huge amount of information contained in genes is translated into functional proteins.
Chapter 3
Role of software Engineers and Technology in Biotechnology
The tools of computer science, statistics and mathematics are critical for studying biology as an informational science. Curiously, biology is the only science that at it’s very heart, employs a digital language. The grand challenge in biology is to determine how the digital language of the chromosomes is converted into 3-D and 4-D (time varying) languages of living organisms.
3.1 Need for software automation
DNA encodes the information necessary for building and maintaining life. DNA is a non-branching, double-stranded macromolecule in which the nucleotide building blocks (A,C,G,T) are linked. Bases are arranged in A-T and C-G pairs. Small viral genomes of the order of several thousand bases were the first to be sequenced in 1970. Few years later, genomes of the order of 40 kilo base pairs represented the limit of what could reasonably be sequenced. At this stage, the need for automation was recognized and methods were applied to the degree possible. By the year 1997, the yeast genome consisting of 12 Mega base pairs was completed, and in 1998, the conclusion of the 100 Mega base pairs nematode genome project was announced. Most recently, the 180 Mega base pairs fruit-fly genome was also completed. All of these projects relied on substantially higher levels of software automation. We are now in the midst of the most ambitious project so far: sequencing of the 3 Giga base pairs Human Genome. For this effort, and those yet to come, software automation lies at the very core of the planning and executing of the project.
The need for automation is driven largely by the trend of handling ever larger sizes of DNA and the corresponding increase in the amount of raw data this entails. Mathematical analysis indicates that the size of a project is roughly proportional to the size of the genome. This is due to the fact that the amount of information obtained for an individual sequencing experiment is relatively constant and is independent of the genome size. It is estimated that for the human genome, as order of 108 individual experiments are required to cover the genome. To meet the projected goals, modern large scale sequencing centers have developed throughput capacities of the order of several million experiments per month, with data processing handled on a continuous basis. Managing such large projects without a high degree of automation would clearly be impossible in terms of cost and time requirements.
So, DNA is the basic genetic material. It transmits hereditary characters from one generation to the next. During synthesis of proteins, mRNA which act as the messengers of information (the exact genetic code) are build from DNA. Proteins are synthesized using mRNA molecules. Protein interactions give rise to information pathways and networks which help in building cells which are identical to their parent cells. Clustering of many cells in a predefined format composes a tissue. An organ is a combination of tissues and an organism is nothing but an organization of organs. Refer figure 3.1.
The challenge for computer professionals is to create tools that can capture and integrate these different levels of biological information.
3.2 Genetic Algorithms
All that computers can do is implement algorithms. Hence when we talk of using computers for processing of biological information, we have to define precise mathematical algorithms. Following are a few absolutely basic algorithms in Bioinformatics.
3.2.1 Database Searching
Database interrogation can take the form of text queries (e.g. Display all the human adrenergic receptors) or sequence similarity searches (e.g. Given the sequence of a human adrenergic receptor, display all the similar sequences in the database). Sequence similarity searches are straightforward because the data in the databases is mostly in the form of sequences.
3.2.2 Comparing Two Sequences
Let us take the case of comparing two protein sequences. The alphabet complexity is 20, since a protein is nothing but a sequence of amino acids and there are 20 possible amino acids. The naïve approach is to line up the sequences against each other and insert additional characters to bring the two strings into vertical alignment. More the matches, more is the closeness in the two sequences.
The process of alignment can be measured in terms of the number of gaps introduced and the number of mismatches remaining in the alignment. A metric relating such parameters represents the distance between two sequences.
3.2.3 Multiple Sequence Alignment
In the previous sub section, we saw pairwise sequence alignment, which is fundamental to sequence analysis. However, analysis of groups of sequences that form gene families requires the ability to make connections between more than two members of the group, in order to reveal subtle conserved family characteristics. The goal of multiple sequence alignment is to generate a concise information-rich summary of sequence data in order to inform decision-making on the relatedness of sequences to a gene family.
Multiple sequence alignment is a 2D table, in which the rows represent individual sequences and the columns the residue positions. The sequences are laid onto this grid in such a manner that (a) the relative positioning of residues within any one sequence is preserved, and (b) similar residues in all the sequences are brought into vertical register.
3.3 Genome Projects
In the mid-1980s, the united states department of Energy initiated a number of projects to construct detailed genetic and physical maps of the human genome, to determine its complete nucleotide sequence, and to localize its estimated 100000 genes. Work on this scale required the development of new computational methods for analysing genetic map and DNA sequence data, and demanded the design of new techniques and instrumentation for detecting and analysing DNA. To benefit the public most effectively, the projects also necessitated the use of advanced means of information dissemination in order to make the results available as rapidly as possible to scientists and physicians. The international effort arising from this vast initiative became known as the human genome project.
Similar research efforts were also launched to map and sequence the genomes of a variety of organisms used extensively in research labs as model systems. In April 1998, although the sequencing projects of only a small number of relatively small genomes had been completed, and the human genome is not expected to be complete until after the year 2003, the results of such projects were already beginning to pour into the public sequence databases in overwhelming numbers.
3.4 Goals for Advancements in Sequencing Technology
DNA sequencing technology has improved dramatically since the genome projects began. The amount of sequence produced each year is increasing steadily; individual centers are now producing tens of millions of base pairs of sequence annually. In the future, de novo sequencing of additional genomes, comparative sequencing of closely related genomes, and sequencing to assess variation within genomes will become increasingly indispensable tools for biological and medical research. Much more efficient sequencing technology will be needed than is currently available. The incremental improvements made to date have not yet resulted in any fundamental paradigm shifts. Nevertheless, the current state-of-the-art technology can still be significantly improved, and resources should be invested to accomplish this. Beyond that, research must be supported on new technologies that will make even higher throughput DNA sequencing efficient, accurate, and cost-effective, thus providing the foundation for other advanced genomic analysis tools. Progress must be achieved in three areas:
a) Continue to increase the throughput and reduce the cost of current sequencing technology.
Increased automation, miniaturization, and integration of the approaches currently in use, together with incremental, evolutionary improvements in all steps of the sequencing process, are needed to yield further increases in throughput (to at least 500 Mb of finished sequence per year by 2003) and reductions in cost. At least a twofold cost reduction from current levels (which average $0.50 per base for finished sequence in large-scale centers) should be achieved in the next 5 years. Production of the working draft of the human sequence will cost considerably less per base pair.
b) Support research on novel technologies that can lead to significant improvements in sequencing technology.
New conceptual approaches to DNA sequencing must be supported to attain substantial improvements over the current sequencing paradigm. For example, microelectromechanical systems (MEMS) may allow significant reduction of reagent use, increase in assay speed, and true integration of sequencing functions. Rapid mass spectrometric analysis methods are achieving impressive results in DNA fragment identification and offer the potential for very rapid DNA sequencing. Other more revolutionary approaches, such as single-molecule sequencing methods, must be explored as well. Significant investment in interdisciplinary research in instrumentation, combining chemistry, physics, biology, computer science, and engineering, will be required to meet this goal. Funding of far-sighted projects that may require 5 to 10 years to reach fruition will be essential. Ultimately, technologies that could, for example, sequence one vertebrate genome per year at affordable cost are highly desirable.
c) Develop effective methods for the advanced development and introduction of new sequencing technologies into the sequencing process.
As the scale of sequencing increases, the introduction of improvements into the production stream becomes more challenging and costly. New technology must therefore be robust and be carefully evaluated and validated in a high-throughput environment before its implementation in a production setting. A strong commitment from both the technology developers and the technology users is essential in this process. It must be recognized that the advanced development process will often require significantly more funds than proof-of-principle studies. Targeted funding allocations and dedicated review mechanisms are needed for advanced technology development.
3.5 Developing Technology to handle Sequence Variations
Natural sequence variation is a fundamental property of all genomes. Any two haploid human genomes show multiple sites and types of polymorphism. Some of these have functional implications, whereas many probably do not. The most common polymorphisms in the human genome are single base-pair differences, also called single-nucleotide polymorphisms (SNPs). When two haploid genomes are compared, SNPs occur every kilobase, on average. Other kinds of sequence variation, such as copy number changes, insertions, deletions, duplications, and rearrangements also exist, but at low frequency and their distribution is poorly understood. Basic information about the types, frequencies, and distribution of polymorphisms in the human genome and in human populations is critical for progress in human genetics. Better high-throughput methods for using such information in the study of human disease are also needed.
SNPs are abundant, stable, widely distributed across the genome, and lend themselves to automated analysis on a very large scale, for example, with DNA array technologies. Because of these properties, SNPs will be a boon for mapping complex traits such as cancer, diabetes, and mental illness. Dense maps of SNPs will make possible genome-wide association studies, which are a powerful method for identifying genes that make a small contribution to disease risk. In some instances, such maps will also permit prediction of individual differences in drug response. Publicly available maps of large numbers of SNPs distributed across the whole genome, together with technology for rapid, large-scale identification and scoring of SNPs, must be developed to facilitate this research.
a) Develop technologies for rapid, large-scale identification of SNPs and other DNA sequence variants. The study of sequence variation requires efficient technologies that can be used on a large scale and that can accomplish one or more of the following tasks: rapid identification of many thousands of new SNPs in large numbers of samples. Although the immediate emphasis is on SNPs, ultimately technologies that can be applied to polymorphisms of any type must be developed. Technologies are also needed that can rapidly compare, by large-scale identification of similarities and differences, the DNA of a species that is closely related to one whose DNA has already been sequenced. The technologies that are developed should be cost-effective and broadly accessible.
b) Identify common variants in the coding regions of the majority of identified genes Initially, association studies involving complex diseases will likely test a large series of candidate genes; eventually, sequences in all genes may be systematically tested. SNPs in coding sequences (also known as cSNPs) and the associated regulatory regions will be immediately useful as specific markers for disease. An effort should be made to identify such SNPs as soon as possible. Ultimately, a catalog of all common variants in all genes will be desirable. This should be cross-referenced with cDNA sequence data.
c) Create an SNP map of at least 100,000 markers. A publicly available SNP map of sufficient density and informativeness to allow effective mapping in any population is the ultimate goal. A map of 100,000 SNPs (one SNP per 30,000 nucleotides) is likely to be sufficient for studies in some relatively homogeneous populations, while denser maps may be required for studies in large, heterogeneous populations. Thus, during this 5-year period, the HGP authorities have planned to create a map of at least 100,000 SNPs. If technological advances permit, a map of greater density is desirable. Research should be initiated to estimate the number of SNPs needed in different populations.
d) Develop the intellectual foundations for studies of sequence variation. The methods and concepts developed for the study of single-gene disorders are not sufficient for the study of complex, multigene traits. The study of the relationship between human DNA sequence variation, phenotypic variation, and complex diseases depends critically on better methods. Effective research design and analysis of linkage, linkage disequilibrium, and association data are areas that need new insights. Questions such as which study designs are appropriate to which specific populations, and with which population genetics characteristics, must be answered. Appropriate statistical and computational tools and rigorous criteria for establishing and confirming associations must also be developed.
e) Create public resources of DNA samples and cell lines. To facilitate SNP discovery it is critical that common public resources of DNA samples and cell lines be made available as rapidly as possible. To maximize discovery of common variants in all human populations, a resource is needed that includes individuals whose ancestors derive from diverse geographic areas. It should encompass as much of the diversity found in the human population as possible. Samples in this initial public repository should be totally anonymous to avoid concerns that arise with linked or identifiable samples.
DNA samples linked to phenotypic data and identified as to their geographic and other origins will be needed to allow studies of the frequency and distribution of DNA polymorphisms in specific populations and their relevance to disease. However, such collections raise many ethical, legal, and social concerns that must be addressed. Credible scientific strategies must be developed before creating these resources.
3.6 Need of Technology in Functional Genomics
Functional genomics is the interpretation of the function of DNA sequence on a genomic scale. Already, the availability of the sequence of entire organisms has demonstrated that many genes and other functional elements of the genome are discovered only when the full DNA sequence is known. Such discoveries will accelerate as sequence data accumulate. However, knowing the structure of a gene or other element is only part of the answer. The next step is to elucidate function, which results from the interaction of genomes with their environment. Current methods for studying DNA function on a genomic scale include comparison and analysis of sequence patterns directly to infer function, large-scale analysis of the messenger RNA and protein products of genes, and various approaches to gene disruption. In the future, a host of novel strategies will be needed for elucidating genomic function. This will be a challenge for all of biology. The HGP will be contributing to this area by emphasizing the development of technology that can be used on a large scale, is efficient, and is capable of generating complete data for the genome as a whole. To the extent that available resources allow, expansion of current approaches as well as innovative technology ideas should be supported in the areas described below.
a) Develop cDNA resources. Complete sets of full-length cDNA clones and sequences for both humans and model organisms would be enormously useful for biologists and are urgently needed. Such resources would help in both gene discovery and functional analysis. High priority should be placed on developing technology for obtaining full-length cDNAs. Complete and validated inventories of full-length cDNA clones and corresponding sequences should be generated and made available to the community once such technology is at hand.
b) Improved technologies are needed for global approaches to the study of non-protein-coding sequences, including production of relevant libraries, comparative sequencing, and computational analysis.
c) Develop technology for comprehensive analysis of gene expression. Information about the spatial and temporal patterns of gene expression in both humans and model organisms offers one key to understanding gene expression. Efficient and cost-effective technology needs to be developed to measure various parameters of gene expression reliably and reproducibly. Complementary DNA sequences and validated sets of clones with unique identifiers will be needed for array technologies, large-scale in situ hybridization, and other strategies for measuring gene expression. Improved methods for quantifying, representing, analyzing, and archiving expression data should also be developed.
d) Improve methods for genome-wide mutagenesis. Creating mutations that cause loss or alteration of function is another prime approach to studying gene function. Technologies, both gene- and phenotype-based, which can be used on a large scale in vivo or in vitro, are needed for generating or finding such mutations in all genes. Such technologies should be piloted in appropriate model systems, including both cell culture and whole organisms.
e) Develop technology for global protein analysis. A full understanding of genome function requires an understanding of protein function on a genome-wide basis. Development of experimental and computational methods to study global spatial and temporal patterns of protein expression, protein-ligand interactions, and protein modification needs to be supported.
3.7 Bioinformatics and Computational Biology
Bioinformatics support is essential to the implementation of genome projects and for public access to their output. Bioinformatics needs for the genome project fall into two broad areas: (i) databases and (ii) development of analytical tools. Collection, analysis, annotation, and storage of the ever increasing amounts of mapping, sequencing, and expression data in publicly accessible, user-friendly databases is critical to the project's success. In addition, the community needs computational methods that will allow scientists to extract, view, annotate, and analyze genomic information efficiently. Thus, the genome project must continue to invest substantially in these areas. Conservation of resources through development of portable software should be encouraged.
a) Improve content and utility of databases. Databases are the ultimate repository of genome project’s data. As new kinds of data are generated and new biological relationships discovered, databases must provide for continuous and rapid expansion and adaptation to the evolving needs of the scientific community. To encourage broad use, databases should be responsive to a diverse range of users with respect to data display, data deposition, data access, and data analysis. Databases should be structured to allow the queries of greatest interest to the community to be answered in a seamless way. Communication among databases must be improved. Achieving this will require standardization of nomenclature. A database of human genomic information, analogous to the model organism databases and including links to many types of phenotypic information, is needed.
b) Develop better tools for data generation, capture, and annotation. Large-scale, high-throughput genomics centers need readily available, transportable informatics tools for commonly performed tasks such as sample tracking, process management, map generation, sequence finishing, and primary annotation of data. Smaller users urgently need reliable tools to meet their sequencing and sequence analysis needs. Readily accessible information about the availability and utility of various tools should be provided, as well as training in the use of tools.
c) Develop and improve tools and databases for comprehensive functional studies. Massive amounts of data on gene expression and function will be generated in the near future. Databases that can organize and display this data in useful ways need to be developed. New statistical and mathematical methods are needed for analysis and comparison of expression and function data, in a variety of cells and tissues, at various times and under different conditions. Also needed are tools for modeling complex networks and interactions.
d) Develop and improve tools for representing and analyzing sequence similarity and variation. The study of sequence similarity and variation within and among species will become an increasingly important approach to biological problems. There will be many forms of sequence variation, of which SNPs will be only one type. Tools need to be created for capturing, displaying, and analyzing information about sequence variation.
e) Create mechanisms to support effective approaches for producing robust, exportable software that can be widely shared. Many useful software products are being developed in both academia and industry that could be of great benefit to the community. However, these tools generally are not robust enough to make them easily exportable to another laboratory. Mechanisms are needed for supporting the validation and development of such tools into products that can be readily shared and for providing training in the use of these products. Participation by the private sector is strongly encouraged.
3.8 Job Opportunities and Job Requirements
The Human Genome Project has created the need for new kinds of scientific specialists who can be creative at the interface of biology and other disciplines, such as computer science, engineering, mathematics, physics, chemistry, and the social sciences. As the popularity of genomic research increases, the demand for these specialists greatly exceeds the supply. In the past, the genome project has benefited immensely from the talents of non-biological scientists, and their participation in the future is likely to be even more crucial. There is an urgent need to train more scientists in interdisciplinary areas that can contribute to genomics. Programs must be developed that will encourage training of both biological and non-biological scientists for careers in genomics. Especially critical is the shortage of individuals trained in Bioinformatics. Also needed are scientists trained in the management skills required to lead large data-production efforts. Another urgent need is for scholars who are trained to undertake studies on the societal impact of genetic discoveries. Such scholars should be knowledgeable in both genome-related sciences and in the social sciences. Ultimately, a stable academic environment for genomic science must be created so that innovative research can be nurtured and training of new individuals can be assured. The latter is the responsibility of the academic sector, but funding agencies can encourage it through their grants programs.
3.9 Training Goals included in the Human Genome Project Plan
a) Nurture the training of scientists skilled in genomics research.
A number of approaches to training for genomics research should be explored. These include providing fellowship and career awards and encouraging the development of institutional training programs and curricula. Training that will facilitate collaboration among scientists from different disciplines, as well as courses that introduce scientists to new technologies or approaches, should also be included.
b) Encourage the establishment of academic career paths for genomic scientists.
Ultimately, a strong academic presence for genomic science is needed to generate the training environment that will encourage individuals to enter the field. Currently, the high demand for genome scientists in industry threatens the retention of genome scientists in academia. Attractive incentives must be developed to maintain the critical mass essential for sponsoring the training of the next generation of genome scientists.
c) Increase the number of scholars who are knowledgeable in both genomic and genetic sciences and in ethics, law, or the social sciences.
As the pace of genetic discoveries increases, the need for individuals who have the necessary training to study the social impact of these discoveries also increases. The ELSI program should expand its efforts to provide postdoctoral and senior fellowship opportunities for cross-training. Such opportunities should be provided both to scientists and health professionals who wish to obtain training in the social sciences and humanities and to scholars trained in law, the social sciences, or the humanities who wish to obtain training in genomic or genetic sciences.
Chapter 4
Human Genome Project
4.1 Introduction
Begun in 1990, the U.S. Human Genome Project is a 13-year effort coordinated by the department of Energy and the National Institutes of Health. The project originally was planned to last 15 years, but effective resource and technological advances have accelerated the expected completion date to 2003. Project goals are to
· identify all the approximately 30,000 genes in human DNA,
· determine the sequences of the 3 billion chemical base pairs that make up human DNA,
· store this information in databases,
· improve tools for data analysis,
· transfer related technologies to the private sector, and
· address the ethical, legal, and social issues that may arise from the project.
4.2 Details of the Human Genome Project
The Human Genome Project (HGP) is fulfilling its promise as the single most important project in biology and the biomedical sciences--one that will permanently change biology and medicine. With the recent completion of the genome sequences of several microorganisms, including Escherichia coli and Saccharomyces cerevisiae, and the imminent completion of the sequence of the metazoan Caenorhabditis elegans, the door has opened wide on the era of whole genome science. The ability to analyze entire genomes is accelerating gene discovery and revolutionizing the breadth and depth of biological questions that can be addressed in model organisms. These exciting successes confirm the view that acquisition of a comprehensive, high-quality human genome sequence will have unprecedented impact and long-lasting value for basic biology, biomedical research, biotechnology, and health care. The transition to sequence-based biology will spur continued progress in understanding gene-environment interactions and in development of highly accurate DNA-based medical diagnostics and therapeutics.
Human DNA sequencing, the flagship endeavor of the HGP, is entering its decisive phase. It will be the project's central focus during the next 5 years. While partial subsets of the DNA sequence, such as expressed sequence tags (ESTs), have proven enormously valuable, experience with simpler organisms confirms that there can be no substitute for the complete genome sequence. In order to move vigorously toward this goal, the crucial task ahead is building sustainable capacity for producing publicly available DNA sequence. The full and incisive use of the human sequence, including comparisons to other vertebrate genomes, will require further increases in sustainable capacity at high accuracy and lower costs. Thus, a high-priority commitment to develop and deploy new and improved sequencing technologies must also be made.
Availability of the human genome sequence presents unique scientific opportunities, chief among them the study of natural genetic variation in humans. Genetic or DNA sequence variation is the fundamental raw material for evolution. Importantly, it is also the basis for variations in risk among individuals for numerous medically important, genetically complex human diseases. An understanding of the relationship between genetic variation and disease risk promises to change significantly the future prevention and treatment of illness. The new focus on genetic variation, as well as other applications of the human genome sequence, raises additional ethical, legal, and social issues that need to be anticipated, considered, and resolved.
The HGP has made genome research a central underpinning of biomedical research. It is essential that it continue to play a lead role in catalyzing large-scale studies of the structure and function of genes, particularly in functional analysis of the genome as a whole. However, full implementation of such methods is a much broader challenge and will ultimately be the responsibility of the entire biomedical research and funding communities.
Success of the HGP critically depends on Bioinformatics and computational biology as well as training of scientists to be skilled in the genome sciences. The project must continue a strong commitment to support of these areas.
As intended, the HGP has become a truly international effort to understand the structure and function of the human genome. Many countries are participating according to their specific interests and capabilities. Coordination is informal and generally effected at the scientist-to-scientist level. The U.S. component of the project is sponsored by the National Human Genome Research Institute at the National Institutes of Health (NIH) and the Office of Biological and Environmental Research at the Department of Energy (DOE). The HGP has benefited greatly from the contributions of its international partners. The private sector has also provided critical assistance. These collaborations will continue, and many will expand. Both NIH and DOE welcome participation of all interested parties in the accomplishment of the HGP's ultimate purpose, which is to develop and make publicly available to the international community the genomic resources that will expedite research to improve the lives of all people.
4.3 U.S. Human Genome Project 5-Year Goals 1998-2003
4.3.1 Human DNA Sequencing
Providing a complete, high-quality sequence of human genomic DNA to the research community as a publicly available resource continues to be the HGP's highest priority goal. The enormous value of the human genome sequence to scientists, and the considerable savings in research costs its widespread availability will allow, are compelling arguments for advancing the timetable for completion. Recent technological developments and experience with large-scale sequencing provide increasing confidence that it will be possible to complete an accurate, high-quality sequence of the human genome by the end of 2003, 2 years sooner than previously predicted. NIH and DOE expect to contribute 60 to 70% of this sequence, with the remainder coming from the effort at the Sanger Center and other international partners.
This is a highly ambitious goal, given that only about 6% of the human genome sequence has been completed thus far. Sequence completion by the end of 2003 is a major challenge, but within reach and well worth the risks and effort. Realizing the goal will require an intense and dedicated effort and a continuation and expansion of the collaborative spirit of the international sequencing community. Only sequence of high accuracy and long-range contiguity will allow a full interpretation of all the information encoded in the human genome.
Availability of the human sequence will not end the need for large-scale sequencing. Full interpretation of that sequence will require much more sequence information from many other organisms, as well as information about sequence variation in humans. Thus, the development of sustainable, long-term sequencing capacity is a critical objective of the HGP. Achieving the goals below will require a capacity of at least 500 megabases (Mb) of finished sequence per year by the end of 2003.
a) Finish the complete human genome sequence by the end of 2003.
To best meet the needs of the scientific community, the finished human DNA sequence must be a faithful representation of the genome, with high base-pair accuracy and long-range contiguity. Specific quality standards that balance cost and utility have already been established. These quality standards should be reexamined periodically; as experience in using sequence data is gained, the appropriate standards for sequence quality may change. One of the most important uses for the human sequence will be comparison with other human and nonhuman sequences. The sequence differences identified in such comparisons should, in nearly all cases, reflect real biological differences rather than errors or incomplete sequence. Consequently, the current standard for accuracy--an error rate of no more than 1 base in 10,000--remains appropriate.
The current public sequencing strategy is based on mapped clones and occurs in two phases. The first, or "shotgun" phase, involves random determination of most of the sequence from a mapped clone of interest. Methods for doing this are now highly automated and efficient. Mapped shotgun data are assembled into a product ("working draft" sequence) that covers most of the region of interest but may still contain gaps and ambiguities. In the second, finishing phase, the gaps are filled and discrepancies resolved. At present, the finishing phase is more labor intensive than the shotgun phase. Already, partially finished, working-draft sequence is accumulating in public databases at about twice the rate of finished sequence.
b) Make the sequence totally and freely accessible.
The HGP was initiated because its proponents believed the human sequence is such a precious scientific resource that it must be made totally and publicly available to all who want to use it. Only the wide availability of this unique resource will maximally stimulate the research that will eventually improve human health.
4.3.2 Sequencing Technology
Create a long-term, sustainable sequencing capacity by improving current technology and developing highly efficient novel technologies. Achieving this HGP goal will require current sequencing capacity to be expanded 2-3 times, demanding further incremental advances in standard sequencing technologies and improvements in efficiency and cost. For future sequencing applications, planners emphasize the importance of supporting novel technologies that may be 5-10 years in development.
4.3.3 Sequence Variation
Develop technologies for rapid identification of DNA sequence variants. A new priority for the HGP is examining regions of natural variation that occur among genomes (except those of identical twins). Goals specify development of methods to detect different types of variation, particularly the most common type called single nucleotide polymorphisms (SNPs) that occur about once every 1000 bases. Scientists believe SNP maps will help them identify genes associated with complex diseases such as cancer, diabetes, vascular disease, and some forms of mental illness. These associations are difficult to make using conventional gene hunting methods because any individual gene may make only a small contribution to disease risk. DNA sequence variations also underlie many individual differences in responses to the environment and treatments.
4.3.4 Functional Genomics
Expand support for current approaches and innovative technologies. Efficient interpretation of the functions of human genes and other DNA sequences requires developing the resources and strategies to enable large-scale investigations across whole genomes. A technically challenging first priority is to generate complete sets of full-length cDNA clones and sequences for human and model organism genes. Other functional genomics goals include studies into gene expression and control, creation of mutations that cause loss or alteration of function in nonhuman organisms, and development of experimental and computational methods for protein analyses.
4.3.5 Comparative Genomics
Obtain complete genomic sequences for C. elegans (1998), Drosophila (2002), and mouse (2008). A first clue toward identifying and understanding the functions of human genes or other DNA regions is often obtained by studying their parallels in nonhuman genomes. To enable efficient comparisons, complete genomic sequences already have been obtained for the bacterium E. coli and the yeast S. cerevisiae, and work continues on sequencing the genomes of the roundworm, fruit fly, and mouse. Planners note that other genomes will need to be sequenced to realize the full promise of comparative genomics, stressing the need to build a sustainable sequencing capacity.
4.3.6 Ethical, Legal, and Social Implications (ELSI)
· Analyze and address implications of identifying DNA sequence information for individuals, families, and communities.
· Facilitate safe and effective integration of genetic technologies.
· Facilitate education about genomics in nonclinical and research settings.
Rapid advances in genetics and applications present new and complex ethical and policy issues for individuals and society. ELSI programs that identify and address these implications have been an integral part of the US HGP since its inception. These programs have resulted in a body of work that promotes education and helps guide the conduct of genetic research and the development of related health professional and public policies. Continuing and new challenges include safeguarding the privacy of individuals and groups who contribute samples for large-scale sequence variation studies; anticipating how resulting data may affect concepts of race and ethnicity; identifying how genetic data could potentially be used in workplaces, schools, and courts; commercial uses; and the impact of genetic advances on concepts of humanity and personal responsibility.
4.3.7 Bioinformatics and Computational Biology
Improve current databases and develop new databases and better tools for data generation and capture and comprehensive functional studies. Continued investment in current and new databases and analytical tools is critical to the success of the Human Genome Project and to the future usefulness of the data. Databases must be structured to adapt to the evolving needs of the scientific community and allow queries to be answered easily. Planners suggest developing a human genome database analogous to model organism databases with links to phenotypic information. Also needed are databases and analytical tools for the expanding body of gene expression and function data, for modeling complex biological networks and interactions, and for collecting and analyzing sequence variation data.
4.3.8 Training
Nurture the training of genomic scientists and establish career paths.
Increase the number of scholars knowledgeable in genomics and ethics, law, or the social sciences. Planners note that future genomics scientists will require training in interdisciplinary areas that include biology, computer science, engineering, mathematics, physics, and chemistry. Additionally, scientists with management skills will be needed for leading large data-production efforts.
Chapter 5
Biological Databases
5.1 The Biological sequence/structure deficit
At the beginning of 1998, in publicly available, non-redundant databases, more than 3,00,000 protein sequences have been deposited, and the number of partial sequences in public and proprietary Expressed sequence tag databases is estimated to run into millions. By contrast, the number of unique 3D structures in the Protein Data Bank (PDB) was less than 1500. Although structural information is far more complex to derive, store and manipulate than are sequence data, these figures nevertheless highlight an enormous information deficit. This situation is likely to get worse as the genome projects around the world begin top bear fruit. Of course, the acquisition of structural data is also hastening, and the future large-scale structure determination enterprise could conceivably furnish 2000 3D structures annually. But this is a small yield by comparison with that of sequence databases, which are doubling in size every year, with a new sequence being added, on average once a minute.
5.2 Biological Databases
If we are to derive the maximum benefit from the deluge of sequence information, we must deal with it in a concerted way; this means establishing, maintaining and disseminating databases; providing easy to use software to access the information they contain; and designing state-of-the-art analysis tools to visualize and interpret the structural and functional clues latent in the data.
The first, then, in analysing sequence information is to assemble it into central shareable resources i.e. databases. Databases are effectively electronic filling cabinets, a convenient and efficient method of storing vast amounts of information. There are many different database types, depending both on the nature of the information being stored and on the manner of data storage( eg: whether in flat-files, tables in a relational database or objects in an object oriented database).
In the context of protein sequence analysis, we will encounter primary, composite and secondary databases. Such resources store different levels of information in totally different formats. In the past, this has led to a variety of communication problems, but emerging computer technologies are beginning to provide solutions, allowing seamless, transparent access to disparate, distributed data structures over the internet.
Primary and secondary databases are used to address different aspects of sequence analysis, because they store different levels of protein sequence information.
The primary structure of a protein is its amino acid sequence; these are stored in primary databases as linear alphabets that denote the constituent residues. The secondary structure of a protein corresponds to regions of local regularity, which, in sequence alignments, are often apparent as well conserved motifs; these are stored in secondary databases as patterns. The tertiary structure of a protein arises from the packing of its secondary structure elements which may form discrete domains within a fold, or may give rise to autonomous folding units or modules; complete folds, domains and modules are stored in structure databases as sets of atomic co-ordinates.
5.3 Primary Sequence Databases
In the early 1980s, sequence information started to become more abundant in the scientific literature. Realising this, several laboratories saw that there might be advantages to harvesting and storing these sequences in central repositories. Thus, several primary database projects began to evolve in different parts of the world.
5.3.1 Nucleic acid Sequence Databases
The principle DNA sequence databases are GenBank (USA), EMBL (Europe) and DDBJ (Japan), which exchange data on a daily basis to ensure comprehensive coverage at each of the sites.
EMBL is the nucleotide sequence database from the European Bioinformatics Institute. The rate of growth of DNA databases has been following an exponential trend, with a doubling time less than a year. EMBL data predominantly (more than 50%) consist of model organisms.
DNA Data Bank of Japan is produced, distributed and maintained by the National Institute of Genetics.
GenBank, the DNA database from the National Center for Biotechnology Information, exchanges data with both EMBL and DDBJ to help ensure comprehensive coverage. The database is split into 17 smaller discrete divisions.
5.3.2 Protein Sequence Databases
PIR, MIPS, SWISS-PROT, and TrEMBL are the major protein sequence databases.
PIR was developed for investigating evolutionary relations between proteins. In its current form, the database is split into four distinct sections PIR1-PIR4, which differ in terms of the quality of data and the level of annotation provided.
MIPS collects and processes sequence data for the tripartite PIR-International Protein sequence Database Project.
SWISS-PROT is a protein sequence database which, endeavors to provide high level annotations, including descriptions of the function of the protein, and of the structure of its domain, its post translational modifications and so on.
TrEMBL was created as a supplement to the SWISS-PROT. It was designed to address the need for a well structured SWISS-PROT-like resource that would allow very rapid access to sequence data from the genome projects, without having to compromise the quality of SWISS-PROT itself by incorporating sequences with insufficient analysis and annotation.
5.4 Composite Protein Sequences Databases
One solution to the problem of proliferation primary databases is to compile a composite, i.e. a database that amalgamates a variety of different primary sources. Composite databases render sequence searching much more efficient, because they obviate the need to interrogate multiple resources. The interrogation process is stream lined still further if the composite has been designed to be non-redundant, as this means that the same sequence need not be searched more than once. The choices of different sources and the application of different redundancy criteria have led to the emergence of different composites. The major composite databases are Non-Redundant Database, OWL, MIPSX, SWISS-PROT+TrMBL.
5.5 Secondary Databases
Secondary databases contain the fruits of analyses of the sequences in the primary resources. Because there are several different primary databases and a variety of ways of analysing protein sequences, the information housed in each of the secondary resources is different. Designing software tools that can search the different types of data, interpret the range of outputs, and assess the biological significance of the results is not a trivial task. SWISS-PROT has emerged as the most popular primary source and many secondary databases now use it as their basis.
Some of the main secondary resources are as follows:
Secondary database Primary source Stored Information
PROSITE SWISS-PROT Regular expressions
Profiles SWISS-PROT Weighted matrices
PRINTS OWL Aligned motifs
Pfam SWISS-PROT Hidden Marcov Models
BLOCKS PROSITE/PRINTS Aligned motifs (blocks)
IDENTIFY BLOCKS/PRINTS Fuzzy regular expressions
5.6 Tertiary Databases
Tertiary databases are the databases derived from information housed in secondary (pattern) databases (e.g. the BLOCKS and eMOTIF databases, which draw on the data stored within PROSITE and PRINTS). The value of such resources is in providing a different scoring perspective on the same underlying data, allowing the possibility to diagnose relationships that might be missed using the original implementation.
Chapter 6
Applications of Bioinformatics
A big amount of investment is being made in the field of biotechnology. In this chapter, I have attempted to take a review of the overall outcome obtained so far and what all is estimated in the future.
6.1 Application to the Ailments of Diseases
The miraculous substance that contains all of our genetic instructions, DNA, is rapidly becoming a key to modern medicine. By focusing on the diaphanous and extraordinarily long filaments of DNA that we inherit from our parents, scientists are finding the root causes of dozens of previously mysterious diseases: abnormal genes. These discoveries are allowing researchers to make precise diagnoses and predictions, to design more effective drugs, and to prevent many painful disorders. The new findings also pave the way for the development of the ultimate therapy - substituting a normal gene for a malfunctioning one so as to correct a patient's genetic defect permanently.
Recently, scientists have made spectacular progress against two fatal genetic diseases of children, cystic fibrosis and Duchenne muscular dystrophy. In addition, they have identified the genetic flaws that predispose people to more widespread, though still poorly understood ailments - various forms of heart disease, breast and colon cancer, diabetes, arthritis - which are not usually thought of as genetic in origin.
While many of the researchers who are exploring our genetic wilderness want to find the sources of the nearly 4,000 disorders caused by defects in single genes, others have an even broader goal: They hope to locate and map all of the 50,000 to 100,000 genes on our chromosomes. This map of our complete biological inheritance "the marvelous message, evolved for 3 billion years or more, which gives rise to each one of us," as Robert Sinsheimer of the University of California, Santa Barbara, calls it - will guide biological research for years to come. And it will radically simplify the search for the genetic flaws that cause disease.
Once scientists have identified such a flaw, they need to understand just how it produces a particular illness. They must determine the normal gene's function in human cells: What kind of protein does it instruct the cells to make, in what quantities, at what times, and in what specific places? Then the researchers can ask whether the genetic flaw results in too little protein, the wrong kind of protein, or no protein at all - and how best to counteract the effects of this failure.
For most genetic disorders, researchers are still at the very beginning of the trail. They have no clues to the DNA error that causes a disease, and they are still trying to find large families whose DNA patterns can help them track it down.
By contrast, scientists who work on cystic fibrosis and a few other diseases have covered much of the trail. They have already succeeded in correcting the gene defect inside living human cells by inserting healthy genes into these cells in a laboratory dish - an achievement that may lead to gene therapy.
The farther scientists go along the trail, the broader the implications of their findings. For example, the discovery of the gene defect that causes Duchenne muscular dystrophy, a muscle-wasting disease, led scientists to identify a previously unknown protein that plays an important role in all muscle function. This gives them a clearer view of how muscle cells work and allows them to diagnose other muscle disorders with exceptional precision, as well as devise new approaches to treatment.
Any new treatment will need to be tested on animals. In fact, the next explosion of information in medical genetics is expected to come from the study of animals - particularly with defects that mimic human disorders. The techniques for producing animal models of disease are improving rapidly. Even today, "designer mice" are playing an increasingly important role in research.
The growth of powerful computerized databases is bringing further insights. Only a month after the discovery of the genetic error involved in neurofibromatosis, a disfiguring and sometimes disabling hereditary disease, a computer search revealed a match between the protein made by normal copies of the newly uncovered gene and a protein that acts to suppress the development of cancers of the lung, liver, and brain - a key finding for cancer researchers.
Such revelations are becoming increasingly frequent. "If a new sequence has no match in the databases as they are, a week later a still newer sequence will match it," observes Walter Gilbert of Harvard University.
Brain disorders such as schizophrenia or Alzheimer's disease may be next to yield to the genetic approach. "We won't know what went wrong in most cases of mental disease until we can find the gene that sets it off," says James Watson, co-discoverer of the structure of DNA.
6.2 Application of Bioinformatics to Agriculture
Techniques aimed at crop improvement have been utilized for centuries. Today, applied plant science has three overall goals: increased crop yield, improved crop quality, and reduced production costs. Biotechnology is proving its value in meeting these goals. Progress has, however, been slower than with medical and other areas of research. Because plants are genetically and physiologically more complex than single-cell organisms such as bacteria and yeasts, the necessary technologies are developing more slowly.
6.2.1 Improvements in Crop Yield and Quality
In one active area of plant research, scientists are exploring ways to use genetic modification to confer desirable characteristics on food crops. Similarly, agronomists are looking for ways to harden plants against adverse environmental conditions such as soil salinity, drought, alkaline earth metals, and anaerobic (lacking air) soil conditions.
Genetic engineering methods to improve fruit and vegetable crop characteristics - such as taste, texture, size, color, acidity or sweetness, and ripening process, are being explored as a potentially superior strategy to the traditional method of cross-breeding.
Research in this area of agricultural biotechnology is complicated by the fact that many of a crop's traits are encoded not by one gene but by many genes working together. Therefore, one must first identify all of the genes that function as a set to express a particular property. This knowledge can then be applied to altering the germlines of commercially important food crops. For example, it will be possible to transfer the genes regulating nutrient content from one variety of tomatoes into a variety that naturally grows to a larger size. Similarly, by modifying the genes that control ripening, agronomists can provide supplies of seasonal fruits and vegetables for extended periods of time.
Biotechnological methods for improving field crops, such as wheat, corn and soybeans, are also being sought, since seeds serve both as a source of nutrition for people and animals and as the material for producing the next plant generation. By increasing the quality and quantity of protein or varying the types in these crops, we can improve their nutritional value.
6.3 Applications of Microbial Genomics
· new energy sources (biofuels)
· environmental monitoring to detect pollutants
· protection from biological and chemical warfare
· safe, efficient toxic waste cleanup
· understanding disease vulnerabilities and revealing drug targets
In 1994, taking advantage of new capabilities developed by the genome project, DOE initiated the Microbial Genome Program to sequence the genomes of bacteria useful in energy production, environmental remediation, toxic waste reduction, and industrial processing.
Despite our reliance on the inhabitants of the microbial world, we know little of their number or their nature: estimates are that less than 0.01% of all microbes have been cultivated and characterized. Programs like the DOE Microbial Genome Program help lay a foundation for knowledge that will ultimately benefit human health and the environment. The economy will benefit from further industrial applications of microbial capabilities.
Information gleaned from the characterization of complete genomes in MGP will lead to insights into the development of such new energy-related biotechnologies as photosynthetic systems, microbial systems that function in extreme environments, and organisms that can metabolize readily available renewable resources and waste material with equal facility.
Expected benefits also include development of diverse new products, processes, and test methods that will open the door to a cleaner environment. Biomanufacturing will use nontoxic chemicals and enzymes to reduce the cost and improve the efficiency of industrial processes. Already, microbial enzymes are being used to bleach paper pulp, stone wash denim, remove lipstick from glassware, break down starch in brewing, and coagulate milk protein for cheese production. In the health arena, microbial sequences may help researchers find new human genes and shed light on the disease-producing properties of pathogens.
Microbial genomics will also help pharmaceutical researchers gain a better understanding of how pathogenic microbes cause disease. Sequencing these microbes will help reveal vulnerabilities and identify new drug targets.
Gaining a deeper understanding of the microbial world also will provide insights into the strategies and limits of life on this planet. Data generated in this young program already have helped scientists identify the minimum number of genes necessary for life and confirm the existence of a third major kingdom of life. Additionally, the new genetic techniques now allow us to establish more precisely the diversity of microorganisms and identify those critical to maintaining or restoring the function and integrity of large and small ecosystems; this knowledge also can be useful in monitoring and predicting environmental change. Finally, studies on microbial communities provide models for understanding biological interactions and evolutionary history.
6.4 Risk Assessment
· assess health damage and risks caused by radiation exposure, including low-dose exposures
· assess health damage and risks caused by exposure to mutagenic chemicals and cancer-causing toxins
· reduce the likelihood of heritable mutations
Understanding the human genome will have an enormous impact on the ability to assess risks posed to individuals by exposure to toxic agents. Scientists know that genetic differences make some people more susceptible and others more resistant to such agents. Far more work must be done to determine the genetic basis of such variability. This knowledge will directly address DOE's long-term mission to understand the effects of low-level exposures to radiation and other energy-related agents, especially in terms of cancer risk.
6.5 Bioarchaeology, Anthropology, Evolution, and Human Migration
· study evolution through germline mutations in lineages
· study migration of different population groups based on female genetic inheritance
· study mutations on the Y chromosome to trace lineage and migration of males
· compare breakpoints in the evolution of mutations with ages of populations and historical events
Understanding genomics will help us understand human evolution and the common biology we share with all of life. Comparative genomics between humans and other organisms such as mice already has led to similar genes associated with diseases and traits. Further comparative studies will help determine the yet-unknown function of thousands of other genes.
Comparing the DNA sequences of entire genomes of differerent microbes will provide new insights about relationships among the three kingdoms of life: archaebacteria, eukaryotes, and prokaryotes.
6.6 DNA Forensics (Identification)
· identify potential suspects whose DNA may match evidence left at crime scenes
· exonerate persons wrongly accused of crimes
· identify crime and catastrophe victims
· establish paternity and other family relationships
· identify endangered and protected species as an aid to wildlife officials (could be used for prosecuting poachers)
· detect bacteria and other organisms that may pollute air, water, soil, and food
· match organ donors with recipients in transplant programs
· determine pedigree for seed or livestock breeds
· authenticate consumables such as caviar and wine
Any type of organism can be identified by examination of DNA sequences unique to that species. Identifying individuals is less precise at this time, although when DNA sequencing technologies progress further, direct characterization of very large DNA segments, and possibly even whole genomes, will become feasible and practical and will allow precise individual identification.
To identify individuals, forensic scientists scan about 10 DNA regions that vary from person to person and use the data to create a DNA profile of that individual (sometimes called a DNA fingerprint). There is an extremely small chance that another person has the same DNA profile for a particular set of regions.
Bibliography
1. IEEE Magazine
Engineering in Medicine and Biology
Volume 20, Number 4, July/August 2002
2. Introduction to Bioinformatics
By T. K. Attwood and D. J. Parry-Smith
First Edition
Publication: Pearson Education Ltd.
3. Web Sites
Human Genome Project http://www.ornl.gov/TechResources/Human_Genome/
Beyond Discovery http://www4.nas.edu/beyond/beyonddiscovery.nsf/
Bioinformatics in India http://bioinformatics-india.com
Other sites http://bioinform.com
http://bioinformatics.org
BILLY GOAT SYSTEM
SEMINAR REPORT
ON
BILLY GOAT SYSTEM
BY
VENKATESH RAJU
T.E. Information Technology
Roll No. 352
2006-2007
GUIDED BY
Prof Ms S.B.BALRAWAT
BRACT’S
Vishwakarma Institute of Information Technology, Pune-48
Department of Information Technology, Survey No. 2/3/4
Kondhwa (BK), Pune-411048
CERTIFICATE
This is to certify that Mr. Raju Venkatesh Satyanarayan has successfully completed his Seminar on Billy Goat System in partial fulfillment of third year of degree course in Information Technology in the academic year 2006-2007.
Date: /02/2007
Prof.N.P.Pathak Prof. Ms S.B.Balrawat
Head, Information Technology Seminar Guide VIIT, Pune VIIT, Pune
Prof.Dr.A.S. Tavildar
Principal
VIIT, Pune
BRACT’s Vishwakarma Institute of Information Technology, Pune – 48 (12)
Department of Information Technology (12)
Survey No. 2/3/4, Kondhwa (Bk), Pune – 411048. (12)
ACKNOWLEDGMENT
I feel great pleasure in submitting this seminar report on “BILLY GOAT SYSTEM”. I wish to express true sense of gratitude towards my seminar guide, Prof Ms S.B.Balrawat who at very discrete step in study of this seminar contributed her valuable guidance and help to solve every problem that arose.
I would wish to thank our H.O.D. Prof. Mr. Pathak for opening the doors of the department towards the realization of the seminar report.
Most likely I would like to express my sincere gratitude towards my family for always being there when I needed them the most. With all respect and gratitude, I would like to thank all the people, who have helped me directly or indirectly. I owe my all success to them.
Venkatesh.S.Raju Roll No: -352
T.E (I.T) VIIT, Pune
ABSTRACT
BILLY GOAT SYTEM
What is Billy Goat System?
Billy Goat is a sensor designed system, built with the specific purpose of detecting and identifying network service worms. Because of this specific focus, Billy Goat can take advantage of worm specific properties that would hinder general-purpose intrusion-detection systems, toward more efficient and accurate detection.
What are the characteristics of Billy Goat System?
These requirements influence the desired characteristics of such a system, particularly in the following aspects:
1. Accuracy: The goal of a WDS is the identification of worm-infected machines. To offer real utility, it must be able to perform this task with a high level of accuracy, so that its reports can be trusted by system and network administrators as the basis for contention and remediation action. A WDS can use highly-specialized techniques to detect worm-infected machines. This enables increased accuracy, at the expense of the ability to detect a wider range of attacks.
2. Speed: Given the explosive nature of modern worms, a WDS should be able to detect an infected machine as quickly as possible, to provide its users a chance to contain the damage, or even to function as the basis for an automated response system.
3. Manageability: New worms and worm variants appear almost every day, so the components of a WDS need to be updated regularly. At a systems level, this process must be automated as much as possible, to be able to deal with the monitoring of very large networks. At middleware and architecture levels, this means the base infrastructure must offer sufficient flexibility to enable the rapid creation of new detection capabilities.
4. Interoperability: Many organizations suffer from the proliferation of security tools, each with their own control, monitoring and reporting mechanisms. Furthermore, many places already have some form of monitoring console, virus-response policies and procedures, etc. A WDS should integrate as much as possible with the existing tools and processes.
5. Resilience: A WDS must operate under extreme conditions in terms of network and processing load, particularly during worm outbreaks. These conditions are more likely to induce failures than other environments. However, a WDS has a specific advantage that is not enjoyed by most other IDS’s because of the repetitive nature of worm activity; the WDS can afford to lose some data without reducing its utility. In practice, this means it is satisfactory to build a system that can “forcefully” recover from failure (for example, by automatically rebooting or even reinstalling itself) rather than trying to resist it.
6. Graceful degradation: While WDS’s may benefit from a distributed architecture, most worm outbreaks have the effect of overloading network links. It is therefore necessary for all sensors to be able to operate on their own (for example, reporting only local data). Given this condition, while the global system may be impeded, its individual sensors can still be useful during a worm outbreak.
A Final Word
Billy Goat has been designed to be scalable, to operate gracefully in a large distributed environment, and to provide extremely accurate detection of worm-infected machines. This paper describes a number of interesting or useful techniques and components identified during the process, of developing “Billy Goat”.
TABLE OF CONTENTS
1. INTRODUCTION 07
1.1 Typical Worm Spreading Logic……………………………..…... 08
2. BILLY GOAT ARCHITECTURE 09
2.1 High Level Characteristics…………………...………………….. 09
2.2 Basic Architecture and Implementation…………………………10
3. ENGINEERING DECISIONS AND IMPLEMENTATIONS 13
3.1 Database Tables…………..………………………………………..13
3.2 Feigning Servers………...…………………………………………14
3.3 Address Virtualization…………...………………………………..15
4. BILLY GOAT DEPLOYMENT 16
4.1 Modes of Deployment……………………………………………...16
4.2 Data Centralization Mechanism…………………………………..19
5. DATA ANALYSIS 20
5.1 Alarm Redistribution…………………………………………..….21
6. BILLY GOAT-EFFECTS AND EXTENSIONS 22
6.1 Focus on attacker-centric monitoring …………………….……..22
6.2 Environmental Effects…………………………...………………..23
6.3 Pattern Identification……………………………………………...24
7. FUTURE WORKS 25
CONCLUSION 27
FREQUENTLY ASKED QUESTIONS 28
REFERENCES 29
CHAPTER 1:
INTRODUCTION
Recent years have brought a continued increase in both the importance of security in networked systems and the difficulty of securing them. The Internet has continued to expand, its connections have become nearly pervasive, and its protocols and services have grown more complex. Beyond the basic need for integrity, confidentiality and privacy, security has become essential toward providing reliability, safety, and freedom from liability. One of the greatest threats to security has come from automatic self-propagating attacks. These attacks include both viruses and worms. While the presence of these attacks is by no means new, the damage that they are able to inflict and the speed with which they are able to propagate has become paramount. Further increases in connectivity and complexity only threaten to increase their virulence.
A computer worm is a self-replicating computer program, similar to a computer virus. A virus attaches itself too, and becomes part of, another executable program; however, a worm is self-contained and does not need to be part of another program to propagate itself. They are often designed to exploit the file transmission capabilities found on many computers. The main difference between a computer virus and a worm is that a virus can not propagate by itself whereas worms can. A worm uses a network to send copies of it to other systems and does so without any intervention. In general, worms harm the network and consume bandwidth, whereas viruses infect or corrupt files on a targeted computer. Viruses generally do not affect network performance, as their malicious activities are mostly confined within the target computer itself.
In addition to replication, a worm may be designed to do any number of things, such as delete files on a host system or send documents via e-mail. More recent worms may be multi-headed and carry other executables as a payload. However, even in the absence of such a payload, a worm can wreak havoc just with the network traffic generated by its reproduction. Mydoom, for example, caused a noticeable worldwide Internet slowdown at the peak of its spread. A common payload is for a worm to install a backdoor in the infected computer, as was done by Sobig and Mydoom. These zombie computers are used by spam senders for sending junk email or to cloak their website's address. Spammers are thought to pay for the creation of such worms, and worm writers have been caught selling lists of IP addresses of infected machines. Others try to blackmail companies with threatened DoS attacks. The backdoors can also be exploited by other worms, such as Doomjuice, which spreads using the backdoor opened by Mydoom.
1.1 Typical worm spreading logic:
Most worms use random IP address generation for spreading to different computers. The worm sends its code as an HTTP request to the target computer. Then depending on the specific worm, it exploits the known vulnerability in it. For example, the CodeRed worm sends a HTTP request to exploit a buffer-overflow vulnerability, which allows the worm to run on that computer. The malicious code is not saved as a file, but is inserted into and then run directly from memory. There is no particular strategy used by the different worms for intrusion. One typical strategy used by W32-Blaster worm is given below.
It generates an IP address and attempts to infect the computer that has that address. The IP address is generated according to the following algorithms:
· For 40% of the time, the generated IP address is of the form A.B.C.0, where A and B are equal to the first two parts of the infected computer's IP address. Once the IP address is calculated, the worm will attempt to find and exploit a computer with the IP address A.B.C.0. The worm will then increment the 0 part of the IP address by 1, attempting to find and exploit other computers based on the new IP address, until it reaches 254.
· With a probability of 60%, the generated IP address is completely random.
· To avoid looping back to infect the source computer, the worm will not make HTTP requests to the IP addresses 127.*.*.*.
· Some fixed characteristics of the TCP and IP headers are:
1. IP identification = 256
2. Time to Live = 128
3. Source IP address = a.b.x.y, where a.b are from the host ip and x.y are random. In some cases, a.b is random.
4. Destination IP address = dns resolution of "windowsupdate.com"
5. TCP Source port is between 1000 and 1999
6. TCP Destination port = 80
7. TCP Sequence number always has the two low bytes set to 0; the 2 high bytes are random.
8. TCP Window size = 16384
CHAPTER 2:
BILLY GOAT ARCHITECTURE:-
Billy Goat is a sensor designed system, built with the specific purpose of detecting and identifying network service worms. Because of this specific focus, Billy Goat can take advantage of worm specific properties that would hinder general-purpose intrusion-detection systems, toward more efficient and accurate detection.
2.1. High Level Characteristics –
The requirements on a worm-detection system (WDS) are different from those of a general-purpose network-based intrusion-detection system (IDS). While the latter needs to detect a wide and unpredictable variety of attacks, the former can focus on specific propagation and attack strategies used by worms. Moreover, the main purpose of a WDS is to detect infected machines in the network, whereas general-purpose IDS must detect the attacks themselves.
These requirements influence the desired characteristics of such a system, particularly in the following aspects:
1. Accuracy: The goal of a WDS is the identification of worm-infected machines. To offer real utility, it must be able to perform this task with a high level of accuracy, so that its reports can be trusted by system and network administrators as the basis for contention and remediation action. A WDS can use highly-specialized techniques to detect worm-infected machines. This enables increased accuracy, at the expense of the ability to detect a wider range of attacks.
2. Speed: Given the explosive nature of modern worms, a WDS should be able to detect an infected machine as quickly as possible, to provide its users a chance to contain the damage, or even to function as the basis for an automated response system.
3. Manageability: New worms and worm variants appear almost every day, so the components of a WDS need to be updated regularly. At a systems level, this process must be automated as much as possible, to be able to deal with the monitoring of very large networks. At middleware and architecture levels, this means the base infrastructure must offer sufficient flexibility to enable the rapid creation of new detection capabilities.
4. Interoperability: Many organizations suffer from the proliferation of security tools, each with their own control, monitoring and reporting mechanisms. Furthermore, many places already have some form of monitoring console, virus-response policies and procedures, etc. A WDS should integrate as much as possible with the existing tools and processes.
5. Resilience: A WDS must operate under extreme conditions in terms of network and processing load, particularly during worm outbreaks. These conditions are more likely to induce failures than other environments. However, a WDS has a specific advantage that is not enjoyed by most other IDS’s because of the repetitive nature of worm activity; the WDS can afford to lose some data without reducing its utility. In practice, this means it is satisfactory to build a system that can “forcefully” recover from failure (for example, by automatically rebooting or even reinstalling itself) rather than trying to resist it.
6. Graceful degradation: While WDS’s may benefit from a distributed architecture, most worm outbreaks have the effect of overloading network links. It is therefore necessary for all sensors to be able to operate on their own (for example, reporting only local data). Given this condition, while the global system may be impeded, its individual sensors can still be useful during a worm outbreak.
2.2. Basic Architecture and Implementation –
Billy Goat is a worm detection system that possesses the characteristics described earlier. Billy Goat is designed to take advantage of the propagation strategies of worms. As explained earlier, most worms try to connect to IP addresses selected at random or scan entire ranges of addresses. By doing so, they can find most of the machines in a network, but they also try to connect to a large number of unused addresses. Billy Goat functions by responding to requests sent to these unused addresses, thereby feigning the existence of a large number of machines and services.
This approach has three immediate consequences:
1. The fact that the addresses are otherwise unused and not advertised means that all traffic destined to these addresses is a priori suspicious.
2. Active feigning of services, rather than the mere recording of connection attempts enables greatly improved understanding of the nature of the connection. Billy Goat is a first-person participant in the protocols, rather than a third-person eavesdropper.
3. The large number of addresses used gives Billy Goat an extensive view of the network. This enables on-box correlation of events from a seemingly diverse collection of sensors.
Instead of directly “guarding the valuables,” as traditional intrusion detection deployments do, Billy Goat guards vast ranges of “nothingness” toward understanding what goes there and why. This is similar to a honeypot. This approach, permitted by the clear focus on detecting worms, coupled with the analysis performed on the data frees Billy Goat from the high rate of false positives produced by most general-purpose IDS’s. For the same reason, it is not a replacement for other IDS’s but a complement to them. In particular, Billy Goat will not even see the traffic directed to existing machines and services, so it is unable to detect attacks against them.
Fig 1: Billy Goat internal architecture.
As shown in fig 1, at the core of Billy Goat is the virtualization mechanism and a data repository, shown in the above figure. The virtualization mechanism allows individual services to be written using standard programming models and interfaces, and respond to multiple IP addresses transparently.
This reduces the difficulty of creating new feigning services and of integrating existing ones. The data repository provides storage for IP header information and for details of the application-level information generated by the feigning servers. The feigned services offered by Billy Goat include those commonly exploited by worms. Each endeavors to offer sufficient functionality to determine accurately the nature of an attack. All the sensors except for SMB (Windows file sharing) are implemented using a specialized framework written in Java that makes it easy to create new services, and which is carefully audited for security to reduce the possibility that a Billy Goat machine could be compromised or affected by a security problem.
A particular advantage of Billy Goat is that it allows us to implement feigned services “preemptively.” For example, when a new vulnerability is announced, it is possible to predict, based on some of its characteristics, that a worm will be written based on it. In these cases, we can create a new feigning server for the protocol affected by the vulnerability, and deploy it on all the existing Billy Goats. If and when the new worm appears, it will be immediately visible to the Billy Goat application-layer sensors. This capability is particularly important in recent times, when the window of time between the announcement of vulnerability and the appearance of code that exploits has reduced dramatically, to an average of 5.8 days (as seen in the first half of 2004).
To satisfy the requirement of continued function in times of heavy worm activity, when the performance of the network may be dramatically diminished, WDS’s require distributed architectures. Each Billy Goat offers the ability to analyze and report events detect locally, thus providing graceful degradation of the detection service. At the same time the data of all Billy Goats on an intranet is centralized to assemble a more complete view as shown in the figure 2.
Fig 2: Distributed Billy Goat architecture.
The nature of the monitoring allows detection of infected machines even on network segments that do not have a Billy Goat sensor installed. Billy Goat includes extensive self-monitoring and recovery mechanisms. When a problem cannot be solved satisfactorily, the machine reboots itself. This provides increased resilience by enabling individual machines to automatically recover from failure. To support the distributed architecture, Billy Goat includes an automatic update mechanism. This ensures that each sensor is always current with respect to both signatures and software versions, and makes it easier to manage a large distributed infrastructure.
CHAPTER 3:
ENGINEERING DECISIONS AND IMPLEMENTATIONS:-
Billy Goat has been implemented with a view to use standard tools, formats, and APIs. The implementation of the system focuses on providing simple, well-documented interfaces by which Billy Goat may be integrated with existing tools and process. Many open-source components have been thoughtfully used throughout Billy Goat, and its construction would not be possible without them.
3.1. Database Tables:-
The data collected by Billy Goat at the IP tables and application layers is split into four database tables to accommodate the repetitive, and often verbose, nature of the attacks used by worms. These tables have the form shown in Figure 3 below, where solid lines indicate external keys (references across tables) and dashed lines indicate temporal proximity (the times are generated in different layers and hence may have slightly different values).
Fig 3: Database table structure used.
TIME is measured by TIME, an SQL timestamp, together with TIME OFFSET which indicates the nth event in a given TIME. The pair thus creates a unique timestamp for each event.
REPORTER is the IP address of the Billy Goat sensor that observed the event. The presence of this field is important in a distributed system.
SRC, PROTOCOL, DST, SPT, DPT and FLAGS apply to the IP layer, mapping to the three covered protocols (TCP, UDP, and ICMP).
FLAGS are void for UDP and ICMP, and the type of ICMP message is stored in both the SPT and DPT fields. Storing the three types of traffic in a single database table makes it easier to extract all the information with a single SQL query.
The full descriptions of the application layer activity, REQUEST and HOST, are expressed in XML. The hierarchical and extensible nature of XML allows us to meaningfully encode descriptions of the numerous application sensors that we use. For example, a simple UDP listener provides a greatly different data model than an extremely complicated protocol like SMB. XML allows us to keep simple descriptions simple while still allowing complex descriptions. Cryptographic checksums i.e. md5 is used for REQUEST and HOST on the corresponding indices.
Rather than native database references (external keys). This offers the significant advantage that references depend only on the representation of the database record to which they refer, rather than on the order in which they were inserted (as would be the case with traditional database external keys). This technique greatly eases data centralization in a distributed system.
SEQID is an automatically incremented value used to keep track of which events have been processed by different components. Finally, SENSOR is a short string identifying the feigning server that produced the record (some of the existing values are http, smb and dcom).
3.2. Feigning servers –
The key observation mechanism of Billy Goat is a collection of feigning servers, each covering an infection vector used by worms to propagate. Each server is to be equipped with adequate logic to accurately diagnose the nature of the connection. Real servers may have vulnerabilities in different layers, and often this requires us to write the feigning servers in a way that can detect attacks in different layers. For example, it may be needed to write the feigning server to be able to detect both low-level buffer overflow vulnerabilities and application layer vulnerabilities. In general, the servers follow the corresponding protocol up to a point that allows accurate identification of the activity, but no more.
For example:
· The HTTP feigning server accepts and records a single HTTP request, and always responds with a “page not found” error, before closing the connection.
· The MS/RPC feigning server accepts and records the first 3000 bytes (configurable) transmitted by the client, before closing the connection. This initial payload generally contains either the full code of the worm or an exploit particular to the worm.
· The SMB/Lure server is a special configuration of Samba that appears to be a badly configured machine (open shares, weak passwords, etc.). Because it is a full implementation of the protocol, SMB/Lure can often capture the full code of the worm, as they upload themselves to Billy Goat.
Majority of servers are written in Java and produce descriptions of individual interactions. The specific syntax of each record (i.e. the tree and object structure) are left to each individual server. The fact that JOX assembles the components at runtime makes the creation, debugging, testing, and deployment of new services quite easy.
3.3. Address Virtualization –
Address virtualization transparently maps the large ranges of IP address covered by a Billy Goat to the single “real” address used by the machine. This virtualization allows the Billy Goat feigning servers to be ordinary server software, written with no special consideration for the large number of IP addresses that a Billy Goat machine monitors. Address virtualization is handled by the operating system, in particular by the IP tables mechanism in the 2.6 release of the Linux kernel.
One of the mechanisms built into IP tables is Network Address Translation (NAT). Ordinarily, NAT is used to allow several machines inside a network to share a single external address. For Billy Goat, we need to do the reverse: allow a single machine to respond to a large number of external addresses. This is called as the “reverse NAT” (as shown in fig 4).
Fig 4: Tradition Network Address Translation (NAT) and reverse NAT.
CHAPTER 4:
BILLY GOAT DEPLOYMENT
During the design, implementation and deployment of “Billy Goat”, a number of properties, constructions and conventions are to be considered. They are as follows: -
· The importance of homogeneity – To operate a large network of distributed sensors, while keeping it manageable, it is imperative that all the sensors are as homogeneous as possible: no special cases, no distinct configurations. This enables automatic update of all the components and configurations. Special adjustments to some of the sensors, makes them lag behind, in terms of updates and maintenance, either because the updates fail, or because they have been disabled to prevent them from overwriting the specialized configuration.
· Centralized configuration – Even in the presence of homogeneity, it is necessary to have some configuration that is different for each sensor. This includes its network and deployment mode configuration information and local addresses that should be ignored or trusted for management purposes.
A related issue is maintaining the configuration for all the distributed sensors in a central place. This offers two advantages:
· If a sensor is completely destroyed (for example, by a catastrophic disk failure), it is trivial to restore its configuration to reinstall the sensor on a different machine.
· It becomes possible to centrally control configuration of machines, similar to network configuration schemes like BOOTP and DHCP. Based on a unique identifier, the central server can provide each sensor with its configuration information, to automate even the initial installation.
This doubly-centralized configuration offers the best of both worlds: it keeps all the sensors as homogeneous as possible, while offering the possibility of having per-sensor configuration in an automated and manageable fashion.
4.1. Modes of Deployment
The fundamental premise of Billy Goat is responding to traffic directed to unused IP addresses, as described in Section 2. However, different deployment modes can be used and combined to direct such traffic to Billy Goat.
4.1.1) Specific network ranges with static routes –
This is the “standard” Billy Goat deployment mode. A specific set of IP address ranges (which should not be in use) are designated for Billy Goat, and the appropriate routers are reconfigured to send traffic sent to those ranges to a Billy Goat sensor. The amount of traffic seen by the sensor depends directly on the size of the network range assigned. Addresses within non-routed address ranges that are not used locally may also be routed to the Billy Goat.
Advantages:
A known set of IP addresses is assigned to Billy Goat, which helps in controlling the amount of traffic it has to process. This mode of operation is well understood, and only simple configuration changes need to be made to the routers.
Disadvantages:
Large-enough groups of network addresses must be available, and assigned by the network administrator. If the assigned range is too small, the functionality of Billy Goat is limited because it cannot observe much of the network traffic.
4.1.2) ARP spoofing –
In a local network, the machine that has a particular IP address is found by using ARP (the Address Resolution Protocol). Using this protocol, machines and routers in the local network that need to send traffic to an address X broadcast the question “who has address X?” and wait for a response. If no response is received in a certain period of time, the address is considered nonexistent. This mode of operation allows a malicious host to “hijack” IP addresses in an attack known as ARP spoofing. Incidentally, this same technique can be used by a Billy Goat device to automatically grab unused IP addresses, using the following method:
· Observe ARP requests on the local network.
· If a response is not observed within a short period of time the Billy Goat sensor sends a response, effectively assigning the requested address to itself.
· Future traffic (for a certain period of time) to the spoofed address will be sent by the local router and machines to the Billy Goat sensor.
Advantages:
No previous assignment of IP addresses is needed, so the deployment effort can be very low (simply connect the Billy Goat machine to the network).
Disadvantages:
ARP spoofing is potentially very dangerous, and can cause trouble if Billy Goat attempts to spoof the IP address of an existing device. Additionally ARP spoofing only works in a local-area network, so in this mode, a Billy Goat sensor can only spoof addresses in the LAN to which it is connected. The implementation of this scheme needs to take into account the potential appearance of devices (Billy Goat must stop spoofing an address immediately when another device with the same address appears on the network).
4.1.3) Billy Goat as default route –
Instead of having specific network ranges assigned to the Billy Goat sensor, the router can be configured to forward everything to Billy Goat, except for the ranges that are being used. This essentially makes Billy Goat the default route for the network, and traffic to all the unused network segments will be sent to it. This scheme can be implemented statically (when the router has a static routing table, and the route to the Billy Goat sensor is added as the default route) or dynamically in conjunction with a routing protocol like BGP. In this second mode, Billy Goat will automatically receive all the traffic that is not covered by the current dynamic routing tables.
Advantages:
Large network coverage, and ease of configuration (the Billy Goat sensor can be configured to “spoof everything,” and it will respond to any traffic it receives).
Disadvantages:
It is potentially dangerous, particularly in conjunction with dynamic routing. In a large network, it is common that certain network segments go offline for short periods of time. If Billy Goat automatically starts responding for them, it may disturb services or automated monitoring systems in the network.
4.1.4) ICMP-based Billy –
Goat One of the most recent developments is a mode of deployment in which Billy Goat operates in conjunction with a router to provide automatic utilization of all the unused addresses outside the local network. This is how it works (see Figure 5):
· When an infected machine in the local network tries to contact a remote non-existing address, an ICMP “network unreachable” or “host unreachable” message will be sent back.
· The ICMP message is intercepted by the router local to the infected machine, which sets up a temporary route for that destination address, with the Billy Goat sensor as its next hop.
· When the infected machine, after not receiving a response, retransmits its packet, it will be sent to the Billy Goat sensor, which will respond to it.
Fig 5: ICMP-based Billy Goat.
Advantages:
This mode of operation allows for automatic spoofing of every unused address outside the LAN. This provides Billy Goat with a truly expansive view and allows it to quickly identify local infected machines.
Disadvantages:
Router support is needed to implement this scheme. The implementation also needs to be careful about removing the routes for hosts once they become active. Shunning mode one problem faced by a Billy Goat sensor, particularly when it is spoofing a very large network range, is that it can suffer from effects similar to those caused by a distributed denial-of-service attack, from the sheer amount of traffic that it needs to monitor and respond to. To limit this problem, the following technique can be used in combination with any other deployment mode: once an IP address is identified as infected, it is added to a “shun list” that causes its traffic to be ignored for a certain period of time. This reduces the overall load on the Billy Goat sensor, particularly in times of heavy worm activity, while still allowing it a complete view of the network.
4.2. Data Centralization Mechanisms –
One of the recurrent problems in intrusion detection is the transfer of data to a central location. This is technically problematic for a number of reasons:
· The common need to transfer data across firewalls. The usual unsatisfactory solution to this problem is to open additional ports in the firewall to allow for the necessary communication channels.
· With a widely distributed system, reliable transfer of data to a central server may be prone to failure, potentially causing data loss or duplication.
One of the probable tools, which can be used for data transfer, is the “BEEPLite", an implementation of BEEP, a modular, extensible protocol that allows flexible establishment and multiplexing of communication channels. One of its salient features, and which makes it particularly suitable for intrusion detection data transfer, is that it decouples the concept of connection initiator from client. This means that either the server or the client can initiate a connection, and facilitates configuration across firewalls.
BEEP additionally simplifies the addition of encryption, authentication and compression to the data flows. BEEP has gained substantial popularity in recent years, and within the intrusion detection community, is also used by the IETF Intrusion Detection Working Group in the IDXP specification.
CHAPTER 5:
DATA ANALYSIS
The data analysis is an iterative process that attempts to determine the types of activities that have been seen by Billy Goat from different source addresses in the network. This process is most valuable when done in the central server to which all Billy Goat machines send their data, because it allows discovery of global behavior that may not be visible at the individual sensors.
· In the first analysis step, a precise description of the behavior of each source IP address is constructed based on all the data gathered during the specified time period, in a data model. This includes the list of destinations contacted, the protocols and port numbers used, and all the requests sent to the Billy Goat sensors.
· The second step consists of summarizing the data, such as replacing the individual destination addresses by the number of destinations and numbers of contacts are replaced by binary orders of magnitude. This eases later identification of similarities between behavior patterns.
· The third step identifies known worms, attacks, and behaviors; thereby creating a high-level description of the data. This is done using a combination of the following methods:
– Capture of the worm itself (for example, SMB worms that upload themselves to Billy Goat). In this case, the MD5 checksum of its code is used to identify the worm with 100% accuracy.
– Observation of the exploits used by the worm (for example, an HTTP request containing a buffer overflow). Because Billy Goat is a first-person observer, it can accurately collect the full set of exploits used by a worm, and during the analysis phase, these sets can be matched with known worm’s characteristics.
– Observation of other behaviors indicative of worm activity (for example, horizontal scanning or account guessing). These are weaker indicators of worm activity in the sense that they do not make it possible to precisely identify the worm.
When precise worm identification is possible (this is, when we can give a name to the worm), the findings are labeled as “alarm” and the worm name is given. Clearly suspicious but unidentifiable findings (for example, a large horizontal scan or a single exploit that does not fully identify a worm) are labeled as “warning” and a description is given. All other data is labeled “unknown” and is available via direct query of the database.
Additional analysis steps may be introduced to add location information or other relevant information concerning a host (for example, DNS lookup, asset or vulnerability information).
5.1. Alarm Redistribution –
One important feature of Billy Goat is that each sensor detects infected machines throughout the network. By performing a centralized analysis of the data, we can build an even more complete view of the network and detect otherwise undetected phenomena. Consequently, it is possible to detect and diagnose problems in machines at locations without any IDS installed. While this is a nice property of the Billy Goat infrastructure, it only becomes valuable if the results can be disseminated to the appropriate people throughout the world in a timely fashion.
The subscription service for Billy Goat produces alerts based on the centralized Billy Goat to provide the most complete coverage. It allows individual systems administrators to self-register to receive alerts pertinent to their own network ranges, thereby ensuring that alerts are delivered to someone who can actually do something to fix the detected problems. Open registration allows the creation of a “living” mapping between network and owner. Access control is done via social mechanisms, by notifying the subscriber’s manager, on registration. The subscription service also permits flexible configurations based on a user-defined policy. The policy focuses on two aspects of the alarm redistribution:
· It can be used to set filters to restrict the alarms to some particular networks of interest, or to select the type of alarms to send.
· It controls the rate at which alarms should be sent, thereby preventing accidental denial of service effects against the subscribers. For example, systems administrators may choose to be notified immediately up to a specifiable limit whereas people concerned with global metrics may opt to receive daily summarized reports.
CHAPTER 6:
BILLY GOAT – EFFECTS AND EXTENSIONS
As conceived from the architecture of Billy Goat, it is somewhat similar to the Intrusion Detection systems. The construction and deployment of Billy Goat gives some insights about fundamental issues, behaviors and properties of worm detection systems and of large distributed systems. This section explores the comparisons and differences when compared with other protection mechanisms.
6.1. Focus on attacker-centric monitoring –
The focus of traditional network-based IDS’s (NIDS) is on the ability to detect attacks against valuable systems, such as critical servers. This “attack-centric” approach has the following implications:
· Although identifying and diagnosing the attacker may be possible, the priority is detecting the attacks.
· NIDS only need to observe traffic going to the machines they protect, and for performance reasons, they are often prevented from seeing anything else. This limits the view of the IDS.
By contrast, Billy Goat has an “attacker-centric” approach: we are more interested in identifying and diagnosing attackers (infected machines) than on identifying which victims have been attacked. This approach also has implications in the deployment of the sensors i.e. they should be placed so that they can receive as much traffic as possible. A target-centric sensor will have detailed information about the local target machines, but limited information about the attackers. An attacker-centric sensor may have limited information about the targets, but will have detailed information about the attackers. For example, a Billy Goat sensor can detect infected machines anywhere in the network, as long as they try to connect to one of the addresses spoofed by that sensor.
In a distributed Billy Goat deployment, each sensor has a global view of the network, limited only by traffic filtering between different network segments. By centralizing this information as described earlier, it is possible to have an unimpeded, expansive view of infected machines anywhere in the network. Global aggregation of data allows detection of behaviors and patterns that may not be noticeable at the local level, like very stealthy or slow scans. For example, we have seen some worms that scan the network by contacting a single address per class-C network. This would not be detected by a Billy Goat sensor spoofing a single class-C network, but is easily discernible at the global level.
6.2. Environmental Effects –
In addition to detecting worms and curbing their infection, “Billy Goat” may also have some favorable or non-favorable effects. One of the key areas deals with the effect of Billy Goat on other systems in the networking environment.
6.2.1) Network discovery –
The first aspect encountered is the interaction of Billy Goat with devices and software that scan the network (for example, asset, vulnerability or network discovery tools). Normally, Billy Goat responds to these scans for each one of the IP addresses
it is spoofing, producing wildly inaccurate results for the scanner. This can be improvised by adding a mechanism that makes Billy Goat respond “truthfully” (only respond to traffic directed to its real IP address, and for the very limited real services it offers) to a fixed set of IP addresses. This makes Billy Goat appear like a regular machine to authorized scanning devices.
Fig 6: Possible NIDS placements.
6.2.2) Network intrusion detection systems –
Another area under consideration is the sharp increase in the number of alarms coming from already deployed NIDS systems that could observe the stream of traffic flowing to Billy Goat. This stems from the fact that Billy Goat, in allowing the completion of illegitimate connections, increases the number of real, albeit harmless, attacks seen on the network. The size of the increase depends on the relative sizes of the networks, the saturation level of the network connections, the root cause of the alarm, the signatures and placement of the NIDS.
This effect was first noticed when the Nimda worm was active in a well-connected network whose Billy Goat address space was approximately 100 times larger than the number of actual hosts, and with NIDS at position 1 (see Figure 6). The result was roughly a corresponding 100-fold increase in the number of Nimda-related alarms of attacks from the Intranet against the LAN. A NIDS at position 2 would see an increase in the number of Nimda-related alarms of attacks originating from machines on the LAN (those against the Intranet and those against Billy Goat).
In the case of Nimda, and many other modern worms, the effect on NIDS 2 is augmented by the optimized propagation strategy that probabilistically favors machines with similar addresses. A NIDS at position 3 would see attacks from both the LAN and the Intranet and would have greatly increased fidelity stemming from the fact that it does not see any legitimate traffic.
6.2.3) Failure Modes –
The final area deals with the default route and Router/ICMP modes of deployment. The problem occurs when a machine or network goes down and Billy Goat automatically starts responding for it. In this case various live ness checking mechanism,
such as –
“ICMP echo (ping)”
yield deceptive results. This is especially problematic when these live ness checking mechanisms have been connected to other systems. This failure mode, induced by a relatively passive system, can be considered as an interesting warning for automatic intrusion response.
6.3. Pattern identification –
The data gathered by “Billy Goat” can be collated for creating clusters of hosts corresponding to active worms in the network. There are interesting applications of the classification of suspicious hosts by behavior types. Emergence of new clusters can be indicative of new worm outbreaks (also of network misconfigurations and malfunctions), and can be used as an early-warning system. The clusters contain detailed information about worm behavior, including infection vectors, scanning algorithms, and exploits used.
To explore the use of traditional data-mining tools, data-mining tools such as CLARAty can be used. In order to apply classical data-mining algorithms to the collective descriptions, the data model is simplified by extracting essential features. This is done by –
· Extracting descriptions of the ports targeted, along with descriptions of the application-level activity (including identification of exploits when possible).
· Adding features computed from the available data, including the order of magnitude of the number of hosts contacted, the efficiency of the scanning algorithm (whether infection attempts are directed at hosts already contacted or not), and scan density/intensity (defined as the average number of contacts per destination class-C network).
CHAPTER 7:
FUTURE WORK
Following areas have future prospects for Billy Goat development.
1. Billy Goat currently provides immediate notification to network administrators about infected machines in their domains. It also provides summarized information about infected machines and suspicious behavior, useful for higher-level management. However, all the information is presently provided in text and XML format, which requires some level of expertise to interpret. To increase the usefulness of Billy Goat data, the reporting and visualization capabilities of Billy Goat could be improved. Following are the improvisations that could be handled:
– Graphic visualization of low-level activity (IP traffic and alarms generated by the feigning servers) both globally and in each Billy Goat sensor. A visual representation of levels of traffic, for example, is very useful in quickly detecting suspicious behavior.
– Graphic visualization of summarized activity. For example, charts showing trends and statistics in number of infected machines and per-region aggregated infection data.
– High-level reports of numbers of infected machines, emerging behaviors, and most common types of worm infections.
2. It is evident that analysis of the data using clustering can produce interesting results. The use of data mining techniques can be further studied and the possibility of automatically generating signatures and metasignatures can be explored (in which multiple signatures are taken as evidence of another, possibly distributed, attack) based on the results.
3. Another interesting development based on the results of anomaly analysis of Billy Goat data is the automatic creation and deployment of feigning servers. For example, if the IP traffic data shows a marked increase in connections to a certain port where no server currently exists, a generic listener for that port could be automatically instantiated and distributed to all the Billy Goat sensors, to capture the payload being sent to that port. This could greatly reduce the reaction time in the face of a new worm outbreak.
4. To ensure accurate identification, the ideal would be to capture the actual worm code. Currently, this is done only by some of the feigning servers (e.g. SMB/Lure, MS/RPC and MS/SQL). This capability could also be incorporated in other servers. For this to happen, the feigning server needs to provide the appropriate responses so that the worm believes its exploit has succeeded, and proceeds to upload its code. In the general case this is impossible to do in a feigning server (as it would need to emulate the full behavior and all the vulnerabilities of the service being attacked). In this scenario, when a Billy Goat sensor receives a connection on a port for which no feigning server exists, it would pass the connection to a virtual machine, and observe both the network traffic and the disk of the virtual machine after the connection has ended. This would provide valuable information about new worms, making it possible to capture any type of new worm, and make it easier to construct new feigning servers.
5. Finally, having a sensor that produces no false positives for a certain class of attacks might make possible the long-standing dream of intrusion detection: an automated response system. Most such systems to date have been marred by false positives, which often result in the response system causing more damage than good. The aim is to build a system that accurately and efficiently isolates misbehaving machines, while allowing critical technical and business processes to continue unimpeded, and with an extreme focus on potential failure modes, and how they might be eliminated or mitigated.
CONCLUSION
Billy Goat has been designed to be scalable, to operate gracefully in a large distributed environment, and to provide extremely accurate detection of worm-infected machines. This paper describes a number of interesting or useful techniques and components identified during the process, of developing “Billy Goat”. It can be used as a reference model for other practitioners faced with similar problems. The paper also throws light on a number of related dependencies such as the use of cryptographic checksums as external keys to ease distributed deployment, use of social structures to enforce access to information, and to determine who needs to receive alerts.
Billy Goat is useful both for accurately detecting machines infected with known worms, through its signature-based analysis. It also detects emerging behavior and new worms, through its comprehensive view of the network and its anomaly analysis capabilities. It has the following reflections:
· The former are immediately useful to system and network administrators, which can be notified of specific infections in machines under their control.
· The latter are useful to security and network analysts interested in large-scale and new behavior of the network.
A single Billy Goat sensor deployed in a network can provide useful information about infected machines, but its real value shows in a multi-sensor environment, which provides better coverage of large networks, and where the data can be centralized and analyzed to detect emerging trends and global suspicious behavior. Thus Billy Goat will prove to become an inherent part of a network based system.
FREQUENTLY ASKED QUESTIONS
Q1) What is Billy Goat System?
Q2) What are the various characteristics of Billy Goat Systems?
Q3) Who designed Billy Goat System?
Q4) What are its Future Works?
Q5) How does Billy Goat System detect Worms?
REFERENCES
[1] James Riordan, Andreas Wespi and Diego Zamboni, "How to hook worms", IEEE Spectrum, volume 42, number 5, May 2005.
[2] Web site = "www.wormblog.com/detection/". Link - "Lessons learned from Billy Goat - an accurate worm detection system".
[3] James Riordan, Andreas Wespi and Diego Zamboni, “Lessons learned from Billy Goat – an accurate worm detection system”, RZ-3609 (#99619).
ON
BILLY GOAT SYSTEM
BY
VENKATESH RAJU
T.E. Information Technology
Roll No. 352
2006-2007
GUIDED BY
Prof Ms S.B.BALRAWAT
BRACT’S
Vishwakarma Institute of Information Technology, Pune-48
Department of Information Technology, Survey No. 2/3/4
Kondhwa (BK), Pune-411048
CERTIFICATE
This is to certify that Mr. Raju Venkatesh Satyanarayan has successfully completed his Seminar on Billy Goat System in partial fulfillment of third year of degree course in Information Technology in the academic year 2006-2007.
Date: /02/2007
Prof.N.P.Pathak Prof. Ms S.B.Balrawat
Head, Information Technology Seminar Guide VIIT, Pune VIIT, Pune
Prof.Dr.A.S. Tavildar
Principal
VIIT, Pune
BRACT’s Vishwakarma Institute of Information Technology, Pune – 48 (12)
Department of Information Technology (12)
Survey No. 2/3/4, Kondhwa (Bk), Pune – 411048. (12)
ACKNOWLEDGMENT
I feel great pleasure in submitting this seminar report on “BILLY GOAT SYSTEM”. I wish to express true sense of gratitude towards my seminar guide, Prof Ms S.B.Balrawat who at very discrete step in study of this seminar contributed her valuable guidance and help to solve every problem that arose.
I would wish to thank our H.O.D. Prof. Mr. Pathak for opening the doors of the department towards the realization of the seminar report.
Most likely I would like to express my sincere gratitude towards my family for always being there when I needed them the most. With all respect and gratitude, I would like to thank all the people, who have helped me directly or indirectly. I owe my all success to them.
Venkatesh.S.Raju Roll No: -352
T.E (I.T) VIIT, Pune
ABSTRACT
BILLY GOAT SYTEM
What is Billy Goat System?
Billy Goat is a sensor designed system, built with the specific purpose of detecting and identifying network service worms. Because of this specific focus, Billy Goat can take advantage of worm specific properties that would hinder general-purpose intrusion-detection systems, toward more efficient and accurate detection.
What are the characteristics of Billy Goat System?
These requirements influence the desired characteristics of such a system, particularly in the following aspects:
1. Accuracy: The goal of a WDS is the identification of worm-infected machines. To offer real utility, it must be able to perform this task with a high level of accuracy, so that its reports can be trusted by system and network administrators as the basis for contention and remediation action. A WDS can use highly-specialized techniques to detect worm-infected machines. This enables increased accuracy, at the expense of the ability to detect a wider range of attacks.
2. Speed: Given the explosive nature of modern worms, a WDS should be able to detect an infected machine as quickly as possible, to provide its users a chance to contain the damage, or even to function as the basis for an automated response system.
3. Manageability: New worms and worm variants appear almost every day, so the components of a WDS need to be updated regularly. At a systems level, this process must be automated as much as possible, to be able to deal with the monitoring of very large networks. At middleware and architecture levels, this means the base infrastructure must offer sufficient flexibility to enable the rapid creation of new detection capabilities.
4. Interoperability: Many organizations suffer from the proliferation of security tools, each with their own control, monitoring and reporting mechanisms. Furthermore, many places already have some form of monitoring console, virus-response policies and procedures, etc. A WDS should integrate as much as possible with the existing tools and processes.
5. Resilience: A WDS must operate under extreme conditions in terms of network and processing load, particularly during worm outbreaks. These conditions are more likely to induce failures than other environments. However, a WDS has a specific advantage that is not enjoyed by most other IDS’s because of the repetitive nature of worm activity; the WDS can afford to lose some data without reducing its utility. In practice, this means it is satisfactory to build a system that can “forcefully” recover from failure (for example, by automatically rebooting or even reinstalling itself) rather than trying to resist it.
6. Graceful degradation: While WDS’s may benefit from a distributed architecture, most worm outbreaks have the effect of overloading network links. It is therefore necessary for all sensors to be able to operate on their own (for example, reporting only local data). Given this condition, while the global system may be impeded, its individual sensors can still be useful during a worm outbreak.
A Final Word
Billy Goat has been designed to be scalable, to operate gracefully in a large distributed environment, and to provide extremely accurate detection of worm-infected machines. This paper describes a number of interesting or useful techniques and components identified during the process, of developing “Billy Goat”.
TABLE OF CONTENTS
1. INTRODUCTION 07
1.1 Typical Worm Spreading Logic……………………………..…... 08
2. BILLY GOAT ARCHITECTURE 09
2.1 High Level Characteristics…………………...………………….. 09
2.2 Basic Architecture and Implementation…………………………10
3. ENGINEERING DECISIONS AND IMPLEMENTATIONS 13
3.1 Database Tables…………..………………………………………..13
3.2 Feigning Servers………...…………………………………………14
3.3 Address Virtualization…………...………………………………..15
4. BILLY GOAT DEPLOYMENT 16
4.1 Modes of Deployment……………………………………………...16
4.2 Data Centralization Mechanism…………………………………..19
5. DATA ANALYSIS 20
5.1 Alarm Redistribution…………………………………………..….21
6. BILLY GOAT-EFFECTS AND EXTENSIONS 22
6.1 Focus on attacker-centric monitoring …………………….……..22
6.2 Environmental Effects…………………………...………………..23
6.3 Pattern Identification……………………………………………...24
7. FUTURE WORKS 25
CONCLUSION 27
FREQUENTLY ASKED QUESTIONS 28
REFERENCES 29
CHAPTER 1:
INTRODUCTION
Recent years have brought a continued increase in both the importance of security in networked systems and the difficulty of securing them. The Internet has continued to expand, its connections have become nearly pervasive, and its protocols and services have grown more complex. Beyond the basic need for integrity, confidentiality and privacy, security has become essential toward providing reliability, safety, and freedom from liability. One of the greatest threats to security has come from automatic self-propagating attacks. These attacks include both viruses and worms. While the presence of these attacks is by no means new, the damage that they are able to inflict and the speed with which they are able to propagate has become paramount. Further increases in connectivity and complexity only threaten to increase their virulence.
A computer worm is a self-replicating computer program, similar to a computer virus. A virus attaches itself too, and becomes part of, another executable program; however, a worm is self-contained and does not need to be part of another program to propagate itself. They are often designed to exploit the file transmission capabilities found on many computers. The main difference between a computer virus and a worm is that a virus can not propagate by itself whereas worms can. A worm uses a network to send copies of it to other systems and does so without any intervention. In general, worms harm the network and consume bandwidth, whereas viruses infect or corrupt files on a targeted computer. Viruses generally do not affect network performance, as their malicious activities are mostly confined within the target computer itself.
In addition to replication, a worm may be designed to do any number of things, such as delete files on a host system or send documents via e-mail. More recent worms may be multi-headed and carry other executables as a payload. However, even in the absence of such a payload, a worm can wreak havoc just with the network traffic generated by its reproduction. Mydoom, for example, caused a noticeable worldwide Internet slowdown at the peak of its spread. A common payload is for a worm to install a backdoor in the infected computer, as was done by Sobig and Mydoom. These zombie computers are used by spam senders for sending junk email or to cloak their website's address. Spammers are thought to pay for the creation of such worms, and worm writers have been caught selling lists of IP addresses of infected machines. Others try to blackmail companies with threatened DoS attacks. The backdoors can also be exploited by other worms, such as Doomjuice, which spreads using the backdoor opened by Mydoom.
1.1 Typical worm spreading logic:
Most worms use random IP address generation for spreading to different computers. The worm sends its code as an HTTP request to the target computer. Then depending on the specific worm, it exploits the known vulnerability in it. For example, the CodeRed worm sends a HTTP request to exploit a buffer-overflow vulnerability, which allows the worm to run on that computer. The malicious code is not saved as a file, but is inserted into and then run directly from memory. There is no particular strategy used by the different worms for intrusion. One typical strategy used by W32-Blaster worm is given below.
It generates an IP address and attempts to infect the computer that has that address. The IP address is generated according to the following algorithms:
· For 40% of the time, the generated IP address is of the form A.B.C.0, where A and B are equal to the first two parts of the infected computer's IP address. Once the IP address is calculated, the worm will attempt to find and exploit a computer with the IP address A.B.C.0. The worm will then increment the 0 part of the IP address by 1, attempting to find and exploit other computers based on the new IP address, until it reaches 254.
· With a probability of 60%, the generated IP address is completely random.
· To avoid looping back to infect the source computer, the worm will not make HTTP requests to the IP addresses 127.*.*.*.
· Some fixed characteristics of the TCP and IP headers are:
1. IP identification = 256
2. Time to Live = 128
3. Source IP address = a.b.x.y, where a.b are from the host ip and x.y are random. In some cases, a.b is random.
4. Destination IP address = dns resolution of "windowsupdate.com"
5. TCP Source port is between 1000 and 1999
6. TCP Destination port = 80
7. TCP Sequence number always has the two low bytes set to 0; the 2 high bytes are random.
8. TCP Window size = 16384
CHAPTER 2:
BILLY GOAT ARCHITECTURE:-
Billy Goat is a sensor designed system, built with the specific purpose of detecting and identifying network service worms. Because of this specific focus, Billy Goat can take advantage of worm specific properties that would hinder general-purpose intrusion-detection systems, toward more efficient and accurate detection.
2.1. High Level Characteristics –
The requirements on a worm-detection system (WDS) are different from those of a general-purpose network-based intrusion-detection system (IDS). While the latter needs to detect a wide and unpredictable variety of attacks, the former can focus on specific propagation and attack strategies used by worms. Moreover, the main purpose of a WDS is to detect infected machines in the network, whereas general-purpose IDS must detect the attacks themselves.
These requirements influence the desired characteristics of such a system, particularly in the following aspects:
1. Accuracy: The goal of a WDS is the identification of worm-infected machines. To offer real utility, it must be able to perform this task with a high level of accuracy, so that its reports can be trusted by system and network administrators as the basis for contention and remediation action. A WDS can use highly-specialized techniques to detect worm-infected machines. This enables increased accuracy, at the expense of the ability to detect a wider range of attacks.
2. Speed: Given the explosive nature of modern worms, a WDS should be able to detect an infected machine as quickly as possible, to provide its users a chance to contain the damage, or even to function as the basis for an automated response system.
3. Manageability: New worms and worm variants appear almost every day, so the components of a WDS need to be updated regularly. At a systems level, this process must be automated as much as possible, to be able to deal with the monitoring of very large networks. At middleware and architecture levels, this means the base infrastructure must offer sufficient flexibility to enable the rapid creation of new detection capabilities.
4. Interoperability: Many organizations suffer from the proliferation of security tools, each with their own control, monitoring and reporting mechanisms. Furthermore, many places already have some form of monitoring console, virus-response policies and procedures, etc. A WDS should integrate as much as possible with the existing tools and processes.
5. Resilience: A WDS must operate under extreme conditions in terms of network and processing load, particularly during worm outbreaks. These conditions are more likely to induce failures than other environments. However, a WDS has a specific advantage that is not enjoyed by most other IDS’s because of the repetitive nature of worm activity; the WDS can afford to lose some data without reducing its utility. In practice, this means it is satisfactory to build a system that can “forcefully” recover from failure (for example, by automatically rebooting or even reinstalling itself) rather than trying to resist it.
6. Graceful degradation: While WDS’s may benefit from a distributed architecture, most worm outbreaks have the effect of overloading network links. It is therefore necessary for all sensors to be able to operate on their own (for example, reporting only local data). Given this condition, while the global system may be impeded, its individual sensors can still be useful during a worm outbreak.
2.2. Basic Architecture and Implementation –
Billy Goat is a worm detection system that possesses the characteristics described earlier. Billy Goat is designed to take advantage of the propagation strategies of worms. As explained earlier, most worms try to connect to IP addresses selected at random or scan entire ranges of addresses. By doing so, they can find most of the machines in a network, but they also try to connect to a large number of unused addresses. Billy Goat functions by responding to requests sent to these unused addresses, thereby feigning the existence of a large number of machines and services.
This approach has three immediate consequences:
1. The fact that the addresses are otherwise unused and not advertised means that all traffic destined to these addresses is a priori suspicious.
2. Active feigning of services, rather than the mere recording of connection attempts enables greatly improved understanding of the nature of the connection. Billy Goat is a first-person participant in the protocols, rather than a third-person eavesdropper.
3. The large number of addresses used gives Billy Goat an extensive view of the network. This enables on-box correlation of events from a seemingly diverse collection of sensors.
Instead of directly “guarding the valuables,” as traditional intrusion detection deployments do, Billy Goat guards vast ranges of “nothingness” toward understanding what goes there and why. This is similar to a honeypot. This approach, permitted by the clear focus on detecting worms, coupled with the analysis performed on the data frees Billy Goat from the high rate of false positives produced by most general-purpose IDS’s. For the same reason, it is not a replacement for other IDS’s but a complement to them. In particular, Billy Goat will not even see the traffic directed to existing machines and services, so it is unable to detect attacks against them.
Fig 1: Billy Goat internal architecture.
As shown in fig 1, at the core of Billy Goat is the virtualization mechanism and a data repository, shown in the above figure. The virtualization mechanism allows individual services to be written using standard programming models and interfaces, and respond to multiple IP addresses transparently.
This reduces the difficulty of creating new feigning services and of integrating existing ones. The data repository provides storage for IP header information and for details of the application-level information generated by the feigning servers. The feigned services offered by Billy Goat include those commonly exploited by worms. Each endeavors to offer sufficient functionality to determine accurately the nature of an attack. All the sensors except for SMB (Windows file sharing) are implemented using a specialized framework written in Java that makes it easy to create new services, and which is carefully audited for security to reduce the possibility that a Billy Goat machine could be compromised or affected by a security problem.
A particular advantage of Billy Goat is that it allows us to implement feigned services “preemptively.” For example, when a new vulnerability is announced, it is possible to predict, based on some of its characteristics, that a worm will be written based on it. In these cases, we can create a new feigning server for the protocol affected by the vulnerability, and deploy it on all the existing Billy Goats. If and when the new worm appears, it will be immediately visible to the Billy Goat application-layer sensors. This capability is particularly important in recent times, when the window of time between the announcement of vulnerability and the appearance of code that exploits has reduced dramatically, to an average of 5.8 days (as seen in the first half of 2004).
To satisfy the requirement of continued function in times of heavy worm activity, when the performance of the network may be dramatically diminished, WDS’s require distributed architectures. Each Billy Goat offers the ability to analyze and report events detect locally, thus providing graceful degradation of the detection service. At the same time the data of all Billy Goats on an intranet is centralized to assemble a more complete view as shown in the figure 2.
Fig 2: Distributed Billy Goat architecture.
The nature of the monitoring allows detection of infected machines even on network segments that do not have a Billy Goat sensor installed. Billy Goat includes extensive self-monitoring and recovery mechanisms. When a problem cannot be solved satisfactorily, the machine reboots itself. This provides increased resilience by enabling individual machines to automatically recover from failure. To support the distributed architecture, Billy Goat includes an automatic update mechanism. This ensures that each sensor is always current with respect to both signatures and software versions, and makes it easier to manage a large distributed infrastructure.
CHAPTER 3:
ENGINEERING DECISIONS AND IMPLEMENTATIONS:-
Billy Goat has been implemented with a view to use standard tools, formats, and APIs. The implementation of the system focuses on providing simple, well-documented interfaces by which Billy Goat may be integrated with existing tools and process. Many open-source components have been thoughtfully used throughout Billy Goat, and its construction would not be possible without them.
3.1. Database Tables:-
The data collected by Billy Goat at the IP tables and application layers is split into four database tables to accommodate the repetitive, and often verbose, nature of the attacks used by worms. These tables have the form shown in Figure 3 below, where solid lines indicate external keys (references across tables) and dashed lines indicate temporal proximity (the times are generated in different layers and hence may have slightly different values).
Fig 3: Database table structure used.
TIME is measured by TIME, an SQL timestamp, together with TIME OFFSET which indicates the nth event in a given TIME. The pair thus creates a unique timestamp for each event.
REPORTER is the IP address of the Billy Goat sensor that observed the event. The presence of this field is important in a distributed system.
SRC, PROTOCOL, DST, SPT, DPT and FLAGS apply to the IP layer, mapping to the three covered protocols (TCP, UDP, and ICMP).
FLAGS are void for UDP and ICMP, and the type of ICMP message is stored in both the SPT and DPT fields. Storing the three types of traffic in a single database table makes it easier to extract all the information with a single SQL query.
The full descriptions of the application layer activity, REQUEST and HOST, are expressed in XML. The hierarchical and extensible nature of XML allows us to meaningfully encode descriptions of the numerous application sensors that we use. For example, a simple UDP listener provides a greatly different data model than an extremely complicated protocol like SMB. XML allows us to keep simple descriptions simple while still allowing complex descriptions. Cryptographic checksums i.e. md5 is used for REQUEST and HOST on the corresponding indices.
Rather than native database references (external keys). This offers the significant advantage that references depend only on the representation of the database record to which they refer, rather than on the order in which they were inserted (as would be the case with traditional database external keys). This technique greatly eases data centralization in a distributed system.
SEQID is an automatically incremented value used to keep track of which events have been processed by different components. Finally, SENSOR is a short string identifying the feigning server that produced the record (some of the existing values are http, smb and dcom).
3.2. Feigning servers –
The key observation mechanism of Billy Goat is a collection of feigning servers, each covering an infection vector used by worms to propagate. Each server is to be equipped with adequate logic to accurately diagnose the nature of the connection. Real servers may have vulnerabilities in different layers, and often this requires us to write the feigning servers in a way that can detect attacks in different layers. For example, it may be needed to write the feigning server to be able to detect both low-level buffer overflow vulnerabilities and application layer vulnerabilities. In general, the servers follow the corresponding protocol up to a point that allows accurate identification of the activity, but no more.
For example:
· The HTTP feigning server accepts and records a single HTTP request, and always responds with a “page not found” error, before closing the connection.
· The MS/RPC feigning server accepts and records the first 3000 bytes (configurable) transmitted by the client, before closing the connection. This initial payload generally contains either the full code of the worm or an exploit particular to the worm.
· The SMB/Lure server is a special configuration of Samba that appears to be a badly configured machine (open shares, weak passwords, etc.). Because it is a full implementation of the protocol, SMB/Lure can often capture the full code of the worm, as they upload themselves to Billy Goat.
Majority of servers are written in Java and produce descriptions of individual interactions. The specific syntax of each record (i.e. the tree and object structure) are left to each individual server. The fact that JOX assembles the components at runtime makes the creation, debugging, testing, and deployment of new services quite easy.
3.3. Address Virtualization –
Address virtualization transparently maps the large ranges of IP address covered by a Billy Goat to the single “real” address used by the machine. This virtualization allows the Billy Goat feigning servers to be ordinary server software, written with no special consideration for the large number of IP addresses that a Billy Goat machine monitors. Address virtualization is handled by the operating system, in particular by the IP tables mechanism in the 2.6 release of the Linux kernel.
One of the mechanisms built into IP tables is Network Address Translation (NAT). Ordinarily, NAT is used to allow several machines inside a network to share a single external address. For Billy Goat, we need to do the reverse: allow a single machine to respond to a large number of external addresses. This is called as the “reverse NAT” (as shown in fig 4).
Fig 4: Tradition Network Address Translation (NAT) and reverse NAT.
CHAPTER 4:
BILLY GOAT DEPLOYMENT
During the design, implementation and deployment of “Billy Goat”, a number of properties, constructions and conventions are to be considered. They are as follows: -
· The importance of homogeneity – To operate a large network of distributed sensors, while keeping it manageable, it is imperative that all the sensors are as homogeneous as possible: no special cases, no distinct configurations. This enables automatic update of all the components and configurations. Special adjustments to some of the sensors, makes them lag behind, in terms of updates and maintenance, either because the updates fail, or because they have been disabled to prevent them from overwriting the specialized configuration.
· Centralized configuration – Even in the presence of homogeneity, it is necessary to have some configuration that is different for each sensor. This includes its network and deployment mode configuration information and local addresses that should be ignored or trusted for management purposes.
A related issue is maintaining the configuration for all the distributed sensors in a central place. This offers two advantages:
· If a sensor is completely destroyed (for example, by a catastrophic disk failure), it is trivial to restore its configuration to reinstall the sensor on a different machine.
· It becomes possible to centrally control configuration of machines, similar to network configuration schemes like BOOTP and DHCP. Based on a unique identifier, the central server can provide each sensor with its configuration information, to automate even the initial installation.
This doubly-centralized configuration offers the best of both worlds: it keeps all the sensors as homogeneous as possible, while offering the possibility of having per-sensor configuration in an automated and manageable fashion.
4.1. Modes of Deployment
The fundamental premise of Billy Goat is responding to traffic directed to unused IP addresses, as described in Section 2. However, different deployment modes can be used and combined to direct such traffic to Billy Goat.
4.1.1) Specific network ranges with static routes –
This is the “standard” Billy Goat deployment mode. A specific set of IP address ranges (which should not be in use) are designated for Billy Goat, and the appropriate routers are reconfigured to send traffic sent to those ranges to a Billy Goat sensor. The amount of traffic seen by the sensor depends directly on the size of the network range assigned. Addresses within non-routed address ranges that are not used locally may also be routed to the Billy Goat.
Advantages:
A known set of IP addresses is assigned to Billy Goat, which helps in controlling the amount of traffic it has to process. This mode of operation is well understood, and only simple configuration changes need to be made to the routers.
Disadvantages:
Large-enough groups of network addresses must be available, and assigned by the network administrator. If the assigned range is too small, the functionality of Billy Goat is limited because it cannot observe much of the network traffic.
4.1.2) ARP spoofing –
In a local network, the machine that has a particular IP address is found by using ARP (the Address Resolution Protocol). Using this protocol, machines and routers in the local network that need to send traffic to an address X broadcast the question “who has address X?” and wait for a response. If no response is received in a certain period of time, the address is considered nonexistent. This mode of operation allows a malicious host to “hijack” IP addresses in an attack known as ARP spoofing. Incidentally, this same technique can be used by a Billy Goat device to automatically grab unused IP addresses, using the following method:
· Observe ARP requests on the local network.
· If a response is not observed within a short period of time the Billy Goat sensor sends a response, effectively assigning the requested address to itself.
· Future traffic (for a certain period of time) to the spoofed address will be sent by the local router and machines to the Billy Goat sensor.
Advantages:
No previous assignment of IP addresses is needed, so the deployment effort can be very low (simply connect the Billy Goat machine to the network).
Disadvantages:
ARP spoofing is potentially very dangerous, and can cause trouble if Billy Goat attempts to spoof the IP address of an existing device. Additionally ARP spoofing only works in a local-area network, so in this mode, a Billy Goat sensor can only spoof addresses in the LAN to which it is connected. The implementation of this scheme needs to take into account the potential appearance of devices (Billy Goat must stop spoofing an address immediately when another device with the same address appears on the network).
4.1.3) Billy Goat as default route –
Instead of having specific network ranges assigned to the Billy Goat sensor, the router can be configured to forward everything to Billy Goat, except for the ranges that are being used. This essentially makes Billy Goat the default route for the network, and traffic to all the unused network segments will be sent to it. This scheme can be implemented statically (when the router has a static routing table, and the route to the Billy Goat sensor is added as the default route) or dynamically in conjunction with a routing protocol like BGP. In this second mode, Billy Goat will automatically receive all the traffic that is not covered by the current dynamic routing tables.
Advantages:
Large network coverage, and ease of configuration (the Billy Goat sensor can be configured to “spoof everything,” and it will respond to any traffic it receives).
Disadvantages:
It is potentially dangerous, particularly in conjunction with dynamic routing. In a large network, it is common that certain network segments go offline for short periods of time. If Billy Goat automatically starts responding for them, it may disturb services or automated monitoring systems in the network.
4.1.4) ICMP-based Billy –
Goat One of the most recent developments is a mode of deployment in which Billy Goat operates in conjunction with a router to provide automatic utilization of all the unused addresses outside the local network. This is how it works (see Figure 5):
· When an infected machine in the local network tries to contact a remote non-existing address, an ICMP “network unreachable” or “host unreachable” message will be sent back.
· The ICMP message is intercepted by the router local to the infected machine, which sets up a temporary route for that destination address, with the Billy Goat sensor as its next hop.
· When the infected machine, after not receiving a response, retransmits its packet, it will be sent to the Billy Goat sensor, which will respond to it.
Fig 5: ICMP-based Billy Goat.
Advantages:
This mode of operation allows for automatic spoofing of every unused address outside the LAN. This provides Billy Goat with a truly expansive view and allows it to quickly identify local infected machines.
Disadvantages:
Router support is needed to implement this scheme. The implementation also needs to be careful about removing the routes for hosts once they become active. Shunning mode one problem faced by a Billy Goat sensor, particularly when it is spoofing a very large network range, is that it can suffer from effects similar to those caused by a distributed denial-of-service attack, from the sheer amount of traffic that it needs to monitor and respond to. To limit this problem, the following technique can be used in combination with any other deployment mode: once an IP address is identified as infected, it is added to a “shun list” that causes its traffic to be ignored for a certain period of time. This reduces the overall load on the Billy Goat sensor, particularly in times of heavy worm activity, while still allowing it a complete view of the network.
4.2. Data Centralization Mechanisms –
One of the recurrent problems in intrusion detection is the transfer of data to a central location. This is technically problematic for a number of reasons:
· The common need to transfer data across firewalls. The usual unsatisfactory solution to this problem is to open additional ports in the firewall to allow for the necessary communication channels.
· With a widely distributed system, reliable transfer of data to a central server may be prone to failure, potentially causing data loss or duplication.
One of the probable tools, which can be used for data transfer, is the “BEEPLite", an implementation of BEEP, a modular, extensible protocol that allows flexible establishment and multiplexing of communication channels. One of its salient features, and which makes it particularly suitable for intrusion detection data transfer, is that it decouples the concept of connection initiator from client. This means that either the server or the client can initiate a connection, and facilitates configuration across firewalls.
BEEP additionally simplifies the addition of encryption, authentication and compression to the data flows. BEEP has gained substantial popularity in recent years, and within the intrusion detection community, is also used by the IETF Intrusion Detection Working Group in the IDXP specification.
CHAPTER 5:
DATA ANALYSIS
The data analysis is an iterative process that attempts to determine the types of activities that have been seen by Billy Goat from different source addresses in the network. This process is most valuable when done in the central server to which all Billy Goat machines send their data, because it allows discovery of global behavior that may not be visible at the individual sensors.
· In the first analysis step, a precise description of the behavior of each source IP address is constructed based on all the data gathered during the specified time period, in a data model. This includes the list of destinations contacted, the protocols and port numbers used, and all the requests sent to the Billy Goat sensors.
· The second step consists of summarizing the data, such as replacing the individual destination addresses by the number of destinations and numbers of contacts are replaced by binary orders of magnitude. This eases later identification of similarities between behavior patterns.
· The third step identifies known worms, attacks, and behaviors; thereby creating a high-level description of the data. This is done using a combination of the following methods:
– Capture of the worm itself (for example, SMB worms that upload themselves to Billy Goat). In this case, the MD5 checksum of its code is used to identify the worm with 100% accuracy.
– Observation of the exploits used by the worm (for example, an HTTP request containing a buffer overflow). Because Billy Goat is a first-person observer, it can accurately collect the full set of exploits used by a worm, and during the analysis phase, these sets can be matched with known worm’s characteristics.
– Observation of other behaviors indicative of worm activity (for example, horizontal scanning or account guessing). These are weaker indicators of worm activity in the sense that they do not make it possible to precisely identify the worm.
When precise worm identification is possible (this is, when we can give a name to the worm), the findings are labeled as “alarm” and the worm name is given. Clearly suspicious but unidentifiable findings (for example, a large horizontal scan or a single exploit that does not fully identify a worm) are labeled as “warning” and a description is given. All other data is labeled “unknown” and is available via direct query of the database.
Additional analysis steps may be introduced to add location information or other relevant information concerning a host (for example, DNS lookup, asset or vulnerability information).
5.1. Alarm Redistribution –
One important feature of Billy Goat is that each sensor detects infected machines throughout the network. By performing a centralized analysis of the data, we can build an even more complete view of the network and detect otherwise undetected phenomena. Consequently, it is possible to detect and diagnose problems in machines at locations without any IDS installed. While this is a nice property of the Billy Goat infrastructure, it only becomes valuable if the results can be disseminated to the appropriate people throughout the world in a timely fashion.
The subscription service for Billy Goat produces alerts based on the centralized Billy Goat to provide the most complete coverage. It allows individual systems administrators to self-register to receive alerts pertinent to their own network ranges, thereby ensuring that alerts are delivered to someone who can actually do something to fix the detected problems. Open registration allows the creation of a “living” mapping between network and owner. Access control is done via social mechanisms, by notifying the subscriber’s manager, on registration. The subscription service also permits flexible configurations based on a user-defined policy. The policy focuses on two aspects of the alarm redistribution:
· It can be used to set filters to restrict the alarms to some particular networks of interest, or to select the type of alarms to send.
· It controls the rate at which alarms should be sent, thereby preventing accidental denial of service effects against the subscribers. For example, systems administrators may choose to be notified immediately up to a specifiable limit whereas people concerned with global metrics may opt to receive daily summarized reports.
CHAPTER 6:
BILLY GOAT – EFFECTS AND EXTENSIONS
As conceived from the architecture of Billy Goat, it is somewhat similar to the Intrusion Detection systems. The construction and deployment of Billy Goat gives some insights about fundamental issues, behaviors and properties of worm detection systems and of large distributed systems. This section explores the comparisons and differences when compared with other protection mechanisms.
6.1. Focus on attacker-centric monitoring –
The focus of traditional network-based IDS’s (NIDS) is on the ability to detect attacks against valuable systems, such as critical servers. This “attack-centric” approach has the following implications:
· Although identifying and diagnosing the attacker may be possible, the priority is detecting the attacks.
· NIDS only need to observe traffic going to the machines they protect, and for performance reasons, they are often prevented from seeing anything else. This limits the view of the IDS.
By contrast, Billy Goat has an “attacker-centric” approach: we are more interested in identifying and diagnosing attackers (infected machines) than on identifying which victims have been attacked. This approach also has implications in the deployment of the sensors i.e. they should be placed so that they can receive as much traffic as possible. A target-centric sensor will have detailed information about the local target machines, but limited information about the attackers. An attacker-centric sensor may have limited information about the targets, but will have detailed information about the attackers. For example, a Billy Goat sensor can detect infected machines anywhere in the network, as long as they try to connect to one of the addresses spoofed by that sensor.
In a distributed Billy Goat deployment, each sensor has a global view of the network, limited only by traffic filtering between different network segments. By centralizing this information as described earlier, it is possible to have an unimpeded, expansive view of infected machines anywhere in the network. Global aggregation of data allows detection of behaviors and patterns that may not be noticeable at the local level, like very stealthy or slow scans. For example, we have seen some worms that scan the network by contacting a single address per class-C network. This would not be detected by a Billy Goat sensor spoofing a single class-C network, but is easily discernible at the global level.
6.2. Environmental Effects –
In addition to detecting worms and curbing their infection, “Billy Goat” may also have some favorable or non-favorable effects. One of the key areas deals with the effect of Billy Goat on other systems in the networking environment.
6.2.1) Network discovery –
The first aspect encountered is the interaction of Billy Goat with devices and software that scan the network (for example, asset, vulnerability or network discovery tools). Normally, Billy Goat responds to these scans for each one of the IP addresses
it is spoofing, producing wildly inaccurate results for the scanner. This can be improvised by adding a mechanism that makes Billy Goat respond “truthfully” (only respond to traffic directed to its real IP address, and for the very limited real services it offers) to a fixed set of IP addresses. This makes Billy Goat appear like a regular machine to authorized scanning devices.
Fig 6: Possible NIDS placements.
6.2.2) Network intrusion detection systems –
Another area under consideration is the sharp increase in the number of alarms coming from already deployed NIDS systems that could observe the stream of traffic flowing to Billy Goat. This stems from the fact that Billy Goat, in allowing the completion of illegitimate connections, increases the number of real, albeit harmless, attacks seen on the network. The size of the increase depends on the relative sizes of the networks, the saturation level of the network connections, the root cause of the alarm, the signatures and placement of the NIDS.
This effect was first noticed when the Nimda worm was active in a well-connected network whose Billy Goat address space was approximately 100 times larger than the number of actual hosts, and with NIDS at position 1 (see Figure 6). The result was roughly a corresponding 100-fold increase in the number of Nimda-related alarms of attacks from the Intranet against the LAN. A NIDS at position 2 would see an increase in the number of Nimda-related alarms of attacks originating from machines on the LAN (those against the Intranet and those against Billy Goat).
In the case of Nimda, and many other modern worms, the effect on NIDS 2 is augmented by the optimized propagation strategy that probabilistically favors machines with similar addresses. A NIDS at position 3 would see attacks from both the LAN and the Intranet and would have greatly increased fidelity stemming from the fact that it does not see any legitimate traffic.
6.2.3) Failure Modes –
The final area deals with the default route and Router/ICMP modes of deployment. The problem occurs when a machine or network goes down and Billy Goat automatically starts responding for it. In this case various live ness checking mechanism,
such as –
“ICMP echo (ping)”
yield deceptive results. This is especially problematic when these live ness checking mechanisms have been connected to other systems. This failure mode, induced by a relatively passive system, can be considered as an interesting warning for automatic intrusion response.
6.3. Pattern identification –
The data gathered by “Billy Goat” can be collated for creating clusters of hosts corresponding to active worms in the network. There are interesting applications of the classification of suspicious hosts by behavior types. Emergence of new clusters can be indicative of new worm outbreaks (also of network misconfigurations and malfunctions), and can be used as an early-warning system. The clusters contain detailed information about worm behavior, including infection vectors, scanning algorithms, and exploits used.
To explore the use of traditional data-mining tools, data-mining tools such as CLARAty can be used. In order to apply classical data-mining algorithms to the collective descriptions, the data model is simplified by extracting essential features. This is done by –
· Extracting descriptions of the ports targeted, along with descriptions of the application-level activity (including identification of exploits when possible).
· Adding features computed from the available data, including the order of magnitude of the number of hosts contacted, the efficiency of the scanning algorithm (whether infection attempts are directed at hosts already contacted or not), and scan density/intensity (defined as the average number of contacts per destination class-C network).
CHAPTER 7:
FUTURE WORK
Following areas have future prospects for Billy Goat development.
1. Billy Goat currently provides immediate notification to network administrators about infected machines in their domains. It also provides summarized information about infected machines and suspicious behavior, useful for higher-level management. However, all the information is presently provided in text and XML format, which requires some level of expertise to interpret. To increase the usefulness of Billy Goat data, the reporting and visualization capabilities of Billy Goat could be improved. Following are the improvisations that could be handled:
– Graphic visualization of low-level activity (IP traffic and alarms generated by the feigning servers) both globally and in each Billy Goat sensor. A visual representation of levels of traffic, for example, is very useful in quickly detecting suspicious behavior.
– Graphic visualization of summarized activity. For example, charts showing trends and statistics in number of infected machines and per-region aggregated infection data.
– High-level reports of numbers of infected machines, emerging behaviors, and most common types of worm infections.
2. It is evident that analysis of the data using clustering can produce interesting results. The use of data mining techniques can be further studied and the possibility of automatically generating signatures and metasignatures can be explored (in which multiple signatures are taken as evidence of another, possibly distributed, attack) based on the results.
3. Another interesting development based on the results of anomaly analysis of Billy Goat data is the automatic creation and deployment of feigning servers. For example, if the IP traffic data shows a marked increase in connections to a certain port where no server currently exists, a generic listener for that port could be automatically instantiated and distributed to all the Billy Goat sensors, to capture the payload being sent to that port. This could greatly reduce the reaction time in the face of a new worm outbreak.
4. To ensure accurate identification, the ideal would be to capture the actual worm code. Currently, this is done only by some of the feigning servers (e.g. SMB/Lure, MS/RPC and MS/SQL). This capability could also be incorporated in other servers. For this to happen, the feigning server needs to provide the appropriate responses so that the worm believes its exploit has succeeded, and proceeds to upload its code. In the general case this is impossible to do in a feigning server (as it would need to emulate the full behavior and all the vulnerabilities of the service being attacked). In this scenario, when a Billy Goat sensor receives a connection on a port for which no feigning server exists, it would pass the connection to a virtual machine, and observe both the network traffic and the disk of the virtual machine after the connection has ended. This would provide valuable information about new worms, making it possible to capture any type of new worm, and make it easier to construct new feigning servers.
5. Finally, having a sensor that produces no false positives for a certain class of attacks might make possible the long-standing dream of intrusion detection: an automated response system. Most such systems to date have been marred by false positives, which often result in the response system causing more damage than good. The aim is to build a system that accurately and efficiently isolates misbehaving machines, while allowing critical technical and business processes to continue unimpeded, and with an extreme focus on potential failure modes, and how they might be eliminated or mitigated.
CONCLUSION
Billy Goat has been designed to be scalable, to operate gracefully in a large distributed environment, and to provide extremely accurate detection of worm-infected machines. This paper describes a number of interesting or useful techniques and components identified during the process, of developing “Billy Goat”. It can be used as a reference model for other practitioners faced with similar problems. The paper also throws light on a number of related dependencies such as the use of cryptographic checksums as external keys to ease distributed deployment, use of social structures to enforce access to information, and to determine who needs to receive alerts.
Billy Goat is useful both for accurately detecting machines infected with known worms, through its signature-based analysis. It also detects emerging behavior and new worms, through its comprehensive view of the network and its anomaly analysis capabilities. It has the following reflections:
· The former are immediately useful to system and network administrators, which can be notified of specific infections in machines under their control.
· The latter are useful to security and network analysts interested in large-scale and new behavior of the network.
A single Billy Goat sensor deployed in a network can provide useful information about infected machines, but its real value shows in a multi-sensor environment, which provides better coverage of large networks, and where the data can be centralized and analyzed to detect emerging trends and global suspicious behavior. Thus Billy Goat will prove to become an inherent part of a network based system.
FREQUENTLY ASKED QUESTIONS
Q1) What is Billy Goat System?
Q2) What are the various characteristics of Billy Goat Systems?
Q3) Who designed Billy Goat System?
Q4) What are its Future Works?
Q5) How does Billy Goat System detect Worms?
REFERENCES
[1] James Riordan, Andreas Wespi and Diego Zamboni, "How to hook worms", IEEE Spectrum, volume 42, number 5, May 2005.
[2] Web site = "www.wormblog.com/detection/". Link - "Lessons learned from Billy Goat - an accurate worm detection system".
[3] James Riordan, Andreas Wespi and Diego Zamboni, “Lessons learned from Billy Goat – an accurate worm detection system”, RZ-3609 (#99619).
Subscribe to:
Posts (Atom)