Thursday, February 12, 2009
active directory
An active directory is a directory structure used on Microsoft Windows based computers and servers to store information and data about networks and domains. It is primarily used for online information and was originally created in 1996 and first used with Windows 2000.
An active directory (sometimes referred to as an AD) does a variety of functions including the ability to provide information on objects, helps organize these objects for easy retrieval and access, allows access by end users and administrators and allows the administrator to set security up for the directory.
An active directory can be defined as a hierarchical structure and this structure is usually broken up into three main categories, the resources which might include hardware such as printers, services for end users such as web email servers and objects which are the main functions of the domain and network.
It is interesting to note the framework for the objects. Remember that an object can be a piece of hardware such as a printer, end user or security settings set by the administrator. These objects can hold other objects within their file structure. All objects have an ID, usually an object name (folder name). In addition to these objects being able to hold other objects, every object has its own attributes which allows it to be characterized by the information which it contains. Most IT professionals call these setting or characterizations schemas.
Depending on the type of schema created for a folder, will ultimately determine how these objects are used. For instance, some objects with certain schemas can not be deleted, they can only be deactivated. Others types of schemas with certain attributes can be deleted entirely. For instance, a user object can be deleted, but the administrator object can not be deleted.
When understanding active directories, it is important to know the framework that objects can be viewed at. In fact, an active directory can be viewed at either one of three levels, these levels are called forests, trees or domains. The highest structure is called the forest because you can see all objects included within the active directory.
Within the Forest structure are trees, these structures usually hold one or more domains, going further down the structure of an active directory are single domains. To put the forest, trees and domains into perspective, consider the following example.
A large organization has many dozens of users and processes. The forest might be the entire network of end users and specific computers at a set location. Within this forest directory are now trees that hold information on specific objects such as domain controllers, program data, system, etc. Within these objects are even more objects which can then be controlled and categorized.
DIRECTORY SERVICE:
Active Directory is a full-featured directory service. But what is a directory service? Well, a directory service is actually a combination of two things – a directory, and services that make the directory useful. Simply, a directory is a store of information, similar to other directories, such as a telephone book. A directory can store a variety of useful information relating to users, groups, computers, printers, shared folders, and so forth – we call these objects. A directory also stores information about objects, or properties of objects – we call these attributes. For example, attributes stored in a directory for a particular user object would be the user’s manager, phone numbers, address information, logon name, password, the groups they are a part of, and more.
To make a directory useful, we have services interact with the directory. For example, we can use the directory as a store or information against which users are authenticated, or as the place we query to find information about an object. For example, I could query a directory to show me all the color printers in the Frankfurt office, the phone number of Bob in the Delhi office, or a list all of the users accounts whose first name starts with the letter ‘G’. In Windows 2000, Active Directory is responsible for creating and organizing not only these smaller objects, but also larger objects – like domains, organizational units, and sites. In order to fully comprehend what Active Directory is all about, we need to take an initial look at a number of concepts. A deeper discussion on Active Directory will be covered once we get to the AD Implementation and Administration portion of the series
HEIRARCHY OF AD (OBJECT VIEW)
The structure of the Active Directory is a hierarchy, and before installing and implementing the Active Directory, you must have a firm understanding of the structure as well as the components that make up the Active Directory. You will use this hierarchy design to build the Active Directory infrastructure for your organization, so it is important that you have a firm grasp of their meaning and place in the hierarchy before you begin planning. The following sections explore the components in the hierarchy structure. Object An Active Directory object represents a physical object of some kind on the network. Common Active Directory objects are users, groups, printers, shared folders, applications, databases, contacts, and so forth. Each of these objects represents something "tangible." Each object is defined by a set of "attributes." An attribute is a quality that helps define the actual object. For example, a user object could have attributes of a username, actual name, and email address. Attributes for each kind of object are defined in the Active Directory. The attributes define the object itself and allow users to search for the particular object
Organizational Unit An organizational unit (OU) is like a file folder in a filing cabinet. The OU is designed to hold objects (or even other OUs). It contains attributes like an object, but has no functionality on its own. As with a file folder, its purpose is to hold other objects. As the name implies, an OU helps you "organize" your directory structure. For example, you could have an accounting OU that contains other OUs, such as Accounting Group A and Accounting Group B, and inside those OUs can reside objects that belong, such as users, groups, computers, printers, etc OUs also serve as securities and administrative boundaries and can be used to replace domains in multiple Window NT domain networks.
Domain By definition, a domain is a logical grouping of users and computers. A domain typically resides in a localized geographic location, but this is not always the case. In reality, a domain is more than a logical grouping — it is actually a security boundary in a Windows 2000 or NT network. You can think of a network with multiple domains as being like a residential neighborhood. All of the homes make up the neighborhood, but each home is a security boundary that holds certain objects inside and keeps others out. Each domain can have its own security policies and can establish trust relationships with other domains. The Active Directory is made up of one or more domains. Domains contain a schema, which is a set of object class instances. The schema determines how objects are defined with the Active Directory. The schema itself resides within the Active Directory and can be dynamically changed. You can learn more about the Active Directory schema in Chapter 18.
Tree The hierarchy structure of the domain, organizational units, and objects is called a tree. The objects within the tree are referred to as endpoints, while the OUs in the tree structure are nodes. In terms of a physical tree, you can think of the branches as OUs or containers and the leaves as objects — an object is the natural endpoint of the node within the tree.
Domain Trees A domain tree exists when several domains are linked by trust relationships and share a common schema, configuration, and global catalog. Trust relationships in Windows 2000 are based on the Kerberos security protocol. Kerberos trusts are transitive. In other words, if domain 1 trusts domain 2 and domain 2 trusts domain 3, then domain 1 trusts domain 3A domain tree also shares a contiguous namespace . A contiguous namespace follows the same naming DNS hierarchy within the domain tree. For example, if the root domain is smithfin.com and domain A and domain B exist in a domain tree, the contiguous namespace for the two would be domaina.smithfin.com and domainb.smithfin.com. If domain A resides in smithfindal.com and domain B resides in the smithfin.com root, then the two would not share a contiguous name space.
Forest A forest is one or more trees that do not share a contiguous name space. The trees in the forest do share a common schema, configuration, and global catalog, but the trees do not share a contiguous name space. All trees in the forest trust each other through Kerberos transitive trusts. In actuality, the forest does not have a distinct name, but the trees are viewed as a hierarchy of trust relationships. The tree at the top of the hierarchy normally refers to the tree. For example, corp.com, production.corp.com, and mgmt.corp.com form a forest with corp.com serving as the forest root.
Site A site is not actually considered a part of the Active Directory hierarchy, but is configured in the Active Directory for replication purposes. A site is defined as a geographical location in a network containing Active Directory servers with a well-connected TCP/IP subnet. Well-connected means that the network connection is highly reliable and fast to other subnets in the network. Administrators use the Active Directory to configure replication between sites. Users do not have to be aware of site configuration. As far as the Active Directory is concerned, users only see domains.
TRUST
Server uses trust to determine wheather access is allowed or not
Active Directory uses two types of trust:
n Transitive: Two objects are able to access each others domains and trees that means user is allowed accessed to another tree or domain,
n Non transitive (one way transitive) :One object can access trees & domain of other but other domain does not allow access to the domain & trees of first. E.g. admin-->user
GOALS
Two primary goals are
n USER
User should access resource throughout the domain using a single login
n ADMINISTRATOR
Administrator should be able to centrally manage both users & resources
DESIGN GOALS OF THE ACTIVE DIRECTORY
The Active Directory's design goals are simple, yet very powerful, allowing Active Directory to provide the desired functionality in virtually any computing environment. The following list describes the major features and goals of the Active Directory technology.
Scalable — The Active Directory is highly scalable, which means it can function in small networking environments or global corporations. The Active Directory supports multiple stores, which are wide groupings of objects, and can hold more than one million objects per store.
Extensible — The Active Directory is "extensible," which means it can be customized to meet the needs of an organization.
Secure — The Active Directory is integrated with Windows 2000 security, allowing administrators to control access to objects.
Seamless — The Active Directory is seamlessly integrated with the local network and the intranet/Internet.
Open Standards — The Active Directory is based on open communication standards, which allow integration and communication with other directory services, such as Novell's NDS.
Backwards Compatible — Although Windows 2000 operating systems make the most use of the Active Directory, the Active Directory is backwards compatible for earlier versions of Windows operating systems. This feature allows implementation of the Active Directory to be taken one step at a time.
samba server in linux
Samba allows you to share files with Windows PCs on your network, as well as access Windows file and print servers, making your Linux box fit in better with Windows-centric organizations.
In Windows file and printer sharing, SMB is sometimes referred to as CIFS (Common Internet File System), which is an Internet standard network file system definition based on SMB, or NetBIOS, which was the original SMB communication protocol.
Samba is a software package that enables you to share file systems and printers on a network with computers that use the Session Message Block (SMB) protocol. This package is distributed with most Linux flavors but can be obtained from www.samba.org if you do not find it on your distribution. SMB is the protocol that is delivered with Windows operating systems for sharing files and printers. Although you can’t always count on NFS being installed on Windows clients (unless you install it yourself), SMB is always available (with a bit of setup).
The Samba software package contains a variety of daemon processes, administrative tools, user tools, and configuration files. To do basic Samba configuration, start with the Samba Server Configuration window, which provides a graphical interface for configuring the server and setting directories to share.
Most of the Samba configuration you do ends up in the /etc/samba/smb.conffile. If you need to access features that are not available through the Samba Server Configuration window, you can edit /etc/samba/smb.confby hand or use SWAT, a Web-based interface, to configure Samba. Daemon processes consist of smbd (the SMB daemon) and nmbd (the NetBIOS name server). The smbd daemon makes the file-sharing and printing services you add to your Linux system available to Windows client computers.
SAMBA PACKAGE SUPPORT
The Samba package supports the following client computers:
Windows 9x
Windows NT
Windows ME
Windows 2000
Windows XP
Windows for workgroups
MS Client 3.0 for DOS
OS/2
Dave for Macintosh computers
Mac OS X
Samba for Linux
Mac OS X Server ships with Samba, so you can use a Macintosh system as a server. You can then have Macintosh, Windows, or Linux client computers. In addition, Mac OS X ships with both client and server software for Samba.
As for administrative tools for Samba, you have several shell commands at your disposal:testparm and testprns, with which you can check your configuration files; smbstatus, which tells you what computers are currently connected to your shared resources; and the nmblookup command, with which you can query computers.
Samba uses the NetBIOS service to share resources with SMB clients, but the underlying network must be configured for TCP/IP. Although other SMB hosts can use TCP/IP, NetBEUI, and IPX/SPX to transport data, Samba for Linux supports only TCP/IP. Messages are carried between host computers with TCP/IP and are then handled by NetBIOS.
Getting and Installing Samba
You can get Samba software in different ways, depending on your Linux distribution. Here are a few examples:
Debian
To use Samba in Debian, you must install the samba and smbclient packages using apt-get. Then start the Samba service by running the appropriate scripts from the /etc/init.d directory, as follows:
# apt-get install samba samba-common smbclient swat
# /etc/init.d/samba start
# /etc/init.d/smb-client start
Gentoo
With Gentoo, you need to have configured net-fs support into the kernel to use Samba server features. Installing the net-fs package (emergenet-fs) should get the required packages. To start the service, run rc-updateand start the service immediately:
# emerge samba
# rc-update add samba default
# /etc/init.d/samba start
Fedora Core and other Red Hat Linux systems
You need to install the samba, samba-client, samba-common, and optionally, the system-config-samba and samba-swat packages to use Samba in Fedora. You can then start Samba using the serviceand chkconfig commands as follows:
# service smb start
# chkconfig smb on
6) SWAT
The commands and configuration files are the same on most Linux systems using Samba. The Samba project itself comes with a Web-based interface for administering Samba called Samba Web Administration Tool (SWAT). For someone setting up Samba for the first time, SWAT is a good way to get it up and running.
Configuring Samba with SWAT
In addition to offering an extensive interface to Samba options, SWAT also comes with an excellent help facility. And if you need to administer Samba from another computer, SWAT can be configured to be remotely accessible and secured by requiring an administrative login and password.
Before you can use SWAT, you must do some configuration. The first thing you must do is turn on the SWAT service, which is done differently in different Linux distributions.
Here’s how to set up SWAT in Fedora Core and other Red Hat Linux systems:
1. Turn on the SWAT service by typing the following, as root user, from a Terminal window:
# chkconfig swat on
2. Pick up the change to the service by restarting the xinetd startup script as follows:
# service xinetd restart
Linux distributions such as Debian, Slackware, and Gentoo turn on the SWAT service from the inetd superserver daemon. After SWAT is installed, you simply remove the comment character from in front of the swat line in the /etc/inetd.conf file (as root user, using any text editor) and restart the daemon. Here’s an example of what the swatline looks like in Debian:
swat stream tcp nowait.400 root /usr/sbin/tcpd /usr/sbin/swat
With the SWAT service ready to be activated, restart the inetd daemon so it rereads the inetd.conf file. To do that in Debian, type the following as root user:
# /etc/init.d/inetd restart
The init.dscript and xinetd services are the two ways that SWAT services are generally started in Linux. So if you are using a Linux distribution other than Fedora or Debian, look in the /etc/inetd.conf file or /etc/xinetd.ddirectory (which is used automatically in Fedora), for the location of your SWAT service.
When you have finished this procedure, a daemon process will be listening on your network interfaces for requests to connect to your SWAT service. You can now use the SWAT program to configure Samba.
Starting with SWAT
You can run the SWAT program by typing the following URL in your local browser: http://localhost:901/ Enter the root username and password when the browser prompts you. The SWAT window appears as follows.
mounting in linux
Automatically Mounting:
After a server exports a directory over the network using NFS, a client computer connects that directory to its own file system using the mount command. That’s the same command used to mount file systems from local hard disks, CDs, and floppies, but with slightly different options.
mount can automatically mount NFS directories added to the /etc/fstabfile, just as it does with local disks. NFS directories can also be added to the /etc/fstabfile in such a way that they are not automatically mounted (so you can mount them manually when you choose). With a noauto option, an NFS directory listed in /etc/fstabis inactive until the mountcommand is used, after the system is up and running, to mount the file system.
To set up an NFS file system to mount automatically each time you start your Linux system, you need to add an entry for that NFS file system to the /etc/fstabfile. That file contains information about all different kinds of mounted (and available to be mounted) file systems for your system. Here’s the format for adding an NFS file system to your local system:
host:directory mountpoint nfs options 0 0
The first item (host:directory) identifies the NFS server computer and shared directory. mountpoint is the local mount point on which the NFS directory is mounted. It’s followed by the file system type (nfs). Any options related to the mount appear next in a comma-separated list. (The last two zeros configure the system to not dump the contents of the file system and not to run fsck on the file system.) The following are examples of NFS entries in /etc/fstab:
maple:/tmp /mnt/maple nfs rsize=8192,wsize=8192 0
oak:/apps /oak/apps nfs noauto,ro 0
In the first example, the remote directory /tmpfrom the computer named maple (maple:/tmp) is mounted on the local directory /mnt/maple(the local directory must already exist). The file system type is nfs, and read (rsize) and write (wsize) buffer sizes (discussed in the “Using mount Options” section later in this chapter) are set at 8192to speed data transfer associated with this connection. In the second example, the remote directory is /appson the computer named oak. It is set up as an NFS file system (nfs) that can be mounted on the /oak/apps directory locally. This file system is not mounted automatically (noauto), however, and can be mounted only as read-only (ro) using the mountcommand after the system is already running.
Manually Mounting an NFS File System:
If you know that the directory from a computer on your network has been exported (that is, made available for mounting), you can mount that directory manually using the mount command. This is a good way to make sure that it is available and working before you set it up to mount permanently. Here is an example of mounting the /tmpdirectory from a computer named maple on your local computer:
# mkdir /mnt/maple
# mount maple:/tmp /mnt/maple
The first command (mkdir) creates the mount point directory (/mnt is a common place to put temporarily mounted disks and NFS file systems). The mountcommand identifies the remote computer and shared file system separated by a colon (maple:/tmp), and the local mount point directory (/mnt/maple) follows.
ENSURE MOUNTING
To ensure that the mount occurred, type mount. This command lists all mounted disks and NFS file systems. Here is an example of the mount command and its output (with file systems not pertinent to this discussion edited out):
# mount
/dev/hda3 on / type ext3 (rw)
..
..
..
maple:/tmp on /mnt/maple type nfs (rw,addr=10.0.0.11)
The output from the mount command shows the mounted disk partitions, special file systems, and NFS file systems. The first output line shows the hard disk (/dev/hda3), mounted on the root file system (/), with read/write permission (rw), with a file system type of ext3 (the standard Linux file system type). The just-mounted NFS file system is the /tmpdirectory from maple (maple:/tmp). It is mounted on /mnt/maple and its mount type is nfs. The file system was mounted read/write (rw), and the IP address of maple is 10.0.0.11(addr=10.0.0.11).
This is a simple example of using mountwith NFS. The mount is temporary and is not remounted when you reboot your computer. You can also add options for NFS mounts:
OPTIONS
-a: Mount all file systems in /etc/fstab(except those indicated as noauto).
-f: This goes through the motions of (fakes) mounting the file systems on the command line (or in /etc/fstab). Used with the -v option, -fis useful for seeing what mount would do before it actually does it.
-r: Mounts the file system as read-only.
-w: Mounts the file system as read/write. (For this to work, the shared file system must have been exported with read/write permission.)
Sharing NFS File Systems
To share an NFS file system from your Linux system, you need to export it from the server system. Exporting is done in Linux by adding entries into the /etc/exportsfile. Each entry identifies a directory in your local file system that you want to share with other computers. The entry also identifies the other computers that can share the resource (or opens it to all computers) and includes other options that reflect permissions associated with the directory.
Remember that when you share a directory, you are sharing all files and subdirectories below that directory as well (by default). So, you need to be sure that you want to share everything in that directory structure.
Configuring the /etc/exports File
To make a directory from your Linux system available to other systems, you need to export that directory. Exporting is done on a permanent basis by adding information about an exported directory to the /etc/exports file. The format of the /etc/exportsfile is
Directory Host(Options) # Comments
where Directoryis the name of the directory that you want to share, and Hostindicates the host computer to which the sharing of this directory is restricted. Optionscan include a variety of options to define the security measures attached to the shared directory for the host. (You can repeat Host/Optionpairs.) Comments are any optional comments you want to add (following the #sign).
As root user, you can use any text editor to configure /etc/exportsto modify shared directory entries or add new ones. Here’s an example of an /etc/exports file:
/cal *.linuxtoys.net(rw) # Company events
/pub (ro,insecure,all_squash) # Public dir
/home maple(rw,squash uids=0-99) spruce(rw,squash uids=0-99)
The /cal entry represents a directory that contains information about events related to the company. It is made accessible to everyone with accounts to any computers in the company’s domain (*.linuxtoys.net). Users can write files to the directory as well as read them (indicated by the rwoption). The comment (#Company events) simply serves to remind you of what the directory contains.
The /pubentry represents a public directory. It allows any computer and user to read files from the directory (indicated by the ro option) but not to write files. The insecureoption enables any computer, even one that doesn’t use a secure NFS port, to access the directory. The all_squash option causes all users (UIDs) and groups (GIDs) to be mapped to the nfsnobody user, giving them minimal permission to files and directories.
The /homeentry enables a set of users to have the same /home directory on different computers. Say, for example, that you are sharing /home from a computer named oak. The computers named maple and spruce could each mount that directory on their own /homedirectories. If you gave all users the same username/UIDs on all machines, you could have the same /home/userdirectory available for each user, regardless of which computer they are logged into. The uids=0–99is used to exclude any administrative login from another computer from changing any files in the shared directory.
These are just examples; you can share any directories that you choose, including the entire file system (/). Of course, there are security implications of sharing the whole file system or sensitive parts of it (such as /etc). Security options that you can add to your /etc/exports file are described throughout the sections that follow.
Hostnames in /etc/exports
You can indicate in the/etc/exports file which host computers can have access to your shared directory in the following ways:
Individual host
You can enter one or more TCP/IP hostnames or IP Addresses. If the host is in your local domain, you can simply indicate the hostname. Otherwise, you can use the full host.domain format. These are valid ways of indicating individual host computers:
maple
maple.handsonhistory.com
10.0.0.11
IP network
To allow access to all hosts from a particular network address, indicate a network
number and its netmask, separated by a slash (/). These are valid ways of indicating network numbers:
10.0.0.0/255.0.0.0
172.16.0.0/255.255.0.0
192.168.18.0/255.255.255.0
TCP/IP domain
Yu can include all or some host computers from a particular domain level. Here are some valid uses of the asterisk and question mark wild cards:
*.handsonhistory.com
*craft.handsonhistory.com
???.handsonhistory.com
The first example matches all hosts in the handsonhistory.com domain. The second example matches
woodcraft, basketcraft, or any other hostnames ending in craft in the handsonhistory.com domain. The final example matches any three-letter hostnames in the domain.
Note Using an asterisk doesn’t match subdomains. For example, *.handsonhistory.com would not cause the
hostname mallard.duck.handsonhistory.com to be included in the access list.
NIS groups
You can allow access to hosts contained in an NIS group. To indicate a NIS group, precede the group name with an at (@) sign (for example, @group).
Link and access options in /etc/exports
You don’t have to just give away your files and directories when you export a directory with NFS. In the options part of each entry in /etc/exports, you can add options that allow or limit access based on user ID, subdirectory, and read/write permission. These options, which are passed to NFS, are as follows:
ro
Only allow the client to mount this exported file system read-only. The default is to mount the file system read/write.
rw
Explicitly ask that a shared directory be shared with read/write permissions. (If the client chooses, it can still mount the directory read-only.)
noaccess
All files and directories below the given directory are not accessible. This is how you would exclude selected subdirectories of a shared directory from being shared. The directory will still appear to the client that mounts the file system that includes this directory, but the client will not be able to view its contents.
link_relative
If absolute symbolic links are included in the shared file system (that is, ones that identify a full path), the full path is converted to a relative path. To do this, each part of the path is converted to two dots and a slash (../) to reach the root of the file system.
Link_absolute
Don’t change any of the symbolic links (default).
User mapping options in /etc/exports
Besides options that define how permissions are handled generally, you can also use options to set the permissions that specific users have to NFS shared file systems.
One method that simplifies this process is to have each user with multiple user accounts have the same user
name and UID on each machine. This makes it easier to map users so that they have the same permission on a
mounted file system as they do on files stored on their local hard disk. If that method is not convenient, user
IDs can be mapped in many other ways. Here are some methods of setting user permissions and the
/etc/exports option that you use for each method:
root user
Normally, the client’s root user is mapped into the anonymous user ID. This prevents the root user from a client computer from being able to change all files and directories in the shared file system. If you want the client’s root user to have root permission on the server, use the no_root_squash option.
There may be other administrative users, in addition to root, that you want to squash. I recommend squashing UIDs 0–99 as follows: squash_uids=0–99.
Anonymous user/group
By using anonymous user ID and group ID, you essentially create a user/group whose permissions will not allow access to files that belong to any users on the server (unless those users open permission to everyone). However, files created by the anonymous user/group will be available to anyone assigned as the anonymous user/group. To set all remote users to the anonymous user/group, use the all_squash option.
The anonymous user assigned by NFS is typically the "nobody" user name with a UID and GID -2 (because -2 cannot be assigned to a file, UIDs and GIDs of 65534 are assigned when the "nobody" user owns a file). This prevents the ID from running into a valid user or group ID. Using anonuid or anongid, you can change the anonymous user or group, respectively. For example, anonuid=175 sets all anonymous users to UID 175 and anongid=300 sets the GID to 300.
User mapping
If the same users have login accounts for a set of computers (and they have the same IDs), NFS, by default, will map those IDs. This means that if the user named mike (UID 110) on maple has an account on pine (mike, UID 110), from either computer he could use his own remotely mounted files from the other computer.
If a client user that is not set up on the server creates a file on the mounted NFS directory, the file is assigned to the remote client’s UID and GID. (An ls -l on the server would show the UID of the owner.) You can identify a file that contains user mappings using the map_static option.
The exports man page describes the map_static option, which should let you create a file that contains new ID mappings. These mappings should let you remap client IDs into different IDs on the server.
Exporting the Shared File Systems
After you have added entries to your /etc/exportsfile, run the exportfscommand to have those directories exported (made available to other computers on the network). Reboot your computer or restart the NFS service, and the exportfscommand runs automatically to export your directories.
If you want to export them immediately, run exportfsfrom the command line (as root). It’s a good idea to run the exportfs command after you change the exports file. If any errors are in the file, exportfs identifies them for you.
Here’s an example of the exportfscommand:
# /usr/sbin/exportfs -a -
exporting maple:/pub
exporting spruce:/pub
exporting maple:/home
exporting spruce:/home
exporting *:/mnt/win
The –a option indicates that all directories listed in /etc/exportsshould be exported.
The -v option says to print verbose output. In this example, the /puband /home directories from the local server are immediately available for mounting by those client computers that are named (maple and spruce). The /mnt/windirectory is available to all client computers.
Running the exportfscommand temporarily makes your exported NFS directories available. To have your NFS directories available on an ongoing basis (that is, every time your system reboots), you need to set your nfs startup scripts to run at boot time.
file server in linux
A file system is usually a structure of files and directories that exists on a single device (such as a hard disk partition or CD-ROM). A Linux file system refers to the entire directory structure (which may include file systems from several disks or NFS resources), beginning from root (/) on a single computer. A shared directory in NFS may represent all or part of a computer’s file system, which can be attached (from the shared directory down the directory tree) to another computer’s file system.
1.2) CENTRALIZED SHARING:
Most networked computers are on the network in the first place so that users can share information. Some users need to collectively edit documents for a project, share access to spreadsheets and forms used in the daily operation of a company, or perform any number of similar file-sharing activities. It also can be efficient for groups of people on a computer network to share common applications and directories of information needed to do their jobs. By far the best way to accomplish the centralized sharing of data is through a file server.
A centralized file server can be backed up, preserving all stored data in one fell swoop. It can focus on the tasks of getting files to end users, rather than running userapplications that can use client resources. And a centralized file server can be used to control access to information — security settings can dictate who can access what.
Linux systems include support for each of the most common file server protocols in use today. Among the most common file server types in use today are the Network File System (NFS), which has always been the file-sharing protocol of choice for Linux and other UNIX systems, and Samba (SMB protocol), which is often used by networks with many Windows and OS/2 computers
2) NFS FILE SERVER
Instead of representing storage devices as drive letters (A, B, C, and so on), as they are in Microsoft operating systems, Linux systems connect file systems from multiple hard disks, floppy disks, CD-ROMs, and other local devices invisibly to form a single Linux file system. The Network File System (NFS) facility enables you to extend your Linux file system in the same way, to connect file systems on other computers to your local directory structure.
An NFS file server provides an easy way to share large amounts of data among the users and computers in an organization. An administrator of a Linux system that is configured to share its file
2.1) STEPS TO SET UP NFS
Systems using NFS has to perform the following tasks to set up NFS:
1. Set up the network. If a LAN or other network link is already connecting the computers on which you want to use NFS, you already have the network you need.
2. Choose what to share on the server. Decide which file systems on your Linux NFS server to make available to other computers. You can choose any point in the file system and make all files and directories below that point accessible to other computers.
3. Set up security on the server. You can use several different security features to suit the level of security with which you are comfortable. Mount-level security lets you restrict the computers that can mount a resource and, for those allowed to mount it, lets you specify whether it can be mounted read/write or read-only. With user-level security, you map users from the client systems to users on the NFS server so that they can rely on standard Linux read/write/execute permissions, file ownership, and group permissions to access and protect files. Linux systems that support Security Enhanced Linux (SELinux), such as Fedora and Red Hat Enterprise Linux, offer another means of offering or restricting shared NFS files and directories.
4. Mount the file system on the client. Each client computer that is allowed access to the server’s NFSshared file system can mount it anywhere the client chooses. For example, you may mountfile system from a computer called maple on the /mnt/maple directory in your local file system. After it is mounted, you can view the contents of that directory by typing ls /mnt/maple. Then you can use the cd command below the /mnt/maple mount point to see the files and directories it contains.
2.3) Getting started NFS:
While nearly every Linux system supports NFS client and server features, NFS is not always installed by default. You’ll need different packages for different Linux systems to install NFS. Here are some examples:
· Fedora Core and other Red Hat Linux systems:
You need to install the nfs-utils package to use Fedora as an NFS server. There is also a graphical NFS Configuration tool that requires you to install the system-config-nfs package. NFS client features are in the base operating system. To turn on the nfs service, type the following:
# service nfs start
# chkconfig nfs on
· Debian:
To act as an NFS client, the nfs-common and portmap packages are required;
for an NFS server, the nfs-kernel-server package must be added. The following apt-get command line (if you are connected to the Internet) installs them all. Then, after you add an exported file system to the /etc/exports file (as described later), you can start the nfs-common and nfs-kernel-server scripts, as shown here:
# apt-get install nfs-common portmap nfs-kernel-server
# /etc/init.d/nfs-kernel-server start
# /etc/init.d/nfs-common start
· Gentoo
With Gentoo, NFS file system and NFS server support must be configured into the kernel to use NFS server features. Installing the nfs-utils package (emergenfs-utils) should get the required packages. To start the service, run rc-updateand start the service immediately:
# emerge nfs-utils
# rc-update add portmap default
# rc-update add nfs default
# /etc/init.d/nfs start
The commands (mount, exportfs, and so on) and files (/etc/exports, /etc/fstab, and so on) or actually configuring NFS are the same on every Linux system I’ve encountered.
Enhance Data rate for GSM Evolution
EDGE offers on average up to three times the data throughput of GPRS — yielding a level of service that can support widespread data adoption, allowing new service offerings to evolve and making mobile multimedia services more affordable to more subscribers. For example, the EDGE technology enables an operator to handle the mass adoption of services like MMS and Push to Talk while also enhancing the end user’s experience of these as well as other services like Web Browsing, FTP and Video/Audio Streaming. With a global footprint of over 550 GSM networks in more than 180 countries, EDGE is in 111 networks and is expected to continue to be widely adopted given its cost-effective evolution properties.
The network EDGE plays a critical role in addressing these challenges because it is the network EDGE that creates the services customers will value. Enhanced Data rates for GSM Evolution (EDGE) was designed to deliver the capacity and the performance necessary to offer high-speed data services that will hold their own value against other 3G standards. EDGE shares spectrum and resources with GSM and GPRS, solving the spectrum availability dilemma for 3G services, and allowing for a highly flexible implementation, minimizing network impact and costs.
1. EDGE Overview
The edge of the network is the boundary between the network infrastructure and the data center servers. A data request moves from the client, over the Internet using the networking infrastructure, and then must travel across the gray area of the edge of the network before being passed on to a Tier 1 Web server. It is here, at the edge, where all the preparation for moving data into the server room happens, with these functions focusing on traffic processing rather than actual data processing. The reason this is considered a gray area is because these functions can be performed either by or with Tier 1 systems or by networking equipment such as routers and switches.
Enhanced Data rates for Global Evolution (EDGE) is a third-generation (3G) wireless technology that’s capable of high-speed data. EDGE occasionally is called “E-GPRS” because it’s an enhancement of the General Packet Radio Service (GPRS) network. EDGE can’t be deployed by itself; it must be added to an existing GPRS network. So, for example, an operator could offer GSM/GPRS/EDGE but not GSM/EDGE.
Cingular Wireless launched the world’s first commercial EDGE network in June 2003 in Indianapolis, Indiana.3 In September, CSL deployed EDGE in Hong Kong.4 Both operators introduced their services with a single handset model – the Nokia 62005 and Nokia 62206 respectively – although they say more models will be available sometime in the near future. The EDGE device that’s most likely to hit the market next is the Sony Ericsson GC82 PC card, 7 although its release date has been pushed back at least once. Cingular’s launch is noteworthy, if only because EDGE has been promised and then postponed so many times. For example, in 1998, Ericsson forecast EDGE deployments by 2000.
Three years later, AT&T Wireless and Nokia forecast commercial launches by 2002. By being late out of the gate, EDGE may have missed its window of opportunity in following key respects. First, EDGE has to catch up with other 3G technologies such as CDMA2000 and W-CDMA, which have been commercially deployed for more than three years.
1.1 GPRS & EDGE: -
Nortel’s EDGE solution integrates almost seamlessly into current GPRS networks. From a Core Network perspective, the same GPRS SGSN and GGSN are used. As more users and services become available, Nortel GGSN can be leveraged to implement new IP-based service offerings including VPNs, personal content portals and content-based billing. All of the nodes in the packet core network follow an aggressive capacity growth curve that is possible through simple software updates. For our Access portfolio, all currently shipped hardware is EDGE-ready with the only upgrade required for older base stations being a new radio transceiver and power amplifier. The resulting EDGE coverage footprint can be better than the original GSM RF plan.
2.1.1 Minimizing Costs, Maximizing Spectrum: -
Nortel has earned a reputation for improving spectral efficiency and minimizing cost in our GSM/GPRS core and access portfolios. For our Access portfolio, all currently shipped hardware is EDGE-ready with the only upgrade required for older base shipped hardware is EDGE-ready with the only upgrade required for older base stations being a new radio transceiver and power amplifier. The resulting EDGE coverage footprint can be better than the original GSM RF plan. Nortel’s EDGE solution was designed such that the power amplifier and radio modules support voice, GPRS and EDGE simultaneously, reducing the need for separate radios and RF spectrum for GPRS/EDGE and voice. Implemented with such flexibility, Nortel’s EDGE solution can be deployed on a cell site configured with only one radio per sector. Cost savings aside, Nortel’s EDGE solution will be differentiated in the marketplace by its spectral efficient characteristics of the access solution. Nortel continues to demonstrate innovation and leadership in spectral efficiency with major technological breakthroughs in system capacity and end-user quality of service (QoS)
2.1.2 Speeding Time To Revenue: -
With most network elements requiring only a software upgrade to support EDGE, the time interval between the business decision to deploy EDGE and the implementation of EDGE on the network can be quite short. In addition, Nortel has devised other features for driving EDGE services revenue right from the start. The most important of these is the bandwidth management for classes of users, which Nortel refers to as the PCUSN QoS Management feature. The ability to differentiate users and services also provides flexibility for the operator to articulate different service packages and address a greater number of market segments. High-end business users with high-bandwidth requirements can be provisioned a Gold or Silver attribute, for example, while flat rate mass market (text) subscribers can use a Bronze subscription. At the busy hour, the Gold and Silver segments can be set to leave, perhaps, 10 percent of relative bandwidth to the Bronze subscription. This
implementation will ensure a good level of customer satisfaction and loyalty from business users. Overall this could dramatically increase the revenues being generated during busy hours depending on the operator’s specific service offerings and user profiles.
In the near future software release from Nortel, R99 terminals supporting Packet Flow Context as defined in the 3GPP Standards release 99 will have the additional benefit of guaranteed throughput for conversational and streaming services. This additional service differentiation will ensure the availability of bandwidth for services like PTT delivering customer satisfaction even under very high traffic loading condition.
2.1.3 Operational excellence: -
The introduction of EDGE in the network translates into more bandwidth delivered to the base station, which requires more backhaul transmission. Furthermore, there is a benefit in being able to closely monitor and fine-tune EDGE radio parameters to maximize EDGE performance for the end users. Nortel’s EDGE solutions come with a set of features that
addresses these operational needs and network transmission costs through a robust, intuitive and easy-to-optimize EDGE software solution and two backhaul transport efficiency features that complement Nortel’s unique strengths in the backbone. Nortel’s EDGE software provides intuitive parameter options for simple but highly effective EDGE performance
optimization, giving the operator a competitive edge.
Backhaul considerations are integrated into the Nortel EDGE solution portfolio. To support EDGE with a standard TDM interface, EDGE’s new modulation and coding scheme would require up to eight additional DS0s per radio. To mitigate this impact, Nortel is introducing an asynchronous interface on the backhaul that can reduce the need for additional DS0s by dynamically sharing a smaller pool of resources, reducing the additional bandwidth
required at the cell site by up to 50 percent. The dynamic AGPRS interface between the BSC and the PCUSN also provides some pooling of resources with a similar 30 percent gain on the number of T1s required.
3.2 Optimal Resource Use With N1: -
As part of the shift from application delivery to network service delivery, data management becomes considerably more complex. Companies need to be able to consolidate computing resources, easily manage them from a centralized view, and optimize resource utilization to maximize their return on investment. Sun solutions first disaggregate resources by delivering component solutions that are tuned for specific functions. At the lowest end, blade servers will provide tightly integrated edge and Tier 1 functionality, enabling intelligent blades to achieve optimal processing for a specific function, such as SSL encryption or load balancing. Sun plans to then reaggregate all of a company’s optimized computing resources with N1— Sun’s vision, architecture, and products for making entire data centers appear as one system. Instead of requiring system management for each server, N1 will provide a single virtual view of a company’s complete computing infrastructure. This will enable a roomful of servers, blades, storage devices, applications, and networking components to appear as a single entity. As a result, system administrators will be able to manage all of these infrastructure components from a single, central view instead of sending out a team of engineers to reconfigure compute resources as workloads change. This virtual view will also support automated configuration, enabling a manager to state a parameter that will then be implemented intelligently across all affected tiers, converting centralized policy into distributed local policy. N1 data center management promises revolutionary benefits, including the following:
· Increases business agility by supporting dynamic reallocation of resources as processing demands and business needs change
· Eliminates the need for individual systems to maintain excess capacity for peak processing demands by allowing excess capacity to be shared
· Boosts server utilization from industry norms of 15 to 30 percent up to 80 percent or higher
· Significantly reduces resource management complexity and the need for manual intervention
· Simplifies deployment of new services
· Protects technology investments by integrating existing equipment
· Increases availability by leveraging the N1 pool of resources to reassign services
· Provides a Web-based single point of control that delivers anywhere, anytime administration
Ultimately, these benefits can result in significantly lower operations costs by eliminating manual management tasks and simplifying resource allocation. These cost savings are critical, as today’s companies spend more than 70 percent of their information technology budget on managing data center complexity. By integrating edge functions into intelligent blade servers that support centralized N1 management, Sun will help to optimize resource use, simplify installation and administration, and deliver exceptional price/performance both in Tier 1 and at the edge of the network. For more information on Sun’s N1 vision, see the Additional References section below.
Tuesday, January 6, 2009
BIT TORRENT
ON
BIT TORRENT
CERTIFICATE
This is to certify that Mr. Suraj Kadam has successfully completed his seminar on Bit torrent in partial fulfillment of third year degree course in Information Technology in academic year 2006 – 2007.
Date :
Prof. Mr. Narendra Pathak Prof. Mrs. Rathi Prof. Dr. A.S Tavildar
(Head of the Department) Seminar Guide Principal
V.I.I.T, Pune. V.I.I.T, Pune. V.I.I.T, Pune.
Bract’s VIIT Pune – 48.
Department Of Information Technology
Sr. No. – 2/3/4,
Kondhawa Bdruk.
Bioinformatics and Role of Software Engineers in It
and Role of Software Engineers in It
By-
Makarand Arjun Kokane
BE-2
Roll number 222
Year 2002-2003
Pune Institute of Computer Technology
Pune 411043
Pune Institute of Computer Technology
Pune 411043
CERTIFICATE
This is to certify that Mr. Makarand Arjun Kokane has successfully completed his seminar on the topic “Bioinformatics and role of software engineers in it” under the guidance of Prof. R. B. Ingle towards the partial completion of Bachelor’s degree in Computer Engineering at Pune Institute of Computer Engineering during academic year 2002-2003.
Date:
Seminar Incharge HOD Principal
(Computer Engineering) PICT
Acknowledgement
I take this opportunity to thank respected Prof. R. B. Ingle Sir (my seminar guide) for his generous assistance. I am immensely grateful to our HOD Prof. Dr. C.V.K Rao for his encouragement and guidance. I extend my sincere thanks to our college library staff and all the staff members for their valuable assistance. I am also thankful to my fellow colleagues for their help and suggestions.
Makarand Kokane
(B.E. – 2)
Abstract
Bioinformatics is the application of computers in biological sciences. It is concerned with capturing, storing, graphically displaying, modeling and ultimately distributing biological information. It is becoming an essential tool in molecular biology as genome projects generate vast quantities of data.
The Human Genome Project has created the need for new kinds of scientific specialists who can be creative at the interface of biology and other disciplines, such as computer science, engineering, mathematics, physics, chemistry, and the social sciences. As the popularity of genomic research increases, the demand for these specialists greatly exceeds the supply. In the past, the genome project has benefited immensely from the talents of non-biological scientists, and their participation in the future is likely to be even more crucial. Through this report I have tried to analyze the future requirements in development of advances technologies in this field and what role, we, as software engineers can play in development of these technologies.
Contents Page
1 Introduction to Bioinformatics 1
1.1 What is Bioinformatics? 1
1.2 Computers and Biology 2
1.3 Limitations in the use of computers 2
1.4 Current Stage of Research 3
1.5 Microbial, Plant and Animal Genomes 4
1.6 History (Stages of development) 4
2 Basics of Molecular Biology 7
2.1 Nucleotide 7
2.2 Amino acid 7
2.3 Properties of Genetic Code 8
2.4 DNA (Deoxy-ribonucleic Acid) 8
2.5 Chromosomes 9
2.6 Gene 9
2.7 Protein 9
2.8 Sequencing 9
2.9 Genome 9
2.10 Clone 10
2.11 Model Organism 10
3 Role of software Engineers and Technology in Biotechnology 11
3.1 Need for software automation 11
3.2 Genetic Algorithms 12
3.2.1 Database Searching 12
3.2.2 Comparing Two Sequences 12
3.2.3 Multiple Sequence Alignment 13
3.3 Genome Projects 13
3.4 Goals for Advancements in Sequencing Technology 14
3.5 Developing Technology to handle Sequence Variations 15
3.6 Need of Technology in Functional Genomics 18
3.7 Bioinformatics and Computational Biology 19
3.8 Job Opportunities and Job Requirements 21
3.9 Training Goals included in the Human Genome Project Plan 21
4 Human Genome Project 23
4.1 Introduction 23
4.2 Details of the Human Genome Project 23
4.3 U.S. Human Genome Project 5-Year Goals 1998-2003 25
4.3.1 Human DNA Sequencing 25
4.3.2 Sequencing Technology 27
4.3.3 Sequence Variation 27
4.3.4 Functional Genomics 27
4.3.5 Comparative Genomics 28
4.3.6 Ethical, Legal, and Social Implications (ELSI) 28
4.3.7 Bioinformatics and Computational Biology 29
4.3.8 Training 29
5 Biological Databases 30
5.1 The Biological sequence/structure deficit 30
5.2 Biological Databases 30
5.3 Primary Sequence Databases 31
5.3.1 Nucleic acid Sequence Databases 31
5.3.2 Protein Sequence Databases 32
5.4 Composite Protein Sequences Databases 32
5.5 Secondary Databases 33
5.6 Tertiary Databases 33
6 Applications of Bioinformatics 34
6.1 Application to the Ailments of Diseases 34
6.2 Application of Bioinformatics to Agriculture 36
6.2.1 Improvements in Crop Yield and Quality 36
6.3 Applications of Microbial Genomics 37
6.4 Risk Assessment 39
6.5 Evolution and Human Migration 39
6.6 DNA Forensics (Identification) 40
Bibliography 41
Chapter 1
Introduction to Bioinformatics
1.1 What is Bioinformatics?
Bioinformatics is the application of computers in biological sciences and especially analysis of biological sequence data. It is concerned with capturing, storing, graphically displaying, modeling and ultimately distributing biological information. It is becoming an essential tool in molecular biology as genome projects generate vast quantities of data. With new sequences being added to DNA databases on an average, once every minute, there is a pressing need to convert this information into biochemical and biophysical knowledge by deciphering the structural, functional and evolutionary clues encoded in the language of biological sequences.
What Bioinformatics therefore offers to the researcher, the entrepreneur, or the Venture Capitalist is an enormous and exciting array of opportunities to discover how living systems metabolise, grow, combat disease, reproduce and regenerate. The current knowledge represents only the tip of the iceberg. Exciting and startling discoveries are being made everyday through Bioinformatics, which is building up an extensive encyclopedia from which life’s mysteries will be unraveled. The importance of computational science in collating this information and its simultaneous interpretation by biologists is the underlying ethos of Bioinformatics.
Having an interest in biology and having a strong inclination towards genetics is all right. But from our point of view, the most important thing is that biocomputing requires lots of software professionals. And there is more to do for these people than the experts in biology.
1.2 Computers and Biology
Bioinformatics is the symbolic relationship between computational and biological sciences. The ability to sort and extricate genetic codes from a human genomic database of 3 billion base pairs of DNA in a meaningful way is perhaps the simplest form of Bioinformatics. Moving on to another level, Bioinformatics is useful in mapping different people’s genomes and deriving differences in their genetic make-up through pattern recognition software. But that is the easiest part. What is more complex is to decipher the genetic code itself to see what the differences in genetic make-up between different people translate into in terms of physiological traits. And there is yet another level, which is even more intricate and that is the genetic code itself. The genetic code actually codes for amino acids and thereby proteins and the specific role, played by each of these proteins controls the state of our health. The role or function of each of our genes in coding for a specific protein, which in turn regulates a particular metabolic pathway, is described as “functional genomics”. The true benefit of Bioinformatics therefore lies in harnessing information pertaining to these genetic functions in order to understand how human beings and other living systems operate.
Computational simulation of experimental biology is an important application of Bioinformatics, which is referred to as “in silico” testing. This is perhaps an area that will expand in a prolific way, given the need to obtain a greater degree of predictability in animal and human clinical trials. Added to this, is the interesting scope that “in silico” testing provides to deal with the growing hostility towards animal testing. The growth of this sector will largely depend on the acceptance of “in silico” testing by the regulatory authorities. However, irrespective of this, research strategies will certainly find computational modeling to be a vital tool in speeding up research with enormous cost benefits.
1.3 Limitations in the use of computers
The last decade has witnessed the dawn of a new era of ‘silicon-based’ biology, opening the door, for the first time, to the possible investigation and comparative analysis of complete genomes. Genome analysis means to elucidate and characterize the genes and gene products of an organism. It depends on a number of pivotal concepts, concerning the processes of evolution (divergence and convergence), the mechanism of protein folding, and the manifestation of protein function.
Today, our use of computers to model such processes is limited by, and must be placed in the context of, the current limits of our understanding of these central themes. At the outset, it is important to recognize that we do not yet fully understand the rules of protein folding; we cannot invariably say that a particular sequence or a fold has arisen by divergent or convergent evolution; and we cannot necessarily diagnose a protein function, given knowledge only of its sequence or of its structure, in isolation. Accepting what we cannot do with computers plays an essential role in forming an appreciation of what, in fact, we can do. Without this kind of understanding, it is easy to be misled, as spurious arguments are often used to promote perhaps rather overenthusiastic points of view about what particular programs and software packages can achieve.
Nature has its own complex rules, which we only poorly understand and which we cannot easily encapsulate within computer programs. No current algorithm can ‘do’ biology. Programs provide mathematical and therefore infallible, models of biological systems. To interpret correctly whether sequences or structures are meaningfully similar, whether they have arisen by the processes of divergence or convergence, whether similar sequences or similar folds have the same or different functions: these are the most challenging problems. There are no simple solutions, and computers do not give us the answers; rather, given a sea of data, they help to narrow the options down so that the users can begin to draw informed biologically reasonable conclusions.
1.4 Current Stage of Research
In the field of Bioinformatics, the current research drive is to be able to understand evolutionary relationships in terms of the expression of protein function. Two computational approaches have been brought up to bear on the problem, tackling the identification of protein function from the perspectives of sequence analysis and of structure analysis respectively. From the point of view of sequence analysis, we are concerned with the detection of relationships between newly determined sequences and those of known function (usually within a database). This may mean pinpointing functional sites shared by disparate proteins (probably the result of convergent evolution), or identifying related functions in similar proteins (most commonly the result of divergent evolution.
The identification of protein function from sequence sounds straightforward, and indeed, sequence analysis is usually a fruitful technique. But, function cannot be inferred from sequence for about one-third of proteins in any of the sequenced genomes, largely because biological characterization cannot keep pace with the volume of data issuing from the genome projects (large number of database sequences thus either carry no annotation beyond the parent gene name, or are simply designated as hypothetical proteins). Another important point is that, in some instances, closely related sequences, which may be assumed to share a common structure, may not share the same function. What this means is that, though sequence or structure analysis can be used for deducing gene functions, still neither technique can be applied infallibly without reference to the underlying biology.
1.5 Microbial, Plant and Animal Genomes
Although the human genome appears to be the focal point of interest, microbial, plant and animal genomes are equally exciting to explore through Bioinformatics. Mining plant genomics has an important impact on opening up new vistas for research in agriculture. Microbial genomics offers a dual opportunity of developing new fermentation-based products and technologies as well as defining new ways of combating microbial infections. Exploring animal genomics opens up unlimited scope to pursue research in veterinary science and transgenic models.
1.6 History (Stages of development)
The science of sequencing began slowly. The earliest techniques were based on methods for separation of proteins and peptides, coupled with methods for identification and quantification of amino acids. Prior to 1945, there was not a single quantitative analysis available for any one protein. However, significant progress with chromatographic and labeling techniques over the next decade eventually led to the elucidation of the first complete sequence, that of the peptide hormone insulin (1955). Yet, it was the first five years before the sequence of the first enzyme (ribonuclease) was complete (1960). By 1965, around 20 proteins with more than 100 residues had been sequenced, and by 1980, the number was estimated to be of the order of 1500. Today, there are more than 3,00,000 sequences available.
Initially, the manual process of sequential Edman degradation – dansylation, obtained the majority of protein sequences. A key step towards the rapid increase in the number of sequenced proteins was the development of automated sequencers, which, by 1980, offered a 104-fold increase in sensitivity relative to the automated procedure implemented by Edman and Begg in 1967.
In the 1960s, scientists struggled to develop methods to sequence nucleic acids, but the first techniques to emerge were really only applicable to tRNA because they are short (74 to 95 nucleotides in length) and it is possible to purify individual molecules.
As against RNA, DNA is very long: human chromosomal molecules may contain between 55*106 and 250*106 base pairs. Assembling the complete nucleotide sequence of a complete DNA molecule is a huge task. Even if the sequence can be broken down into smaller fragments, purification remains a problem. The advent of gene cloning provided a solution to how the fragments can be separated. By 1977, two sequencing methods had emerged using chain termination and chemical degradation approaches. With only minor changes, the techniques propagated to laboratories throughout the world, and laid the foundation for the sequence revolution of the next two decades, and the subsequent birth of Bioinformatics.
During the last decade, molecular biology has witnessed an information revolution as a result of both of the development of rapid DNA sequencing techniques and of the corresponding progress in computer base technologies, which are allowing us to cope with this information deluge in increasingly efficient ways. The broad term that was coined in the mid-1980s to encompass computer applications in biological sciences is Bioinformatics. The term Bioinformatics has been commandeered by several different disciplines to mean rather different things. In its broadest sense, the term can be considered to mean information technology applied to the management and analysis of biological sequence data; this has implications in diverse areas, ranging from artificial intelligence and robotics to genome analysis. In the context of genome initiatives, the term was originally applied to the computational manipulation and analysis of biological sequence data. However in view of this recent rapid accumulation of available protein structures, the term now tends also to be used to embrace the manipulation and analysis of 3D structure data.
Chapter 2
Basics of Molecular Biology
This chapter explains in short some of the common biological terms absolutely essential to get a clear understanding of what exactly is Bioinformatics all about. I have avoided getting into the intricacies of Genetics because the basic aim of this report is to know the latest developments in the field of Bioinformatics, try to visualize where it is heading, understand what it has got to offer to the community, and exploit the opportunities available in this field.
2.1 Nucleotide
A nucleotide is a macromolecule made up of three sub-units: a pentose sugar, a nitrogen base and a phosphate. Nucleic acids are polymers of nucleotides. Pentose sugar is either ribose or deoxyribose (this decides whether the genetic material formed is RNA or DNA). Nitrogen bases are of two types: Purines (Adenine (A), Guanine (G)) and Pyramidines (Cytosine (C), Thymine (T) and Uracil (U))
2.2 Amino acid
It is the fundamental building block of proteins. There are 20 naturally occurring amino acids in animals and around 100 more found only in plants. A sequence of three nucleotides forms one amino acid. The logic behind this is as follows: There are four types of nucleotides depending on the nitrogenous base: (A,G,C,T) in DNA and (A,G,C,U) in RNA. 20 different amino acids are to be coded using permutations of 4 types of nucleotides. So obviously, 3 nucleotides are required to signify one amino acid (43 > 20), because less than 3 will be insufficient and more than 3 will cause redundancy. The sequence of three nucleotide specifying an amoni acid is called a triplet code or codon (coding unit). All 64 codons specify something or the other. Most of them specify amino acids, but a few are instructions for starting and stopping the synthesis.
2.3 Properties of Genetic Code
1. Three nucleotides in a DNA molecule code for one amino acid in the corresponding protein. Such a triplet is called a codon.
2. The code is read from a fixed starting point.
3. Codes for starting and stopping are present, but not for a pause in the middle, or
comma.
4. The nucleotides are read three at a time in a non-overlapping manner.
5. Most of the 64 possible nucleotide triplets stand for one amino acid or the other.
6. A few triplets stand for starting and stopping the synthesis.
7. There are two or more different codons for the same amino acid. Because of this,
the genetic code is said to be degenerate.
8. The code has polarity because it can be read only in one direction.
9. The code is universal. Practically all the organisms use the same code.
2.4 DNA (Deoxy-ribonucleic Acid)
The long, thread-like DNA molecule consists of two strands that are joined to one another all along their length. Each strand is a polymer made up of repeated sub-units (nucleotides). Hence each strand is also called a polynucleotide. DNA is the basic genetic material in all the living material existing on this earth. The two essential mechanisms possessed by DNA are (1) Transmission of hereditary characters and (2) Ability of self-duplication. In the DNA molecule, tow long polynucleotide chains are spirally twisted around each other. This is also called helical coiling and the DNA is often referred to as a double helix. A polynucleotide chain has polarity and the two strands of a DNA molecule run in opposite directions, hence they are said to be anti parallel. The two chains are joined together by hydrogen bonds existing between the nitrogenous bases on the inside. Adenine (A) forms a bond only with Thymine (T) and Guanine (G) can form a bond only with Cytosine (C). Because of the base pairing restriction, the two strands are always complementary to each other.
The sequence of bases along the polynucleotide is not restricted in any way. An infinite variety of combinations is possible. It is the precise sequence of bases that determines the genetic information. There is no theoretical limit to the length of a DNA molecule.
2.5 Chromosomes
Chromosomes are the paired, self-replicating genetic structures of cells that contain the cellular DNA; the nucleotide sequence of the DNA encodes the linear array of genes.
2.6 Gene
A gene is the fundamental physical and functional unit of heredity. A gene is an ordered sequence of nucleotides located in a particular position on a particular chromosome that encodes a specific functional product (i.e. a protein or RNA molecule).
2.7 Protein
Protein is a molecule composed of one or more chains of amino acids in a specific order. The order is determined by the base sequence of nucleotides in the gene coding for the protein. Proteins are required for the structure, function and regulation of cells, tissues and organs, each protein having a specific role (e.g., hormones, enzymes and antibodies).
DNA carries the hereditary material and the only thing that they do is to synthesize proteins, and thereafter, all the hereditary characteristics get reflected in the activities of the body cells because of proteins.
2.8 Sequencing
Sequencing means the determination of the order of nucleotides (base sequences) in a DNA or RNA molecule, or the order of amino acids in a protein.
2.9 Genome
Genome of an organism means all the genetic material in its chromosomes. Its size is generally given as its total number of base pairs. Genomes of different organisms can be compared to identify similarities and disparities in the strategies for the ‘Logic of Life’.
2.10 Clone
Clone is an exact copy made of biological material such as a DNA segment, a whole cell or a complete organism. The process of creating a clone is called as cloning.
2.11 Model Organism
Saccharomytes cerevisiae commonly known as the baker’s yeast have emerged as the model organism. It has demonstrated the fundamental conservation of the basic informational pathways found in almost all the organisms. From the detailed study of the genomes of these organisms (which is possible today), we can gain an insight into their functioning. All this data will lead to the fundamental insights into human biology.
Vast amount of genetic data available on this species provides important clues helpful for the ongoing research on human genetics. Saccharomytes cerevisiae has become the workhorse of many biotechnology labs. It can exist either in a haploid or a diploid state and divides by the vegetative process of budding. Yeast cultures can be easily propagated in labs. It has become the model organism partly because of the ease with which genetic manipulations can be carried out. Random mutations can be induced into the genome by the treatment of live cells with chemicals such as ethyl-methanesulphone or by exposure to ultra-violet rays. Targeted gene inactivations can also be carried out; this property is very important during experiments for the unambiguous assignments of gene functions.
Saccharomytes cerevisiae has a compact genome of 12 lakh base pairs of DNA present on 16 chromosomes. This presented a reasonable goal for complete sequencing and analysis of it’s genome. The Saccharomytes genome database (SGD) was established at the Stanford University in 1995.
Knowing the complete sequence of a genome is only the first step in understanding how the huge amount of information contained in genes is translated into functional proteins.
Chapter 3
Role of software Engineers and Technology in Biotechnology
The tools of computer science, statistics and mathematics are critical for studying biology as an informational science. Curiously, biology is the only science that at it’s very heart, employs a digital language. The grand challenge in biology is to determine how the digital language of the chromosomes is converted into 3-D and 4-D (time varying) languages of living organisms.
3.1 Need for software automation
DNA encodes the information necessary for building and maintaining life. DNA is a non-branching, double-stranded macromolecule in which the nucleotide building blocks (A,C,G,T) are linked. Bases are arranged in A-T and C-G pairs. Small viral genomes of the order of several thousand bases were the first to be sequenced in 1970. Few years later, genomes of the order of 40 kilo base pairs represented the limit of what could reasonably be sequenced. At this stage, the need for automation was recognized and methods were applied to the degree possible. By the year 1997, the yeast genome consisting of 12 Mega base pairs was completed, and in 1998, the conclusion of the 100 Mega base pairs nematode genome project was announced. Most recently, the 180 Mega base pairs fruit-fly genome was also completed. All of these projects relied on substantially higher levels of software automation. We are now in the midst of the most ambitious project so far: sequencing of the 3 Giga base pairs Human Genome. For this effort, and those yet to come, software automation lies at the very core of the planning and executing of the project.
The need for automation is driven largely by the trend of handling ever larger sizes of DNA and the corresponding increase in the amount of raw data this entails. Mathematical analysis indicates that the size of a project is roughly proportional to the size of the genome. This is due to the fact that the amount of information obtained for an individual sequencing experiment is relatively constant and is independent of the genome size. It is estimated that for the human genome, as order of 108 individual experiments are required to cover the genome. To meet the projected goals, modern large scale sequencing centers have developed throughput capacities of the order of several million experiments per month, with data processing handled on a continuous basis. Managing such large projects without a high degree of automation would clearly be impossible in terms of cost and time requirements.
So, DNA is the basic genetic material. It transmits hereditary characters from one generation to the next. During synthesis of proteins, mRNA which act as the messengers of information (the exact genetic code) are build from DNA. Proteins are synthesized using mRNA molecules. Protein interactions give rise to information pathways and networks which help in building cells which are identical to their parent cells. Clustering of many cells in a predefined format composes a tissue. An organ is a combination of tissues and an organism is nothing but an organization of organs. Refer figure 3.1.
The challenge for computer professionals is to create tools that can capture and integrate these different levels of biological information.
3.2 Genetic Algorithms
All that computers can do is implement algorithms. Hence when we talk of using computers for processing of biological information, we have to define precise mathematical algorithms. Following are a few absolutely basic algorithms in Bioinformatics.
3.2.1 Database Searching
Database interrogation can take the form of text queries (e.g. Display all the human adrenergic receptors) or sequence similarity searches (e.g. Given the sequence of a human adrenergic receptor, display all the similar sequences in the database). Sequence similarity searches are straightforward because the data in the databases is mostly in the form of sequences.
3.2.2 Comparing Two Sequences
Let us take the case of comparing two protein sequences. The alphabet complexity is 20, since a protein is nothing but a sequence of amino acids and there are 20 possible amino acids. The naïve approach is to line up the sequences against each other and insert additional characters to bring the two strings into vertical alignment. More the matches, more is the closeness in the two sequences.
The process of alignment can be measured in terms of the number of gaps introduced and the number of mismatches remaining in the alignment. A metric relating such parameters represents the distance between two sequences.
3.2.3 Multiple Sequence Alignment
In the previous sub section, we saw pairwise sequence alignment, which is fundamental to sequence analysis. However, analysis of groups of sequences that form gene families requires the ability to make connections between more than two members of the group, in order to reveal subtle conserved family characteristics. The goal of multiple sequence alignment is to generate a concise information-rich summary of sequence data in order to inform decision-making on the relatedness of sequences to a gene family.
Multiple sequence alignment is a 2D table, in which the rows represent individual sequences and the columns the residue positions. The sequences are laid onto this grid in such a manner that (a) the relative positioning of residues within any one sequence is preserved, and (b) similar residues in all the sequences are brought into vertical register.
3.3 Genome Projects
In the mid-1980s, the united states department of Energy initiated a number of projects to construct detailed genetic and physical maps of the human genome, to determine its complete nucleotide sequence, and to localize its estimated 100000 genes. Work on this scale required the development of new computational methods for analysing genetic map and DNA sequence data, and demanded the design of new techniques and instrumentation for detecting and analysing DNA. To benefit the public most effectively, the projects also necessitated the use of advanced means of information dissemination in order to make the results available as rapidly as possible to scientists and physicians. The international effort arising from this vast initiative became known as the human genome project.
Similar research efforts were also launched to map and sequence the genomes of a variety of organisms used extensively in research labs as model systems. In April 1998, although the sequencing projects of only a small number of relatively small genomes had been completed, and the human genome is not expected to be complete until after the year 2003, the results of such projects were already beginning to pour into the public sequence databases in overwhelming numbers.
3.4 Goals for Advancements in Sequencing Technology
DNA sequencing technology has improved dramatically since the genome projects began. The amount of sequence produced each year is increasing steadily; individual centers are now producing tens of millions of base pairs of sequence annually. In the future, de novo sequencing of additional genomes, comparative sequencing of closely related genomes, and sequencing to assess variation within genomes will become increasingly indispensable tools for biological and medical research. Much more efficient sequencing technology will be needed than is currently available. The incremental improvements made to date have not yet resulted in any fundamental paradigm shifts. Nevertheless, the current state-of-the-art technology can still be significantly improved, and resources should be invested to accomplish this. Beyond that, research must be supported on new technologies that will make even higher throughput DNA sequencing efficient, accurate, and cost-effective, thus providing the foundation for other advanced genomic analysis tools. Progress must be achieved in three areas:
a) Continue to increase the throughput and reduce the cost of current sequencing technology.
Increased automation, miniaturization, and integration of the approaches currently in use, together with incremental, evolutionary improvements in all steps of the sequencing process, are needed to yield further increases in throughput (to at least 500 Mb of finished sequence per year by 2003) and reductions in cost. At least a twofold cost reduction from current levels (which average $0.50 per base for finished sequence in large-scale centers) should be achieved in the next 5 years. Production of the working draft of the human sequence will cost considerably less per base pair.
b) Support research on novel technologies that can lead to significant improvements in sequencing technology.
New conceptual approaches to DNA sequencing must be supported to attain substantial improvements over the current sequencing paradigm. For example, microelectromechanical systems (MEMS) may allow significant reduction of reagent use, increase in assay speed, and true integration of sequencing functions. Rapid mass spectrometric analysis methods are achieving impressive results in DNA fragment identification and offer the potential for very rapid DNA sequencing. Other more revolutionary approaches, such as single-molecule sequencing methods, must be explored as well. Significant investment in interdisciplinary research in instrumentation, combining chemistry, physics, biology, computer science, and engineering, will be required to meet this goal. Funding of far-sighted projects that may require 5 to 10 years to reach fruition will be essential. Ultimately, technologies that could, for example, sequence one vertebrate genome per year at affordable cost are highly desirable.
c) Develop effective methods for the advanced development and introduction of new sequencing technologies into the sequencing process.
As the scale of sequencing increases, the introduction of improvements into the production stream becomes more challenging and costly. New technology must therefore be robust and be carefully evaluated and validated in a high-throughput environment before its implementation in a production setting. A strong commitment from both the technology developers and the technology users is essential in this process. It must be recognized that the advanced development process will often require significantly more funds than proof-of-principle studies. Targeted funding allocations and dedicated review mechanisms are needed for advanced technology development.
3.5 Developing Technology to handle Sequence Variations
Natural sequence variation is a fundamental property of all genomes. Any two haploid human genomes show multiple sites and types of polymorphism. Some of these have functional implications, whereas many probably do not. The most common polymorphisms in the human genome are single base-pair differences, also called single-nucleotide polymorphisms (SNPs). When two haploid genomes are compared, SNPs occur every kilobase, on average. Other kinds of sequence variation, such as copy number changes, insertions, deletions, duplications, and rearrangements also exist, but at low frequency and their distribution is poorly understood. Basic information about the types, frequencies, and distribution of polymorphisms in the human genome and in human populations is critical for progress in human genetics. Better high-throughput methods for using such information in the study of human disease are also needed.
SNPs are abundant, stable, widely distributed across the genome, and lend themselves to automated analysis on a very large scale, for example, with DNA array technologies. Because of these properties, SNPs will be a boon for mapping complex traits such as cancer, diabetes, and mental illness. Dense maps of SNPs will make possible genome-wide association studies, which are a powerful method for identifying genes that make a small contribution to disease risk. In some instances, such maps will also permit prediction of individual differences in drug response. Publicly available maps of large numbers of SNPs distributed across the whole genome, together with technology for rapid, large-scale identification and scoring of SNPs, must be developed to facilitate this research.
a) Develop technologies for rapid, large-scale identification of SNPs and other DNA sequence variants. The study of sequence variation requires efficient technologies that can be used on a large scale and that can accomplish one or more of the following tasks: rapid identification of many thousands of new SNPs in large numbers of samples. Although the immediate emphasis is on SNPs, ultimately technologies that can be applied to polymorphisms of any type must be developed. Technologies are also needed that can rapidly compare, by large-scale identification of similarities and differences, the DNA of a species that is closely related to one whose DNA has already been sequenced. The technologies that are developed should be cost-effective and broadly accessible.
b) Identify common variants in the coding regions of the majority of identified genes Initially, association studies involving complex diseases will likely test a large series of candidate genes; eventually, sequences in all genes may be systematically tested. SNPs in coding sequences (also known as cSNPs) and the associated regulatory regions will be immediately useful as specific markers for disease. An effort should be made to identify such SNPs as soon as possible. Ultimately, a catalog of all common variants in all genes will be desirable. This should be cross-referenced with cDNA sequence data.
c) Create an SNP map of at least 100,000 markers. A publicly available SNP map of sufficient density and informativeness to allow effective mapping in any population is the ultimate goal. A map of 100,000 SNPs (one SNP per 30,000 nucleotides) is likely to be sufficient for studies in some relatively homogeneous populations, while denser maps may be required for studies in large, heterogeneous populations. Thus, during this 5-year period, the HGP authorities have planned to create a map of at least 100,000 SNPs. If technological advances permit, a map of greater density is desirable. Research should be initiated to estimate the number of SNPs needed in different populations.
d) Develop the intellectual foundations for studies of sequence variation. The methods and concepts developed for the study of single-gene disorders are not sufficient for the study of complex, multigene traits. The study of the relationship between human DNA sequence variation, phenotypic variation, and complex diseases depends critically on better methods. Effective research design and analysis of linkage, linkage disequilibrium, and association data are areas that need new insights. Questions such as which study designs are appropriate to which specific populations, and with which population genetics characteristics, must be answered. Appropriate statistical and computational tools and rigorous criteria for establishing and confirming associations must also be developed.
e) Create public resources of DNA samples and cell lines. To facilitate SNP discovery it is critical that common public resources of DNA samples and cell lines be made available as rapidly as possible. To maximize discovery of common variants in all human populations, a resource is needed that includes individuals whose ancestors derive from diverse geographic areas. It should encompass as much of the diversity found in the human population as possible. Samples in this initial public repository should be totally anonymous to avoid concerns that arise with linked or identifiable samples.
DNA samples linked to phenotypic data and identified as to their geographic and other origins will be needed to allow studies of the frequency and distribution of DNA polymorphisms in specific populations and their relevance to disease. However, such collections raise many ethical, legal, and social concerns that must be addressed. Credible scientific strategies must be developed before creating these resources.
3.6 Need of Technology in Functional Genomics
Functional genomics is the interpretation of the function of DNA sequence on a genomic scale. Already, the availability of the sequence of entire organisms has demonstrated that many genes and other functional elements of the genome are discovered only when the full DNA sequence is known. Such discoveries will accelerate as sequence data accumulate. However, knowing the structure of a gene or other element is only part of the answer. The next step is to elucidate function, which results from the interaction of genomes with their environment. Current methods for studying DNA function on a genomic scale include comparison and analysis of sequence patterns directly to infer function, large-scale analysis of the messenger RNA and protein products of genes, and various approaches to gene disruption. In the future, a host of novel strategies will be needed for elucidating genomic function. This will be a challenge for all of biology. The HGP will be contributing to this area by emphasizing the development of technology that can be used on a large scale, is efficient, and is capable of generating complete data for the genome as a whole. To the extent that available resources allow, expansion of current approaches as well as innovative technology ideas should be supported in the areas described below.
a) Develop cDNA resources. Complete sets of full-length cDNA clones and sequences for both humans and model organisms would be enormously useful for biologists and are urgently needed. Such resources would help in both gene discovery and functional analysis. High priority should be placed on developing technology for obtaining full-length cDNAs. Complete and validated inventories of full-length cDNA clones and corresponding sequences should be generated and made available to the community once such technology is at hand.
b) Improved technologies are needed for global approaches to the study of non-protein-coding sequences, including production of relevant libraries, comparative sequencing, and computational analysis.
c) Develop technology for comprehensive analysis of gene expression. Information about the spatial and temporal patterns of gene expression in both humans and model organisms offers one key to understanding gene expression. Efficient and cost-effective technology needs to be developed to measure various parameters of gene expression reliably and reproducibly. Complementary DNA sequences and validated sets of clones with unique identifiers will be needed for array technologies, large-scale in situ hybridization, and other strategies for measuring gene expression. Improved methods for quantifying, representing, analyzing, and archiving expression data should also be developed.
d) Improve methods for genome-wide mutagenesis. Creating mutations that cause loss or alteration of function is another prime approach to studying gene function. Technologies, both gene- and phenotype-based, which can be used on a large scale in vivo or in vitro, are needed for generating or finding such mutations in all genes. Such technologies should be piloted in appropriate model systems, including both cell culture and whole organisms.
e) Develop technology for global protein analysis. A full understanding of genome function requires an understanding of protein function on a genome-wide basis. Development of experimental and computational methods to study global spatial and temporal patterns of protein expression, protein-ligand interactions, and protein modification needs to be supported.
3.7 Bioinformatics and Computational Biology
Bioinformatics support is essential to the implementation of genome projects and for public access to their output. Bioinformatics needs for the genome project fall into two broad areas: (i) databases and (ii) development of analytical tools. Collection, analysis, annotation, and storage of the ever increasing amounts of mapping, sequencing, and expression data in publicly accessible, user-friendly databases is critical to the project's success. In addition, the community needs computational methods that will allow scientists to extract, view, annotate, and analyze genomic information efficiently. Thus, the genome project must continue to invest substantially in these areas. Conservation of resources through development of portable software should be encouraged.
a) Improve content and utility of databases. Databases are the ultimate repository of genome project’s data. As new kinds of data are generated and new biological relationships discovered, databases must provide for continuous and rapid expansion and adaptation to the evolving needs of the scientific community. To encourage broad use, databases should be responsive to a diverse range of users with respect to data display, data deposition, data access, and data analysis. Databases should be structured to allow the queries of greatest interest to the community to be answered in a seamless way. Communication among databases must be improved. Achieving this will require standardization of nomenclature. A database of human genomic information, analogous to the model organism databases and including links to many types of phenotypic information, is needed.
b) Develop better tools for data generation, capture, and annotation. Large-scale, high-throughput genomics centers need readily available, transportable informatics tools for commonly performed tasks such as sample tracking, process management, map generation, sequence finishing, and primary annotation of data. Smaller users urgently need reliable tools to meet their sequencing and sequence analysis needs. Readily accessible information about the availability and utility of various tools should be provided, as well as training in the use of tools.
c) Develop and improve tools and databases for comprehensive functional studies. Massive amounts of data on gene expression and function will be generated in the near future. Databases that can organize and display this data in useful ways need to be developed. New statistical and mathematical methods are needed for analysis and comparison of expression and function data, in a variety of cells and tissues, at various times and under different conditions. Also needed are tools for modeling complex networks and interactions.
d) Develop and improve tools for representing and analyzing sequence similarity and variation. The study of sequence similarity and variation within and among species will become an increasingly important approach to biological problems. There will be many forms of sequence variation, of which SNPs will be only one type. Tools need to be created for capturing, displaying, and analyzing information about sequence variation.
e) Create mechanisms to support effective approaches for producing robust, exportable software that can be widely shared. Many useful software products are being developed in both academia and industry that could be of great benefit to the community. However, these tools generally are not robust enough to make them easily exportable to another laboratory. Mechanisms are needed for supporting the validation and development of such tools into products that can be readily shared and for providing training in the use of these products. Participation by the private sector is strongly encouraged.
3.8 Job Opportunities and Job Requirements
The Human Genome Project has created the need for new kinds of scientific specialists who can be creative at the interface of biology and other disciplines, such as computer science, engineering, mathematics, physics, chemistry, and the social sciences. As the popularity of genomic research increases, the demand for these specialists greatly exceeds the supply. In the past, the genome project has benefited immensely from the talents of non-biological scientists, and their participation in the future is likely to be even more crucial. There is an urgent need to train more scientists in interdisciplinary areas that can contribute to genomics. Programs must be developed that will encourage training of both biological and non-biological scientists for careers in genomics. Especially critical is the shortage of individuals trained in Bioinformatics. Also needed are scientists trained in the management skills required to lead large data-production efforts. Another urgent need is for scholars who are trained to undertake studies on the societal impact of genetic discoveries. Such scholars should be knowledgeable in both genome-related sciences and in the social sciences. Ultimately, a stable academic environment for genomic science must be created so that innovative research can be nurtured and training of new individuals can be assured. The latter is the responsibility of the academic sector, but funding agencies can encourage it through their grants programs.
3.9 Training Goals included in the Human Genome Project Plan
a) Nurture the training of scientists skilled in genomics research.
A number of approaches to training for genomics research should be explored. These include providing fellowship and career awards and encouraging the development of institutional training programs and curricula. Training that will facilitate collaboration among scientists from different disciplines, as well as courses that introduce scientists to new technologies or approaches, should also be included.
b) Encourage the establishment of academic career paths for genomic scientists.
Ultimately, a strong academic presence for genomic science is needed to generate the training environment that will encourage individuals to enter the field. Currently, the high demand for genome scientists in industry threatens the retention of genome scientists in academia. Attractive incentives must be developed to maintain the critical mass essential for sponsoring the training of the next generation of genome scientists.
c) Increase the number of scholars who are knowledgeable in both genomic and genetic sciences and in ethics, law, or the social sciences.
As the pace of genetic discoveries increases, the need for individuals who have the necessary training to study the social impact of these discoveries also increases. The ELSI program should expand its efforts to provide postdoctoral and senior fellowship opportunities for cross-training. Such opportunities should be provided both to scientists and health professionals who wish to obtain training in the social sciences and humanities and to scholars trained in law, the social sciences, or the humanities who wish to obtain training in genomic or genetic sciences.
Chapter 4
Human Genome Project
4.1 Introduction
Begun in 1990, the U.S. Human Genome Project is a 13-year effort coordinated by the department of Energy and the National Institutes of Health. The project originally was planned to last 15 years, but effective resource and technological advances have accelerated the expected completion date to 2003. Project goals are to
· identify all the approximately 30,000 genes in human DNA,
· determine the sequences of the 3 billion chemical base pairs that make up human DNA,
· store this information in databases,
· improve tools for data analysis,
· transfer related technologies to the private sector, and
· address the ethical, legal, and social issues that may arise from the project.
4.2 Details of the Human Genome Project
The Human Genome Project (HGP) is fulfilling its promise as the single most important project in biology and the biomedical sciences--one that will permanently change biology and medicine. With the recent completion of the genome sequences of several microorganisms, including Escherichia coli and Saccharomyces cerevisiae, and the imminent completion of the sequence of the metazoan Caenorhabditis elegans, the door has opened wide on the era of whole genome science. The ability to analyze entire genomes is accelerating gene discovery and revolutionizing the breadth and depth of biological questions that can be addressed in model organisms. These exciting successes confirm the view that acquisition of a comprehensive, high-quality human genome sequence will have unprecedented impact and long-lasting value for basic biology, biomedical research, biotechnology, and health care. The transition to sequence-based biology will spur continued progress in understanding gene-environment interactions and in development of highly accurate DNA-based medical diagnostics and therapeutics.
Human DNA sequencing, the flagship endeavor of the HGP, is entering its decisive phase. It will be the project's central focus during the next 5 years. While partial subsets of the DNA sequence, such as expressed sequence tags (ESTs), have proven enormously valuable, experience with simpler organisms confirms that there can be no substitute for the complete genome sequence. In order to move vigorously toward this goal, the crucial task ahead is building sustainable capacity for producing publicly available DNA sequence. The full and incisive use of the human sequence, including comparisons to other vertebrate genomes, will require further increases in sustainable capacity at high accuracy and lower costs. Thus, a high-priority commitment to develop and deploy new and improved sequencing technologies must also be made.
Availability of the human genome sequence presents unique scientific opportunities, chief among them the study of natural genetic variation in humans. Genetic or DNA sequence variation is the fundamental raw material for evolution. Importantly, it is also the basis for variations in risk among individuals for numerous medically important, genetically complex human diseases. An understanding of the relationship between genetic variation and disease risk promises to change significantly the future prevention and treatment of illness. The new focus on genetic variation, as well as other applications of the human genome sequence, raises additional ethical, legal, and social issues that need to be anticipated, considered, and resolved.
The HGP has made genome research a central underpinning of biomedical research. It is essential that it continue to play a lead role in catalyzing large-scale studies of the structure and function of genes, particularly in functional analysis of the genome as a whole. However, full implementation of such methods is a much broader challenge and will ultimately be the responsibility of the entire biomedical research and funding communities.
Success of the HGP critically depends on Bioinformatics and computational biology as well as training of scientists to be skilled in the genome sciences. The project must continue a strong commitment to support of these areas.
As intended, the HGP has become a truly international effort to understand the structure and function of the human genome. Many countries are participating according to their specific interests and capabilities. Coordination is informal and generally effected at the scientist-to-scientist level. The U.S. component of the project is sponsored by the National Human Genome Research Institute at the National Institutes of Health (NIH) and the Office of Biological and Environmental Research at the Department of Energy (DOE). The HGP has benefited greatly from the contributions of its international partners. The private sector has also provided critical assistance. These collaborations will continue, and many will expand. Both NIH and DOE welcome participation of all interested parties in the accomplishment of the HGP's ultimate purpose, which is to develop and make publicly available to the international community the genomic resources that will expedite research to improve the lives of all people.
4.3 U.S. Human Genome Project 5-Year Goals 1998-2003
4.3.1 Human DNA Sequencing
Providing a complete, high-quality sequence of human genomic DNA to the research community as a publicly available resource continues to be the HGP's highest priority goal. The enormous value of the human genome sequence to scientists, and the considerable savings in research costs its widespread availability will allow, are compelling arguments for advancing the timetable for completion. Recent technological developments and experience with large-scale sequencing provide increasing confidence that it will be possible to complete an accurate, high-quality sequence of the human genome by the end of 2003, 2 years sooner than previously predicted. NIH and DOE expect to contribute 60 to 70% of this sequence, with the remainder coming from the effort at the Sanger Center and other international partners.
This is a highly ambitious goal, given that only about 6% of the human genome sequence has been completed thus far. Sequence completion by the end of 2003 is a major challenge, but within reach and well worth the risks and effort. Realizing the goal will require an intense and dedicated effort and a continuation and expansion of the collaborative spirit of the international sequencing community. Only sequence of high accuracy and long-range contiguity will allow a full interpretation of all the information encoded in the human genome.
Availability of the human sequence will not end the need for large-scale sequencing. Full interpretation of that sequence will require much more sequence information from many other organisms, as well as information about sequence variation in humans. Thus, the development of sustainable, long-term sequencing capacity is a critical objective of the HGP. Achieving the goals below will require a capacity of at least 500 megabases (Mb) of finished sequence per year by the end of 2003.
a) Finish the complete human genome sequence by the end of 2003.
To best meet the needs of the scientific community, the finished human DNA sequence must be a faithful representation of the genome, with high base-pair accuracy and long-range contiguity. Specific quality standards that balance cost and utility have already been established. These quality standards should be reexamined periodically; as experience in using sequence data is gained, the appropriate standards for sequence quality may change. One of the most important uses for the human sequence will be comparison with other human and nonhuman sequences. The sequence differences identified in such comparisons should, in nearly all cases, reflect real biological differences rather than errors or incomplete sequence. Consequently, the current standard for accuracy--an error rate of no more than 1 base in 10,000--remains appropriate.
The current public sequencing strategy is based on mapped clones and occurs in two phases. The first, or "shotgun" phase, involves random determination of most of the sequence from a mapped clone of interest. Methods for doing this are now highly automated and efficient. Mapped shotgun data are assembled into a product ("working draft" sequence) that covers most of the region of interest but may still contain gaps and ambiguities. In the second, finishing phase, the gaps are filled and discrepancies resolved. At present, the finishing phase is more labor intensive than the shotgun phase. Already, partially finished, working-draft sequence is accumulating in public databases at about twice the rate of finished sequence.
b) Make the sequence totally and freely accessible.
The HGP was initiated because its proponents believed the human sequence is such a precious scientific resource that it must be made totally and publicly available to all who want to use it. Only the wide availability of this unique resource will maximally stimulate the research that will eventually improve human health.
4.3.2 Sequencing Technology
Create a long-term, sustainable sequencing capacity by improving current technology and developing highly efficient novel technologies. Achieving this HGP goal will require current sequencing capacity to be expanded 2-3 times, demanding further incremental advances in standard sequencing technologies and improvements in efficiency and cost. For future sequencing applications, planners emphasize the importance of supporting novel technologies that may be 5-10 years in development.
4.3.3 Sequence Variation
Develop technologies for rapid identification of DNA sequence variants. A new priority for the HGP is examining regions of natural variation that occur among genomes (except those of identical twins). Goals specify development of methods to detect different types of variation, particularly the most common type called single nucleotide polymorphisms (SNPs) that occur about once every 1000 bases. Scientists believe SNP maps will help them identify genes associated with complex diseases such as cancer, diabetes, vascular disease, and some forms of mental illness. These associations are difficult to make using conventional gene hunting methods because any individual gene may make only a small contribution to disease risk. DNA sequence variations also underlie many individual differences in responses to the environment and treatments.
4.3.4 Functional Genomics
Expand support for current approaches and innovative technologies. Efficient interpretation of the functions of human genes and other DNA sequences requires developing the resources and strategies to enable large-scale investigations across whole genomes. A technically challenging first priority is to generate complete sets of full-length cDNA clones and sequences for human and model organism genes. Other functional genomics goals include studies into gene expression and control, creation of mutations that cause loss or alteration of function in nonhuman organisms, and development of experimental and computational methods for protein analyses.
4.3.5 Comparative Genomics
Obtain complete genomic sequences for C. elegans (1998), Drosophila (2002), and mouse (2008). A first clue toward identifying and understanding the functions of human genes or other DNA regions is often obtained by studying their parallels in nonhuman genomes. To enable efficient comparisons, complete genomic sequences already have been obtained for the bacterium E. coli and the yeast S. cerevisiae, and work continues on sequencing the genomes of the roundworm, fruit fly, and mouse. Planners note that other genomes will need to be sequenced to realize the full promise of comparative genomics, stressing the need to build a sustainable sequencing capacity.
4.3.6 Ethical, Legal, and Social Implications (ELSI)
· Analyze and address implications of identifying DNA sequence information for individuals, families, and communities.
· Facilitate safe and effective integration of genetic technologies.
· Facilitate education about genomics in nonclinical and research settings.
Rapid advances in genetics and applications present new and complex ethical and policy issues for individuals and society. ELSI programs that identify and address these implications have been an integral part of the US HGP since its inception. These programs have resulted in a body of work that promotes education and helps guide the conduct of genetic research and the development of related health professional and public policies. Continuing and new challenges include safeguarding the privacy of individuals and groups who contribute samples for large-scale sequence variation studies; anticipating how resulting data may affect concepts of race and ethnicity; identifying how genetic data could potentially be used in workplaces, schools, and courts; commercial uses; and the impact of genetic advances on concepts of humanity and personal responsibility.
4.3.7 Bioinformatics and Computational Biology
Improve current databases and develop new databases and better tools for data generation and capture and comprehensive functional studies. Continued investment in current and new databases and analytical tools is critical to the success of the Human Genome Project and to the future usefulness of the data. Databases must be structured to adapt to the evolving needs of the scientific community and allow queries to be answered easily. Planners suggest developing a human genome database analogous to model organism databases with links to phenotypic information. Also needed are databases and analytical tools for the expanding body of gene expression and function data, for modeling complex biological networks and interactions, and for collecting and analyzing sequence variation data.
4.3.8 Training
Nurture the training of genomic scientists and establish career paths.
Increase the number of scholars knowledgeable in genomics and ethics, law, or the social sciences. Planners note that future genomics scientists will require training in interdisciplinary areas that include biology, computer science, engineering, mathematics, physics, and chemistry. Additionally, scientists with management skills will be needed for leading large data-production efforts.
Chapter 5
Biological Databases
5.1 The Biological sequence/structure deficit
At the beginning of 1998, in publicly available, non-redundant databases, more than 3,00,000 protein sequences have been deposited, and the number of partial sequences in public and proprietary Expressed sequence tag databases is estimated to run into millions. By contrast, the number of unique 3D structures in the Protein Data Bank (PDB) was less than 1500. Although structural information is far more complex to derive, store and manipulate than are sequence data, these figures nevertheless highlight an enormous information deficit. This situation is likely to get worse as the genome projects around the world begin top bear fruit. Of course, the acquisition of structural data is also hastening, and the future large-scale structure determination enterprise could conceivably furnish 2000 3D structures annually. But this is a small yield by comparison with that of sequence databases, which are doubling in size every year, with a new sequence being added, on average once a minute.
5.2 Biological Databases
If we are to derive the maximum benefit from the deluge of sequence information, we must deal with it in a concerted way; this means establishing, maintaining and disseminating databases; providing easy to use software to access the information they contain; and designing state-of-the-art analysis tools to visualize and interpret the structural and functional clues latent in the data.
The first, then, in analysing sequence information is to assemble it into central shareable resources i.e. databases. Databases are effectively electronic filling cabinets, a convenient and efficient method of storing vast amounts of information. There are many different database types, depending both on the nature of the information being stored and on the manner of data storage( eg: whether in flat-files, tables in a relational database or objects in an object oriented database).
In the context of protein sequence analysis, we will encounter primary, composite and secondary databases. Such resources store different levels of information in totally different formats. In the past, this has led to a variety of communication problems, but emerging computer technologies are beginning to provide solutions, allowing seamless, transparent access to disparate, distributed data structures over the internet.
Primary and secondary databases are used to address different aspects of sequence analysis, because they store different levels of protein sequence information.
The primary structure of a protein is its amino acid sequence; these are stored in primary databases as linear alphabets that denote the constituent residues. The secondary structure of a protein corresponds to regions of local regularity, which, in sequence alignments, are often apparent as well conserved motifs; these are stored in secondary databases as patterns. The tertiary structure of a protein arises from the packing of its secondary structure elements which may form discrete domains within a fold, or may give rise to autonomous folding units or modules; complete folds, domains and modules are stored in structure databases as sets of atomic co-ordinates.
5.3 Primary Sequence Databases
In the early 1980s, sequence information started to become more abundant in the scientific literature. Realising this, several laboratories saw that there might be advantages to harvesting and storing these sequences in central repositories. Thus, several primary database projects began to evolve in different parts of the world.
5.3.1 Nucleic acid Sequence Databases
The principle DNA sequence databases are GenBank (USA), EMBL (Europe) and DDBJ (Japan), which exchange data on a daily basis to ensure comprehensive coverage at each of the sites.
EMBL is the nucleotide sequence database from the European Bioinformatics Institute. The rate of growth of DNA databases has been following an exponential trend, with a doubling time less than a year. EMBL data predominantly (more than 50%) consist of model organisms.
DNA Data Bank of Japan is produced, distributed and maintained by the National Institute of Genetics.
GenBank, the DNA database from the National Center for Biotechnology Information, exchanges data with both EMBL and DDBJ to help ensure comprehensive coverage. The database is split into 17 smaller discrete divisions.
5.3.2 Protein Sequence Databases
PIR, MIPS, SWISS-PROT, and TrEMBL are the major protein sequence databases.
PIR was developed for investigating evolutionary relations between proteins. In its current form, the database is split into four distinct sections PIR1-PIR4, which differ in terms of the quality of data and the level of annotation provided.
MIPS collects and processes sequence data for the tripartite PIR-International Protein sequence Database Project.
SWISS-PROT is a protein sequence database which, endeavors to provide high level annotations, including descriptions of the function of the protein, and of the structure of its domain, its post translational modifications and so on.
TrEMBL was created as a supplement to the SWISS-PROT. It was designed to address the need for a well structured SWISS-PROT-like resource that would allow very rapid access to sequence data from the genome projects, without having to compromise the quality of SWISS-PROT itself by incorporating sequences with insufficient analysis and annotation.
5.4 Composite Protein Sequences Databases
One solution to the problem of proliferation primary databases is to compile a composite, i.e. a database that amalgamates a variety of different primary sources. Composite databases render sequence searching much more efficient, because they obviate the need to interrogate multiple resources. The interrogation process is stream lined still further if the composite has been designed to be non-redundant, as this means that the same sequence need not be searched more than once. The choices of different sources and the application of different redundancy criteria have led to the emergence of different composites. The major composite databases are Non-Redundant Database, OWL, MIPSX, SWISS-PROT+TrMBL.
5.5 Secondary Databases
Secondary databases contain the fruits of analyses of the sequences in the primary resources. Because there are several different primary databases and a variety of ways of analysing protein sequences, the information housed in each of the secondary resources is different. Designing software tools that can search the different types of data, interpret the range of outputs, and assess the biological significance of the results is not a trivial task. SWISS-PROT has emerged as the most popular primary source and many secondary databases now use it as their basis.
Some of the main secondary resources are as follows:
Secondary database Primary source Stored Information
PROSITE SWISS-PROT Regular expressions
Profiles SWISS-PROT Weighted matrices
PRINTS OWL Aligned motifs
Pfam SWISS-PROT Hidden Marcov Models
BLOCKS PROSITE/PRINTS Aligned motifs (blocks)
IDENTIFY BLOCKS/PRINTS Fuzzy regular expressions
5.6 Tertiary Databases
Tertiary databases are the databases derived from information housed in secondary (pattern) databases (e.g. the BLOCKS and eMOTIF databases, which draw on the data stored within PROSITE and PRINTS). The value of such resources is in providing a different scoring perspective on the same underlying data, allowing the possibility to diagnose relationships that might be missed using the original implementation.
Chapter 6
Applications of Bioinformatics
A big amount of investment is being made in the field of biotechnology. In this chapter, I have attempted to take a review of the overall outcome obtained so far and what all is estimated in the future.
6.1 Application to the Ailments of Diseases
The miraculous substance that contains all of our genetic instructions, DNA, is rapidly becoming a key to modern medicine. By focusing on the diaphanous and extraordinarily long filaments of DNA that we inherit from our parents, scientists are finding the root causes of dozens of previously mysterious diseases: abnormal genes. These discoveries are allowing researchers to make precise diagnoses and predictions, to design more effective drugs, and to prevent many painful disorders. The new findings also pave the way for the development of the ultimate therapy - substituting a normal gene for a malfunctioning one so as to correct a patient's genetic defect permanently.
Recently, scientists have made spectacular progress against two fatal genetic diseases of children, cystic fibrosis and Duchenne muscular dystrophy. In addition, they have identified the genetic flaws that predispose people to more widespread, though still poorly understood ailments - various forms of heart disease, breast and colon cancer, diabetes, arthritis - which are not usually thought of as genetic in origin.
While many of the researchers who are exploring our genetic wilderness want to find the sources of the nearly 4,000 disorders caused by defects in single genes, others have an even broader goal: They hope to locate and map all of the 50,000 to 100,000 genes on our chromosomes. This map of our complete biological inheritance "the marvelous message, evolved for 3 billion years or more, which gives rise to each one of us," as Robert Sinsheimer of the University of California, Santa Barbara, calls it - will guide biological research for years to come. And it will radically simplify the search for the genetic flaws that cause disease.
Once scientists have identified such a flaw, they need to understand just how it produces a particular illness. They must determine the normal gene's function in human cells: What kind of protein does it instruct the cells to make, in what quantities, at what times, and in what specific places? Then the researchers can ask whether the genetic flaw results in too little protein, the wrong kind of protein, or no protein at all - and how best to counteract the effects of this failure.
For most genetic disorders, researchers are still at the very beginning of the trail. They have no clues to the DNA error that causes a disease, and they are still trying to find large families whose DNA patterns can help them track it down.
By contrast, scientists who work on cystic fibrosis and a few other diseases have covered much of the trail. They have already succeeded in correcting the gene defect inside living human cells by inserting healthy genes into these cells in a laboratory dish - an achievement that may lead to gene therapy.
The farther scientists go along the trail, the broader the implications of their findings. For example, the discovery of the gene defect that causes Duchenne muscular dystrophy, a muscle-wasting disease, led scientists to identify a previously unknown protein that plays an important role in all muscle function. This gives them a clearer view of how muscle cells work and allows them to diagnose other muscle disorders with exceptional precision, as well as devise new approaches to treatment.
Any new treatment will need to be tested on animals. In fact, the next explosion of information in medical genetics is expected to come from the study of animals - particularly with defects that mimic human disorders. The techniques for producing animal models of disease are improving rapidly. Even today, "designer mice" are playing an increasingly important role in research.
The growth of powerful computerized databases is bringing further insights. Only a month after the discovery of the genetic error involved in neurofibromatosis, a disfiguring and sometimes disabling hereditary disease, a computer search revealed a match between the protein made by normal copies of the newly uncovered gene and a protein that acts to suppress the development of cancers of the lung, liver, and brain - a key finding for cancer researchers.
Such revelations are becoming increasingly frequent. "If a new sequence has no match in the databases as they are, a week later a still newer sequence will match it," observes Walter Gilbert of Harvard University.
Brain disorders such as schizophrenia or Alzheimer's disease may be next to yield to the genetic approach. "We won't know what went wrong in most cases of mental disease until we can find the gene that sets it off," says James Watson, co-discoverer of the structure of DNA.
6.2 Application of Bioinformatics to Agriculture
Techniques aimed at crop improvement have been utilized for centuries. Today, applied plant science has three overall goals: increased crop yield, improved crop quality, and reduced production costs. Biotechnology is proving its value in meeting these goals. Progress has, however, been slower than with medical and other areas of research. Because plants are genetically and physiologically more complex than single-cell organisms such as bacteria and yeasts, the necessary technologies are developing more slowly.
6.2.1 Improvements in Crop Yield and Quality
In one active area of plant research, scientists are exploring ways to use genetic modification to confer desirable characteristics on food crops. Similarly, agronomists are looking for ways to harden plants against adverse environmental conditions such as soil salinity, drought, alkaline earth metals, and anaerobic (lacking air) soil conditions.
Genetic engineering methods to improve fruit and vegetable crop characteristics - such as taste, texture, size, color, acidity or sweetness, and ripening process, are being explored as a potentially superior strategy to the traditional method of cross-breeding.
Research in this area of agricultural biotechnology is complicated by the fact that many of a crop's traits are encoded not by one gene but by many genes working together. Therefore, one must first identify all of the genes that function as a set to express a particular property. This knowledge can then be applied to altering the germlines of commercially important food crops. For example, it will be possible to transfer the genes regulating nutrient content from one variety of tomatoes into a variety that naturally grows to a larger size. Similarly, by modifying the genes that control ripening, agronomists can provide supplies of seasonal fruits and vegetables for extended periods of time.
Biotechnological methods for improving field crops, such as wheat, corn and soybeans, are also being sought, since seeds serve both as a source of nutrition for people and animals and as the material for producing the next plant generation. By increasing the quality and quantity of protein or varying the types in these crops, we can improve their nutritional value.
6.3 Applications of Microbial Genomics
· new energy sources (biofuels)
· environmental monitoring to detect pollutants
· protection from biological and chemical warfare
· safe, efficient toxic waste cleanup
· understanding disease vulnerabilities and revealing drug targets
In 1994, taking advantage of new capabilities developed by the genome project, DOE initiated the Microbial Genome Program to sequence the genomes of bacteria useful in energy production, environmental remediation, toxic waste reduction, and industrial processing.
Despite our reliance on the inhabitants of the microbial world, we know little of their number or their nature: estimates are that less than 0.01% of all microbes have been cultivated and characterized. Programs like the DOE Microbial Genome Program help lay a foundation for knowledge that will ultimately benefit human health and the environment. The economy will benefit from further industrial applications of microbial capabilities.
Information gleaned from the characterization of complete genomes in MGP will lead to insights into the development of such new energy-related biotechnologies as photosynthetic systems, microbial systems that function in extreme environments, and organisms that can metabolize readily available renewable resources and waste material with equal facility.
Expected benefits also include development of diverse new products, processes, and test methods that will open the door to a cleaner environment. Biomanufacturing will use nontoxic chemicals and enzymes to reduce the cost and improve the efficiency of industrial processes. Already, microbial enzymes are being used to bleach paper pulp, stone wash denim, remove lipstick from glassware, break down starch in brewing, and coagulate milk protein for cheese production. In the health arena, microbial sequences may help researchers find new human genes and shed light on the disease-producing properties of pathogens.
Microbial genomics will also help pharmaceutical researchers gain a better understanding of how pathogenic microbes cause disease. Sequencing these microbes will help reveal vulnerabilities and identify new drug targets.
Gaining a deeper understanding of the microbial world also will provide insights into the strategies and limits of life on this planet. Data generated in this young program already have helped scientists identify the minimum number of genes necessary for life and confirm the existence of a third major kingdom of life. Additionally, the new genetic techniques now allow us to establish more precisely the diversity of microorganisms and identify those critical to maintaining or restoring the function and integrity of large and small ecosystems; this knowledge also can be useful in monitoring and predicting environmental change. Finally, studies on microbial communities provide models for understanding biological interactions and evolutionary history.
6.4 Risk Assessment
· assess health damage and risks caused by radiation exposure, including low-dose exposures
· assess health damage and risks caused by exposure to mutagenic chemicals and cancer-causing toxins
· reduce the likelihood of heritable mutations
Understanding the human genome will have an enormous impact on the ability to assess risks posed to individuals by exposure to toxic agents. Scientists know that genetic differences make some people more susceptible and others more resistant to such agents. Far more work must be done to determine the genetic basis of such variability. This knowledge will directly address DOE's long-term mission to understand the effects of low-level exposures to radiation and other energy-related agents, especially in terms of cancer risk.
6.5 Bioarchaeology, Anthropology, Evolution, and Human Migration
· study evolution through germline mutations in lineages
· study migration of different population groups based on female genetic inheritance
· study mutations on the Y chromosome to trace lineage and migration of males
· compare breakpoints in the evolution of mutations with ages of populations and historical events
Understanding genomics will help us understand human evolution and the common biology we share with all of life. Comparative genomics between humans and other organisms such as mice already has led to similar genes associated with diseases and traits. Further comparative studies will help determine the yet-unknown function of thousands of other genes.
Comparing the DNA sequences of entire genomes of differerent microbes will provide new insights about relationships among the three kingdoms of life: archaebacteria, eukaryotes, and prokaryotes.
6.6 DNA Forensics (Identification)
· identify potential suspects whose DNA may match evidence left at crime scenes
· exonerate persons wrongly accused of crimes
· identify crime and catastrophe victims
· establish paternity and other family relationships
· identify endangered and protected species as an aid to wildlife officials (could be used for prosecuting poachers)
· detect bacteria and other organisms that may pollute air, water, soil, and food
· match organ donors with recipients in transplant programs
· determine pedigree for seed or livestock breeds
· authenticate consumables such as caviar and wine
Any type of organism can be identified by examination of DNA sequences unique to that species. Identifying individuals is less precise at this time, although when DNA sequencing technologies progress further, direct characterization of very large DNA segments, and possibly even whole genomes, will become feasible and practical and will allow precise individual identification.
To identify individuals, forensic scientists scan about 10 DNA regions that vary from person to person and use the data to create a DNA profile of that individual (sometimes called a DNA fingerprint). There is an extremely small chance that another person has the same DNA profile for a particular set of regions.
Bibliography
1. IEEE Magazine
Engineering in Medicine and Biology
Volume 20, Number 4, July/August 2002
2. Introduction to Bioinformatics
By T. K. Attwood and D. J. Parry-Smith
First Edition
Publication: Pearson Education Ltd.
3. Web Sites
Human Genome Project http://www.ornl.gov/TechResources/Human_Genome/
Beyond Discovery http://www4.nas.edu/beyond/beyonddiscovery.nsf/
Bioinformatics in India http://bioinformatics-india.com
Other sites http://bioinform.com
http://bioinformatics.org