Click llink for more description
This problem will be comparing and measuring the difference between DNA sequences. This type of analysis is common in the Computational
This problem will be comparing and measuring the difference between DNA sequences. This type of analysis is common in the Computational Biology field. There are three parts to this assignment. You must provide functions with the given names and parameters. The names of any other supporting functions are named at your discretion. The names should be reflective of the function they perform.
Measuring DNA Similarity
DNA is the hereditary material in human and other species. Almost every cell in a person’s body has the same DNA. All information in a DNA is stored as a code in four chemical bases: adenine (A), guanine (G), cytosine (C) and thymine (T). The differences in the order of these bases provide a means of specifying different information.
In the lectures there have been examples of string representation and use in C++. In this assignment we will create strings that represent the possible DNA sequences. DNA sequences use a limited alphabet (A,C,G,T) within a string. Here we will be implementing a number of functions that are used to search for substrings that may represent genes and protein sequences within the DNA.
One of the challenges in computational biology is determining what the codons in a DNA sequence represent. A codon is a sequence of three nucleotides that form a unit of genetic code in a DNA or RNA molecule. For example, given the sequence GGGA, the codon could be a GGG or a GGA, depending on where the gene begins with the sequence. Clues about how to interpret a DNA sequence can be found by comparing an unknown DNA sequence to a known sequence and measuring their similarity. If sequences are similar, then it can be hypothesized that they have similar functions and proteins.
Another common DNA analysis function is to compare two sequences to determine the similarity of newly discovered DNA to the large databases of known DNA sequences.
Hamming distance and similarity between two strings
Hamming distance is one of the most common ways to measure the similarity between two strings of the same length. Hamming distance is a position-by-position comparison that counts the number of positions in which the corresponding characters in the string are different. Two strings with a small Hamming distance are more similar than two strings with a larger Hamming distance.
first string = “ACCT”
second string = “ACCG”
| | | *
In this example, there are three matching characters and one mismatch, so the Hamming distance is one.
The similarity score for two sequences is then calculated as follows:
similarity_score = (string length - hamming distance) / string length
similarity_score = (4-1)/4=3/4=0.75