1.) Michael Crichton's fantasy about cloning dinosaurs, Jurassic Park, contains a putative dinosaur DNA sequence. Use nucleotide-nucleotide BLAST against the default nucleotide database, nr, to identify the real source of the following sequence. Select, copy and paste it into the BLAST form window. This is probably the most common use of nucleotide-nucleotide BLAST: sequence identification, establishing whether an exact match for a sequence is already present in the database. >DinoDNA from JURASSIC PARK p. 103 nt 1-1200 GCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAAAATCGACGC GGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAAGCTCCCTCG TGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGGAAGCGTGGC TGCTCACGCTGTACCTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTGGGCTGTGTG CCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCAACCCGGTAA AGTAGGACAGGTGCCGGCAGCGCTCTGGGTCATTTTCGGCGAGGACCGCTTTCGCTGGAG ATCGGCCTGTCGCTTGCGGTATTCGGAATCTTGCACGCCCTCGCTCAAGCCTTCGTCACT CCAAACGTTTCGGCGAGAAGCAGGCCATTATCGCCGGCATGGCGGCCGACGCGCTGGGCT GGCGTTCGCGACGCGAGGCTGGATGGCCTTCCCCATTATGATTCTTCTCGCTTCCGGCGG CCCGCGTTGCAGGCCATGCTGTCCAGGCAGGTAGATGACGACCATCAGGGACAGCTTCAA CGGCTCTTACCAGCCTAACTTCGATCACTGGACCGCTGATCGTCACGGCGATTTATGCCG CACATGGACGCGTTGCTGGCGTTTTTCCATAGGCTCCGCCCCCCTGACGAGCATCACAAA CAAGTCAGAGGTGGCGAAACCCGACAGGACTATAAAGATACCAGGCGTTTCCCCCTGGAA GCGCTCTCCTGTTCCGACCCTGCCGCTTACCGGATACCTGTCCGCCTTTCTCCCTTCGGG CTTTCTCAATGCTCACGCTGTAGGTATCTCAGTTCGGTGTAGGTCGTTCGCTCCAAGCTG ACGAACCCCCCGTTCAGCCCGACCGCTGCGCCTTATCCGGTAACTATCGTCTTGAGTCCA ACACGACTTAACGGGTTGGCATGGATTGTAGGCGCCGCCCTATACCTTGTCTGCCTCCCC GCGGTGCATGGAGCCGGGCCACCTCGACCTGAATGGAAGCCGGCGGCACCTCGCTAACGG CCAAGAATTGGAGCCAATCAATTCTTGCGGAGAACTGTGAATGCGCAAACCAACCCTTGG CCATCGCGTCCGCCATCTCCAGCAGCCGCACGCGGCGCATCTCGGGCAGCGTTGGGTCCT Does this sequence makes sense? 2.) NCBI scientist Mark Boguski noticed this obvious "contaminant" and supplied Crichton with a better sequence, shown below, for the sequel, The Lost World. Identify the most likely source of this sequence using nucleotide-nucleotide BLAST. >DinoDNA from THE LOST WORLD p. 135 GAATTCCGGAAGCGAGCAAGAGATAAGTCCTGGCATCAGATACAGTTGGAGATAAGGACG GACGTGTGGCAGCTCCCGCAGAGGATTCACTGGAAGTGCATTACCTATCCCATGGGAGCC ATGGAGTTCGTGGCGCTGGGGGGGCCGGATGCGGGCTCCCCCACTCCGTTCCCTGATGAA GCCGGAGCCTTCCTGGGGCTGGGGGGGGGCGAGAGGACGGAGGCGGGGGGGCTGCTGGCC TCCTACCCCCCCTCAGGCCGCGTGTCCCTGGTGCCGTGGGCAGACACGGGTACTTTGGGG ACCCCCCAGTGGGTGCCGCCCGCCACCCAAATGGAGCCCCCCCACTACCTGGAGCTGCTG CAACCCCCCCGGGGCAGCCCCCCCCATCCCTCCTCCGGGCCCCTACTGCCACTCAGCAGC GGGCCCCCACCCTGCGAGGCCCGTGAGTGCGTCATGGCCAGGAAGAACTGCGGAGCGACG GCAACGCCGCTGTGGCGCCGGGACGGCACCGGGCATTACCTGTGCAACTGGGCCTCAGCC TGCGGGCTCTACCACCGCCTCAACGGCCAGAACCGCCCGCTCATCCGCCCCAAAAAGCGC CTGCTGGTGAGTAAGCGCGCAGGCACAGTGTGCAGCCACGAGCGTGAAAACTGCCAGACA TCCACCACCACTCTGTGGCGTCGCAGCCCCATGGGGGACCCCGTCTGCAACAACATTCAC GCCTGCGGCCTCTACTACAAACTGCACCAAGTGAACCGCCCCCTCACGATGCGCAAAGAC GGAATCCAAACCCGAAACCGCAAAGTTTCCTCCAAGGGTAAAAAGCGGCGCCCCCCGGGG GGGGGAAACCCCTCCGCCACCGCGGGAGGGGGCGCTCCTATGGGGGGAGGGGGGGACCCC TCTATGCCCCCCCCGCCGCCCCCCCCGGCCGCCGCCCCCCCTCAAAGCGACGCTCTGTAC GCTCTCGGCCCCGTGGTCCTTTCGGGCCATTTTCTGCCCTTTGGAAACTCCGGAGGGTTT TTTGGGGGGGGGGCGGGGGGTTACACGGCCCCCCCGGGGCTGAGCCCGCAGATTTAAATA ATAACTCTGACGTGGGCAAGTGGGCCTTGCTGAGAAGACAGTGTAACATAATAATTTGCA CCTCGGCAATTGCAGAGGGTCGATCTCCACTTTGGACACAACAGGGCTACTCGGTAGGAC CAGATAAGCACTTTGCTCCCTGGACTGAAAAAGAAAGGATTTATCTGTTTGCTTCTTGCT GACAAATCCCTGTGAAAGGTAAAAGTCGGACACAGCAATCGATTATTTCTCGCCTGTGTG AAATTACTGTGAATATTGTAAATATATATATATATATATATATATCTGTATAGAACAGCC TCGGAGGCGGCATGGACCCAGCGTAGATCATGCTGGATTTGTACTGCCGGAATTC 3.) The Caenorhabditis elegans gene SMA-4 is a member of the dwarfins gene family, also called the MAD family, which plays a role in transforming growth factor beta-mediated signal transduction. In this example we will attempt to find homologs for the SMA-4 protein (SMA4_CAEEL, Accession P45897) in vertebrate species. a.) This can be done with blink automatically. b.) To simulate performing a BLAST search with a novel protein, we will use an Entrez query to remove all Caenorhabditis proteins from the BLAST database. Link to the protein-protein blast page and enter the SMA-4 accession number (P45897) in the Search text area. We will search against the default, nr, database. In order to remove, the Caenorhabditis proteins from the nr database, enter the following Entrez search in the "Limit by Entrez query" box under the "Options" section of the form: protein all[Filter] NOT Caenorhabditis[Organism] Because there are a large number of related proteins in the BLAST database, we also need to increase the number of descriptions or BLAST hits that will be shown. Do this by increasing the number of descriptions to 1000 in the "Format" section of the BLAST form. Run the search by clicking the BLAST button. c.) On the formatting page, you can see that the CD-search has identified conserved domains in this protein. You can click on the graphic to see what these domains are and what their function is. d.) Look at the BLAST output and find all chicken (Gallus gallus) proteins that are similar to SMA-4. (Use the Tax Blast link at the upper left of the graphic to help in finding the chicken proteins.) e.) Open a new browser and run the same search again. Restrict the search to chicken proteins using the Entrez query option as you did before. This time use the query Chicken[Organism] Are the same proteins found? Compare the expectation values of these hits to the same hits found against nr with no organism restriction. Why are the e-values different for the same scores and alignments? f.) Look at your BLAST graphical output and verify that the Entrez query eliminated the protein from the database; you should see no full-length matches. Now look at your descriptions and their e-values. In the non-significant e-values (> 1) there are two proteins from sheep ( Ovis aries) labeled as MAD proteins (Smad4 and Smad7). These protein fragments are homologs of SMA-4, but we did not demonstrate that with this particular search. Why? Be sure to retain your formatting page for these results or copy your request ID so you can format them for PSI-BLAST fo the next exercise. 4.) In this exercise we will show using PSI-BLAST that the above sheep proteins are significant matches to SMA-4. Any protein-protein BLAST search on the NCBI web pages can be extended to a PSI-BLAST search simply by re-formating the results. Check the "Format for PSI-BLAST" box on the formatting page. The results are the same except that they are formatted differently. There is a line across the descriptions section of the results corresponding to the PSI-BLAST inclusion threshold of 0.005. Position-specific information from a multiple sequence alignment of the sequences above this line are used to generate a position-specific score matrix (PSSM) in the next iteration. Notice that one of the first proteins below this line is the Smad4 from the sheep (Ovis aries). What is the e-value of this hit? Now click the "Run PSI-BLAST iteration 2" button. Note that the Formatting page is refreshed in its separate window, generating a new Request ID number. What is the new expect value for Smad4 (Ovis aries)? Notice that there are now several new sequences above threshold. Some of them are not annotated as Sma/Mad homologs but are clearly significant hits. These new sequences will be used to construct a new PSSM for iteration 3 and so on. After a few more iterations no more sequences will be found; at this point the search is said to have converged. 5.) Mark imbedded a message in the sequence he provided for the LOST WORLD. To see Mark's message use the translating BLAST (blastx) page with the sequence above. What is Mark's message?