As I said I'm Peter Cooper.
My email address is there.
Peter.Cooper@NIH.gov.
I will talk about what I mean by genomic context.
Then we'll talk about the identical proteins group and how to see the corresponding genomic
proteins for BLAST hits.
We'll use the graphics links on BLAST.
At the end we will wrap it up by showing you some is scripting ways to get to some of this
information using EDirect and some additional programs.
What do I mean by genomic context?
I mean what genes or protein products or regulatory sequences or other features are either upstream
or downstream in the genome for matched proteins.
These are the kind of things that everybody is interested in because he can tell you about
regulatory mechanisms.
It's useful to know from a comparative genomics perspective how does this set of genes change
over times.
Are there multiple copies?
Those kinds of things.
Also, another sort of way to use this is a quality check on genome assemblies and a annotation.
This genome is missing, this particular gene, maybe that is a problem with the assembly?
You could look at things like that.
So the particular set of genes that we will look at today are the ones involved in the
type III restriction endonuclease cassette.
To get rid of foreign DNA.
This is a typical example of what you see if you look into a bacterial genome.
You will see there is helicase at the 5-prime end of this thing.
There is methyl transfer here which is the thing that methyl lates bacterial genomic
DNA of the host so that it does not get cut.
And then there is a binding protein.
This is a type III restriction endonuclease subunit R. That is a typical situation.
Look around and there are some changes in this in some kinds of bacteria.
This is the canonical one that we will be looking at today.
What we will do is focus on the site specific DNA methyl transferase.
Will look for that.
Find it in BLAST and take a look at the genome and see if the other two genes are flanking
it.
I hope that makes some sense.
So if we are doing a nucleotide search, this is actually a straightforward thing to do.
What I mean by nucleotide search is, I mean the database is nucleotide.
We are looking down in the alignment section.
When we have a tblast in search like this one, where I search with a protein, and I
match the protein translation of a genomic sequence, I can click the graphics link and
it will take me directly to a view of that genomic sequence in our graphical sequence
viewer.
Like this one.
And there is my protein match.
There is a number of mismatches they are you can see the bottom of the screen here.
The red marks are mis matches.
That is still a very good protein blast match.
A tblastn alignment against the subject sequence which is the whole genomes shotgun sequence
from this bacterium.
If you check the annotation of these genes on either side, they are similar to the ones
that I showed you a minute ago.
One is labeled a hypothetical protein, if you take that in your BLAST search you will
see that it matches one of the other two things.
This site-specific transferase is annotated in the right way.
So the genomic context in this case is conserved.
We saw it simply by clicking one link on the BLAST output.
What a lot of people do when they do these searches is a protein search.
In part to do a tblastn search against large genomes is burdensome, and they fail for various
reasons.
Protein is more likely to work.
The problem is the graphics link does not do what you think it might.
It takes you to the protein sequence graphic which doesn't tell you about the genomic context.
However, there is another link that is often shown on the right-hand side when there is
more than one protein available, the identical proteins.
This is a particular kind of protein accession number that short that starts with a WP.
This is a nonredundant, RefSeq bacterial proteins.
I will talk more about that in a minute.
So if I link to the protein itself, in our entrez system web database, I can see a link
to identical proteins there.
I can follow the link directly to the IPG report and this is a list of all the proteins
that are identical.
There may be several GenPept submitted sequences identical to this RefSeq.
But the WP is annotated on a number of different genomic sequences.
So you can see those in the second column of the table on the right-hand side of this
slide.
If I followed the link to those, it will take me to the coding region in the nucleotide
database with that particular range and that particular strand of the DNA focused in on.
That will give me a way of looking at the corresponding coding regions in all of these
bacteria that have these protein annotated on it.
For example, this is the protein link from the protein will be looking at the genome.
This is a salmonella enterica assembly.
I am just looking at the 3000 bp corresponds to the coding region of the BLAST match.
If you want to try to look at the view link is there in the slide.
www.ncbi.nlm.nih.gov/nuccore/NZ_AZKP01000022.1?from=149913&to=153038
If I follow this to the genomic view, I can manipulate the graphics view a little bit,
and I can take a look at the things surrounding that particular coding region.
So that is the one that we were zoomed in on a minute ago.
That is labeled as a site specific DNA methyl transferase, what I expect my blast search
to match.
We are not going directly from blast so it doesn't show you the alignment.
We were just aligned protein to protein.
I am looking at the region on the genome that corresponds to that protein.
Upstream of that is the ATP -dependent helicase that we saw the beginning when talking about
the canonical version of this.
And on the other side is the type III restriction protein binding site.
Subunit.
We saw that by following the links to the identical protein report and then to the graphical
sequence viewer.
So that's how you do it on the web.
I will demonstrate that live in a few minutes.
I wanted to mention that you can access these identical protein group reports using the
API in particular we will focus on using EDirect to do that.
And we have a number of webinars on EDirect if you want.
You can go to the link on the slide for YouTube playlist.
EDirect is a UNIX command line version of Entrez utilities.
The API for the Entrez system.
Lets you search and retrieve records from all of our databases.
Has processing scripts and in XML parser, called xtract.
Each program outputs XML and you can pipe it into another EDirect program or a linux,
Unix utility.
We talked about this in various webinars, i fact one last week.
You can check those out if you want more information.
Take a look at the documentation.
Which is very extensive.
It explains a lot about using EDirect.
I won't have time today to talk a lot about that.
I will point out some things that are useful that you can do with these particular reports.
We will be focusing on E fetch and the program xtract to look at some various things.
So the -format ipg, the argument to xtract will give you the identical proteins group
table that we saw a minute ago.
Here's an example of how you can retrieve that and get that report on the command line.
The other thing that is going to be useful to us if you want to get the genomic context
once we know what sequence we will be looking at, we want to use a particular kind of XML
called GBC.
A special XML format that has sort of the wrappers or built-in xtract functions that
let you parse out features from the feature table.
For example this rather complicated command line at the bottom.
efetch -db nuccore -id NZ_AZKPo1000022.1 -format gbc' | \
xtract -insd CDS INSDInterval_from INSDInterval_to protein_id product
I am retrieving this record from a nucleotide database in the format gbc.
I'm passing it through xtract and it is using this INSD argument to look at the CDS feature.
Within that feature the are are various qualifiers that you can out put by giving their names.
I can get the interval, I can get the protein ID.
So we can use that in a very crude way today to get that information.
So here is what we will do.
We will go to the web and do some web BLAST stuff.
I won't run the searches but I will show you the results.
It will save us some time.
I will show you you how to get to the graphics view.
We will talk a little bit and touch on using EDirect to get these things out using the
command line.
I will show you how to do it with a single report.
We will begin to talk about how you would do with BLAST output . That's a more difficult
problem because you have lots and lots of IPG reports.
One for each protein.
Each one of those could have many, many corresponding genomic regions.
Ultimately, in some cases if you wanted to do this in large quantity, you would probably
want to go to the assembly database and download the corresponding GFF files or the annotation
files.
I will talk about that more in a minute.
We will get these nucleotide records and look at the neighboring genes to see if they are
the same, like we saw a minute ago inn the screenshots on the web.
Let me escape out of the slides.
What we're going to be doing for our blast searches is, I went through a whole shotgun
assembly of a meta-genomic project and I found an open reading frame in there.
That corresponds to one of these methyl transferees.
We will use this as a query sequence as we are doing I blest search.
If you want to see what that project is you can see the link in this handout.
The first thing is we will take a look at web searches.
And remember, we did a search against the nucleotide database.
The one that I did was the TBLAST search against representative genomes.
I am just taking that RID.
This should be valid for several months because I preserved it it will not be valid forever
but it should work for the short-term.
This is a BLAST insert.
It took a few minutes to run.
As you can see, there are all these hits to complete genomes.
And we can pick any one of these to look at the corresponding graphical view to see whether
we think these are conserved.
I will pick one that I have not looked at before.
I will take this chlorobium.
The complete genome here.
ANd here it is in the TBLASTN alignment.
If I click on the graphics alignment here, either one will give me the same thing.
There is only one hit to this particular genome.
It looks like we have some differences in annotation than what we saw a minute ago.
Which is fine.
This is the kind of thing you can identify.
You can identify this open reading frame here as a site specific DNA methyl transferase.
But then there are also things that are sort of part of that match that come up as separate
annotated proteins.
That may or may not be correct.
So we already are seeing some genomic context of the BLAST hits here.
Two genes annotated in this region.
I can zoom out and see what else is around.
So here is the gene that is to the five prime that.
It says that is a transposase.
Different than the canonical version of this.
It gives genomic context.
This is another different gene there so keep looking around if we want to.
Here we have an extra gene here.
But, the next gene down is the type III restriction endonuclease subunit.
This is an interesting kind of finding if you're interested in the biology of this organism.
It will be useful for you to know.
So that is that BLAST result.
I will close the graphical overview here for that one.
Let's go ahead and get the other kind of BLAST protein results.
I will go back over here.
In this case I blasted that protein sequence we saw against the Refseq protein database.
This will give me entirely, for bacteria, it will give me entirely WP style records.
So you can see that out methylase has this nice type III restriction methylase superfamily.
That is good.
That is what we would expect.
And we have hits to lots of WP style proteins.
From lots of organisms that you probably recognize the names of.
Here is Salmonella enterica.
This is one we were looking at in slides a moment ago.
We will jump down to the alignment section here.
And so I don't see the, I'm not sure if this is a bug today.
The particular link to identical proteins is not there.
It doesn't matter because I can obviously get it by going directly to the record itself.
And that is a bug.
Let's see if we can get the identical proteins report.
We do have the identical proteins report.
This is a nice table.
It lets me see the coding region.
So I can pick a particular salmonella.
Let's do the first one.
Again we are zoomed in with this particular coding region on this genome.
I can go to graphical view here.
And there is my protein.
And as we are ready know, because the protein says so, it's a site specific DNA methyl transferase.
I can zoom out a little bit.
And see what is there.
We have already seen this in my slideshow a moment ago.
So there is the ATP-dependent helicase.
There's our methyl transferees.
There is the restriction binding site there.
Those are the two things you can see and BLAST and how to get to the genomic context for
these matches.
Let's switch back and forth between this page here and we are going to start using EDirect.
This will be fairly quick you can always come back to this and get the command lines if
you want to practice using EDirect.
Let me grab the first one.
First of all I am just going to get the IPG report.
So you can see that it is the same thing.
So I am going to search the protein database for this protein record, the one we saw minute
ago.
I will ask for the IPG format.
So we get this nice report here.
It's just like we saw on the web.
One of the things I want to do is focus on the Ref sequence here.
There is an entry in the table here that says INSDC.
It's a GenPept record, identical to the RefSeq sequences, the WP accession, it's all just
one WP.
So what I want to do is clean that up and just get the REfSeqs.
I can use awk.
Sort of a swiss army knife of dealing with tab-delimited output.
I can say just give me the Refseqs in column two.
I will print out of that is simply the nucleotide accession numbers.
What we can do with those is go back in and retrieve those in that other format that I
talked about.
And go ahead and parse out the feature table to get the flanking genes on either side of
this.
And this way I'm suggesting is not very sophisticated.
Certainly you might be able to think of better ways.
But it makes the point that it can be done.
Here is what I am going to do.
There's a lot of extra business in here.
I will talk about what is important for this particular command line here.
So we have already done this.
I'm going to get the WP in the IPG format.
I will get out the nucleotide accession's in column three.
I will pass just the first two to show you quickly what happens.
And then I will pass those two things to efetch.
That is very well discussed in the document that goes with EDirect and how to use it.
I will go to the nuccore and get these guys from the database, in the gbd format.
I will use that xtract command to get all of the coding regions.
Then I will do something really crude, basically use grep to get that particular coding region,
the one before, and the one after it.
Let me do that.
So, those are the first two and they are indeed exactly the same.
ATP dependent helicase and the site specific DNA methyl transferase and the type III restriction
protein subunit.
And you can extend this, two up two down if you want.
There is nothing particularly special about it except I change the grep parameters.
Instead of running that, I will just show you what the result is.
So you can see, the only real difference, so I've got two up two down here.
The only real difference is sometimes they are flipped around because a particular coding
region is on the opposite strand.
The order is sometimes different.
The genes are the same.
Unlike where we saw a minute ago where they were not the same, but in this case they are.
These are all the salmonella assemblies we saw minute ago.
Now, what about doing that for BLAST result?
What I did was I took one of those BLAST searches and I saved it from the web onto my hard drive.
I can go ahead and search that and get the IPG reports.
That's what this does here.
Basically it parses the blast report and we have a webinar that is about doing that.
Basically what I am getting it is all of those, I just did the first 10.
That was 100 different BLAST things.
You get the basic IPG reports for all of those.
It is kind of confusing because it wordwraps, the lines are too long.
You can see these are the IPG reports.
You can parse this with a script if you want to.
And do the same kind of things we did before.
The only issue is you now have a different WP and you have to go through and write a
script to handle this in a more sophisticated way.
I can rewrite this if I want to.
I can put it in a more human readable format where it gets the assembly accession highlighted.
That is sort of the next way to do this.
I will not go through this.
But you are certainly welcome to take a look at this.
The idea here is to go ahead and get the assembly out of the last column.
That would be printed out here.
Then there are some strategies that are documented fairly well on the genomes download about
how to download particular files from the assembly.
The file I recommend is the GFF3 file for each one of these.
It is a very simple matter because it is a single line per coding region.
You can parse those out of there pretty easily.
I will show you that IPG report here.
We will stop for some questions.
I used awk to simplify the report a little bit.
But you can continue along those lines and go ahead and go do a search against the assembly
database using in EDirect to get the path to the FTP directory that contains the assembly
files.
You can write another script or make a more complicated script and pull down the gff file
for each one of these and parse it for their particular proteins.
But of course you have to have a different protein for each one of these assemblies.
So I will stop there.
And see if anybody has any questions.
I will leave it open for a couple of questions in case someone has a question at the last
minute.
What we saw is you can link to the IPG report to get the nucleotide coding region for web
BLAST.
You can fetch and parse protein IPG reports using that GBC XML with EDirect and other
commandline tools for a single report or for standalone BLAST output.
The last slide has a number of URLs.
Many of these I have mentioned earlier.
But you might want to download that and take a look at that.
In particular, take a look at the YouTube playlist for EDirect.
In the EDirect manual if you are interested in doing more things with EDirect.
[Rana] Hi Peter, we have a quick question at this point.
A user is interested in how we can get the TBLASTN results in the form of protein sequences
using the web not EDirect.
Going from a T BLAST result and pulling the protein sequences from.
[Peter] Basically you want to get the translations of the genome.
He wants to get the protein translations.
Where there are matches like the ones I saw where you are matching annotated genomic sequences,
yes, you can do that because you can do the same kind of process where you could go and
find the annotated proteins.
However, if you are talking about doing a TBLASTN search against something like a WGS
record or another unannotated sequence, there is no way for you to get the corresponding
coding regions.
You can take the nucleotide sequence, which is what I did to make that protein that we
used as a query, and go to our ORFfinder and do a six frame translation on that to get
the particular open reading frame on it if that's what you want to do.
You would have to take the genomic DNA hit and do some kind of translation yourself.
The ORFfinder on our website will let you do that.
And I can quickly show you where that is in case you do not know.
So it is just the NCBI URL and orffinder.
You can put your open reading frame or accession in here and it will generate the protein sequence
for you.
It doesn't do it in bulk.
And I don't think it does it in a way that would it will let you download.
We can check that out and I will put more elaborate answers in the Q&A document that
goes along with this.
Is there anything else?
[Rana] The same person is interested in how to translate 2000 hits using the web, and
that's not possible, but we can put an example in the documentation after this webinar on
the web.
On the FTP site as to how you can actually get your protein sequences out.
There's another question that somebody had which is, he's interested in IPG's.
But he wants to know if you can filter the IPG cluster shown to those that have specific
sets of organisms?
He's interested in EDirect or using scripting.
Can you filter the IPG hits that are returned to those that only have a certain group of
organisms?
[Peter] So I think if I understand the question, yes, you can.
You can actually search IPG.
I'm not positive you can, you can search IPG on the web.
I have not tried it on the command line.
But it ought to work the same way.
You can put an organism query in their and it will give you the clusters back.
The identical protein groups back that have that particular organism in them.
I think that answers the question.
But I'm not sure.
IPG has its own interface.
You can search here.
We weren't talking about this database per se today.
We were talking about the link to the database and the format that you can get from the protein
side.
But there is a way to search the IPG reports in this particular way.
I think I will close.
Thank you very much for coming.
I will and the webinar now.
Không có nhận xét nào:
Đăng nhận xét