- X. laevis genome build v7b (fasta file)
- X. laevis Mayball gene models (fasta file)
- ftp://ftp.xenbase.org/pub/Genomics/JGI/Xenla9.1/Xla.v91.repeatMasked.fa.gz X. laevis genome build v9.1 (fasta file)
- X. laevis e2f4 RNAseq and ChIPseq data
- X. laevis prdm12 RNAseq and ChIPseq data
- Extensive (and I mean it) RNAseq of X. laevis multiciliated cells
- X. laevis foxj1, myb, rad21, H3K4me3 and H3K27ac ChIPseq data
- X. laevis 3-dimensional genomic structure (HiC) data
Let's face it, Xenopus rules. Great imaging, easy to manipulate, fantastic biochemistry. However, not everybody knows it is also now an extremely tractable genomics model. Above, find genome build version 7b and resources to work with it. Version 7b is in 17,006 scaffolds (over 1 kb, the total number including the dinky ones is more like 400,000). Still, it'll get you 90% of the way there.
While 90% is pretty good, a 100%-dazzling, chromosome-level assembly is very close. For years now, Dan Rokhsar's group, especially Adam Session and Taejoon Kwon and in conjunction with the Xenopus Genome Project Consortium have been hard at work to produce the highest quality genome possible. It can be pretty hard to put together the last pieces, though.
To help, I generated HiC data from X. laevis embryos. HiC is usually used to obtain information about long-range looping chromosomal interactions - the way it works is that you fix the genome in its glorious, bowl-of-spaghetti in situ state, cut randomly with restriction enzymes, religate and sequence, hoping to catch loops by seeing two distant pieces now stuck together. Most of data you get, though, are pieces that are linearly right next to each other with no looping required, which happens when you cut them apart and they religate like nothing ever happened. While not informative if you're studying long-range interactions, these data are perfect for figuring out which pieces of DNA are next to each other in a linear sequence and assembling chromosomes.
Even with nice data, it's a tricky problem - I made an assembly with Lachesis from Jay Shendure's lab that was only so-so - but then Nik Putnam took my raw HiC data and knocked it out of the park with his assembler HiRise and made X. laevis v9.1 with Adam and Taejoon, assisted by BAC-FISH data and extensive fine-grain corrections from the Consortium. It's simply gorgeous.
Here's the genome paper. Check it out!
The gene models
Genomic sequence is handy, but it's a lot handier if you know what's in it. X. laevis has several EST projects (like this one), but they are older and incomplete. To address this, Taejoon took some 2B RNAseq reads generated and donated by the community (including 1B from me) and generated several collections of gene models. Above, find a fasta file containing Taejoon's Mayball gene model release, which is a nice option to align RNAseq experiments to.
Like the rest of this project, Mayball is an interim release, but it's still excellent. To make comparisons to human biology easier, I inserted the ensembl gene ID of the best-matching human ortholog in the name of each transcript. This means the name format is
gene ID | ensembl ID | unique gene identifier | position of gene in genome build v7b: scaffold_start-end, +/- strand.
Yeah, it's clunky. Sue me.
X. laevis has two pseudogenomes (it's a long story, you'll have to wait for the paper), and historically Xenopus researchers have referred to gene copies (homeologs) as "A" and "B" forms. The Mayball naming convention doesn't discriminate between "A" and "B" forms (which are now going to be named "L" and "S", referring to the long and short chromosomes of the pseudogenomes in X. laevis. They'll be in future builds but aren't in this one). As such, you'll see duplicated names for homeologs (e.g., you'll see two rfx2's) but you can discriminate between them using the unique gene identifier or positional information (they're generally on different scaffolds).
Finally, above find a gtf-formatted file containing all exonic positions of Mayball models in v7b. To confirm active transcriptional start sites, one can check for overlap with the histone modification H3K4me3, and since RNAseq reads can have a 3' bias, it's a good idea to have an independent measure of promoters. Simon van Heeringen in Gert Veenstra's group and I each performed ChIPseq on H3K4me3 and a few other marks and Taejoon used the data to refine the transcriptional start site of the Mayball models.
You can use this annotation file along with v7b to visualize experiments with IGV or other genome browsers. v7b combined with the Mayball models and my naming convention is also the default X. laevis infrastructure in place if you use HOMER, a fantastic package for sequence manipulation and motif-finding. If you're a frog person and you use HOMER and this infrastructure all together, your life will be a dream.
By mid-2016 we accumulated a vast trove of RNAseq data across many conditions, along with histone modifications to map promoters and enhancers (H3K4me3 and H4K27ac) and transcription factors in wild-type tissue and in tissue converted to multiciliated cells (e2f4, myb, foxj1, and foxn4). Moreover, we got 3D chromosomal conformation data from both tissue types, along with a smattering of other sequencing experiments in neural tissue for a collaboration. Please download these data and use them!