El poder de la mente subconsciente

Please download to get full document.

View again

of 11
All materials on our website are shared by users. If you have any questions about copyright issues, please report us to resolve them. We are always happy to assist you.
Document Description
A comprehensive representation of the gene content of the long arm of human chromosome 21 (Hsa21q) remains of interest for the study of Down syndrome, its associated phenotypic features, and mouse models. Here we compare transcript catalogs for
Document Share
Document Tags
Document Transcript
  ORIGINAL CONTRIBUTIONS Transcript catalogs of human chromosome 21 and orthologouschimpanzee and mouse regions Xiaolu Sturgeon  • Katheleen J. Gardiner Received: 16 December 2010/Accepted: 17 February 2011/Published online: 13 March 2011   Springer Science+Business Media, LLC 2011 Abstract  A comprehensive representation of the genecontent of the long arm of human chromosome 21(Hsa21q) remains of interest for the study of Down syn-drome, its associated phenotypic features, and mousemodels. Here we compare transcript catalogs for Hsa21q,chimpanzee chromosome 21 (Ptr21q), and orthologousregions of mouse chromosomes 16, 17, and 10 for openreading frame (ORF) characteristics and conservation. TheHsa21q and mouse catalogs contain 552 and 444 genemodels, respectively, of which only 162 are highly con-served. Hsa21q transcripts were used to identify ortholo-gous exons in Ptr21q and assemble 533 putative transcripts.Transcript catalogs for all three organisms are searchablefor nucleotide and amino acid sequence features of ORFlength, repeat content, experimental support, gene struc-ture, and conservation. For human and mouse comparisons,three additional summaries are provided: (1) the chromo-somal distribution of novel ORF transcripts versuspotential functional RNAs, (2) the distribution of species-specific transcripts within Hsa21q and mouse models of Down syndrome, and (3) the organization of sense–anti-sense and putative sense–antisense structures definingpotential regulatory mechanisms. Catalogs, summaries, andnucleotide and amino acid sequences of all composite tran-scripts are available and searchable at http://gfuncpathdb.ucdenver.edu/iddrc/chr21/home.php. These data sets pro-vide comprehensive information useful for evaluation of candidate genes and mouse models of Down syndrome andfor identification of potential functional RNA genes andnovel regulatory mechanisms involving Hsa21q genes.These catalogs and search tools complement and extendinformation available from other gene annotation projects. Introduction Increases in the complexity of dbEST and mRNA dat-abases, progress in the ENCyclopedia Of DNA Elements(ENCODE) project (ENCODE Project Consortium 2007),and manual curation efforts such as the VErtebrate Gen-ome Annotation (VEGA) (Wilming et al. 2008) are gen-erating thorough annotation of the gene content of humanchromosomes that is graphically displayed in the UCSCGenome Browser and the Ensembl web site. However, anexplicit gene catalog for Hsa21q that includes the nucleo-tide sequence of each gene transcript, equally compre-hensive catalogs of putative orthologous transcripts inchimpanzee and mouse, and searchable information ontranscript characteristics of open reading frames (ORFs)and conservation are not readily accessible. Here we dis-cuss transcript catalogs for Hsa21q, chimpanzee chromo-some 21 (Ptr21), and orthologous regions of mousechromosomes 16, 17, and 10 (Mmu16, 17, and 10), whereentries are supported by experimental evidence, specifi-cally by automated assembly of high-quality mRNAsequences present in GenBank RefSeq, mRNA, and splicedEST databases. Recognizing the difficulties of reliable Electronic supplementary material  The online version of thisarticle (doi:10.1007/s00335-011-9321-y) contains supplementarymaterial, which is available to authorized users.X. Sturgeon    K. J. GardinerDepartment of Pediatrics, Computational Biosciences Program,University of Colorado Denver, Aurora, CO 80045, USAK. J. Gardiner ( & )Intellectual and Developmental Disabilities Research Center,Neuroscience and Human Medical Genetics Programs,University of Colorado Denver, Mail Stop 8313, 12800 E.19th Avenue, PO Box 6511, Aurora, CO 80045, USAe-mail: katheleen.gardiner@ucdenver.edu  1 3 Mamm Genome (2011) 22:261–271DOI 10.1007/s00335-011-9321-y  prediction of protein coding versus functional RNA genes(Mercer et al. 2009; Dinger et al. 2008), rather than assigning functional class for each transcript, we describethe characteristics of the ORF, including length, repeatcontent, and conservation. The resulting catalogs andresources include the following: (1) identification of 552Hsa21q genes that include 161 RefSeq protein-codinggenes; (2) identification of 444 genes in orthologousregions of Mmu16, 17, and 10 and characterization of theirconservation in HSsa21q; (3) sequence comparison of 533orthologous Ptr21q genes with Hsa21q; (4) annotation andcomparison of sense–antisense gene pairs in human andmouse that suggest potential regulatory roles for non-RefSeq genes; and (5) description of the gene content of mouse segmental trisomies used to model Down syndrome.Results are provided as nucleotide and amino acidsequences of each transcript, and both as summaries and indetailed tables that describe, for each splice variant of eachgene, genomic features, number of supporting ESTsequences, open reading characteristics, and conservation.Data are searchable for user-specified features thus allow-ing comparative retrieval of subsets of transcripts andgenes for the three organisms. These data sets extend andrefine previous descriptions of these chromosomes (Hattoriet al. 2000; Gardiner et al. 2003; Watanabe et al. 2004). Materials and methods Hsa21q sequence data setsHsa21q genomic sequence and transcript data sets forRefSeq cDNAs, mRNAs, and spliced ESTs were retrievedfrom the UCSC hg19 database (12/07/10) (http://genome.ucsc.edu). In-house software TLocate, applied to the tran-script-genomic alignments, identified exon locations, splicesites, and the strand of each transcript.After manual review of the data, several exclusioncriteria were established to eliminate likely cloning arti-facts and ensure that only high-quality EST and mRNAsequences were used in subsequent analysis. The in-housesoftware TCleanse was used to eliminate ESTs and mRNAswith the following features: ESTs and intronless mRNAswith C 80% repeat content or were located entirely within anexon of another transcript; intronless mRNAs less than 500nt in length or having a genomic polyA sequence  C 6A;intronless ESTs and those with total length less than 100 nt,an intron less than 25 nt in a single intron EST, a majority of nonconsensus splice sites (accepting consensus splice sitesas GT:AG, GC:AG, and AT:AC), or one exon of a two-exontranscript less than 10 nt in length or less than 15 nt inlength with low complexity ( C 90% of the exon is a singlenucleotide repeat). ESTs and mRNAs with these featurestypically were unique and/or clearly artifactual representa-tions of a well-established gene structure. Lastly, ESTs andmRNAs with matches in multiple genomic locations andwhere the best match was not the target chromosome regionwere excluded. With these criteria, the srcinal 72,149HSA21q transcripts were reduced to 32,548.Generation of the HSA21q gene catalogA flowchart for processing Hsa21q gene information isshown in Fig. 1. We developed the computational toolRCDAgene, composed of RCluster and DAssemble(unpub-lished), which is designed to cluster ESTs and assemble acomposite transcript for each group. RCluster was used tocluster 32,548 Hsa21q transcripts into independent genegroups based on partial or complete exon overlaps. TheRefSeq database at NCBI annotates on Hsa21q 210 protein-coding genes and 56 RNA genes. We refer to the first asRefSeqP genes. We do not refer to RefSeq RNA genes as aclass because they are not reliably differentiated from beingprotein-coding (Dinger et al. 2008). RefSeqP genes include49 keratin-associated protein (KRTAP) genes, a subset of which has been described elsewhere (Shibuya et al. 2004).Because these are intronless and frequently not associatedwith ESTs, they are not distinguishable from pseudogeneswithout experimental analysis and are not discussedhere. This reduces the number of RefSeqP genes foranalysis to 161. RefSeqP mRNAs were retained as thedefinitive gene structure with the exception of Claudin 17(CLDN17), which is represented by the more completemRNA sequence AY358094. We note also that previous Fig. 1  Flowchart for processing transcript information262 X. Sturgeon, K. J. Gardiner: Transcripts from HSA21, PTR21, and mouse orthologs  1 3  analysis has shown that the mitochondrial ribosomal proteinS6 (MRPS6) and the solute carrier family 5 (sodium/ myo-inositol cotransporter) member 3 (SLC5A3) share afirst exon (Gardiner et al. 2003). In spite of this overlap,they are annotated as separate genes because of theirfunctional properties. The remaining EST/mRNA groupscomprise 391 nonredundant non-RefSeqP transcript groups.DAssemble was used to generate a composite nucleotidetranscript sequence for each non-RefSeqP gene, represent-ing the most complete transcript obtainable for that geneand its major splice variants. For this assembly, genomicsequence was used as the reference rather than the cDNAs.As a measure of the strength of experimental support foreach gene, we count the number of mRNAs plus ESTssupporting each exon of the structure and report the largestnumber.Generation of the orthologous PTR21q and mouse genecatalogsThe computational tool HtoP_TMap was developed toidentify orthologous exons of Hsa21q in PTR21q genomicsequence and to assemble putative orthologous chimpanzeetranscripts. HtoC_TMap was applied to Hsa21q RefSeqPand non-RefSeqP composite transcripts from RCDAgeneand the UCSC net file generated from a human–chimpBLASTZ alignment. HtoC_TMap results were validatedthrough comparison of human–chimp exon number, splicesite quality, and RefSeq-protein-coding gene ORF conser-vation. This identified 533 orthologous PTR21q genes(excluding KRTAP genes).The gene catalog for orthologous regions of mousechromosomes 16, 17, and 10 (gene and genomic boundariesare provided in Table 1) was generated by the same meth-ods as for Hsa21q, using genomic sequence and transcriptdatabases for mouse RefSeq protein genes, mRNAs, andspliced ESTs retrieved from UCSC mm9 (12/07/2010).cDNA nucleotide and ORF comparisonsHsa21q composite transcripts were aligned with ortholo-gous Ptr21q predicted transcripts using BLASTN, and theoverall percent identity was calculated relative to thehuman transcript sequence. BLASTP between the humanand the orthologous chimpanzee ORFs provides the percentamino acid similarity relative to the human ORF. If atranscript contains more than one ORF, the representativeORF is chosen as the longest ORF that is conserved inchimpanzee.BLASTN was used to align Hsa21q transcripts withmouse transcripts mapping to syntenic mouse chromo-somal regions. A gene is classed as conserved at the tran-script level if the alignment showed  C 70% identify over300 or more nucleotides or if   E   B  1e  -  10. To detectexons of orthologous genes that lack identified transcripts,BLASTN was used to align Hsa21q transcripts to thesyntenic mouse genomic sequence. A gene was classed asconserved at the genomic level if the alignment showed C 70% identity over more than 100 contiguous nucleotides,or if at least one exon was  C 70% identical or if the entirecDNA was C 60% identical. A similar reciprocal alignmentof mouse transcripts to Hsa21q was carried out.Sense–antisense transcript pair annotationAntisense transcript pairs were identified as transcripts onopposite strands having one or more complete or partiallycomplementary exons. These are described as 5 0 –5 0 or 3 0 –3 0 when the genes partially overlap, and as ‘‘internal’’ whenone gene is entirely contained within the genomic span of the other. Potential antisense pairs were identified astranscripts on opposite strands in a 5 0 –5 0 and a 3 0 –3 0 ori-entation with an intergenic distance less than 5 kb.Complete transcript catalogs, including splice variants,for Hsa21q, Ptr21q, and Mmu16, 17, and 10 segments,nucleotide sequences of complete transcripts, amino acidsequences of longest ORFs, and sense–antisense genetables are available at http://gfuncpathdb.ucdenver.edu/ iddrc/chr21/home.php. The GeneQuest tool allows the userto retrieve lists of genes with values for these featuresspecified. In all tables, genes are listed from centromere totelomere. Additional information for each gene and variantincludes accession number, exon number, number of sup-porting ESTs, genomic start and stop, cDNA length, ORFlength and location within cDNA sequence, % repeatcontent of cDNA and ORF, nucleotide identity, and ORFsimilarity between human and chimpanzee transcripts andbetween human and mouse. Results Hsa21q gene classificationClustering and assembly of Hsa21q transcripts identified552 independent gene models (excluding 49 KRTAPgenes). Of these, 161 are annotated with RefSeq protein Table 1  Mouse chromosomal regions orthologous to Hsa21qCentromere proximal Telomere proximalMmu Chr Gene start nt start Gene stop nt16 Lipi 75540789 Telomere Telomere17 Umodl1 31091628 Rrp1b 3219754910 Prmt2 75669967 Pdxk 77899492X. Sturgeon, K. J. Gardiner: Transcripts from HSA21, PTR21, and mouse orthologs 263  1 3  sequences; we refer to this set as RefSeqP genes. Theremaining 391 gene models are referred to as novel genesor non-RefSeqP genes (we discuss RefSeq RNA geneannotation below). The complete list of genes is providedin Supplementary Table 1. We examined transcripts forfeatures of nucleotide and open reading frame (ORF)sequence, leading to the separation of genes into five cat-egories. We first considered ORFs  C 50 amino acids inlength, with \ 25% repeat content and  C 90% similarity tothe orthologous Ptr ORF. This ORF length, which is lowerthan those used by some groups, was chosen because 7% of proteins in the SwissProt database have ORFs within therange of 60–100 amino acids and 0.7% are in the range of 50–60 amino acids (e.g., FAM165B and PCP4 that map toHsa21q are 58 and 62 amino acids, respectively). We chosea cutoff of 25% repeat content because although rare, someRefSeqP ORFs (including Hsa21 C21orf7) meet thiscriterion.Transcripts in Classes 1 and 2 have complete ORFs thatlikely are not subject to Nonsense Mediated Decay (NMD),based on the location of the last intron within the codingregion relative to the stop codon. They differ only in thatthe ORFs in Class 1 genes have at least one intron withinthe ORF, while the ORFs in Class 2 genes are intronless.Requiring splicing across an intron to maintain an ORFadds weight to novel protein-coding potential, as discussedin VEGA annotations (Harrow et al. 2007). Class 3 geneshave ORFs that lack either a Met or a stop codon, or havean intron location that creates a potential target for NMD.Entries in Class 3 may represent incomplete gene structuresobscuring significant ORFs or alternative splice variants of transcripts that are easier to classify. Due to the limitedinformation and ambiguous nature of these genes, we donot discuss them further. Class 4 genes have ORFs that areeither less than 50 amino acids in length, lack conservationin chimpanzee (although the nucleotide sequence isconserved), or contain 25% or more repetitive sequences.Class 5 genes cannot be accurately evaluated for conser-vation in chimpanzee due to one or more exons fallingwithin gaps in the PTR genomic sequence.Numbers and characteristics of RefSeqP and novelgenes in the five classes are summarized in Table 2 (detailsfor each gene in each class are available in SupplementaryTables 2–6). Note that when a gene has been identifiedwith one or more splice variants, it has been assigned to thehighest class possible considering ORF characteristics of all splice variants. Column 3 of Table 2 shows that whilethe majority of RefSeqP genes—146 of 161—are in Class1, 15 are in Class 2. Classes 1 and 2 also contain 50 and 99novel genes, respectively; we call these non-RefSeq ORFgenes or novel ORF genes. Column 4 in Table 2 shows thatnon-RefSeq ORF genes are located within genomic regionsof base composition similar to RefSeqP genes, i.e., 47%.Indeed, as shown in Fig. 2a, b, the distribution of novelORF genes is not significantly different from that of RefSeqP genes. Both groups show the telomeric enrich-ment, where 66 (41%) RefSeqP and 62 (42%) novel ORFgenes are located within a 5-Mb segment ( \ 10% of thechromosome). Both also show a relative ‘‘gene desert’’ in21q21, where only one RefSeqP and seven (5%) novelORF genes are located within a 6-Mb segment. Novel ORFgenes, however, do differ from RefSeqP genes in severalfeatures: novel ORFs averaged for Class 1 plus 2 areshorter (92 vs. 639 amino acids) and transcripts are sup-ported by fewer ESTs (3 on average vs. 134). Importantly,while the majority of RefSeqP genes are annotated withsome functional domain or motifs, none of the novel ORFshas any significant functional association. We note that 11and 16 genes from Class 1 and Class 2, respectively, areannotated in NCBI as RefSeq RNA genes. This differencein assignment derives from our use of a 50-amino-acidminimal ORF (vs. 100 in RefSeq) and our requirement of  Table 2  Numbers and features of Hsa21q genes conserved in Ptr21qClass a Gene type No. genes % GC b ORF length (AA) Transcript support c % unique sequence (cDNA)Average Range Average Range Average Range1 RefSeqP 146 46 641 59–3337 144 5–1481 97 23–100Non 47 48 95 52–229 4 1–29 92 17–1002 RefSeqP 15 47 321 75–1159 36 4–133 89 32–100Non 99 47 86 51–245 3 1–52 79 15–1003 Non 31 48 68 52–135 3 1–16 81 26–1004 Non 192 42 NA NA 2 1–29 71 2–100 a Class 1, complete ORF C 50 aa, \ 25% repeat, intron w/i ORF; class 2, complete ORF C 50 aa, \ 25% repeat, intronless ORF; class 3, ORF C 50aa lacks Met and/or Stop codon, or ORF is a putative target of NMD; class 4, no ORF C 50 aa or ORF C 50 aa is C 25% repeat, or the ORF is notconserved in chimpanzee b %GC, base composition of genomic region c Transcript support, largest number of ESTs supporting any single exon of an individual gene264 X. Sturgeon, K. J. Gardiner: Transcripts from HSA21, PTR21, and mouse orthologs  1 3  ORF conservation only in chimpanzee (vs. mouse). This isconsistent with our goal of describing ORFs rather thanassigning functional class.The largest class, 192 entries, is the model in Class 4.Thirty-one high-confidence Class 4 genes are supported bythree or more ESTs, and they are not, on average, highlyrepetitive. Twenty are included in RefSeq as RNA genes.In contrast to novel ORF genes, Class 4 genes appearuniformly distributed within HSA21q, as shown in Fig. 2c.Consistent with this distribution, they are also more oftenfound in lower-GC%-content regions (average 42%).Comparisons of Hsa21q and Mmu16, 17, and 10RefSeqP and novel genesA similar analysis of the orthologous regions of mousechromosomes 16, 17, and 10 identified 444 genes (withoutKRTAP genes) that include orthologs of 157 Hsa21qRefSeqP genes (Supplementary Table 7). Relative order of allRefSeqPgenesisalsoconserved.AbsentinmousearetheHsa21 genes  POTED  that maps proximal to  LIPI   in thepericentromeric region of Hsa21q, and  TCP10L  ,  DSCR4 ,and  PLAC4  that map within the Mmu16 syntenic segment.There are four mouse-specific RefSeqP genes, all mappingto Mmu16: the multiexon  Itgb2l  gene that maps between  Igsf5  and  Pcp4 ; the two intronless transcripts 2310079G19Rikand2310061NO2RikthatmapwithintheproximalKRTAP cluster; and 4930563D23Rik that maps between Fam165b and Kcne1 (butwithinintronsofasplicevariantof Hsa21 composite  FAM165b/C21orf51  gene).Of the additional 284 mouse non-RefSeqP genes, 144encode nonrepetitive ORFs 50 or more amino acids inlength (Classes 1 and 2), 8 have incomplete ORFs or arepredicted targets of NMD (Class 3), and 132 lack suchORFs (Class 4) (Supplementary Tables 8–11). As withHsa21q, mouse non-RefSeqP genes have lower levels of  Fig. 2  Gene distribution withinHSA21q.  Bars  indicate thenumber of genes in each 1-Mbbin within HSA21q.  a  RefSeqPgenes.  b  Non-RefSeqP novelORF genes from Classes 1 and2.  c  Non-RefSeqP genes fromClass 4, i.e., lacking ORFs [ 50aa (amino acids). Gene locationis defined by the centromereproximal end. Brackets indicatethe ‘‘gene desert’’ in a GC-poor6-Mb segment of 21q21 and themost ‘‘gene-rich’’ GC-rich 5-Mbsegment in the telomeric region.Locations of selected HSA21qRefSeqP genes are indicated forreferenceX. Sturgeon, K. J. Gardiner: Transcripts from HSA21, PTR21, and mouse orthologs 265  1 3
Search Related
We Need Your Support
Thank you for visiting our website and your interest in our free products and services. We are nonprofit website to share and download documents. To the running of this website, we need your help to support us.

Thanks to everyone for your continued support.

No, Thanks