Overview

The Gene Transfer Format (GTF) is a widely used format for storing gene annotations. You can obtain GTF files easily from the UCSC table browser and Ensembl. For example, the first few lines of UCSC’s gene annotation for hg19 looks like the following:

chr1    hg19_knownGene  exon    11874   12227   0.000000    +   .   gene_id "uc001aaa.3"; transcript_id "uc001aaa.3";
chr1    hg19_knownGene  exon    12613   12721   0.000000    +   .   gene_id "uc001aaa.3"; transcript_id "uc001aaa.3";
chr1    hg19_knownGene  exon    13221   14409   0.000000    +   .   gene_id "uc001aaa.3"; transcript_id "uc001aaa.3";
chr1    hg19_knownGene  exon    11874   12227   0.000000    +   .   gene_id "uc010nxr.1"; transcript_id "uc010nxr.1";
chr1    hg19_knownGene  exon    12646   12697   0.000000    +   .   gene_id "uc010nxr.1"; transcript_id "uc010nxr.1";
chr1    hg19_knownGene  exon    13221   14409   0.000000    +   .   gene_id "uc010nxr.1"; transcript_id "uc010nxr.1";
chr1    hg19_knownGene  start_codon 12190   12192   0.000000    +   .   gene_id "uc010nxq.1"; transcript_id "uc010nxq.1";
chr1    hg19_knownGene  CDS 12190   12227   0.000000    +   0   gene_id "uc010nxq.1"; transcript_id "uc010nxq.1";
chr1    hg19_knownGene  exon    11874   12227   0.000000    +   .   gene_id "uc010nxq.1"; transcript_id "uc010nxq.1";
chr1    hg19_knownGene  CDS 12595   12721   0.000000    +   1   gene_id "uc010nxq.1"; transcript_id "uc010nxq.1";

The columns are tab separated and are defined by the GTF standard specified here. PrimerSeq only uses the lines with the “exon” feature (column 3) and ignores other lines.

Be Careful

If you are particularly observant, you will notice that the above example GTF from UCSC known gene annotation has the same gene ID as transcript ID. This is because the GTF format requires a gene ID attribute, so UCSC just fills in a wrong value (the transcript ID). This example GTF and many GTFs are also not sorted by chromosome and position. PrimerSeq requires a sorted GTF as input.

I have listed ways to obtain sorted GTF files below, in suggested order.

1. Pre-sorted GTF files

If you are using a common annotation I strongly suggest you download it from the list below.

If you feel a widely used annotation is missing, feel free to suggest that I include it by email.

2. Sorting a GTF file from the command line

Since most tasks dealing with RNA-Seq data analysis are ran from a server, I have provided a python script to sort a GTF file. An example usage scenario for a GTF file called “your_gtf_file.gtf” is shown below:

$ wget --no-check-certificate https://raw.github.com/ctokheim/PrimerSeq/master/gtf.py -O gtf.py  # get command line script
$ python gtf.py -c your_gtf_file.gtf  # check if GTF is sorted
your_gtf_file.gtf is not correctly sorted. please sort before use.
$ python gtf.py -i your_gtf_file.gtf -o your_gtf_file.sorted.gtf  # GTF was not sorted, so sort it

3. Sorting a GTF file from PrimerSeq

PrimerSeq also includes the ability to handle such ill-specified GTF files. For any GTF file not downloaded from the PrimerSeq website, you should sort the GTF by Edit -> Sort GTF in the PrimerSeq GUI.

Warning! Depending on your platform and hardware, you may not have sufficient memory to sort a GTF. I recommend you use a computer with at least 4G of RAM. If this is not an option you can either use SortGtf.java or gtf.py from the command line to properly sort your GTF.

Warning!

Your GTF must be sorted.
Do not assume so unless
you download the GTF from
this website.
Open the sort dialog by Edit -> Sort Gtf

Select the GTF you wish to sort using the “GTF:” button. Next select the output file path for the sorted gtf by pressing the “Sorted GTF:” button. Now, sort the GTF by pressing the “Sort” button. While PrimerSeq is sorting your GTF the “Sort” button should now say “Sorting …”. When sorting is finished you should see the button text return to “Sort”.

Adding Gene IDs

Warning!

If you download a GTF from
UCSC, you will need to add
correct Gene IDs

If your GTF is also from UCSC you can then use Edit -> Add Genes to add correct gene IDs. A dialog will appear and require your original GTF and a kgXref file. You can obtain a kgXref file from UCSC by doing the following:

  1. Go to the UCSC table browser
  2. Select Genes and Gene Prediction tracks from the group dropdown
  3. Select UCSC Genes from the track dropdown
  4. Select kgXref from the table dropdown
  5. Make sure the output format is ‘all fields from selected table’
  6. Click ‘get output’
Open the "Add Valid Gene IDs" dialog by Edit -> Add Genes

Please select your GTF file from UCSC by pressing the “GTF:” button. You will also need to select the kgXref file you downloaded from UCSC by pressing the “kgXref” button. Next, select the file name you wish to save your GTF as by pressing the “GTF W/ Genes” button. To start changing the gene IDs press the “Change Gene IDs” button. You will notice the button will be disabled while PrimerSeq is still in progress. When finished, the “Change Gene IDs” button will be available again.

Advanced Use at your own risk

PrimerSeq also allows you to mix GTF files together or use output from transcript assembly programs like Cufflinks. Like any normal GTF, output from programs like Cufflinks can be used as long as it is properly sorted (see above for details). In addition you can combine two or more GTF files into a single input GTF file for PrimerSeq. Besides sorting, the only additional action you must perform is to select “Not Valid” from the Gene ID dropdown in the Optional tab of the GUI. You need to specify the gene IDs as “Not Valid” because the two or more GTFs likely will not have the same gene ID even though they may have many of the same genes in their annotation.