UCSCTableQuery-class {rtracklayer}R Documentation

Querying UCSC Tables

Description

The UCSC genome browser is backed by a large database, which is exposed by the Table Browser web interface. Tracks are stored as tables, so this is also the mechanism for retrieving tracks. The UCSCTableQuery class represents a query against the Table Browser. Storing the query fields in a formal class facilitates incremental construction and adjustment of a query.

Details

There are six supported fields for a table query:

provider

The provider should be a session, a genome identifier, or a TrackHub URI. session: The UCSCSession instance from the tables are retrieved. Although all sessions are based on the same database, the set of user-uploaded tracks, which are represented as tables, is not the same, in general.

tableName

The name of the specific table to retrieve. May be NULL, in which case the behavior depends on how the query is executed, see below.

range

A genome identifier, a GRanges or a IntegerRangesList indicating the portion of the table to retrieve, in genome coordinates. Simply specifying the genome string is the easiest way to download data for the entire genome, and GRangesForUCSCGenome facilitates downloading data for e.g. an entire chromosome.

hubUrl

The URI of the specific TrackHub

genome

A genome identifier of the specific TrackHub, only need to provide it if the provider is up of TrackHub URI.

names

Names/accessions of the desired features

A common workflow for querying the UCSC database is to create an instance of UCSCTableQuery using the ucscTableQuery constructor, invoke tableNames to list the available tables for a track, and finally to retrieve the desired table either as a data.frame via getTable or as a track via track. See the examples.

The reason for a formal query class is to facilitate multiple queries when the differences between the queries are small. For example, one might want to query multiple tables within the track and/or same genomic region, or query the same table for multiple regions. The UCSCTableQuery instance can be incrementally adjusted for each new query. Some caching is also performed, which enhances performance.

Constructor

ucscTableQuery(x, range = seqinfo(x), table = NULL, names = NULL, hubUrl = NULL, genome = NULL): Creates a UCSCTableQuery with the UCSCSession, genome identifier or TrackHub URI given as x and the table name given by the single string table. range should be a genome string identifier, a GRanges instance or IntegerRangesList instance, and it effectively defaults to genome(x). If the genome is missing, it is taken from the provider. Feature names, such as gene identifiers, may be passed via names as a character vector.

Executing Queries

Below, object is a UCSCTableQuery instance.

track(object): Retrieves the indicated table as a track, i.e. a GRanges object. Note that not all tables are available as tracks.

getTable(object): Retrieves the indicated table as a data.frame. Note that not all tables are output in parseable form, and that UCSC will truncate responses if they exceed certain limits (usually around 100,000 records). The safest (and most efficient) bet for large queries is to download the file via FTP and query it locally.

tableNames(object): Gets the names of the tables available for the provider, table and range specified by the query.

Accessor methods

In the code snippets below, x/object is a UCSCTableQuery object.

genome(x), genome(x) <- value: Gets or sets the genome identifier (e.g. “hg18”) of the object.

hubUrl(x), hubUrl(x) <- value: Gets or sets the TrackHub URI.

tableName(x), tableName(x) <- value: Get or set the single string indicating the name of the table to retrieve. May be NULL, in which case the table is automatically determined.

range(x), range(x) <- value: Get or set the GRanges indicating the portion of the table to retrieve in genomic coordinates. Any missing information, such as the genome identifier, is filled in using range(browserSession(x)). It is also possible to set the genome identifier string or a IntegerRangesList.

names(x), names(x) <- value: Get or set the names of the features to retrieve. If NULL, this filter is disabled.

ucscSchema(x): Get the UCSCSchema object describing the selected table.

ucscTables(genome, track): Get the list of tables for the specified track(e.g. “Assembly”) and genome identifier (e.g. “hg19”). Here genome and track must be a single non-NA string.

Author(s)

Michael Lawrence

Examples

## Not run: 
# query using `session` provider
session <- browserSession()
genome(session) <- "mm9"
## choose the phastCons30way table for a portion of mm9 chr1
query <- ucscTableQuery(session, table = "phastCons30way",
                        range = GRangesForUCSCGenome("mm9", "chr12",
                                             IRanges(57795963, 57815592)))
## list the table names
tableNames(query)
## retrieve the track data
track(query)  # a GRanges object
## get the multiz30waySummary track
tableName(query) <- "multiz30waySummary"
## get a data.frame summarizing the multiple alignment
getTable(query)

# query using `genome identifier` provider
query <- ucscTableQuery("hg18", table = "snp129",
                        names = c("rs10003974", "rs10087355", "rs10075230"))
ucscSchema(query)
getTable(query)

# query using `TrackHub URI` provider
query <- ucscTableQuery("https://ftp.ncbi.nlm.nih.gov/snp/population_frequency/TrackHub/20200227123210/",
                        genome = "hg19", table = "ALFA_GLB")
getTable(query)
# get the list of tables for 'Assembly' track and 'hg19' genome identifier
ucscTables("hg19", "Assembly")

## End(Not run)

[Package rtracklayer version 1.62.0 Index]