UCSCTableQuery-class {rtracklayer} | R Documentation |
Querying UCSC Tables
Description
The UCSC genome browser is backed by a large database,
which is exposed by the Table Browser web interface. Tracks are
stored as tables, so this is also the mechanism for retrieving tracks. The
UCSCTableQuery
class represents a query against the Table
Browser. Storing the query fields in a formal class facilitates
incremental construction and adjustment of a query.
Details
There are six supported fields for a table query:
- provider
-
The provider should be a session, a genome identifier, or a TrackHub URI.
session
: TheUCSCSession
instance from the tables are retrieved. Although all sessions are based on the same database, the set of user-uploaded tracks, which are represented as tables, is not the same, in general. - tableName
The name of the specific table to retrieve. May be
NULL
, in which case the behavior depends on how the query is executed, see below.- range
A genome identifier, a
GRanges
or aIntegerRangesList
indicating the portion of the table to retrieve, in genome coordinates. Simply specifying the genome string is the easiest way to download data for the entire genome, andGRangesForUCSCGenome
facilitates downloading data for e.g. an entire chromosome.- hubUrl
The URI of the specific TrackHub
- genome
A genome identifier of the specific TrackHub, only need to provide it if the provider is up of TrackHub URI.
- names
Names/accessions of the desired features
A common workflow for querying the UCSC database is to create an
instance of UCSCTableQuery
using the ucscTableQuery
constructor, invoke tableNames
to list the available tables for
a track, and finally to retrieve the desired table either as a
data.frame
via getTable
or as a track
via track
. See the examples.
The reason for a formal query class is to facilitate multiple queries
when the differences between the queries are small. For example, one
might want to query multiple tables within the track and/or same
genomic region, or query the same table for multiple regions. The
UCSCTableQuery
instance can be incrementally adjusted for each
new query. Some caching is also performed, which enhances performance.
Constructor
-
ucscTableQuery(x, range = seqinfo(x), table = NULL, names = NULL, hubUrl = NULL, genome = NULL)
: Creates aUCSCTableQuery
with theUCSCSession
, genome identifier or TrackHub URI given asx
and the table name given by the single stringtable
.range
should be a genome string identifier, aGRanges
instance orIntegerRangesList
instance, and it effectively defaults togenome(x)
. If the genome is missing, it is taken from the provider. Feature names, such as gene identifiers, may be passed vianames
as a character vector.
Executing Queries
Below, object
is a UCSCTableQuery
instance.
-
track(object)
: Retrieves the indicated table as a track, i.e. aGRanges
object. Note that not all tables are available as tracks. -
getTable(object)
: Retrieves the indicated table as adata.frame
. Note that not all tables are output in parseable form, and that UCSC will truncate responses if they exceed certain limits (usually around 100,000 records). The safest (and most efficient) bet for large queries is to download the file via FTP and query it locally. -
tableNames(object)
: Gets the names of the tables available for the provider, table and range specified by the query.
Accessor methods
In the code snippets below, x
/object
is a
UCSCTableQuery
object.
genome(x)
,genome(x) <- value
: Gets or sets the genome identifier (e.g. “hg18”) of the object.hubUrl(x)
,hubUrl(x) <- value
: Gets or sets the TrackHub URI.tableName(x)
,tableName(x) <- value
: Get or set the single string indicating the name of the table to retrieve. May beNULL
, in which case the table is automatically determined.range(x)
,range(x) <- value
: Get or set theGRanges
indicating the portion of the table to retrieve in genomic coordinates. Any missing information, such as the genome identifier, is filled in usingrange(browserSession(x))
. It is also possible to set the genome identifier string or aIntegerRangesList
.names(x)
,names(x) <- value
: Get or set the names of the features to retrieve. IfNULL
, this filter is disabled.ucscSchema(x)
: Get theUCSCSchema
object describing the selected table.ucscTables(genome, track)
: Get the list of tables for the specified track(e.g. “Assembly”) and genome identifier (e.g. “hg19”). Heregenome
andtrack
must be a single non-NA string.
Author(s)
Michael Lawrence
Examples
## Not run:
# query using `session` provider
session <- browserSession()
genome(session) <- "mm9"
## choose the phastCons30way table for a portion of mm9 chr1
query <- ucscTableQuery(session, table = "phastCons30way",
range = GRangesForUCSCGenome("mm9", "chr12",
IRanges(57795963, 57815592)))
## list the table names
tableNames(query)
## retrieve the track data
track(query) # a GRanges object
## get the multiz30waySummary track
tableName(query) <- "multiz30waySummary"
## get a data.frame summarizing the multiple alignment
getTable(query)
# query using `genome identifier` provider
query <- ucscTableQuery("hg18", table = "snp129",
names = c("rs10003974", "rs10087355", "rs10075230"))
ucscSchema(query)
getTable(query)
# query using `TrackHub URI` provider
query <- ucscTableQuery("https://ftp.ncbi.nlm.nih.gov/snp/population_frequency/TrackHub/20200227123210/",
genome = "hg19", table = "ALFA_GLB")
getTable(query)
# get the list of tables for 'Assembly' track and 'hg19' genome identifier
ucscTables("hg19", "Assembly")
## End(Not run)