data_handling Module

Functions for handling data.

data_handling.request

Data handling for request step.

gtexquery.data_handling.request.gtex_request(region: str, gene: str, output: str) None

Make a thead-safe gtex request against mediantranscriptexpression.

If gene starts with “ENSG”, a query is made to GTEx. If it does not, no file is created. This is designed to be used with snakemake checkpoints.

A thread local session is provided by a call to _get_session. This allows the reuse of sessions, which, among other things, provides significant speed ups.

Parameters
  • region (str) – The gtex region to query.

  • gene (str) – The ensg to query.

  • output (str) – Where to save the output file.

Raises

requests.HTTPError – When the get request returns an error

gtexquery.data_handling.request.lut_check(gene: str, lut: pandas.core.frame.DataFrame) str

Check that a gene is found in the Gencode annotations.

If the gene is found, then it is converted to its Ensembl ID. If it is not found, then the gene name is returned. The found status can be queried by seeing if the resulting string starts with “ENSG”, a pattern that will only occur for Ensembl IDs.

Note

It’s likely that your gene is in Gencode even if it is not found. Common reasons (at least for me!) that a gene might not be found include spelling errors and name errors (ie. using NGN2 instead of NEUROG2).

Parameters
  • gene (str) – The gene name to be queried

  • lut (pd.DataFrame) – The dataframe containing the name-to-id conversion for the genes

Returns

Return type

str

Example

>>> lut = pd.DataFrame.from_dict({"name": ["ASCL1"], "id": ["ENSG00000139352.3"]})
>>> lut_check("ASCL1", lut)
'ENSG00000139352.3'
>>> lut_check("NotAGene", lut)
'NotAGene'

data_handling.biomart

Data handling for biomart step.

gtexquery.data_handling.biomart.XML_QUERY

A lambda funcion encapsulating the unwieldy XML query string required by Biomart. The list of transcript are joined to form the ensembl_transcript_id field.

Type

Callable[[list[str]], str]

gtexquery.data_handling.biomart.biomart_request(infile: str, output: str) None

Query Biomart with a list of transcripts.

Instantiates a thread_local request.Session before querying Biomart with a list of transcript IDs. Should an error occur, it is logged using the logging.exception method.

Parameters
  • infile (str) – The input file. This is expected to be the output of the GTEx query, and will fail if the expected columns are not present.

  • output (str) – Where to save results

Raises

requests.HTTPError – When the GET request fails

data_handling.process

Data handling for process step.

gtexquery.data_handling.process.merge_data(gtex_path: Union[pathlib.Path, str], bm_path: Union[pathlib.Path, str], mane: pandas.core.frame.DataFrame, out_path: Union[pathlib.Path, str]) None

Merge the data from previous pipeline queries.

Parameters
  • gtex_path (Union[Path, str]) – Path to the file containing GTEx query data.

  • bm_path (Union[Path, str]) – Path to the file containing BioMart query data.

  • mane (pd.DataFrame) – A DataFrame containing MANE annotations.

  • out_path (Union[Path, str]) – Path to the output file.