Running¶
Run PolyA from the command line:
polyA -h
or
python -m polyA -h
Command line usage is also included below for convenience.
usage: polyA [-h] [-v] [--chunk-size CHUNK_SIZE] [--trans-penalty TRANS_PENALTY]
[--confidence] [--prior-counts FILE] [--shard-gap SHARD_GAP]
[--sequences SEQS] [--ultra-data FILE] [--easel-path BIN]
[--ultra-path BIN] [--log-file LOG] [--log-level LEVEL]
[--matrix-position] [--output-path PATH] [--sequence-position] [--soda]
ALIGNMENTS MATRICES
positional arguments:
ALIGNMENTS alignments file in Stockholm format
MATRICES substitution matrices file in PolyA matrix format
optional arguments:
-h, --help show this help message and exit
-v, --version show version and exit
--chunk-size CHUNK_SIZE
size of the window in base pairs analyzed together
--trans-penalty TRANS_PENALTY
penalty for changing annotations
--confidence run the confidence calculation and then exit
--prior-counts FILE file containing query genomic counts
--shard-gap SHARD_GAP
maximum alignment gap before sharding occurs
--sequences SEQS FASTA file of the target sequence for using ULTRA
--ultra-data FILE text file of the output from ULTRA ran on the FASTA file
of the target sequence
--easel-path BIN path to the esl_scorematrix program, if necessary
(assumed to be in PATH)
--ultra-path BIN path to the ULTRA binary to use, if necessary (assumed
to be in PATH)
--log-file LOG file to store log output in, defaults to stderr
--log-level LEVEL logging level to use, 'debug' is the most noisy
--matrix-position produce output in terms of the matrix position
--output-path PATH directory to write output files to, defaults to
working directory
--sequence-position produce output in terms of the target sequence
position
--soda write a SODA visualization file to the output
directory
--complexity-adjustment complexity-adjust alignment scores (scores of matches
between sequence regions of biassed nucleotide
composition are adjusted downwards)
Input Formats¶
PolyA accepts two required inputs and several optional inputs that affect its behavior. The required inputs are an alignment file which must contain alignments for all possible queries matching the target sequence. This file must be in Stockholm format with several custom metadata fields. The other required input is a set of substitution matrices. This file uses a custom, but extremely simple format.
Alignment File¶
Alignments for all possible queries matching the target sequence should be contained in a single file in Stockholm format.
There are several special metadata fields that must exist for each alignment in this file. See the example below. An explanation is indented to the right of each field with additional detail as noted.
#=GF ID MERX#DNA/TcMar-Tigger query sequence (1)
#=GF TR chr1:11543-28567 target sequence
#=GF SC 1153 alignment score
#=GF SD + strand
#=GF TQ -1 (2)
#=GF ST 127 alignment start position on target
#=GF SP 601 alignment stop position on target
#=GF CST 135 alignment start position on query
#=GF CSP 628 alignment stop position on query
#=GF FL 128 (3)
#=GF MX matrix_name (4)
#=GF GI -25 gap init
#=GF GE -5 gap extension
query sequence names must be in the format ‘name#family/class’
valid values: ‘q’ if the alignment is on the reverse strand and the reversed sequence is the query; ‘t’ if the alignment is on the reverse strand and the reversed sequence is the target; ‘-1’ if the alignment is on the positive strand
the flanking region of the unaligned query sequence
the name of the substitution matrix file used to create alignment
Converting Alignments¶
PolyA can convert a Cross Match or Repeat Masker alignment file to the
particular version of Stockholm format it requires. To do this, use either the
--cm-to-stockholm
or --rm-to-stockholm
options, respectively, passing the
path to the file to be converted.
The script will produce two files in the same directory as the input, one with a
.sto
extension and the other with a .matrix
extension. These can be passed
to the PolyA command line tool as ALIGNMENTS
and MATRICES
, respectively (see
--help
).
Example:
python -m polyA --cm-to-stockholm my_alignments.cm
Substitution Matrix File¶
The substitution matrix file example format (can include ambiguity codes):
this file must include all of the matrices specified in the
#=GF MX
field of the alignment file, with corresponding and matching matrix namesif lambda is not included polyA will use esl_scorematrix to calculate it for all matrices
matrix_name lambda(optional)
A G C T N
8 -6 -13 -15 -1
-2 10 -13 -13 -1
-13 -13 10 -2 -1
-15 -13 -6 8 -1
-1 -1 -1 -1 -1
//
matrix_name2 lambda2(optional)
A G C T N
8 -6 -13 -15 -1
-2 10 -13 -13 -1
-13 -13 10 -2 -1
-15 -13 -6 8 -1
-1 -1 -1 -1 -1
//
...
Sequence File¶
A FASTA file of the target sequence is needed when using ULTRA. The target sequence must be the same genomic region that was used to get the cross_match alignment file. This file must follow the format of
>chrom:start-end
target_sequence
as shown in the example below.
>chr1:152302175-152325203
AATAGTTTATTTTTAATTTAGATGCAGCTTACTATAATATTAATTATGTCCAAGATGATT
TTTTGAATACAGAATACTAGAATTCCAATAGAAGGATAATAGAGAAAGATGTGCTAGCCC
...
Output Formats¶
start stop ID name
----------------------------------------
11990879 11991268 eaa042dd09f944f68dba2fd4727c64e2 LTR40a#LTR/ERVL
11991272 11991444 fb5ef5e0e2ca4e05837ddc34ca7ef9e4 MSTA1#LTR/ERVL-MaLR
11991445 11991562 bdfc4039b7d947d0b25bf1115cc282ed AluJr4#SINE/Alu
11991563 11991573 4871d91441a146209b98f645feae68c8 FLAM_C#SINE/Alu
11991574 11991818 fb5ef5e0e2ca4e05837ddc34ca7ef9e4 MSTB1#LTR/ERVL-MaLR
11991819 11991875 eaa042dd09f944f68dba2fd4727c64e2 LTR40a#LTR/ERVL
* Matching IDnums correspond to partial sequences that originate from
the same ancestral sequence.
Confidence Output Format¶
Computes confidence of a single input alignment region. Does not perform annotation or adjudication, simply outputs the confidence of all competing queries given in the input.
query_label confidence
LTR40a#LTR/ERVL 0.875
LTR40b#LTR/ERVL 0.052
LTR40c#LTR/ERVL 0.001
...
Extensions¶
Visualizing Annotations¶
The command line option --soda
will output the annotation data to a json file
(output.0.viz) that can be used for visualization in SODA (linked below).
The json file can be submitted on the browser to view the TE annotations from PolyA
as well as the annotations from the UCSC Genome Browser for the same region of the
human genome (hg38). The PolyA visualization can display the confidence values for all competing
annotations of a selected region as well as their corresponding sequence alignments.
Prior Counts Files¶
Default confidence calculations assume a uniform distribution over all competing queries. In the case of non uniform priors, the command line option –prior-counts prior_counts.txt includes prior genome counts in confidence calculations (see paper for more details).
https://www.biorxiv.org/content/10.1101/2021.02.13.430877v1
Prior counts file example format:
subfamily genome_count
AluYk2 6855
LTR38 255
L1PA7_5end 13261
...
Using ULTRA¶
The optional use of ULTRA allows polyA to include tandem repeats (TRs) in the competing annotations
of the target sequence. Doing so removes the dependency on pre-masking TRs prior to annotation, allows
TRs to outcompete potentially weak fragmentary family annotation, and allows a family annotation
to outcompete a TR.
The command line option --sequences seq.fasta
(with --ultra-path
if necessary) will
run ULTRA with polyA or --ultra-data ultra_data.txt
can be used if ULTRA was ran on seq.fasta prior.