tidysq package is meant to store and conduct operations
on biological sequences. This vignette provides a guide to basic usage
of tidysq, i.e. reading, manipulating and writing sequences
to file.
The most recent version of tidysq can be installed with
install_github() function from devtools.
Biological sequences can be and often are represented as strings –
sequences of letters. For example, a DNA sequence can take the form of
"TAGGCCCTAGACCTG", where A means adenine,
C – cytosine, G – guanine and T –
thymine. Exact IUPAC recommendations for one-letter codes can be found
in Cornish-Bowden A. Nomenclature for incompletely specified bases
in nucleic acid sequences: recommendations 1984. Nucleic Acids Res. 1985
May 10;13(9):3021-30. doi: 10.1093/nar/13.9.3021. PMID: 2582368; PMCID:
PMC341218.
Within tidysq package sequence data is stored in
sq objects, that is, vectors of biological sequences. They
can be created from string vectors as above:
sq_dna <- sq(c("TAGGCCCTAGACCTG", "TAGGCCCTGGGCATG"))
sq_dna
#> basic DNA sequences list:
#> [1] TAGGCCCTAGACCTG                                                         <15>
#> [2] TAGGCCCTGGGCATG                                                         <15>There are several thing to note. First, each sequence is an element
of sq object. Many operations are vectorized — they are
applied to all sequences of a vector — and sq objects are
no different in this regard. Second, the first line of output says:
basic DNA sequences list. This means that all sequences of
this object are of DNA type and do not use ambiguous letters (more about
that in “Advanced alphabet techniques” vignette).
Manipulating sequence objects is an integral part of
tidysq. sq objects can be easily subsetted
using usual R syntax:
Extracting subsequences is a bit more complicated than that — because
it uses designated function bite(). Its syntax, however,
closely resembles that of base R — indexing starts with one and negative
indices are interpreted as “anything except that”. It returns an
sq object with all sequences subsetted:
bite(sq_dna, 5:10)
#> basic DNA sequences list:
#> [1] CCCTAG                                                                   <6>
#> [2] CCCTGG                                                                   <6>
bite(sq_dna, c(-9, -11, -13))
#> basic DNA sequences list:
#> [1] TAGGCCCTGCTG                                                            <12>
#> [2] TAGGCCCTGCTG                                                            <12>It’s possible to reverse sequences using this function:
# Don't do it like that!
bite(sq_dna, 15:1)
#> basic DNA sequences list:
#> [1] GTCCAGATCCCGGAT                                                         <15>
#> [2] GTACGGGTCCCGGAT                                                         <15>However, this usage is strongly discouraged, because it’s both
ineffective and works badly with sequences of different lengths.
Instead, there is a designated function reverse():
reverse(sq_dna)
#> basic DNA sequences list:
#> [1] GTCCAGATCCCGGAT                                                         <15>
#> [2] GTACGGGTCCCGGAT                                                         <15>Note that it is very different to base rev(), which
reverses only the order of sequences, not letters:
We can combine two or more sq objects using base
c() function:
tidysq offers two functions specific to DNA/RNA
sequences, namely complement() and
translate(). The former creates sequences with
complementary bases, that is, replaces A with
T, C with G and vice
versa. The latter translates input to amino acid sequences using the
translation table with three-letter codons.
These functions can be called as shown below:
complement(sq_dna)
#> basic DNA sequences list:
#> [1] ATCCGGGATCTGGAC                                                         <15>
#> [2] ATCCGGGACCCGTAC                                                         <15>
#> [3] CAGGTCTAGGGCCTA                                                         <15>
#> [4] CATGCCCAGGGCCTA                                                         <15>
translate(sq_dna)
#> basic amino acid sequences list:
#> [1] *ALDL                                                                    <5>
#> [2] *ALGM                                                                    <5>
#> [3] VQIPD                                                                    <5>
#> [4] VRVPD                                                                    <5>One noteworthy feature here is that translation can be done with any genetic code table of those listed on this Wikipedia page:
Motifs are short subsequences. These are often searched for in
biological sequences. tidysq has two distinct functions
that allow the user to perform such search.
One of them is a %has% operator that takes
sq object and character vector as parameters respectively.
It returns a logical vector of the same length as sq
object, where each element says whether all motifs passed as strings
were found in given sequence:
sq_dna %has% "ATC"
#> [1] FALSE FALSE  TRUE FALSE
# It can be used to subset sq
sq_dna[sq_dna %has% c("AG", "CC")]
#> basic DNA sequences list:
#> [1] TAGGCCCTAGACCTG                                                         <15>
#> [2] TAGGCCCTGGGCATG                                                         <15>
#> [3] GTCCAGATCCCGGAT                                                         <15>It says nothing about motif placement within sequence nor it exact
form, however. In this case, there is find_motifs()
function that returns a whole tibble (from
tibble package; basically improved version of
data.frame) with various info about found motifs. Important
thing to note here is that the second argument is a character vector of
sequence names to avoid embedding potentially long sequences in
resulting tibble potentially many times:
find_motifs(sq_dna, c("seq1", "seq2", "rev1", "rev2"), c("ATC", "TAG"))
#> # A tibble: 4 × 5
#>   names found     sought start   end
#>   <chr> <dna_bsc> <chr>  <int> <int>
#> 1 rev1  ATC <3>   ATC        7     9
#> 2 seq1  TAG <3>   TAG        1     3
#> 3 seq1  TAG <3>   TAG        8    10
#> 4 seq2  TAG <3>   TAG        1     3You can also provide this function with a data.frame
(or, what we recommend, tibble) containing one column
called sq, containing the sequences and the other column
name containing the names.
sqibble <- tibble::tibble(sq = sq_dna, 
                          name = c("seq1", "seq2", "rev1", "rev2"))
# does the same as the call from previous chunk of code
find_motifs(sqibble, c("ATC", "TAG"))
#> # A tibble: 4 × 5
#>   names found     sought start   end
#>   <chr> <dna_bsc> <chr>  <int> <int>
#> 1 rev1  ATC <3>   ATC        7     9
#> 2 seq1  TAG <3>   TAG        1     3
#> 3 seq1  TAG <3>   TAG        8    10
#> 4 seq2  TAG <3>   TAG        1     3There are ambiguous DNA bases in IUPAC codes and these can be used in
motifs. One of them is "N" — its meaning is “any of
A, C, G or T:
find_motifs(sqibble, "GNCC")
#> # A tibble: 7 × 5
#>   names found     sought start   end
#>   <chr> <dna_bsc> <chr>  <int> <int>
#> 1 seq1  GGCC <4>  GNCC       3     6
#> 2 seq1  GCCC <4>  GNCC       4     7
#> 3 seq1  GACC <4>  GNCC      10    13
#> 4 seq2  GGCC <4>  GNCC       3     6
#> 5 seq2  GCCC <4>  GNCC       4     7
#> 6 rev1  GTCC <4>  GNCC       1     4
#> 7 rev2  GTCC <4>  GNCC       7    10This example displays the difference between "sought"
and "found" columns. The former contains the string
representation of motif that the user was looking for, while the latter
contains a tidysq-encoded sequence with an “instance” of
motif.
Two additional characters are reserved because of their special
meaning in motifs. "^" means that this motif must be found
at the start of a sequence, while "$" means the same, but
with the end instead. They can be mixed with ambiguous letters, of
course:
After doing computations the user might wish to save their sequences
for future use. One of the most popular formats for storing biological
sequences is FASTA. tidysq allows the user to write
sequences to FASTA file with write_fasta() function.
Important thing to remember here that the arguments for the function are
analogous to those used in find_motifs() – either
sq object and a vector of names or a tibble
with columns of sequences and names: