Welcome to Saguaro!
Saguaro (Genome-Wide) is a program to detect signatures of selection
within populations, strains, or species. It takes SNPs or
nucleotides as input, and creates statistical local phylogenies for
each region in the genome. Saguaro was developed at the Science for
Life Laboratory, Department of Medical Biochemistry and
Microbiology, Uppsala University, and the Broad Institute of MIT and
Harvard.
The software is licensed under the
GNU
Library or Lesser General Public License version 2.0 (LGPLv2),
you can download the source code from here.
Background
When species or populations diverge, their genomes will accumulate
differences to each other, so that the overall phylogeny follows
ancestry and is recognizable by computing pairwise genomic distances
genome-wide. However, several evolutionary forces act on very local
genomic regions, so that these regions appear in violation to the
dominant phylogeny, e.g. parallel evolution that sweeps certain
haplotypes to fixation independently in different populations,
driven by the need the need to adapt to similar environments.
Saguaro implements an algorithm that sets out to identify and
pinpoint such regions, without the need for any a priori hypotheses:
a Hidden Markov Model and a Neural Network, applied in an
interleaved fashion, will hypothesize local "phylogenies",
especially when they occur several times over in the genome, and
report them for further biological analysis.
Supported platforms
Saguaro is written in C++ and requires Linux, and the GNU gcc
compiler (we tested it with gcc versions 4.2, 4.4, and 4.6). The
amount of RAM required to run depends on the data set, please note
that even the small test sample data set included in the download
will take up several GB, since the data conversion is optimized to
process large amounts of data quickly. The sample data does run on a
MacBook Pro with 4GB of RAM, but only if no other programs are
loaded into memory, and even then it will start swapping.
Installing Saguaro
Download the source code from here. To
compile the executables, type
> make
on the command line.
Input file format
Saguaro takes binary "feature" files only. You can convert your
input data to Saguaro binary format from multiple sequence alignment
format (MAF), and multi-fasta. To convert a MAF, run:
> ./Maf2HMMFeature -i <maf> -o <out> -n <names>
-c <center> -nosame
All available arguments:
-i<string> : input multiple alignment file (MAF format)
-o<string> : binary feature output files
-n<string> : names of the genomes to be extracted (must match
MAF)
-nosame<bool> : skip positions in which all calls are the same
(def=0)
-m<int> : minimum coverage (def=2)
-c<string> : name of the genome in which the coordinates will
be reported
Note that you need to provide a plain text file to option -n that
contains all the organisms to be analyzed, with one entry per line
(see sample data below). The option -c defines which genome to use
as the coordinate system, and this organism needs to be in the list
of names. If organisms are either fairly closely related (more bases
match than mismatch across all organisms), or the experiment
normalizes conserved regions, we recommend the -nosame option.
To convert from multi-fasta format, use:
> ./Fasta2HMMFeature
-i<string> : input fasta file (multiple alignment)
-o<string> : binary output file
-nosame<bool> : skip positions in which all calls are the same
(def=0)
-m<int> : minimum coverage (def=2)
and all entries in the multi-fasta file will be converted to Saguaro
format. Note that the mult-fasta file needs to contain all the gaps
to build a consistent multi-alignment, and further, that the
multi-alignment positions will be used as the coordinate system to
report results.
Running Saguaro
Once converted into features, run Saguaro, the options are:
>./Saguaro
-f<string> : Feature vector (def=)
-l<string> : Feature vector list file (def=)
-o<string> : output directory
-cycle<int> : iterations per cycle (def=2)
-iter<int> : iterations with split (def=40)
-t<double> : transition penalty (def=150)
where either a single feature file will be processed via the -f
option, or a list of files (one line per file name) is supplied via
the -l option. The option -iter controls both the number of
iterations, as well as how many hypotheses will be output.
Output
The final result is found in the output directory, called
LocalTrees.out. After a file header, it lists a phylogeny for each
genomic location, including coordinates and a distance matrix best
describing this regions, e.g.:
cactus0 chr6: 32886361 -
32893560 length:
7199 (frames 72-284 l=212)
score=21.1358
hg18 panTro2
rheMac2 calJac1 tarSyr1 gorGor1
hg18
-0.00 0.10 0.76
1.47 2.14 0.10
panTro2 0.10
-0.00 0.87 1.58
2.14 0.12
rheMac2 0.76
0.87 0.01
0.99 1.16 0.84
calJac1 1.47
1.58 0.99
0.05 0.87 1.78
tarSyr1 2.14
2.14 1.16
0.87 0.04 2.08
gorGor1 0.10
0.12 0.84
1.78 2.08 -0.00
In addition, the file saguaro.cactus lists all hypotheses that have
been generated during the run and fit the individual regions best
genome-wide. Note that Saguaro is not forced to assign genomic
regions to all hypotheses, so some might not be used to classify
local regions.
Example data set
For example data, see the script test_Saguaro that is distributed
with the software. The sample data is part of the 29 mammalian
genomes alignments, and maps to a part of human chromosome 6.