Saguaro logo

Welcome to Saguaro!

Saguaro (Genome-Wide) is a program to detect signatures of selection within populations, strains, or species. It takes SNPs or nucleotides as input, and creates statistical local phylogenies for each region in the genome. Saguaro was developed at the Science for Life Laboratory, Department of Medical Biochemistry and Microbiology, Uppsala University, and the Broad Institute of MIT and Harvard.

The software is licensed under the GNU Library or Lesser General Public License version 2.0 (LGPLv2), you can download the source code from here.

Background
When species or populations diverge, their genomes will accumulate differences to each other, so that the overall phylogeny follows ancestry and is recognizable by computing pairwise genomic distances genome-wide. However, several evolutionary forces act on very local genomic regions, so that these regions appear in violation to the dominant phylogeny, e.g. parallel evolution that sweeps certain haplotypes to fixation independently in different populations, driven by the need the need to adapt to similar environments. Saguaro implements an algorithm that sets out to identify and pinpoint such regions, without the need for any a priori hypotheses: a Hidden Markov Model and a Neural Network, applied in an interleaved fashion, will hypothesize local "phylogenies", especially when they occur several times over in the genome, and report them for further biological analysis.

Supported platforms
Saguaro is written in C++ and requires Linux, and the GNU gcc compiler (we tested it with gcc versions 4.2, 4.4, and 4.6). The amount of RAM required to run depends on the data set, please note that even the small test sample data set included in the download will take up several GB, since the data conversion is optimized to process large amounts of data quickly. The sample data does run on a MacBook Pro with 4GB of RAM, but only if no other programs are loaded into memory, and even then it will start swapping.

Installing Saguaro
Download the source code from here. To compile the executables, type

> make

on the command line.

Input file format
Saguaro takes binary "feature" files only. You can convert your input data to Saguaro binary format from multiple sequence alignment format (MAF), and multi-fasta. To convert a MAF, run:

> ./Maf2HMMFeature -i <maf> -o <out> -n <names> -c <center> -nosame

All available arguments:

-i<string> : input multiple alignment file (MAF format)
-o<string> : binary feature output files
-n<string> : names of the genomes to be extracted (must match MAF)
-nosame<bool> : skip positions in which all calls are the same (def=0)
-m<int> : minimum coverage (def=2)
-c<string> : name of the genome in which the coordinates will be reported

Note that you need to provide a plain text file to option -n that contains all the organisms to be analyzed, with one entry per line (see sample data below). The option -c defines which genome to use as the coordinate system, and this organism needs to be in the list of names. If organisms are either fairly closely related (more bases match than mismatch across all organisms), or the experiment normalizes conserved regions, we recommend the -nosame option.

To convert from multi-fasta format, use:

> ./Fasta2HMMFeature

-i<string> : input fasta file (multiple alignment)
-o<string> : binary output file
-nosame<bool> : skip positions in which all calls are the same (def=0)
-m<int> : minimum coverage (def=2)

and all entries in the multi-fasta file will be converted to Saguaro format. Note that the mult-fasta file needs to contain all the gaps to build a consistent multi-alignment, and further, that the multi-alignment positions will be used as the coordinate system to report results.

Running Saguaro
Once converted into features, run Saguaro, the options are:

>./Saguaro

-f<string> : Feature vector (def=)
-l<string> : Feature vector list file (def=)
-o<string> : output directory
-cycle<int> : iterations per cycle (def=2)
-iter<int> : iterations with split (def=40)
-t<double> : transition penalty (def=150)

where either a single feature file will be processed via the -f option, or a list of files (one line per file name) is supplied via the -l option. The option -iter controls both the number of iterations, as well as how many hypotheses will be output.

Output
The final result is found in the output directory, called LocalTrees.out. After a file header, it lists a phylogeny for each genomic location, including coordinates and a distance matrix best describing this regions, e.g.:

cactus0 chr6: 32886361 - 32893560       length: 7199    (frames 72-284 l=212)   score=21.1358
hg18    panTro2 rheMac2 calJac1 tarSyr1 gorGor1
hg18    -0.00   0.10    0.76    1.47    2.14    0.10
panTro2 0.10    -0.00   0.87    1.58    2.14    0.12
rheMac2 0.76    0.87    0.01    0.99    1.16    0.84
calJac1 1.47    1.58    0.99    0.05    0.87    1.78
tarSyr1 2.14    2.14    1.16    0.87    0.04    2.08
gorGor1 0.10    0.12    0.84    1.78    2.08    -0.00

In addition, the file saguaro.cactus lists all hypotheses that have been generated during the run and fit the individual regions best genome-wide. Note that Saguaro is not forced to assign genomic regions to all hypotheses, so some might not be used to classify local regions.

Example data set
For example data, see the script test_Saguaro that is distributed with the software. The sample data is part of the 29 mammalian genomes alignments, and maps to a part of human chromosome 6.