The first step of using BWA is to make an index of the reference genome in fasta format. The basic usage of the bwa index is:
where input_reference.fasta is an input file of the reference genome in fasta format, and index_prefix is the prefix of the generated index files. The option -a is required and can have two values: bwtsw (does not work for short genomes) and is (does not work for long genomes). Therefore, this value is chosen according to the length of the genome.
The bwa mem algorithm is one of the three algorithms provided by BWA. It performs local alignment and produces alignments for different part of the query sequence.
where index_prefix is the index for the reference genome generated from bwa index, and input_reads.fastq, input_reads_pair_1.fastq, input_reads_pair_2.fastq are the input files of sequencing data that can be single-end or paired-end respectively. Additional options for bwa mem can be found in the BWA manual.
Simple SLURM script for running bwa mem on Tusker with paired-end fastq input data, index_prefix as reference genome index, SAM output file and 8 CPUs is shown below:
The bwa bwasw algorithm is another algorithm provided by BWA. For input files with single-end reads it aligns the query sequences. For input files with paired-ends reads it performs paired-end alignment that only works for Illumina reads. An example of bwa bwasw for single-end input file input-reads.fasta in fasta format and output file bwa_bwasw_alignments.sam where the alignments are stored, is shown below:
The third BWA algorithm, bwa aln, aligns the input file of sequence data to the reference genome. In addition, there is an example of running bwa aln with single-end input_reads.fasta input file and 8 CPUs:
The command bwa samse uses the bwa_aln_alignments.sai output form bwa aln in order to generate SAM file from the alignments for single-end reads.
The command bwa sampe uses the bwa_aln_alignments.sai output form bwa aln in order to generate SAM file from the alignments for paired-end reads.
The command bwa fastmap identifies and outputs super-maximal exact matches (SMEMs).
The command bwa pemerge merges overlapping paired ends and can print either only the merged reads or the unmerged ones. An example of bwa pemerge of input_reads_pair_1.fastq and input_reads_pair_2.fastq with 8 CPUs and output file output_reads_merged.fastq that contains only the merged reads is shown below:
The command bwa fa2pac converts fasta to pac files.
BWA Pac2bwt and BWA Pac2bwtgen:
The commands bwa pac2bwt and bwa pac2bwtgen convert pac to bwt files.
The command bwa bwtupdate updates bwt files to the new format.
The command bwa bwt2sa generates sa files from bwt and Occ files.
In order to test the scalability of BWA (bwa/0.7) on Crane, we used two paired-end input fastq files: large_1.fastq and large_2.fastq, and one single-end input fasta file, large.fasta. Some statistics about the input files and the time and memory resources required for bwa mem are shown on the table below:
|total # of sequences||total size in MB||# of used CPUs||running time for 4 CPUs||required memory for 4 CPUs||# of used CPUs||running time for 8 CPUs||required memory for 8 CPUs||# of used CPUs||running time for 16 CPUs||required memory for 16 CPUs|
|large_1.fastq||10,174,715||3,376 MB||4||~ 35 minutes||~ 12 GB||8||~ 18.5 minutes||~ 18 GB||16||~ 10 minutes||~ 19 GB|
|large.fasta||592,593||836 MB||4||~ 5.5 minutes||~ 3 GB||8||~ 3 minutes||~ 4 GB||16||~ 2 minutes||~ 6.2 GB|