Usage
Introduction
Coordinate system
Some of dhtslib's novel functionality is pervasive throughout the library, mostly the compile-time, type-safe coordinate system. dhtslib, when dealing with integer-based position or coordinates, requires the use of dhtslib.coordinates. This system helps prevent off-by-one errors by asserting at compile-time that the coordinate system must be known for a pair of integer coordinates.
To define a Coordinate:
import dhtslib.coordinates;
auto c1 = Coordinate!(Basis.zero)(0);
auto c2 = Coordinate!(Basis.zero)(1);
c2 = 2;
// c2 = 0; would result in an error as
// a one-based system cannot have a coordinate zeroThis defines a singular coordinate as zero or one-based. An easier way of specifying a coordinate is:
import dhtslib;
auto c1 = ZeroBased(0);
auto c2 = OneBased(1);
auto c3 = ZB(0);
auto c4 = OB(1);A Coordinate can be converted from one Basis to another:
To specify anInterval:
This defines a singular coordinate pair as a coordinate system that combines basis and end. Basis being zero or one-based. End being open or closed (referred to as half-open because the starting coordinate is always closed). The available coordinate systems are: zero-based half-open, one-based half-open, zero-based closed, and one-based closed. An easier way of specifying a coordinate pair is:
Interval s can be converted to different coordinate systems.
All of the readers, writers, and records for dhtslib will return either an Interval or a Coordinate and will accept an Interval instead of integer-based coordinates.
All functions that take coordinates are responsible for converting them to the correct Coordsystem.
Readers
dhtslib provides readers for SAM/BAM(/CRAM untested), VCF/BCF, BED, GFF, FASTQ, faidx'd FASTA, generic BGZF or GZIP compressed files, and tabix indexed files. Readers automatically (via htslib) have support for reading compressed and remote (https or aws s3) files.
dhtslib.sam.reader:SAMReaderSAM/BAM(/CRAM untested))dhtslib.vcf.reader:VCFReaderVCF/BCFdhtslib.bed.reader:BedReaderdhtslib.gff.reader:GFFReader,GTFReader,GFF2Reader,GFF3Readerdhtslib.fastq:FastqFile
The readers generally follow the following format:
They are structs, but generally must be initialized
They act as InputRanges that returns the appropriate Record type
They handle any available filtering
They store headers as the appropriate Header type or as a string
They own the data that backs the underlying
htslibhtsFileThey are reference counted
They control allocating and freeing that data
Exceptions:
BGZFileacts as anInputRangevia it'sbyLineandbyLineCopymethods.
Writers
dhtslib provides writers for SAM/BAM(/CRAM untested), VCF/BCF, BED, and GFF.
dhtslib.sam.writer:SAMWriterSAM/BAM(/CRAM untested))dhtslib.vcf.writer:VCFWriterVCFdhtslib.bed.writer:BedWriterdhtslib.gff.writer:GFF2Writer,GFF3Writer
The writers generally follow the following format:
They are structs, but generally must be initialized
They require the header upon initialization
They have a write method that accepts their specific Record type
They own the data that backs the underlying
htslibhtsFileThey are reference counted
They control allocating and freeing that data
Notes:
SAM/BAM(/CRAM untested) writing accepts an enum that allows it to output SAM, compressed SAM, BAM, uncompressed BAM, and CRAM. By default it will try and deduce this based on the file extension of file it is writing to.
VCF/BCF writing currently does not have a similar feature but it will.
GFF, VCF, BED writing can only output uncompressed text.
Records
dhtslib provides record types for SAM/BAM(/CRAM untested), VCF/BCF, BED, and GFF.
dhtslib.sam.record:SAMReaderSAM/BAM(/CRAM untested))dhtslib.vcf.record:VCFRecordVCF/BCFdhtslib.bed.record:BedRecorddhtslib.gff.record:GFFRecord,GTFRecord,GFF2Record,GFF3Recorddhtslib.fastq:FastqRecord
The records generally follow the following format:
They are structs, but generally must be initialized
They can be built from scratch, though are usually generated from a reader
Some require a header (VCF, SAM(optional)) upon initialization
They own the data that backs the underlying
htslibrecord type or stringThey are reference counted
They control allocating and freeing that data
They have helper methods for mutating the underlying data
They allow access to the underlying
htslibdatatype pointers
Examples
BAM/SAM manipulation
Loop over records in sam file and do something, then write to bam file.
BAM filtering
For each region in bed file, filter bam file and do something. Must have test.bam.bai (bam must be indexed).
VCF manipulation
Loop over records in compressed vcf file and do something, then write to vcf file.
VCF filtering
For each region in gff file, filter vcf file via tabix and do something. Must have test.vcf.gz.tbi (vcf must be bgzipped and tabix'd).
Last updated
Was this helpful?