Package core :: Module AnalyseMapping
[hide private]
[frames] | no frames]

Module AnalyseMapping

source code

This script is used to evalulate the mapping results

Classes [hide private]
  CustomRead
Class for exhaustive reading of the sam alignments.
  TPRead
Class for exhaustive reading of the sam alignments.
  ReadID
Class for efficient addressing for np.
Functions [hide private]
 
savepickle(dictionaary, outputname) source code
 
loadpickle(inputname) source code
TPRead
getTPRead(alngt, compareList, readdic)
Funtion which transforms an HTSeq alignment to a TPRead class.
source code
CustomRead
getCustomRead(alngt, compareList, readdic)
Funtion which transforms an HTSeq alignment to a CustomRead class.
source code
dictionary
GetOrderDictionary(referenceSAM)
Function to get a dictionary containing the rank of a read (given a sorted samfile by readname, samtools).
source code
Readobj
getNextLine(textfile, compareList, readdic)
Function which returns the next line from a SAMFile.
source code
gaps,mismatches
getMisGap(mdtag, cigar)
Reads the alignment tag given, the text and a tag to search for
source code
(bool,list,bool)
getAllID(textfile, read, compareList, readdic)
Reads all alignments for the current read which are in the SAM file (sorted).
source code
np.array
CompareAlignments(reflist, artlist, file)
Compares alignments for the reference and artificial reference for a specific read id.
source code
read obj.
SkipHeader(file, compareList, readdic)
Skips the header from a SAM file, but reads the first line of the alignment section.
source code
int,int
getRanks(RefRead, ArtRead, rankdic)
Function which returns the ranks of 2 given readIDs (read from reference,read from artificial).
source code
ranks
ReadSAMnoMem(ref, art, output, compareList, readdic, rankdic)
Main function for comparing to mappings.
source code
int
getNumberOf(line, tag)
Reads the alignment tag, given the text and a tag to search for.
source code
int
getsum(strlist)
Sum of strings (as ints)
source code
float
getmean(strlist)
Mean of strings.
source code
float
ComputeRQScore(quality)
Computes the ReadQualityScore given a quality string
source code
int
getAlignmentLength(cigar)
Computes the alignment length given a cigar string.
source code
readobj, readname
isSaneAlignment(alignment, identifier, compareList, readdic)
Checks alignment line for unnecessary informations to skip
source code
bool
CheckForSameAlignments(readref, readart)
Function for comparison of artificial alignments and reference alignments.
source code
int
getHammingdistance(CompareString, start, end)
Computes the number of subsitutions in the artificial reference using the CompareString.
source code
dictionary
readReadQualities(fastqfile)
Reads a .fastqfile and calculates a defined readscore input: fastq file output: fastq dictionary key = readid; value = qualstr
source code
dictionary
extendReadDic(readdic)
Extends a given dictionary with KEY = readid and VALUE = qualstr such that an internal naming is generated which can be used to efficiently create an numpy array
source code
string
returnSequence(fasta)
Returns a sequence string from a fasta file.
source code
list
CreateCompareList(Reference, ARG)
Creates a list which is used for comparisons between aligned reads (exact number of mismatches)
source code
read obj.
readSAMline(alignment, identifier, compareList, readdic)
Function for reading SAM alignment text file (one line)
source code
int
returnIndex(readdic, readname)
Returns the index of a read.
source code
array
ReadArtificialSAMfileHTSeq(art, compareList, RefArray, readdic)
Function for reading the artificial reference genome using HTSeq.This function is mainly used.
source code
 
writeToTabArt(ReadDic, outfile)
Write function for hits ont he reference genome.
source code
 
writeToTabRef(RefArray, outfile, reverseReadArray, ReadDic)
Write function for hits ont he reference genome.
source code
np.array(obj)
ReadFromTab(infile, arraysize)
Reads the results from writeToTab in for plotting and analysis
source code
files
readInput(file)
Function for reading in the controldictionary from the inputfile.
source code
 
GetCompletePath(path) source code
 
initOutFiles(controlDic, mapper, outpath)
Since we append in the program, we need to make sure no old files remain...
source code
Variables [hide private]
  alngts = 0
  AlignedReadsdic = {}
  algnedToRef = 0
  algnedToArt = 0
Helper debug functions....
  __package__ = 'core'
Function Details [hide private]

getTPRead(alngt, compareList, readdic)

source code 

Funtion which transforms an HTSeq alignment to a TPRead class.

Parameters:
  • alngt (alignment) - alignment from the sam file.
  • compareList (list) - list which indicates differences between reference / artificial
  • readdic (dictionary) - Contains ranks and qualities for every entry in the fastq file
Returns: TPRead
Transformed ReadObject.

getCustomRead(alngt, compareList, readdic)

source code 

Funtion which transforms an HTSeq alignment to a CustomRead class.

Parameters:
  • alngt (alignment) - alignment from the sam file.
  • compareList (list) - list which indicates differences between reference / artificial
  • readdic (dictionary) - Contains ranks and qualities for every entry in the fastq file
Returns: CustomRead
Transformed ReadObject.

GetOrderDictionary(referenceSAM)

source code 

Function to get a dictionary containing the rank of a read (given a sorted samfile by readname, samtools).

Parameters:
  • referenceSAM (string) - Inputfile name for reference SAM file.
Returns: dictionary
Internalnaming according to the sorting of samtools. Key = ReadID, Value = rank

getNextLine(textfile, compareList, readdic)

source code 

Function which returns the next line from a SAMFile.

Parameters:
  • textfile (fileobject stream) - SAMfile
  • compareList (list) - AA sequence of the reference genome.
  • readdic (Dictionary) - Boolean which decides if unbalanced mutations are allowed (only initial mutation is performed)
Returns: Readobj
Parsed read from text line.

getMisGap(mdtag, cigar)

source code 

Reads the alignment tag given, the text and a tag to search for

Parameters:
  • mdtag (string) - MDTag from alignment
  • cigar (string) - Cigar from alignment
Returns: gaps,mismatches
Parsed gaps and mismatches

getAllID(textfile, read, compareList, readdic)

source code 

Reads all alignments for the current read which are in the SAM file (sorted). If a new read ID is scanned the results are returned.

Parameters:
  • textfile (fileobject stream) - SAM file for reading.
  • read (read obj) - the last read obj which defines the current read id
  • readdic (dictionary) - Look up for quality values.
Returns: (bool,list,bool)
a "triple", where 2 bools are defined as indicator variables and a list with all alignments for one read.

CompareAlignments(reflist, artlist, file)

source code 

Compares alignments for the reference and artificial reference for a specific read id. The goal is to identify false positives.

Parameters:
  • reflist (read obj list) - list containing alignments witht the same ID (reference)
  • artlist (read obj list) - list containing alignments witht the same ID (artificial)
Returns: np.array
indices of unique alignments

SkipHeader(file, compareList, readdic)

source code 

Skips the header from a SAM file, but reads the first line of the alignment section.

Parameters:
  • file (read obj list) - list containing alignments witht the same ID (reference)
  • compareList (list) - list for accumulation of the same read id
  • readdic (dictionary) - dictionary containing read ID - read quality mappings.
Returns: read obj.
Returns a read obj. from the SAM file.

getRanks(RefRead, ArtRead, rankdic)

source code 

Function which returns the ranks of 2 given readIDs (read from reference,read from artificial).

Parameters:
  • RefRead (read obj) - read obj (reference)
  • ArtRead (read obj) - read obj (artificial)
  • rankdic (dictionary) - dictionary containing ranks of the read IDs (according to the sorted SAM files).
Returns: int,int
returns the false positives and true positives for a SAM file pair (reference, artificial)

ReadSAMnoMem(ref, art, output, compareList, readdic, rankdic)

source code 

Main function for comparing to mappings. This functions takes the complete alignments for artificial and reference genome and goes through them in parallel. Since the mappings are sorted the function alternates the parsing of the samfiles in such a way that no memory is used for comparing these functions.

Parameters:
  • ref (string) - path to reference alignments (SAM file)
  • art (string) - path to artificial alignments (SAM file)
  • output (read obj) - read obj (artificial)
  • compareList (read obj) - read obj (artificial)
  • readdic (dictionary) - dictionary containing read ID - read quality mappings.
  • rankdic (dictionary) - dictionary containing ranks of the read IDs (according to the sorted SAM files).
Returns: ranks
Returns the ranks for the 2 read ids.

getNumberOf(line, tag)

source code 

Reads the alignment tag, given the text and a tag to search for.

Parameters:
  • line (string) - SAM line
  • tag (string) - SAM tag. i.e: NM,MD
Returns: int
number x behind desired tag tag:i:x

getsum(strlist)

source code 

Sum of strings (as ints)

Parameters:
  • strlist (list(str)) - SAM line
Returns: int
MD tag calculation.

getmean(strlist)

source code 

Mean of strings.

Parameters:
  • strlist (list(str)) - SAM line
Returns: float
mean

ComputeRQScore(quality)

source code 

Computes the ReadQualityScore given a quality string

Parameters:
  • quality (string) - quality string of a read.
Returns: float
ReadQualityScore (RQS)

getAlignmentLength(cigar)

source code 

Computes the alignment length given a cigar string. Needed for Start + End calculation.

Parameters:
  • cigar (string) - Cigar string (SAM)
Returns: int
alignmentlength

isSaneAlignment(alignment, identifier, compareList, readdic)

source code 

Checks alignment line for unnecessary informations to skip

Parameters:
  • alignment (string) - Line from SAM
  • identifier (string) - read id
  • compareList (list) - list of alignments with same read id.
  • readdic (dictionary) - Dictionary containg a read id; read quality mapping
Returns: readobj, readname
Returns the read and it's identifier.

CheckForSameAlignments(readref, readart)

source code 

Function for comparison of artificial alignments and reference alignments. FP are defined such that start and end position must be unique to the artificial reference returns 0 if no same read is found (FP found) returns 1 if an equal alignment is found

Parameters:
  • readref (read obj.) - reference
  • readart (read obj.) - artificial
Returns: bool
Indicator if alignment is the same (start & end equal)

getHammingdistance(CompareString, start, end)

source code 

Computes the number of subsitutions in the artificial reference using the CompareString.

Parameters:
  • CompareString (string) - string of 0 and 1s. 1 = hamming 1 between reference and artificial.
  • start (int) - start of alignment
  • end (int) - end ofalignment
Returns: int
hamming distance

readReadQualities(fastqfile)

source code 

Reads a .fastqfile and calculates a defined readscore input: fastq file output: fastq dictionary key = readid; value = qualstr

Parameters:
  • fastqfile (string) - path to fastq file
Returns: dictionary
dictionary containing read ids and read qualities.

extendReadDic(readdic)

source code 

Extends a given dictionary with KEY = readid and VALUE = qualstr such that an internal naming is generated which can be used to efficiently create an numpy array

Parameters:
  • readdic (dictionary) - dictionary containing read ids and read qualities.
Returns: dictionary
extended readdic with KEY = ID, VALUE = READID object with READID.internalid and READID.qulstr = qualstr

returnSequence(fasta)

source code 

Returns a sequence string from a fasta file.

Parameters:
  • fasta (string) - path to fasta file.
Returns: string
sequence

CreateCompareList(Reference, ARG)

source code 

Creates a list which is used for comparisons between aligned reads (exact number of mismatches)

Parameters:
  • Reference (string) - reference genome
  • ARG (string) - artificial reference genome.
Returns: list
list containt 1s, where there is a difference in the genomes and 0s where the nucleotides are equal.

readSAMline(alignment, identifier, compareList, readdic)

source code 

Function for reading SAM alignment text file (one line)

Parameters:
  • alignment (string) - SAM alignment
  • identifier (string) - read id
  • compareList (list) - list containt 1s, where there is a difference in the genomes and 0s where the nucleotides are equal.
  • readdic (dictionary) - dictionary containing read ids and read qualities.
Returns: read obj.
returns a customRead object

returnIndex(readdic, readname)

source code 

Returns the index of a read. The index is prescribed by the ordering in the sam file.

Parameters:
  • readname (string) - read id
  • readdic (dictionary) - dictionary containing read ids and read qualities.ArtRead
Returns: int
index

ReadArtificialSAMfileHTSeq(art, compareList, RefArray, readdic)

source code 

Function for reading the artificial reference genome using HTSeq.This function is mainly used. Only if no quality string is in the SAM line. The custom SAM reading function is used.

Parameters:
  • art (string) - artificial file.
  • RefArray (array) - Results from reading the reference SAM file.
  • compareList (list) - list containt 1s, where there is a difference in the genomes and 0s where the nucleotides are equal.
  • readdic (dictionary) - dictionary containing read ids and read qualities.
Returns: array
aligned read objects in an array.

writeToTabArt(ReadDic, outfile)

source code 

Write function for hits ont he reference genome. Only the best alignment (least mismatches) Header : ReadID MatchedReference Substitutions NumberOfMismatches ReadQuality MappingQuality

Parameters:
  • outfile (string) - Path to outfile.
  • ReadDic (dictionary) - dictionary containing read ids and read qualities.

writeToTabRef(RefArray, outfile, reverseReadArray, ReadDic)

source code 

Write function for hits ont he reference genome. Only the best alignment (least mismatches)

Header : ReadID MatchedReference Substitutions NumberOfMismatches ReadQuality MappingQuality rows in the infile

Parameters:
  • RefArray (string) - path to inputfile.
  • outfile (int) - rows in the infile
  • reverseReadArray (dictionary) - Contains reads = values and ranks = keys
  • ReadDic (dictionary) - dictionary containing read ids and read qualities.

ReadFromTab(infile, arraysize)

source code 

Reads the results from writeToTab in for plotting and analysis

Header : ReadID MatchedReference Substitutions NumberOfMismatches ReadQuality MappingQuality

Parameters:
  • infile (string) - path to inputfile.
  • arraysize (int) - rows in the infile
Returns: np.array(obj)
array for unique classified reads.

readInput(file)

source code 

Function for reading in the controldictionary from the inputfile.

Parameters:
  • file (string) - Path to the input file.
Returns: files
Controldictionary which is used to regulate the programs workflow.

initOutFiles(controlDic, mapper, outpath)

source code 

Since we append in the program, we need to make sure no old files remain...

Parameters:
  • controlDic (dictionary) - dictionary containing future filenames.
  • mapper (string) - current identifier mapper for which the results are written
  • outpath (string) - existing path, where the outfiles will be written.