Tag: TFBS

  • matchPWM is now part of Biostrings 2.7.47 in…

    matchPWM() is now part of Biostrings 2.7.47 (in BioC devel, so you need R-2.7).
    
    Load Biostrings:
    
       library(Biostrings)
    
    Suppose 'pwm' contains a Position Weight Matrix, let's say:
    
       pwm <- rbind(A=c( 1,  0, 19, 20, 18,  1, 20,  7),
                    C=c( 1,  0,  1,  0,  1, 18,  0,  2),
                    G=c(17,  0,  0,  0,  1,  0,  0,  3),
                    T=c( 1, 20,  0,  0,  0,  1,  0,  8))
    
    Note that this is just a standard integer matrix with the 4 DNA base letters
    as row names (having these row names is mandatory).
    Some low-level utility functions are available for manipulating this kind of
    matrix:
    
       > maxWeights(pwm)  # the max weight in each column
       [1] 17 20 19 20 18 18 20  8
    
       > maxScore(pwm)  # the max possible score
       [1] 140
    
    Let's match 'pwm' against Human chr1:
    
       library(BSgenome.Hsapiens.UCSC.hg18)
       chr1 <- Hsapiens$chr1
    
    Number of "best" matches:
    
       > countPWM(pwm, chr1, min.score="100%")  # takes about 5 seconds on my system
       [1] 5152
    
    With a lower cut-off value:
    
       m <- matchPWM(pwm, chr1, min.score="90%")
    
    See the 10 first matches ("first" means "smallest chromosome location", NOT "best"
    matches):
    
       > m[1:10]
         Views on a 247249719-letter DNAString subject
       subject: TAACCCTAACCCTAACCCTAACCCTAACCCTAAC...NNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNNN
       views:
            start   end width
        [1] 31931 31938     8 [GTAAACAA]
        [2] 33324 33331     8 [GTAAACAT]
        [3] 38425 38432     8 [GTAAACAG]
        [4] 39177 39184     8 [GTAAACAC]
        [5] 46971 46978     8 [GTAAACAT]
        [6] 49952 49959     8 [GTAAACAT]
        [7] 70381 70388     8 [GTAAACAG]
        [8] 74359 74366     8 [GTAAACAC]
        [9] 90714 90721     8 [GTAAACAT]
       [10] 96544 96551     8 [GTAAACAC]
    
    The speed could be improved, maybe by a factor 2 (or more, for longest PWMs).
    
    Also maybe an additional argument could be added to let the user control how
    the returned matches should be sorted ("left-to-right" or "best-first")?
    
    Unlike MatInspector or the transfac-tool, there is no facility yet to suggest
    individual cut-off values depending on the length of the PWMs.
    
    See '?matchPWM' for more information (e.g. how to search the minus strand of
    the chromosome).
    

    https://stat.ethz.ch/pipermail/bioconductor/2008-April/022029.html