python – 查找所有常见的非重叠子字符串

给定两个字符串,我想识别从最长到最短的所有常见子字符串.

我想删除任何“子”子字符串.例如,’1234’的任何子串都不会包含在’12345’和’51234’之间的匹配中.

string1 = '51234'string2 = '12345'result = ['1234', '5']

我想找到longest common substring,然后递归地找到左/右最长的子串.但是,我不希望在找到之后删除一个公共子字符串.例如,下面的结果在中间共享一个6：

string1 = '12345623456'string2 = '623456'result = ['623456', '23456']

最后,我需要针对数千个字符串的固定列表检查一个字符串.我不确定是否有一个聪明的步骤可以用来散列这些字符串中的所有子串.

以前的答案：

在这个thread中,发现一个动态编程解决方案需要O(nm)时间,其中n和m是字符串的长度.我对更有效的方法感兴趣,它将使用后缀树.

背景：

我正在用旋律片段创作歌曲旋律.有时,组合设法生成在现有的一行中匹配太多音符的旋律.

我可以使用字符串相似性度量,例如编辑距离,但相信与旋律有很小差异的曲调是独特且有趣的.不幸的是,这些曲调与连续复制旋律的许多音符的歌曲具有相似的相似度.

解决方法:

让我们从树开始吧

from collections import defaultdictdef identity(x):    return xclass TreeReprMixin(object):    def __repr__(self):        base = dict(self)        return repr(base)class PrefixTree(TreeReprMixin, defaultdict):    '''    A hash-based Prefix or Suffix Tree for testing for    sequence inclusion. This implementation works for any    slice-able sequence of hashable objects, not just strings.    '''    def __init__(self):        defaultdict.__init__(self, PrefixTree)        self.labels = set()    def add(self, sequence, label=None):        layer = self        if label is None:            label = sequence        if label:            layer.labels.add(label)        for i in range(len(sequence)):            layer = layer[sequence[i]]            if label:                layer.labels.add(label)        return self    def add_ngram(self, sequence, label=None):        if label is None:            label = sequence        for i in range(1, len(sequence)   1):            self.add(sequence[:i], label)    def __contains__(self, sequence):        layer = self        j = 0        for i in sequence:            j  = 1            if not dict.__contains__(layer, i):                break            layer = layer[i]        return len(sequence) == j    def depth_in(self, sequence):        layer = self        count = 0        for i in sequence:            if not dict.__contains__(layer, i):                print "Breaking"                break            else:                layer = layer[i]            count  = 1        return count    def subsequences_of(self, sequence):        layer = self        for i in sequence:            layer = layer[i]        return layer.labels    def __iter__(self):        return iter(self.labels)class SuffixTree(PrefixTree):    '''    A hash-based Prefix or Suffix Tree for testing for    sequence inclusion. This implementation works for any    slice-able sequence of hashable objects, not just strings.    '''    def __init__(self):        defaultdict.__init__(self, SuffixTree)        self.labels = set()    def add_ngram(self, sequence, label=None):        if label is None:            label = sequence        for i in range(len(sequence)):            self.add(sequence[i:], label=label)

要填充树,您将使用.add_ngram方法.

下一部分有点棘手,因为您正在寻找并发遍历字符串同时跟踪树坐标.为了解决所有这些问题,我们需要一些在树上运行的函数和一个查询字符串

def overlapping_substrings(string, tree, solved=None):    if solved is None:        solved = PrefixTree()    i = 1    last = 0    matching = True    solutions = []    while i < len(string)   1:        if string[last:i] in tree:            if not matching:                matching = True            else:                i  = 1                continue        else:            if matching:                matching = False                solutions.append(string[last:i - 1])                last = i - 1                i -= 1        i  = 1    if matching:        solutions.append(string[last:i])    for solution in solutions:        if solution in solved:            continue        else:            solved.add_ngram(solution)            yield solutiondef slide_start(string):    for i in range(len(string)):        yield string[i:]def seek_subtree(tree, sequence):    # Find the node of the search tree which    # is found by this sequence of items    node = tree    for i in sequence:        if i in node:            node = node[i]        else:            raise KeyError(i)    return nodedef find_all_common_spans(string, tree):    # We can keep track of solutions to avoid duplicates    # and incomplete prefixes using a Prefix Tree    seen = PrefixTree()    for substring in slide_start(string):        # Drive generator forward        list(overlapping_substrings(substring, tree, seen))    # Some substrings are suffixes of other substrings which you do not    # want    compress = SuffixTree()    for solution in sorted(seen.labels, key=len, reverse=True):        # A substrings may be a suffix of another substrings, but that substrings        # is actually a repeating pattern. If a solution is        # a repeating pattern, `not solution in seek_subtree(tree, solution)` will tell us.        # Otherwise, discard the solution        if solution in compress and not solution in seek_subtree(tree, solution):            continue        else:            compress.add_ngram(solution)    return compress.labelsdef search(query, corpus):    tree = SuffixTree()    if isinstance(corpus, SuffixTree):        tree = corpus    else:        for elem in corpus:            tree.add_ngram(elem)    return list(find_all_common_spans(query, tree))

所以现在做你想做的事,做到这一点：

search("12345", ["51234"])search("623456", ["12345623456"])

如果有什么不清楚的地方,请告诉我,我会尽力澄清.

来源：https://www.icode9.com/content-1-303851.html

本站仅提供存储服务，所有内容均由用户发布，如发现有害或侵权内容，请点击举报。