打开APP
userphoto
未登录

开通VIP,畅享免费电子书等14项超值服

开通VIP
Fuzzy String Matching in Python – Marco Bonzanini

Fuzzy String Matching, also called Approximate String Matching, is the process of finding strings that approximatively match a given pattern.
The closeness of a match is often measured in terms of edit distance, which is the number of primitive operations necessary to convert the string into an exact match.
Primitive operations are usually: insertion (to insert a new character at a given position), deletion (to delete a particular character) and substitution (to replace a character with a new one).

Fuzzy String Matching can have different practical applications. Typical examples are spell-checking, text re-use detection (the politically correct way of calling plagiarism detection), spam filtering, as well as several applications in the bioinformatics domain, e.g. matching DNA sequences.

This article plays around with fuzzywuzzy, a Python library for Fuzzy String Matching.

Getting Started with Fuzzywuzzy

FuzzyWuzzy has been developed and open-sourced by SeatGeek, a service to find sport and concert tickets. Their original use case, as discussed in their blog, was the problem given by the many different ways of labelling the same event, adding or hiding location, dates, venue, etc. This problem is also arising with different entities like persons or companies.

To install the library, you can use pip as usual:

pip install fuzzywuzzy

The main modules in FuzzyWuzzy are called fuzz, for string-to-string comparisons, and process to compare a string with a list of strings.

Under the hood, FuzzyWuzzy uses difflib, part of the standard library, so there is nothing extra to install. We can anyway benefit from the performance of python-Levenshtein for sequence matching, so let’s also install this library:

pip install python-Levenshtein

Examples of Usage

Firstly, let’s import the main modules:

1
2
from fuzzywuzzy import fuzz
from fuzzywuzzy import process

In order to calculate a similarity score between two strings, we can use the methods ratio() or partial_ratio():

1
2
3
4
fuzz.ratio("ACME Factory", "ACME Factory Inc.")
# 83
fuzz.partial_ratio("ACME Factory", "ACME Factory Inc.")
# 100

We can see how the ratio() function is confused by the suffix “Inc.” used in company names, but really the two strings refer to the same entity. This is captured by the partial ratio.

More examples:

1
2
3
4
5
6
7
8
9
fuzz.ratio('Barack Obama', 'Barack H. Obama')
# 89
fuzz.partial_ratio('Barack Obama', 'Barack H. Obama')
# 75
fuzz.ratio('Barack H Obama', 'Barack H. Obama')
# 97
fuzz.partial_ratio('Barack H Obama', 'Barack H. Obama')
# 92

Here we observe the opposite behaviour: different variations in Barack Obama’s name produce a lower score for the partial ratio, why is that? Probably because the extra token for the middle name is right in the middle of the string. For this particular case, we can benefit by other functions that tokenise the string and treat it as a set or as a sequence of words:

1
2
3
4
5
6
7
8
9
fuzz.token_sort_ratio('Barack Obama', 'Barack H. Obama')
# 92
fuzz.token_set_ratio('Barack Obama', 'Barack H. Obama')
# 100
fuzz.token_sort_ratio('Barack H Obama', 'Barack H. Obama')
# 100
fuzz.token_set_ratio('Barack H Obama', 'Barack H. Obama')
# 100

The token_* functions split the string on white-spaces, lowercase everything and get rid of non-alpha non-numeric characters, which means punctuation is ignored (as well as weird unicode symbols).

In case we have a list of options and we want to find the closest match(es), we can use the process module:

1
2
3
4
5
6
7
8
9
query = 'Barack Obama'
choices = ['Barack H Obama', 'Barack H. Obama', 'B. Obama']
# Get a list of matches ordered by score, default limit to 5
process.extract(query, choices)
# [('Barack H Obama', 95), ('Barack H. Obama', 95), ('B. Obama', 85)]
# If we want only the top one
process.extractOne(query, choices)
# ('Barack H Obama', 95)

Summary

This article has introduced Fuzzy String Matching, which is a well understood problem with some interesting practical applications.

Python has a very simple option to tackle the problem: the FuzzyWuzzy library, which is built on top of difflib (and python-Levenshtein for speed). It can take a while to figure out how to scope our string matching problem, but the easy interface of fuzzywuzzy should help speeding up the development.

本站仅提供存储服务,所有内容均由用户发布,如发现有害或侵权内容,请点击举报
打开APP,阅读全文并永久保存 查看更多类似文章
猜你喜欢
类似文章
【热】打开小程序,算一算2024你的财运
python fuzzywuzzy模块 模糊字符串匹配详细用法
使用 fuzzywuzzy 模块计算两个字符串之间的相似度
一个非常好用的 Python 魔法库
揭秘 | Barack Obama 之 5 百万美元大宅
Barack Obama's Speeches - 奥巴马演讲集
Can Barack Obama help U.S. out of troubles?
更多类似文章 >>
生活服务
热点新闻
分享 收藏 导长图 关注 下载文章
绑定账号成功
后续可登录账号畅享VIP特权!
如果VIP功能使用有故障,
可点击这里联系客服!

联系客服