To AI or not to AI? That should not be the question.

Documenting all the business rules

The first challenge we had was to figure the actual business rules that constitute the matching criteria. Since the code has been written by multiple people in multiple years, the only way to do the documentation was to read the source code which took some time. Many thanks to my teammate Dylan Manville’s hard work. We were able to produce the related documentation.

Aspects of considerations of Python library

From data science perspective, fuzzy string matching at its core is Levenshtein distance. From business logic perspective, our data contains both ASCII and non-ASCII characters. Thus, supporting non-ASCII is one of the concerns. From a long-term maintenance perspective, I wanted the library that has more activities showing the footprints of growth.

Python library candidates

Among a few Python libraries, we selected fuzzywuzzy and rapidfuzz as candidates. Each library has two approaches, eight functions that use different algorithms. Funny enough, functions are sharing same names between two libraries. As a total, we have 32 combinations of library-package-function to evaluate performance.

Combinations of library-package-function
  1. Accuracy, also known as matching rate, represents how close the result is compared to the existing result
  2. Confidence, also known as confidence rate, represents how certain the function considers the result to be
  3. Speed, also known as average time per shipment, represents how soon function calculates result for each shipment data

Results

Running through several weeks of historical data, WRation function of fuzz approach in rapidfuzz library is considered the most reliable one. For the folks who are interested in the details of research, please check out Zhuoyi’s article.

Client facing, algorithm tune-up

Given the fact that confidence rate is part of evaluations, I had decided that we would reveal confidence rate to the caller and allow the caller to set the confidence rate based on his or her need. My thought was while our implementation sets 90 as the threshold of confidence rate, I believe that the future user of this “match engine” might want to specify his or her own match threshold based on the requirements. Another consideration was the need to monitor performance over time. To do that, I decided to implement the logging mechanism so that we can evaluate the running performance, make new decision, and tune up algorithm in the future.

--

--

Get the Medium app

A button that says 'Download on the App Store', and if clicked it will lead you to the iOS App store
A button that says 'Get it on, Google Play', and if clicked it will lead you to the Google Play store