To AI or not to AI? That should not be the question.

Yiqing Lan
5 min readJul 21, 2021

This blog is quite different from other blogs that I had posted. Instead of talking about a specific technical topic, what I want is to share my experience in solving a particular business problem with AI (or not AI) based solution and the decision that led to the final implementation.

The problem I wanted to solve was to improve the matching rate of shipment records to their PeriShip account IDs. This is a crucial function in our operational system. It impacts all the “down line” functions in our system. In this actual implementation, this functionality was written in a group of “if else” conditional statements in the actual implementation. The matching rate was close to 90%. For the 10% shipment records that our program failed to find the matching account IDs, our operation staff must manually assign them with the correct account ID. To illustrate the magnitude of the problem, if we have 10,000 shipments a day, then there will be 1000 shipment records that our operation folks must manually assign the account ID to them. Of course, PeriShip’s daily shipment volume is a lot higher than 10,000. As you can imagine the manual labor used to work on these unmatched shipment records is quite higher.

How to improve the matching rate so we can have less manual work has always been one of my “unfinished” tasks. Since AI has been a popular topic these days, inevitably it came to my mind when I was searching for a solution. A few months ago, I approached my colleague Zhuoyi Zhao to see if she could use AI to improve the matching rate. After the discussion on the goal of the project and how to measure the result, we started the actual work. Our first step was to document the current business rules used to match the records. Then ZhuoYi started to research the various ways to solve the “record match” problem. She presented her finding and evaluation results based on the different Python Fuzzy String-Matching packages after a week or so. Along with the results, ZhuoYi’s also recommended narrowing down the Python packages to rapidfuzz and fuzzywuzzy. We had a couple of team meetings to go over her research results and recommendation. In the end, as a team, we decided to go with her recommendation. Once we had decided on the approach, the entire design and implementation process was less eventful. Here is a quick rundown on what I considered key aspects of this effort:

Documenting all the business rules

The first challenge we had was to figure the actual business rules that constitute the matching criteria. Since the code has been written by multiple people in multiple years, the only way to do the documentation was to read the source code which took some time. Many thanks to my teammate Dylan Manville’s hard work. We were able to produce the related documentation.

Aspects of considerations of Python library

From data science perspective, fuzzy string matching at its core is Levenshtein distance. From business logic perspective, our data contains both ASCII and non-ASCII characters. Thus, supporting non-ASCII is one of the concerns. From a long-term maintenance perspective, I wanted the library that has more activities showing the footprints of growth.

Python library candidates

Among a few Python libraries, we selected fuzzywuzzy and rapidfuzz as candidates. Each library has two approaches, eight functions that use different algorithms. Funny enough, functions are sharing same names between two libraries. As a total, we have 32 combinations of library-package-function to evaluate performance.

Combinations of library-package-function

We came up with three main aspects of performance. They are accuracy, confidence, and speed.

  1. Accuracy, also known as matching rate, represents how close the result is compared to the existing result
  2. Confidence, also known as confidence rate, represents how certain the function considers the result to be
  3. Speed, also known as average time per shipment, represents how soon function calculates result for each shipment data

Since the goal is to reduce human labor, I believe accuracy is the most critical one. I prefer confidence over speed. Because more confident the result is less attention human needs to pay to double-checking result. In addition, Zhuoyi came up with a fourth aspect, that is “Accuracy-over-Confidence” which I completely support. It represents how close the result is compared to the existing result if confidence rate is above 90. In the research, we use 90 as a threshold for confidence rate. While we do want to have a faster speed in the match process, however it is a lesser factor in relation to the Accuracy and Confidence.

Results

Running through several weeks of historical data, WRation function of fuzz approach in rapidfuzz library is considered the most reliable one. For the folks who are interested in the details of research, please check out Zhuoyi’s article.

Client facing, algorithm tune-up

Given the fact that confidence rate is part of evaluations, I had decided that we would reveal confidence rate to the caller and allow the caller to set the confidence rate based on his or her need. My thought was while our implementation sets 90 as the threshold of confidence rate, I believe that the future user of this “match engine” might want to specify his or her own match threshold based on the requirements. Another consideration was the need to monitor performance over time. To do that, I decided to implement the logging mechanism so that we can evaluate the running performance, make new decision, and tune up algorithm in the future.

I am very happy with the result of this new approach. The improvement was significant. It produced concrete cost savings. However, I did have a “small disappointment”. The “small disappointment” was we did not get the chance to use AI based solution since “Fuzzy String Matching” really could not be considered as an AI based technology. I was thinking about using AI technology when I started this “journey”. In the end, we did not get the chance to use it. However, what really counted was the result. Regardless AI or not, it was hard to dispute the 99+% matching rate. The lesson I had learned from this process was: when comes to create a good business solution, “to AI or not to AI?” should not be the question. The right question should be: “what is the best way to solve the problem?”.

Many thanks to ZhuoYi Zhao and Dylan Manville. Without their hard work, we could not produce this wonderful solution. Thanks for reading this article. Hope this article sheds some light on my attempt to introduce AI technology into our system design and implementation. While we did not use AI technology in this project, I know there are many other areas that we can and will be able to apply the AI technology. I will continue to share my experience in this area with you. Stay tuned.

--

--