When your colleague asks you ‘What options are there for fuzzy text matching?’ and your brain struggles to give an appropriate answer…
This was how this post was born. Behold, my feeble attempt at translating the available options~
Length of text
- One/ a few words
Case in point, entity name matching. When you have external vendor data with names that don’t match your internal names (and you have no common ID to link them up, of course)… ‘Apple Inc’ doesn’t match exactly with ‘Apple Incorporated’ is just the tip of the iceberg. You’re dealing with potentially misspellings, scrambled words (‘Congo, Republic of the’ vs ‘Republic of the Congo’) and short forms (‘Pte Ltd’ vs ‘Private Limited’)… the list goes on! Hardcoding these cleanup rules would be crazy (people have tried, believe me).
this is a good start: https://www.datacamp.com/community/tutorials/fuzzy-string-python
if you have a giant dataset and need speed: https://bergvca.github.io/2017/10/14/super-fast-string-matching.html
can i do a shameless #selfplug here lol i combined several algorithms to meet my own need…! : https://github.com/yinglinglow/namematching
- A sentence
This comes in handy when you are working on a chatbot! Say, someone types in ‘I would like to know what is my account balance today.’ You would need to extract the intent from this sentence. Again, tagging each word and all other possible substitutions for that word would be a tedious task. Thankfully, we have our friendly tech giants who have created APIs to make life easier! Check out the below:
- Amazon Web Services Lex: https://aws.amazon.com/lex/
- Google Dialogflow: https://dialogflow.com
- Facebook wit.ai: https://wit.ai
- Microsoft Cognitive Services Language Understanding (LUIS): https://www.luis.ai
of course if you want to train your own: https://towardsdatascience.com/a-brief-introduction-to-intent-classification-96fda6b1f557
- A paragraph
Imagine you have a paragraph (or two) of text that someone (say…audit? lol) has keyed in. Now, you would like to tag this paragraph of words with a relevant label. Perhaps it is the category ‘Fraud’ out of other categories. How do I go about this? Perhaps we can see what are the most frequently occuring words?
sentiment analysis: https://www.datacamp.com/community/tutorials/simplifying-sentiment-analysis-python
moarrr: https://www.kaggle.com/ngyptr/python-nltk-sentiment-analysis
ok there are a million more things to write but… till next time!!
in the meantime CHECK THIS OUT THIS IS INSANE
https://medium.com/dair-ai/deep-learning-for-nlp-an-overview-of-recent-trends-d0d8f40a776d