So i'm working on a chatbot project, or atleast i've reached the stage where i'm thinking about how to implement it.
It's going to be an ELIZA-like chatbot, that is, it will have a list of keywords, and corresponding list of responses to each of those keywords.
Right now i'm working on the user input. I want it to be as flexible as possible, first of all i've decided to convert all user input to lower case, and store the keywords only in lower-case. The plan is to split the sentence that the user entered into words, then go through each word and match it to each keyword using the Levenshtein distance, and then picking the keyword with the lowest distance.
The thing is, i need a threshhold, if the user enters a sentence where the best match is a distance of 15, it will pick that response, and it will seem like the chatbot is on shrooms. So i've now added a check to see if the best match has a distance of more than 3, and if so, make the bot print "I don't know what you mean." instead of the response it would otherwise have chosen.
Now the thing is, i have a keyword "hey", which makes the bot respond with "Hello there!". If the user enters "qqq" or something like that, it will match to hey and the bot greets the user. This is because the distance is 3 or less between "qqq" and "hey", but that obviously isn't the behaviour i'm looking for. I've tried lowering the threshhold to 2, but that is too strict, then i might aswell just do straight up string comparison.
So i stumbled across the SOUNDEX algorithm, which is a way to encode names, so that similarly sounding names encode to the same thing. Am i missing something, or could this be used for all kinds of text strings, not just names?
My idea was to continue using the edit-distance check, but then also do a SOUNDEX encoding of the keyword and the inputted word, and if they encode to different strings (ie, are two words that don't sound alike), i could make this factor in somehow in the way the chatbot chooses which keyword to match to.
I seem to recall a thread somewhere on cboard a while ago, about implementing a spell checker in C, where CommonTater mentioned an algorithm that will match similarly sounding words, but i can't find it now. Are there any other besides SOUNDEX?
Is this the way you would go about making a chatbot? Any suggestions on how i can make it more tolerant when matching user input to keywords, without overdoing it so that some nonsense-textstring matches to a keyword?