Realizing the conversational version of “Maybe”? Programmatically detect mistakes
Hello.
I'm Mandai, in charge of Wild on the development team.
Misspoken words and misunderstandings are common occurrences in conversations.
It depends on the atmosphere and situation, but I don't think it's a bad thing since there are usually some elements that make the situation more relaxing (although I may be clearly remembering it wrong).
This time, I will introduce some algorithms that come up in Google search that may (or may) be possible to implement "maybe".
table of contents
Levenshtein distance
First, Levenshtein distance, which quantifies the distance between characters.
Levenshtein distance - according to
It is a type of distance that shows how much two strings differ. Also called edit distance.
Let's take a look at the distance between kitten and sitting, which is also introduced on Wikipedia.
- kitten
- sitten # k → s substitution
- sittin # e → i replacement
- sitting # add g
It is possible to convert from kitten to sitting through these three steps, and the Levenshtein distance in this case is 3.
This number of operations is defined as the distance between characters, and the degree of approximation is measured.
Addition, deletion, and replacement can be considered as operations on characters, but since replacement can be considered to be an operation that adds and inserts at the same time, there is also a view that the cost of replacement is originally 2.
The Levenshtein distance in this case is 5.
There is some question as to what it is useful for, but as I read further, it seems that it is being applied to DNA research.
It is true that DNA is represented by a combination of the four letters AGCT, so roughly speaking, Levenshtein distance can be replaced with genetic closeness.
This Levenshtein distance can be easily calculated using PHP, and there is a dedicated standard function.
echo levenshtein('kitten', 'sitting'); // Result is 3 echo levenshtein('kitten', 'sitting', 1, 2, 1); // If the replacement cost is 2 // Result is 5 echo levenshtein('Yamada', 'Yamaguchi'); // Result is 3
Gah. It seems that multi-byte characters are not supported.
Since I executed it in a UTF8 environment, Kanji characters are counted as 3 bytes. Therefore, if the second argument and subsequent arguments are omitted, the replacement is also counted as cost 1, so it is considered that 3 bytes have been changed.
Let's try inserting another string and see what's going on inside.
echo bin2hex('mountain'); // e5b1b1 echo bin2hex('mouth'); // e58fa3 echo bin2hex('field'); // e794b0 echo bin2hex('bureau'); // e79480 'mouth' and lower 2 Only the digits are different echo levenshtein('Yamada', 'Yamaguchi'); // The result is 3 echo levenshtein('Yamada', 'Yamatsuke'); // The result is 3
It seems that multi-byte characters are recognized as multi-byte characters, rather than being looked at in bytes, but it seems that the final number of characters converted is roughly checked around strlen.
It seems to be relatively easy to implement, and I found a page with Python and Perl implementations, so I'll post a link.
Edit Distance (Levenshtein Distance) - naoya's Hatena Diary
soundex
Soundex is an algorithm that takes a different approach from Levenshtein distance and is more likely to detect mistakes.
It seems to be specifically designed to handle personal names.
Wikipedia doesn't have any Japanese articles, but there is an English page. → Soundex - Wikipedia
Soundex analyzes the pronunciation of the characters you input into a string called a 4-character soundex key.
I feel like it's a bit unreasonable to use 4 characters no matter how long the string is...
PHP also has a standard function for this, so I would like to easily execute it.
I have prepared some English words that are the same when written in katakana.
echo soundex('rock'); // Result is R200 echo soundex('lock'); // Result is L200 echo soundex('free'); // Result is F600 echo soundex('flea'); // Result is F400 echo soundex('flee'); // The result is F400 echo soundex('aerosmith'); // The result is A625
I don't understand the return value at all, but I do understand the difference between rock and lock.
The only difference between free and flea is the second letter, and flea and flee are exactly the same.
There are surprisingly many detailed rules for conversion, so measure the similarity of strings (2) Focus on pronunciation | If you refer to what is written in Colorless Green Ideas, if you combine soundex and Levenshtein distance, From the perspective of pronunciation, it seems possible to implement a "maybe" function.
Soundex is also supported by MySQL and provides the soundex function.
MySQL :: MySQL 5.6 Reference Manual :: 12.5 String Function SOUNDEX(str)
MySQL's soundex is slightly different from PHP's output, and you must be careful that it is not fixed at 4 characters.
mysql> SELECT SOUNDEX('Quadratically'); -> 'Q36324'
The soundex function does not support Japanese.
Due to the language specifications, it can't be measured using the same yardstick, so it may be natural.
metaphone
Metaphone is an algorithm devised by Lawrence Philips, and like soundex, it is an algorithm that outputs a string called a metaphone key based on pronunciation.
Soundex identifies pronunciation by looking at the letters in order, but metaphone has defined combinations of letters and pronunciations, and seems to be able to obtain more accurate data than the soundex key.
The metaphone key consists only of the alphabet without vowels.
The metaphone key consists of 16 characters: "0BFHJKLMNPRSTWXY".
0 seems to represent the sound of th, and it seems to be distinguished from the single T and H.
As an exception, there is a rule that a vowel is added only when the word starts with a vowel, such as authentication.
In addition, there are many more combinations of conversion rules than Soundex, which is why you can get more accurate data than Soundex.
PHP also has a standard function called metaphone, so let's take a look at a sample.
echo metaphone('authentication'); // The result is A0NTKXN echo metaphone('supercalifragilisticexpialidocious'); // The famous one from Mary Poppins // The result is SPRKLFRJLSTSKSPLTSS
They are similar in nature to phonetic symbols in the sense that they convert pronunciation into letters, but phonetic symbols are for correct pronunciation, and metaphone and soundex are extracts of the sounds included in the pronunciation, so you can use the metaphone key. You won't know how to pronounce a word just by looking at it.
By obtaining the metaphone keys of multiple words and measuring the Levenshtein distance, it seems possible to obtain a more accurate degree of approximation than soundex.
Metaphone-related information can be found here: Lawrence Philips' Metaphone Algorithm (information may be outdated as it has many broken links and is no longer maintained).
There is also an improved version of the metaphone algorithm called double metaphone, which outputs two types of keys: a primary key and a secondary key.
Although the basic rules such as the strings to be handled do not seem to have changed, the internal implementation has changed considerably, so it seems impossible to compare the results between metaphone and double metaphone.
Information related to double metaphone is collected in PHP DoubleMetaPhone
Since double metaphone is not a standard PHP function, you need to load the library provided by PECL from PECL::Package::doublemetaphone
connection
This is an explanation page for the Java API that performs spell checking called Jazzy, and at the beginning there is an explanation about soundex, metaphone, and Levenshtein distance.
I'm surprised that soundex has a patent.
Some people have researched interesting things using this technology, so I'll post a link to them.
Although it is a research paper, the content is relatively frank and easy to read.
A study towards automatic generation of empty phrases from Western music lyrics
That's it.