Realizing the conversational version of “Maybe”? Programmatically detect mistakes

Hello.
I'm Mandai, in charge of Wild on the development team.
I think that mis-speaking or mis-hearing is something that happens often in conversation.
It depends on the atmosphere and situation, but in most cases it can lighten the mood, so I don't think it's a bad thing (although there are times when it's obvious that you've misremembered something).
This time, I will introduce some algorithms that may (maybe) be possible to implement, which appear in Google searches
table of contents
Levenshtein distance
First, there is the Levenshtein distance, which quantifies the distance between characters
Levenshtein distance - Wikipedia
It is a type of distance that indicates how different two strings are. It is also called the edit distance
Let's take a look at the distance between kitten and sitting, which is also mentioned on Wikipedia
- kitten
- sitten # k → s substitution
- sitting # substitution of e → i
- Add sitting # g
It is possible to convert kitten to sitting through these three steps, and the Levenshtein distance in this case is 3.
The number of these steps is defined as the distance between characters, and we measure the degree of approximation.
Character operations can be addition, deletion, or substitution, but substitution can be considered as an operation that simultaneously performs addition and insertion, so some say that the cost of substitution is actually 2.
In this case, the Levenshtein distance is 5.
Although one may wonder what its usefulness is, as one reads further, it appears to be applied to DNA research.
Indeed, DNA is expressed as a combination of the four letters AGCT, so roughly speaking, the Levenshtein distance can be replaced with genetic closeness.
This Levenshtein distance can be easily calculated in PHP, and there is a dedicated standard function for it
echo levenshtein('kitten', 'sitting'); // The result is 3 echo levenshtein('kitten', 'sitting', 1, 2, 1); // The cost of the substitution is 2 // The result is 5 echo levenshtein('Yamada', 'Yamaguchi'); // The result is 3
Oh no! It doesn't seem to support multi-byte characters.
Since I ran it in a UTF8 environment, Kanji characters are counted as 3 bytes. Therefore, if you omit the second argument and onwards, the replacement is also counted as a cost of 1, so it is considered that a 3-byte change has occurred. Let
's try inserting a different string and take a look at what's going on inside from the outside.
echo bin2hex('Yama'); // e5b1b1 echo bin2hex('Kuchi'); // e58fa3 echo bin2hex('Ta'); // e794b0 echo bin2hex('Yamaguchi'); // e79480 Only the last two digits are different from "Kuchi" echo levenshtein('Yamada', 'Yamaguchi'); // The result is 3 echo levenshtein('Yamada', 'Yamaguchi'); // The result is 3
It does not look at multi-byte characters byte by byte, but rather recognizes them as multi-byte characters, but it appears that the final number of converted characters is roughly checked using strlen or similar
It seems to be fairly easy to implement, and I found a page with implementations in Python and Perl, so I've included a link here
Edit Distance (Levenshtein Distance) - naoya's Hatena Diary
soundex
Soundex is an algorithm that approaches speech mistakes differently from Levenshtein distance, and
is designed to be particularly suited to personal names.
There is no Japanese article on Wikipedia, but there is an English page. → Soundex - Wikipedia
Soundex analyzes the phonetics of the characters you input and converts them into a string of four characters called a soundex key.
It seems a bit rough to reduce any string, no matter how long, to four characters...
There is also a standard function for this in PHP, so let's try it out.
I have prepared some English words that are the same when written in katakana.
echo soundex('rock'); // Result is R200 echo soundex('lock'); // Result is L200 echo soundex('free'); // Result is F600 echo soundex('flea'); // Result is F400 echo soundex('flee'); // Result is F400 echo soundex('aerosmith'); // Result is A625
Although I don't understand the return value at all, I can somehow understand the difference between rock and lock.
The difference between free and flea is only the second letter, while flea and flee are exactly the same.
There are surprisingly many detailed rules for conversion, so to Measuring the similarity of strings (2) Focusing on pronunciation | Colorless Green Ideas . By combining soundex and Levenshtein distance, it seems possible to implement a "maybe" function from the perspective of pronunciation.
Soundex is also supported by MySQL, which provides a soundex function
MySQL :: MySQL 5.6 Reference Manual :: 12.5 String Functions SOUNDEX(str)
It is important to note that MySQL's soundex is slightly different from the PHP output and is not fixed at 4 characters
mysql> SELECT SOUNDEX('Quadratically'); -> 'Q36324'
The soundex function does not support Japanese.
This may be natural, since the language specifications mean that Japanese cannot be measured with the same yardstick.
metaphone
Metaphone is an algorithm invented by Lawrence Philips that, like soundex, outputs a string of characters called a metaphone key based on pronunciation.
While soundex identifies pronunciation by looking at letters in order, metaphone defines letter combinations and pronunciations, and appears to produce more accurate data than soundex keys.
Metaphone keys are made up of only the alphabet, excluding vowels.
They consist of 16 characters: "0BFHJKLMNPRSTWXY."
The 0 appears to represent the "th" sound, and is distinct from the standalone T and H. There is
an exception to this rule, where a vowel is added only when the word begins with a vowel, such as "authentication." In addition
, there are many more conversion rules based on combinations than soundex, which is why metaphone keys produce more accurate data than soundex.
PHP also has a standard function called metaphone, so let's take a look at an example
echo metaphone('authentication'); // Result is A0NTKXN echo metaphone('supercalifragilisticexpialidocious'); // The one famous from Mary Poppins // Result is SPRKLFRJLSTSKSPLTSS
In the sense that they represent pronunciation in writing, they are similar to phonetic symbols, but phonetic symbols are used to ensure correct pronunciation, while metaphone and soundex are merely extracted sounds contained in pronunciation, so looking at the metaphone key does not mean you can find out how a word is pronounced
Obtaining the metaphone keys of multiple words and measuring the Levenshtein distance is likely to provide a more accurate approximation than soundex
For more information on metaphone, Lawrence Philips' Metaphone Algorithm (the links are often broken and unmaintained, so the information may be out of date).
There is also an improved version of the metaphone algorithm called double metaphone, which has been modified to output two types of keys: a primary key and a secondary key. While
the basic rules for handling strings and other information remain the same, the implementation has changed significantly, so it is unlikely that the results of metaphone and double metaphone can be compared.
Information about double metaphone is collected at PHP DoubleMetaPhone
Since double metaphone is not a standard PHP function, you need to load the library provided by PECL from PECL::Package::doublemetaphone
connection
This is a page explaining Jazzy, a Java API for spell checking. At the beginning, there is an explanation of soundex, metaphone, and Levenshtein distance.
I was surprised to learn that soundex is patented.
I found a person who was researching something interesting using this technology, so I'll include a link here.
Although it's a research paper, the content is fairly straightforward and easy to read.
A study on automatic generation of misheard phrases from Western music lyrics
That's it.
0