A conversational version of "Maybe"? Programmatically detecting slip-ups

Hello.
I'm Mandai, the Wild team member in charge of development.
Mispronouncing or mishearing things is quite common in conversations.
Depending on the atmosphere and situation, it can often lighten the mood, so I don't think it's necessarily a bad thing (although sometimes it's clearly just a misremembering).
This time, I will introduce some algorithms that may (maybe) be possible to implement, which appear in Google searches
table of contents
Levenshtein distance
First, there is the Levenshtein distance, which quantifies the distance between characters
Levenstein distance -According to
It is a type of distance that indicates how different two strings are. It is also called the edit distance
Let's take a look at the distance between kitten and sitting, which is also mentioned on Wikipedia
- kitten
- sitten # k → s substitution
- sitting # substitution of e → i
- Add sitting # g
Through these three steps, the conversion from kitten to sitting is possible, and in this case, the Levenshtein distance is 3.
We define the number of these operations as the distance between the letters and measure the degree of approximation.
Text manipulation can involve adding, deleting, and substituting, but some argue that since substitution can be considered an operation that adds and inserts simultaneously, the cost of substitution is inherently 2.
In this case, the Levenshtein distance becomes 5.
While the question of its practical use arises, as you read on, it appears to be applied to DNA research.
Indeed, since DNA is represented by a combination of four letters, A, G, C, and T, roughly speaking, the Levenshtein distance can be equated to genetic proximity.
This Levenshtein distance can be easily calculated in PHP, and there is a dedicated standard function for it
echo levenshtein('kitten', 'sitting'); // The result is 3 echo levenshtein('kitten', 'sitting', 1, 2, 1); // The cost of the substitution is 2 // The result is 5 echo levenshtein('Yamada', 'Yamaguchi'); // The result is 3
Oh no. It seems it doesn't support multibyte characters.
I ran it in a UTF8 environment, so Kanji characters are counted as 3 bytes. Therefore, if you omit the second argument and beyond, the substitution is also counted as costing 1, so it's likely that 3 bytes were changed. Let's
try inserting a different string and take a look at how it works internally from the outside.
echo bin2hex('Yama'); // e5b1b1 echo bin2hex('Kuchi'); // e58fa3 echo bin2hex('Ta'); // e794b0 echo bin2hex('Yamaguchi'); // e79480 Only the last two digits are different from "Kuchi" echo levenshtein('Yamada', 'Yamaguchi'); // The result is 3 echo levenshtein('Yamada', 'Yamaguchi'); // The result is 3
It does not look at multi-byte characters byte by byte, but rather recognizes them as multi-byte characters, but it appears that the final number of converted characters is roughly checked using strlen or similar
It seems to be fairly easy to implement, and I found a page with implementations in Python and Perl, so I've included a link here
Edit Distance (Levenshtein Distance) - naoya's Hatena Diary
soundex
Soundex is an algorithm that takes a different approach than the Levenshtein distance, and is more focused on detecting mispronunciations.
It appears to be specifically designed to handle personal names.
There isn't a Japanese article on Wikipedia, but there is an English page. → Soundex - Wikipedia
Soundex analyzes the pronunciation of the entered characters and converts them into a four-character string called a "Soundex key."
It seems a bit crude that it reduces even long strings to just four characters...
Since PHP has a standard function for this, I'd like to try executing it easily.
I've prepared a few English words that sound the same when written in katakana.
echo soundex('rock'); // Result is R200 echo soundex('lock'); // Result is L200 echo soundex('free'); // Result is F600 echo soundex('flea'); // Result is F400 echo soundex('flee'); // Result is F400 echo soundex('aerosmith'); // Result is A625
I don't understand the return value at all, but I kind of understand the difference between rock and lock.
The only difference between free and flea is the second letter, and flea and flee are exactly the same.
There are surprisingly many detailed rules for conversion, so"Measuring String Similarity (2) Focusing on Pronunciation | Colorless Green Ideas," but by combining soundex and the Levenshtein distance, it seems possible to implement a "Did you mean...?" function from the perspective of pronunciation.
Soundex is also supported by MySQL, which provides a soundex function
MySQL :: MySQL 5.6 Reference Manual :: 12.5 String Functions SOUNDEX(str)
It is important to note that MySQL's soundex is slightly different from the PHP output and is not fixed at 4 characters
mysql> SELECT SOUNDEX('Quadratically'); -> 'Q36324'
The soundex function does not support Japanese.
This is perhaps to be expected, given that the same standards cannot be applied to different languages.
metaphone
Metaphone is an algorithm devised by Lawrence Philips that, like soundex, outputs a string called a metaphone key based on pronunciation. While
soundex distinguishes pronunciation by sequentially examining letters, metaphone defines the pronunciation based on the combination of letters, and it seems to produce more accurate data than soundex keys.
The metaphone key consists only of alphabetic characters, excluding vowels.
The metaphone key is composed of 16 characters: "0BFHJKLMNPRSTWXY".
0 seems to represent the "th" sound and is distinguished from the standalone T and H.
Exceptionally, there is a rule that a vowel is added only when the word begins with a vowel, such as in "authentication". In addition
, there are many more combination-based conversion rules than in soundex, which is why it provides more accurate data than soundex.
PHP also has a standard function called metaphone, so let's take a look at an example
echo metaphone('authentication'); // Result is A0NTKXN echo metaphone('supercalifragilisticexpialidocious'); // The one famous from Mary Poppins // Result is SPRKLFRJLSTSKSPLTSS
In the sense that they represent pronunciation in writing, they are similar to phonetic symbols, but phonetic symbols are used to ensure correct pronunciation, while metaphone and soundex are merely extracted sounds contained in pronunciation, so looking at the metaphone key does not mean you can find out how a word is pronounced
Obtaining the metaphone keys of multiple words and measuring the Levenshtein distance is likely to provide a more accurate approximation than soundex
Information related to metaphones can be found here:Lawrence Philips' Metaphone Algorithm(however, many links are broken and the site is not being maintained, so the information may be outdated).
Furthermore, the metaphone algorithm also has an improved version called double metaphone, which has been modified to output two types of keys: a primary key and a secondary key.
The basic rules regarding the strings used seem to remain the same, but the internal implementation is quite different, so it seems that the results of metaphone and double metaphone cannot be compared.
Information related to double metaphonesPHP DoubleMetaPhone, which also includes links to double metaphone implementations other than PHP.
Since `double metaphone` is not a standard PHP function,`PECL::Package::doublemetaphone`you need to load the library provided by PECL from
connection
This is a description page for Jazzy, a Java API for spell checking, and it begins with an explanation of soundex, metaphone, and Levenshtein distance.
I was surprised to learn that soundex is patented.
I've included a link to someone who was doing some interesting research using these technologies.
It's a research paper, but the content is fairly informal and easy to read.
A study on automatic generation of misheard phrases from Western music lyrics
That's all
0
