Before we continue with the analysis of the schematic, we should take the time to establish a much simplified, yet sufficient, phonetic model of the command set. We do this in the
armchair linguistics way, without having worked with the actual object of study.
Vowel generation: Air passing through the throat is excited into wide spectrum vibrations (noise). By changing the tongue’s position, the oral cavity functions as two separate bandpass filters and hereby generates two distinct main tones, the formants F1 and F2. It is actually the difference F2-F1 that determines what vowel a listener detects. The airflow is not blocked and remains free from turbulences.
Some vowel examples for an average male voice:
F1 F2 F2-F1
i 240 2400 2160
e 390 2300 1910
ɒ 700 760 60
ɤ 460 1310 850
u 250 595 345 <- for our model, ə/ʊ are roughly similar to u
edit 2211281407:
Liquid sounds:
I started to adapt and expand the command model by additional phonetic effects that could trigger a pulse train for the JK-FlipFlops.
This brought my attention to the liquid sounds [ l ] and [ ɹ ] in [ left ] and [ ɹ
aɪt ].
From
https://en.wikipedia.org/wiki/Formant :"The liquid [l] usually has an extra formant at 1500 Hz, whereas the English "r" sound ([ ɹ ]) is distinguished by a very low third formant (well below 2000 Hz). "
I am also starting to doubt my initial interpretation of the filters' function. The filter that generates the MFHP2_PULSE could be interpreted as a bandpass, with the poti controlling the lower cut-off.
/edit 2211281407
Consonant generation: A characteristic of consonant sounds is that the airflow is restricted or even completely blocked and then released. The result is a sound signal containing high frequency noise.
Some consonant examples: [ s ] alveolar fricative sound, produced by forcing air through a narrow channel between the tip of the tongue and teethridge.
[ d ]/[ t ] alveolar plosive sound, formed by closing the escape of air by pressing the tongue against the teethridge
[ g ] dorsal plosive sound, formed by pressing the back of the tongue against the soft palate
https://en.wikipedia.org/wiki/Manner_of_articulation https://en.wikipedia.org/wiki/Formant For the actual decoding, we can expect a combination of time, frequency and command logic state information.
Command set, frequency over time (may be adjusted after getting more data, added liquids during edit 2211281420) go [ ɡəʊ ] [ ɡ ] high frequency - [ ə ] low frequency - [ ʊ ] low frequency
ahead [ ə ˈhed ] low – pause – weak noise – low – [ d ] noise <- long sequence
right [ ɹaɪt ] liquid - low – high – [ t ] strong noise
left [ left ] liquid – low – noise – strong noise
stop [ stɒp ] [ s ] noise – strong noise – low – strong noise
I attached the current draft of the schematic plus some selected composites from Photoshop.