22 KiB
Speech to Text - Wio Terminal
In this part of the lesson, you will write code to convert speech from the captured audio into text using the speech service.
Sending Audio to the Speech Service
Audio can be sent to the speech service using the REST API. To use the speech service, you first need to request an access token, which is then used to access the REST API. These access tokens expire after 10 minutes, so your code should regularly request new tokens to ensure they remain valid.
Task - Obtain an Access Token
-
Open the
smart-timer
project if it is not already open. -
Add the following library dependencies to the
platformio.ini
file to enable WiFi access and handle JSON:seeed-studio/Seeed Arduino rpcWiFi @ 1.0.5 seeed-studio/Seeed Arduino rpcUnified @ 2.1.3 seeed-studio/Seeed_Arduino_mbedtls @ 3.0.1 seeed-studio/Seeed Arduino RTC @ 2.0.0 bblanchon/ArduinoJson @ 6.17.3
-
Add the following code to the
config.h
header file:const char *SSID = "<SSID>"; const char *PASSWORD = "<PASSWORD>"; const char *SPEECH_API_KEY = "<API_KEY>"; const char *SPEECH_LOCATION = "<LOCATION>"; const char *LANGUAGE = "<LANGUAGE>"; const char *TOKEN_URL = "https://%s.api.cognitive.microsoft.com/sts/v1.0/issuetoken";
Replace
<SSID>
and<PASSWORD>
with your WiFi credentials.Replace
<API_KEY>
with the API key for your speech service resource. Replace<LOCATION>
with the location you used when creating the speech service resource.Replace
<LANGUAGE>
with the locale name of the language you will be speaking, for example,en-GB
for English orzh-HK
for Cantonese. You can find a list of supported languages and their locale names in the Language and Voice Support documentation on Microsoft Docs.The
TOKEN_URL
constant is the URL of the token issuer without the location. This will later be combined with the location to form the full URL. -
Similar to connecting to Custom Vision, you will need to use an HTTPS connection to connect to the token issuing service. Add the following code to the end of
config.h
:const char *TOKEN_CERTIFICATE = "-----BEGIN CERTIFICATE-----\r\n" "MIIF8zCCBNugAwIBAgIQAueRcfuAIek/4tmDg0xQwDANBgkqhkiG9w0BAQwFADBh\r\n" "MQswCQYDVQQGEwJVUzEVMBMGA1UEChMMRGlnaUNlcnQgSW5jMRkwFwYDVQQLExB3\r\n" "d3cuZGlnaWNlcnQuY29tMSAwHgYDVQQDExdEaWdpQ2VydCBHbG9iYWwgUm9vdCBH\r\n" "MjAeFw0yMDA3MjkxMjMwMDBaFw0yNDA2MjcyMzU5NTlaMFkxCzAJBgNVBAYTAlVT\r\n" "MR4wHAYDVQQKExVNaWNyb3NvZnQgQ29ycG9yYXRpb24xKjAoBgNVBAMTIU1pY3Jv\r\n" "c29mdCBBenVyZSBUTFMgSXNzdWluZyBDQSAwNjCCAiIwDQYJKoZIhvcNAQEBBQAD\r\n" "ggIPADCCAgoCggIBALVGARl56bx3KBUSGuPc4H5uoNFkFH4e7pvTCxRi4j/+z+Xb\r\n" "wjEz+5CipDOqjx9/jWjskL5dk7PaQkzItidsAAnDCW1leZBOIi68Lff1bjTeZgMY\r\n" "iwdRd3Y39b/lcGpiuP2d23W95YHkMMT8IlWosYIX0f4kYb62rphyfnAjYb/4Od99\r\n" "ThnhlAxGtfvSbXcBVIKCYfZgqRvV+5lReUnd1aNjRYVzPOoifgSx2fRyy1+pO1Uz\r\n" "aMMNnIOE71bVYW0A1hr19w7kOb0KkJXoALTDDj1ukUEDqQuBfBxReL5mXiu1O7WG\r\n" "0vltg0VZ/SZzctBsdBlx1BkmWYBW261KZgBivrql5ELTKKd8qgtHcLQA5fl6JB0Q\r\n" "gs5XDaWehN86Gps5JW8ArjGtjcWAIP+X8CQaWfaCnuRm6Bk/03PQWhgdi84qwA0s\r\n" "sRfFJwHUPTNSnE8EiGVk2frt0u8PG1pwSQsFuNJfcYIHEv1vOzP7uEOuDydsmCjh\r\n" "lxuoK2n5/2aVR3BMTu+p4+gl8alXoBycyLmj3J/PUgqD8SL5fTCUegGsdia/Sa60\r\n" "N2oV7vQ17wjMN+LXa2rjj/b4ZlZgXVojDmAjDwIRdDUujQu0RVsJqFLMzSIHpp2C\r\n" "Zp7mIoLrySay2YYBu7SiNwL95X6He2kS8eefBBHjzwW/9FxGqry57i71c2cDAgMB\r\n" "AAGjggGtMIIBqTAdBgNVHQ4EFgQU1cFnOsKjnfR3UltZEjgp5lVou6UwHwYDVR0j\r\n" "BBgwFoAUTiJUIBiV5uNu5g/6+rkS7QYXjzkwDgYDVR0PAQH/BAQDAgGGMB0GA1Ud\r\n" "JQQWMBQGCCsGAQUFBwMBBggrBgEFBQcDAjASBgNVHRMBAf8ECDAGAQH/AgEAMHYG\r\n" "CCsGAQUFBwEBBGowaDAkBggrBgEFBQcwAYYYaHR0cDovL29jc3AuZGlnaWNlcnQu\r\n" "Y29tMEAGCCsGAQUFBzAChjRodHRwOi8vY2FjZXJ0cy5kaWdpY2VydC5jb20vRGln\r\n" "aUNlcnRHbG9iYWxSb290RzIuY3J0MHsGA1UdHwR0MHIwN6A1oDOGMWh0dHA6Ly9j\r\n" "cmwzLmRpZ2ljZXJ0LmNvbS9EaWdpQ2VydEdsb2JhbFJvb3RHMi5jcmwwN6A1oDOG\r\n" "MWh0dHA6Ly9jcmw0LmRpZ2ljZXJ0LmNvbS9EaWdpQ2VydEdsb2JhbFJvb3RHMi5j\r\n" "cmwwHQYDVR0gBBYwFDAIBgZngQwBAgEwCAYGZ4EMAQICMBAGCSsGAQQBgjcVAQQD\r\n" "AgEAMA0GCSqGSIb3DQEBDAUAA4IBAQB2oWc93fB8esci/8esixj++N22meiGDjgF\r\n" "+rA2LUK5IOQOgcUSTGKSqF9lYfAxPjrqPjDCUPHCURv+26ad5P/BYtXtbmtxJWu+\r\n" "cS5BhMDPPeG3oPZwXRHBJFAkY4O4AF7RIAAUW6EzDflUoDHKv83zOiPfYGcpHc9s\r\n" "kxAInCedk7QSgXvMARjjOqdakor21DTmNIUotxo8kHv5hwRlGhBJwps6fEVi1Bt0\r\n" "trpM/3wYxlr473WSPUFZPgP1j519kLpWOJ8z09wxay+Br29irPcBYv0GMXlHqThy\r\n" "8y4m/HyTQeI2IMvMrQnwqPpY+rLIXyviI2vLoI+4xKE4Rn38ZZ8m\r\n" "-----END CERTIFICATE-----\r\n";
This is the same certificate you used when connecting to Custom Vision.
-
Add an include directive for the WiFi header file and the config header file at the top of the
main.cpp
file:#include <rpcWiFi.h> #include "config.h"
-
Add code to connect to WiFi in
main.cpp
above thesetup
function:void connectWiFi() { while (WiFi.status() != WL_CONNECTED) { Serial.println("Connecting to WiFi.."); WiFi.begin(SSID, PASSWORD); delay(500); } Serial.println("Connected!"); }
-
Call this function from the
setup
function after the serial connection has been established:connectWiFi();
-
Create a new header file in the
src
folder calledspeech_to_text.h
. Add the following code to this header file:#pragma once #include <Arduino.h> #include <ArduinoJson.h> #include <HTTPClient.h> #include <WiFiClientSecure.h> #include "config.h" #include "mic.h" class SpeechToText { public: private: }; SpeechToText speechToText;
This includes necessary header files for an HTTP connection, configuration, and the
mic.h
header file. It also defines a class calledSpeechToText
and declares an instance of this class for later use. -
Add the following two fields to the
private
section of this class:WiFiClientSecure _token_client; String _access_token;
The
_token_client
is a WiFi Client that uses HTTPS and will be used to get the access token. The token will then be stored in_access_token
. -
Add the following method to the
private
section:String getAccessToken() { char url[128]; sprintf(url, TOKEN_URL, SPEECH_LOCATION); HTTPClient httpClient; httpClient.begin(_token_client, url); httpClient.addHeader("Ocp-Apim-Subscription-Key", SPEECH_API_KEY); int httpResultCode = httpClient.POST("{}"); if (httpResultCode != 200) { Serial.println("Error getting access token, trying again..."); delay(10000); return getAccessToken(); } Serial.println("Got access token."); String result = httpClient.getString(); httpClient.end(); return result; }
This code constructs the URL for the token issuer API using the location of the speech resource. It then creates an
HTTPClient
to make the web request, setting it up to use the WiFi client configured with the token endpoint's certificate. The API key is set as a header for the call. A POST request is made to retrieve the token, retrying in case of errors. Finally, the access token is returned. -
Add a method to the
public
section to retrieve the access token. This will be needed in later lessons to convert text to speech:String AccessToken() { return _access_token; }
-
Add an
init
method to thepublic
section to set up the token client:void init() { _token_client.setCACert(TOKEN_CERTIFICATE); _access_token = getAccessToken(); }
This sets the certificate on the WiFi client and retrieves the access token.
-
In
main.cpp
, include this new header file in the include directives:#include "speech_to_text.h"
-
Initialize the
SpeechToText
class at the end of thesetup
function, after themic.init
call but before writingReady
to the serial monitor:speechToText.init();
Task - Read Audio from Flash Memory
-
In an earlier part of this lesson, audio was recorded to flash memory. This audio needs to be sent to the Speech Services REST API, so it must be read from flash memory. It cannot be loaded into an in-memory buffer as it would be too large. The
HTTPClient
class, which makes REST calls, can stream data using an Arduino Stream—a class that loads data in small chunks, sending them one at a time as part of the request. Each time you callread
on a stream, it returns the next block of data. An Arduino stream can be created to read from flash memory. Create a new file calledflash_stream.h
in thesrc
folder, and add the following code:#pragma once #include <Arduino.h> #include <HTTPClient.h> #include <sfud.h> #include "config.h" class FlashStream : public Stream { public: virtual size_t write(uint8_t val) { } virtual int available() { } virtual int read() { } virtual int peek() { } private: };
This declares the
FlashStream
class, which derives from the ArduinoStream
class. This is an abstract class, meaning derived classes must implement certain methods before the class can be instantiated. These methods are defined in this class.✅ Learn more about Arduino Streams in the Arduino Stream documentation
-
Add the following fields to the
private
section:size_t _pos; size_t _flash_address; const sfud_flash *_flash; byte _buffer[HTTP_TCP_BUFFER_SIZE];
This defines a temporary buffer to store data read from flash memory, along with fields to store the current position when reading from the buffer, the current address to read from flash memory, and the flash memory device.
-
Add the following method to the
private
section:void populateBuffer() { sfud_read(_flash, _flash_address, HTTP_TCP_BUFFER_SIZE, _buffer); _flash_address += HTTP_TCP_BUFFER_SIZE; _pos = 0; }
This code reads from flash memory at the current address and stores the data in a buffer. It then increments the address so the next call reads the next block of memory. The buffer is sized based on the largest chunk the
HTTPClient
will send to the REST API at one time.💁 Flash memory must be erased using the grain size, but reading can be done without this restriction.
-
Add a constructor to the
public
section of this class:FlashStream() { _pos = 0; _flash_address = 0; _flash = sfud_get_device_table() + 0; populateBuffer(); }
This constructor initializes all fields to start reading from the beginning of the flash memory block and loads the first chunk of data into the buffer.
-
Implement the
write
method. Since this stream will only read data, this method does nothing and returns 0:virtual size_t write(uint8_t val) { return 0; }
-
Implement the
peek
method. This returns the data at the current position without advancing the stream. Callingpeek
multiple times will always return the same data as long as no data is read from the stream:virtual int peek() { return _buffer[_pos]; }
-
Implement the
available
function. This returns how many bytes can be read from the stream, or -1 if the stream is complete. For this class, the maximum available will not exceed the HTTPClient's chunk size. When this stream is used in the HTTP client, it calls this function to determine how much data is available, then requests that amount to send to the REST API. If more than the chunk size is available, the chunk size is returned. If less, the available amount is returned. Once all data has been streamed, -1 is returned:virtual int available() { int remaining = BUFFER_SIZE - ((_flash_address - HTTP_TCP_BUFFER_SIZE) + _pos); int bytes_available = min(HTTP_TCP_BUFFER_SIZE, remaining); if (bytes_available == 0) { bytes_available = -1; } return bytes_available; }
-
Implement the
read
method to return the next byte from the buffer, incrementing the position. If the position exceeds the buffer size, the buffer is populated with the next block from flash memory, and the position is reset:virtual int read() { int retVal = _buffer[_pos++]; if (_pos == HTTP_TCP_BUFFER_SIZE) { populateBuffer(); } return retVal; }
-
In the
speech_to_text.h
header file, add an include directive for this new header file:#include "flash_stream.h"
Task - Convert Speech to Text
-
Speech can be converted to text by sending the audio to the Speech Service via a REST API. This REST API uses a different certificate than the token issuer, so add the following code to the
config.h
header file to define this certificate:const char *SPEECH_CERTIFICATE = "-----BEGIN CERTIFICATE-----\r\n" "MIIF8zCCBNugAwIBAgIQCq+mxcpjxFFB6jvh98dTFzANBgkqhkiG9w0BAQwFADBh\r\n" "MQswCQYDVQQGEwJVUzEVMBMGA1UEChMMRGlnaUNlcnQgSW5jMRkwFwYDVQQLExB3\r\n" "d3cuZGlnaWNlcnQuY29tMSAwHgYDVQQDExdEaWdpQ2VydCBHbG9iYWwgUm9vdCBH\r\n" "MjAeFw0yMDA3MjkxMjMwMDBaFw0yNDA2MjcyMzU5NTlaMFkxCzAJBgNVBAYTAlVT\r\n" "MR4wHAYDVQQKExVNaWNyb3NvZnQgQ29ycG9yYXRpb24xKjAoBgNVBAMTIU1pY3Jv\r\n" "c29mdCBBenVyZSBUTFMgSXNzdWluZyBDQSAwMTCCAiIwDQYJKoZIhvcNAQEBBQAD\r\n" "ggIPADCCAgoCggIBAMedcDrkXufP7pxVm1FHLDNA9IjwHaMoaY8arqqZ4Gff4xyr\r\n" "RygnavXL7g12MPAx8Q6Dd9hfBzrfWxkF0Br2wIvlvkzW01naNVSkHp+OS3hL3W6n\r\n" "l/jYvZnVeJXjtsKYcXIf/6WtspcF5awlQ9LZJcjwaH7KoZuK+THpXCMtzD8XNVdm\r\n" "GW/JI0C/7U/E7evXn9XDio8SYkGSM63aLO5BtLCv092+1d4GGBSQYolRq+7Pd1kR\r\n" "EkWBPm0ywZ2Vb8GIS5DLrjelEkBnKCyy3B0yQud9dpVsiUeE7F5sY8Me96WVxQcb\r\n" "OyYdEY/j/9UpDlOG+vA+YgOvBhkKEjiqygVpP8EZoMMijephzg43b5Qi9r5UrvYo\r\n" "o19oR/8pf4HJNDPF0/FJwFVMW8PmCBLGstin3NE1+NeWTkGt0TzpHjgKyfaDP2tO\r\n" "4bCk1G7pP2kDFT7SYfc8xbgCkFQ2UCEXsaH/f5YmpLn4YPiNFCeeIida7xnfTvc4\r\n" "7IxyVccHHq1FzGygOqemrxEETKh8hvDR6eBdrBwmCHVgZrnAqnn93JtGyPLi6+cj\r\n" "WGVGtMZHwzVvX1HvSFG771sskcEjJxiQNQDQRWHEh3NxvNb7kFlAXnVdRkkvhjpR\r\n" "GchFhTAzqmwltdWhWDEyCMKC2x/mSZvZtlZGY+g37Y72qHzidwtyW7rBetZJAgMB\r\n" "AAGjggGtMIIBqTAdBgNVHQ4EFgQUDyBd16FXlduSzyvQx8J3BM5ygHYwHwYDVR0j\r\n" "BBgwFoAUTiJUIBiV5uNu5g/6+rkS7QYXjzkwDgYDVR0PAQH/BAQDAgGGMB0GA1Ud\r\n" "JQQWMBQGCCsGAQUFBwMBBggrBgEFBQcDAjASBgNVHRMBAf8ECDAGAQH/AgEAMHYG\r\n" "CCsGAQUFBwEBBGowaDAkBggrBgEFBQcwAYYYaHR0cDovL29jc3AuZGlnaWNlcnQu\r\n" "Y29tMEAGCCsGAQUFBzAChjRodHRwOi8vY2FjZXJ0cy5kaWdpY2VydC5jb20vRGln\r\n" "aUNlcnRHbG9iYWxSb290RzIuY3J0MHsGA1UdHwR0MHIwN6A1oDOGMWh0dHA6Ly9j\r\n" "cmwzLmRpZ2ljZXJ0LmNvbS9EaWdpQ2VydEdsb2JhbFJvb3RHMi5jcmwwN6A1oDOG\r\n" "MWh0dHA6Ly9jcmw0LmRpZ2ljZXJ0LmNvbS9EaWdpQ2VydEdsb2JhbFJvb3RHMi5j\r\n" "cmwwHQYDVR0gBBYwFDAIBgZngQwBAgEwCAYGZ4EMAQICMBAGCSsGAQQBgjcVAQQD\r\n" "AgEAMA0GCSqGSIb3DQEBDAUAA4IBAQAlFvNh7QgXVLAZSsNR2XRmIn9iS8OHFCBA\r\n" "WxKJoi8YYQafpMTkMqeuzoL3HWb1pYEipsDkhiMnrpfeYZEA7Lz7yqEEtfgHcEBs\r\n" "K9KcStQGGZRfmWU07hPXHnFz+5gTXqzCE2PBMlRgVUYJiA25mJPXfB00gDvGhtYa\r\n" "+mENwM9Bq1B9YYLyLjRtUz8cyGsdyTIG/bBM/Q9jcV8JGqMU/UjAdh1pFyTnnHEl\r\n" "Y59Npi7F87ZqYYJEHJM2LGD+le8VsHjgeWX2CJQko7klXvcizuZvUEDTjHaQcs2J\r\n" "+kPgfyMIOY1DMJ21NxOJ2xPRC/wAh/hzSBRVtoAnyuxtkZ4VjIOh\r\n" "-----END CERTIFICATE-----\r\n";
-
Add a constant to this file for the speech URL without the location. This will later be combined with the location and language to form the full URL:
const char *SPEECH_URL = "https://%s.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1?language=%s";
-
In the
speech_to_text.h
header file, add a field to theprivate
section of theSpeechToText
class for a WiFi Client using the speech certificate:WiFiClientSecure _speech_client;
-
In the
init
method, set the certificate on this WiFi Client:_speech_client.setCACert(SPEECH_CERTIFICATE);
-
Add the following code to the
public
section of theSpeechToText
class to define a method for converting speech to text:String convertSpeechToText() { }
-
Add the following code to this method to create an HTTP client using the WiFi client configured with the speech certificate, and using the speech URL set with the location and language:
char url[128]; sprintf(url, SPEECH_URL, SPEECH_LOCATION, LANGUAGE); HTTPClient httpClient; httpClient.begin(_speech_client, url);
-
Set the necessary headers for the connection:
httpClient.addHeader("Authorization", String("Bearer ") + _access_token); httpClient.addHeader("Content-Type", String("audio/wav; codecs=audio/pcm; samplerate=") + String(RATE)); httpClient.addHeader("Accept", "application/json;text/xml");
This sets headers for authorization using the access token, the audio format using the sample rate, and specifies that the client expects the result in JSON format.
-
Add the following code to make the REST API call:
Serial.println("Sending speech..."); FlashStream stream; int httpResponseCode = httpClient.sendRequest("POST", &stream, BUFFER_SIZE); Serial.println("Speech sent!");
This creates a
FlashStream
and uses it to stream data to the REST API. -
Add the following code to check the response code:
String text = ""; if (httpResponseCode == 200) { String result = httpClient.getString(); Serial.println(result); DynamicJsonDocument doc(1024); deserializeJson(doc, result.c_str()); JsonObject obj = doc.as<JsonObject>(); text = obj["DisplayText"].as<String>(); } else if (httpResponseCode == 401) { Serial.println("Access token expired, trying again with a new token"); _access_token = getAccessToken(); return convertSpeechToText(); } else { Serial.print("Failed to convert text to speech - error "); Serial.println(httpResponseCode); }
If the response code is 200 (success), the result is retrieved, decoded from JSON, and the
DisplayText
property is assigned to thetext
variable. This property contains the text version of the speech.If the response code is 401, the access token has expired (tokens last only 10 minutes). A new access token is requested, and the call is retried.
Otherwise, an error is logged to the serial monitor, and the
text
variable is left blank. -
Add the following code to the end of this method to close the HTTP client and return the text:
httpClient.end(); return text;
-
In
main.cpp
, call the newconvertSpeechToText
method in theprocessAudio
function, then log the speech to the serial monitor:String text = speechToText.convertSpeechToText(); Serial.println(text);
-
Build the code, upload it to your Wio Terminal, and test it through the serial monitor. Once you see
Ready
in the serial monitor, press the C button (the one on the left-hand side, closest to the power switch), and speak. Four seconds of audio will be captured and converted to text.--- Available filters and text transformations: colorize, debug, default, direct, hexlify, log2file, nocontrol, printable, send_on_enter, time --- More details at http://bit.ly/pio-monitor-filters --- Miniterm on /dev/cu.usbmodem1101 9600,8,N,1 --- --- Quit: Ctrl+C | Menu: Ctrl+T | Help: Ctrl+T followed by Ctrl+H --- Connecting to WiFi.. Connected! Got access token. Ready. Starting recording... Finished recording Sending speech... Speech sent! {"RecognitionStatus":"Success","DisplayText":"Set a 2 minute and 27 second timer.","Offset":4700000,"Duration":35300000} Set a 2 minute and 27 second timer.
💁 You can find this code in the code-speech-to-text/wio-terminal folder.
😀 Congratulations! Your speech-to-text program is working successfully!
Disclaimer:
This document has been translated using the AI translation service Co-op Translator. While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.