22 KiB

Raw Permalink Blame History

Speech to Text - Wio Terminal

In this part of the lesson, you will write code to convert speech from the captured audio into text using the speech service.

Sending Audio to the Speech Service

Audio can be sent to the speech service using the REST API. To use the speech service, you first need to request an access token, which is then used to access the REST API. These access tokens expire after 10 minutes, so your code should regularly request new tokens to ensure they remain valid.

Task - Obtain an Access Token

Open the smart-timer project if it is not already open.

Add the following library dependencies to the platformio.ini file to enable WiFi access and handle JSON:

seeed-studio/Seeed Arduino rpcWiFi @ 1.0.5
seeed-studio/Seeed Arduino rpcUnified @ 2.1.3
seeed-studio/Seeed_Arduino_mbedtls @ 3.0.1
seeed-studio/Seeed Arduino RTC @ 2.0.0
bblanchon/ArduinoJson @ 6.17.3

Add the following code to the config.h header file:
```
const char *SSID = "<SSID>";
const char *PASSWORD = "<PASSWORD>";

const char *SPEECH_API_KEY = "<API_KEY>";
const char *SPEECH_LOCATION = "<LOCATION>";
const char *LANGUAGE = "<LANGUAGE>";

const char *TOKEN_URL = "https://%s.api.cognitive.microsoft.com/sts/v1.0/issuetoken";
```
Replace <SSID> and <PASSWORD> with your WiFi credentials.

Replace <API_KEY> with the API key for your speech service resource. Replace <LOCATION> with the location you used when creating the speech service resource.

Replace <LANGUAGE> with the locale name of the language you will be speaking, for example, en-GB for English or zh-HK for Cantonese. You can find a list of supported languages and their locale names in the Language and Voice Support documentation on Microsoft Docs.

The TOKEN_URL constant is the URL of the token issuer without the location. This will later be combined with the location to form the full URL.

Similar to connecting to Custom Vision, you will need to use an HTTPS connection to connect to the token issuing service. Add the following code to the end of config.h:

const char *TOKEN_CERTIFICATE =
    "-----BEGIN CERTIFICATE-----\r\n"
    "MIIF8zCCBNugAwIBAgIQAueRcfuAIek/4tmDg0xQwDANBgkqhkiG9w0BAQwFADBh\r\n"
    "MQswCQYDVQQGEwJVUzEVMBMGA1UEChMMRGlnaUNlcnQgSW5jMRkwFwYDVQQLExB3\r\n"
    "d3cuZGlnaWNlcnQuY29tMSAwHgYDVQQDExdEaWdpQ2VydCBHbG9iYWwgUm9vdCBH\r\n"
    "MjAeFw0yMDA3MjkxMjMwMDBaFw0yNDA2MjcyMzU5NTlaMFkxCzAJBgNVBAYTAlVT\r\n"
    "MR4wHAYDVQQKExVNaWNyb3NvZnQgQ29ycG9yYXRpb24xKjAoBgNVBAMTIU1pY3Jv\r\n"
    "c29mdCBBenVyZSBUTFMgSXNzdWluZyBDQSAwNjCCAiIwDQYJKoZIhvcNAQEBBQAD\r\n"
    "ggIPADCCAgoCggIBALVGARl56bx3KBUSGuPc4H5uoNFkFH4e7pvTCxRi4j/+z+Xb\r\n"
    "wjEz+5CipDOqjx9/jWjskL5dk7PaQkzItidsAAnDCW1leZBOIi68Lff1bjTeZgMY\r\n"
    "iwdRd3Y39b/lcGpiuP2d23W95YHkMMT8IlWosYIX0f4kYb62rphyfnAjYb/4Od99\r\n"
    "ThnhlAxGtfvSbXcBVIKCYfZgqRvV+5lReUnd1aNjRYVzPOoifgSx2fRyy1+pO1Uz\r\n"
    "aMMNnIOE71bVYW0A1hr19w7kOb0KkJXoALTDDj1ukUEDqQuBfBxReL5mXiu1O7WG\r\n"
    "0vltg0VZ/SZzctBsdBlx1BkmWYBW261KZgBivrql5ELTKKd8qgtHcLQA5fl6JB0Q\r\n"
    "gs5XDaWehN86Gps5JW8ArjGtjcWAIP+X8CQaWfaCnuRm6Bk/03PQWhgdi84qwA0s\r\n"
    "sRfFJwHUPTNSnE8EiGVk2frt0u8PG1pwSQsFuNJfcYIHEv1vOzP7uEOuDydsmCjh\r\n"
    "lxuoK2n5/2aVR3BMTu+p4+gl8alXoBycyLmj3J/PUgqD8SL5fTCUegGsdia/Sa60\r\n"
    "N2oV7vQ17wjMN+LXa2rjj/b4ZlZgXVojDmAjDwIRdDUujQu0RVsJqFLMzSIHpp2C\r\n"
    "Zp7mIoLrySay2YYBu7SiNwL95X6He2kS8eefBBHjzwW/9FxGqry57i71c2cDAgMB\r\n"
    "AAGjggGtMIIBqTAdBgNVHQ4EFgQU1cFnOsKjnfR3UltZEjgp5lVou6UwHwYDVR0j\r\n"
    "BBgwFoAUTiJUIBiV5uNu5g/6+rkS7QYXjzkwDgYDVR0PAQH/BAQDAgGGMB0GA1Ud\r\n"
    "JQQWMBQGCCsGAQUFBwMBBggrBgEFBQcDAjASBgNVHRMBAf8ECDAGAQH/AgEAMHYG\r\n"
    "CCsGAQUFBwEBBGowaDAkBggrBgEFBQcwAYYYaHR0cDovL29jc3AuZGlnaWNlcnQu\r\n"
    "Y29tMEAGCCsGAQUFBzAChjRodHRwOi8vY2FjZXJ0cy5kaWdpY2VydC5jb20vRGln\r\n"
    "aUNlcnRHbG9iYWxSb290RzIuY3J0MHsGA1UdHwR0MHIwN6A1oDOGMWh0dHA6Ly9j\r\n"
    "cmwzLmRpZ2ljZXJ0LmNvbS9EaWdpQ2VydEdsb2JhbFJvb3RHMi5jcmwwN6A1oDOG\r\n"
    "MWh0dHA6Ly9jcmw0LmRpZ2ljZXJ0LmNvbS9EaWdpQ2VydEdsb2JhbFJvb3RHMi5j\r\n"
    "cmwwHQYDVR0gBBYwFDAIBgZngQwBAgEwCAYGZ4EMAQICMBAGCSsGAQQBgjcVAQQD\r\n"
    "AgEAMA0GCSqGSIb3DQEBDAUAA4IBAQB2oWc93fB8esci/8esixj++N22meiGDjgF\r\n"
    "+rA2LUK5IOQOgcUSTGKSqF9lYfAxPjrqPjDCUPHCURv+26ad5P/BYtXtbmtxJWu+\r\n"
    "cS5BhMDPPeG3oPZwXRHBJFAkY4O4AF7RIAAUW6EzDflUoDHKv83zOiPfYGcpHc9s\r\n"
    "kxAInCedk7QSgXvMARjjOqdakor21DTmNIUotxo8kHv5hwRlGhBJwps6fEVi1Bt0\r\n"
    "trpM/3wYxlr473WSPUFZPgP1j519kLpWOJ8z09wxay+Br29irPcBYv0GMXlHqThy\r\n"
    "8y4m/HyTQeI2IMvMrQnwqPpY+rLIXyviI2vLoI+4xKE4Rn38ZZ8m\r\n"
    "-----END CERTIFICATE-----\r\n";

This is the same certificate you used when connecting to Custom Vision.

Add an include directive for the WiFi header file and the config header file at the top of the main.cpp file:
```
#include <rpcWiFi.h>

#include "config.h"
```

Add code to connect to WiFi in main.cpp above the setup function:

void connectWiFi()
{
    while (WiFi.status() != WL_CONNECTED)
    {
        Serial.println("Connecting to WiFi..");
        WiFi.begin(SSID, PASSWORD);
        delay(500);
    }

    Serial.println("Connected!");
}

Call this function from the setup function after the serial connection has been established:
```
connectWiFi();
```
Create a new header file in the src folder called speech_to_text.h. Add the following code to this header file:
```
#pragma once

#include <Arduino.h>
#include <ArduinoJson.h>
#include <HTTPClient.h>
#include <WiFiClientSecure.h>

#include "config.h"
#include "mic.h"

class SpeechToText
{
public:

private:

};

SpeechToText speechToText;
```
This includes necessary header files for an HTTP connection, configuration, and the mic.h header file. It also defines a class called SpeechToText and declares an instance of this class for later use.
Add the following two fields to the private section of this class:
```
WiFiClientSecure _token_client;
String _access_token;
```
The _token_client is a WiFi Client that uses HTTPS and will be used to get the access token. The token will then be stored in _access_token.

Add the following method to the private section:

String getAccessToken()
{
    char url[128];
    sprintf(url, TOKEN_URL, SPEECH_LOCATION);

    HTTPClient httpClient;
    httpClient.begin(_token_client, url);

    httpClient.addHeader("Ocp-Apim-Subscription-Key", SPEECH_API_KEY);
    int httpResultCode = httpClient.POST("{}");

    if (httpResultCode != 200)
    {
        Serial.println("Error getting access token, trying again...");
        delay(10000);
        return getAccessToken();
    }

    Serial.println("Got access token.");
    String result = httpClient.getString();

    httpClient.end();

    return result;
}

This code constructs the URL for the token issuer API using the location of the speech resource. It then creates an HTTPClient to make the web request, setting it up to use the WiFi client configured with the token endpoint's certificate. The API key is set as a header for the call. A POST request is made to retrieve the token, retrying in case of errors. Finally, the access token is returned.

Add a method to the public section to retrieve the access token. This will be needed in later lessons to convert text to speech:
```
String AccessToken()
{
    return _access_token;
}
```
Add an init method to the public section to set up the token client:
```
void init()
{
    _token_client.setCACert(TOKEN_CERTIFICATE);
    _access_token = getAccessToken();
}
```
This sets the certificate on the WiFi client and retrieves the access token.
In main.cpp, include this new header file in the include directives:
```
#include "speech_to_text.h"
```
Initialize the SpeechToText class at the end of the setup function, after the mic.init call but before writing Ready to the serial monitor:
```
speechToText.init();
```

Task - Read Audio from Flash Memory

In an earlier part of this lesson, audio was recorded to flash memory. This audio needs to be sent to the Speech Services REST API, so it must be read from flash memory. It cannot be loaded into an in-memory buffer as it would be too large. The HTTPClient class, which makes REST calls, can stream data using an Arduino Stream—a class that loads data in small chunks, sending them one at a time as part of the request. Each time you call read on a stream, it returns the next block of data. An Arduino stream can be created to read from flash memory. Create a new file called flash_stream.h in the src folder, and add the following code:
```
#pragma once

#include <Arduino.h>
#include <HTTPClient.h>
#include <sfud.h>

#include "config.h"

class FlashStream : public Stream
{
public:
    virtual size_t write(uint8_t val)
    {    
    }

    virtual int available()
    {
    }

    virtual int read()
    {
    }

    virtual int peek()
    {
    }
private:

};
```
This declares the FlashStream class, which derives from the Arduino Stream class. This is an abstract class, meaning derived classes must implement certain methods before the class can be instantiated. These methods are defined in this class.

✅ Learn more about Arduino Streams in the Arduino Stream documentation
Add the following fields to the private section:
```
size_t _pos;
size_t _flash_address;
const sfud_flash *_flash;

byte _buffer[HTTP_TCP_BUFFER_SIZE];
```
This defines a temporary buffer to store data read from flash memory, along with fields to store the current position when reading from the buffer, the current address to read from flash memory, and the flash memory device.
Add the following method to the private section:
```
void populateBuffer()
{
    sfud_read(_flash, _flash_address, HTTP_TCP_BUFFER_SIZE, _buffer);
    _flash_address += HTTP_TCP_BUFFER_SIZE;
    _pos = 0;
}
```
This code reads from flash memory at the current address and stores the data in a buffer. It then increments the address so the next call reads the next block of memory. The buffer is sized based on the largest chunk the HTTPClient will send to the REST API at one time.

💁 Flash memory must be erased using the grain size, but reading can be done without this restriction.
Add a constructor to the public section of this class:
```
FlashStream()
{
    _pos = 0;
    _flash_address = 0;
    _flash = sfud_get_device_table() + 0;

    populateBuffer();
}
```
This constructor initializes all fields to start reading from the beginning of the flash memory block and loads the first chunk of data into the buffer.
Implement the write method. Since this stream will only read data, this method does nothing and returns 0:
```
virtual size_t write(uint8_t val)
{
    return 0;
}
```
Implement the peek method. This returns the data at the current position without advancing the stream. Calling peek multiple times will always return the same data as long as no data is read from the stream:
```
virtual int peek()
{
    return _buffer[_pos];
}
```
Implement the available function. This returns how many bytes can be read from the stream, or -1 if the stream is complete. For this class, the maximum available will not exceed the HTTPClient's chunk size. When this stream is used in the HTTP client, it calls this function to determine how much data is available, then requests that amount to send to the REST API. If more than the chunk size is available, the chunk size is returned. If less, the available amount is returned. Once all data has been streamed, -1 is returned:
```
virtual int available()
{
    int remaining = BUFFER_SIZE - ((_flash_address - HTTP_TCP_BUFFER_SIZE) + _pos);
    int bytes_available = min(HTTP_TCP_BUFFER_SIZE, remaining);

    if (bytes_available == 0)
    {
        bytes_available = -1;
    }

    return bytes_available;
}
```
Implement the read method to return the next byte from the buffer, incrementing the position. If the position exceeds the buffer size, the buffer is populated with the next block from flash memory, and the position is reset:
```
virtual int read()
{
    int retVal = _buffer[_pos++];

    if (_pos == HTTP_TCP_BUFFER_SIZE)
    {
        populateBuffer();
    }

    return retVal;
}
```
In the speech_to_text.h header file, add an include directive for this new header file:
```
#include "flash_stream.h"
```

Task - Convert Speech to Text

Speech can be converted to text by sending the audio to the Speech Service via a REST API. This REST API uses a different certificate than the token issuer, so add the following code to the config.h header file to define this certificate:

const char *SPEECH_CERTIFICATE =
    "-----BEGIN CERTIFICATE-----\r\n"
    "MIIF8zCCBNugAwIBAgIQCq+mxcpjxFFB6jvh98dTFzANBgkqhkiG9w0BAQwFADBh\r\n"
    "MQswCQYDVQQGEwJVUzEVMBMGA1UEChMMRGlnaUNlcnQgSW5jMRkwFwYDVQQLExB3\r\n"
    "d3cuZGlnaWNlcnQuY29tMSAwHgYDVQQDExdEaWdpQ2VydCBHbG9iYWwgUm9vdCBH\r\n"
    "MjAeFw0yMDA3MjkxMjMwMDBaFw0yNDA2MjcyMzU5NTlaMFkxCzAJBgNVBAYTAlVT\r\n"
    "MR4wHAYDVQQKExVNaWNyb3NvZnQgQ29ycG9yYXRpb24xKjAoBgNVBAMTIU1pY3Jv\r\n"
    "c29mdCBBenVyZSBUTFMgSXNzdWluZyBDQSAwMTCCAiIwDQYJKoZIhvcNAQEBBQAD\r\n"
    "ggIPADCCAgoCggIBAMedcDrkXufP7pxVm1FHLDNA9IjwHaMoaY8arqqZ4Gff4xyr\r\n"
    "RygnavXL7g12MPAx8Q6Dd9hfBzrfWxkF0Br2wIvlvkzW01naNVSkHp+OS3hL3W6n\r\n"
    "l/jYvZnVeJXjtsKYcXIf/6WtspcF5awlQ9LZJcjwaH7KoZuK+THpXCMtzD8XNVdm\r\n"
    "GW/JI0C/7U/E7evXn9XDio8SYkGSM63aLO5BtLCv092+1d4GGBSQYolRq+7Pd1kR\r\n"
    "EkWBPm0ywZ2Vb8GIS5DLrjelEkBnKCyy3B0yQud9dpVsiUeE7F5sY8Me96WVxQcb\r\n"
    "OyYdEY/j/9UpDlOG+vA+YgOvBhkKEjiqygVpP8EZoMMijephzg43b5Qi9r5UrvYo\r\n"
    "o19oR/8pf4HJNDPF0/FJwFVMW8PmCBLGstin3NE1+NeWTkGt0TzpHjgKyfaDP2tO\r\n"
    "4bCk1G7pP2kDFT7SYfc8xbgCkFQ2UCEXsaH/f5YmpLn4YPiNFCeeIida7xnfTvc4\r\n"
    "7IxyVccHHq1FzGygOqemrxEETKh8hvDR6eBdrBwmCHVgZrnAqnn93JtGyPLi6+cj\r\n"
    "WGVGtMZHwzVvX1HvSFG771sskcEjJxiQNQDQRWHEh3NxvNb7kFlAXnVdRkkvhjpR\r\n"
    "GchFhTAzqmwltdWhWDEyCMKC2x/mSZvZtlZGY+g37Y72qHzidwtyW7rBetZJAgMB\r\n"
    "AAGjggGtMIIBqTAdBgNVHQ4EFgQUDyBd16FXlduSzyvQx8J3BM5ygHYwHwYDVR0j\r\n"
    "BBgwFoAUTiJUIBiV5uNu5g/6+rkS7QYXjzkwDgYDVR0PAQH/BAQDAgGGMB0GA1Ud\r\n"
    "JQQWMBQGCCsGAQUFBwMBBggrBgEFBQcDAjASBgNVHRMBAf8ECDAGAQH/AgEAMHYG\r\n"
    "CCsGAQUFBwEBBGowaDAkBggrBgEFBQcwAYYYaHR0cDovL29jc3AuZGlnaWNlcnQu\r\n"
    "Y29tMEAGCCsGAQUFBzAChjRodHRwOi8vY2FjZXJ0cy5kaWdpY2VydC5jb20vRGln\r\n"
    "aUNlcnRHbG9iYWxSb290RzIuY3J0MHsGA1UdHwR0MHIwN6A1oDOGMWh0dHA6Ly9j\r\n"
    "cmwzLmRpZ2ljZXJ0LmNvbS9EaWdpQ2VydEdsb2JhbFJvb3RHMi5jcmwwN6A1oDOG\r\n"
    "MWh0dHA6Ly9jcmw0LmRpZ2ljZXJ0LmNvbS9EaWdpQ2VydEdsb2JhbFJvb3RHMi5j\r\n"
    "cmwwHQYDVR0gBBYwFDAIBgZngQwBAgEwCAYGZ4EMAQICMBAGCSsGAQQBgjcVAQQD\r\n"
    "AgEAMA0GCSqGSIb3DQEBDAUAA4IBAQAlFvNh7QgXVLAZSsNR2XRmIn9iS8OHFCBA\r\n"
    "WxKJoi8YYQafpMTkMqeuzoL3HWb1pYEipsDkhiMnrpfeYZEA7Lz7yqEEtfgHcEBs\r\n"
    "K9KcStQGGZRfmWU07hPXHnFz+5gTXqzCE2PBMlRgVUYJiA25mJPXfB00gDvGhtYa\r\n"
    "+mENwM9Bq1B9YYLyLjRtUz8cyGsdyTIG/bBM/Q9jcV8JGqMU/UjAdh1pFyTnnHEl\r\n"
    "Y59Npi7F87ZqYYJEHJM2LGD+le8VsHjgeWX2CJQko7klXvcizuZvUEDTjHaQcs2J\r\n"
    "+kPgfyMIOY1DMJ21NxOJ2xPRC/wAh/hzSBRVtoAnyuxtkZ4VjIOh\r\n"
    "-----END CERTIFICATE-----\r\n";

Add a constant to this file for the speech URL without the location. This will later be combined with the location and language to form the full URL:
```
const char *SPEECH_URL = "https://%s.stt.speech.microsoft.com/speech/recognition/conversation/cognitiveservices/v1?language=%s";
```
In the speech_to_text.h header file, add a field to the private section of the SpeechToText class for a WiFi Client using the speech certificate:
```
WiFiClientSecure _speech_client;
```
In the init method, set the certificate on this WiFi Client:
```
_speech_client.setCACert(SPEECH_CERTIFICATE);
```
Add the following code to the public section of the SpeechToText class to define a method for converting speech to text:
```
String convertSpeechToText()
{

}
```
Add the following code to this method to create an HTTP client using the WiFi client configured with the speech certificate, and using the speech URL set with the location and language:
```
char url[128];
sprintf(url, SPEECH_URL, SPEECH_LOCATION, LANGUAGE);

HTTPClient httpClient;
httpClient.begin(_speech_client, url);
```

Set the necessary headers for the connection:

httpClient.addHeader("Authorization", String("Bearer ") + _access_token);
httpClient.addHeader("Content-Type", String("audio/wav; codecs=audio/pcm; samplerate=") + String(RATE));
httpClient.addHeader("Accept", "application/json;text/xml");

This sets headers for authorization using the access token, the audio format using the sample rate, and specifies that the client expects the result in JSON format.

Add the following code to make the REST API call:

Serial.println("Sending speech...");

FlashStream stream;
int httpResponseCode = httpClient.sendRequest("POST", &stream, BUFFER_SIZE);

Serial.println("Speech sent!");

This creates a FlashStream and uses it to stream data to the REST API.

Add the following code to check the response code:

String text = "";

if (httpResponseCode == 200)
{
    String result = httpClient.getString();
    Serial.println(result);

    DynamicJsonDocument doc(1024);
    deserializeJson(doc, result.c_str());

    JsonObject obj = doc.as<JsonObject>();
    text = obj["DisplayText"].as<String>();
}
else if (httpResponseCode == 401)
{
    Serial.println("Access token expired, trying again with a new token");
    _access_token = getAccessToken();
    return convertSpeechToText();
}
else
{
    Serial.print("Failed to convert text to speech - error ");
    Serial.println(httpResponseCode);
}

If the response code is 200 (success), the result is retrieved, decoded from JSON, and the DisplayText property is assigned to the text variable. This property contains the text version of the speech.

If the response code is 401, the access token has expired (tokens last only 10 minutes). A new access token is requested, and the call is retried.

Otherwise, an error is logged to the serial monitor, and the text variable is left blank.

Add the following code to the end of this method to close the HTTP client and return the text:
```
httpClient.end();

return text;
```
In main.cpp, call the new convertSpeechToText method in the processAudio function, then log the speech to the serial monitor:
```
String text = speechToText.convertSpeechToText();
Serial.println(text);
```

Build the code, upload it to your Wio Terminal, and test it through the serial monitor. Once you see Ready in the serial monitor, press the C button (the one on the left-hand side, closest to the power switch), and speak. Four seconds of audio will be captured and converted to text.

--- Available filters and text transformations: colorize, debug, default, direct, hexlify, log2file, nocontrol, printable, send_on_enter, time
--- More details at http://bit.ly/pio-monitor-filters
--- Miniterm on /dev/cu.usbmodem1101  9600,8,N,1 ---
--- Quit: Ctrl+C | Menu: Ctrl+T | Help: Ctrl+T followed by Ctrl+H ---
Connecting to WiFi..
Connected!
Got access token.
Ready.
Starting recording...
Finished recording
Sending speech...
Speech sent!
{"RecognitionStatus":"Success","DisplayText":"Set a 2 minute and 27 second timer.","Offset":4700000,"Duration":35300000}
Set a 2 minute and 27 second timer.

💁 You can find this code in the code-speech-to-text/wio-terminal folder.

😀 Congratulations! Your speech-to-text program is working successfully!

Disclaimer:
This document has been translated using the AI translation service Co-op Translator. While we aim for accuracy, please note that automated translations may include errors or inaccuracies. The original document in its native language should be regarded as the authoritative source. For critical information, professional human translation is advised. We are not responsible for any misunderstandings or misinterpretations resulting from the use of this translation.

22 KiB Raw Permalink Blame History