diff --git a/.vscode/settings.json b/.vscode/settings.json
index 0794c279..dde853fc 100644
--- a/.vscode/settings.json
+++ b/.vscode/settings.json
@@ -5,6 +5,7 @@
"Geospatial",
"Kbps",
"Mbps",
+ "SSML",
"Seeed",
"Siri",
"Twilio",
diff --git a/6-consumer/lessons/2-language-understanding/assignment.md b/6-consumer/lessons/2-language-understanding/assignment.md
index 69c4645f..f3c5e301 100644
--- a/6-consumer/lessons/2-language-understanding/assignment.md
+++ b/6-consumer/lessons/2-language-understanding/assignment.md
@@ -2,7 +2,7 @@
## Instructions
-So far in this lesson you have trained a model to understand setting a timer. Another useful feature is cancelling a timer - maybe your bread is ready and can be taken out of the oven.
+So far in this lesson you have trained a model to understand setting a timer. Another useful feature is cancelling a timer - maybe your bread is ready and can be taken out of the oven before the timer is elapsed.
Add a new intent to your LUIS app to cancel the timer. It won't need any entities, but will need some example sentences. Handle this in your serverless code if it is the top intent, logging that the intent was recognized.
diff --git a/6-consumer/lessons/3-spoken-feedback/README.md b/6-consumer/lessons/3-spoken-feedback/README.md
index b114baec..9fecc411 100644
--- a/6-consumer/lessons/3-spoken-feedback/README.md
+++ b/6-consumer/lessons/3-spoken-feedback/README.md
@@ -26,17 +26,49 @@ In this lesson we'll cover:
## Text to speech
+Text to speech, as the name suggests, is the process of converting text into audio that contains the text as spoken words. The basic principle is to break down the words in the text into their constituent sounds (known as phonemes), and stitch together audio for those sounds, either using pre-recorded audio or using audio generated by AI models.
+
+Text to speech systems typically have 3 stages:
+* Text analysis
+* Linguistic analysis
+* Wave-form generation
+### Text analysis
+Text analysis involves taking the text provided, and converting into words that can be used to generate speech. For example, if you convert "Hello world", there there is no text analysis needed, the two words can be converted to speech. If you have "1234" however, then this might need to be converted either into the words "One thousand, two hundred thirty four" or "One, two, three, four" depending on the context. For "I have 1234 apples", then it would be "One thousand, two hundred thirty four", but for "The child counted 1234" then it would be "One, two, three, four".
+The words created vary not only for the language, but the locale of that language. For example, in American English, 120 would be "One hundred twenty", in British English it would be "One hundred and twenty", with the use of "and" after the hundreds.
+✅ Some other examples that require text analysis include "in" as a short form of inch, and "st" as a short form of saint and street. Can you think of other examples in your language of words that are ambiguous without context.
+Once the words have been defined, they are sent for linguistic analysis.
+### Linguistic analysis
+Linguistic analysis breaks the words down into phonemes. Phonemes are based not just on the letters used, but the other letters in the word. For example, in English the 'a' sound in 'car' and 'care' is different. The English language has 44 different phonemes for the 26 letters in the alphabet, some shared by different letters, such as the same phoneme used at the start of 'circle' and 'serpent'.
+✅ Do some research: What are the phonemes for you language?
+
+Once the words have been converted to phonemes, these phonemes need additional data to support intonation, adjusting the tone or duration depending on the context. One example is in English pitch increases can be used to convert a sentence into a question, having a raised pitch for the last word implies a question.
+
+For example - the sentence "You have an apple" is a statement saying that you have an apple. If the pitch goes up at the end, increasing for the word apple, it becomes the question "You have an apple?", asking if you have an apple. The linguistic analysis needs to use the question mark at the end to decide to increase pitch.
+
+Once the phonemes have been generated, they can be sent for wave-form generation to produce the audio output.
+
+### Wave-form generation
+
+The first electronic text to speech systems used single audio recordings for each phoneme, leading to very monotonous, robotic sounding voices. The linguistic analysis would produce phonemes, these would be loaded from a database of sounds and stitched together to make the audio.
+
+✅ Do some research: Find some audio recordings from early speech synthesis systems. Compare it to modern speech synthesis, such as that used in smart assistants.
+
+More modern wave-form generation uses ML models built using deep learning (very large neural networks that act in a similar way to neurons in the brain) to produce more natural sounding voices that can be indistinguishable from humans.
+
+> 💁 Some of these ML models can be re-trained using transfer learning to sound like real people. This means using voice as a security system, something banks are increasingly trying to do, is no longer a good idea as anyone with a recording of a few minutes of your voice can impersonate you.
+
+These large ML models are being trained to combine all three steps into end-to-end speech synthesizers.
## Set the timer
@@ -74,14 +106,21 @@ The timer can be set by sending a command from the serverless code, instructing
## Convert text to speech
-The same speech service you used to convert speech to text can be used to convert text back into speech, and this can be played through a speaker on your IoT device.
+The same speech service you used to convert speech to text can be used to convert text back into speech, and this can be played through a speaker on your IoT device. The text to convert is sent to the speech service, along with the type of audio required (such as the sample rate), and binary data containing the audio is returned.
-Voices, neural, others
-
-SSML
+When you send this request, you send it using *Speech Synthesis Markup Language* (SSML), an XML-based markup language for speech synthesis applications. This defines not only the text to be converted, but the language of the text, the voice to use, and can even be used to define speed, volume, and pitch for some or all of the words in the text.
+For example, this SSML defines a request to convert the text "Your 3 minute 5 second time has been set" to speech using a British English voice called `en-GB-MiaNeural`
+```xml
+
+
+ Your 3 minute 5 second time has been set
+
+
+```
+> 💁 Most text to speech systems have multiple voices for different languages, with relevant accents such as a British English voice with an English accent and a New Zealand English voice with a New Zealand accent.
### Task - convert text to speech
@@ -95,12 +134,17 @@ Work through the relevant guide to convert text to speech using your IoT device:
## 🚀 Challenge
+SSML has ways to change how words are spoken, such as adding emphasis to certain words, adding pauses, or changing pitch. Try some of these out, sending different SSML from your IoT device and comparing the output. You can read more about SSML, including how to change the way words are spoken in the [Speech Synthesis Markup Language (SSML) Version 1.1 specification from the World Wide Web consortium](https://www.w3.org/TR/speech-synthesis11/).
+
## Post-lecture quiz
[Post-lecture quiz](https://brave-island-0b7c7f50f.azurestaticapps.net/quiz/46)
## Review & Self Study
+* Read more on speech synthesis on the [Speech synthesis page on Wikipedia](https://wikipedia.org/wiki/Speech_synthesis)
+* Read more on ways criminals are using speech synthesis to steal on the [Fake voices 'help cyber crooks steal cash' story on BBC news](https://www.bbc.com/news/technology-48908736)
+
## Assignment
-[](assignment.md)
+[Cancel the timer](assignment.md)
diff --git a/6-consumer/lessons/3-spoken-feedback/assignment.md b/6-consumer/lessons/3-spoken-feedback/assignment.md
index da157d5c..efaad571 100644
--- a/6-consumer/lessons/3-spoken-feedback/assignment.md
+++ b/6-consumer/lessons/3-spoken-feedback/assignment.md
@@ -1,9 +1,12 @@
-#
+# Cancel the timer
## Instructions
+In the assignment for the last lesson, you added a cancel timer intent to LUIS. For this assignment you need to handle this intent in the serverless code, send a command to the IoT device, then cancel the timer.
+
## Rubric
| Criteria | Exemplary | Adequate | Needs Improvement |
| -------- | --------- | -------- | ----------------- |
-| | | | |
+| Handle the intent in serverless code and send a command | Was able to handle the intent and send a command to the device | Was able to handle the intent but was unable to send the command to the device | Was unable to handle the intent |
+| Cancel the timer on the device | Was able to receive the command and cancel the timer | Was able to receive the command but not cancel the timer | Was unable to receive the command |
diff --git a/6-consumer/lessons/3-spoken-feedback/code-spoken-response/virtual-iot-device/smart-timer/app.py b/6-consumer/lessons/3-spoken-feedback/code-spoken-response/virtual-iot-device/smart-timer/app.py
index 3b2e7dcf..cd1a8feb 100644
--- a/6-consumer/lessons/3-spoken-feedback/code-spoken-response/virtual-iot-device/smart-timer/app.py
+++ b/6-consumer/lessons/3-spoken-feedback/code-spoken-response/virtual-iot-device/smart-timer/app.py
@@ -40,7 +40,15 @@ first_voice = next(x for x in voices if x.locale.lower() == language.lower())
speech_config.speech_synthesis_voice_name = first_voice.short_name
def say(text):
- speech_synthesizer.speak_text(text)
+ ssml = f''
+ ssml += f''
+ ssml += text
+ ssml += ''
+ ssml += ''
+
+ recognizer.stop_continuous_recognition()
+ speech_synthesizer.speak_ssml(ssml)
+ recognizer.start_continuous_recognition()
def announce_timer(minutes, seconds):
announcement = 'Times up on your '
diff --git a/6-consumer/lessons/3-spoken-feedback/virtual-device-text-to-speech.md b/6-consumer/lessons/3-spoken-feedback/virtual-device-text-to-speech.md
index af72004e..df71c4a0 100644
--- a/6-consumer/lessons/3-spoken-feedback/virtual-device-text-to-speech.md
+++ b/6-consumer/lessons/3-spoken-feedback/virtual-device-text-to-speech.md
@@ -43,12 +43,28 @@ Each language supports a range of different voices, and you can get the list of
> speech_config.speech_synthesis_voice_name = 'hi-IN-SwaraNeural'
> ```
-1. Finally update the contents of the `say` function to use the speech synthesizer to speak the response:
+1. Update the contents of the `say` function to generate SSML for the response:
```python
- speech_synthesizer.speak_text(text)
+ ssml = f''
+ ssml += f''
+ ssml += text
+ ssml += ''
+ ssml += ''
```
+1. Below this, stop the speech recognition, speak the SSML, then start the recognition again:
+
+ ```python
+ recognizer.stop_continuous_recognition()
+ speech_synthesizer.speak_ssml(ssml)
+ recognizer.start_continuous_recognition()
+ ```
+
+ The recognition is stopped whilst the text is spoken to avoid the announcement of the timer starting being detected, sent to LUIS and possibly interpreted as a request to set a new timer.
+
+ > 💁 You can test this out by commenting out the lines to stop and restart the recognition. Set one timer, and you may find the announcement sets a new timer, which causes a new announcement, leading to a new timer, and so on for ever!
+
1. Run the app, and ensure the function app is also running. Set some timers, and you will hear a spoken response saying that your timer has been set, then another spoken response when the timer is complete.
> 💁 You can find this code in the [code-spoken-response/virtual-iot-device](code-spoken-response/virtual-iot-device) folder.
diff --git a/images/Diagrams.sketch b/images/Diagrams.sketch
index c298bcff..468d852f 100644
Binary files a/images/Diagrams.sketch and b/images/Diagrams.sketch differ
diff --git a/images/tts-overview.png b/images/tts-overview.png
new file mode 100644
index 00000000..d4181271
Binary files /dev/null and b/images/tts-overview.png differ