pull/34/head
Jen Looper 3 years ago
parent 0978f8d6c5
commit df8e4bc1c8

@ -8,16 +8,16 @@ Add a sketchnote if possible/appropriate
For most *Natural Language Processing* tasks, the text to be processed must be broken down, examined, and the results stored or cross referenced with rules and data sets. This allows the programmer to derive the meaning or intent or only the frequency of terms and words in a text.
Let's discover common techniques used in processing text. Combined with machine learning, these techniques help you to analyse large amounts of text efficiently.
## The Tools of NLP
Let's discover common techniques used in processing text. Combined with machine learning, these techniques help you to analyse large amounts of text efficiently. Before applying ML to these tasks, however, let's understand the problems encountered by an NLP specialist.
## Tasks common to NLP
> 🎓 **Tokenization**
> 
> Probably the first thing most NLP algorithms have to do is split the text into tokens, or words. While this sounds simple, having to take punctuation and different language's word and sentence delimiters can make it tricky.
> Probably the first thing most NLP algorithms have to do is split the text into tokens, or words. While this sounds simple, having to account for punctuation and different language's word and sentence delimiters can make it tricky.
> 🎓 **Parsing & Part-of-speech Tagging**
>
> Every word that has been tokenised can be tagged as a part of speech - is the word a noun, a verb, or adjective etc. The sentence `the quick red fox jumped over the lazy brown dog` might be POS tagged as *fox* = noun, *jumped* = verb etc.
> Every word that has been tokenized can be tagged as a part of speech - a noun, verb, or adjective etc. The sentence `the quick red fox jumped over the lazy brown dog` might be POS tagged as *fox* = noun, *jumped* = verb etc.
>
> Parsing is recognizing what words are related to each other in a sentence - for instance `the quick red fox jumped` is an adjective-noun-verb sequence that is is separate from `lazy brown dog` sequence.
@ -78,11 +78,7 @@ In the sentence `the quick red fox jumped over the lazy brown dog` there are 2 n
>
> A sentence or text can be analysed for sentiment, or how *positive* or *negative* it is. Sentiment is measured in *polarity* and *objectivity/subjectivity*. Polarity is measured from -1.0 to 1.0 (negative to positive) and 0.0 to 1.0 (most objective to most subjective).
✅ Later you'll learn that there are different ways to determine sentiment using machine learning, but one way is to have a list of words and phrases that are categorised as positive or negative by a human expert and apply that model to text to calculate a polarity score. Can you see how this would work in some circumstances and less well in others?
> 🎓 **WordNet**
>
> [WordNet](https://wordnet.princeton.edu/) is a database of words, synonyms, antonyms and many other details for every word in many different languages. It is incredibly useful when attempting to build translations, spell checkers, or language tools of any type.
✅ Later you'll learn that there are different ways to determine sentiment using machine learning, but one way is to have a list of words and phrases that are categorized as positive or negative by a human expert and apply that model to text to calculate a polarity score. Can you see how this would work in some circumstances and less well in others?
> 🎓 **Inflection**
>
@ -90,14 +86,23 @@ In the sentence `the quick red fox jumped over the lazy brown dog` there are 2 n
> 🎓 **Lemmatization**
>
> A *lemma* is the root or headword for a set of words, for instance *flew*, *flies*, *flying* have a lemma of the verb *fly*.
## TextBlob & NLTK
> A *lemma* is the root or headword for a set of words, for instance *flew*, *flies*, *flying* have a lemma of the verb *fly*.
There are also useful databases available for the NLP researcher, notably:
> 🎓 **WordNet**
>
> [WordNet](https://wordnet.princeton.edu/) is a database of words, synonyms, antonyms and many other details for every word in many different languages. It is incredibly useful when attempting to build translations, spell checkers, or language tools of any type.
Luckily, you don't have to build all of these techniques yourself, as there are excellent Python libraries available that make it much more accessible to developers who aren't specialised in natural language processing or machine learning. The next lesson includes more examples on these, but here you will learn some useful examples to help you with the next task.
## NLP Libraries
Luckily, you don't have to build all of these techniques yourself, as there are excellent Python libraries available that make it much more accessible to developers who aren't specialized in natural language processing or machine learning. The next lessons include more examples of these, but here you will learn some useful examples to help you with the next task.
Let's use a library called TextBlob as it contains helpful APIs for tackling these types of tasks. TextBlob "stands on the giant shoulders of [NLTK](https://nltk.org) and [pattern](https://github.com/clips/pattern), and plays nicely with both." It has a considerable amount of ML embedded in its API.
> Note: A useful [Quick Start](https://textblob.readthedocs.io/en/dev/quickstart.html#quickstart) guide is available for TextBlob that is recommended for experienced Python developers
When attempting to identify *noun phrases*, the default extractor seems to miss quite a few, but there is the option to use a different one.
When attempting to identify *noun phrases*, TextBlob offers several options of extractors to find noun phrases. Take a look at `ConllExtractor`.
```python
from textblob import TextBlob
@ -110,6 +115,8 @@ user_input = input("> ")
user_input_blob = TextBlob(user_input, np_extractor=extractor) # note non-default extractor specified
np = user_input_blob.noun_phrases
```
> What's going on here? [ConllExtractor](https://textblob.readthedocs.io/en/dev/api_reference.html?highlight=Conll#textblob.en.np_extractors.ConllExtractor) is "A noun phrase extractor that uses chunk parsing trained with the ConLL-2000 training corpus." ConLL-2000 refers to the Conference on Computational Natural Language Learning (CoNLL-2000). Each year the conference hosted a workshop to tackle a thorny NLP problem, and in 2000 it was noun chunking. A model was trained on the Wall Street Journal, with "sections 15-18 as training data (211727 tokens) and section 20 as test data (47377 tokens)". You can look at the procedures used [here](https://www.clips.uantwerpen.be/conll2000/chunking/) and the [results](https://ifarm.nl/erikt/research/np-chunking.html).
## Task: Improving your bot with a little NLP
In the previous lesson you built a very simple Q&A bot. Now, you'll make Marvin a bit more sympathetic by analyzing your input for sentiment and printing out a response to match the sentiment. You'll also need to identify a `noun_phrase` and ask about it.

@ -1,31 +1,46 @@
# Translation with NLP and Introduction to Sentiment
# Translation and Sentiment Analysis with ML
Add a sketchnote if possible/appropriate
![Embed a video here if available](video-url)
## [Pre-lecture quiz](link-to-quiz-app)
An important challenge in computational linguistics is accurate *translation* of a sentence from one spoken or written language to another. This is a very hard problem compounded by the fact that there are thousands of languages and each can have very different grammar rules. One approach is to convert the formal grammar rules for one language, such as English, into a non-language dependent structure, and then translate it by converting back to another language. This means that you would take the following steps:
In the previous lessons you learned how to build a basic bot using TextBlob, a library that embeds ML behind-the-scenes to perform basic NLP tasks such as noun phrase extraction. Another important challenge in computational linguistics is accurate *translation* of a sentence from one spoken or written language to another.
This is a very hard problem compounded by the fact that there are thousands of languages and each can have very different grammar rules. One approach is to convert the formal grammar rules for one language, such as English, into a non-language dependent structure, and then translate it by converting back to another language. This means that you would take the following steps:
1. Identify or tag the words in input language into nouns, verbs etc.
2. Produce a direct translation of each word in the target language format
> ☘️ **Example**: In **English**, the simple sentence `I feel happy` is 3 words in the order **subject** (I), **verb** (feel), **adjective** (happy). However, in the **Irish** language, the same sentence has a very different grammatical structure - emotions like "*happy*" or "*sad*" are expressed as being *upon* you. The English phrase `I feel happy` in Irish would be `Tá athas orm`. A *literal* translation would be `Happy is upon me`. Of course, an Irish speaker translating to English would say `I feel happy`, not `Happy is upon me`, because they understand the meaning of the sentence, even if the words and sentence structure are different. The formal order for the sentence in Irish are **verb** (Tá or is), **adjective** (athas, or happy), **subject** (orm, or upon me).
A naive translation program might translate words only, ignoring the sentence structure. This leads to bad (and sometimes hilarious) mistranslations: `I feel happy` translates literally to `Mise bhraitheann athas`. In Irish that means (literally) `me feel happy` and not a valid Irish sentence. Even though English and Irish are languages spoken on two closely neighboring islands, they are very different languages with different grammar structures.
## Translation
A naive translation program might translate words only, ignoring the sentence structure.
✅ If you've learned a second (or third or more) language as an adult, you might have started by thinking in your native language, translating a concept word by word in your head to the second language, and then speaking out your translation. This is similar to what naive translation computer programs are doing. It's important to get past this phase to attain fluency!
Naive translation leads to bad (and sometimes hilarious) mistranslations: `I feel happy` translates literally to `Mise bhraitheann athas`. In Irish that means (literally) `me feel happy` and not a valid Irish sentence. Even though English and Irish are languages spoken on two closely neighboring islands, they are very different languages with different grammar structures.
> You can watch some videos about Irish linguistic traditions such as [this one](https://www.youtube.com/watch?v=mRIaLSdRMMs)
### Machine Learning Approaches
### Machine learning approaches
So far, you've learned about the formal rules approach to natural language processing. Another approach is to ignore the meaning of the words, and _instead use machine learning to detect patterns_. This can work in translation if you have lots of text (a *corpus*) or texts (*corpora*) in both the origin and target languages. For instance, if you have *Pride and Prejudice* in English and a human translation of the book in *French*, you could detect phrases in one that are idiomatically translated into the other.
For instance, when an English phrase such as `John looked at the cake with a wolfish grin` is translated literally, to, say French, it might become `John regarda le gâteau avec un sourire de loup`. A reader of both languages would understand that the direct translation of `wolfish grin` is not the French translation `wolf smile` but a synonym - in this case for being very hungry or voracious. A better translation that a human might make would be `John regarda le gâteau avec voracité`, because it better conveys the meaning. If a ML model has enough human translations to build a model on, it can improve the accuracy of translations by identifying common patterns in texts that have been previously translated by expert human speakers of both languages.
TODO: Demo
## Sentiment analysis
Another area where machine learning can work very well is sentiment analysis. A non-ML approach to sentiment is to identify words and phrases which are 'positive' and 'negative'. Then, given a new piece of text, calculate the total value of the positive, negative and neutral words to identify the overall sentiment. This approach is easily tricked as you may have seen in the Marvin task - the sentence `Great, that was a wonderful waste of time, I'm glad we are lost on this dark road` is a sarcastic, negative sentiment sentence, but the simple algorithm detects 'great', 'wonderful', 'glad' as positive and 'waste', 'lost' and 'dark' as negative. The overall sentiment is swayed by these conflicting words.
✅ Stop a second and think about how we convey sarcasm as human speakers. Tone inflection plays a large role. Try to say the phrase "Well, that film was awesome" to discover how your voice conveys meaning.
### Machine learning approaches
The ML approach would be to hand gather negative and positive bodies of text - tweets, or movie reviews, or anything where the human has given a score *and* a written opinion. Then NLP techniques can be applied to opinions and scores, so that patterns emerge (e.g., positive movie reviews tend to have the phrase 'Oscar worthy' more than negative movie reviews, or positive restaurant reviews say 'gourmet' much more than 'disgusting').
> ⚖️ **Example**: If you worked in a politician's office and there was some new law being debated, constituents might write to the office with emails supporting or emails against the particular new law. Let's say you are tasked with reading the emails and sorting them in 2 piles, *for* and *against*. If there were a lot of emails, you might be overwhelmed attempting to read them all. Wouldn't it be nice if a bot could read them all for you, understand them and tell you in which pile each email belonged?
@ -97,10 +112,13 @@ Here is a sample [solution](solutions/book.py).
- It would be dreadful!
✅ Any aficionado of Jane Austen will understand that she often uses her books to critique the more ridiculous aspects of English Regency society. Elizabeth Bennett, the main character in Pride and Prejudice, is a keen social observer (like the author) and her language is often heavily nuanced. Even Mr. Darcy (the love interest in the story) notes Elizabeth's playful and teasing use of language: "I have had the pleasure of your acquaintance long enough to know that you find great enjoyment in occasionally professing opinions which in fact are not your own."
## 🚀Challenge
Can you make Marvin even better by extracting other features from the user input?
## [Post-lecture quiz](link-to-quiz-app)
## Review & Self Study
There are many ways to extract sentiment from text. Think of the business applications that might make use of this technique. Think about how it can go awry. Read more about sophisticated enterprise-ready systems that analyze sentiment such as [Azure Text Analysis](https://docs.microsoft.com/en-us/azure/cognitive-services/Text-Analytics/how-tos/text-analytics-how-to-sentiment-analysis?tabs=version-3-1?WT.mc_id=academic-15963-cxa). Test some of the Pride and Prejudice sentences above and see if it can detect nuance.

@ -1,6 +1,8 @@
# Getting Started with Natural Language Processing
In this section of the curriculum, you will be introduced to one of the most widespread uses of machine learning: Natural Language Processing (NLP). Derived from Computational Linguistics, this category of Artificial Intelligence is the bridge between humans and machines via voice or textual communication. In these lessons we'll learn the basics of NLP by building small conversational bots to learn how Machine Learning aids in making these conversations more and more 'smart'. You'll travel back in time, chatting with Elizabeth Bennett and Mr. Darcy from Jane Austen's classic novel, **Pride and Prejudice**, published in 1813. Then, you'll further your knowledge by learning about sentiment analysis via hotel reviews in Europe.
In this section of the curriculum, you will be introduced to one of the most widespread uses of machine learning: Natural Language Processing (NLP). Derived from Computational Linguistics, this category of Artificial Intelligence is the bridge between humans and machines via voice or textual communication.
In these lessons we'll learn the basics of NLP by building small conversational bots to learn how Machine Learning aids in making these conversations more and more 'smart'. You'll travel back in time, chatting with Elizabeth Bennett and Mr. Darcy from Jane Austen's classic novel, **Pride and Prejudice**, published in 1813. Then, you'll further your knowledge by learning about sentiment analysis via hotel reviews in Europe.
![Pride and Prejudice book and tea](images/p&p.jpg)
> Photo by <a href="https://unsplash.com/@elaineh?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Elaine Howlin</a> on <a href="https://unsplash.com/s/photos/pride-and-prejudice?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
@ -8,12 +10,11 @@ In this section of the curriculum, you will be introduced to one of the most wid
## Lessons
1. [Introduction to Natural Language Processing](1-Introduction-to-NLP/README.md)
2. [Common NLP Tasks](2-Tasks/README.md)
3. [Translation with NLP](3-Translation/README.md)
2. [Common NLP Tasks and Techniques](2-Tasks/README.md)
3. [Translation and Sentiment Analysis with Machine Learning](3-Translation-Sentiment/README.md)
4. TBD
5. TBD
## Credits
These Natural Language Processing lessons were written with ☕ by [Stephen Howell]([Twitter](https://twitter.com/Howell_MSFT))

@ -80,7 +80,7 @@ By ensuring that the content aligns with projects, the process is made more enga
| 14 | Exploring Nigerian Musical Tastes 🎧 | [Clustering](Clustering/README.md) | Explore the K-Means Clustering Method | [lesson](Clustering/2-K-Means/README.md) | Jen |
| 15 | Introduction to Natural Language Processing ☕️ | [Natural Language Processing](NLP/README.md) | Learn the basics about NLP by building a simple bot | [lesson](NLP/1-Introduction-to-NLP/README.md) | Stephen |
| 16 | Common NLP Tasks ☕️ | [Natural Language Processing](NLP/README.md) | Deepen your NLP knowledge by understanding common tasks required when dealing with language structures | [lesson](NLP/2-Tasks/README.md) | Stephen |
| 17 | Translation and Sentiment Analysis ❤️ | [Natural Language Processing](NLP/README.md) | Translation and Sentiment analysis with Jane Austen | [lesson](NLP/3-Translation/README.md) | Stephen |
| 17 | Translation and Sentiment Analysis ❤️ | [Natural Language Processing](NLP/README.md) | Translation and Sentiment analysis with Jane Austen | [lesson](NLP/3-Translation-Sentiment/README.md) | Stephen |
| 18 | Romantic Hotels of Europe ♥️ | [Natural Language Processing](NLP/README.md) | Sentiment analysis, continued | [lesson]() | Stephen |
| 19 | Romantic Hotels of Europe ♥️ | [Natural Language Processing](NLP/README.md) | Sentiment analysis, continued | [lesson]() | Stephen |
| 20 | Introduction to Time Series Forecasting | [Time Series](TimeSeries/README.md) | Introduction to Time Series Forecasting | [lesson](TimeSeries/1-Introduction/README.md) | Francesca |

Loading…
Cancel
Save