some infographics for NLP

pull/38/head
Jen Looper 3 years ago
parent 6b04e8f2a2
commit 8db171a0d4

@ -7,25 +7,35 @@ Let's discover common techniques used in processing text. Combined with machine
## Tasks common to NLP
> 🎓 **Tokenization**
> 
> Probably the first thing most NLP algorithms have to do is split the text into tokens, or words. While this sounds simple, having to account for punctuation and different languages' word and sentence delimiters can make it tricky. Thought it might seem very straightforward to split a sentence into words, you might have to use some other methods to determine demarcations.
🎓 **Tokenization**
Probably the first thing most NLP algorithms have to do is split the text into tokens, or words. While this sounds simple, having to account for punctuation and different languages' word and sentence delimiters can make it tricky. You might have to use various methods to determine demarcations.
![tokenization](images/tokenization.png)
> Tokenizing a sentence from **Pride and Prejudice**. Infographic by [Jen Looper](https://twitter.com/jenlooper)
🎓 **Embeddings**
> [Word embeddings](https://wikipedia.org/wiki/Word_embedding) are a way to convert your text data numerically. This is done in a way so that words with a similar meaning or words used together cluster together.
[Word embeddings](https://wikipedia.org/wiki/Word_embedding) are a way to convert your text data numerically. This is done in a way so that words with a similar meaning or words used together cluster together.
![word embeddings](images/embedding.png)
> "I have the highest respect for your nerves, they are my old friends." - Word embeddings for a sentence in **Pride and Prejudice**. Infographic by [Jen Looper](https://twitter.com/jenlooper)
✅ Try [this interesting tool](https://projector.tensorflow.org/) to experiment with word embeddings. Clicking on one word shows clusters of similar words: 'toy' clusters with 'disney', 'lego', 'playstation', and 'console'.
> 🎓 **Parsing & Part-of-speech Tagging**
>
> Every word that has been tokenized can be tagged as a part of speech - a noun, verb, or adjective etc. The sentence `the quick red fox jumped over the lazy brown dog` might be POS tagged as *fox* = noun, *jumped* = verb etc.
>
> Parsing is recognizing what words are related to each other in a sentence - for instance `the quick red fox jumped` is an adjective-noun-verb sequence that is is separate from `lazy brown dog` sequence.
🎓 **Parsing & Part-of-speech Tagging**
Every word that has been tokenized can be tagged as a part of speech - a noun, verb, or adjective etc. The sentence `the quick red fox jumped over the lazy brown dog` might be POS tagged as *fox* = noun, *jumped* = verb etc.
![parsing](images/parse.png)
> 🎓 **Word and Phrase Frequencies**
>
> A useful tool when analyzing a large body of text is to build a dictionary of every word or phrase of interest and how often it appears. The phrase `the quick red fox jumped over the lazy brown dog` has a word frequency of 2 for `the`.
> Parsing a sentence from **Pride and Prejudice**. Infographic by [Jen Looper](https://twitter.com/jenlooper)
Parsing is recognizing what words are related to each other in a sentence - for instance `the quick red fox jumped` is an adjective-noun-verb sequence that is separate from the `lazy brown dog` sequence.
🎓 **Word and Phrase Frequencies**
A useful tool when analyzing a large body of text is to build a dictionary of every word or phrase of interest and how often it appears. The phrase `the quick red fox jumped over the lazy brown dog` has a word frequency of 2 for `the`.
### Example:
The Rudyard Kipling poem *The Winners* has a verse:
@ -41,9 +51,9 @@ He travels the fastest who travels alone.
As phrase frequencies can be case insensitive or case sensitive as required, the phrase `a friend` has a frequency of 2 and `the` has a frequency of 6, and `travels` is 2.
> 🎓 **N-grams**
>
> A text can be split into sequences of words of a set length, a single word (unigram), two words (bigrams), three words (trigrams) or any number of words (n-grams).
🎓 **N-grams**
A text can be split into sequences of words of a set length, a single word (unigram), two words (bigrams), three words (trigrams) or any number of words (n-grams).
### Example
For instance `the quick red fox jumped over the lazy brown dog` with a n-gram score of 2 produces the following n-grams:
@ -69,32 +79,38 @@ It might be easier to visualise it as a sliding box over the sentence. Here it i
7. the quick red fox jumped over <u>**the lazy brown**</u> dog
8. the quick red fox jumped over the **<u>lazy brown dog</u>**
> 🎓 **Noun phrase Extraction**
>
> In most sentences, there is a noun that is the subject, or object of the sentence. In English, it is often identifiable as having 'a' or 'an' or 'the' preceding it. Identifying the subject or object of a sentence by 'extracting the noun phrase' is a common task in NLP when attempting to understand the meaning of a sentence.
![n-grams sliding window](images/n-grams.gif)
> N-gram value of 3: Infographic by [Jen Looper](https://twitter.com/jenlooper)
🎓 **Noun phrase Extraction**
In most sentences, there is a noun that is the subject, or object of the sentence. In English, it is often identifiable as having 'a' or 'an' or 'the' preceding it. Identifying the subject or object of a sentence by 'extracting the noun phrase' is a common task in NLP when attempting to understand the meaning of a sentence.
✅ In the sentence "I cannot fix on the hour, or the spot, or the look or the words, which laid the foundation. It is too long ago. I was in the middle before I knew that I had begun.", can you identify the noun phrases?
### Example
In the sentence `the quick red fox jumped over the lazy brown dog` there are 2 noun phrases: **quick red fox** and **lazy brown dog**.
> 🎓 **Sentiment analysis**
>
> A sentence or text can be analysed for sentiment, or how *positive* or *negative* it is. Sentiment is measured in *polarity* and *objectivity/subjectivity*. Polarity is measured from -1.0 to 1.0 (negative to positive) and 0.0 to 1.0 (most objective to most subjective).
🎓 **Sentiment analysis**
A sentence or text can be analysed for sentiment, or how *positive* or *negative* it is. Sentiment is measured in *polarity* and *objectivity/subjectivity*. Polarity is measured from -1.0 to 1.0 (negative to positive) and 0.0 to 1.0 (most objective to most subjective).
✅ Later you'll learn that there are different ways to determine sentiment using machine learning, but one way is to have a list of words and phrases that are categorized as positive or negative by a human expert and apply that model to text to calculate a polarity score. Can you see how this would work in some circumstances and less well in others?
> 🎓 **Inflection**
>
> Inflection enables you to take a word and get the singular or plural of the word.
🎓 **Inflection**
> 🎓 **Lemmatization**
>
> A *lemma* is the root or headword for a set of words, for instance *flew*, *flies*, *flying* have a lemma of the verb *fly*.
Inflection enables you to take a word and get the singular or plural of the word.
🎓 **Lemmatization**
A *lemma* is the root or headword for a set of words, for instance *flew*, *flies*, *flying* have a lemma of the verb *fly*.
There are also useful databases available for the NLP researcher, notably:
> 🎓 **WordNet**
>
> [WordNet](https://wordnet.princeton.edu/) is a database of words, synonyms, antonyms and many other details for every word in many different languages. It is incredibly useful when attempting to build translations, spell checkers, or language tools of any type.
🎓 **WordNet**
[WordNet](https://wordnet.princeton.edu/) is a database of words, synonyms, antonyms and many other details for every word in many different languages. It is incredibly useful when attempting to build translations, spell checkers, or language tools of any type.
## NLP Libraries
@ -118,7 +134,7 @@ user_input_blob = TextBlob(user_input, np_extractor=extractor) # note non-defau
np = user_input_blob.noun_phrases
```
> What's going on here? [ConllExtractor](https://textblob.readthedocs.io/en/dev/api_reference.html?highlight=Conll#textblob.en.np_extractors.ConllExtractor) is "A noun phrase extractor that uses chunk parsing trained with the ConLL-2000 training corpus." ConLL-2000 refers to the Conference on Computational Natural Language Learning (CoNLL-2000). Each year the conference hosted a workshop to tackle a thorny NLP problem, and in 2000 it was noun chunking. A model was trained on the Wall Street Journal, with "sections 15-18 as training data (211727 tokens) and section 20 as test data (47377 tokens)". You can look at the procedures used [here](https://www.clips.uantwerpen.be/conll2000/chunking/) and the [results](https://ifarm.nl/erikt/research/np-chunking.html).
> What's going on here? [ConllExtractor](https://textblob.readthedocs.io/en/dev/api_reference.html?highlight=Conll#textblob.en.np_extractors.ConllExtractor) is "A noun phrase extractor that uses chunk parsing trained with the ConLL-2000 training corpus." ConLL-2000 refers to the 2000 Conference on Computational Natural Language Learning. Each year the conference hosted a workshop to tackle a thorny NLP problem, and in 2000 it was noun chunking. A model was trained on the Wall Street Journal, with "sections 15-18 as training data (211727 tokens) and section 20 as test data (47377 tokens)". You can look at the procedures used [here](https://www.clips.uantwerpen.be/conll2000/chunking/) and the [results](https://ifarm.nl/erikt/research/np-chunking.html).
## Task: Improving your bot with a little NLP
@ -184,7 +200,6 @@ Take a task in the prior knowledge check and try to implement it. Test the bot o
## Review & Self Study
In the next few lessons you will learn more about sentiment analysis. Research this interesting technique in articles such as these on [KDNuggets](https://www.kdnuggets.com/tag/nlp)
## Assignment
[Make a bot talk back](assignment.md)

Binary file not shown.

After

Width:  |  Height:  |  Size: 88 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 56 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 40 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 73 KiB

Loading…
Cancel
Save