24 KiB

Raw Permalink Blame History

သဘာဝဘာသာစကားကို အလုပ်လုပ်စေခြင်းဆိုင်ရာ အလုပ်များနှင့် နည်းလမ်းများ

သဘာဝဘာသာစကားကို အလုပ်လုပ်စေခြင်း အလုပ်များအတွက်၊ အလုပ်လုပ်စေလိုသော စာသားကို အစိတ်အပိုင်းများ ခွဲခြားပြီး၊ စစ်ဆေးကာ၊ ရလဒ်များကို စည်းမျဉ်းများနှင့် ဒေတာအစုများနှင့် တွဲဆက်ထားရမည်။ အလုပ်များသည်၊ အရေးအသား၏ အဓိပ္ပါယ်၊ ရည်ရွယ်ချက် သို့မဟုတ် စကားလုံးများ၏ ထပ်တလဲလဲဖြစ်မှု ကို ရယူရန် အခွင့်အရေးပေးသည်။

Pre-lecture quiz

စာသားကို အလုပ်လုပ်စေရာတွင် အသုံးပြုသော နည်းလမ်းများကို ရှာဖွေကြမည်။ စက်ရုပ်သင်ယူမှုနှင့် ပေါင်းစပ်ပြီး၊ ဤနည်းလမ်းများသည် စာသားအများအပြားကို ထိရောက်စွာ ခွဲခြားစစ်ဆေးရန် ကူညီပေးသည်။ သို့သော် စက်ရုပ်သင်ယူမှုကို ဤအလုပ်များတွင် အသုံးပြုမီ၊ သဘာဝဘာသာစကားကို အလုပ်လုပ်စေသူများ ရင်ဆိုင်ရသော ပြဿနာများကို နားလည်ရမည်။

သဘာဝဘာသာစကားကို အလုပ်လုပ်စေခြင်းဆိုင်ရာ အလုပ်များ

သင်လုပ်ဆောင်လိုသော စာသားကို ခွဲခြားစစ်ဆေးရန် နည်းလမ်းများ မျိုးစုံရှိသည်။ သင်လုပ်ဆောင်နိုင်သော အလုပ်များရှိပြီး၊ ဤအလုပ်များမှတစ်ဆင့် စာသားကို နားလည်နိုင်ပြီး သုံးသပ်ချက်များ ဆွဲယူနိုင်သည်။ အလုပ်များကို အဆင့်ဆင့် လုပ်ဆောင်လေ့ရှိသည်။

Tokenization

NLP အယ်လဂိုရစ်သမားများအတွက် ပထမဆုံးလုပ်ဆောင်ရမည့်အရာမှာ စာသားကို token သို့မဟုတ် စကားလုံးများအဖြစ် ခွဲခြားခြင်းဖြစ်သည်။ ဤအရာသည် ရိုးရှင်းသလို ထင်ရပေမယ့်၊ စာကြောင်းအတွင်းရှိ သတ်မှတ်ချက်များနှင့် ဘာသာစကားအမျိုးမျိုး၏ စကားလုံးနှင့် စာကြောင်းအဆုံးသတ်ချက်များကို ထည့်သွင်းစဉ်းစားရခြင်းကြောင့် ခက်ခဲနိုင်သည်။ သတ်မှတ်ချက်များကို ဆုံးဖြတ်ရန် နည်းလမ်းများ မျိုးစုံကို အသုံးပြုရနိုင်သည်။

Pride and Prejudice စာအုပ်မှ စာကြောင်းတစ်ကြောင်းကို token ခွဲခြင်း။ Jen Looper မှ Infographic

Embeddings

Word embeddings သည် စာသားဒေတာကို ကိန်းဂဏန်းအဖြစ် ပြောင်းလဲရန် နည်းလမ်းတစ်ခုဖြစ်သည်။ Embeddings ကို အဓိပ္ပါယ်တူသော စကားလုံးများ သို့မဟုတ် အတူတူအသုံးပြုသော စကားလုံးများကို အစုအဖွဲ့တစ်ခုအဖြစ် စုပုံစေသော နည်းလမ်းဖြင့် ပြုလုပ်သည်။

"I have the highest respect for your nerves, they are my old friends." - Pride and Prejudice စာအုပ်မှ စာကြောင်းတစ်ကြောင်းအတွက် Word embeddings။ Jen Looper မှ Infographic

✅ ဤစိတ်ဝင်စားဖွယ် tool ကို အသုံးပြု၍ word embeddings ကို စမ်းသပ်ပါ။ စကားလုံးတစ်လုံးကို နှိပ်ပါက အဓိပ္ပါယ်တူသော စကားလုံးများ၏ အစုအဖွဲ့ကို ပြသသည်။ ဥပမာ - 'toy' သည် 'disney', 'lego', 'playstation', နှင့် 'console' တို့နှင့် အစုအဖွဲ့တူသည်။

Parsing & Part-of-speech Tagging

Tokenized ဖြစ်ထားသော စကားလုံးတစ်ခုစီကို noun, verb, adjective စသည့် part of speech အဖြစ် tag လုပ်နိုင်သည်။ ဥပမာ - the quick red fox jumped over the lazy brown dog စာကြောင်းကို POS tagging လုပ်ပါက fox = noun, jumped = verb ဖြစ်နိုင်သည်။

Pride and Prejudice စာအုပ်မှ စာကြောင်းတစ်ကြောင်းကို Parsing လုပ်ခြင်း။ Jen Looper မှ Infographic

Parsing သည် စာကြောင်းအတွင်း စကားလုံးများသည် အချင်းချင်း ဆက်စပ်နေမှုကို သိရှိခြင်းဖြစ်သည်။ ဥပမာ - the quick red fox jumped သည် adjective-noun-verb အစဉ်ဖြစ်ပြီး၊ lazy brown dog အစဉ်နှင့် သီးသန့်ဖြစ်သည်။

Word and Phrase Frequencies

စာသားအများအပြားကို ခွဲခြားစစ်ဆေးရာတွင် အသုံးဝင်သော နည်းလမ်းတစ်ခုမှာ စိတ်ဝင်စားဖွယ် စကားလုံး သို့မဟုတ် စကားစုများ၏ အကြိမ်ရေကို စာရင်းပြုစုခြင်းဖြစ်သည်။ the quick red fox jumped over the lazy brown dog စာကြောင်းတွင် the စကားလုံး၏ frequency သည် 2 ဖြစ်သည်။

ဥပမာစာသားတစ်ခုကို ကြည့်ပြီး စကားလုံးများ၏ frequency ကို ရေတွက်ကြည့်ပါ။ Rudyard Kipling ရေးသားသော The Winners ကဗျာတွင် အောက်ပါ စာပိုဒ်ပါဝင်သည် -

What the moral? Who rides may read.
When the night is thick and the tracks are blind
A friend at a pinch is a friend, indeed,
But a fool to wait for the laggard behind.
Down to Gehenna or up to the Throne,
He travels the fastest who travels alone.

Phrase frequencies သည် case sensitive သို့မဟုတ် case insensitive ဖြစ်နိုင်သည်။ ဥပမာ - a friend phrase ၏ frequency သည် 2 ဖြစ်ပြီး၊ the phrase ၏ frequency သည် 6 ဖြစ်သည်။ travels phrase ၏ frequency သည် 2 ဖြစ်သည်။

N-grams

စာသားကို စကားလုံးအရေအတွက် သတ်မှတ်ထားသော အစဉ်အတိုင်း ခွဲခြားနိုင်သည်။ တစ်လုံး (unigram), နှစ်လုံး (bigram), သုံးလုံး (trigram) သို့မဟုတ် စကားလုံးအရေအတွက် မည်သည့်အတိုင်း (n-grams) ဖြစ်နိုင်သည်။

ဥပမာ - the quick red fox jumped over the lazy brown dog ကို n-gram score 2 ဖြင့် ခွဲခြားပါက အောက်ပါ n-grams များရရှိသည် -

the quick
quick red
red fox
fox jumped
jumped over
over the
the lazy
lazy brown
brown dog

ဤအရာကို sliding box အဖြစ် စဉ်းစားပါက ပိုမိုလွယ်ကူနိုင်သည်။ ဤသည်မှာ 3 စကားလုံး n-grams အတွက် sliding box ဖြစ်သည် -

the quick red fox jumped over the lazy brown dog
the quick red fox jumped over the lazy brown dog
the quick red fox jumped over the lazy brown dog
the quick red fox jumped over the lazy brown dog
the quick red fox jumped over the lazy brown dog
the quick red fox jumped over the lazy brown dog
the quick red fox jumped over the lazy brown dog
the quick red fox jumped over the lazy brown dog

N-gram value of 3: Jen Looper မှ Infographic

Noun phrase Extraction

စာကြောင်းအများစုတွင် subject သို့မဟုတ် object ဖြစ်သော noun phrase တစ်ခုရှိသည်။ အင်္ဂလိပ်ဘာသာစကားတွင် a, an, the စသည်ဖြင့် စတင်သော noun phrase ကို ရှာဖွေခြင်းသည် စာကြောင်း၏ အဓိပ္ပါယ်ကို နားလည်ရန် အသုံးဝင်သော NLP အလုပ်တစ်ခုဖြစ်သည်။

✅ "I cannot fix on the hour, or the spot, or the look or the words, which laid the foundation. It is too long ago. I was in the middle before I knew that I had begun." စာကြောင်းတွင် noun phrases ကို ရှာဖွေပါ။

the quick red fox jumped over the lazy brown dog စာကြောင်းတွင် noun phrases 2 ခုရှိသည် - quick red fox နှင့် lazy brown dog။

Sentiment analysis

စာကြောင်း သို့မဟုတ် စာသားကို အပျော် သို့မဟုတ် အနည်းငယ်စိတ်ဆိုး ဖြစ်မှုအပေါ် အခြေခံ၍ sentiment ကို ခွဲခြားနိုင်သည်။ Sentiment ကို polarity နှင့် objectivity/subjectivity ဖြင့် တိုင်းတာသည်။ Polarity သည် -1.0 မှ 1.0 (negative to positive) ဖြစ်ပြီး၊ 0.0 မှ 1.0 (most objective to most subjective) ဖြစ်သည်။

✅ နောက်ပိုင်းတွင် sentiment ကို စက်ရုပ်သင်ယူမှုဖြင့် ဆုံးဖြတ်နိုင်သော နည်းလမ်းများကို သင်လေ့လာမည်။ သို့သော် လူ့အတတ်ပညာရှင်က စိတ်ဝင်စားဖွယ် positive သို့မဟုတ် negative စကားလုံးများနှင့် phrase များကို စာရင်းပြုစုထားသော နည်းလမ်းတစ်ခုကို စဉ်းစားပါ။ ဤနည်းလမ်းသည် အချို့သောအခြေအနေများတွင် အလုပ်လုပ်နိုင်ပြီး အခြားအခြေအနေများတွင် မလုပ်နိုင်ခြင်းကို သင်မြင်နိုင်ပါသလား။

Inflection

Inflection သည် စကားလုံးတစ်လုံးကို singular သို့မဟုတ် plural အဖြစ် ပြောင်းလဲရန် ခွင့်ပြုသည်။

Lemmatization

Lemma သည် စကားလုံးများအစုအဖွဲ့၏ root သို့မဟုတ် headword ဖြစ်သည်။ ဥပမာ - flew, flies, flying ၏ lemma သည် fly ဖြစ်သည်။

NLP သုတေသနသူများအတွက် အသုံးဝင်သော ဒေတာဘေ့စ်များလည်း ရရှိနိုင်သည်၊ အထူးသဖြင့် -

WordNet

WordNet သည် စကားလုံးများ၊ အဓိပ္ပါယ်တူစကားလုံးများ၊ အဓိပ္ပါယ်ဆန်စကားလုံးများနှင့် အခြားသော အသေးစိတ်များကို ဘာသာစကားအမျိုးမျိုးအတွက် စုစည်းထားသော ဒေတာဘေ့စ်ဖြစ်သည်။ ဘာသာပြန်များ၊ စာလုံးပေါင်းစစ်ဆေးသူများ သို့မဟုတ် ဘာသာစကား tools များကို တည်ဆောက်ရာတွင် အလွန်အသုံးဝင်သည်။

NLP Libraries

ကံကောင်းစွာ၊ ဤနည်းလမ်းများအားလုံးကို ကိုယ်တိုင် တည်ဆောက်ရန် မလိုအပ်ပါ၊ အလွန်ထိရောက်သော Python libraries များ ရရှိနိုင်ပြီး၊ သဘာဝဘာသာစကားကို အလုပ်လုပ်စေခြင်း သို့မဟုတ် စက်ရုပ်သင်ယူမှုတွင် အထူးကျွမ်းကျင်မဟုတ်သော developer များအတွက် ပိုမိုလွယ်ကူစေသည်။ နောက်ပိုင်းသင်ခန်းစာများတွင် ဤ libraries များ၏ နမူနာများကို ပိုမိုလေ့လာမည်၊ သို့သော် အခုမှာ သင်၏ အလုပ်ကို ကူညီနိုင်သော အသုံးဝင်သော နမူနာများကို လေ့လာပါ။

Exercise - `TextBlob` library ကို အသုံးပြုခြင်း

TextBlob ဟုခေါ်သော library ကို အသုံးပြုကြည့်ပါ၊ ဤ library တွင် ဤအမျိုးအစားအလုပ်များကို လုပ်ဆောင်ရန် API များပါဝင်သည်။ TextBlob သည် "NLTK နှင့် pattern ၏ အကြီးမားသော အခြေခံအရပ်များပေါ်တွင် ရပ်တည်ပြီး၊ နှစ်ခုစလုံးနှင့် သင့်တော်စွာ အလုပ်လုပ်သည်။" ၎င်း၏ API တွင် ML အများအပြား ပါဝင်သည်။

Note: Quick Start လမ်းညွှန်သည် အတွေ့အကြုံရှိ Python developer များအတွက် အကြံပြုထားသည်။

Noun phrases ကို ရှာဖွေရာတွင် TextBlob သည် extractors များစွာကို ပေးထားသည်။

ConllExtractor ကို ကြည့်ပါ။
```
from textblob import TextBlob
from textblob.np_extractors import ConllExtractor
# import and create a Conll extractor to use later 
extractor = ConllExtractor()

# later when you need a noun phrase extractor:
user_input = input("> ")
user_input_blob = TextBlob(user_input, np_extractor=extractor)  # note non-default extractor specified
np = user_input_blob.noun_phrases                                    
```
ဤနေရာတွင် ဘာဖြစ်နေသနည်း? ConllExtractor သည် "ConLL-2000 training corpus ဖြင့် လေ့ကျင့်ထားသော chunk parsing ကို အသုံးပြုသော noun phrase extractor" ဖြစ်သည်။ ConLL-2000 သည် 2000 ခုနှစ် Computational Natural Language Learning Conference ကို ရည်ညွှန်းသည်။ 2000 ခုနှစ်တွင် noun chunking ကို လေ့လာရန် model တစ်ခုကို Wall Street Journal မှ training data (211727 tokens) နှင့် test data (47377 tokens) ဖြင့် လေ့ကျင့်ခဲ့သည်။ Procedures နှင့် results ကို ကြည့်နိုင်သည်။

Challenge - သင့် bot ကို NLP ဖြင့် တိုးတက်စေခြင်း

ယခင်သင်ခန်းစာတွင် သင်အလွန်ရိုးရှင်းသော Q&A bot တစ်ခုကို တည်ဆောက်ခဲ့သည်။ အခုမှာ Marvin ကို သင့် input ကို sentiment အပေါ်အခြေခံ၍ သုံးသပ်ပြီး၊ sentiment နှင့် ကိုက်ညီသော အဖြေကို ပေးနိုင်စေရန် တိုးတက်စေပါမည်။ သင် noun phrase ကို ရှာဖွေပြီး၊ pluralize လုပ်ကာ အဲဒီအကြောင်းအရာအပေါ် input ပေးရန် မေးမြန်းရပါမည်။

သင့် bot ကို ပိုမိုကောင်းမွန်စေရန် လုပ်ဆောင်ရမည့် အဆင့်များ -

အသုံးပြုသူကို bot နှင့် အပြန်အလှန် ဆက်သွယ်ရန် လမ်းညွှန်ချက်များကို print လုပ်ပါ
loop ကို စတင်ပါ
1. အသုံးပြုသူ input ကို လက်ခံပါ
2. အသုံးပြုသူသည် exit မေးမြန်းပါက ထွက်ပါ
3. အသုံးပြုသူ input ကို စစ်ဆေးပြီး sentiment response ကို ဆုံးဖြတ်ပါ
4. sentiment တွင် noun phrase ရှိပါက pluralize လုပ်ပြီး အဲဒီအကြောင်းအရာအပေါ် input ပေးရန် မေးမြန်းပါ
5. အဖြေကို print လုပ်ပါ
အဆင့် 2 သို့ ပြန်သွားပါ

TextBlob ကို အသုံးပြု၍ sentiment ကို ဆုံးဖြတ်ရန် code snippet ကို ကြည့်ပါ။ Sentiment response ၏ gradients သာ 4 ခုရှိသည် (သင်လိုချင်ပါက ပိုမိုများစွာ ထည့်နိုင်သည်) -

if user_input_blob.polarity <= -0.5:
  response = "Oh dear, that sounds bad. "
elif user_input_blob.polarity <= 0:
  response = "Hmm, that's not great. "
elif user_input_blob.polarity <= 0.5:
  response = "Well, that sounds positive. "
elif user_input_blob.polarity <= 1:
  response = "Wow, that sounds great. "

ဤနမူနာ output ကို လမ်းညွှန်အဖြစ် အသုံးပြုပါ (အသုံးပြုသူ input သည် > ဖြင့် စတင်သော အကြောင်းအရာများတွင် ရှိသည်) -

Hello, I am Marvin, the friendly robot.
You can end this conversation at any time by typing 'bye'
After typing each answer, press 'enter'
How are you today?
> I am ok
Well, that sounds positive. Can you tell me more?
> I went for a walk and saw a lovely cat
Well, that sounds positive. Can you tell me more about lovely cats?
> cats are the best. But I also have a cool dog
Wow, that sounds great. Can you tell me more about cool dogs?
> I have an old hounddog but he is sick
Hmm, that's not great. Can you tell me more about old hounddogs?
> bye
It was nice talking to you, goodbye!

ဤအလုပ်ကို ပြုလုပ်ရန် အဖြေတစ်ခုကို ဒီမှာ ကြည့်နိုင်သည်။

✅ Knowledge Check

Sympathetic responses သည် bot ကို အသုံးပြုသူက အမှန်တကယ် နားလည်နေသည်ဟု ထင်ရစေမည်လား?
Noun phrase ကို ရှာဖွေခြင်းသည် bot ကို ပိုမို 'ယုံကြည်စရာကောင်း' စေမည်လား?
စာကြောင်းမှ 'noun phrase' ကို ရှာဖွေခြင်းသည် ဘာကြောင့် အသုံးဝင်သနည်း?

ယခင် knowledge check တွင် bot ကို တည်ဆောက်ပြီး၊ သူငယ်ချင်းတစ်ဦးကို စမ်းသပ်ပါ။ Bot သည် သူတို့ကို လှည့်စားနိုင်ပါသလား? Bot ကို ပိုမို 'ယုံကြည်စရာကောင်း'

ဝက်ဘ်ဆိုက်မှတ်ချက်:
ဒီစာရွက်စာတမ်းကို AI ဘာသာပြန်ဝန်ဆောင်မှု Co-op Translator ကို အသုံးပြု၍ ဘာသာပြန်ထားပါသည်။ ကျွန်ုပ်တို့သည် တိကျမှန်ကန်မှုအတွက် ကြိုးစားနေပါသော်လည်း၊ အလိုအလျောက်ဘာသာပြန်မှုများတွင် အမှားများ သို့မဟုတ် မမှန်ကန်မှုများ ပါဝင်နိုင်သည်ကို သတိပြုပါ။ မူလဘာသာစကားဖြင့် ရေးသားထားသော စာရွက်စာတမ်းကို အာဏာတည်သော ရင်းမြစ်အဖြစ် သတ်မှတ်သင့်ပါသည်။ အရေးကြီးသော အချက်အလက်များအတွက် လူ့ဘာသာပြန်ပညာရှင်များကို အကြံပြုပါသည်။ ဒီဘာသာပြန်မှုကို အသုံးပြုခြင်းမှ ဖြစ်ပေါ်လာသော နားလည်မှုမှားမှုများ သို့မဟုတ် အဓိပ္ပာယ်မှားမှုများအတွက် ကျွန်ုပ်တို့သည် တာဝန်မယူပါ။

24 KiB Raw Permalink Blame History