Merge branch 'microsoft:main' into 2-regression-ja

pull/159/head
kenya-sk 4 years ago committed by GitHub
commit bd5f6c2c59
No known key found for this signature in database
GPG Key ID: 4AEE18F83AFDEB23

2
.gitignore vendored

@ -33,6 +33,8 @@ bld/
# Visual Studio 2015/2017 cache/options directory
.vs/
# Visual Studio Code cache/options directory
.vscode/
# Uncomment if you have tasks that create the project's static files in wwwroot
#wwwroot/

@ -102,6 +102,8 @@ Sketch, on paper or using an online app like [Excalidraw](https://excalidraw.com
To learn more about how you can work with ML algorithms in the cloud, follow this [Learning Path](https://docs.microsoft.com/learn/paths/create-no-code-predictive-models-azure-machine-learning/?WT.mc_id=academic-15963-cxa).
Take a [Learning Path](https://docs.microsoft.com/learn/modules/introduction-to-machine-learning/?WT.mc_id=academic-15963-cxa) about the basics of ML.
## Assignment
[Get up and running](assignment.md)

@ -0,0 +1,109 @@
# Introduction au machine learning
[![ML, AI, deep learning - Quelle est la différence ?](https://img.youtube.com/vi/lTd9RSxS9ZE/0.jpg)](https://youtu.be/lTd9RSxS9ZE "ML, AI, deep learning - What's the difference?")
> 🎥 Cliquer sur l'image ci-dessus afin de regarder une vidéo expliquant la différence entre machine learning, AI et deep learning.
## [Quiz préalable](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/1?loc=fr)
### Introduction
Bienvenue à ce cours sur le machine learning classique pour débutant ! Que vous soyez complètement nouveau sur ce sujet ou que vous soyez un professionnel du ML expérimenté cherchant à peaufiner vos connaissances, nous sommes heureux de vous avoir avec nous ! Nous voulons créer un tremplin chaleureux pour vos études en ML et serions ravis d'évaluer, de répondre et d'apprendre de vos retours d'[expériences](https://github.com/microsoft/ML-For-Beginners/discussions).
[![Introduction au ML](https://img.youtube.com/vi/h0e2HAPTGF4/0.jpg)](https://youtu.be/h0e2HAPTGF4 "Introduction to ML")
> 🎥 Cliquer sur l'image ci-dessus afin de regarder une vidéo: John Guttag du MIT introduit le machine learning
### Débuter avec le machine learning
Avant de commencer avec ce cours, vous aurez besoin d'un ordinateur configuré et prêt à faire tourner des notebooks (jupyter) localement.
- **Configurer votre ordinateur avec ces vidéos**. Apprendre comment configurer votre ordinateur avec cette [série de vidéos](https://www.youtube.com/playlist?list=PLlrxD0HtieHhS8VzuMCfQD4uJ9yne1mE6).
- **Apprendre Python**. Il est aussi recommandé d'avoir une connaissance basique de [Python](https://docs.microsoft.com/learn/paths/python-language/?WT.mc_id=academic-15963-cxa), un langage de programmaton utile pour les data scientist que nous utilisons tout au long de ce cours.
- **Apprendre Node.js et Javascript**. Nous utilisons aussi Javascript par moment dans ce cours afin de construire des applications WEB, vous aurez donc besoin de [node](https://nodejs.org) et [npm](https://www.npmjs.com/) installé, ainsi que de [Visual Studio Code](https://code.visualstudio.com/) pour développer en Python et Javascript.
- **Créer un compte GitHub**. Comme vous nous avez trouvé sur [GitHub](https://github.com), vous y avez sûrement un compte, mais si non, créez en un et répliquez ce cours afin de l'utiliser à votre grés. (N'oublier pas de nous donner une étoile aussi 😊)
- **Explorer Scikit-learn**. Familiariser vous avec [Scikit-learn](https://scikit-learn.org/stable/user_guide.html), un ensemble de librairies ML que nous mentionnons dans nos leçons.
### Qu'est-ce que le machine learning
Le terme `machine learning` est un des mots les plus populaire et le plus utilisé ces derniers temps. Il y a une probabilité accrue que vous l'ayez entendu au moins une fois si vous avez une appétence pour la technologie indépendamment du domaine dans lequel vous travaillez. Le fonctionnement du machine learning, cependant, reste un mystère pour la plupart des personnes. Pour un débutant en machine learning, le sujet peut nous submerger. Ainsi, il est important de comprendre ce qu'est le machine learning et de l'apprendre petit à petit au travers d'exemples pratiques.
![ml hype curve](../images/hype.png)
> Google Trends montre la récente 'courbe de popularité' pour le mot 'machine learning'
Nous vivons dans un univers rempli de mystères fascinants. De grands scientifiques comme Stephen Hawking, Albert Einstein et pleins d'autres ont dévoués leur vie à la recherche d'informations utiles afin de dévoiler les mystères qui nous entourent. C'est la condition humaine pour apprendre : un enfant apprend de nouvelles choses et découvre la structure du monde année après année jusqu'à qu'ils deviennent adultes.
Le cerveau d'un enfant et ses sens perçoivent l'environnement qui les entourent et apprennent graduellement des schémas non observés de la vie qui vont l'aider à fabriquer des règles logiques afin d'identifier les schémas appris. Le processus d'apprentissage du cerveau humain est ce que rend les hommes comme la créature la plus sophistiquée du monde vivant. Apprendre continuellement par la découverte de schémas non observés et ensuite innover sur ces schémas nous permet de nous améliorer tout au long de notre vie. Cette capacité d'apprendre et d'évoluer est liée au concept de [plasticité neuronale](https://www.simplypsychology.org/brain-plasticity.html), nous pouvons tirer quelques motivations similaires entre le processus d'apprentissage du cerveau humain et le concept de machine learning.
Le [cerveau humain](https://www.livescience.com/29365-human-brain.html) perçoit des choses du monde réel, assimile les informations perçues, fait des décisions rationnelles et entreprend certaines actions selon le contexte. C'est ce que l'on appelle se comporter intelligemment. Lorsque nous programmons une reproduction du processus de ce comportement à une machine, c'est ce que l'on appelle intelligence artificielle (IA).
Bien que le terme peut être confu, machine learning (ML) est un important sous-ensemble de l'intelligence artificielle. **ML se réfère à l'utilisation d'algorithmes spécialisés afin de découvrir des informations utiles et de trouver des schémas non observés depuis des données perçues pour corroborer un processus de décision rationnel**.
![AI, ML, deep learning, data science](../images/ai-ml-ds.png)
> Un diagramme montrant les relations entre AI, ML, deep learning et data science. Infographie par [Jen Looper](https://twitter.com/jenlooper) et inspiré par [ce graphique](https://softwareengineering.stackexchange.com/questions/366996/distinction-between-ai-ml-neural-networks-deep-learning-and-data-mining)
## Ce que vous allez apprendre dans ce cours
Dans ce cours, nous allons nous concentrer sur les concepts clés du machine learning qu'un débutant se doit de connaître. Nous parlerons de ce que l'on appelle le 'machine learning classique' en utilisant principalement Scikit-learn, une excellente librairie que beaucoup d'étudiants utilisent afin d'apprendre les bases. Afin de comprendre les concepts plus larges de l'intelligence artificielle ou du deep learning, une profonde connaissance en machine learning est indispensable, et c'est ce que nous aimerions fournir ici.
Dans ce cours, vous allez apprendre :
- Les concepts clés du machine learning
- L'histoire du ML
- ML et équité (fairness)
- Les techniques de régression ML
- Les techniques de classification ML
- Les techniques de regroupement (clustering) ML
- Les techniques du traitement automatique des langues (NLP) ML
- Les techniques de prédictions à partir de séries chronologiques ML
- Apprentissage renforcé
- D'applications réels du ML
## Ce que nous ne couvrirons pas
- Deep learning
- Neural networks
- IA
Afin d'avoir la meilleur expérience d'apprentissage, nous éviterons les complexités des réseaux neuronaux, du 'deep learning' (construire un modèle utilisant plusieurs couches de réseaux neuronaux) et IA, dont nous parlerons dans un cours différent. Nous offirons aussi un cours à venir sur la data science pour concentrer sur cet aspect de champs très large.
## Pourquoi etudier le machine learning ?
Le machine learning, depuis une perspective systémique, est défini comme la création de systèmes automatiques pouvant apprendre des schémas non observés depuis des données afin d'aider à prendre des décisions intelligentes.
Ce but est faiblement inspiré de la manière dont le cerveau humain apprend certaines choses depuis les données qu'il perçoit du monde extérieur.
✅ Penser une minute aux raisons qu'une entreprise aurait d'essayer d'utiliser des stratégies de machine learning au lieu de créer des règles codés en dur.
### Les applications du machine learning
Les applications du machine learning sont maintenant pratiquement partout, et sont aussi omniprésentes que les données qui circulent autour de notre société (générés par nos smartphones, appareils connectés ou autres systèmes). En prenant en considération l'immense potentiel des algorithmes dernier cri de machine learning, les chercheurs ont pu exploités leurs capacités afin de résoudre des problèmes multidimensionnels et interdisciplinaires de la vie avec d'important retours positifs
**Vous pouvez utiliser le machine learning de plusieurs manières** :
- Afin de prédire la possibilité d'avoir une maladie à partir des données médicales d'un patient.
- Pour tirer parti des données météorologiques afin de prédire les événements météorologiques.
- Afin de comprendre le sentiment d'un texte.
- Afin de détecter les fake news pour stopper la propagation de la propagande.
La finance, l'économie, les sciences de la terre, l'exploration spatiale, le génie biomédical, les sciences cognitives et même les domaines des sciences humaines ont adapté le machine learning pour résoudre les problèmes ardus et lourds de traitement des données dans leur domaine respectif.
Le machine learning automatise le processus de découverte de modèles en trouvant des informations significatives à partir de données réelles ou générées. Il s'est avéré très utile dans les applications commerciales, de santé et financières, entre autres.
Dans un avenir proche, comprendre les bases du machine learning sera indispensable pour les personnes de tous les domaines en raison de son adoption généralisée.
---
## 🚀 Challenge
Esquisser, sur papier ou à l'aide d'une application en ligne comme [Excalidraw](https://excalidraw.com/), votre compréhension des différences entre l'IA, le ML, le deep learning et la data science. Ajouter quelques idées de problèmes que chacune de ces techniques est bonne à résoudre.
## [Quiz de validation des connaissances](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/2?loc=fr)
## Révision et auto-apprentissage
Pour en savoir plus sur la façon dont vous pouvez utiliser les algorithmes de ML dans le cloud, suivez ce [Parcours d'apprentissage](https://docs.microsoft.com/learn/paths/create-no-code-predictive-models-azure-machine-learning/?WT.mc_id=academic-15963-cxa).
## Devoir
[Être opérationnel](assignment.fr.md)

@ -0,0 +1,107 @@
# Pengantar Machine Learning
[![ML, AI, deep learning - Apa perbedaannya?](https://img.youtube.com/vi/lTd9RSxS9ZE/0.jpg)](https://youtu.be/lTd9RSxS9ZE "ML, AI, deep learning - Apa perbedaannya?")
> 🎥 Klik gambar diatas untuk menonton video yang mendiskusikan perbedaan antara Machine Learning, AI, dan Deep Learning.
## [Quiz Pra-Pelajaran](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/1/)
### Pengantar
Selamat datang di pelajaran Machine Learning klasik untuk pemula! Baik kamu yang masih benar-benar baru, atau seorang praktisi ML berpengalaman yang ingin meningkatkan kemampuan kamu, kami senang kamu ikut bersama kami! Kami ingin membuat sebuah titik mulai yang ramah untuk pembelajaran ML kamu dan akan sangat senang untuk mengevaluasi, merespon, dan memasukkan [umpan balik](https://github.com/microsoft/ML-For-Beginners/discussions) kamu.
[![Pengantar Machine Learning](https://img.youtube.com/vi/h0e2HAPTGF4/0.jpg)](https://youtu.be/h0e2HAPTGF4 "Pengantar Machine Learning")
> 🎥 Klik gambar diatas untuk menonton video: John Guttag dari MIT yang memberikan pengantar Machine Learning.
### Memulai Machine Learning
Sebelum memulai kurikulum ini, kamu perlu memastikan komputer kamu sudah dipersiapkan untuk menjalankan *notebook* secara lokal.
- **Konfigurasi komputer kamu dengan video ini**. Pelajari bagaimana menyiapkan komputer kamu dalam [video-video](https://www.youtube.com/playlist?list=PLlrxD0HtieHhS8VzuMCfQD4uJ9yne1mE6) ini.
- **Belajar Python**. Disarankan juga untuk memiliki pemahaman dasar dari [Python](https://docs.microsoft.com/learn/paths/python-language/?WT.mc_id=academic-15963-cxa), sebuah bahasa pemrograman yang digunakan oleh data scientist yang juga akan kita gunakan dalam pelajaran ini.
- **Belajar Node.js dan JavaScript**. Kita juga menggunakan JavaScript beberapa kali dalam pelajaran ini ketika membangun aplikasi web, jadi kamu perlu menginstal [node](https://nodejs.org) dan [npm](https://www.npmjs.com/), serta [Visual Studio Code](https://code.visualstudio.com/) yang tersedia untuk pengembangan Python dan JavaScript.
- **Buat akun GitHub**. Karena kamu menemukan kami di [GitHub](https://github.com), kamu mungkin sudah punya akun, tapi jika belum, silakan buat akun baru kemudian *fork* kurikulum ini untuk kamu pergunakan sendiri. (Jangan ragu untuk memberikan kami bintang juga 😊)
- **Jelajahi Scikit-learn**. Buat diri kamu familiar dengan [Scikit-learn]([https://scikit-learn.org/stable/user_guide.html), seperangkat *library* ML yang kita acu dalam pelajaran-pelajaran ini.
### Apa itu Machine Learning?
Istilah 'Machine Learning' merupakan salah satu istilah yang paling populer dan paling sering digunakan saat ini. Ada kemungkinan kamu pernah mendengar istilah ini paling tidak sekali jika kamu familiar dengan teknologi. Tetapi untuk mekanisme Machine Learning sendiri, merupakan sebuah misteri bagi sebagian besar orang. Karena itu, penting untuk memahami sebenarnya apa itu Machine Learning, dan mempelajarinya langkah demi langkah melalui contoh praktis.
![kurva tren ml](../images/hype.png)
> Google Trends memperlihatkan 'kurva tren' dari istilah 'Machine Learning' belakangan ini.
Kita hidup di sebuah alam semesta yang penuh dengan misteri yang menarik. Ilmuwan-ilmuwan besar seperti Stephen Hawking, Albert Einstein, dan banyak lagi telah mengabdikan hidup mereka untuk mencari informasi yang berarti yang mengungkap misteri dari dunia disekitar kita. Ini adalah kondisi belajar manusia: seorang anak manusia belajar hal-hal baru dan mengungkap struktur dari dunianya tahun demi tahun saat mereka tumbuh dewasa.
Otak dan indera seorang anak memahami fakta-fakta di sekitarnya dan secara bertahap mempelajari pola-pola kehidupan yang tersembunyi yang membantu anak untuk menyusun aturan-aturan logis untuk mengidentifikasi pola-pola yang dipelajari. Proses pembelajaran otak manusia ini menjadikan manusia sebagai makhluk hidup paling canggih di dunia ini. Belajar terus menerus dengan menemukan pola-pola tersembunyi dan kemudian berinovasi pada pola-pola itu memungkinkan kita untuk terus menjadikan diri kita lebih baik sepanjang hidup. Kapasitas belajar dan kemampuan berkembang ini terkait dengan konsep yang disebut dengan *[brain plasticity](https://www.simplypsychology.org/brain-plasticity.html)*. Secara sempit, kita dapat menarik beberapa kesamaan motivasi antara proses pembelajaran otak manusia dan konsep Machine Learning.
[Otak manusia](https://www.livescience.com/29365-human-brain.html) menerima banyak hal dari dunia nyata, memproses informasi yang diterima, membuat keputusan rasional, dan melakukan aksi-aksi tertentu berdasarkan keadaan. Inilah yang kita sebut dengan berperilaku cerdas. Ketika kita memprogram sebuah salinan dari proses perilaku cerdas ke sebuah mesin, ini dinamakan kecerdasan buatan atau Artificial Intelligence (AI).
Meskipun istilah-stilahnya bisa membingungkan, Machine Learning (ML) adalah bagian penting dari Artificial Intelligence. **ML berkaitan dengan menggunakan algoritma-algoritma terspesialisasi untuk mengungkap informasi yang berarti dan mencari pola-pola tersembunyi dari data yang diterima untuk mendukung proses pembuatan keputusan rasional**.
![AI, ML, deep learning, data science](../images/ai-ml-ds.png)
> Sebuah diagram yang memperlihatkan hubungan antara AI, ML, Deep Learning, dan Data Science. Infografis oleh [Jen Looper](https://twitter.com/jenlooper) terinspirasi dari [infografis ini](https://softwareengineering.stackexchange.com/questions/366996/distinction-between-ai-ml-neural-networks-deep-learning-and-data-mining)
## Apa yang akan kamu pelajari
Dalam kurikulum ini, kita hanya akan membahas konsep inti dari Machine Learning yang harus diketahui oleh seorang pemula. Kita membahas apa yang kami sebut sebagai 'Machine Learning klasik' utamanya menggunakan Scikit-learn, sebuah *library* luar biasa yang banyak digunakan para siswa untuk belajar dasarnya. Untuk memahami konsep Artificial Intelligence atau Deep Learning yang lebih luas, pengetahuan dasar yang kuat tentang Machine Learning sangat diperlukan, itulah yang ingin kami tawarkan di sini.
Kamu akan belajar:
- Konsep inti ML
- Sejarah dari ML
- Keadilan dan ML
- Teknik regresi ML
- Teknik klasifikasi ML
- Teknik *clustering* ML
- Teknik *natural language processing* ML
- Teknik *time series forecasting* ML
- *Reinforcement learning*
- Penerapan nyata dari ML
## Yang tidak akan kita bahas
- *deep learning*
- *neural networks*
- AI
Untuk membuat pengalaman belajar yang lebih baik, kita akan menghindari kerumitan dari *neural network*, *deep learning* - membangun *many-layered model* menggunakan *neural network* - dan AI, yang mana akan kita bahas dalam kurikulum yang berbeda. Kami juga akan menawarkan kurikulum *data science* yang berfokus pada aspek bidang tersebut.
## Kenapa belajar Machine Learning?
Machine Learning, dari perspektif sistem, didefinisikan sebagai pembuatan sistem otomatis yang dapat mempelajari pola-pola tersembunyi dari data untuk membantu membuat keputusan cerdas.
Motivasi ini secara bebas terinspirasi dari bagaimana otak manusia mempelajari hal-hal tertentu berdasarkan data yang diterimanya dari dunia luar.
✅ Pikirkan sejenak mengapa sebuah bisnis ingin mencoba menggunakan strategi Machine Learning dibandingkan membuat sebuah mesin berbasis aturan yang tertanam (*hard-coded*).
### Penerapan Machine Learning
Penerapan Machine Learning saat ini hampir ada di mana-mana, seperti data yang mengalir di sekitar kita, yang dihasilkan oleh ponsel pintar, perangkat yang terhubung, dan sistem lainnya. Mempertimbangkan potensi besar dari algoritma Machine Learning terkini, para peneliti telah mengeksplorasi kemampuan Machine Learning untuk memecahkan masalah kehidupan nyata multi-dimensi dan multi-disiplin dengan hasil positif yang luar biasa.
**Kamu bisa menggunakan Machine Learning dalam banyak hal**:
- Untuk memprediksi kemungkinan penyakit berdasarkan riwayat atau laporan medis pasien.
- Untuk memanfaatkan data cuaca untuk memprediksi peristiwa cuaca.
- Untuk memahami sentimen sebuah teks.
- Untuk mendeteksi berita palsu untuk menghentikan penyebaran propaganda.
Keuangan, ekonomi, geosains, eksplorasi ruang angkasa, teknik biomedis, ilmu kognitif, dan bahkan bidang humaniora telah mengadaptasi Machine Learning untuk memecahkan masalah sulit pemrosesan data di bidang mereka.
Machine Learning mengotomatiskan proses penemuan pola dengan menemukan wawasan yang berarti dari dunia nyata atau dari data yang dihasilkan. Machine Learning terbukti sangat berharga dalam penerapannya di berbagai bidang, diantaranya adalah bidang bisnis, kesehatan, dan keuangan.
Dalam waktu dekat, memahami dasar-dasar Machine Learning akan menjadi suatu keharusan bagi orang-orang dari bidang apa pun karena adopsinya yang luas.
---
## 🚀 Tantangan
Buat sketsa di atas kertas atau menggunakan aplikasi seperti [Excalidraw](https://excalidraw.com/), mengenai pemahaman kamu tentang perbedaan antara AI, ML, Deep Learning, dan Data Science. Tambahkan beberapa ide masalah yang cocok diselesaikan masing-masing teknik.
## [Quiz Pasca-Pelajaran](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/2/)
## Ulasan & Belajar Mandiri
Untuk mempelajari lebih lanjut tentang bagaimana kamu dapat menggunakan algoritma ML di cloud, ikuti [Jalur Belajar](https://docs.microsoft.com/learn/paths/create-no-code-predictive-models-azure-machine-learning/?WT.mc_id=academic-15963-cxa) ini.
## Tugas
[Persiapan](assignment.id.md)

@ -4,7 +4,7 @@
> 🎥 上の画像をクリックすると、機械学習、AI、深層学習の違いについて説明した動画が表示されます。
## [Pre-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/1/)
## [Pre-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/1?loc=ja)
### イントロダクション
@ -94,12 +94,12 @@
## 🚀 Challenge
AI、ML、深層学習、データサイエンスの違いについて理解していることを、紙や[Excalidraw](https://excalidraw.com/)などのオンラインアプリを使ってスケッチしてください。また、それぞれの技術が得意とする問題のアイデアを加えてみてください。
## [Post-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/2/)
## [Post-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/2?loc=ja)
## 振り返りと自習
クラウド上でMLアルゴリズムをどのように扱うことができるかについては、この[ラーニングパス](https://docs.microsoft.com/learn/paths/create-no-code-predictive-models-azure-machine-learning/?WT.mc_id=academic-15963-cxa)に従ってください。.
クラウド上でMLアルゴリズムをどのように扱うことができるかについては、この[ラーニングパス](https://docs.microsoft.com/learn/paths/create-no-code-predictive-models-azure-machine-learning/?WT.mc_id=academic-15963-cxa)に従ってください。
## 課題
[起動し、実行してください。](assignment.md)
[稼働させる](assignment.ja.md)

@ -104,4 +104,4 @@
## 任务
[启动并运行](../assignment.md)
[启动并运行](assignment.zh-cn.md)

@ -0,0 +1,9 @@
# Lévantate y corre
## Instrucciones
En esta tarea no calificada, debe repasar Python y hacer que su entorno esté en funcionamiento y sea capaz de ejecutar cuadernos.
Tome esta [Ruta de aprendizaje de Python](https://docs.microsoft.com/learn/paths/python-language/?WT.mc_id=academic-15963-cxa), y luego configure sus sistemas con estos videos introductorios:
https://www.youtube.com/playlist?list=PLlrxD0HtieHhS8VzuMCfQD4uJ9yne1mE6

@ -0,0 +1,10 @@
# Être opérationnel
## Instructions
Dans ce devoir non noté, vous devez vous familiariser avec Python et rendre votre environnement opérationnel et capable d'exécuter des notebook.
Suivez ce [parcours d'apprentissage Python](https://docs.microsoft.com/learn/paths/python-language/?WT.mc_id=academic-15963-cxa), puis configurez votre système en parcourant ces vidéos introductives :
https://www.youtube.com/playlist?list=PLlrxD0HtieHhS8VzuMCfQD4uJ9yne1mE6

@ -0,0 +1,9 @@
# Persiapan
## Instruksi
Dalam tugas yang tidak dinilai ini, kamu akan mempelajari Python dan mempersiapkan *environment* kamu sehingga dapat digunakan untuk menjalankan *notebook*.
Ambil [Jalur Belajar Python](https://docs.microsoft.com/learn/paths/python-language/?WT.mc_id=academic-15963-cxa) ini, kemudian persiapkan sistem kamu dengan menonton video-video pengantar ini:
https://www.youtube.com/playlist?list=PLlrxD0HtieHhS8VzuMCfQD4uJ9yne1mE6

@ -0,0 +1,9 @@
# 稼働させる
## 指示
この評価のない課題では、Pythonについて復習し、環境を稼働させてートブックを実行できるようにする必要があります。
この[Pythonラーニングパス](https://docs.microsoft.com/learn/paths/python-language/?WT.mc_id=academic-15963-cxa)を受講し、次の入門用ビデオに従ってシステムをセットアップしてください。
https://www.youtube.com/playlist?list=PLlrxD0HtieHhS8VzuMCfQD4uJ9yne1mE6

@ -0,0 +1,9 @@
# 启动和运行
## 说明
在这个不评分的作业中,你应该温习一下 Python将 Python 环境能够运行起来,并且可以运行 notebooks。
学习这个 [Python 学习路径](https://docs.microsoft.com/learn/paths/python-language/?WT.mc_id=academic-15963-cxa),然后通过这些介绍性的视频将你的系统环境设置好:
https://www.youtube.com/playlist?list=PLlrxD0HtieHhS8VzuMCfQD4uJ9yne1mE6

@ -0,0 +1,117 @@
# Histoire du Machine Learning (apprentissage automatique)
![Résumé de l'histoire du machine learning dans un sketchnote](../../../sketchnotes/ml-history.png)
> Sketchnote de [Tomomi Imura](https://www.twitter.com/girlie_mac)
## [Quizz préalable](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/3?loc=fr)
Dans cette leçon, nous allons parcourir les principales étapes de l'histoire du machine learning et de l'intelligence artificielle.
L'histoire de l'intelligence artificielle, l'IA, en tant que domaine est étroitement liée à l'histoire du machine learning, car les algorithmes et les avancées informatiques qui sous-tendent le ML alimentent le développement de l'IA. Bien que ces domaines en tant que domaines de recherches distincts ont commencé à se cristalliser dans les années 1950, il est important de rappeler que les [découvertes algorithmiques, statistiques, mathématiques, informatiques et techniques](https://wikipedia.org/wiki/Timeline_of_machine_learning) ont précédé et chevauchait cette époque. En fait, le monde réfléchit à ces questions depuis [des centaines d'années](https://fr.wikipedia.org/wiki/Histoire_de_l%27intelligence_artificielle) : cet article traite des fondements intellectuels historiques de l'idée d'une « machine qui pense ».
## Découvertes notables
- 1763, 1812 [théorème de Bayes](https://wikipedia.org/wiki/Bayes%27_theorem) et ses prédécesseurs. Ce théorème et ses applications sous-tendent l'inférence, décrivant la probabilité qu'un événement se produise sur la base de connaissances antérieures.
- 1805 [Théorie des moindres carrés](https://wikipedia.org/wiki/Least_squares) par le mathématicien français Adrien-Marie Legendre. Cette théorie, que vous découvrirez dans notre unité Régression, aide à l'ajustement des données.
- 1913 [Chaînes de Markov](https://wikipedia.org/wiki/Markov_chain) du nom du mathématicien russe Andrey Markov sont utilisées pour décrire une séquence d'événements possibles basée sur un état antérieur.
- 1957 [Perceptron](https://wikipedia.org/wiki/Perceptron) est un type de classificateur linéaire inventé par le psychologue américain Frank Rosenblatt qui sous-tend les progrès de l'apprentissage en profondeur.
- 1967 [Nearest Neighbor](https://wikipedia.org/wiki/Nearest_neighbor) est un algorithme conçu à l'origine pour cartographier les itinéraires. Dans un contexte ML, il est utilisé pour détecter des modèles.
- 1970 [Backpropagation](https://wikipedia.org/wiki/Backpropagation) est utilisé pour former des [réseaux de neurones feedforward (propagation avant)](https://fr.wikipedia.org/wiki/R%C3%A9seau_de_neurones_%C3%A0_propagation_avant).
- 1982 [Réseaux de neurones récurrents](https://wikipedia.org/wiki/Recurrent_neural_network) sont des réseaux de neurones artificiels dérivés de réseaux de neurones à réaction qui créent des graphes temporels.
✅ Faites une petite recherche. Quelles autres dates sont marquantes dans l'histoire du ML et de l'IA ?
## 1950 : Des machines qui pensent
Alan Turing, une personne vraiment remarquable qui a été élue [par le public en 2019](https://wikipedia.org/wiki/Icons:_The_Greatest_Person_of_the_20th_Century) comme le plus grand scientifique du 20e siècle, est reconnu pour avoir aidé à jeter les bases du concept d'une "machine qui peut penser". Il a lutté avec ses opposants et son propre besoin de preuves empiriques de sa théorie en créant le [Test de Turing] (https://www.bbc.com/news/technology-18475646), que vous explorerez dans nos leçons de NLP (TALN en français).
## 1956 : Projet de recherche d'été à Dartmouth
« Le projet de recherche d'été de Dartmouth sur l'intelligence artificielle a été un événement fondateur pour l'intelligence artificielle en tant que domaine », et c'est ici que le terme « intelligence artificielle » a été inventé ([source](https://250.dartmouth.edu/highlights/artificial-intelligence-ai-coined-dartmouth)).
> Chaque aspect de l'apprentissage ou toute autre caractéristique de l'intelligence peut en principe être décrit si précisément qu'une machine peut être conçue pour les simuler.
Le chercheur en tête, le professeur de mathématiques John McCarthy, espérait « procéder sur la base de la conjecture selon laquelle chaque aspect de l'apprentissage ou toute autre caractéristique de l'intelligence peut en principe être décrit avec une telle précision qu'une machine peut être conçue pour les simuler ». Les participants comprenaient une autre sommité dans le domaine, Marvin Minsky.
L'atelier est crédité d'avoir initié et encouragé plusieurs discussions, notamment « l'essor des méthodes symboliques, des systèmes spécialisés sur des domaines limités (premiers systèmes experts) et des systèmes déductifs par rapport aux systèmes inductifs ». ([source](https://fr.wikipedia.org/wiki/Conf%C3%A9rence_de_Dartmouth)).
## 1956 - 1974 : "Les années d'or"
Des années 50 au milieu des années 70, l'optimisme était au rendez-vous en espérant que l'IA puisse résoudre de nombreux problèmes. En 1967, Marvin Minsky a déclaré avec assurance que « Dans une génération... le problème de la création d'"intelligence artificielle" sera substantiellement résolu. » (Minsky, Marvin (1967), Computation: Finite and Infinite Machines, Englewood Cliffs, N.J.: Prentice-Hall)
La recherche sur le Natural Language Processing (traitement du langage naturel en français) a prospéré, la recherche a été affinée et rendue plus puissante, et le concept de « micro-mondes » a été créé, où des tâches simples ont été effectuées en utilisant des instructions en langue naturelle.
La recherche a été bien financée par les agences gouvernementales, des progrès ont été réalisés dans le calcul et les algorithmes, et des prototypes de machines intelligentes ont été construits. Certaines de ces machines incluent :
* [Shakey le robot](https://fr.wikipedia.org/wiki/Shakey_le_robot), qui pouvait manœuvrer et décider comment effectuer des tâches « intelligemment ».
![Shakey, un robot intelligent](../images/shakey.jpg)
> Shaky en 1972
* Eliza, une des premières « chatbot », pouvait converser avec les gens et agir comme une « thérapeute » primitive. Vous en apprendrez plus sur Eliza dans les leçons de NLP.
![Eliza, un bot](../images/eliza.png)
> Une version d'Eliza, un chatbot
* Le « monde des blocs » était un exemple de micro-monde où les blocs pouvaient être empilés et triés, et où des expériences d'apprentissages sur des machines, dans le but qu'elles prennent des décisions, pouvaient être testées. Les avancées réalisées avec des bibliothèques telles que [SHRDLU](https://fr.wikipedia.org/wiki/SHRDLU) ont contribué à faire avancer le natural language processing.
[![Monde de blocs avec SHRDLU](https://img.youtube.com/vi/QAJz4YKUwqw/0.jpg)](https://www.youtube.com/watch?v=QAJz4YKUwqw "Monde de blocs avec SHRDLU" )
> 🎥 Cliquez sur l'image ci-dessus pour une vidéo : Blocks world with SHRDLU
## 1974 - 1980 : « l'hiver de l'IA »
Au milieu des années 1970, il était devenu évident que la complexité de la fabrication de « machines intelligentes » avait été sous-estimée et que sa promesse, compte tenu de la puissance de calcul disponible, avait été exagérée. Les financements se sont taris et la confiance dans le domaine s'est ralentie. Parmi les problèmes qui ont eu un impact sur la confiance, citons :
- **Restrictions**. La puissance de calcul était trop limitée.
- **Explosion combinatoire**. Le nombre de paramètres à former augmentait de façon exponentielle à mesure que l'on en demandait davantage aux ordinateurs, sans évolution parallèle de la puissance et de la capacité de calcul.
- **Pénurie de données**. Il y avait un manque de données qui a entravé le processus de test, de développement et de raffinement des algorithmes.
- **Posions-nous les bonnes questions ?**. Les questions mêmes, qui étaient posées, ont commencé à être remises en question. Les chercheurs ont commencé à émettre des critiques sur leurs approches :
- Les tests de Turing ont été remis en question au moyen, entre autres, de la « théorie de la chambre chinoise » qui postulait que « la programmation d'un ordinateur numérique peut faire croire qu'il comprend le langage mais ne peut pas produire une compréhension réelle ». ([source](https://plato.stanford.edu/entries/chinese-room/))
- L'éthique de l'introduction d'intelligences artificielles telles que la "thérapeute" ELIZA dans la société a été remise en cause.
Dans le même temps, diverses écoles de pensée sur l'IA ont commencé à se former. Une dichotomie a été établie entre les pratiques IA ["scruffy" et "neat"](https://wikipedia.org/wiki/Neats_and_scruffies). Les laboratoires _Scruffy_ peaufinaient leurs programmes pendant des heures jusqu'à ce qu'ils obtiennent les résultats souhaités. Les laboratoires _Neat_ "se concentraient sur la logique et la résolution formelle de problèmes". ELIZA et SHRDLU étaient des systèmes _scruffy_ bien connus. Dans les années 1980, alors qu'émergeait la demande de rendre les systèmes ML reproductibles, l'approche _neat_ a progressivement pris le devant de la scène car ses résultats sont plus explicables.
## 1980 : Systèmes experts
Au fur et à mesure que le domaine s'est développé, ses avantages pour les entreprises sont devenus plus clairs, particulièrement via les « systèmes experts » dans les années 1980. "Les systèmes experts ont été parmi les premières formes vraiment réussies de logiciels d'intelligence artificielle (IA)." ([source](https://fr.wikipedia.org/wiki/Syst%C3%A8me_expert)).
Ce type de système est en fait _hybride_, composé en partie d'un moteur de règles définissant les exigences métier et d'un moteur d'inférence qui exploite le système de règles pour déduire de nouveaux faits.
Cette époque a également vu une attention croissante accordée aux réseaux de neurones.
## 1987 - 1993 : IA « Chill »
La prolifération du matériel spécialisé des systèmes experts a eu pour effet malheureux de devenir trop spécialisée. L'essor des ordinateurs personnels a également concurrencé ces grands systèmes spécialisés et centralisés. La démocratisation de l'informatique a commencé et a finalement ouvert la voie à l'explosion des mégadonnées.
## 1993 - 2011
Cette époque a vu naître une nouvelle ère pour le ML et l'IA afin de résoudre certains des problèmes qui n'avaient pu l'être plus tôt par le manque de données et de puissance de calcul. La quantité de données a commencé à augmenter rapidement et à devenir plus largement disponibles, pour le meilleur et pour le pire, en particulier avec l'avènement du smartphone vers 2007. La puissance de calcul a augmenté de façon exponentielle et les algorithmes ont évolué parallèlement. Le domaine a commencé à gagner en maturité alors que l'ingéniosité a commencé à se cristalliser en une véritable discipline.
## À présent
Aujourd'hui, le machine learning et l'IA touchent presque tous les aspects de notre vie. Cette ère nécessite une compréhension approfondie des risques et des effets potentiels de ces algorithmes sur les vies humaines. Comme l'a déclaré Brad Smith de Microsoft, « les technologies de l'information soulèvent des problèmes qui vont au cœur des protections fondamentales des droits de l'homme comme la vie privée et la liberté d'expression. Ces problèmes accroissent la responsabilité des entreprises technologiques qui créent ces produits. À notre avis, ils appellent également à une réglementation gouvernementale réfléchie et au développement de normes autour des utilisations acceptables" ([source](https://www.technologyreview.com/2019/12/18/102365/the-future-of-ais-impact-on-society/)).
Reste à savoir ce que l'avenir nous réserve, mais il est important de comprendre ces systèmes informatiques ainsi que les logiciels et algorithmes qu'ils exécutent. Nous espérons que ce programme vous aidera à mieux les comprendre afin que vous puissiez décider par vous-même.
[![L'histoire du Deep Learning](https://img.youtube.com/vi/mTtDfKgLm54/0.jpg)](https://www.youtube.com/watch?v=mTtDfKgLm54 "L'histoire du Deep Learning")
> 🎥 Cliquez sur l'image ci-dessus pour une vidéo : Yann LeCun discute de l'histoire du deep learning dans cette conférence
---
## 🚀Challenge
Plongez dans l'un de ces moments historiques et apprenez-en plus sur les personnes derrière ceux-ci. Il y a des personnalités fascinantes, et aucune découverte scientifique n'a jamais été créée avec un vide culturel. Que découvrez-vous ?
## [Quiz de validation des connaissances](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/4?loc=fr)
## Révision et auto-apprentissage
Voici quelques articles à regarder et à écouter :
[Ce podcast où Amy Boyd discute de l'évolution de l'IA](http://runasradio.com/Shows/Show/739)
[![L'histoire de l'IA par Amy Boyd](https://img.youtube.com/vi/EJt3_bFYKss/0.jpg)](https://www.youtube.com/watch?v=EJt3_bFYKss "L'histoire de l'IA par Amy Boyd")
## Devoir
[Créer une frise chronologique](assignment.fr.md)

@ -0,0 +1,116 @@
# Sejarah Machine Learning
![Ringkasan dari Sejarah Machine Learning dalam sebuah catatan sketsa](../../../sketchnotes/ml-history.png)
> Catatan sketsa oleh [Tomomi Imura](https://www.twitter.com/girlie_mac)
## [Quiz Pra-Pelajaran](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/3/)
Dalam pelajaran ini, kita akan membahas tonggak utama dalam sejarah Machine Learning dan Artificial Intelligence.
Sejarah Artifical Intelligence, AI, sebagai bidang terkait dengan sejarah Machine Learning, karena algoritma dan kemajuan komputasi yang mendukung ML dimasukkan ke dalam pengembangan AI. Penting untuk diingat bahwa, meski bidang-bidang ini sebagai bidang-bidang penelitian yang berbeda mulai terbentuk pada 1950-an, [algoritmik, statistik, matematik, komputasi dan penemuan teknis](https://wikipedia.org/wiki/Timeline_of_machine_learning) penting sudah ada sebelumnya, dan saling tumpang tindih di era ini. Faktanya, orang-orang telah memikirkan pertanyaan-pertanyaan ini selama [ratusan tahun](https://wikipedia.org/wiki/History_of_artificial_intelligence): artikel ini membahas dasar-dasar intelektual historis dari gagasan 'mesin yang berpikir'.
## Penemuan penting
- 1763, 1812 [Bayes Theorem](https://wikipedia.org/wiki/Bayes%27_theorem) dan para pendahulu. Teorema ini dan penerapannya mendasari inferensi, mendeskripsikan kemungkinan suatu peristiwa terjadi berdasarkan pengetahuan sebelumnya.
- 1805 [Least Square Theory](https://wikipedia.org/wiki/Least_squares) oleh matematikawan Perancis Adrien-Marie Legendre. Teori ini yang akan kamu pelajari di unit Regresi, ini membantu dalam *data fitting*.
- 1913 [Markov Chains](https://wikipedia.org/wiki/Markov_chain) dinamai dengan nama matematikawan Rusia, Andrey Markov, digunakan untuk mendeskripsikan sebuah urutan dari kejadian-kejadian yang mungkin terjadi berdasarkan kondisi sebelumnya.
- 1957 [Perceptron](https://wikipedia.org/wiki/Perceptron) adalah sebuah tipe dari *linear classifier* yang ditemukan oleh psikolog Amerika, Frank Rosenblatt, yang mendasari kemajuan dalam *Deep Learning*.
- 1967 [Nearest Neighbor](https://wikipedia.org/wiki/Nearest_neighbor) adalah sebuah algoritma yang pada awalnya didesain untuk memetakan rute. Dalam konteks ML, ini digunakan untuk mendeteksi berbagai pola.
- 1970 [Backpropagation](https://wikipedia.org/wiki/Backpropagation) digunakan untuk melatih [feedforward neural networks](https://wikipedia.org/wiki/Feedforward_neural_network).
- 1982 [Recurrent Neural Networks](https://wikipedia.org/wiki/Recurrent_neural_network) adalah *artificial neural networks* yang berasal dari *feedforward neural networks* yang membuat grafik sementara.
✅ Lakukan sebuah riset kecil. Tanggal berapa lagi yang merupakan tanggal penting dalam sejarah ML dan AI?
## 1950: Mesin yang berpikir
Alan Turing, merupakan orang luar biasa yang terpilih oleh [publik di tahun 2019](https://wikipedia.org/wiki/Icons:_The_Greatest_Person_of_the_20th_Century) sebagai ilmuwan terhebat di abad 20, diberikan penghargaan karena membantu membuat fondasi dari sebuah konsep 'mesin yang bisa berpikir', Dia berjuang menghadapi orang-orang yang menentangnya dan keperluannya sendiri untuk bukti empiris dari konsep ini dengan membuat [Turing Test](https://www.bbc.com/news/technology-18475646), yang mana akan kamu jelajahi di pelajaran NLP kami.
## 1956: Proyek Riset Musim Panas Dartmouth
"Proyek Riset Musim Panas Dartmouth pada *artificial intelligence* merupakan sebuah acara penemuan untuk *artificial intelligence* sebagai sebuah bidang," dan dari sinilah istilah '*artificial intelligence*' diciptakan ([sumber](https://250.dartmouth.edu/highlights/artificial-intelligence-ai-coined-dartmouth)).
> Setiap aspek pembelajaran atau fitur kecerdasan lainnya pada prinsipnya dapat dideskripsikan dengan sangat tepat sehingga sebuah mesin dapat dibuat untuk mensimulasikannya.
Ketua peneliti, profesor matematika John McCarthy, berharap "untuk meneruskan dasar dari dugaan bahwa setiap aspek pembelajaran atau fitur kecerdasan lainnya pada prinsipnya dapat dideskripsikan dengan sangat tepat sehingga mesin dapat dibuat untuk mensimulasikannya." Marvin Minsky, seorang tokoh terkenal di bidang ini juga termasuk sebagai peserta penelitian.
Workshop ini dipuji karena telah memprakarsai dan mendorong beberapa diskusi termasuk "munculnya metode simbolik, sistem yang berfokus pada domain terbatas (sistem pakar awal), dan sistem deduktif versus sistem induktif." ([sumber](https://wikipedia.org/wiki/Dartmouth_workshop)).
## 1956 - 1974: "Tahun-tahun Emas"
Dari tahun 1950-an hingga pertengahan 70-an, optimisme memuncak dengan harapan bahwa AI dapat memecahkan banyak masalah. Pada tahun 1967, Marvin Minsky dengan yakin menyatakan bahwa "Dalam satu generasi ... masalah menciptakan '*artificial intelligence*' akan terpecahkan secara substansial." (Minsky, Marvin (1967), Computation: Finite and Infinite Machines, Englewood Cliffs, N.J.: Prentice-Hall)
Penelitian *natural language processing* berkembang, pencarian disempurnakan dan dibuat lebih *powerful*, dan konsep '*micro-worlds*' diciptakan, di mana tugas-tugas sederhana diselesaikan menggunakan instruksi bahasa sederhana.
Penelitian didanai dengan baik oleh lembaga pemerintah, banyak kemajuan dibuat dalam komputasi dan algoritma, dan prototipe mesin cerdas dibangun. Beberapa mesin tersebut antara lain:
* [Shakey the robot](https://wikipedia.org/wiki/Shakey_the_robot), yang bisa bermanuver dan memutuskan bagaimana melakukan tugas-tugas secara 'cerdas'.
![Shakey, an intelligent robot](../images/shakey.jpg)
> Shakey pada 1972
* Eliza, sebuah 'chatterbot' awal, dapat mengobrol dengan orang-orang dan bertindak sebagai 'terapis' primitif. Kamu akan belajar lebih banyak tentang Eliza dalam pelajaran NLP.
![Eliza, a bot](../images/eliza.png)
> Sebuah versi dari Eliza, sebuah *chatbot*
* "Blocks world" adalah contoh sebuah *micro-world* dimana balok dapat ditumpuk dan diurutkan, dan pengujian eksperimen mesin pengajaran untuk membuat keputusan dapat dilakukan. Kemajuan yang dibuat dengan *library-library* seperti [SHRDLU](https://wikipedia.org/wiki/SHRDLU) membantu mendorong kemajuan pemrosesan bahasa.
[![blocks world dengan SHRDLU](https://img.youtube.com/vi/QAJz4YKUwqw/0.jpg)](https://www.youtube.com/watch?v=QAJz4YKUwqw "blocks world dengan SHRDLU")
> 🎥 Klik gambar diatas untuk menonton video: Blocks world with SHRDLU
## 1974 - 1980: "Musim Dingin AI"
Pada pertengahan 1970-an, semakin jelas bahwa kompleksitas pembuatan 'mesin cerdas' telah diremehkan dan janjinya, mengingat kekuatan komputasi yang tersedia, telah dilebih-lebihkan. Pendanaan telah habis dan kepercayaan dalam bidang ini menurun. Beberapa masalah yang memengaruhi kepercayaan diri termasuk:
- **Keterbatasan**. Kekuatan komputasi terlalu terbatas.
- **Ledakan kombinatorial**. Jumlah parameter yang perlu dilatih bertambah secara eksponensial karena lebih banyak hal yang diminta dari komputer, tanpa evolusi paralel dari kekuatan dan kemampuan komputasi.
- **Kekurangan data**. Adanya kekurangan data yang menghalangi proses pengujian, pengembangan, dan penyempurnaan algoritma.
- **Apakah kita menanyakan pertanyaan yang tepat?**. Pertanyaan-pertanyaan yang diajukan pun mulai dipertanyakan kembali. Para peneliti mulai melontarkan kritik tentang pendekatan mereka
- Tes Turing mulai dipertanyakan, di antara ide-ide lain, dari 'teori ruang Cina' yang mengemukakan bahwa, "memprogram komputer digital mungkin membuatnya tampak memahami bahasa tetapi tidak dapat menghasilkan pemahaman yang sebenarnya." ([sumber](https://plato.stanford.edu/entries/chinese-room/))
- Tantangan etika ketika memperkenalkan kecerdasan buatan seperti si "terapis" ELIZA ke dalam masyarakat.
Pada saat yang sama, berbagai aliran pemikiran AI mulai terbentuk. Sebuah dikotomi didirikan antara praktik ["scruffy" vs. "neat AI"](https://wikipedia.org/wiki/Neats_and_scruffies). Lab _Scruffy_ mengubah program selama berjam-jam sampai mendapat hasil yang diinginkan. Lab _Neat_ "berfokus pada logika dan penyelesaian masalah formal". ELIZA dan SHRDLU adalah sistem _scruffy_ yang terkenal. Pada tahun 1980-an, karena perkembangan permintaan untuk membuat sistem ML yang dapat direproduksi, pendekatan _neat_ secara bertahap menjadi yang terdepan karena hasilnya lebih dapat dijelaskan.
## 1980s Sistem Pakar
Seiring berkembangnya bidang ini, manfaatnya bagi bisnis menjadi lebih jelas, dan begitu pula dengan menjamurnya 'sistem pakar' pada tahun 1980-an. "Sistem pakar adalah salah satu bentuk perangkat lunak artificial intelligence (AI) pertama yang benar-benar sukses." ([sumber](https://wikipedia.org/wiki/Expert_system)).
Tipe sistem ini sebenarnya adalah _hybrid_, sebagian terdiri dari mesin aturan yang mendefinisikan kebutuhan bisnis, dan mesin inferensi yang memanfaatkan sistem aturan untuk menyimpulkan fakta baru.
Pada era ini juga terlihat adanya peningkatan perhatian pada jaringan saraf.
## 1987 - 1993: AI 'Chill'
Perkembangan perangkat keras sistem pakar terspesialisasi memiliki efek yang tidak menguntungkan karena menjadi terlalu terspesialiasasi. Munculnya komputer pribadi juga bersaing dengan sistem yang besar, terspesialisasi, dan terpusat ini. Demokratisasi komputasi telah dimulai, dan pada akhirnya membuka jalan untuk ledakan modern dari *big data*.
## 1993 - 2011
Pada zaman ini memperlihatkan era baru bagi ML dan AI untuk dapat menyelesaikan beberapa masalah yang sebelumnya disebabkan oleh kurangnya data dan daya komputasi. Jumlah data mulai meningkat dengan cepat dan tersedia secara luas, terlepas dari baik dan buruknya, terutama dengan munculnya *smartphone* sekitar tahun 2007. Daya komputasi berkembang secara eksponensial, dan algoritma juga berkembang saat itu. Bidang ini mulai mengalami kedewasaan karena hari-hari yang tidak beraturan di masa lalu mulai terbentuk menjadi disiplin yang sebenarnya.
## Sekarang
Saat ini, *machine learning* dan AI hampir ada di setiap bagian dari kehidupan kita. Era ini menuntut pemahaman yang cermat tentang risiko dan efek potensi dari berbagai algoritma yang ada pada kehidupan manusia. Seperti yang telah dinyatakan oleh Brad Smith dari Microsoft, "Teknologi informasi mengangkat isu-isu yang menjadi inti dari perlindungan hak asasi manusia yang mendasar seperti privasi dan kebebasan berekspresi. Masalah-masalah ini meningkatkan tanggung jawab bagi perusahaan teknologi yang menciptakan produk-produk ini. Dalam pandangan kami, mereka juga menyerukan peraturan pemerintah yang bijaksana dan untuk pengembangan norma-norma seputar penggunaan yang wajar" ([sumber](https://www.technologyreview.com/2019/12/18/102365/the-future-of-ais-impact-on-society/)).
Kita masih belum tahu apa yang akan terjadi di masa depan, tetapi penting untuk memahami sistem komputer dan perangkat lunak serta algoritma yang dijalankannya. Kami berharap kurikulum ini akan membantu kamu untuk mendapatkan pemahaman yang lebih baik sehingga kamu dapat memutuskan sendiri.
[![Sejarah Deep Learning](https://img.youtube.com/vi/mTtDfKgLm54/0.jpg)](https://www.youtube.com/watch?v=mTtDfKgLm54 "Sejarah Deep Learning")
> 🎥 Klik gambar diatas untuk menonton video: Yann LeCun mendiskusikan sejarah dari Deep Learning dalam pelajaran ini
---
## 🚀Tantangan
Gali salah satu momen bersejarah ini dan pelajari lebih lanjut tentang orang-orang di baliknya. Ada karakter yang menarik, dan tidak ada penemuan ilmiah yang pernah dibuat dalam kekosongan budaya. Apa yang kamu temukan?
## [Quiz Pasca-Pelajaran](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/4/)
## Ulasan & Belajar Mandiri
Berikut adalah item untuk ditonton dan didengarkan:
[Podcast dimana Amy Boyd mendiskusikan evolusi dari AI](http://runasradio.com/Shows/Show/739)
[![Sejarah AI oleh Amy Boyd](https://img.youtube.com/vi/EJt3_bFYKss/0.jpg)](https://www.youtube.com/watch?v=EJt3_bFYKss "Sejarah AI oleh Amy Boyd")
## Tugas
[Membuat sebuah *timeline*](assignment.id.md)

@ -3,7 +3,7 @@
![機械学習の歴史をまとめたスケッチ](../../../sketchnotes/ml-history.png)
> [Tomomi Imura](https://www.twitter.com/girlie_mac)によるスケッチ
## [Pre-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/3/)
## [Pre-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/3?loc=ja)
この授業では、機械学習と人工知能の歴史における主要な出来事を紹介します。
@ -99,7 +99,7 @@
これらの歴史的瞬間の1つを掘り下げて、その背後にいる人々について学びましょう。魅力的な人々がいますし、文化的に空白の状態で科学的発見がなされたことはありません。どういったことが見つかるでしょうか
## [Post-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/4/)
## [Post-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/4?loc=ja)
## 振り返りと自習
@ -111,4 +111,4 @@
## 課題
[時系列を制作してください](../assignment.md)
[年表を作成する](./assignment.ja.md)

@ -113,4 +113,4 @@ Alan Turing一个真正杰出的人[在2019年被公众投票选出](https
## 任务
[创建时间线](../assignment.md)
[创建时间线](assignment.zh-cn.md)

@ -0,0 +1,11 @@
# Créer une frise chronologique
## Instructions
Utiliser [ce repo](https://github.com/Digital-Humanities-Toolkit/timeline-builder), créer une frise chronologique de certains aspects de l'histoire des algorithmes, des mathématiques, des statistiques, de l'IA ou du machine learning, ou une combinaison de ceux-ci. Vous pouvez vous concentrer sur une personne, une idée ou une longue période d'innovations. Assurez-vous d'ajouter des éléments multimédias.
## Rubrique
| Critères | Exemplaire | Adéquate | A améliorer |
| -------- | ---------------------------------------------------------------- | ------------------------------------ | ------------------------------------------------------------------ |
| | Une chronologie déployée est présentée sous forme de page GitHub | Le code est incomplet et non déployé | La chronologie est incomplète, pas bien recherchée et pas déployée |

@ -0,0 +1,11 @@
# Membuat sebuah *timeline*
## Instruksi
Menggunakan [repo ini](https://github.com/Digital-Humanities-Toolkit/timeline-builder), buatlah sebuah *timeline* dari beberapa aspek sejarah algoritma, matematika, statistik, AI, atau ML, atau kombinasi dari semuanya. Kamu dapat fokus pada satu orang, satu ide, atau rentang waktu pemikiran yang panjang. Pastikan untuk menambahkan elemen multimedia.
## Rubrik
| Kriteria | Sangat Bagus | Cukup | Perlu Peningkatan |
| -------- | ------------------------------------------------- | --------------------------------------- | ---------------------------------------------------------------- |
| | *Timeline* yang dideploy disajikan sebagai halaman GitHub | Kode belum lengkap dan belum dideploy | *Timeline* belum lengkap, belum diriset dengan baik dan belum dideploy |

@ -0,0 +1,11 @@
# 年表を作成する
## 指示
[このリポジトリ](https://github.com/Digital-Humanities-Toolkit/timeline-builder) を使って、アルゴリズム・数学・統計学・人工知能・機械学習、またはこれらの組み合わせに対して、歴史のひとつの側面に関する年表を作成してください。焦点を当てるのは、ひとりの人物・ひとつのアイディア・長期間にわたる思想のいずれのものでも構いません。マルチメディアの要素を必ず加えるようにしてください。
## 評価基準
| 基準 | 模範的 | 十分 | 要改善 |
| ---- | -------------------------------------- | ------------------------------------ | ------------------------------------------------------------ |
| | GitHub page に年表がデプロイされている | コードが未完成でデプロイされていない | 年表が未完成で、十分に調査されておらず、デプロイされていない |

@ -0,0 +1,11 @@
# 建立一个时间轴
## 说明
使用这个 [仓库](https://github.com/Digital-Humanities-Toolkit/timeline-builder),创建一个关于算法、数学、统计学、人工智能、机器学习的某个方面或者可以综合多个以上学科来讲。你可以着重介绍某个人,某个想法,或者一个经久不衰的思想。请确保添加了多媒体元素在你的时间线中。
## 评判标准
| 标准 | 优秀 | 中规中矩 | 仍需努力 |
| ------------ | ---------------------------------- | ---------------------- | ------------------------------------------ |
| | 有一个用 GitHub page 展示的 timeline | 代码还不完整并且没有部署 | 时间线不完整,没有经过充分的研究,并且没有部署 |

@ -0,0 +1,213 @@
# Keadilan dalam Machine Learning
![Ringkasan dari Keadilan dalam Machine Learning dalam sebuah catatan sketsa](../../../sketchnotes/ml-fairness.png)
> Catatan sketsa oleh [Tomomi Imura](https://www.twitter.com/girlie_mac)
## [Quiz Pra-Pelajaran](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/5/)
## Pengantar
Dalam kurikulum ini, kamu akan mulai mengetahui bagaimana Machine Learning bisa memengaruhi kehidupan kita sehari-hari. Bahkan sekarang, sistem dan model terlibat dalam tugas pengambilan keputusan sehari-hari, seperti diagnosis kesehatan atau mendeteksi penipuan. Jadi, penting bahwa model-model ini bekerja dengan baik untuk memberikan hasil yang adil bagi semua orang.
Bayangkan apa yang bisa terjadi ketika data yang kamu gunakan untuk membangun model ini tidak memiliki demografi tertentu, seperti ras, jenis kelamin, pandangan politik, agama, atau secara tidak proporsional mewakili demografi tersebut. Bagaimana jika keluaran dari model diinterpretasikan lebih menyukai beberapa demografis tertentu? Apa konsekuensi untuk aplikasinya?
Dalam pelajaran ini, kamu akan:
- Meningkatkan kesadaran dari pentingnya keadilan dalam Machine Learning.
- Mempelajari tentang berbagai kerugian terkait keadilan.
- Learn about unfairness assessment and mitigation.
- Mempelajari tentang mitigasi dan penilaian ketidakadilan.
## Prasyarat
Sebagai prasyarat, silakan ikuti jalur belajar "Prinsip AI yang Bertanggung Jawab" dan tonton video di bawah ini dengan topik:
Pelajari lebih lanjut tentang AI yang Bertanggung Jawab dengan mengikuti [Jalur Belajar](https://docs.microsoft.com/learn/modules/responsible-ai-principles/?WT.mc_id=academic-15963-cxa) ini
[![Pendekatan Microsoft untuk AI yang Bertanggung Jawab](https://img.youtube.com/vi/dnC8-uUZXSc/0.jpg)](https://youtu.be/dnC8-uUZXSc "Pendekatan Microsoft untuk AI yang Bertanggung Jawab")
> 🎥 Klik gambar diatas untuk menonton video: Pendekatan Microsoft untuk AI yang Bertanggung Jawab
## Ketidakadilan dalam data dan algoritma
> "Jika Anda menyiksa data cukup lama, data itu akan mengakui apa pun " - Ronald Coase
Pernyataan ini terdengar ekstrem, tetapi memang benar bahwa data dapat dimanipulasi untuk mendukung kesimpulan apa pun. Manipulasi semacam itu terkadang bisa terjadi secara tidak sengaja. Sebagai manusia, kita semua memiliki bias, dan seringkali sulit untuk secara sadar mengetahui kapan kamu memperkenalkan bias dalam data.
Menjamin keadilan dalam AI dan machine learning tetap menjadi tantangan sosioteknik yang kompleks. Artinya, hal itu tidak bisa ditangani baik dari perspektif sosial atau teknis semata.
### Kerugian Terkait Keadilan
Apa yang dimaksud dengan ketidakadilan? "Ketidakadilan" mencakup dampak negatif atau "bahaya" bagi sekelompok orang, seperti yang didefinisikan dalam hal ras, jenis kelamin, usia, atau status disabilitas.
Kerugian utama yang terkait dengan keadilan dapat diklasifikasikan sebagai:
- **Alokasi**, jika suatu jenis kelamin atau etnisitas misalkan lebih disukai daripada yang lain.
- **Kualitas layanan**. Jika kamu melatih data untuk satu skenario tertentu tetapi kenyataannya jauh lebih kompleks, hasilnya adalah layanan yang berkinerja buruk.
- **Stereotip**. Mengaitkan grup tertentu dengan atribut yang ditentukan sebelumnya.
- **Fitnah**. Untuk mengkritik dan melabeli sesuatu atau seseorang secara tidak adil.
- **Representasi yang kurang atau berlebihan**. Idenya adalah bahwa kelompok tertentu tidak terlihat dalam profesi tertentu, dan layanan atau fungsi apa pun yang terus dipromosikan yang menambah kerugian.
Mari kita lihat contoh-contohnya.
### Alokasi
Bayangkan sebuah sistem untuk menyaring pengajuan pinjaman. Sistem cenderung memilih pria kulit putih sebagai kandidat yang lebih baik daripada kelompok lain. Akibatnya, pinjaman ditahan dari pemohon tertentu.
Contoh lain adalah alat perekrutan eksperimental yang dikembangkan oleh perusahaan besar untuk menyaring kandidat. Alat tersebut secara sistematis mendiskriminasi satu gender dengan menggunakan model yang dilatih untuk lebih memilih kata-kata yang terkait dengan gender lain. Hal ini mengakibatkan kandidat yang resumenya berisi kata-kata seperti "tim rugby wanita" tidak masuk kualifikasi.
✅ Lakukan sedikit riset untuk menemukan contoh dunia nyata dari sesuatu seperti ini
### Kualitas Layanan
Para peneliti menemukan bahwa beberapa pengklasifikasi gender komersial memiliki tingkat kesalahan yang lebih tinggi di sekitar gambar wanita dengan warna kulit lebih gelap dibandingkan dengan gambar pria dengan warna kulit lebih terang. [Referensi](https://www.media.mit.edu/publications/gender-shades-intersectional-accuracy-disparities-in-commercial-gender-classification/)
Contoh terkenal lainnya adalah dispenser sabun tangan yang sepertinya tidak bisa mendeteksi orang dengan kulit gelap. [Referensi](https://gizmodo.com/why-cant-this-soap-dispenser-identify-dark-skin-1797931773)
### Stereotip
Pandangan gender stereotip ditemukan dalam terjemahan mesin. Ketika menerjemahkan "dia (laki-laki) adalah seorang perawat dan dia (perempuan) adalah seorang dokter" ke dalam bahasa Turki, masalah muncul. Turki adalah bahasa tanpa gender yang memiliki satu kata ganti, "o" untuk menyampaikan orang ketiga tunggal, tetapi menerjemahkan kalimat kembali dari Turki ke Inggris menghasilkan stereotip dan salah sebagai "dia (perempuan) adalah seorang perawat dan dia (laki-laki) adalah seorang dokter".
![terjemahan ke bahasa Turki](../images/gender-bias-translate-en-tr.png)
![terjemahan kembali ke bahasa Inggris](../images/gender-bias-translate-tr-en.png)
### Fitnah
Sebuah teknologi pelabelan gambar yang terkenal salah memberi label gambar orang berkulit gelap sebagai gorila. Pelabelan yang salah berbahaya bukan hanya karena sistem membuat kesalahan karena secara khusus menerapkan label yang memiliki sejarah panjang yang sengaja digunakan untuk merendahkan orang kulit hitam.
[![AI: Bukankah Aku Seorang Wanita?](https://img.youtube.com/vi/QxuyfWoVV98/0.jpg)](https://www.youtube.com/watch?v=QxuyfWoVV98 "Bukankah Aku Seorang Wanita?")
> 🎥 Klik gambar diatas untuk sebuah video: AI, Bukankah Aku Seorang Wanita? - menunjukkan kerugian yang disebabkan oleh pencemaran nama baik yang menyinggung ras oleh AI
### Representasi yang kurang atau berlebihan
Hasil pencarian gambar yang condong ke hal tertentu (skewed) dapat menjadi contoh yang bagus dari bahaya ini. Saat menelusuri gambar profesi dengan persentase pria yang sama atau lebih tinggi daripada wanita, seperti teknik, atau CEO, perhatikan hasil yang lebih condong ke jenis kelamin tertentu.
![Pencarian CEO di Bing](../images/ceos.png)
> Pencarian di Bing untuk 'CEO' ini menghasilkan hasil yang cukup inklusif
Lima jenis bahaya utama ini tidak saling eksklusif, dan satu sistem dapat menunjukkan lebih dari satu jenis bahaya. Selain itu, setiap kasus bervariasi dalam tingkat keparahannya. Misalnya, memberi label yang tidak adil kepada seseorang sebagai penjahat adalah bahaya yang jauh lebih parah daripada memberi label yang salah pada gambar. Namun, penting untuk diingat bahwa bahkan kerugian yang relatif tidak parah dapat membuat orang merasa terasing atau diasingkan dan dampak kumulatifnya bisa sangat menekan.
**Diskusi**: Tinjau kembali beberapa contoh dan lihat apakah mereka menunjukkan bahaya yang berbeda.
| | Alokasi | Kualitas Layanan | Stereotip | Fitnah | Representasi yang kurang atau berlebihan |
| ----------------------- | :--------: | :----------------: | :----------: | :---------: | :----------------------------: |
| Sistem perekrutan otomatis | x | x | x | | x |
| Terjemahan mesin | | | | | |
| Melabeli foto | | | | | |
## Mendeteksi Ketidakadilan
Ada banyak alasan mengapa sistem tertentu berperilaku tidak adil. Bias sosial, misalnya, mungkin tercermin dalam kumpulan data yang digunakan untuk melatih. Misalnya, ketidakadilan perekrutan mungkin telah diperburuk oleh ketergantungan yang berlebihan pada data historis. Dengan menggunakan pola dalam resume yang dikirimkan ke perusahaan selama periode 10 tahun, model tersebut menentukan bahwa pria lebih berkualitas karena mayoritas resume berasal dari pria, yang mencerminkan dominasi pria di masa lalu di industri teknologi.
Data yang tidak memadai tentang sekelompok orang tertentu dapat menjadi alasan ketidakadilan. Misalnya, pengklasifikasi gambar memiliki tingkat kesalahan yang lebih tinggi untuk gambar orang berkulit gelap karena warna kulit yang lebih gelap kurang terwakili dalam data.
Asumsi yang salah yang dibuat selama pengembangan menyebabkan ketidakadilan juga. Misalnya, sistem analisis wajah yang dimaksudkan untuk memprediksi siapa yang akan melakukan kejahatan berdasarkan gambar wajah orang dapat menyebabkan asumsi yang merusak. Hal ini dapat menyebabkan kerugian besar bagi orang-orang yang salah diklasifikasikan.
## Pahami model kamu dan bangun dalam keadilan
Meskipun banyak aspek keadilan tidak tercakup dalam metrik keadilan kuantitatif, dan tidak mungkin menghilangkan bias sepenuhnya dari sistem untuk menjamin keadilan, Kamu tetap bertanggung jawab untuk mendeteksi dan mengurangi masalah keadilan sebanyak mungkin.
Saat Kamu bekerja dengan model pembelajaran mesin, penting untuk memahami model Kamu dengan cara memastikan interpretasinya dan dengan menilai serta mengurangi ketidakadilan.
Mari kita gunakan contoh pemilihan pinjaman untuk mengisolasi kasus untuk mengetahui tingkat dampak setiap faktor pada prediksi.
## Metode Penilaian
1. **Identifikasi bahaya (dan manfaat)**. Langkah pertama adalah mengidentifikasi bahaya dan manfaat. Pikirkan tentang bagaimana tindakan dan keputusan dapat memengaruhi calon pelanggan dan bisnis itu sendiri.
1. **Identifikasi kelompok yang terkena dampak**. Setelah Kamu memahami jenis kerugian atau manfaat apa yang dapat terjadi, identifikasi kelompok-kelompok yang mungkin terpengaruh. Apakah kelompok-kelompok ini ditentukan oleh jenis kelamin, etnis, atau kelompok sosial?
1. **Tentukan metrik keadilan**. Terakhir, tentukan metrik sehingga Kamu memiliki sesuatu untuk diukur dalam pekerjaan Kamu untuk memperbaiki situasi.
### Identifikasi bahaya (dan manfaat)
Apa bahaya dan manfaat yang terkait dengan pinjaman? Pikirkan tentang skenario negatif palsu dan positif palsu:
**False negatives** (ditolak, tapi Y=1) - dalam hal ini, pemohon yang akan mampu membayar kembali pinjaman ditolak. Ini adalah peristiwa yang merugikan karena sumber pinjaman ditahan dari pemohon yang memenuhi syarat.
**False positives** (diterima, tapi Y=0) - dalam hal ini, pemohon memang mendapatkan pinjaman tetapi akhirnya wanprestasi. Akibatnya, kasus pemohon akan dikirim ke agen penagihan utang yang dapat mempengaruhi permohonan pinjaman mereka di masa depan.
### Identifikasi kelompok yang terkena dampak
Langkah selanjutnya adalah menentukan kelompok mana yang kemungkinan akan terpengaruh. Misalnya, dalam kasus permohonan kartu kredit, sebuah model mungkin menentukan bahwa perempuan harus menerima batas kredit yang jauh lebih rendah dibandingkan dengan pasangan mereka yang berbagi aset rumah tangga. Dengan demikian, seluruh demografi yang ditentukan berdasarkan jenis kelamin menjadi terpengaruh.
### Tentukan metrik keadilan
Kamu telah mengidentifikasi bahaya dan kelompok yang terpengaruh, dalam hal ini digambarkan berdasarkan jenis kelamin. Sekarang, gunakan faktor terukur (*quantified factors*) untuk memisahkan metriknya. Misalnya, dengan menggunakan data di bawah ini, Kamu dapat melihat bahwa wanita memiliki tingkat *false positive* terbesar dan pria memiliki yang terkecil, dan kebalikannya berlaku untuk *false negative*.
✅ Dalam pelajaran selanjutnya tentang *Clustering*, Kamu akan melihat bagaimana membangun 'confusion matrix' ini dalam kode
| | False positive rate | False negative rate | count |
| ---------- | ------------------- | ------------------- | ----- |
| Women | 0.37 | 0.27 | 54032 |
| Men | 0.31 | 0.35 | 28620 |
| Non-binary | 0.33 | 0.31 | 1266 |
Tabel ini memberitahu kita beberapa hal. Pertama, kami mencatat bahwa ada sedikit orang non-biner dalam data. Datanya condong (*skewed*), jadi Kamu harus berhati-hati dalam menafsirkan angka-angka ini.
Dalam hal ini, kita memiliki 3 grup dan 2 metrik. Ketika kita memikirkan tentang bagaimana sistem kita memengaruhi kelompok pelanggan dengan permohonan pinjaman mereka, ini mungkin cukup, tetapi ketika Kamu ingin menentukan jumlah grup yang lebih besar, Kamu mungkin ingin menyaringnya menjadi kumpulan ringkasan yang lebih kecil. Untuk melakukannya, Kamu dapat menambahkan lebih banyak metrik, seperti perbedaan terbesar atau rasio terkecil dari setiap *false negative* dan *false positive*.
✅ Berhenti dan Pikirkan: Kelompok lain yang apa lagi yang mungkin terpengaruh untuk pengajuan pinjaman?
## Mengurangi ketidakadilan
Untuk mengurangi ketidakadilan, jelajahi model untuk menghasilkan berbagai model yang dimitigasi dan bandingkan pengorbanan yang dibuat antara akurasi dan keadilan untuk memilih model yang paling adil.
Pelajaran pengantar ini tidak membahas secara mendalam mengenai detail mitigasi ketidakadilan algoritmik, seperti pendekatan pasca-pemrosesan dan pengurangan (*post-processing and reductions approach*), tetapi berikut adalah *tool* yang mungkin ingin Kamu coba.
### Fairlearn
[Fairlearn](https://fairlearn.github.io/) adalah sebuah *package* Python open-source yang memungkinkan Kamu untuk menilai keadilan sistem Kamu dan mengurangi ketidakadilan.
*Tool* ini membantu Kamu menilai bagaimana prediksi model memengaruhi kelompok yang berbeda, memungkinkan Kamu untuk membandingkan beberapa model dengan menggunakan metrik keadilan dan kinerja, dan menyediakan serangkaian algoritma untuk mengurangi ketidakadilan dalam klasifikasi dan regresi biner.
- Pelajari bagaimana cara menggunakan komponen-komponen yang berbeda dengan mengunjungi [GitHub](https://github.com/fairlearn/fairlearn/) Fairlearn
- Jelajahi [panduan pengguna](https://fairlearn.github.io/main/user_guide/index.html), [contoh-contoh](https://fairlearn.github.io/main/auto_examples/index.html)
- Coba beberapa [sampel notebook](https://github.com/fairlearn/fairlearn/tree/master/notebooks).
- Pelajari [bagaimana cara mengaktifkan penilaian keadilan](https://docs.microsoft.com/azure/machine-learning/how-to-machine-learning-fairness-aml?WT.mc_id=academic-15963-cxa) dari model machine learning di Azure Machine Learning.
- Lihat [sampel notebook](https://github.com/Azure/MachineLearningNotebooks/tree/master/contrib/fairness) ini untuk skenario penilaian keadilan yang lebih banyak di Azure Machine Learning.
---
## 🚀 Tantangan
Untuk mencegah kemunculan bias pada awalnya, kita harus:
- memiliki keragaman latar belakang dan perspektif di antara orang-orang yang bekerja pada sistem
- berinvestasi dalam dataset yang mencerminkan keragaman masyarakat kita
- mengembangkan metode yang lebih baik untuk mendeteksi dan mengoreksi bias ketika itu terjadi
Pikirkan tentang skenario kehidupan nyata di mana ketidakadilan terbukti dalam pembuatan dan penggunaan model. Apa lagi yang harus kita pertimbangkan?
## [Quiz Pasca-Pelajaran](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/6/)
## Ulasan & Belajar Mandiri
Dalam pelajaran ini, Kamu telah mempelajari beberapa dasar konsep keadilan dan ketidakadilan dalam pembelajaran mesin.
Tonton workshop ini untuk menyelami lebih dalam kedalam topik:
- YouTube: Kerugian terkait keadilan dalam sistem AI: Contoh, penilaian, dan mitigasi oleh Hanna Wallach dan Miro Dudik [Kerugian terkait keadilan dalam sistem AI: Contoh, penilaian, dan mitigasi - YouTube](https://www.youtube.com/watch?v=1RptHwfkx_k)
Kamu juga dapat membaca:
- Pusat sumber daya RAI Microsoft: [Responsible AI Resources Microsoft AI](https://www.microsoft.com/ai/responsible-ai-resources?activetab=pivot1%3aprimaryr4)
- Grup riset FATE Microsoft: [FATE: Fairness, Accountability, Transparency, and Ethics in AI - Microsoft Research](https://www.microsoft.com/research/theme/fate/)
Jelajahi *toolkit* Fairlearn
[Fairlearn](https://fairlearn.org/)
Baca mengenai *tools* Azure Machine Learning untuk memastikan keadilan
- [Azure Machine Learning](https://docs.microsoft.com/azure/machine-learning/concept-fairness-ml?WT.mc_id=academic-15963-cxa)
## Tugas
[Jelajahi Fairlearn](assignment.id.md)

@ -3,7 +3,7 @@
![機械学習における公平性をまとめたスケッチ](../../../sketchnotes/ml-fairness.png)
> [Tomomi Imura](https://www.twitter.com/girlie_mac)によるスケッチ
## [Pre-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/5/)
## [Pre-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/5?loc=ja)
## イントロダクション
@ -178,7 +178,7 @@ AIや機械学習における公平性の保証は、依然として複雑な社
モデルの構築や使用において、不公平が明らかになるような現実のシナリオを考えてみてください。他にどのようなことを考えるべきでしょうか?
## [Post-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/6/)
## [Post-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/6?loc=ja)
## Review & Self Study
このレッスンでは、機械学習における公平、不公平の概念の基礎を学びました。
@ -201,4 +201,4 @@ Azure Machine Learningによる、公平性を確保するためのツールに
## 課題
[Fairlearnを調査する](../assignment.md)
[Fairlearnを調査する](./assignment.ja.md)

@ -89,11 +89,11 @@
**讨论**:重温一些例子,看看它们是否显示出不同的危害。
| | 分配 | 服务质量 | 刻板印象 | 诋毁 | 代表性过高或过低 |
| ----------------------- | :--------: | :----------------: | :----------: | :---------: | :----------------------------: |
| 自动招聘系统 | x | x | x | | x |
| 机器翻译 | | | | | |
| 照片加标签 | | | | | |
| | 分配 | 服务质量 | 刻板印象 | 诋毁 | 代表性过高或过低 |
| ------------ | :---: | :------: | :------: | :---: | :--------------: |
| 自动招聘系统 | x | x | x | | x |
| 机器翻译 | | | | | |
| 照片加标签 | | | | | |
## 检测不公平
@ -138,14 +138,14 @@
✅ 在以后关于聚类的课程中,你将看到如何在代码中构建这个“混淆矩阵”
| | 假阳性率 | 假阴性率 | 数量 |
| ---------- | ------------------- | ------------------- | ----- |
| 女性 | 0.37 | 0.27 | 54032 |
| 男性 | 0.31 | 0.35 | 28620 |
| 未列出性别 | 0.33 | 0.31 | 1266 |
| | 假阳性率 | 假阴性率 | 数量 |
| ---------- | -------- | -------- | ----- |
| 女性 | 0.37 | 0.27 | 54032 |
| 男性 | 0.31 | 0.35 | 28620 |
| 未列出性别 | 0.33 | 0.31 | 1266 |
张桌子告诉我们几件事。首先,我们注意到数据中的未列出性别的人相对较少。数据是有偏差的,所以你需要小心解释这些数字。
个表格告诉我们几件事。首先,我们注意到数据中的未列出性别的人相对较少。数据是有偏差的,所以你需要小心解释这些数字。
在本例中我们有3个组和2个度量。当我们考虑我们的系统如何影响贷款申请人的客户群时这可能就足够了但是当你想要定义更多的组时你可能需要将其提取到更小的摘要集。为此你可以添加更多的度量例如每个假阴性和假阳性的最大差异或最小比率。
@ -211,4 +211,4 @@
## 任务
[探索Fairlearn](../assignment.md)
[探索Fairlearn](assignment.zh-cn.md)

@ -0,0 +1,11 @@
# Explore Fairlearn
## Instrucciones
En esta lección, aprendió sobre Fairlearn, un "proyecto open-source impulsado por la comunidad para ayudar a los científicos de datos a mejorar la equidad de los sistemas de AI." Para esta tarea, explore uno de los [cuadernos](https://fairlearn.org/v0.6.2/auto_examples/index.html) de Fairlearn e informe sus hallazgos en un documento o presentación.
## Rúbrica
| Criterios | Ejemplar | Adecuado | Necesita mejorar |
| -------- | --------- | -------- | ----------------- |
| | Un documento o presentación powerpoint es presentado discutiendo los sistemas de Fairlearn, el cuadernos que fue ejecutado, y las conclusiones extraídas al ejecutarlo | Un documento es presentado sin conclusiones | No se presenta ningún documento |

@ -0,0 +1,11 @@
# Jelajahi Fairlearn
## Instruksi
Dalam pelajaran ini kamu telah belajar mengenai Fairlearn, sebuah "proyek *open-source* berbasis komunitas untuk membantu para *data scientist* meningkatkan keadilan dari sistem AI." Untuk penugasan kali ini, jelajahi salah satu dari [notebook](https://fairlearn.org/v0.6.2/auto_examples/index.html) yang disediakan Fairlearn dan laporkan penemuanmu dalam sebuah paper atau presentasi.
## Rubrik
| Kriteria | Sangat Bagus | Cukup | Perlu Peningkatan |
| -------- | --------- | -------- | ----------------- |
| | Sebuah *paper* atau presentasi powerpoint yang membahas sistem Fairlearn, *notebook* yang dijalankan, dan kesimpulan yang diambil dari hasil menjalankannya | Sebuah paper yang dipresentasikan tanpa kesimpulan | Tidak ada paper yang dipresentasikan |

@ -0,0 +1,11 @@
# Fairlearnを調査する
## 指示
このレッスンでは、「データサイエンティストがAIシステムの公平性を向上させるための、オープンソースでコミュニティ主導のプロジェクト」であるFairlearnについて学習しました。この課題では、Fairlearnの [ノートブック](https://fairlearn.org/v0.6.2/auto_examples/index.html) のうちのひとつを調査し、わかったことをレポートやプレゼンテーションの形で報告してください。
## 評価基準
| 基準 | 模範的 | 十分 | 要改善 |
| ---- | --------------------------------------------------------------------------------------------------------------------------------------- | ---------------------------------- | -------------------------- |
| | Fairlearnのシステム・実行したートブック・実行によって得られた結果が、レポートやパワーポイントのプレゼンテーションとして提示されている | 結論のないレポートが提示されている | レポートが提示されていない |

@ -0,0 +1,11 @@
# 探索 Fairlearn
## 说明
在这节课中,你了解了 Fairlearn一个“开源的社区驱动的项目旨在帮助数据科学家们提高人工智能系统的公平性”。在这项作业中探索 Fairlearn [笔记本](https://fairlearn.org/v0.6.2/auto_examples/index.html)中的一个例子,之后你可以用论文或者 ppt 的形式叙述你学习后的发现。
## 评判标准
| 标准 | 优秀 | 中规中矩 | 仍需努力 |
| -------- | --------- | -------- | ----------------- |
| | 提交了一篇论文或者ppt 关于讨论 Fairlearn 系统、挑选运行的例子、和运行这个例子后所得出来的心得结论 | 提交了一篇没有结论的论文 | 没有提交论文 |

@ -6,6 +6,7 @@ The process of building, using, and maintaining machine learning models and the
- Explore base concepts such as 'models', 'predictions', and 'training data'.
## [Pre-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/7/)
## Introduction
On a high level, the craft of creating machine learning (ML) processes is comprised of a number of steps:
@ -39,14 +40,20 @@ To be able to answer your question with any kind of certainty, you need a good a
✅ After collecting and processing your data, take a moment to see if its shape will allow you to address your intended question. It may be that the data will not perform well in your given task, as we discover in our [Clustering](../../5-Clustering/1-Visualize/README.md) lessons!
### Selecting your feature variable
### Features and Target
A [feature](https://www.datasciencecentral.com/profiles/blogs/an-introduction-to-variable-and-feature-selection) is a measurable property of your data. In many datasets it is expressed as a column heading like 'date' 'size' or 'color'. Your feature variable, usually represented as `X` in code, represent the input variable which will be used to train model.
A [feature](https://www.datasciencecentral.com/profiles/blogs/an-introduction-to-variable-and-feature-selection) is a measurable property of your data. In many datasets it is expressed as a column heading like 'date' 'size' or 'color'. Your feature variable, usually represented as `y` in code, represents the answer to the question you are trying to ask of your data: in December, what **color** pumpkins will be cheapest? in San Francisco, what neighborhoods will have the best real estate **price**?
A target is a thing you are trying to predict. Target usually represented as `y` in code, represents the answer to the question you are trying to ask of your data: in December, what **color** pumpkins will be cheapest? in San Francisco, what neighborhoods will have the best real estate **price**? Sometimes target is also referred as label attribute.
### Selecting your feature variable
🎓 **Feature Selection and Feature Extraction** How do you know which variable to choose when building a model? You'll probably go through a process of feature selection or feature extraction to choose the right variables for the most performant model. They're not the same thing, however: "Feature extraction creates new features from functions of the original features, whereas feature selection returns a subset of the features." ([source](https://wikipedia.org/wiki/Feature_selection))
### Visualize your data
An important aspect of the data scientist's toolkit is the power to visualize data using several excellent libraries such as Seaborn or MatPlotLib. Representing your data visually might allow you to uncover hidden correlations that you can leverage. Your visualizations might also help you to uncover bias or unbalanced data (as we discover in [Classification](../../4-Classification/2-Classifiers-1/README.md)).
An important aspect of the data scientist's toolkit is the power to visualize data using several excellent libraries such as Seaborn or MatPlotLib. Representing your data visually might allow you to uncover hidden correlations that you can leverage. Your visualizations might also help you to uncover bias or unbalanced data (as we discover in [Classification](../../4-Classification/2-Classifiers-1/README.md)).
### Split your dataset
Prior to training, you need to split your dataset into two or more parts of unequal size that still represent the data well.
@ -61,10 +68,12 @@ Using your training data, your goal is to build a model, or a statistical repres
### Decide on a training method
Depending on your question and the nature of your data, your will choose a method to train it. Stepping through [Scikit-learn's documentation](https://scikit-learn.org/stable/user_guide.html) - which we use in this course - you can explore many ways to train a model. Depending on your experience, you might have to try several different methods to build the best model. You are likely to go through a process whereby data scientists evaluate the performance of a model by feeding it unseen data, checking for accuracy, bias, and other quality-degrading issues, and selecting the most appropriate training method for the task at hand.
Depending on your question and the nature of your data, you will choose a method to train it. Stepping through [Scikit-learn's documentation](https://scikit-learn.org/stable/user_guide.html) - which we use in this course - you can explore many ways to train a model. Depending on your experience, you might have to try several different methods to build the best model. You are likely to go through a process whereby data scientists evaluate the performance of a model by feeding it unseen data, checking for accuracy, bias, and other quality-degrading issues, and selecting the most appropriate training method for the task at hand.
### Train a model
Armed with your training data, you are ready to 'fit' it to create a model. You will notice that in many ML libraries you will find the code 'model.fit' - it is at this time that you send in your data as an array of values (usually 'X') and a feature variable (usually 'y').
Armed with your training data, you are ready to 'fit' it to create a model. You will notice that in many ML libraries you will find the code 'model.fit' - it is at this time that you send in your feature variable as an array of values (usually 'X') and a target variable (usually 'y').
### Evaluate the model
Once the training process is complete (it can take many iterations, or 'epochs', to train a large model), you will be able to evaluate the model's quality by using test data to gauge its performance. This data is a subset of the original data that the model has not previously analyzed. You can print out a table of metrics about your model's quality.

@ -0,0 +1,107 @@
# Técnicas de Machine Learning
El proceso de creación, uso y mantenimiento de modelos de machine learning, y los datos que se utilizan, es un proceso muy diferente de muchos otros flujos de trabajo de desarrollo. En esta lección, demistificaremos el proceso, y describiremos las principales técnicas que necesita saber. Vas a:
- Comprender los procesos que sustentan el machine learning a un alto nivel.
- Explorar conceptos básicos como 'modelos', 'predicciones', y 'datos de entrenamiento'
## [Cuestionario previo a la conferencia](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/7/)
## Introducción
A un alto nivel, el arte de crear procesos de machine learning (ML) se compone de una serie de pasos:
1. **Decidir sobre la pregunta**. La mayoría de los procesos de ML, comienzan por hacer una pregunta que no puede ser respondida por un simple programa condicional o un motor basado en reglas. Esas preguntas a menudo giran en torno a predicciones basadas en una recopilación de datos.
2. **Recopile y prepare datos**. Para poder responder a su pregunta, necesita datos. La calidad y, a veces, cantidad de sus datos determinarán que tan bien puede responder a su pregunta inicial. La visualización de datos es un aspecto importante de esta fase. Esta fase también incluye dividir los datos en un grupo de entrenamiento y pruebas para construir un modelo.
3. **Elige un método de entrenamiento**. Dependiendo de su pregunta y la naturaleza de sus datos, debe elegir cómo desea entrenar un modelo para reflejar mejor sus datos y hacer predicciones precisas contra ellos. Esta es la parte de su proceso de ML que requiere experiencia específica y, a menudo, una cantidad considerable de experimetación.
4. **Entrena el model**. Usando sus datos de entrenamiento, usará varios algoritmos para entrenar un modelo para reconocer patrones en los datos. El modelo puede aprovechar las ponderaciones internas que se pueden ajustar para privilegiar ciertas partes de los datos sobre otras para construir un modelo mejor.
5. **Evaluar el modelo**. Utiliza datos nunca antes vistos (sus datos de prueba) de su conjunto recopilado para ver cómo se está desempeñando el modelo.
6. **Ajuste de parámetros**. Según el rendimiento de su modelo, puede rehacer el proceso utilizando diferentes parámetros, o variables, que controlan el comportamiento de los algoritmos utlizados para entrenarl el modelo.
7. **Predecir**. Utilice nuevas entradas para probar la precisión de su modelo.
## Que pregunta hacer
Las computadoras son particularmente hábiles para descubrir patrones ocultos en los datos. Esta utlidad es muy útil para los investigadores que tienen preguntas sobre un dominio determinado que no pueden responderse fácilmente mediante la creación de un motor de reglas basado en condicionales. Dada una tarea actuarial, por ejemplo, un científico de datos podría construir reglas artesanales sobre la mortalidad de los fumadores frente a los no fumadores.
Sin embargo, cuandos se incorporan muchas otras variables a la ecuación, un modelo de ML podría resultar más eficiente para predecir las tasas de mortalidad futuras en funciòn de los antecedentes de salud. Un ejemplo más alegre podría hacer predicciones meteorólogicas para el mes de abril en una ubicación determinada que incluya latitud, longitud, cambio climático, proximidad al océano, patrones de la corriente en chorro, y más.
✅ Esta [presentación de diapositivas](https://www2.cisl.ucar.edu/sites/default/files/0900%20June%2024%20Haupt_0.pdf) sobre modelos meteorológicos ofrece una perspectiva histórica del uso de ML en el análisis meteorológico.
## Tarea previas a la construcción
Antes de comenzar a construir su modelo, hay varias tareas que debe comletar. Para probar su pregunta y formar una hipótesis basada en las predicciones de su modelo, debe identificar y configurar varios elementos.
### Datos
Para poder responder su pregunta con cualquier tipo de certeza, necesita una buena cantidad de datos del tipo correcto.
Hay dos cosas que debe hacer en este punto:
- **Recolectar datos**. Teniendo en cuenta la lección anterior sobre la equidad en el análisis de datos, recopile sus datos con cuidado. Tenga en cuenta la fuente de estos datos, cualquier sesgo inherente que pueda tener y documente su origen.
- **Preparar datos**. Hay varios pasos en el proceso de preparación de datos. Podría necesitar recopilar datos y normalizarlos si provienen de diversas fuentes. Puede mejorar la calidad y cantidad de los datos mediante varios métodos, como convertir strings en números (como hacemos en [Clustering](../../5-Clustering/1-Visualize/README.md)). También puede generar nuevos datos, basados en los originales (como hacemos en [Clasificación](../../4-Classification/1-Introduction/README.md)). Puede limpiar y editar los datos (como lo haremos antes de la lección [Web App](../../3-Web-App/README.md)). Por último, es posible que también deba aleotizarlo y mezclarlo, según sus técnicas de entrenamiento.
✅ Despúes de recopilar y procesar sus datos, tómese un momento para ver si su forma le permitirá responder a su pregunta. ¡Puede ser que los datos no funcionen bien en su tarea dada, como descubriremos en nuestras lecciones de[Clustering](../../5-Clustering/1-Visualize/README.md)!
### Seleccionando su variable característica
Una [característica](https://www.datasciencecentral.com/profiles/blogs/an-introduction-to-variable-and-feature-selection) es una propiedad medible de sus datos. En muchos conjuntos de datos, se expresa como un encabezado de columna como 'fecha', 'tamaño' o 'color'. Su variable característica, generalmente representada como `y` en el código, representa la respuesta a la pregunta que está tratando de hacer a sus datos: en diciembre, ¿qué calabazas de **color** serán las más baratas?, en San Francisco, ¿que vecinadarios tendrán el mejor **precio** de bienes raíces?
🎓 **Selección y extracción de características** ¿ Cómo sabe que variable elegir al construir un modelo? Probablemente pasará por un proceso de selección o extracción de características para elegir las variables correctas para mayor un mayor rendimiento del modelo. Sin embargo, no son lo mismo: "La extracción de características crea nuevas características a partir de funciones de las características originales, mientras que la selección de características devuelve un subconjunto de las características." ([fuente](https://wikipedia.org/wiki/Feature_selection))
### Visualiza tus datos
Un aspecto importante del conjunto de herramientas del científico de datos es el poder de visualizar datos utilizando varias bibliotecas excelentes como Seaborn o MatPlotLib. Representar sus datos visualmente puede permitirle descubrir correlaciones ocultas que puede aprovechar. Sus visualizaciones también pueden ayudarlo a descubrir sesgos o datos desequilibrados. (como descubrimos en [Clasificación](../../4-Classification/2-Classifiers-1/README.md)).
### Divide tu conjunto de datos
Antes del entrenamiento, debe dividir su conjunto de datos en dos o más partes de tamaño desigual que aún represente bien los datos.
- **Entrenamiento**. Esta parte del conjunto de datos se ajusta a su modelo para entrenarlo. Este conjunto constituye la mayor parte del conjunto de datos original.
- **Pruebas**. Un conjunto de datos de pruebas es un grupo independiente de datos, a menudo recopilado a partir de los datos originales, que se utiliza para confirmar el rendimiento del modelo construido.
- **Validación**. Un conjunto de validación es un pequeño grupo independiente de ejemplos que se usa para ajustar los hiperparámetros o la arquitectura del modelo para mejorar el modelo. Dependiendo del tamaño de de su conjunto de datos y de la pregunta que se está haciendo, es posible que no necesite crear este tercer conjunto (como notamos en [Pronóstico se series de tiempo](../../7-TimeSeries/1-Introduction/README.md)).
## Contruye un modelo
Usando sus datos de entrenamiento, su objetivo es construir un modelo, o una representación estadística de sus datos, usando varios algoritmos para **entrenarlo**. El entrenamiento de un modelo lo expone a los datos y le permite hacer suposiciones sobre los patrones percibidos que descubre, valida y rechaza.
### Decide un método de entrenamiento
Dependiendo de su pregunta y la naturaleza de sus datos, elegirá un método para entrenarlos. Pasando por la [documentación de Scikit-learn ](https://scikit-learn.org/stable/user_guide.html) - que usamos en este curso - puede explorar muchas formas de entrenar un modelo. Dependiendo de su experiencia, es posible que deba probar varios métodos diferentes para construir el mejor modelo. Es probable que pase por un proceso en el que los científicos de datos evalúan el rendimiento de un modelo alimentándolo con datos no vistos anteriormente por el modelo, verificando la precisión, el sesgo, y otros problemas que degradan la calidad, y seleccionando el método de entrenamieto más apropiado para la tarea en custión.
### Entrena un modelo
Armado con sus datos de entrenamiento, está listo para 'fit'(ajustarlos/entrenarlos) para crear un modelo. Notará que en muchas bibliotecas de ML, encontrará el código 'model.fit' - es en este momento cuando envías sus datos como una matriz de valores (generalmente 'X') y una variable característica (generalmente 'Y').
### Evaluar el modelo
Una vez que se completa el proceso de entrenamiento (puede tomar muchas iteraciones, o 'épocas', entrenar un modelo de gran tamaño), podrá evaluar la calidad del modelo utilizando datos de prueba para medir su rendimiento. Estos datos son un subconjunto de los datos originales que el modelo no ha analizado previamente. Puede imprimir una tabla de métricas sobre la calidad de su modelo.
🎓 **Ajuste del modelo (Model fitting)**
En el contexto del machine learning, el ajuste del modelo se refiere a la precisión de la función subyacente del modelo cuando intenta analizar datos con los que no está familiarizado.
🎓 **Ajuste insuficiente (Underfitting)** y **sobreajuste (overfitting)** son problemas comunes que degradan la calidad del modelo, ya que el modelo no encaja suficientemente bien, o encaja demasiado bien. Esto hace que el modelo haga predicciones demasiado estrechamente alineadas o demasiado poco alineadas con sus datos de entrenamiento. Un modelo sobreajustadoo (overfitting) predice demasiado bien los datos de entrenamiento porque ha aprendido demasiado bien los detalles de los datos y el ruido. Un modelo insuficentemente ajustado (Underfitting) es es preciso, ya que ni puede analizar con precisión sus datos de entrenamiento ni los datos que aún no ha 'visto'.
![overfitting model](images/overfitting.png)
> Infografía de [Jen Looper](https://twitter.com/jenlooper)
## Ajuste de parámetros
Una vez que haya completado su entrenamiento inicial, observe la calidad del modelo y considere mejorarlo ajustando sus 'hiperparámetros'. Lea más sobre el proceso [en la documentación](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters?WT.mc_id=academic-15963-cxa).
## Predicción
Este es el momento en el que puede usar datos completamente nuevos para probar la precisión de su modelo. En una configuración de ML aplicada, donde está creando activos web para usar el modelo en producción, este proceo puede implicar la recopilación de la entrada del usuario (presionar un botón, por ejemplo) para establecer una variable y enviarla al modelo para la inferencia, o evaluación.
En estas lecciones, descubrirá cómo utilizar estos pasos para preparar, construir, probar, evaluar, y predecir - todos los gestos de un científico de datos y más, a medida que avanza en su viaje para convertirse en un ingeniero de machine learning 'full stack'.
---
## 🚀Desafío
Dibuje un diagrama de flujos que refleje los pasos de practicante de ML. ¿Dónde te ves ahora mismo en el proceso? ¿Dónde predice que encontrará dificultades? ¿Qué te parece fácil?
## [Cuestionario previo a la conferencia](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/8/)
## Revisión & Autoestudio
Busque en línea entrevistas con científicos de datos que analicen su trabajo diario. Aquí está [uno](https://www.youtube.com/watch?v=Z3IjgbbCEfs).
## Asignación
[Entrevistar a un científico de datos](assignment.md)

@ -0,0 +1,105 @@
# Teknik-teknik Machine Learning
Proses membangun, menggunakan, dan memelihara model machine learning dan data yang digunakan adalah proses yang sangat berbeda dari banyak alur kerja pengembangan lainnya. Dalam pelajaran ini, kita akan mengungkap prosesnya dan menguraikan teknik utama yang perlu Kamu ketahui. Kamu akan:
- Memahami gambaran dari proses yang mendasari machine learning.
- Menjelajahi konsep dasar seperti '*models*', '*predictions*', dan '*training data*'.
## [Quiz Pra-Pelajaran](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/7/)
## Pengantar
Gambaran membuat proses machine learning (ML) terdiri dari sejumlah langkah:
1. **Menentukan pertanyaan**. Sebagian besar proses ML dimulai dengan mengajukan pertanyaan yang tidak dapat dijawab oleh program kondisional sederhana atau mesin berbasis aturan (*rules-based engine*). Pertanyaan-pertanyaan ini sering berkisar seputar prediksi berdasarkan kumpulan data.
2. **Mengumpulkan dan menyiapkan data**. Untuk dapat menjawab pertanyaanmu, Kamu memerlukan data. Bagaimana kualitas dan terkadang kuantitas data kamu akan menentukan seberapa baik kamu dapat menjawab pertanyaan awal kamu. Memvisualisasikan data merupakan aspek penting dari fase ini. Fase ini juga mencakup pemisahan data menjadi kelompok *training* dan *testing* untuk membangun model.
3. **Memilih metode training**. Tergantung dari pertanyaan dan sifat datamu, Kamu perlu memilih bagaimana kamu ingin men-training sebuah model untuk mencerminkan data kamu dengan baik dan membuat prediksi yang akurat terhadapnya. Ini adalah bagian dari proses ML yang membutuhkan keahlian khusus dan seringkali perlu banyak eksperimen.
4. **Melatih model**. Dengan menggunakan data *training*, kamu akan menggunakan berbagai algoritma untuk melatih model guna mengenali pola dalam data. Modelnya mungkin bisa memanfaatkan *internal weight* yang dapat disesuaikan untuk memberi hak istimewa pada bagian tertentu dari data dibandingkan bagian lainnya untuk membangun model yang lebih baik.
5. **Mengevaluasi model**. Gunakan data yang belum pernah dilihat sebelumnya (data *testing*) untuk melihat bagaimana kinerja model.
6. **Parameter tuning**. Berdasarkan kinerja modelmu, Kamu dapat mengulang prosesnya menggunakan parameter atau variabel yang berbeda, yang mengontrol perilaku algoritma yang digunakan untuk melatih model.
7. **Prediksi**. Gunakan input baru untuk menguji keakuratan model kamu.
## Pertanyaan apa yang harus ditanyakan?
Komputer sangat ahli dalam menemukan pola tersembunyi dalam data. Hal ini sangat membantu peneliti yang memiliki pertanyaan tentang domain tertentu yang tidak dapat dijawab dengan mudah dari hanya membuat mesin berbasis aturan kondisional (*conditionally-based rules engine*). Untuk tugas aktuaria misalnya, seorang data scientist mungkin dapat membuat aturan secara manual seputar mortalitas perokok vs non-perokok.
Namun, ketika banyak variabel lain dimasukkan ke dalam persamaan, model ML mungkin terbukti lebih efisien untuk memprediksi tingkat mortalitas di masa depan berdasarkan riwayat kesehatan masa lalu. Contoh yang lebih menyenangkan mungkin membuat prediksi cuaca untuk bulan April di lokasi tertentu berdasarkan data yang mencakup garis lintang, garis bujur, perubahan iklim, kedekatan dengan laut, pola aliran udara (Jet Stream), dan banyak lagi.
✅ [Slide deck](https://www2.cisl.ucar.edu/sites/default/files/0900%20June%2024%20Haupt_0.pdf) ini menawarkan perspektif historis pada model cuaca dengan menggunakan ML dalam analisis cuaca.
## Tugas Pra-Pembuatan
Sebelum mulai membangun model kamu, ada beberapa tugas yang harus kamu selesaikan. Untuk menguji pertanyaan kamu dan membentuk hipotesis berdasarkan prediksi model, Kamu perlu mengidentifikasi dan mengonfigurasi beberapa elemen.
### Data
Untuk dapat menjawab pertanyaan kamu dengan kepastian, Kamu memerlukan sejumlah besar data dengan jenis yang tepat. Ada dua hal yang perlu kamu lakukan pada saat ini:
- **Mengumpulkan data**. Ingat pelajaran sebelumnya tentang keadilan dalam analisis data, kumpulkan data kamu dengan hati-hati. Waspadai sumber datanya, bias bawaan apa pun yang mungkin dimiliki, dan dokumentasikan asalnya.
- **Menyiapkan data**. Ada beberapa langkah dalam proses persiapan data. Kamu mungkin perlu menyusun data dan melakukan normalisasi jika berasal dari berbagai sumber. Kamu dapat meningkatkan kualitas dan kuantitas data melalui berbagai metode seperti mengonversi string menjadi angka (seperti yang kita lakukan di [Clustering](../../5-Clustering/1-Visualize/translations/README.id.md)). Kamu mungkin juga bisa membuat data baru berdasarkan data yang asli (seperti yang kita lakukan di [Classification](../../4-Classification/1-Introduction/translations/README.id.md)). Kamu bisa membersihkan dan mengubah data (seperti yang kita lakukan sebelum pelajaran [Web App](../3-Web-App/translations/README.id.md)). Terakhir, Kamu mungkin juga perlu mengacaknya dan mengubah urutannya, tergantung pada teknik *training* kamu.
✅ Setelah mengumpulkan dan memproses data kamu, luangkan waktu sejenak untuk melihat apakah bentuknya memungkinkan kamu untuk menjawab pertanyaan yang kamu maksudkan. Mungkin data tidak akan berkinerja baik dalam tugas yang kamu berikan, seperti yang kita temukan dalam pelajaran [Clustering](../../5-Clustering/1-Visualize/translations/README.id.md).
### Memilih variabel fiturmu
Sebuah [fitur](https://www.datasciencecentral.com/profiles/blogs/an-introduction-to-variable-and-feature-selection) adalah sebuah properti yang dapat diukur dalam data kamu. Dalam banyak dataset, properti dinyatakan sebagai sebuah heading kolom seperti 'date' 'size' atau 'color'. Variabel fitur kamu yang biasanya direpresentasikan sebagai `y` dalam kode, mewakili jawaban atas pertanyaan yang kamu coba tanyakan tentang data kamu: pada bulan Desember, labu dengan **warna** apa yang akan paling murah? di San Francisco, lingkungan mana yang menawarkan **harga** real estate terbaik?
🎓 **Feature Selection dan Feature Extraction** Bagaimana kamu tahu variabel mana yang harus dipilih saat membangun model? Kamu mungkin akan melalui proses pemilihan fitur (*Feature Selection*) atau ekstraksi fitur (*Feature Extraction*) untuk memilih variabel yang tepat untuk membuat model yang berkinerja paling baik. Namun, keduanya tidak sama: "Ekstraksi fitur membuat fitur baru dari fungsi fitur asli, sedangkan pemilihan fitur mengembalikan subset fitur." ([sumber](https://wikipedia.org/wiki/Feature_selection))
### Visualisasikan datamu
Aspek penting dari toolkit data scientist adalah kemampuan untuk memvisualisasikan data menggunakan beberapa *library* seperti Seaborn atau MatPlotLib. Merepresentasikan data kamu secara visual memungkinkan kamu mengungkap korelasi tersembunyi yang dapat kamu manfaatkan. Visualisasimu mungkin juga membantu kamu mengungkap data yang bias atau tidak seimbang (seperti yang kita temukan dalam [Classification](../../4-Classification/2-Classifiers-1/translations/README.id.md)).
### Membagi dataset
Sebelum memulai *training*, Kamu perlu membagi dataset menjadi dua atau lebih bagian dengan ukuran yang tidak sama tapi masih mewakili data dengan baik.
- **Training**. Bagian dataset ini digunakan untuk men-training model kamu. Bagian dataset ini merupakan mayoritas dari dataset asli.
- **Testing**. Sebuah dataset tes adalah kelompok data independen, seringkali dikumpulkan dari data yang asli yang akan digunakan untuk mengkonfirmasi kinerja dari model yang dibuat.
- **Validating**. Dataset validasi adalah kumpulan contoh mandiri yang lebih kecil yang kamu gunakan untuk menyetel hyperparameter atau arsitektur model untuk meningkatkan model. Tergantung dari ukuran data dan pertanyaan yang kamu ajukan, Kamu mungkin tidak perlu membuat dataset ketiga ini (seperti yang kita catat dalam [Time Series Forecasting](../7-TimeSeries/1-Introduction/translations/README.id.md)).
## Membuat sebuah model
Dengan menggunakan data *training*, tujuan kamu adalah membuat model atau representasi statistik data kamu menggunakan berbagai algoritma untuk **melatihnya**. Melatih model berarti mengeksposnya dengan data dan mengizinkannya membuat asumsi tentang pola yang ditemukan, divalidasi, dan diterima atau ditolak.
### Tentukan metode training
Tergantung dari pertanyaan dan sifat datamu, Kamu akan memilih metode untuk melatihnya. Buka dokumentasi [Scikit-learn](https://scikit-learn.org/stable/user_guide.html) yang kita gunakan dalam pelajaran ini, kamu bisa menjelajahi banyak cara untuk melatih sebuah model. Tergantung dari pengalamanmu, kamu mungkin perlu mencoba beberapa metode yang berbeda untuk membuat model yang terbaik. Kemungkinan kamu akan melalui proses di mana data scientist mengevaluasi kinerja model dengan memasukkan data yang belum pernah dilihat, memeriksa akurasi, bias, dan masalah penurunan kualitas lainnya, dan memilih metode training yang paling tepat untuk tugas yang ada.
### Melatih sebuah model
Berbekal data *training*, Kamu siap untuk menggunakannya untuk membuat model. Kamu akan melihat di banyak *library* ML mengenai kode 'model.fit' - pada saat inilah kamu mengirimkan data kamu sebagai *array* nilai (biasanya 'X') dan variabel fitur (biasanya 'y' ).
### Mengevaluasi model
Setelah proses *training* selesai (ini mungkin membutuhkan banyak iterasi, atau 'epoch', untuk melatih model besar), Kamu akan dapat mengevaluasi kualitas model dengan menggunakan data tes untuk mengukur kinerjanya. Data ini merupakan subset dari data asli yang modelnya belum pernah dianalisis sebelumnya. Kamu dapat mencetak tabel metrik tentang kualitas model kamu.
🎓 **Model fitting**
Dalam konteks machine learning, *model fitting* mengacu pada keakuratan dari fungsi yang mendasari model saat mencoba menganalisis data yang tidak familiar.
🎓 **Underfitting** dan **overfitting** adalah masalah umum yang menurunkan kualitas model, karena model tidak cukup akurat atau terlalu akurat. Hal ini menyebabkan model membuat prediksi yang terlalu selaras atau tidak cukup selaras dengan data trainingnya. Model overfit memprediksi data *training* terlalu baik karena telah mempelajari detail dan noise data dengan terlalu baik. Model underfit tidak akurat karena tidak dapat menganalisis data *training* atau data yang belum pernah dilihat sebelumnya secara akurat.
![overfitting model](../images/overfitting.png)
> Infografis oleh [Jen Looper](https://twitter.com/jenlooper)
## Parameter tuning
Setelah *training* awal selesai, amati kualitas model dan pertimbangkan untuk meningkatkannya dengan mengubah 'hyperparameter' nya. Baca lebih lanjut tentang prosesnya [di dalam dokumentasi](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters?WT.mc_id=academic-15963-cxa).
## Prediksi
Ini adalah saat di mana Kamu dapat menggunakan data yang sama sekali baru untuk menguji akurasi model kamu. Dalam setelan ML 'terapan', di mana kamu membangun aset web untuk menggunakan modelnya dalam produksi, proses ini mungkin melibatkan pengumpulan input pengguna (misalnya menekan tombol) untuk menyetel variabel dan mengirimkannya ke model untuk inferensi, atau evaluasi.
Dalam pelajaran ini, Kamu akan menemukan cara untuk menggunakan langkah-langkah ini untuk mempersiapkan, membangun, menguji, mengevaluasi, dan memprediksi - semua gestur data scientist dan banyak lagi, seiring kemajuanmu dalam perjalanan menjadi 'full stack' ML engineer.
---
## 🚀Tantangan
Gambarlah sebuah flow chart yang mencerminkan langkah-langkah seorang praktisi ML. Di mana kamu melihat diri kamu saat ini dalam prosesnya? Di mana kamu memprediksi kamu akan menemukan kesulitan? Apa yang tampak mudah bagi kamu?
## [Quiz Pra-Pelajaran](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/8/)
## Ulasan & Belajar Mandiri
Cari di Internet mengenai wawancara dengan data scientist yang mendiskusikan pekerjaan sehari-hari mereka. Ini [salah satunya](https://www.youtube.com/watch?v=Z3IjgbbCEfs).
## Tugas
[Wawancara dengan data scientist](assignment.id.md)

@ -0,0 +1,110 @@
# 機械学習の手法
機械学習モデルやそのモデルが使用するデータを構築・使用・管理するプロセスは、他の多くの開発ワークフローとは全く異なるものです。このレッスンでは、このプロセスを明快にして、知っておくべき主な手法の概要をまとめます。あなたは、
- 機械学習を支えるプロセスを高い水準で理解します。
- 「モデル」「予測」「訓練データ」などの基本的な概念を調べます。
## [講義前の小テスト](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/7?loc=ja)
## 導入
大まかに言うと、機械学習 (Machine Learning: ML) プロセスを作成する技術はいくつかのステップで構成されています。
1. **質問を決める**。ほとんどの機械学習プロセスは、単純な条件のプログラムやルールベースのエンジンでは答えられないような質問をすることから始まります。このような質問は、データの集合を使った予測を中心にされることが多いです。
2. **データを集めて準備する**。質問に答えるためにはデータが必要です。データの質と、ときには量が、最初の質問にどれだけうまく答えられるかを決めます。データの可視化がこのフェーズの重要な側面です。モデルを構築するためにデータを訓練グループとテストグループに分けることもこのフェーズに含みます。
3. **学習方法を選ぶ**。質問の内容やデータの性質に応じて、データを最も良く反映して正確に予測できるモデルを、どのように学習するかを選ぶ必要があります。これは機械学習プロセスの中でも、特定の専門知識と、多くの場合はかなりの試行回数が必要になる部分です。
4. **モデルを学習する**。データのパターンを認識するモデルを学習するために、訓練データと様々なアルゴリズムを使います。モデルはより良いモデルを構築するために、データの特定の部分を優先するように調整できる内部の重みを活用するかもしれません。
5. **モデルを評価する**。モデルがどのように動作しているかを確認するために、集めたデータの中からまだ見たことのないもの(テストデータ)を使います。
6. **パラメータチューニング**。モデルの性能によっては、モデルを学習するために使われる、各アルゴリズムの挙動を制御するパラメータや変数を変更してプロセスをやり直すこともできます。
7. **予測する**。モデルの精度をテストするために新しい入力を使います。
## どのような質問をすれば良いか
コンピュータはデータの中に隠れているパターンを見つけることがとても得意です。この有用性は、条件ベースのルールエンジンを作っても簡単には答えられないような、特定の領域に関する質問を持っている研究者にとって非常に役立ちます。たとえば、ある保険数理の問題があったとして、データサイエンティストは喫煙者と非喫煙者の死亡率に関する法則を自分の手だけでも作れるかもしれません。
しかし、他にも多くの変数が方程式に含まれる場合、過去の健康状態から将来の死亡率を予測する機械学習モデルの方が効率的かもしれません。もっと明るいテーマの例としては、緯度、経度、気候変動、海への近さ、ジェット気流のパターンなどのデータに基づいて、特定の場所における4月の天気を予測することができます。
✅ 気象モデルに関するこの [スライド](https://www2.cisl.ucar.edu/sites/default/files/0900%20June%2024%20Haupt_0.pdf) は、気象解析に機械学習を使う際の歴史的な考え方を示しています。
## 構築前のタスク
モデルの構築を始める前に、いくつかのタスクを完了させる必要があります。質問をテストしたりモデルの予測に基づいた仮説を立てたりするためには、いくつかの要素を特定して設定する必要があります。
### データ
質問に確実に答えるためには、適切な種類のデータが大量に必要になります。ここではやるべきことが2つあります。
- **データを集める**。データ解析における公平性に関する前回の講義を思い出しながら、慎重にデータを集めてください。特定のバイアスを持っているかもしれないデータのソースに注意し、それを記録しておいてください。
- **データを準備する**。データを準備するプロセスにはいくつかのステップがあります。異なるソースからデータを集めた場合、照合と正規化が必要になるかもしれません。([クラスタリング](../../../5-Clustering/1-Visualize/README.md) で行っているように、)文字列を数値に変換するなどの様々な方法でデータの質と量を向上させることができます。([分類](../../../4-Classification/1-Introduction/README.md) で行っているように、)元のデータから新しいデータを生成することもできます。([Webアプリ](../../../3-Web-App/README.md) の講義の前に行うように、)データをクリーニングしたり編集したりすることができます。最後に、学習の手法によっては、ランダムにしたりシャッフルしたりする必要もあるかもしれません。
✅ データを集めて処理した後は、その形で意図した質問に対応できるかどうかを確認してみましょう。[クラスタリング](../../../5-Clustering/1-Visualize/README.md) の講義でわかるように、データは与えられたタスクに対して上手く機能しないかもしれません!
### 特徴量の選択
[特徴](https://www.datasciencecentral.com/profiles/blogs/an-introduction-to-variable-and-feature-selection) とは、測定可能なデータの特性のことです。多くのデータセットでは、「日付」「大きさ」「色」などの列の見出しとして表されています。コード上では `y` で表されることが多い特徴量は、データに関する次のような質問への回答を意味します。「12月はどんな **色** のカボチャが一番安いか?」「サンフランシスコでは、どの地域が最も不動産の **価格** が高いか?」
🎓 **特徴選択と特徴抽出** モデルを構築する際にどの変数を選ぶべきかは、どうすればわかるでしょうか?最も性能の高いモデルのためには、適した変数を選択する特徴選択や特徴抽出のプロセスをたどることになるでしょう。しかし、これらは同じものではありません。「特徴抽出は元の特徴の機能から新しい特徴を作成するのに対し、特徴選択は特徴の一部を返すものです。」 ([出典](https://wikipedia.org/wiki/Feature_selection))
### データを可視化する
データサイエンティストの道具に関する重要な側面は、Seaborn や MatPlotLib などの優れたライブラリを使ってデータを可視化する力です。データを視覚的に表現することで、隠れた相関関係を見つけて活用できるかもしれません。また、([分類](../../../4-Classification/2-Classifiers-1/README.md) でわかるように、)視覚化することで、バイアスやバランシングされていないデータを見つけられるかもしれません。
### データセットを分割する
学習の前にデータセットを2つ以上に分割して、それぞれがデータを表すのに十分かつ不均等な大きさにする必要があります。
- **学習**。データセットのこの部分は、モデルを学習するために適合させます。これは元のデータセットの大部分を占めます。
- **テスト**。テストデータセットとは、構築したモデルの性能を確認するために使用する独立したデータグループのことで、多くの場合は元のデータから集められます。
- **検証**。検証セットとは、さらに小さくて独立したサンプルの集合のことで、モデルを改善するためにハイパーパラメータや構造を調整する際に使用されます。([時系列予測](../../../7-TimeSeries/1-Introduction/README.md) に記載しているように、データの大きさや質問の内容によっては、この3つ目のセットを作る必要はありません。
## モデルの構築
訓練データと様々なアルゴリズムを使った **学習** によって、モデルもしくはデータの統計的な表現を構築することが目標です。モデルを学習することで、データを扱えるようになったり、発見、検証、肯定または否定したパターンに関する仮説を立てることができたりします。
### 学習方法を決める
質問の内容やデータの性質に応じて、モデルを学習する方法を選択します。このコースで使用する [Scikit-learn のドキュメント](https://scikit-learn.org/stable/user_guide.html) を見ると、モデルを学習する様々な方法を調べられます。経験次第では、最適なモデルを構築するためにいくつかの異なる方法を試す必要があるかもしれません。また、モデルが見たことのないデータを与えたり、質を下げている問題、精度、バイアスについて調べたり、タスクに対して最適な学習方法を選んだりすることで、データサイエンティストが行っている、モデルの性能を評価するプロセスを踏むことになるでしょう。
### モデルを学習する
訓練データを用意したので、モデルを作成するためにそれを「適合」させる準備が整いました。多くの機械学習ライブラリには 'model.fit' というコードがあることに気づくでしょう。データを(通常は 'X' で表す)値の配列と(通常は 'y' で表す)特徴量として渡すときです。
### モデルを評価する
(大きなモデルを学習するには多くの反復(エポック)が必要になりますが、)学習プロセスが完了したら、テストデータを使ってモデルの質を評価することができます。このデータは元のデータのうち、モデルがそれまでに分析していないものです。モデルの質を表す指標の表を出力することができます。
🎓 **モデルフィッティング**
機械学習におけるモデルフィッティングは、モデルがまだ知らないデータを分析する際の根本的な機能の精度を参照します。
🎓 **未学習** と **過学習** はモデルの質を下げる一般的な問題で、モデルが十分に適合していないか、または適合しすぎています。これによってモデルは訓練データに近すぎたり遠すぎたりする予測を行います。過学習モデルは、データの詳細やノイズもよく学習しているため、訓練データを上手く予測しすぎてしまいます。未学習モデルは、訓練データやまだ「見たことのない」データを正確に分析することができないため、精度が高くないです。
![過学習モデル](../images/overfitting.png)
> [Jen Looper](https://twitter.com/jenlooper) さんによる解説画像
## パラメータチューニング
最初のトレーニングが完了したら、モデルの質を観察して、「ハイパーパラメータ」の調整によるモデルの改善を検討しましょう。このプロセスについては [ドキュメント](https://docs.microsoft.com/en-us/azure/machine-learning/how-to-tune-hyperparameters?WT.mc_id=academic-15963-cxa) を読んでください。
## 予測
全く新しいデータを使ってモデルの精度をテストする瞬間です。本番環境でモデルを使用するためにWebアセットを構築するよう「適用された」機械学習の設定においては、推論や評価のためにモデルに渡したり、変数を設定したりするためにユーザの入力ボタンの押下などを収集することがこのプロセスに含まれるかもしれません。
この講義では、「フルスタック」の機械学習エンジニアになるための旅をしながら、準備・構築・テスト・評価・予測などのデータサイエンティストが行うすべてのステップの使い方を学びます。
---
## 🚀チャレンジ
機械学習の学習者のステップを反映したフローチャートを描いてください。今の自分はこのプロセスのどこにいると思いますか?どこに困難があると予想しますか?あなたにとって簡単そうなことは何ですか?
## [講義後の小テスト](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/8?loc=ja)
## 振り返りと自主学習
データサイエンティストが日々の仕事について話しているインタビューをネットで検索してみましょう。ひとつは [これ](https://www.youtube.com/watch?v=Z3IjgbbCEfs) です。
## 課題
[データサイエンティストにインタビューする](assignment.ja.md)

@ -54,7 +54,7 @@
- **训练**。这部分数据集适合你的模型进行训练。这个集合构成了原始数据集的大部分。
- **测试**。测试数据集是一组独立的数据,通常从原始数据中收集,用于确认构建模型的性能。
- **验证**。验证集是一个较小的独立示例组,用于调整模型的超参数或架构,以改进模型。根据你的数据大小和你提出的问题,你可能不需要构建第三组(正如我们在[时间序列预测](../../7-TimeSeries/1-Introduction/README.md)中所述)。
- **验证**。验证集是一个较小的独立示例组,用于调整模型的超参数或架构,以改进模型。根据你的数据大小和你提出的问题,你可能不需要构建第三组(正如我们在[时间序列预测](../../../7-TimeSeries/1-Introduction/README.md)中所述)。
## 建立模型
@ -72,7 +72,7 @@
训练过程完成后(训练大型模型可能需要多次迭代或“时期”),你将能够通过使用测试数据来衡量模型的性能来评估模型的质量。此数据是模型先前未分析的原始数据的子集。 你可以打印出有关模型质量的指标表。
🎓 **模型拟合 **
🎓 **模型拟合**
在机器学习的背景下,模型拟合是指模型在尝试分析不熟悉的数据时其底层功能的准确性。
@ -105,4 +105,4 @@
## 任务
[采访一名数据科学家](../assignment.md)
[采访一名数据科学家](assignment.zh-cn.md)

@ -0,0 +1,11 @@
# Wawancara seorang data scientist
## Instruksi
Di perusahaan Kamu, dalam user group, atau di antara teman atau sesama siswa, berbicaralah dengan seseorang yang bekerja secara profesional sebagai data scientist. Tulis makalah singkat (500 kata) tentang pekerjaan sehari-hari mereka. Apakah mereka spesialis, atau apakah mereka bekerja 'full stack'?
## Rubrik
| Kriteria | Sangat Bagus | Cukup | Perlu Peningkatan |
| -------- | ------------------------------------------------------------------------------------ | ------------------------------------------------------------------ | --------------------- |
| | Sebuah esai dengan panjang yang sesuai, dengan sumber yang dikaitkan, disajikan sebagai file .doc | Esai dikaitkan dengan buruk atau lebih pendek dari panjang yang dibutuhkan | Tidak ada esai yang disajikan |

@ -0,0 +1,11 @@
# データサイエンティストにインタビューする
## 指示
会社・ユーザグループ・友人・学生仲間の中で、データサイエンティストとして専門的に働いている人に話を聞いてみましょう。その人の日々の仕事について短いレポート500語を書いてください。その人は専門家でしょうかそれとも「フルスタック」として働いているでしょうか
## 評価基準
| 基準 | 模範的 | 十分 | 要改善 |
| ---- | ---------------------------------------------------------------------- | -------------------------------------------------------------- | -------------------------- |
| | 出典が明記された適切な長さのレポートが.docファイルとして提示されている | レポートに出典が明記されていない、もしくは必要な長さよりも短い | レポートが提示されていない |

@ -0,0 +1,11 @@
# 采访一位数据科学家
## 说明
在你的公司、你所在的社群、或者在你的朋友和同学中找到一位从事数据科学专业工作的人与他或她交流一下。写一篇关于他们工作日常的小短文500字左右。他们是专家还是说他们是“全栈”开发者
## 评判标准
| 标准 | 优秀 | 中规中矩 | 仍需努力 |
| -------- | ------------------------------------------------------------------------------------ | ------------------------------------------------------------------ | --------------------- |
| | 提交一篇清晰描述了职业属性且字数符合规范的word文档 | 提交的文档职业属性描述得不清晰或者字数不合规范 | 啥都没有交 |

@ -0,0 +1,22 @@
# Introduction au machine learning
Dans cette section du programme, vous découvrirez les concepts de base sous-jacents au domaine du machine learning, ce quil est, et vous découvrirez son histoire et les techniques que les chercheurs utilisent pour travailler avec lui. Explorons ensemble ce nouveau monde de ML !
![globe](../images/globe.jpg)
> Photo par <a href="https://unsplash.com/@bill_oxford?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Bill Oxford</a> sur <a href="https://unsplash.com/s/photos/globe?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
### Leçons
1. [Introduction au machine learning](../1-intro-to-ML/translations/README.fr.md)
1. [Lhistoire du machine learning et de lIA](../2-history-of-ML/translations/README.fr.md)
1. [Équité et machine learning](../3-fairness/translations/README.fr.md)
1. [Techniques de machine learning](../4-techniques-of-ML/translations/README.fr.md)
### Crédits
"Introduction au machine learning" a été écrit avec ♥️ par une équipe de personnes comprenant [Muhammad Sakib Khan Inan](https://twitter.com/Sakibinan), [Ornella Altunyan](https://twitter.com/ornelladotcom) et [Jen Looper](https://twitter.com/jenlooper)
"Lhistoire du machine learning" a été écrit avec ♥️ par [Jen Looper](https://twitter.com/jenlooper) et [Amy Boyd](https://twitter.com/AmyKateNicho)
"Équité et machine learning" a été écrit avec ♥️ par [Tomomi Imura](https://twitter.com/girliemac)
"Techniques de machine learning" a été écrit avec ♥️ par [Jen Looper](https://twitter.com/jenlooper) et [Chris Noring](https://twitter.com/softchris)

@ -0,0 +1,23 @@
# Pengantar Machine Learning
Di bagian kurikulum ini, Kamu akan berkenalan dengan konsep yang mendasari bidang Machine Learning, apa itu Machine Learning, dan belajar mengenai
sejarah serta teknik-teknik yang digunakan oleh para peneliti. Ayo jelajahi dunia baru Machine Learning bersama!
![bola dunia](../images/globe.jpg)
> Foto oleh <a href="https://unsplash.com/@bill_oxford?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Bill Oxford</a> di <a href="https://unsplash.com/s/photos/globe?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
### Pelajaran
1. [Pengantar Machine Learning](../1-intro-to-ML/translations/README.id.md)
1. [Sejarah dari Machine Learning dan AI](../2-history-of-ML/translations/README.id.md)
1. [Keadilan dan Machine Learning](../3-fairness/translations/README.id.md)
1. [Teknik-Teknik Machine Learning](../4-techniques-of-ML/translations/README.id.md)
### Penghargaan
"Pengantar Machine Learning" ditulis dengan ♥️ oleh sebuah tim yang terdiri dari [Muhammad Sakib Khan Inan](https://twitter.com/Sakibinan), [Ornella Altunyan](https://twitter.com/ornelladotcom) dan [Jen Looper](https://twitter.com/jenlooper)
"Sejarah dari Machine Learning dan AI" ditulis dengan ♥️ oleh [Jen Looper](https://twitter.com/jenlooper) dan [Amy Boyd](https://twitter.com/AmyKateNicho)
"Keadilan dan Machine Learning" ditulis dengan ♥️ oleh [Tomomi Imura](https://twitter.com/girliemac)
"Teknik-Teknik Machine Learning" ditulis dengan ♥️ oleh [Jen Looper](https://twitter.com/jenlooper) dan [Chris Noring](https://twitter.com/softchris)

@ -0,0 +1,22 @@
# 机器学习入门
课程的本章节将为您介绍机器学习领域背后的基本概念、什么是机器学习,并学习它的历史以及曾为此做出贡献的技术研究者门。让我们一起开始探索机器学习的全新世界吧!
![globe](../images/globe.jpg)
> 图片由 <a href="https://unsplash.com/@bill_oxford?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Bill Oxford</a>提供,来自 <a href="https://unsplash.com/s/photos/globe?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
### 课程安排
1. [机器学习简介](../1-intro-to-ML/translations/README.zh-cn.md)
1. [机器学习的历史](../2-history-of-ML/translations/README.zh-cn.md)
1. [机器学习中的公平性](../3-fairness/translations/README.zh-cn.md)
1. [机器学习技术](../4-techniques-of-ML/translations/README.zh-cn.md)
### 致谢
"机器学习简介"由 [Muhammad Sakib Khan Inan](https://twitter.com/Sakibinan), [Ornella Altunyan](https://twitter.com/ornelladotcom) 及 [Jen Looper](https://twitter.com/jenlooper),共同倾 ♥️ 而作
"机器学习及人工智能历史" 由 [Jen Looper](https://twitter.com/jenlooper) 及 [Amy Boyd](https://twitter.com/AmyKateNicho)倾 ♥️ 而作
"公平性与机器学习" 由 [Tomomi Imura](https://twitter.com/girliemac) 倾 ♥️ 而作
"机器学习的技术" 由 [Jen Looper](https://twitter.com/jenlooper) 及 [Chris Noring](https://twitter.com/softchris) 倾 ♥️ 而作

@ -5,6 +5,9 @@
> Sketchnote by [Tomomi Imura](https://www.twitter.com/girlie_mac)
## [Pre-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/9/)
> ### [This lesson is available in R!](./solution/lesson_1-R.ipynb)
## Introduction
In these four lessons, you will discover how to build regression models. We will discuss what these are for shortly. But before you do anything, make sure you have the right tools in place to start the process!
@ -95,7 +98,7 @@ For this task we will import some libraries:
- **matplotlib**. It's a useful [graphing tool](https://matplotlib.org/) and we will use it to create a line plot.
- **numpy**. [numpy](https://numpy.org/doc/stable/user/whatisnumpy.html) is a useful library for handling numeric data in Python.
- **sklearn**. This is the Scikit-learn library.
- **sklearn**. This is the [Scikit-learn](https://scikit-learn.org/stable/user_guide.html) library.
Import some libraries to help with your tasks.
@ -180,6 +183,9 @@ In a new code cell, load the diabetes dataset by calling `load_diabetes()`. The
```python
plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, y_pred, color='blue', linewidth=3)
plt.xlabel('Scaled BMIs')
plt.ylabel('Disease Progression')
plt.title('A Graph Plot Showing Diabetes Progression Against BMI')
plt.show()
```

Binary file not shown.

After

Width:  |  Height:  |  Size: 558 KiB

Binary file not shown.

Before

Width:  |  Height:  |  Size: 283 KiB

After

Width:  |  Height:  |  Size: 16 KiB

@ -0,0 +1,436 @@
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "lesson_1-R.ipynb",
"provenance": [],
"collapsed_sections": [],
"toc_visible": true
},
"kernelspec": {
"name": "ir",
"display_name": "R"
},
"language_info": {
"name": "R"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "YJUHCXqK57yz"
},
"source": [
"#Build a regression model: Get started with R and Tidymodels for regression models"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "LWNNzfqd6feZ"
},
"source": [
"## Introduction to Regression - Lesson 1\n",
"\n",
"#### Putting it into perspective\n",
"\n",
"✅ There are many types of regression methods, and which one you pick depends on the answer you're looking for. If you want to predict the probable height for a person of a given age, you'd use `linear regression`, as you're seeking a **numeric value**. If you're interested in discovering whether a type of cuisine should be considered vegan or not, you're looking for a **category assignment** so you would use `logistic regression`. You'll learn more about logistic regression later. Think a bit about some questions you can ask of data, and which of these methods would be more appropriate.\n",
"\n",
"In this section, you will work with a [small dataset about diabetes](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html). Imagine that you wanted to test a treatment for diabetic patients. Machine Learning models might help you determine which patients would respond better to the treatment, based on combinations of variables. Even a very basic regression model, when visualized, might show information about variables that would help you organize your theoretical clinical trials.\n",
"\n",
"That said, let's get started on this task!\n",
"\n",
"![Artwork by \\@allison_horst](../images/encouRage.jpg)<br>Artwork by @allison_horst"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "FIo2YhO26wI9"
},
"source": [
"## 1. Loading up our tool set\n",
"\n",
"For this task, we'll require the following packages:\n",
"\n",
"- `tidyverse`: The [tidyverse](https://www.tidyverse.org/) is a [collection of R packages](https://www.tidyverse.org/packages) designed to makes data science faster, easier and more fun!\n",
"\n",
"- `tidymodels`: The [tidymodels](https://www.tidymodels.org/) framework is a [collection of packages](https://www.tidymodels.org/packages/) for modeling and machine learning.\n",
"\n",
"You can have them installed as:\n",
"\n",
"`install.packages(c(\"tidyverse\", \"tidymodels\"))`\n",
"\n",
"The script below checks whether you have the packages required to complete this module and installs them for you in case some are missing."
]
},
{
"cell_type": "code",
"metadata": {
"id": "cIA9fz9v7Dss",
"colab": {
"base_uri": "https://localhost:8080/"
},
"outputId": "2df7073b-86b2-4b32-cb86-0da605a0dc11"
},
"source": [
"if (!require(\"pacman\")) install.packages(\"pacman\")\n",
"pacman::p_load(tidyverse, tidymodels)"
],
"execution_count": 2,
"outputs": [
{
"output_type": "stream",
"text": [
"Loading required package: pacman\n",
"\n"
],
"name": "stderr"
}
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "gpO_P_6f9WUG"
},
"source": [
"Now, let's load these awesome packages and make them available in our current R session.(This is for mere illustration, `pacman::p_load()` already did that for you)"
]
},
{
"cell_type": "code",
"metadata": {
"id": "NLMycgG-9ezO"
},
"source": [
"# load the core Tidyverse packages\n",
"library(tidyverse)\n",
"\n",
"# load the core Tidymodels packages\n",
"library(tidymodels)\n"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "KM6iXLH996Cl"
},
"source": [
"## 2. The diabetes dataset\n",
"\n",
"In this exercise, we'll put our regression skills into display by making predictions on a diabetes dataset. The [diabetes dataset](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.rwrite1.txt) includes `442 samples` of data around diabetes, with 10 predictor feature variables, `age`, `sex`, `body mass index`, `average blood pressure`, and `six blood serum measurements` as well as an outcome variable `y`: a quantitative measure of disease progression one year after baseline.\n",
"\n",
"|Number of observations|442|\n",
"|----------------------|:---|\n",
"|Number of predictors|First 10 columns are numeric predictive|\n",
"|Outcome/Target|Column 11 is a quantitative measure of disease progression one year after baseline|\n",
"|Predictor Information|- age in years\n",
"||- sex\n",
"||- bmi body mass index\n",
"||- bp average blood pressure\n",
"||- s1 tc, total serum cholesterol\n",
"||- s2 ldl, low-density lipoproteins\n",
"||- s3 hdl, high-density lipoproteins\n",
"||- s4 tch, total cholesterol / HDL\n",
"||- s5 ltg, possibly log of serum triglycerides level\n",
"||- s6 glu, blood sugar level|\n",
"\n",
"\n",
"\n",
"\n",
"> 🎓 Remember, this is supervised learning, and we need a named 'y' target.\n",
"\n",
"Before you can manipulate data with R, you need to import the data into R's memory, or build a connection to the data that R can use to access the data remotely.\n",
"\n",
"> The [readr](https://readr.tidyverse.org/) package, which is part of the Tidyverse, provides a fast and friendly way to read rectangular data into R.\n",
"\n",
"Now, let's load the diabetes dataset provided in this source URL: <https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html>\n",
"\n",
"Also, we'll perform a sanity check on our data using `glimpse()` and dsiplay the first 5 rows using `slice()`.\n",
"\n",
"Before going any further, let's also introduce something you will encounter often in R code 🥁🥁: the pipe operator `%>%`\n",
"\n",
"The pipe operator (`%>%`) performs operations in logical sequence by passing an object forward into a function or call expression. You can think of the pipe operator as saying \"and then\" in your code."
]
},
{
"cell_type": "code",
"metadata": {
"id": "Z1geAMhM-bSP"
},
"source": [
"# Import the data set\n",
"diabetes <- read_table2(file = \"https://www4.stat.ncsu.edu/~boos/var.select/diabetes.rwrite1.txt\")\n",
"\n",
"\n",
"# Get a glimpse and dimensions of the data\n",
"glimpse(diabetes)\n",
"\n",
"\n",
"# Select the first 5 rows of the data\n",
"diabetes %>% \n",
" slice(1:5)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "UwjVT1Hz-c3Z"
},
"source": [
"`glimpse()` shows us that this data has 442 rows and 11 columns with all the columns being of data type `double` \n",
"\n",
"<br>\n",
"\n",
"\n",
"\n",
"> glimpse() and slice() are functions in [`dplyr`](https://dplyr.tidyverse.org/). Dplyr, part of the Tidyverse, is a grammar of data manipulation that provides a consistent set of verbs that help you solve the most common data manipulation challenges\n",
"\n",
"<br>\n",
"\n",
"Now that we have the data, let's narrow down to one feature (`bmi`) to target for this exercise. This will require us to select the desired columns. So, how do we do this?\n",
"\n",
"[`dplyr::select()`](https://dplyr.tidyverse.org/reference/select.html) allows us to *select* (and optionally rename) columns in a data frame."
]
},
{
"cell_type": "code",
"metadata": {
"id": "RDY1oAKI-m80"
},
"source": [
"# Select predictor feature `bmi` and outcome `y`\n",
"diabetes_select <- diabetes %>% \n",
" select(c(bmi, y))\n",
"\n",
"# Print the first 5 rows\n",
"diabetes_select %>% \n",
" slice(1:10)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "SDk668xK-tc3"
},
"source": [
"## 3. Training and Testing data\n",
"\n",
"It's common practice in supervised learning to *split* the data into two subsets; a (typically larger) set with which to train the model, and a smaller \"hold-back\" set with which to see how the model performed.\n",
"\n",
"Now that we have data ready, we can see if a machine can help determine a logical split between the numbers in this dataset. We can use the [rsample](https://tidymodels.github.io/rsample/) package, which is part of the Tidymodels framework, to create an object that contains the information on *how* to split the data, and then two more rsample functions to extract the created training and testing sets:\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "EqtHx129-1h-"
},
"source": [
"set.seed(2056)\n",
"# Split 67% of the data for training and the rest for tesing\n",
"diabetes_split <- diabetes_select %>% \n",
" initial_split(prop = 0.67)\n",
"\n",
"# Extract the resulting train and test sets\n",
"diabetes_train <- training(diabetes_split)\n",
"diabetes_test <- testing(diabetes_split)\n",
"\n",
"# Print the first 3 rows of the training set\n",
"diabetes_train %>% \n",
" slice(1:10)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "sBOS-XhB-6v7"
},
"source": [
"## 4. Train a linear regression model with Tidymodels\n",
"\n",
"Now we are ready to train our model!\n",
"\n",
"In Tidymodels, you specify models using `parsnip()` by specifying three concepts:\n",
"\n",
"- Model **type** differentiates models such as linear regression, logistic regression, decision tree models, and so forth.\n",
"\n",
"- Model **mode** includes common options like regression and classification; some model types support either of these while some only have one mode.\n",
"\n",
"- Model **engine** is the computational tool which will be used to fit the model. Often these are R packages, such as **`\"lm\"`** or **`\"ranger\"`**\n",
"\n",
"This modeling information is captured in a model specification, so let's build one!"
]
},
{
"cell_type": "code",
"metadata": {
"id": "20OwEw20--t3"
},
"source": [
"# Build a linear model specification\n",
"lm_spec <- \n",
" # Type\n",
" linear_reg() %>% \n",
" # Engine\n",
" set_engine(\"lm\") %>% \n",
" # Mode\n",
" set_mode(\"regression\")\n",
"\n",
"\n",
"# Print the model specification\n",
"lm_spec"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "_oDHs89k_CJj"
},
"source": [
"After a model has been *specified*, the model can be `estimated` or `trained` using the [`fit()`](https://parsnip.tidymodels.org/reference/fit.html) function, typically using a formula and some data.\n",
"\n",
"`y ~ .` means we'll fit `y` as the predicted quantity/target, explained by all the predictors/features ie, `.` (in this case, we only have one predictor: `bmi` )"
]
},
{
"cell_type": "code",
"metadata": {
"id": "YlsHqd-q_GJQ"
},
"source": [
"# Build a linear model specification\n",
"lm_spec <- linear_reg() %>% \n",
" set_engine(\"lm\") %>%\n",
" set_mode(\"regression\")\n",
"\n",
"\n",
"# Train a linear regression model\n",
"lm_mod <- lm_spec %>% \n",
" fit(y ~ ., data = diabetes_train)\n",
"\n",
"# Print the model\n",
"lm_mod"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "kGZ22RQj_Olu"
},
"source": [
"From the model output, we can see the coefficients learned during training. They represent the coefficients of the line of best fit that gives us the lowest overall error between the actual and predicted variable.\n",
"<br>\n",
"\n",
"## 5. Make predictions on the test set\n",
"\n",
"Now that we've trained a model, we can use it to predict the disease progression y for the test dataset using [parsnip::predict()](https://parsnip.tidymodels.org/reference/predict.model_fit.html). This will be used to draw the line between data groups."
]
},
{
"cell_type": "code",
"metadata": {
"id": "nXHbY7M2_aao"
},
"source": [
"# Make predictions for the test set\n",
"predictions <- lm_mod %>% \n",
" predict(new_data = diabetes_test)\n",
"\n",
"# Print out some of the predictions\n",
"predictions %>% \n",
" slice(1:5)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "R_JstwUY_bIs"
},
"source": [
"Woohoo! 💃🕺 We just trained a model and used it to make predictions!\n",
"\n",
"When making predictions, the tidymodels convention is to always produce a tibble/data frame of results with standardized column names. This makes it easy to combine the original data and the predictions in a usable format for subsequent operations such as plotting.\n",
"\n",
"`dplyr::bind_cols()` efficiently binds multiple data frames column."
]
},
{
"cell_type": "code",
"metadata": {
"id": "RybsMJR7_iI8"
},
"source": [
"# Combine the predictions and the original test set\n",
"results <- diabetes_test %>% \n",
" bind_cols(predictions)\n",
"\n",
"\n",
"results %>% \n",
" slice(1:5)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "XJbYbMZW_n_s"
},
"source": [
"## 6. Plot modelling results\n",
"\n",
"Now, its time to see this visually 📈. We'll create a scatter plot of all the `y` and `bmi` values of the test set, then use the predictions to draw a line in the most appropriate place, between the model's data groupings.\n",
"\n",
"R has several systems for making graphs, but `ggplot2` is one of the most elegant and most versatile. This allows you to compose graphs by **combining independent components**."
]
},
{
"cell_type": "code",
"metadata": {
"id": "R9tYp3VW_sTn"
},
"source": [
"# Set a theme for the plot\n",
"theme_set(theme_light())\n",
"# Create a scatter plot\n",
"results %>% \n",
" ggplot(aes(x = bmi)) +\n",
" # Add a scatter plot\n",
" geom_point(aes(y = y), size = 1.6) +\n",
" # Add a line plot\n",
" geom_line(aes(y = .pred), color = \"blue\", size = 1.5)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "zrPtHIxx_tNI"
},
"source": [
"> ✅ Think a bit about what's going on here. A straight line is running through many small dots of data, but what is it doing exactly? Can you see how you should be able to use this line to predict where a new, unseen data point should fit in relationship to the plot's y axis? Try to put into words the practical use of this model.\n",
"\n",
"Congratulations, you built your first linear regression model, created a prediction with it, and displayed it in a plot!\n"
]
}
]
}

@ -0,0 +1,250 @@
---
title: 'Build a regression model: Get started with R and Tidymodels for regression models'
output:
html_document:
df_print: paged
theme: flatly
highlight: breezedark
toc: yes
toc_float: yes
code_download: yes
---
## Introduction to Regression - Lesson 1
#### Putting it into perspective
✅ There are many types of regression methods, and which one you pick depends on the answer you're looking for. If you want to predict the probable height for a person of a given age, you'd use `linear regression`, as you're seeking a **numeric value**. If you're interested in discovering whether a type of cuisine should be considered vegan or not, you're looking for a **category assignment** so you would use `logistic regression`. You'll learn more about logistic regression later. Think a bit about some questions you can ask of data, and which of these methods would be more appropriate.
In this section, you will work with a [small dataset about diabetes](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html). Imagine that you wanted to test a treatment for diabetic patients. Machine Learning models might help you determine which patients would respond better to the treatment, based on combinations of variables. Even a very basic regression model, when visualized, might show information about variables that would help you organize your theoretical clinical trials.
That said, let's get started on this task!
![Artwork by \@allison_horst](../images/encouRage.jpg){width="630"}
## 1. Loading up our tool set
For this task, we'll require the following packages:
- `tidyverse`: The [tidyverse](https://www.tidyverse.org/) is a [collection of R packages](https://www.tidyverse.org/packages) designed to makes data science faster, easier and more fun!
- `tidymodels`: The [tidymodels](https://www.tidymodels.org/) framework is a [collection of packages](https://www.tidymodels.org/packages/) for modeling and machine learning.
You can have them installed as:
`install.packages(c("tidyverse", "tidymodels"))`
The script below checks whether you have the packages required to complete this module and installs them for you in case they are missing.
```{r, message=F, warning=F}
if (!require("pacman")) install.packages("pacman")
pacman::p_load(tidyverse, tidymodels)
```
Now, let's load these awesome packages and make them available in our current R session. (This is for mere illustration, `pacman::p_load()` already did that for you)
```{r load_tidy_verse_models, message=F, warning=F}
# load the core Tidyverse packages
library(tidyverse)
# load the core Tidymodels packages
library(tidymodels)
```
## 2. The diabetes dataset
In this exercise, we'll put our regression skills into display by making predictions on a diabetes dataset. The [diabetes dataset](https://www4.stat.ncsu.edu/~boos/var.select/diabetes.rwrite1.txt) includes `442 samples` of data around diabetes, with 10 predictor feature variables, `age`, `sex`, `body mass index`, `average blood pressure`, and `six blood serum measurements` as well as an outcome variable `y`: a quantitative measure of disease progression one year after baseline.
+----------------------------+------------------------------------------------------------------------------------+
| **Number of observations** | **442** |
+============================+====================================================================================+
| **Number of predictors** | First 10 columns are numeric predictive values |
+----------------------------+------------------------------------------------------------------------------------+
| **Outcome/Target** | Column 11 is a quantitative measure of disease progression one year after baseline |
+----------------------------+------------------------------------------------------------------------------------+
| **Predictor Information** | - age age in years |
| | - sex |
| | - bmi body mass index |
| | - bp average blood pressure |
| | - s1 tc, total serum cholesterol |
| | - s2 ldl, low-density lipoproteins |
| | - s3 hdl, high-density lipoproteins |
| | - s4 tch, total cholesterol / HDL |
| | - s5 ltg, possibly log of serum triglycerides level |
| | - s6 glu, blood sugar level |
+----------------------------+------------------------------------------------------------------------------------+
> 🎓 Remember, this is supervised learning, and we need a named 'y' target.
Before you can manipulate data with R, you need to import the data into R's memory, or build a connection to the data that R can use to access the data remotely.\
> The [readr](https://readr.tidyverse.org/) package, which is part of the Tidyverse, provides a fast and friendly way to read rectangular data into R.
Now, let's load the diabetes dataset provided in this source URL: <https://www4.stat.ncsu.edu/~boos/var.select/diabetes.html>
Also, we'll perform a sanity check on our data using `glimpse()` and dsiplay the first 5 rows using `slice()`.
Before going any further, let's introduce something you will encounter quite often in R code: the pipe operator `%>%`
The pipe operator (`%>%`) performs operations in logical sequence by passing an object forward into a function or call expression. You can think of the pipe operator as saying "and then" in your code.\
```{r load_dataset, message=F, warning=F}
# Import the data set
diabetes <- read_table2(file = "https://www4.stat.ncsu.edu/~boos/var.select/diabetes.rwrite1.txt")
# Get a glimpse and dimensions of the data
glimpse(diabetes)
# Select the first 5 rows of the data
diabetes %>%
slice(1:5)
```
`glimpse()` shows us that this data has 442 rows and 11 columns with all the columns being of data type `double`
> glimpse() and slice() are functions in [`dplyr`](https://dplyr.tidyverse.org/). Dplyr, part of the Tidyverse, is a grammar of data manipulation that provides a consistent set of verbs that help you solve the most common data manipulation challenges
Now that we have the data, let's narrow down to one feature (`bmi`) to target for this exercise. This will require us to select the desired columns. So, how do we do this?
[`dplyr::select()`](https://dplyr.tidyverse.org/reference/select.html) allows us to *select* (and optionally rename) columns in a data frame.
```{r select, message=F, warning=F}
# Select predictor feature `bmi` and outcome `y`
diabetes_select <- diabetes %>%
select(c(bmi, y))
# Print the first 5 rows
diabetes_select %>%
slice(1:5)
```
## 3. Training and Testing data
It's common practice in supervised learning to *split* the data into two subsets; a (typically larger) set with which to train the model, and a smaller "hold-back" set with which to see how the model performed.
Now that we have data ready, we can see if a machine can help determine a logical split between the numbers in this dataset. We can use the [rsample](https://tidymodels.github.io/rsample/) package, which is part of the Tidymodels framework, to create an object that contains the information on *how* to split the data, and then two more rsample functions to extract the created training and testing sets:
```{r split, message=F, warning=F}
set.seed(2056)
# Split 67% of the data for training and the rest for tesing
diabetes_split <- diabetes_select %>%
initial_split(prop = 0.67)
# Extract the resulting train and test sets
diabetes_train <- training(diabetes_split)
diabetes_test <- testing(diabetes_split)
# Print the first 3 rows of the training set
diabetes_train %>%
slice(1:3)
```
## 4. Train a linear regression model with Tidymodels
Now we are ready to train our model!
In Tidymodels, you specify models using `parsnip()` by specifying three concepts:
- Model **type** differentiates models such as linear regression, logistic regression, decision tree models, and so forth.
- Model **mode** includes common options like regression and classification; some model types support either of these while some only have one mode.
- Model **engine** is the computational tool which will be used to fit the model. Often these are R packages, such as **`"lm"`** or **`"ranger"`**
This modeling information is captured in a model specification, so let's build one!
```{r lm_model_spec, message=F, warning=F}
# Build a linear model specification
lm_spec <-
# Type
linear_reg() %>%
# Engine
set_engine("lm") %>%
# Mode
set_mode("regression")
# Print the model specification
lm_spec
```
After a model has been *specified*, the model can be `estimated` or `trained` using the [`fit()`](https://parsnip.tidymodels.org/reference/fit.html) function, typically using a formula and some data.
`y ~ .` means we'll fit `y` as the predicted quantity/target, explained by all the predictors/features ie, `.` (in this case, we only have one predictor: `bmi` )
```{r train, message=F, warning=F}
# Build a linear model specification
lm_spec <- linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
# Train a linear regression model
lm_mod <- lm_spec %>%
fit(y ~ ., data = diabetes_train)
# Print the model
lm_mod
```
From the model output, we can see the coefficients learned during training. They represent the coefficients of the line of best fit that gives us the lowest overall error between the actual and predicted variable.
## 5. Make predictions on the test set
Now that we've trained a model, we can use it to predict the disease progression y for the test dataset using [parsnip::predict()](https://parsnip.tidymodels.org/reference/predict.model_fit.html). This will be used to draw the line between data groups.
```{r test, message=F, warning=F}
# Make predictions for the test set
predictions <- lm_mod %>%
predict(new_data = diabetes_test)
# Print out some of the predictions
predictions %>%
slice(1:5)
```
Woohoo! 💃🕺 We just trained a model and used it to make predictions!
When making predictions, the tidymodels convention is to always produce a tibble/data frame of results with standardized column names. This makes it easy to combine the original data and the predictions in a usable format for subsequent operations such as plotting.
`dplyr::bind_cols()` efficiently binds multiple data frames column.
```{r test_pred, message=F, warning=F}
# Combine the predictions and the original test set
results <- diabetes_test %>%
bind_cols(predictions)
results %>%
slice(1:5)
```
## 6. Plot modelling results
Now, its time to see this visually 📈. We'll create a scatter plot of all the `y` and `bmi` values of the test set, then use the predictions to draw a line in the most appropriate place, between the model's data groupings.
R has several systems for making graphs, but `ggplot2` is one of the most elegant and most versatile. This allows you to compose graphs by **combining independent components**.
```{r plot_pred, message=F, warning=F}
# Set a theme for the plot
theme_set(theme_light())
# Create a scatter plot
results %>%
ggplot(aes(x = bmi)) +
# Add a scatter plot
geom_point(aes(y = y), size = 1.6) +
# Add a line plot
geom_line(aes(y = .pred), color = "blue", size = 1.5)
```
> ✅ Think a bit about what's going on here. A straight line is running through many small dots of data, but what is it doing exactly? Can you see how you should be able to use this line to predict where a new, unseen data point should fit in relationship to the plot's y axis? Try to put into words the practical use of this model.
Congratulations, you built your first linear regression model, created a prediction with it, and displayed it in a plot!

@ -182,13 +182,6 @@
"plt.plot(X_test, y_pred, color='blue', linewidth=3)\n",
"plt.show()"
]
},
{
"cell_type": "code",
"execution_count": null,
"metadata": {},
"outputs": [],
"source": []
}
]
}

@ -0,0 +1,211 @@
# Iniziare con Python e Scikit-learn per i modelli di regressione
![Sommario delle regressioni in uno sketchnote](../../../sketchnotes/ml-regression.png)
> Sketchnote di [Tomomi Imura](https://www.twitter.com/girlie_mac)
## [Qui Pre-lezione](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/9/)
## Introduzione
In queste quattro lezioni, si scoprirà come costruire modelli di regressione. Si discuterà di cosa siano fra breve.
Prima di tutto, ci si deve assicurare di avere a disposizione gli strumenti adatti per far partire il processo!
In questa lezione, si imparerà come:
- Configurare il proprio computer per attività locali di machine learning.
- Lavorare con i Jupyter notebook.
- Usare Scikit-learn, compresa l'installazione.
- Esplorare la regressione lineare con un esercizio pratico.
## Installazioni e configurazioni
[![Usare Python con Visual Studio Code](https://img.youtube.com/vi/7EXd4_ttIuw/0.jpg)](https://youtu.be/7EXd4_ttIuw "Using Python with Visual Studio Code")
> 🎥 Fare click sull'immagine qui sopra per un video: usare Python all'interno di VS Code.
1. **Installare Python**. Assicurarsi che [Python](https://www.python.org/downloads/) sia installato nel proprio computer. Si userà Python for per molte attività di data science e machine learning. La maggior parte dei sistemi già include una installazione di Python. Ci sono anche utili [Pacchetti di Codice Python](https://code.visualstudio.com/learn/educators/installers?WT.mc_id=academic-15963-cxa) disponbili, per facilitare l'installazione per alcuni utenti.
Alcuni utilizzi di Python, tuttavia, richiedono una versione del software, laddove altri ne richiedono un'altra differente. Per questa ragione, è utile lavorare con un [ambiente virtuale](https://docs.python.org/3/library/venv.html).
2. **Installare Visual Studio Code**. Assicurarsi di avere installato Visual Studio Code sul proprio computer. Si seguano queste istruzioni per [installare Visual Studio Code](https://code.visualstudio.com/) per l'installazione basica. Si userà Python in Visual Studio Code in questo corso, quindi meglio rinfrescarsi le idee su come [configurare Visual Studio Code](https://docs.microsoft.com/learn/modules/python-install-vscode?WT.mc_id=academic-15963-cxa) per lo sviluppo in Python.
> Si prenda confidenza con Python tramite questa collezione di [moduli di apprendimento](https://docs.microsoft.com/users/jenlooper-2911/collections/mp1pagggd5qrq7?WT.mc_id=academic-15963-cxa)
3. **Installare Scikit-learn**, seguendo [queste istruzioni](https://scikit-learn.org/stable/install.html). Visto che ci si deve assicurare di usare Python 3, ci si raccomanda di usare un ambiente virtuale. Si noti che se si installa questa libreria in un M1 Mac, ci sono istruzioni speciali nella pagina di cui al riferimento qui sopra.
1. **Installare Jupyter Notebook**. Servirà [installare il pacchetto Jupyter](https://pypi.org/project/jupyter/).
## Ambiente di creazione ML
Si useranno **notebook** per sviluppare il codice Python e creare modelli di machine learning. Questo tipo di file è uno strumento comune per i data scientist, e viene identificato dal suffisso o estensione `.ipynb`.
I notebook sono un ambiente interattivo che consente allo sviluppatore di scrivere codice, aggiungere note e scrivere documentazione attorno al codice il che è particolarmente utile per progetti sperimentali o orientati alla ricerca.
### Esercizio - lavorare con un notebook
In questa cartella, si troverà il file _notebook.ipynb_.
1. Aprire _notebook.ipynb_ in Visual Studio Code.
Un server Jupyter verrà lanciato con Python 3+. Si troveranno aree del notebook che possono essere `eseguite`, pezzi di codice. Si può eseguire un blocco di codice selezionando l'icona che assomiglia a un bottone di riproduzione.
1. Selezionare l'icona `md` e aggiungere un poco di markdown, e il seguente testo **# Benvenuto nel tuo notebook**.
Poi, aggiungere un blocco di codice Python.
1. Digitare **print('hello notebook')** nell'area riservata al codice.
1. Selezionare la freccia per eseguire il codice.
Si dovrebbe vedere stampata la seguente frase:
```output
hello notebook
```
![VS Code con un notebook aperto](../images/notebook.png)
Si può inframezzare il codice con commenti per auto documentare il notebook.
✅ Si pensi per un minuto all'ambiente di lavoro di uno sviluppatore web rispetto a quello di un data scientist.
## Scikit-learn installato e funzionante
Adesso che Python è impostato nel proprio ambiente locale, e si è familiari con i notebook Jupyter, si acquisterà ora confidenza con Scikit-learn (si pronuncia con la `si` della parola inglese `science`). Scikit-learn fornisce una [API estensiva](https://scikit-learn.org/stable/modules/classes.html#api-ref) che aiuta a eseguire attività ML.
Stando al loro [sito web](https://scikit-learn.org/stable/getting_started.html), "Scikit-learn è una libreria di machine learning open source che supporta l'apprendimento assistito (supervised learning) e non assistito (unsuperivised learnin). Fornisce anche strumenti vari per l'adattamento del modello, la pre-elaborazione dei dati, la selezione e la valutazione dei modelli e molte altre utilità."
In questo corso, si userà Scikit-learn e altri strumenti per costruire modelli di machine learning per eseguire quelle che vengono chiamate attività di 'machine learning tradizionale'. Si sono deliberamente evitate le reti neurali e il deep learning visto che saranno meglio trattati nel prossimo programma di studi 'AI per Principianti'.
Scikit-learn rende semplice costruire modelli e valutarli per l'uso. Si concentra principalmente sull'utilizzo di dati numerici e contiene diversi insiemi di dati già pronti per l'uso come strumenti di apprendimento. Include anche modelli pre-costruiti per gli studenti da provare. Si esplora ora il processo di caricamento dei dati preconfezionati, e, utilizzando un modello di stimatore incorporato, un primo modello ML con Scikit-Learn con alcuni dati di base.
## Esercizio - Il Primo notebook Scikit-learn
> Questo tutorial è stato ispirato dall'[esempio di regressione lineare](https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html#sphx-glr-auto-examples-linear-model-plot-ols-py) nel sito web di Scikit-learn.
Nel file _notebook.ipynb_ associato a questa lezione, svuotare tutte le celle usando l'icona cestino ('trash can').
In questa sezione, di lavorerà con un piccolo insieme di dati sul diabete che è incorporato in Scikit-learn per scopi di apprendimento. Si immagini di voler testare un trattamento per i pazienti diabetici. I modelli di machine learning potrebbero essere di aiuto nel determinare quali pazienti risponderebbero meglio al trattamento, in base a combinazioni di variabili. Anche un modello di regressione molto semplice, quando visualizzato, potrebbe mostrare informazioni sulle variabili che aiuteranno a organizzare le sperimentazioni cliniche teoriche.
✅ Esistono molti tipi di metodi di regressione e quale scegliere dipende dalla risposta che si sta cercando. Se si vuole prevedere l'altezza probabile per una persona di una data età, si dovrebbe usare la regressione lineare, visto che si sta cercando un **valore numerico**. Se si è interessati a scoprire se un tipo di cucina dovrebbe essere considerato vegano o no, si sta cercando un'**assegnazione di categoria** quindi si dovrebbe usare la regressione logistica. Si imparerà di più sulla regressione logistica in seguito. Si pensi ad alcune domande che si possono chiedere ai dati e quale di questi metodi sarebbe più appropriato.
Si inizia con questa attività.
### Importare le librerie
Per questo compito verranno importate alcune librerie:
- **matplotlib**. E' un utile [strumento grafico](https://matplotlib.org/) e verrà usato per creare una trama a linee.
- **numpy**. [numpy](https://numpy.org/doc/stable/user/whatisnumpy.html) è una libreira utile per gestire i dati numerici in Python.
- **sklearn**. Questa è la libreria Scikit-learn.
Importare alcune librerie che saranno di aiuto per le proprie attività.
1. Con il seguente codice si aggiungono le importazioni:
```python
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model, model_selection
```
Qui sopra vengono importati `matplottlib`, e `numpy`, da `sklearn` si importa `datasets`, `linear_model` e `model_selection`. `model_selection` viene usato per dividere i dati negli insiemi di addestramento e test.
### L'insieme di dati riguardante il diabete
L'[insieme dei dati sul diabete](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset) include 442 campioni di dati sul diabete, con 10 variabili caratteristiche, alcune delle quali includono:
- age (età): età in anni
- bmi: indice di massa corporea (body mass index)
- bp: media pressione sanguinea
- s1 tc: Cellule T (un tipo di leucocito)
✅ Questo insieme di dati include il concetto di "sesso" come caratteristica variabile importante per la ricerca sul diabete. Molti insiemi di dati medici includono questo tipo di classificazione binaria. Si rifletta su come categorizzazioni come questa potrebbe escludere alcune parti di una popolazione dai trattamenti.
Ora si caricano i dati di X e y.
> 🎓 Si ricordi, questo è apprendimento supervisionato (supervised learning), e serve dare un nome all'obiettivo 'y'.
In una nuova cella di codice, caricare l'insieme di dati sul diabete chiamando `load_diabetes()`. Il parametro `return_X_y=True` segnala che `X` sarà una matrice di dati e `y` sarà l'obiettivo della regressione.
1. Si aggiungono alcuni comandi di stampa per msotrare la forma della matrice di dati e i suoi primi elementi:
```python
X, y = datasets.load_diabetes(return_X_y=True)
print(X.shape)
print(X[0])
```
Quella che viene ritornata è una tuple. Quello che si sta facento è assegnare i primi due valori della tupla a `X` e `y` rispettivamente. Per saperne di più sulle [tuples](https://wikipedia.org/wiki/Tuple).
Si può vedere che questi dati hanno 442 elementi divisi in array di 10 elementi:
```text
(442, 10)
[ 0.03807591 0.05068012 0.06169621 0.02187235 -0.0442235 -0.03482076
-0.04340085 -0.00259226 0.01990842 -0.01764613]
```
✅ Si rifletta sulla relazione tra i dati e l'obiettivo di regressione. La regressione lineare prevede le relazioni tra la caratteristica X e la variabile di destinazione y. Si può trovare l'[obiettivo](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset) per l'insieme di dati sul diabete nella documentazione? Cosa dimostra questo insieme di dati, dato quell'obiettivo?
2. Successivamente, selezionare una porzione di questo insieme di dati da tracciare sistemandola in un nuovo array usando la funzione di numpy's `newaxis`. Verrà usata la regressione lineare per generare una linea tra i valori in questi dati secondo il modello che determina.
```python
X = X[:, np.newaxis, 2]
```
✅ A piacere, stampare i dati per verificarne la forma.
3. Ora che si hanno dei dati pronti per essere tracciati, è possibile vedere se una macchina può aiutare a determinare una divisione logica tra i numeri in questo insieme di dati. Per fare ciò, è necessario dividere sia i dati (X) che l'obiettivo (y) in insiemi di test e addestamento. Scikit-learn ha un modo semplice per farlo; si possono dividere i dati di prova in un determinato punto.
```python
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.33)
```
4. Ora si è pronti ad addestare il modello! Caricare il modello di regressione lineare e addestrarlo con i propri insiemi di addestramento X e y usando `model.fit()`:
```python
model = linear_model.LinearRegression()
model.fit(X_train, y_train)
```
`model.fit()` è una funzione che si vedrà in molte librerie ML tipo TensorFlow
5. Successivamente creare una previsione usando i dati di test, con la funzione `predict()`. Questo servirà per tracciare la linea tra i gruppi di dati
```python
y_pred = model.predict(X_test)
```
6. Ora è il momento di mostrare i dati in un tracciato. Matplotlib è uno strumento molto utile per questo compito. Si crei un grafico a dispersione (scatterplot) di tutti i dati del test X e y e si utilizzi la previsione per disegnare una linea nel luogo più appropriato, tra i raggruppamenti dei dati del modello.
```python
plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, y_pred, color='blue', linewidth=3)
plt.show()
```
![un grafico a dispersione che mostra i punti dati sul diabete](../images/scatterplot.png)
✅ Si pensi a cosa sta succedendo qui. Una linea retta scorre attraverso molti piccoli punti dati, ma cosa sta facendo esattamente? Si può capire come si dovrebbe utilizzare questa linea per prevedere dove un nuovo punto di dati non noto dovrebbe adattarsi alla relazione con l'asse y del tracciato? Si cerchi di mettere in parole l'uso pratico di questo modello.
Congratulazioni, si è costruito il primo modello di regressione lineare, creato una previsione con esso, e visualizzata in una tracciato!
---
## 🚀Sfida
Tracciare una variabile diversa da questo insieme di dati. Suggerimento: modificare questa riga: `X = X[:, np.newaxis, 2]`. Dato l'obiettivo di questo insieme di dati, cosa si potrebbe riuscire a scoprire circa la progressione del diabete come matattia?
## [Qui post-lezione](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/10/)
## Riepilogo e Auto Apprendimento
In questo tutorial, si è lavorato con una semplice regressione lineare, piuttosto che una regressione univariata o multipla. Ci so informi circa le differenze tra questi metodi oppure si dia uno sguardo a [questo video](https://www.coursera.org/lecture/quantifying-relationships-regression-models/linear-vs-nonlinear-categorical-variables-ai2Ef)
Si legga di più sul concetto di regressione e si pensi a quale tipo di domande potrebbero trovare risposta con questa tecnica. Seguire questo [tutorial](https://docs.microsoft.com/learn/modules/train-evaluate-regression-models?WT.mc_id=academic-15963-cxa) per approfondire la propria conoscenza.
## Compito
[Un insieme di dati diverso](assignment.it.md)

@ -4,7 +4,7 @@
> [Tomomi Imura](https://www.twitter.com/girlie_mac) によって制作されたスケッチノート
## [講義前クイズ](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/9/)
## [講義前クイズ](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/9?loc=ja)
## イントロダクション
@ -205,7 +205,7 @@ s1 tc: T細胞白血球の一種
## 🚀チャレンジ
このデータセットから別の変数を選択してプロットしてください。ヒント: `X = X[:, np.newaxis, 2]` の行を編集する。今回のデータセットのターゲットである、糖尿病という病気の進行について、どのような発見があるのでしょうか?
## [講義後クイズ](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/10/)
## [講義後クイズ](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/10?loc=ja)
## レビュー & 自主学習
@ -215,4 +215,4 @@ s1 tc: T細胞白血球の一種
## 課題
[異なるデータセット](assignment.md)
[異なるデータセット](./assignment.ja.md)

@ -0,0 +1,205 @@
# 开始使用Python和Scikit学习回归模型
![回归](../../../sketchnotes/ml-regression.png)
> 作者[Tomomi Imura](https://www.twitter.com/girlie_mac)
## [课前测](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/9/)
## 介绍
在这四节课中,你将了解如何构建回归模型。我们将很快讨论这些是什么。但在你做任何事情之前,请确保你有合适的工具来开始这个过程!
在本课中,你将学习如何:
- 为本地机器学习任务配置你的计算机。
- 使用Jupyter notebooks。
- 使用Scikit-learn包括安装。
- 通过动手练习探索线性回归。
## 安装和配置
[![在 Visual Studio Code中使用 Python](https://img.youtube.com/vi/7EXd4_ttIuw/0.jpg)](https://youtu.be/7EXd4_ttIuw "在 Visual Studio Code中使用 Python")
> 🎥 单击上图观看视频在VS Code中使用Python。
1. **安装 Python**。确保你的计算机上安装了[Python](https://www.python.org/downloads/)。你将在许多数据科学和机器学习任务中使用 Python。大多数计算机系统已经安装了Python。也有一些有用的[Python编码包](https://code.visualstudio.com/learn/educations/installers?WT.mc_id=academic-15963-cxa)可用于简化某些用户的设置。
然而Python的某些用法需要一个版本的软件而其他用法则需要另一个不同的版本。 因此,在[虚拟环境](https://docs.python.org/3/library/venv.html)中工作很有用。
2. **安装 Visual Studio Code**。确保你的计算机上安装了Visual Studio Code。按照这些说明[安装 Visual Studio Code](https://code.visualstudio.com/)进行基本安装。在本课程中你将在Visual Studio Code中使用Python因此你可能想复习如何[配置 Visual Studio Code](https://docs.microsoft.com/learn/modules/python-install-vscode?WT.mc_id=academic-15963-cxa)用于Python开发。
> 通过学习这一系列的 [学习模块](https://docs.microsoft.com/users/jenlooper-2911/collections/mp1pagggd5qrq7?WT.mc_id=academic-15963-cxa)熟悉Python
3. **按照[这些说明]安装Scikit learn**(https://scikit-learn.org/stable/install.html )。由于你需要确保使用Python3因此建议你使用虚拟环境。注意如果你是在M1 Mac上安装这个库在上面链接的页面上有特别的说明。
4. **安装Jupyter Notebook**。你需要[安装Jupyter包](https://pypi.org/project/jupyter/)。
## 你的ML工作环境
你将使用**notebooks**开发Python代码并创建机器学习模型。这种类型的文件是数据科学家的常用工具可以通过后缀或扩展名`.ipynb`来识别它们。
Notebooks是一个交互式环境允许开发人员编写代码并添加注释并围绕代码编写文档这对于实验或面向研究的项目非常有帮助。
### 练习 - 使用notebook
1. 在Visual Studio Code中打开_notebook.ipynb_。
Jupyter服务器将以python3+启动。你会发现notebook可以“运行”的区域、代码块。你可以通过选择看起来像播放按钮的图标来运行代码块。
2. 选择`md`图标并添加一点markdown输入文字 **# Welcome to your notebook**。
接下来添加一些Python代码。
1. 在代码块中输入**print("hello notebook")**。
2. 选择箭头运行代码。
你应该看到打印的语句:
```output
hello notebook
```
![打开notebook的VS Code](../images/notebook.png)
你可以为你的代码添加注释以便notebook可以自描述。
✅ 想一想web开发人员的工作环境与数据科学家的工作环境有多大的不同。
## 启动并运行Scikit-learn
现在Python已在你的本地环境中设置好并且你对Jupyter notebook感到满意让我们同样熟悉Scikit-learn在“science”中发音为“sci”。 Scikit-learn提供了[大量的API](https://scikit-learn.org/stable/modules/classes.html#api-ref)来帮助你执行ML任务。
根据他们的[网站](https://scikit-learn.org/stable/getting_started.html)“Scikit-learn是一个开源机器学习库支持有监督和无监督学习。它还提供了各种模型拟合工具、数据预处理、模型选择和评估以及许多其他实用程序。”
在本课程中你将使用Scikit-learn和其他工具来构建机器学习模型以执行我们所谓的“传统机器学习”任务。我们特意避免了神经网络和深度学习因为它们在我们即将推出的“面向初学者的人工智能”课程中得到了更好的介绍。
Scikit-learn使构建模型和评估它们的使用变得简单。它主要侧重于使用数字数据并包含几个现成的数据集用作学习工具。它还包括供学生尝试的预建模型。让我们探索加载预先打包的数据和使用内置的estimator first ML模型和Scikit-learn以及一些基本数据的过程。
## 练习 - 你的第一个Scikit-learn notebook
> 本教程的灵感来自Scikit-learn网站上的[线性回归示例](https://scikit-learn.org/stable/auto_examples/linear_model/plot_ols.html#sphx-glr-auto-examples-linear-model-plot-ols-py)。
在与本课程相关的 _notebook.ipynb_ 文件中,通过点击“垃圾桶”图标清除所有单元格。
在本节中你将使用一个关于糖尿病的小数据集该数据集内置于Scikit-learn中以用于学习目的。想象一下你想为糖尿病患者测试一种治疗方法。机器学习模型可能会帮助你根据变量组合确定哪些患者对治疗反应更好。即使是非常基本的回归模型在可视化时也可能会显示有助于组织理论临床试验的变量信息。
✅ 回归方法有很多种,你选择哪一种取决于你正在寻找的答案。如果你想预测给定年龄的人的可能身高,你可以使用线性回归,因为你正在寻找**数值**。如果你有兴趣了解某种菜肴是否应被视为素食主义者,那么你正在寻找**类别分配**,以便使用逻辑回归。稍后你将了解有关逻辑回归的更多信息。想一想你可以对数据提出的一些问题,以及这些方法中的哪一个更合适。
让我们开始这项任务。
### 导入库
对于此任务,我们将导入一些库:
- **matplotlib**。这是一个有用的[绘图工具](https://matplotlib.org/),我们将使用它来创建线图。
- **numpy**。 [numpy](https://numpy.org/doc/stable/user/whatisnumpy.html)是一个有用的库用于在Python中处理数字数据。
- **sklearn**。这是Scikit-learn库。
导入一些库来帮助你完成任务。
1. 通过输入以下代码添加导入:
```python
import matplotlib.pyplot as plt
import numpy as np
from sklearn import datasets, linear_model, model_selection
```
在上面的代码中,你正在导入`matplottlib`、`numpy`,你正在从`sklearn`导入`datasets`、`linear_model`和`model_selection`。 `model_selection`用于将数据拆分为训练集和测试集。
### 糖尿病数据集
内置的[糖尿病数据集](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset)包含442个围绕糖尿病的数据样本具有10个特征变量其中包括
age岁数
bmi体重指数
bp平均血压
s1 tcT细胞一种白细胞
✅ 该数据集包括“性别”的概念,作为对糖尿病研究很重要的特征变量。许多医学数据集包括这种类型的二元分类。想一想诸如此类的分类如何将人群的某些部分排除在治疗之外。
现在加载X和y数据。
> 🎓 请记住这是监督学习我们需要一个命名为“y”的目标。
在新的代码单元中,通过调用`load_diabetes()`加载糖尿病数据集。输入`return_X_y=True`表示`X`将是一个数据矩阵,而`y`将是回归目标。
1. 添加一些打印命令来显示数据矩阵的形状及其第一个元素:
```python
X, y = datasets.load_diabetes(return_X_y=True)
print(X.shape)
print(X[0])
```
作为响应返回的是一个元组。你正在做的是将元组的前两个值分别分配给`X`和`y`。了解更多 [关于元组](https://wikipedia.org/wiki/Tuple)。
你可以看到这个数据有442个项目组成了10个元素的数组
```text
(442, 10)
[ 0.03807591 0.05068012 0.06169621 0.02187235 -0.0442235 -0.03482076
-0.04340085 -0.00259226 0.01990842 -0.01764613]
```
✅ 稍微思考一下数据和回归目标之间的关系。线性回归预测特征X和目标变量y之间的关系。你能在文档中找到糖尿病数据集的[目标](https://scikit-learn.org/stable/datasets/toy_dataset.html#diabetes-dataset)吗?鉴于该目标,该数据集展示了什么?
2. 接下来通过使用numpy的`newaxis`函数将其排列到一个新数组中来选择要绘制的该数据集的一部分。我们将使用线性回归根据它确定的模式在此数据中的值之间生成一条线。
```python
X = X[:, np.newaxis, 2]
```
✅ 随时打印数据以检查其形状。
3. 现在你已准备好绘制数据,你可以查看机器是否可以帮助确定此数据集中数字之间的逻辑分割。为此你需要将数据(X)和目标(y)拆分为测试集和训练集。Scikit-learn有一个简单的方法来做到这一点你可以在给定点拆分测试数据。
```python
X_train, X_test, y_train, y_test = model_selection.train_test_split(X, y, test_size=0.33)
```
4. 现在你已准备好训练你的模型!加载线性回归模型并使用`model.fit()`使用X和y训练集对其进行训练
```python
model = linear_model.LinearRegression()
model.fit(X_train, y_train)
```
`model.fit()`是一个你会在许多机器学习库(例如 TensorFlow中看到的函数
5. 然后,使用函数`predict()`,使用测试数据创建预测。这将用于绘制数据组之间的线
```python
y_pred = model.predict(X_test)
```
6. 现在是时候在图中显示数据了。Matplotlib是完成此任务的非常有用的工具。创建所有X和y测试数据的散点图并使用预测在模型的数据分组之间最合适的位置画一条线。
```python
plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, y_pred, color='blue', linewidth=3)
plt.show()
```
![显示糖尿病周围数据点的散点图](../images/scatterplot.png)
✅ 想一想这里发生了什么。一条直线穿过许多小数据点但它到底在做什么你能看到你应该如何使用这条线来预测一个新的、未见过的数据点对应的y轴值吗尝试用语言描述该模型的实际用途。
恭喜,你构建了第一个线性回归模型,使用它创建了预测,并将其显示在绘图中!
---
## 🚀挑战
从这个数据集中绘制一个不同的变量。提示:编辑这一行:`X = X[:, np.newaxis, 2]`。鉴于此数据集的目标,你能够发现糖尿病作为一种疾病的进展情况吗?
## [课后测](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/10/)
## 复习与自学
在本教程中,你使用了简单线性回归,而不是单变量或多元线性回归。阅读一些关于这些方法之间差异的信息,或查看[此视频](https://www.coursera.org/lecture/quantifying-relationships-regression-models/linear-vs-nonlinear-categorical-variables-ai2Ef)
阅读有关回归概念的更多信息,并思考这种技术可以回答哪些类型的问题。用这个[教程](https://docs.microsoft.com/learn/modules/train-evaluate-regression-models?WT.mc_id=academic-15963-cxa)加深你的理解。
## 任务
[不同的数据集](../assignment.md)

@ -0,0 +1,13 @@
# Regressione con Scikit-learn
## Istruzioni
Dare un'occhiata all'[insieme di dati Linnerud](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_linnerud.html#sklearn.datasets.load_linnerud) in Scikit-learn. Questo insieme di dati ha [obiettivi](https://scikit-learn.org/stable/datasets/toy_dataset.html#linnerrud-dataset) multipli: "Consiste di tre variabili di esercizio (dati) e tre variabili fisiologiche (obiettivo) raccolte da venti uomini di mezza età in un fitness club".
Con parole proprie, descrivere come creare un modello di Regressione che tracci la relazione tra il punto vita e il numero di addominali realizzati. Fare lo stesso per gli altri punti dati in questo insieme di dati.
## Rubrica
| Criteri | Ottimo | Adeguato | Necessita miglioramento |
| ------------------------------ | ----------------------------------- | ----------------------------- | -------------------------- |
| Inviare un paragrafo descrittivo | Viene presentato un paragrafo ben scritto | Vengono inviate alcune frasi | Non viene fornita alcuna descrizione |

@ -0,0 +1,13 @@
# Scikit-learnを用いた回帰
## 課題の指示
Scikit-learnで[Linnerud dataset](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_linnerud.html#sklearn.datasets.load_linnerud) を見てみましょう。このデータセットは複数の[ターゲット](https://scikit-learn.org/stable/datasets/toy_dataset.html#linnerrud-dataset) を持っています。フィットネスクラブで20人の中年男性から収集した3つの運動変数(data)と3つの生理変数(target)で構成されています。
あなた自身の言葉で、ウエストラインと腹筋の回数との関係をプロットする回帰モデルの作成方法を説明してください。このデータセットの他のデータポイントについても同様に説明してみてください。
## ルーブリック
| 指標 | 模範的 | 適切 | 要改善 |
| ------------------------------ | ----------------------------------- | ----------------------------- | -------------------------- |
| 説明文を提出してください。 | よく書けた文章が提出されている。 | いくつかの文章が提出されている。 | 文章が提出されていません。 |

@ -0,0 +1,14 @@
# 用 Scikit-learn 实现一次回归算法
## 说明
先看看 Scikit-learn 中的 [Linnerud 数据集](https://scikit-learn.org/stable/modules/generated/sklearn.datasets.load_linnerud.html#sklearn.datasets.load_linnerud)
这个数据集中有多个[目标变量target](https://scikit-learn.org/stable/datasets/toy_dataset.html#linnerrud-dataset)其中包含了三种运动训练数据和三个生理指标目标变量组成这些数据都是从一个健身俱乐部中的20名中年男子收集到的。
之后用自己的方式,创建一个可以描述腰围和完成仰卧起坐个数关系的回归模型。用同样的方式对这个数据集中的其它数据也建立一下模型探究一下其中的关系。
## 评判标准
| 标准 | 优秀 | 中规中矩 | 仍需努力 |
| ------------------------------ | ----------------------------------- | ----------------------------- | -------------------------- |
| 需要提交一段能描述数据集中关系的文字 | 很好的描述了数据集中的关系 | 只能描述少部分的关系 | 啥都没有提交 |

@ -1,10 +1,13 @@
# Build a regression model using Scikit-learn: prepare and visualize data
> ![Data visualization infographic](./images/data-visualization.png)
> Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)
![Data visualization infographic](./images/data-visualization.png)
Infographic by [Dasani Madipalli](https://twitter.com/dasani_decoded)
## [Pre-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/11/)
> ### [This lesson is available in R!](./solution/lesson_2-R.ipynb)
## Introduction
Now that you are set up with the tools you need to start tackling machine learning model building with Scikit-learn, you are ready to start asking questions of your data. As you work with data and apply ML solutions, it's very important to understand how to ask the right question to properly unlock the potentials of your dataset.

Binary file not shown.

After

Width:  |  Height:  |  Size: 1.0 MiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 125 KiB

@ -0,0 +1,644 @@
{
"nbformat": 4,
"nbformat_minor": 0,
"metadata": {
"colab": {
"name": "lesson_2-R.ipynb",
"provenance": [],
"collapsed_sections": [],
"toc_visible": true
},
"kernelspec": {
"name": "ir",
"display_name": "R"
},
"language_info": {
"name": "R"
}
},
"cells": [
{
"cell_type": "markdown",
"metadata": {
"id": "Pg5aexcOPqAZ"
},
"source": [
"# Build a regression model: prepare and visualize data\n",
"\n",
"## **Linear Regression for Pumpkins - Lesson 2**\n",
"#### Introduction\n",
"\n",
"Now that you are set up with the tools you need to start tackling machine learning model building with Tidymodels and the Tidyverse, you are ready to start asking questions of your data. As you work with data and apply ML solutions, it's very important to understand how to ask the right question to properly unlock the potentials of your dataset.\n",
"\n",
"In this lesson, you will learn:\n",
"\n",
"- How to prepare your data for model-building.\n",
"\n",
"- How to use `ggplot2` for data visualization.\n",
"\n",
"The question you need answered will determine what type of ML algorithms you will leverage. And the quality of the answer you get back will be heavily dependent on the nature of your data.\n",
"\n",
"Let's see this by working through a practical exercise.\n",
"\n",
"![Artwork by \\@allison_horst](../images/unruly_data.jpg)<br>Artwork by \\@allison_horst"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "dc5WhyVdXAjR"
},
"source": [
"## 1. Importing pumpkins data and summoning the Tidyverse\n",
"\n",
"We'll require the following packages to slice and dice this lesson:\n",
"\n",
"- `tidyverse`: The [tidyverse](https://www.tidyverse.org/) is a [collection of R packages](https://www.tidyverse.org/packages) designed to makes data science faster, easier and more fun!\n",
"\n",
"You can have them installed as:\n",
"\n",
"`install.packages(c(\"tidyverse\"))`\n",
"\n",
"The script below checks whether you have the packages required to complete this module and installs them for you in case some are missing."
]
},
{
"cell_type": "code",
"metadata": {
"id": "GqPYUZgfXOBt"
},
"source": [
"if (!require(\"pacman\")) install.packages(\"pacman\")\n",
"pacman::p_load(tidyverse)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "kvjDTPDSXRr2"
},
"source": [
"Now, let's fire up some packages and load the [data](https://github.com/microsoft/ML-For-Beginners/blob/main/2-Regression/data/US-pumpkins.csv) provided for this lesson!"
]
},
{
"cell_type": "code",
"metadata": {
"id": "VMri-t2zXqgD"
},
"source": [
"# Load the core Tidyverse packages\n",
"library(tidyverse)\n",
"\n",
"# Import the pumpkins data\n",
"pumpkins <- read_csv(file = \"https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv\")\n",
"\n",
"\n",
"# Get a glimpse and dimensions of the data\n",
"glimpse(pumpkins)\n",
"\n",
"\n",
"# Print the first 50 rows of the data set\n",
"pumpkins %>% \n",
" slice_head(n =50)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "REWcIv9yX29v"
},
"source": [
"A quick `glimpse()` immediately shows that there are blanks and a mix of strings (`chr`) and numeric data (`dbl`). The `Date` is of type character and there's also a strange column called `Package` where the data is a mix between `sacks`, `bins` and other values. The data, in fact, is a bit of a mess 😤.\n",
"\n",
"In fact, it is not very common to be gifted a dataset that is completely ready to use to create a ML model out of the box. But worry not, in this lesson, you will learn how to prepare a raw dataset using standard R libraries 🧑‍🔧. You will also learn various techniques to visualize the data.📈📊\n",
"<br>\n",
"\n",
"> A refresher: The pipe operator (`%>%`) performs operations in logical sequence by passing an object forward into a function or call expression. You can think of the pipe operator as saying \"and then\" in your code.\n",
"\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "Zxfb3AM5YbUe"
},
"source": [
"## 2. Check for missing data\n",
"\n",
"One of the most common issues data scientists need to deal with is incomplete or missing data. R represents missing, or unknown values, with special sentinel value: `NA` (Not Available).\n",
"\n",
"So how would we know that the data frame contains missing values?\n",
"<br>\n",
"- One straight forward way would be to use the base R function `anyNA` which returns the logical objects `TRUE` or `FALSE`"
]
},
{
"cell_type": "code",
"metadata": {
"id": "G--DQutAYltj"
},
"source": [
"pumpkins %>% \n",
" anyNA()"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "mU-7-SB6YokF"
},
"source": [
"Great, there seems to be some missing data! That's a good place to start.\n",
"\n",
"- Another way would be to use the function `is.na()` that indicates which individual column elements are missing with a logical `TRUE`."
]
},
{
"cell_type": "code",
"metadata": {
"id": "W-DxDOR4YxSW"
},
"source": [
"pumpkins %>% \n",
" is.na() %>% \n",
" head(n = 7)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "xUWxipKYY0o7"
},
"source": [
"Okay, got the job done but with a large data frame such as this, it would be inefficient and practically impossible to review all of the rows and columns individually😴.\n",
"\n",
"- A more intuitive way would be to calculate the sum of the missing values for each column:"
]
},
{
"cell_type": "code",
"metadata": {
"id": "ZRBWV6P9ZArL"
},
"source": [
"pumpkins %>% \n",
" is.na() %>% \n",
" colSums()"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "9gv-crB6ZD1Y"
},
"source": [
"Much better! There is missing data, but maybe it won't matter for the task at hand. Let's see what further analysis brings forth.\n",
"\n",
"> Along with the awesome sets of packages and functions, R has a very good documentation. For instance, use `help(colSums)` or `?colSums` to find out more about the function."
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "o4jLY5-VZO2C"
},
"source": [
"## 3. Dplyr: A Grammar of Data Manipulation\n",
"\n",
"![Artwork by \\@allison_horst](../images/dplyr_wrangling.png)<br/>Artwork by \\@allison_horst"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "i5o33MQBZWWw"
},
"source": [
"[`dplyr`](https://dplyr.tidyverse.org/), a package in the Tidyverse, is a grammar of data manipulation that provides a consistent set of verbs that help you solve the most common data manipulation challenges. In this section, we'll explore some of dplyr's verbs!\n",
"<br>\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "x3VGMAGBZiUr"
},
"source": [
"#### dplyr::select()\n",
"\n",
"`select()` is a function in the package `dplyr` which helps you pick columns to keep or exclude.\n",
"\n",
"To make your data frame easier to work with, drop several of its columns, using `select()`, keeping only the columns you need.\n",
"\n",
"For instance, in this exercise, our analysis will involve the columns `Package`, `Low Price`, `High Price` and `Date`. Let's select these columns."
]
},
{
"cell_type": "code",
"metadata": {
"id": "F_FgxQnVZnM0"
},
"source": [
"# Select desired columns\n",
"pumpkins <- pumpkins %>% \n",
" select(Package, `Low Price`, `High Price`, Date)\n",
"\n",
"\n",
"# Print data set\n",
"pumpkins %>% \n",
" slice_head(n = 5)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "2KKo0Ed9Z1VB"
},
"source": [
"#### dplyr::mutate()\n",
"\n",
"`mutate()` is a function in the package `dplyr` which helps you create or modify columns, while keeping the existing columns.\n",
"\n",
"The general structure of mutate is:\n",
"\n",
"`data %>% mutate(new_column_name = what_it_contains)`\n",
"\n",
"Let's take `mutate` out for a spin using the `Date` column by doing the following operations:\n",
"\n",
"1. Convert the dates (currently of type character) to a month format (these are US dates, so the format is `MM/DD/YYYY`).\n",
"\n",
"2. Extract the month from the dates to a new column.\n",
"\n",
"In R, the package [lubridate](https://lubridate.tidyverse.org/) makes it easier to work with Date-time data. So, let's use `dplyr::mutate()`, `lubridate::mdy()`, `lubridate::month()` and see how to achieve the above objectives. We can drop the Date column since we won't be needing it again in subsequent operations."
]
},
{
"cell_type": "code",
"metadata": {
"id": "5joszIVSZ6xe"
},
"source": [
"# Load lubridate\n",
"library(lubridate)\n",
"\n",
"pumpkins <- pumpkins %>% \n",
" # Convert the Date column to a date object\n",
" mutate(Date = mdy(Date)) %>% \n",
" # Extract month from Date\n",
" mutate(Month = month(Date)) %>% \n",
" # Drop Date column\n",
" select(-Date)\n",
"\n",
"# View the first few rows\n",
"pumpkins %>% \n",
" slice_head(n = 7)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "nIgLjNMCZ-6Y"
},
"source": [
"Woohoo! 🤩\n",
"\n",
"Next, let's create a new column `Price`, which represents the average price of a pumpkin. Now, let's take the average of the `Low Price` and `High Price` columns to populate the new Price column.\n",
"<br>"
]
},
{
"cell_type": "code",
"metadata": {
"id": "Zo0BsqqtaJw2"
},
"source": [
"# Create a new column Price\n",
"pumpkins <- pumpkins %>% \n",
" mutate(Price = (`Low Price` + `High Price`)/2)\n",
"\n",
"# View the first few rows of the data\n",
"pumpkins %>% \n",
" slice_head(n = 5)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "p77WZr-9aQAR"
},
"source": [
"Yeees!💪\n",
"\n",
"\"But wait!\", you'll say after skimming through the whole data set with `View(pumpkins)`, \"There's something odd here!\"🤔\n",
"\n",
"If you look at the `Package` column, pumpkins are sold in many different configurations. Some are sold in `1 1/9 bushel` measures, and some in `1/2 bushel` measures, some per pumpkin, some per pound, and some in big boxes with varying widths.\n",
"\n",
"Let's verify this:"
]
},
{
"cell_type": "code",
"metadata": {
"id": "XISGfh0IaUy6"
},
"source": [
"# Verify the distinct observations in Package column\n",
"pumpkins %>% \n",
" distinct(Package)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "7sMjiVujaZxY"
},
"source": [
"Amazing!👏\n",
"\n",
"Pumpkins seem to be very hard to weigh consistently, so let's filter them by selecting only pumpkins with the string *bushel* in the `Package` column and put this in a new data frame `new_pumpkins`.\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "L8Qfcs92ageF"
},
"source": [
"#### dplyr::filter() and stringr::str_detect()\n",
"\n",
"[`dplyr::filter()`](https://dplyr.tidyverse.org/reference/filter.html): creates a subset of the data only containing **rows** that satisfy your conditions, in this case, pumpkins with the string *bushel* in the `Package` column.\n",
"\n",
"[stringr::str_detect()](https://stringr.tidyverse.org/reference/str_detect.html): detects the presence or absence of a pattern in a string.\n",
"\n",
"The [`stringr`](https://github.com/tidyverse/stringr) package provides simple functions for common string operations."
]
},
{
"cell_type": "code",
"metadata": {
"id": "hy_SGYREampd"
},
"source": [
"# Retain only pumpkins with \"bushel\"\n",
"new_pumpkins <- pumpkins %>% \n",
" filter(str_detect(Package, \"bushel\"))\n",
"\n",
"# Get the dimensions of the new data\n",
"dim(new_pumpkins)\n",
"\n",
"# View a few rows of the new data\n",
"new_pumpkins %>% \n",
" slice_head(n = 5)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "VrDwF031avlR"
},
"source": [
"You can see that we have narrowed down to 415 or so rows of data containing pumpkins by the bushel.🤩\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "mLpw2jH4a0tx"
},
"source": [
"#### dplyr::case_when()\n",
"\n",
"**But wait! There's one more thing to do**\n",
"\n",
"Did you notice that the bushel amount varies per row? You need to normalize the pricing so that you show the pricing per bushel, not per 1 1/9 or 1/2 bushel. Time to do some math to standardize it.\n",
"\n",
"We'll use the function [`case_when()`](https://dplyr.tidyverse.org/reference/case_when.html) to *mutate* the Price column depending on some conditions. `case_when` allows you to vectorise multiple `if_else()`statements.\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "P68kLVQmbM6I"
},
"source": [
"# Convert the price if the Package contains fractional bushel values\n",
"new_pumpkins <- new_pumpkins %>% \n",
" mutate(Price = case_when(\n",
" str_detect(Package, \"1 1/9\") ~ Price/(1 + 1/9),\n",
" str_detect(Package, \"1/2\") ~ Price/(1/2),\n",
" TRUE ~ Price))\n",
"\n",
"# View the first few rows of the data\n",
"new_pumpkins %>% \n",
" slice_head(n = 30)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "pS2GNPagbSdb"
},
"source": [
"Now, we can analyze the pricing per unit based on their bushel measurement. All this study of bushels of pumpkins, however, goes to show how very `important` it is to `understand the nature of your data`!\n",
"\n",
"> ✅ According to [The Spruce Eats](https://www.thespruceeats.com/how-much-is-a-bushel-1389308), a bushel's weight depends on the type of produce, as it's a volume measurement. \"A bushel of tomatoes, for example, is supposed to weigh 56 pounds... Leaves and greens take up more space with less weight, so a bushel of spinach is only 20 pounds.\" It's all pretty complicated! Let's not bother with making a bushel-to-pound conversion, and instead price by the bushel. All this study of bushels of pumpkins, however, goes to show how very important it is to understand the nature of your data!\n",
">\n",
"> ✅ Did you notice that pumpkins sold by the half-bushel are very expensive? Can you figure out why? Hint: little pumpkins are way pricier than big ones, probably because there are so many more of them per bushel, given the unused space taken by one big hollow pie pumpkin.\n",
"<br>\n"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "qql1SowfbdnP"
},
"source": [
"Now lastly, for the sheer sake of adventure 💁‍♀️, let's also move the Month column to the first position i.e `before` column `Package`.\n",
"\n",
"`dplyr::relocate()` is used to change column positions."
]
},
{
"cell_type": "code",
"metadata": {
"id": "JJ1x6kw8bixF"
},
"source": [
"# Create a new data frame new_pumpkins\n",
"new_pumpkins <- new_pumpkins %>% \n",
" relocate(Month, .before = Package)\n",
"\n",
"new_pumpkins %>% \n",
" slice_head(n = 7)"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "y8TJ0Za_bn5Y"
},
"source": [
"Good job!👌 You now have a clean, tidy dataset on which you can build your new regression model!\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "mYSH6-EtbvNa"
},
"source": [
"## 4. Data visualization with ggplot2\n",
"\n",
"![Infographic by Dasani Madipalli](../images/data-visualization.png){width=\"600\"}\n",
"\n",
"There is a *wise* saying that goes like this:\n",
"\n",
"> \"The simple graph has brought more information to the data analyst's mind than any other device.\" --- John Tukey\n",
"\n",
"Part of the data scientist's role is to demonstrate the quality and nature of the data they are working with. To do this, they often create interesting visualizations, or plots, graphs, and charts, showing different aspects of data. In this way, they are able to visually show relationships and gaps that are otherwise hard to uncover.\n",
"\n",
"Visualizations can also help determine the machine learning technique most appropriate for the data. A scatterplot that seems to follow a line, for example, indicates that the data is a good candidate for a linear regression exercise.\n",
"\n",
"R offers a number of several systems for making graphs, but [`ggplot2`](https://ggplot2.tidyverse.org/index.html) is one of the most elegant and most versatile. `ggplot2` allows you to compose graphs by **combining independent components**.\n",
"\n",
"Let's start with a simple scatter plot for the Price and Month columns.\n",
"\n",
"So in this case, we'll start with [`ggplot()`](https://ggplot2.tidyverse.org/reference/ggplot.html), supply a dataset and aesthetic mapping (with [`aes()`](https://ggplot2.tidyverse.org/reference/aes.html)) then add a layers (like [`geom_point()`](https://ggplot2.tidyverse.org/reference/geom_point.html)) for scatter plots.\n"
]
},
{
"cell_type": "code",
"metadata": {
"id": "g2YjnGeOcLo4"
},
"source": [
"# Set a theme for the plots\n",
"theme_set(theme_light())\n",
"\n",
"# Create a scatter plot\n",
"p <- ggplot(data = new_pumpkins, aes(x = Price, y = Month))\n",
"p + geom_point()"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "Ml7SDCLQcPvE"
},
"source": [
"Is this a useful plot 🤷? Does anything about it surprise you?\n",
"\n",
"It's not particularly useful as all it does is display in your data as a spread of points in a given month.\n",
"<br>"
]
},
{
"cell_type": "markdown",
"metadata": {
"id": "jMakvJZIcVkh"
},
"source": [
"### **How do we make it useful?**\n",
"\n",
"To get charts to display useful data, you usually need to group the data somehow. For instance in our case, finding the average price of pumpkins for each month would provide more insights to the underlying patterns in our data. This leads us to one more **dplyr** flyby:\n",
"\n",
"#### `dplyr::group_by() %>% summarize()`\n",
"\n",
"Grouped aggregation in R can be easily computed using\n",
"\n",
"`dplyr::group_by() %>% summarize()`\n",
"\n",
"- `dplyr::group_by()` changes the unit of analysis from the complete dataset to individual groups such as per month.\n",
"\n",
"- `dplyr::summarize()` creates a new data frame with one column for each grouping variable and one column for each of the summary statistics that you have specified.\n",
"\n",
"For example, we can use the `dplyr::group_by() %>% summarize()` to group the pumpkins into groups based on the **Month** columns and then find the **mean price** for each month."
]
},
{
"cell_type": "code",
"metadata": {
"id": "6kVSUa2Bcilf"
},
"source": [
"# Find the average price of pumpkins per month\n",
"new_pumpkins %>%\n",
" group_by(Month) %>% \n",
" summarise(mean_price = mean(Price))"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "Kds48GUBcj3W"
},
"source": [
"Succinct!✨\n",
"\n",
"Categorical features such as months are better represented using a bar plot 📊. The layers responsible for bar charts are `geom_bar()` and `geom_col()`. Consult `?geom_bar` to find out more.\n",
"\n",
"Let's whip up one!"
]
},
{
"cell_type": "code",
"metadata": {
"id": "VNbU1S3BcrxO"
},
"source": [
"# Find the average price of pumpkins per month then plot a bar chart\n",
"new_pumpkins %>%\n",
" group_by(Month) %>% \n",
" summarise(mean_price = mean(Price)) %>% \n",
" ggplot(aes(x = Month, y = mean_price)) +\n",
" geom_col(fill = \"midnightblue\", alpha = 0.7) +\n",
" ylab(\"Pumpkin Price\")"
],
"execution_count": null,
"outputs": []
},
{
"cell_type": "markdown",
"metadata": {
"id": "zDm0VOzzcuzR"
},
"source": [
"🤩🤩This is a more useful data visualization! It seems to indicate that the highest price for pumpkins occurs in September and October. Does that meet your expectation? Why or why not?\n",
"\n",
"Congratulations on finishing the second lesson 👏! You prepared your data for model building, then uncovered more insights using visualizations!"
]
}
]
}

@ -0,0 +1,345 @@
---
title: 'Build a regression model: prepare and visualize data'
output:
html_document:
df_print: paged
theme: flatly
highlight: breezedark
toc: yes
toc_float: yes
code_download: yes
---
## **Linear Regression for Pumpkins - Lesson 2**
#### Introduction
Now that you are set up with the tools you need to start tackling machine learning model building with Tidymodels and the Tidyverse, you are ready to start asking questions of your data. As you work with data and apply ML solutions, it's very important to understand how to ask the right question to properly unlock the potentials of your dataset.
In this lesson, you will learn:
- How to prepare your data for model-building.
- How to use `ggplot2` for data visualization.
The question you need answered will determine what type of ML algorithms you will leverage. And the quality of the answer you get back will be heavily dependent on the nature of your data.
Let's see this by working through a practical exercise.
![Artwork by \@allison_horst](../images/unruly_data.jpg){width="700"}
## 1. Importing pumpkins data and summoning the Tidyverse
We'll require the following packages to slice and dice this lesson:
- `tidyverse`: The [tidyverse](https://www.tidyverse.org/) is a [collection of R packages](https://www.tidyverse.org/packages) designed to makes data science faster, easier and more fun!
You can have them installed as:
`install.packages(c("tidyverse"))`
The script below checks whether you have the packages required to complete this module and installs them for you in case they are missing.
```{r, message=F, warning=F}
if (!require("pacman")) install.packages("pacman")
pacman::p_load(tidyverse)
```
Now, let's fire up some packages and load the [data](https://github.com/microsoft/ML-For-Beginners/blob/main/2-Regression/data/US-pumpkins.csv) provided for this lesson!
```{r load_tidy_verse_models, message=F, warning=F}
# Load the core Tidyverse packages
library(tidyverse)
# Import the pumpkins data
pumpkins <- read_csv(file = "https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv")
# Get a glimpse and dimensions of the data
glimpse(pumpkins)
# Print the first 50 rows of the data set
pumpkins %>%
slice_head(n =50)
```
A quick `glimpse()` immediately shows that there are blanks and a mix of strings (`chr`) and numeric data (`dbl`). The `Date` is of type character and there's also a strange column called `Package` where the data is a mix between `sacks`, `bins` and other values. The data, in fact, is a bit of a mess 😤.
In fact, it is not very common to be gifted a dataset that is completely ready to use to create a ML model out of the box. But worry not, in this lesson, you will learn how to prepare a raw dataset using standard R libraries 🧑‍🔧. You will also learn various techniques to visualize the data.📈📊
> A refresher: The pipe operator (`%>%`) performs operations in logical sequence by passing an object forward into a function or call expression. You can think of the pipe operator as saying "and then" in your code.
## 2. Check for missing data
One of the most common issues data scientists need to deal with is incomplete or missing data. R represents missing, or unknown values, with special sentinel value: `NA` (Not Available).
So how would we know that the data frame contains missing values?
- One straight forward way would be to use the base R function `anyNA` which returns the logical objects `TRUE` or `FALSE`
```{r anyNA, message=F, warning=F}
pumpkins %>%
anyNA()
```
Great, there seems to be some missing data! That's a good place to start.
- Another way would be to use the function `is.na()` that indicates which individual column elements are missing with a logical `TRUE`.
```{r is_na, message=F, warning=F}
pumpkins %>%
is.na() %>%
head(n = 7)
```
Okay, got the job done but with a large data frame such as this, it would be inefficient and practically impossible to review all of the rows and columns individually😴.
- A more intuitive way would be to calculate the sum of the missing values for each column:
```{r colSum_NA, message=F, warning=F}
pumpkins %>%
is.na() %>%
colSums()
```
Much better! There is missing data, but maybe it won't matter for the task at hand. Let's see what further analysis brings forth.
> Along with the awesome sets of packages and functions, R has a very good documentation. For instance, use `help(colSums)` or `?colSums` to find out more about the function.
## 3. Dplyr: A Grammar of Data Manipulation
![Artwork by \@allison_horst](../images/dplyr_wrangling.png){width="569"}
[`dplyr`](https://dplyr.tidyverse.org/), a package in the Tidyverse, is a grammar of data manipulation that provides a consistent set of verbs that help you solve the most common data manipulation challenges. In this section, we'll explore some of dplyr's verbs!
#### dplyr::select()
`select()` is a function in the package `dplyr` which helps you pick columns to keep or exclude.
To make your data frame easier to work with, drop several of its columns, using `select()`, keeping only the columns you need.
For instance, in this exercise, our analysis will involve the columns `Package`, `Low Price`, `High Price` and `Date`. Let's select these columns.
```{r select, message=F, warning=F}
# Select desired columns
pumpkins <- pumpkins %>%
select(Package, `Low Price`, `High Price`, Date)
# Print data set
pumpkins %>%
slice_head(n = 5)
```
#### dplyr::mutate()
`mutate()` is a function in the package `dplyr` which helps you create or modify columns, while keeping the existing columns.
The general structure of mutate is:
`data %>% mutate(new_column_name = what_it_contains)`
Let's take `mutate` out for a spin using the `Date` column by doing the following operations:
1. Convert the dates (currently of type character) to a month format (these are US dates, so the format is `MM/DD/YYYY`).
2. Extract the month from the dates to a new column.
In R, the package [lubridate](https://lubridate.tidyverse.org/) makes it easier to work with Date-time data. So, let's use `dplyr::mutate()`, `lubridate::mdy()`, `lubridate::month()` and see how to achieve the above objectives. We can drop the Date column since we won't be needing it again in subsequent operations.
```{r mut_date, message=F, warning=F}
# Load lubridate
library(lubridate)
pumpkins <- pumpkins %>%
# Convert the Date column to a date object
mutate(Date = mdy(Date)) %>%
# Extract month from Date
mutate(Month = month(Date)) %>%
# Drop Date column
select(-Date)
# View the first few rows
pumpkins %>%
slice_head(n = 7)
```
Woohoo! 🤩
Next, let's create a new column `Price`, which represents the average price of a pumpkin. Now, let's take the average of the `Low Price` and `High Price` columns to populate the new Price column.
```{r price, message=F, warning=F}
# Create a new column Price
pumpkins <- pumpkins %>%
mutate(Price = (`Low Price` + `High Price`)/2)
# View the first few rows of the data
pumpkins %>%
slice_head(n = 5)
```
Yeees!💪
"But wait!", you'll say after skimming through the whole data set with `View(pumpkins)`, "There's something odd here!"🤔
If you look at the `Package` column, pumpkins are sold in many different configurations. Some are sold in `1 1/9 bushel` measures, and some in `1/2 bushel` measures, some per pumpkin, some per pound, and some in big boxes with varying widths.
Let's verify this:
```{r Package, message=F, warning=F}
# Verify the distinct observations in Package column
pumpkins %>%
distinct(Package)
```
Amazing!👏
Pumpkins seem to be very hard to weigh consistently, so let's filter them by selecting only pumpkins with the string *bushel* in the `Package` column and put this in a new data frame `new_pumpkins`.
#### dplyr::filter() and stringr::str_detect()
[`dplyr::filter()`](https://dplyr.tidyverse.org/reference/filter.html): creates a subset of the data only containing **rows** that satisfy your conditions, in this case, pumpkins with the string *bushel* in the `Package` column.
[stringr::str_detect()](https://stringr.tidyverse.org/reference/str_detect.html): detects the presence or absence of a pattern in a string.
The [`stringr`](https://github.com/tidyverse/stringr) package provides simple functions for common string operations.
```{r filter, message=F, warning=F}
# Retain only pumpkins with "bushel"
new_pumpkins <- pumpkins %>%
filter(str_detect(Package, "bushel"))
# Get the dimensions of the new data
dim(new_pumpkins)
# View a few rows of the new data
new_pumpkins %>%
slice_head(n = 5)
```
You can see that we have narrowed down to 415 or so rows of data containing pumpkins by the bushel.🤩
#### dplyr::case_when()
**But wait! There's one more thing to do**
Did you notice that the bushel amount varies per row? You need to normalize the pricing so that you show the pricing per bushel, not per 1 1/9 or 1/2 bushel. Time to do some math to standardize it.
We'll use the function [`case_when()`](https://dplyr.tidyverse.org/reference/case_when.html) to *mutate* the Price column depending on some conditions. `case_when` allows you to vectorise multiple `if_else()`statements.
```{r normalize_price, message=F, warning=F}
# Convert the price if the Package contains fractional bushel values
new_pumpkins <- new_pumpkins %>%
mutate(Price = case_when(
str_detect(Package, "1 1/9") ~ Price/(1 + 1/9),
str_detect(Package, "1/2") ~ Price/(1/2),
TRUE ~ Price))
# View the first few rows of the data
new_pumpkins %>%
slice_head(n = 30)
```
Now, we can analyze the pricing per unit based on their bushel measurement. All this study of bushels of pumpkins, however, goes to show how very `important` it is to `understand the nature of your data`!
> ✅ According to [The Spruce Eats](https://www.thespruceeats.com/how-much-is-a-bushel-1389308), a bushel's weight depends on the type of produce, as it's a volume measurement. "A bushel of tomatoes, for example, is supposed to weigh 56 pounds... Leaves and greens take up more space with less weight, so a bushel of spinach is only 20 pounds." It's all pretty complicated! Let's not bother with making a bushel-to-pound conversion, and instead price by the bushel. All this study of bushels of pumpkins, however, goes to show how very important it is to understand the nature of your data!
>
> ✅ Did you notice that pumpkins sold by the half-bushel are very expensive? Can you figure out why? Hint: little pumpkins are way pricier than big ones, probably because there are so many more of them per bushel, given the unused space taken by one big hollow pie pumpkin.
Now lastly, for the sheer sake of adventure 💁‍♀️, let's also move the Month column to the first position i.e `before` column `Package`.
`dplyr::relocate()` is used to change column positions.
```{r new_pumpkins, message=F, warning=F}
# Create a new data frame new_pumpkins
new_pumpkins <- new_pumpkins %>%
relocate(Month, .before = Package)
new_pumpkins %>%
slice_head(n = 7)
```
Good job!👌 You now have a clean, tidy dataset on which you can build your new regression model!
## 4. Data visualization with ggplot2
![Infographic by Dasani Madipalli](../images/data-visualization.png){width="600"}
There is a *wise* saying that goes like this:
> "The simple graph has brought more information to the data analyst's mind than any other device." --- John Tukey
Part of the data scientist's role is to demonstrate the quality and nature of the data they are working with. To do this, they often create interesting visualizations, or plots, graphs, and charts, showing different aspects of data. In this way, they are able to visually show relationships and gaps that are otherwise hard to uncover.
Visualizations can also help determine the machine learning technique most appropriate for the data. A scatterplot that seems to follow a line, for example, indicates that the data is a good candidate for a linear regression exercise.
R offers a number of several systems for making graphs, but [`ggplot2`](https://ggplot2.tidyverse.org/index.html) is one of the most elegant and most versatile. `ggplot2` allows you to compose graphs by **combining independent components**.
Let's start with a simple scatter plot for the Price and Month columns.
So in this case, we'll start with [`ggplot()`](https://ggplot2.tidyverse.org/reference/ggplot.html), supply a dataset and aesthetic mapping (with [`aes()`](https://ggplot2.tidyverse.org/reference/aes.html)) then add a layers (like [`geom_point()`](https://ggplot2.tidyverse.org/reference/geom_point.html)) for scatter plots.
```{r scatter_plt, message=F, warning=F}
# Set a theme for the plots
theme_set(theme_light())
# Create a scatter plot
p <- ggplot(data = new_pumpkins, aes(x = Price, y = Month))
p + geom_point()
```
Is this a useful plot 🤷? Does anything about it surprise you?
It's not particularly useful as all it does is display in your data as a spread of points in a given month.
### **How do we make it useful?**
To get charts to display useful data, you usually need to group the data somehow. For instance in our case, finding the average price of pumpkins for each month would provide more insights to the underlying patterns in our data. This leads us to one more **dplyr** flyby:
#### `dplyr::group_by() %>% summarize()`
Grouped aggregation in R can be easily computed using
`dplyr::group_by() %>% summarize()`
- `dplyr::group_by()` changes the unit of analysis from the complete dataset to individual groups such as per month.
- `dplyr::summarize()` creates a new data frame with one column for each grouping variable and one column for each of the summary statistics that you have specified.
For example, we can use the `dplyr::group_by() %>% summarize()` to group the pumpkins into groups based on the **Month** columns and then find the **mean price** for each month.
```{r grp_sumry, message=F, warning=F}
# Find the average price of pumpkins per month
new_pumpkins %>%
group_by(Month) %>%
summarise(mean_price = mean(Price))
```
Succinct!✨
Categorical features such as months are better represented using a bar plot 📊. The layers responsible for bar charts are `geom_bar()` and `geom_col()`. Consult
`?geom_bar` to find out more.
Let's whip up one!
```{r bar_plt, message=F, warning=F}
# Find the average price of pumpkins per month then plot a bar chart
new_pumpkins %>%
group_by(Month) %>%
summarise(mean_price = mean(Price)) %>%
ggplot(aes(x = Month, y = mean_price)) +
geom_col(fill = "midnightblue", alpha = 0.7) +
ylab("Pumpkin Price")
```
🤩🤩This is a more useful data visualization! It seems to indicate that the highest price for pumpkins occurs in September and October. Does that meet your expectation? Why or why not?
Congratulations on finishing the second lesson 👏! You did prepared your data for model building, then uncovered more insights using visualizations!\

@ -0,0 +1,201 @@
# Costruire un modello di regressione usando Scikit-learn: preparare e visualizzare i dati
> ![Infografica sulla visualizzazione dei dati](../images/data-visualization.png)
> Infografica di [Dasani Madipalli](https://twitter.com/dasani_decoded)
## [Quiz Pre-Lezione](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/11/)
## Introduzione
Ora che si hanno a disposizione gli strumenti necessari per iniziare ad affrontare la creazione di modelli di machine learning con Scikit-learn, si è pronti per iniziare a porre domande sui propri dati. Mentre si lavora con i dati e si applicano soluzioni ML, è molto importante capire come porre la domanda giusta per sbloccare correttamente le potenzialità del proprio insieme di dati.
In questa lezione, si imparerà:
- Come preparare i dati per la creazione del modello.
- Come utilizzare Matplotlib per la visualizzazione dei dati.
## Fare la domanda giusta ai propri dati
La domanda a cui si deve rispondere determinerà il tipo di algoritmi ML che verranno utilizzati. La qualità della risposta che si riceverà dipenderà fortemente dalla natura dei propri dati.
Si dia un'occhiata ai [dati](../../data/US-pumpkins.csv) forniti per questa lezione. Si può aprire questo file .csv in VS Code. Una rapida scrematura mostra immediatamente che ci sono spazi vuoti e un mix di stringhe e dati numerici. C'è anche una strana colonna chiamata "Package" (pacchetto) in cui i dati sono un mix tra "sacks" (sacchi), "bins" (contenitori) e altri valori. I dati, infatti, sono un po' un pasticcio.
In effetti, non è molto comune ricevere un insieme di dati completamente pronto per creare un modello ML pronto all'uso. In questa lezione si imparerà come preparare un insieme di dati non elaborato utilizzando le librerie standard di Python. Si impareranno anche varie tecniche per visualizzare i dati.
## Caso di studio: 'il mercato della zucca'
In questa cartella si troverà un file .csv nella cartella `data` radice chiamato [US-pumpkins.csv](../../data/US-pumpkins.csv) che include 1757 righe di dati sul mercato delle zucche, ordinate in raggruppamenti per città. Si tratta di dati grezzi estratti dai [Report Standard dei Mercati Terminali delle Colture Speciali](https://www.marketnews.usda.gov/mnp/fv-report-config-step1?type=termPrice) distribuiti dal Dipartimento dell'Agricoltura degli Stati Uniti.
### Preparazione dati
Questi dati sono di pubblico dominio. Possono essere scaricati in molti file separati, per città, dal sito web dell'USDA. Per evitare troppi file separati, sono stati concatenati tutti i dati della città in un unico foglio di calcolo, quindi un po' i dati sono già stati _preparati_ . Successivamente, si darà un'occhiata più da vicino ai dati.
### I dati della zucca - prime conclusioni
Cosa si nota riguardo a questi dati? Si è già visto che c'è un mix di stringhe, numeri, spazi e valori strani a cui occorre dare un senso.
Che domanda si puà fare a questi dati, utilizzando una tecnica di Regressione? Che dire di "Prevedere il prezzo di una zucca in vendita durante un dato mese". Esaminando nuovamente i dati, ci sono alcune modifiche da apportare per creare la struttura dati necessaria per l'attività.
## Esercizio: analizzare i dati della zucca
Si usa [Pandas](https://pandas.pydata.org/), (il nome sta per `Python Data Analysis`) uno strumento molto utile per dare forma ai dati, per analizzare e preparare questi dati sulla zucca.
### Innanzitutto, controllare le date mancanti
Prima si dovranno eseguire i passaggi per verificare le date mancanti:
1. Convertire le date in un formato mensile (queste sono date statunitensi, quindi il formato è `MM/GG/AAAA`).
2. Estrarre il mese in una nuova colonna.
Aprire il file _notebook.ipynb_ in Visual Studio Code e importare il foglio di calcolo in un nuovo dataframe Pandas.
1. Usare la funzione `head()` per visualizzare le prime cinque righe.
```python
import pandas as pd
pumpkins = pd.read_csv('../data/US-pumpkins.csv')
pumpkins.head()
```
✅ Quale funzione si userebbe per visualizzare le ultime cinque righe?
1. Controllare se mancano dati nel dataframe corrente:
```python
pumpkins.isnull().sum()
```
Ci sono dati mancanti, ma forse non avrà importanza per l'attività da svolgere.
1. Per rendere più facile lavorare con il dataframe, si scartano molte delle sue colonne, usando `drop()`, mantenendo solo le colonne di cui si ha bisogno:
```python
new_columns = ['Package', 'Month', 'Low Price', 'High Price', 'Date']
pumpkins = pumpkins.drop([c for c in pumpkins.columns if c not in new_columns], axis=1)
```
### Secondo, determinare il prezzo medio della zucca
Si pensi a come determinare il prezzo medio di una zucca in un dato mese. Quali colonne si sceglierebbero per questa attività? Suggerimento: serviranno 3 colonne.
Soluzione: prendere la media delle colonne `Low Price` e `High Price` per popolare la nuova colonna Price e convertire la colonna Date per mostrare solo il mese. Fortunatamente, secondo il controllo di cui sopra, non mancano dati per date o prezzi.
1. Per calcolare la media, aggiungere il seguente codice:
```python
price = (pumpkins['Low Price'] + pumpkins['High Price']) / 2
month = pd.DatetimeIndex(pumpkins['Date']).month
```
✅ Si possono di stampare tutti i dati che si desidera controllare utilizzando `print(month)`.
2. Ora copiare i dati convertiti in un nuovo dataframe Pandas:
```python
new_pumpkins = pd.DataFrame({'Month': month, 'Package': pumpkins['Package'], 'Low Price': pumpkins['Low Price'],'High Price': pumpkins['High Price'], 'Price': price})
```
La stampa del dataframe mostrerà un insieme di dati pulito e ordinato su cui si può costruire il nuovo modello di regressione.
### Ma non è finita qui! C'è qualcosa di strano qui.
Osservando la colonna `Package`, le zucche sono vendute in molte configurazioni diverse. Alcune sono venduti in misure '1 1/9 bushel' (bushel = staio) e alcuni in misure '1/2 bushel', alcuni per zucca, alcuni per libbra e alcuni in grandi scatole con larghezze variabili.
> Le zucche sembrano molto difficili da pesare in modo coerente
Scavando nei dati originali, è interessante notare che qualsiasi cosa con `Unit of Sale` (Unità di vendita) uguale a 'EACH' o 'PER BIN' ha anche il tipo di `Package` per 'inch' (pollice), per 'bin' (contenitore) o 'each' (entrambi). Le zucche sembrano essere molto difficili da pesare in modo coerente, quindi si filtrano selezionando solo zucche con la stringa "bushel" nella colonna `Package`.
1. Aggiungere un filtro nella parte superiore del file, sotto l'importazione .csv iniziale:
```python
pumpkins = pumpkins[pumpkins['Package'].str.contains('bushel', case=True, regex=True)]
```
Se si stampano i dati ora, si può vedere che si stanno ricevendo solo le circa 415 righe di dati contenenti zucche per bushel.
### Ma non è finita qui! C'è un'altra cosa da fare.
Si è notato che la quantità di bushel varia per riga? Si deve normalizzare il prezzo in modo da mostrare il prezzo per bushel, quindi si facciano un po' di calcoli per standardizzarlo.
1. Aggiungere queste righe dopo il blocco che crea il dataframe new_pumpkins:
```python
new_pumpkins.loc[new_pumpkins['Package'].str.contains('1 1/9'), 'Price'] = price/(1 + 1/9)
new_pumpkins.loc[new_pumpkins['Package'].str.contains('1/2'), 'Price'] = price/(1/2)
```
✅ Secondo [The Spruce Eats](https://www.thespruceeats.com/how-much-is-a-bushel-1389308), il peso di un bushel dipende dal tipo di prodotto, poiché è una misura di volume. "Un bushel di pomodori, per esempio, dovrebbe pesare 56 libbre... Foglie e verdure occupano più spazio con meno peso, quindi un bushel di spinaci è solo 20 libbre". È tutto piuttosto complicato! Non occorre preoccuparsi di fare una conversione da bushel a libbra, e invece si valuta a bushel. Tutto questo studio sui bushel di zucche, però, dimostra quanto sia importante capire la natura dei propri dati!
Ora si può analizzare il prezzo per unità in base alla misurazione del bushel. Se si stampano i dati ancora una volta, si può vedere come sono standardizzati.
✅ Si è notato che le zucche vendute a metà bushel sono molto costose? Si riesce a capire perché? Suggerimento: le zucche piccole sono molto più costose di quelle grandi, probabilmente perché ce ne sono molte di più per bushel, dato lo spazio inutilizzato occupato da una grande zucca cava.
## Strategie di Visualizzazione
Parte del ruolo del data scientist è dimostrare la qualità e la natura dei dati con cui sta lavorando. Per fare ciò, si creano spesso visualizzazioni interessanti o tracciati, grafici e diagrammi, che mostrano diversi aspetti dei dati. In questo modo, sono in grado di mostrare visivamente relazioni e lacune altrimenti difficili da scoprire.
Le visualizzazioni possono anche aiutare a determinare la tecnica di machine learning più appropriata per i dati. Un grafico a dispersione che sembra seguire una linea, ad esempio, indica che i dati sono un buon candidato per un esercizio di regressione lineare.
Una libreria di visualizzazione dei dati che funziona bene nei notebook Jupyter è [Matplotlib](https://matplotlib.org/) (che si è visto anche nella lezione precedente).
> Per fare più esperienza con la visualizzazione dei dati si seguano [questi tutorial](https://docs.microsoft.com/learn/modules/explore-analyze-data-with-python?WT.mc_id=academic-15963-cxa).
## Esercizio - sperimentare con Matplotlib
Provare a creare alcuni grafici di base per visualizzare il nuovo dataframe appena creato. Cosa mostrerebbe un grafico a linee di base?
1. Importare Matplotlib nella parte superiore del file, sotto l'importazione di Pandas:
```python
import matplotlib.pyplot as plt
```
1. Rieseguire l'intero notebook per aggiornare.
1. Nella parte inferiore del notebook, aggiungere una cella per tracciare i dati come una casella:
```python
price = new_pumpkins.Price
month = new_pumpkins.Month
plt.scatter(price, month)
plt.show()
```
![Un grafico a dispersione che mostra la relazione tra prezzo e mese](../images/scatterplot.png)
È un tracciato utile? C'è qualcosa che sorprende?
Non è particolarmente utile in quanto tutto ciò che fa è visualizzare nei propri dati come una diffusione di punti in un dato mese.
### Renderlo utile
Per fare in modo che i grafici mostrino dati utili, di solito è necessario raggruppare i dati in qualche modo. Si prova a creare un grafico che mostra la distribuzione dei dati dove l'asse x mostra i mesi.
1. Aggiungere una cella per creare un grafico a barre raggruppato:
```python
new_pumpkins.groupby(['Month'])['Price'].mean().plot(kind='bar')
plt.ylabel("Pumpkin Price")
```
![Un grafico a barre che mostra la relazione tra prezzo e mese](../images/barchart.png)
Questa è una visualizzazione dei dati più utile! Sembra indicare che il prezzo più alto per le zucche si verifica a settembre e ottobre. Questo soddisfa le proprie aspettative? Perché o perché no?
---
## 🚀 Sfida
Esplorare i diversi tipi di visualizzazione offerti da Matplotlib. Quali tipi sono più appropriati per i problemi di regressione?
## [Quiz post-lezione](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/12/)
## Revisione e Auto Apprendimento
Dare un'occhiata ai molti modi per visualizzare i dati. Fare un elenco delle varie librerie disponibili e annotare quali sono le migliori per determinati tipi di attività, ad esempio visualizzazioni 2D rispetto a visualizzazioni 3D. Cosa si è scoperto?
## Compito
[Esplorazione della visualizzazione](assignment.it.md)

@ -0,0 +1,206 @@
# Scikit-learnを用いた回帰モデルの構築: データの準備と可視化
> ![データの可視化に関するインフォグラフィック](../images/data-visualization.png)
>
> [Dasani Madipalli](https://twitter.com/dasani_decoded) によるインフォグラフィック
## [講義前のクイズ](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/11?loc=ja)
## イントロダクション
Scikit-learnを使って機械学習モデルの構築を行うために必要なツールの用意ができたところで、データに対する問いかけを始める準備が整いました。データを扱いMLソリューションを適用する際には、データセットの潜在能力を適切に引き出すために正しい問いかけをすることが非常に重要です。
このレッスンでは、以下のことを学びます。
- モデルを構築するためのデータ処理方法について
- データの可視化におけるMatplotlibの使い方について
## データに対して正しい問いかけをする
どのような質問に答えるかによって、どのようなMLアルゴリズムを活用するかが決まります。また、返ってくる回答の質は、データの性質に大きく依存します。
このレッスンのために用意された[データ]((../../data/US-pumpkins.csv))を見てみましょう。この.csvファイルは、VS Codeで開くことができます。ざっと確認してみると、空欄があったり、文字列や数値データが混在していることがわかります。また、「Package」という奇妙な列では「sacks」や 「bins」などの異なる単位の値が混在しています。このように、データはちょっとした混乱状態にあります。
実際のところ、MLモデルの作成にすぐに使えるような整ったデータセットをそのまま受け取ることはあまりありません。このレッスンでは、Pythonの標準ライブラリを使って生のデータセットを準備する方法を学びます。また、データを可視化するための様々なテクニックを学びます。
## ケーススタディ: カボチャの市場
ルートの`date`フォルダの中に [US-pumpkins.csv](../../data/US-pumpkins.csv) という名前の.csvファイルがあります。このファイルには、カボチャの市場に関する1757行のデータが、都市ごとにグループ分けされて入っています。これは、米国農務省が配布している [Specialty Crops Terminal Markets Standard Reports](https://www.marketnews.usda.gov/mnp/fv-report-config-step1?type=termPrice) から抽出した生データです。
### データの準備
このデータはパブリックドメインです。米国農務省のウェブサイトから、都市ごとに個別ファイルをダウンロードすることができます。ファイルが多くなりすぎないように、すべての都市のデータを1つのスプレッドシートに連結しました。次に、データを詳しく見てみましょう。
### カボチャのデータ - 初期の結論
このデータについて何か気付いたことはありますか?文字列、数字、空白、奇妙な値が混在していて、意味を理解しなければならないこと気付いたと思います。
回帰を使って、このデータにどのような問いかけができますか?「ある月に販売されるカボチャの価格を予測する」というのはどうでしょうか?データをもう一度見てみると、この課題に必要なデータ構造を作るために、いくつかの変更が必要です。
## エクササイズ - カボチャのデータを分析
データを整形するのに非常に便利な [Pandas](https://pandas.pydata.org/) (Python Data Analysisの略) を使って、このカボチャのデータを分析したり整えてみましょう。
### 最初に、日付が欠損していないか確認する
日付が欠損していないか確認するために、いくつかのステップがあります:
1. 日付を月の形式に変換する(これは米国の日付なので、形式は `MM/DD/YYYY` となる)。
2. 新しい列として月を抽出する。
Visual Studio Codeで _notebook.ipynb_ ファイルを開き、スプレッドシートを Pandas DataFrame としてインポートします。
1. `head()` 関数を使って最初の5行を確認します。
```python
import pandas as pd
pumpkins = pd.read_csv('../data/US-pumpkins.csv')
pumpkins.head()
```
✅ 最後の5行を表示するには、どのような関数を使用しますか
2. 現在のデータフレームに欠損データがあるかどうかをチェックします。
```python
pumpkins.isnull().sum()
```
欠損データがありましたが、今回のタスクには影響がなさそうです。
3. データフレームを扱いやすくするために、`drop()` 関数を使っていくつかの列を削除し、必要な列だけを残すようにします。
```python
new_columns = ['Package', 'Month', 'Low Price', 'High Price', 'Date']
pumpkins = pumpkins.drop([c for c in pumpkins.columns if c not in new_columns], axis=1)
```
### 次に、カボチャの平均価格を決定します。
ある月のかぼちゃの平均価格を決定する方法を考えてみましょう。このタスクのために、どの列が必要ですかヒント3つの列が必要になります。
解決策「最低価格」と「最高価格」の平均値を取って新しい「price」列を作成し、「日付」列を月のみ表示するように変換します。幸いなことに、上記で確認した結果によると日付や価格に欠損データはありませんでした。
1. 平均値を算出するために、以下のコードを追加します。
```python
price = (pumpkins['Low Price'] + pumpkins['High Price']) / 2
month = pd.DatetimeIndex(pumpkins['Date']).month
```
`print(month)` などを使って自由にデータを確認してみてください。
2. 変換したデータをPandasの新しいデータフレームにコピーします。
```python
new_pumpkins = pd.DataFrame({'Month': month, 'Package': pumpkins['Package'], 'Low Price': pumpkins['Low Price'],'High Price': pumpkins['High Price'], 'Price': price})
```
データフレームを出力すると、新しい回帰モデルを構築するための綺麗に整頓されたデータセットが表示されます。
### でも、待ってください!なにかおかしいです。
`Package` 列をみると、カボチャは様々な形で販売されています。「1 1/9ブッシェル」で売られているもの、「1/2ブッシェル」で売られているもの、かぼちゃ1個単位で売られているもの、1ポンド単位で売られているもの、幅の違う大きな箱で売られているものなど様々です。
> かぼちゃの重さを一定にするのはとても難しいようです。
元のデータを調べてみると、「Unit of Sale」が「EACH」または「PER BIN」となっているものは、「Package」が「per inch」、「per bin」、「each」となっているのが興味深いです。カボチャの計量単位に一貫性を持たせるのが非常に難しいようなので、`Package`列に「bushel」という文字列を持つカボチャだけを選択してフィルタリングしてみましょう。
1. ファイルの一番上にフィルタを追加します。
```python
pumpkins = pumpkins[pumpkins['Package'].str.contains('bushel', case=True, regex=True)]
```
今、データを出力してみると、ブッシェル単位のカボチャを含む415行ほどのデータしか得られていないことがわかります。
### でも、待ってください!もうひとつ、やるべきことがあります。
行ごとにブッシェルの量が異なることに気付きましたか1ブッシェルあたりの価格を表示するためには、計算して価格を標準化する必要があります。
1. new_pumpkinsデータフレームを作成するブロックの後に以下の行を追加します。
```python
new_pumpkins.loc[new_pumpkins['Package'].str.contains('1 1/9'), 'Price'] = price/(1 + 1/9)
new_pumpkins.loc[new_pumpkins['Package'].str.contains('1/2'), 'Price'] = price/(1/2)
```
✅ [The Spruce Eats](https://www.thespruceeats.com/how-much-is-a-bushel-1389308) によると、ブッシェルの重さは体積を測るものなので、農産物の種類によって異なります。例えば、トマトの1ブッシェルは、56ポンドの重さになるとされています。葉っぱや野菜は重量が少なくてもスペースを取るので、ほうれん草の1ブッシェルはたったの20ポンドです。なんだか複雑ですねブッシェルからポンドへの換算は面倒なのでやめて、ブッシェル単位で価格を決めましょう。しかし、カボチャのブッシェルについての議論は、データの性質を理解することがいかに重要であるかを示しています。
これで、ブッシェルの測定値に基づいて、ユニットごとの価格を分析することができます。もう1度データを出力してみると、標準化されていることがわかります。
✅ ハーフブッシェルで売られているカボチャがとても高価なことに気付きましたか?なぜだかわかりますか?小さなカボチャは大きなカボチャよりもはるかに高価です。おそらく大きなカボチャ中身には、体積あたりで考えると空洞な部分が多く含まれると考えられます。
## 可視化戦略
データサイエンティストの役割の一つは、扱うデータの質や性質を示すことです。そのために、データのさまざまな側面を示す興味深いビジュアライゼーション(プロット、グラフ、チャート)を作成することがよくあります。そうすることで、他の方法では発見しにくい関係性やギャップを視覚的に示すことができます。
また、可視化することでデータに適した機械学習の手法を判断することができます。例えば、散布図が直線に沿っているように見える場合は、適用する手法の候補の一つとして線形回帰が考えられます。
Jupyter notebookでうまく利用できるテータ可視化ライブラリの一つに [Matplotlib](https://matplotlib.org/) があります (前のレッスンでも紹介しています)。
> [こちらのチュートリアル](https://docs.microsoft.com/learn/modules/explore-analyze-data-with-python?WT.mc_id=academic-15963-cxa) でデータの可視化ついてより深く体験することができます。
## エクササイズ - Matplotlibの実験
先ほど作成したデータフレームを表示するために、いくつか基本的なプロットを作成してみてください。折れ線グラフから何が読み取れるでしょうか?
1. ファイルの先頭、Pandasのインポートの下で Matplotlibをインポートします。
```python
import matplotlib.pyplot as plt
```
1. ノートブック全体を再実行してリフレッシュします。
2. ノートブックの下部に、データをプロットするためのセルを追加します。
```python
price = new_pumpkins.Price
month = new_pumpkins.Month
plt.scatter(price, month)
plt.show()
```
![価格と月の関係を示す散布図](../images/scatterplot.png)
これは役に立つプロットですか?なにか驚いたことはありますか?
これはデータをある月について、データの広がりとして表示しているだけなので、特に役に立つものではありません。
### 活用できるようにする
グラフに有用なデータを表示するには、通常、データを何らかの方法でグループ化する必要があります。ここでは、X軸を月として、データの分布を示すようなプロットを作ってみましょう。
1. セルを追加してグループ化された棒グラフを作成します。
```python
new_pumpkins.groupby(['Month'])['Price'].mean().plot(kind='bar')
plt.ylabel("Pumpkin Price")
```
![値段と月の関係を表した棒グラフ](../images/barchart.png)
このプロットの方が、より有用なデータを可視化していますカボチャの価格が最も高くなるのは、9月と10月であることを示しているようです。このプロットはあなたの期待に応えるものですかどのような点で期待通りですかまた、どのような点で期待に答えられていませんか
---
## 🚀チャレンジ
Matplotlibが提供する様々なタイプのビジュアライゼーションを探ってみましょう。回帰の問題にはどのタイプが最も適しているでしょうか
## [講義後クイズ](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/12?loc=ja)
## レビュー & 自主学習
データを可視化するための様々な方法を見てみましょう。様々なライブラリをリストアップし、例えば2Dビジュアライゼーションと3Dビジュアライゼーションのように、特定のタイプのタスクに最適なものをメモします。どのような発見がありましたか
## 課題
[ビジュアライゼーションの探求](./assignment.ja.md)

@ -0,0 +1,202 @@
# 使用Scikit-learn构建回归模型准备和可视化数据
> ![数据可视化信息图](../images/data-visualization.png)
> 作者[Dasani Madipalli](https://twitter.com/dasani_decoded)
## [课前测](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/11/)
## 介绍
既然你已经设置了开始使用Scikit-learn处理机器学习模型构建所需的工具你就可以开始对数据提出问题了。当你处理数据并应用ML解决方案时了解如何提出正确的问题以正确释放数据集的潜力非常重要。
在本课中,你将学习:
- 如何为模型构建准备数据。
- 如何使用Matplotlib进行数据可视化。
## 对你的数据提出正确的问题
你需要回答的问题将决定你将使用哪种类型的ML算法。你得到的答案的质量将在很大程度上取决于你的数据的性质。
查看为本课程提供的[数据](../data/US-pumpkins.csv)。你可以在VS Code中打开这个.csv文件。快速浏览一下就会发现有空格还有字符串和数字数据的混合。还有一个奇怪的列叫做“Package”其中的数据是“sacks”、“bins”和其他值的混合。事实上数据有点乱。
事实上获得一个完全准备好用于创建开箱即用的ML模型的数据集并不是很常见。在本课中你将学习如何使用标准Python库准备原始数据集。你还将学习各种技术来可视化数据。
## 案例研究:“南瓜市场”
你将在`data`文件夹中找到一个名为[US-pumpkins.csv](../data/US-pumpkins.csv)的.csv 文件其中包含有关南瓜市场的1757行数据已 按城市排序分组。这是从美国农业部分发的[特种作物终端市场标准报告](https://www.marketnews.usda.gov/mnp/fv-report-config-step1?type=termPrice)中提取的原始数据。
### 准备数据
这些数据属于公共领域。它可以从美国农业部网站下载,每个城市有许多不同的文件。为了避免太多单独的文件,我们将所有城市数据合并到一个电子表格中,因此我们已经准备了一些数据。接下来,让我们仔细看看数据。
### 南瓜数据 - 早期结论
你对这些数据有什么看法?你已经看到了无法理解的字符串、数字、空格和奇怪值的混合体。
你可以使用回归技术对这些数据提出什么问题?“预测给定月份内待售南瓜的价格”怎么样?再次查看数据,你需要进行一些更改才能创建任务所需的数据结构。
## 练习 - 分析南瓜数据
让我们使用[Pandas](https://pandas.pydata.org/)“Python 数据分析”的意思)一个非常有用的工具,用于分析和准备南瓜数据。
### 首先,检查遗漏的日期
你首先需要采取以下步骤来检查缺少的日期:
1. 将日期转换为月份格式(这些是美国日期,因此格式为`MM/DD/YYYY`)。
2. 将月份提取到新列。
在 Visual Studio Code 中打开notebook.ipynb文件并将电子表格导入到新的Pandas dataframe中。
1. 使用 `head()`函数查看前五行。
```python
import pandas as pd
pumpkins = pd.read_csv('../../data/US-pumpkins.csv')
pumpkins.head()
```
✅ 使用什么函数来查看最后五行?
2. 检查当前dataframe中是否缺少数据
```python
pumpkins.isnull().sum()
```
有数据丢失,但可能对手头的任务来说无关紧要。
3. 为了让你的dataframe更容易使用使用`drop()`删除它的几个列,只保留你需要的列:
```python
new_columns = ['Package', 'Month', 'Low Price', 'High Price', 'Date']
pumpkins = pumpkins.drop([c for c in pumpkins.columns if c not in new_columns], axis=1)
```
### 然后,确定南瓜的平均价格
考虑如何确定给定月份南瓜的平均价格。你会为此任务选择哪些列提示你需要3列。
解决方案:取`Low Price`和`High Price`列的平均值来填充新的Price列将Date列转换成只显示月份。幸运的是根据上面的检查没有丢失日期或价格的数据。
1. 要计算平均值,请添加以下代码:
```python
price = (pumpkins['Low Price'] + pumpkins['High Price']) / 2
month = pd.DatetimeIndex(pumpkins['Date']).month
```
✅ 请随意使用`print(month)`打印你想检查的任何数据。
2. 现在将转换后的数据复制到新的Pandas dataframe中
```python
new_pumpkins = pd.DataFrame({'Month': month, 'Package': pumpkins['Package'], 'Low Price': pumpkins['Low Price'],'High Price': pumpkins['High Price'], 'Price': price})
```
打印出的dataframe将向你展示一个干净整洁的数据集你可以在此数据集上构建新的回归模型。
### 但是等等!这里有点奇怪
如果你看看`Package`(包装)一栏南瓜有很多不同的配置。有的以1 1/9蒲式耳的尺寸出售有的以1/2蒲式耳的尺寸出售有的以每只南瓜出售有的以每磅出售有的以不同宽度的大盒子出售。
> 南瓜似乎很难统一称重方式
深入研究原始数据,有趣的是,任何`Unit of Sale`等于“EACH”或“PER BIN”的东西也具有每英寸、每箱或“每个”的`Package`类型。南瓜似乎很难采用统一称重方式,因此让我们通过仅选择`Package`列中带有字符串“蒲式耳”的南瓜来过滤它们。
1. 在初始.csv导入下添加过滤器
```python
pumpkins = pumpkins[pumpkins['Package'].str.contains('bushel', case=True, regex=True)]
```
如果你现在打印数据,你可以看到你只获得了 415 行左右包含按蒲式耳计算的南瓜的数据。
### 可是等等! 还有一件事要做
你是否注意到每行的蒲式耳数量不同?你需要对定价进行标准化,以便显示每蒲式耳的定价,因此请进行一些数学计算以对其进行标准化。
1. 在创建 new_pumpkins dataframe的代码块之后添加这些行
```python
new_pumpkins.loc[new_pumpkins['Package'].str.contains('1 1/9'), 'Price'] = price/(1 + 1/9)
new_pumpkins.loc[new_pumpkins['Package'].str.contains('1/2'), 'Price'] = price/(1/2)
```
✅ 根据 [The Spruce Eats](https://www.thespruceeats.com/how-much-is-a-bushel-1389308)蒲式耳的重量取决于产品的类型因为它是一种体积测量。“例如一蒲式耳西红柿应该重56 磅……叶子和蔬菜占据更多空间重量更轻所以一蒲式耳菠菜只有20磅。” 这一切都相当复杂!让我们不要费心进行蒲式耳到磅的转换,而是按蒲式耳定价。然而,所有这些对蒲式耳南瓜的研究表明,了解数据的性质是多么重要!
现在,你可以根据蒲式耳测量来分析每单位的定价。如果你再打印一次数据,你可以看到它是如何标准化的。
✅ 你有没有注意到半蒲式耳卖的南瓜很贵?你能弄清楚为什么吗?提示:小南瓜比大南瓜贵得多,这可能是因为考虑到一个大的空心馅饼南瓜占用的未使用空间,每蒲式耳的南瓜要多得多。
## 可视化策略
数据科学家的部分职责是展示他们使用的数据的质量和性质。为此,他们通常会创建有趣的可视化或绘图、图形和图表,以显示数据的不同方面。通过这种方式,他们能够直观地展示难以发现的关系和差距。
可视化还可以帮助确定最适合数据的机器学习技术。例如,似乎沿着一条线的散点图表明该数据是线性回归练习的良好候选者。
一个在Jupyter notebooks中运行良好的数据可视化库是[Matplotlib](https://matplotlib.org/)(你在上一课中也看到过)。
> 在[这些教程](https://docs.microsoft.com/learn/modules/explore-analyze-data-with-python?WT.mc_id=academic-15963-cxa)中获得更多数据可视化经验。
## 练习 - 使用 Matplotlib 进行实验
尝试创建一些基本图形来显示你刚刚创建的新dataframe。基本线图会显示什么
1. 在文件顶部导入Matplotlib
```python
import matplotlib.pyplot as plt
```
2. 重新刷新以运行整个notebook。
3. 在notebook底部添加一个单元格以绘制数据
```python
price = new_pumpkins.Price
month = new_pumpkins.Month
plt.scatter(price, month)
plt.show()
```
![显示价格与月份关系的散点图](../images/scatterplot.png)
这是一个有用的图吗?有什么让你吃惊的吗?
它并不是特别有用,因为它所做的只是在你的数据中显示为给定月份的点数分布。
### 让它有用
为了让图表显示有用的数据你通常需要以某种方式对数据进行分组。让我们尝试创建一个图其中y轴显示月份数据显示数据的分布。
1. 添加单元格以创建分组柱状图:
```python
new_pumpkins.groupby(['Month'])['Price'].mean().plot(kind='bar')
plt.ylabel("Pumpkin Price")
```
![显示价格与月份关系的柱状图](../images/barchart.png)
这是一个更有用的数据可视化似乎表明南瓜的最高价格出现在9月和10月。这符合你的期望吗为什么为什么不
---
## 🚀挑战
探索Matplotlib提供的不同类型的可视化。哪种类型最适合回归问题
## [课后测](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/12/)
## 复习与自学
请看一下可视化数据的多种方法。列出各种可用的库并注意哪些库最适合给定类型的任务例如2D可视化与3D可视化。你发现了什么
## 任务
[探索可视化](../assignment.md)

@ -0,0 +1,9 @@
# Esplorazione delle visualizzazioni
Sono disponibili diverse librerie per la visualizzazione dei dati. Creare alcune visualizzazioni utilizzando i dati della zucca in questa lezione con matplotlib e seaborn in un notebook di esempio. Con quali librerie è più facile lavorare?
## Rubrica
| Criteri | Ottimo | Adeguato | Necessita miglioramento |
| -------- | --------- | -------- | ----------------- |
| | Viene inviato un notebook con due esplorazioni/visualizzazioni | Viene inviato un notebook con una esplorazione/visualizzazione | Non è stato inviato un notebook |

@ -0,0 +1,9 @@
# ビジュアライゼーションの探求
データのビジュアライゼーションには、いくつかの異なるライブラリがあります。このレッスンのPumpkinデータを使って、matplotlibとseabornを使って、サンプルートブックでいくつかのビジュアライゼーションを作ってみましょう。どのライブラリが作業しやすいでしょうか
## ルーブリック
| 指標 | 模範的 | 適切 | 要改善 |
| -------- | --------- | -------- | ----------------- |
| | ートブックには2つの活用法/可視化方法が示されている。 | ートブックには1つの活用法/可視化方法が示されている。 | ノートブックが提出されていない。 |

@ -0,0 +1,9 @@
# 探索数据可视化
有好几个库都可以进行数据可视化。用 matplotlib 和 seaborn 对本课中涉及的 Pumpkin 数据集创建一些数据可视化的图标。并思考哪个库更容易使用?
## 评判标准
| 标准 | 优秀 | 中规中矩 | 仍需努力 |
| -------- | --------- | -------- | ----------------- |
| | 提交了含有两种探索可视化方法的notebook工程文件 | 提交了只包含有一种探索可视化方法的notebook工程文件 | 没提交 notebook 工程文件 |

Binary file not shown.

After

Width:  |  Height:  |  Size: 873 KiB

Binary file not shown.

After

Width:  |  Height:  |  Size: 6.3 MiB

File diff suppressed because it is too large Load Diff

@ -0,0 +1,679 @@
---
title: 'Build a regression model: linear and polynomial regression models'
output:
html_document:
df_print: paged
theme: flatly
highlight: breezedark
toc: yes
toc_float: yes
code_download: yes
---
## Linear and Polynomial Regression for Pumpkin Pricing - Lesson 3
![Infographic by Dasani Madipalli](../images/linear-polynomial.png){width="800"}
#### Introduction
So far you have explored what regression is with sample data gathered from the pumpkin pricing dataset that we will use throughout this lesson. You have also visualized it using `ggplot2`.💪
Now you are ready to dive deeper into regression for ML. In this lesson, you will learn more about two types of regression: *basic linear regression* and *polynomial regression*, along with some of the math underlying these techniques.
> Throughout this curriculum, we assume minimal knowledge of math, and seek to make it accessible for students coming from other fields, so watch for notes, 🧮 callouts, diagrams, and other learning tools to aid in comprehension.
#### Preparation
As a reminder, you are loading this data so as to ask questions of it.
- When is the best time to buy pumpkins?
- What price can I expect of a case of miniature pumpkins?
- Should I buy them in half-bushel baskets or by the 1 1/9 bushel box? Let's keep digging into this data.
In the previous lesson, you created a `tibble` (a modern reimagining of the data frame) and populated it with part of the original dataset, standardizing the pricing by the bushel. By doing that, however, you were only able to gather about 400 data points and only for the fall months. Maybe we can get a little more detail about the nature of the data by cleaning it more? We'll see... 🕵️‍♀️
For this task, we'll require the following packages:
- `tidyverse`: The [tidyverse](https://www.tidyverse.org/) is a [collection of R packages](https://www.tidyverse.org/packages) designed to makes data science faster, easier and more fun!
- `tidymodels`: The [tidymodels](https://www.tidymodels.org/) framework is a [collection of packages](https://www.tidymodels.org/packages/) for modeling and machine learning.
- `janitor`: The [janitor package](https://github.com/sfirke/janitor) provides simple little tools for examining and cleaning dirty data.
- `corrplot`: The [corrplot package](https://cran.r-project.org/web/packages/corrplot/vignettes/corrplot-intro.html) provides a visual exploratory tool on correlation matrix that supports automatic variable reordering to help detect hidden patterns among variables.
You can have them installed as:
`install.packages(c("tidyverse", "tidymodels", "janitor", "corrplot"))`
The script below checks whether you have the packages required to complete this module and installs them for you in case they are missing.
```{r, message=F, warning=F}
suppressWarnings(if (!require("pacman")) install.packages("pacman"))
pacman::p_load(tidyverse, tidymodels, janitor, corrplot)
```
We'll later load these awesome packages and make them available in our current R session. (This is for mere illustration, `pacman::p_load()` already did that for you)
## 1. A linear regression line
As you learned in Lesson 1, the goal of a linear regression exercise is to be able to plot a *line* *of* *best fit* to:
- **Show variable relationships**. Show the relationship between variables
- **Make predictions**. Make accurate predictions on where a new data point would fall in relationship to that line.
To draw this type of line, we use a statistical technique called **Least-Squares Regression**. The term `least-squares` means that all the data points surrounding the regression line are squared and then added up. Ideally, that final sum is as small as possible, because we want a low number of errors, or `least-squares`. As such, the line of best fit is the line that gives us the lowest value for the sum of the squared errors - hence the name *least squares regression*.
We do so since we want to model a line that has the least cumulative distance from all of our data points. We also square the terms before adding them since we are concerned with its magnitude rather than its direction.
> **🧮 Show me the math**
>
> This line, called the *line of best fit* can be expressed by [an equation](https://en.wikipedia.org/wiki/Simple_linear_regression):
>
> Y = a + bX
>
> `X` is the '`explanatory variable` or `predictor`'. `Y` is the '`dependent variable` or `outcome`'. The slope of the line is `b` and `a` is the y-intercept, which refers to the value of `Y` when `X = 0`.
>
> ![Infographic by Jen Looper](../images/slope.png){width="400"}
>
> First, calculate the slope `b`.
>
> In other words, and referring to our pumpkin data's original question: "predict the price of a pumpkin per bushel by month", `X` would refer to the price and `Y` would refer to the month of sale.
>
> ![Infographic by Jen Looper](../images/calculation.png)
>
> Calculate the value of Y. If you're paying around \$4, it must be April!
>
> The math that calculates the line must demonstrate the slope of the line, which is also dependent on the intercept, or where `Y` is situated when `X = 0`.
>
> You can observe the method of calculation for these values on the [Math is Fun](https://www.mathsisfun.com/data/least-squares-regression.html) web site. Also visit [this Least-squares calculator](https://www.mathsisfun.com/data/least-squares-calculator.html) to watch how the numbers' values impact the line.
Not so scary, right? 🤓
#### Correlation
One more term to understand is the **Correlation Coefficient** between given X and Y variables. Using a scatterplot, you can quickly visualize this coefficient. A plot with datapoints scattered in a neat line have high correlation, but a plot with datapoints scattered everywhere between X and Y have a low correlation.
A good linear regression model will be one that has a high (nearer to 1 than 0) Correlation Coefficient using the Least-Squares Regression method with a line of regression.
## **2. A dance with data: creating a data frame that will be used for modelling**
![Artwork by \@allison_horst](../images/janitor.jpg){width="700"}
Load up required libraries and dataset. Convert the data to a data frame containing a subset of the data:
- Only get pumpkins priced by the bushel
- Convert the date to a month
- Calculate the price to be an average of high and low prices
- Convert the price to reflect the pricing by bushel quantity
> We covered these steps in the [previous lesson](https://github.com/microsoft/ML-For-Beginners/blob/main/2-Regression/2-Data/solution/lesson_2-R.ipynb).
```{r load_tidy_verse_models, message=F, warning=F}
# Load the core Tidyverse packages
library(tidyverse)
library(lubridate)
# Import the pumpkins data
pumpkins <- read_csv(file = "https://raw.githubusercontent.com/microsoft/ML-For-Beginners/main/2-Regression/data/US-pumpkins.csv")
# Get a glimpse and dimensions of the data
glimpse(pumpkins)
# Print the first 50 rows of the data set
pumpkins %>%
slice_head(n = 5)
```
In the spirit of sheer adventure, let's explore the [`janitor package`](github.com/sfirke/janitor) that provides simple functions for examining and cleaning dirty data. For instance, let's take a look at the column names for our data:
```{r col_names}
# Return column names
pumpkins %>%
names()
```
🤔 We can do better. Let's make these column names `friendR` by converting them to the [snake_case](https://en.wikipedia.org/wiki/Snake_case) convention using `janitor::clean_names`. To find out more about this function: `?clean_names`
```{r friendR}
# Clean names to the snake_case convention
pumpkins <- pumpkins %>%
clean_names(case = "snake")
# Return column names
pumpkins %>%
names()
```
Much tidyR 🧹! Now, a dance with the data using `dplyr` as in the previous lesson! 💃
```{r prep_data, message=F, warning=F}
# Select desired columns
pumpkins <- pumpkins %>%
select(variety, city_name, package, low_price, high_price, date)
# Extract the month from the dates to a new column
pumpkins <- pumpkins %>%
mutate(date = mdy(date),
month = month(date)) %>%
select(-date)
# Create a new column for average Price
pumpkins <- pumpkins %>%
mutate(price = (low_price + high_price)/2)
# Retain only pumpkins with the string "bushel"
new_pumpkins <- pumpkins %>%
filter(str_detect(string = package, pattern = "bushel"))
# Normalize the pricing so that you show the pricing per bushel, not per 1 1/9 or 1/2 bushel
new_pumpkins <- new_pumpkins %>%
mutate(price = case_when(
str_detect(package, "1 1/9") ~ price/(1.1),
str_detect(package, "1/2") ~ price*2,
TRUE ~ price))
# Relocate column positions
new_pumpkins <- new_pumpkins %>%
relocate(month, .before = variety)
# Display the first 5 rows
new_pumpkins %>%
slice_head(n = 5)
```
Good job!👌 You now have a clean, tidy data set on which you can build your new regression model!
Mind a scatter plot?
```{r scatter_price_month}
# Set theme
theme_set(theme_light())
# Make a scatter plot of month and price
new_pumpkins %>%
ggplot(mapping = aes(x = month, y = price)) +
geom_point(size = 1.6)
```
A scatter plot reminds us that we only have month data from August through December. We probably need more data to be able to draw conclusions in a linear fashion.
Let's take a look at our modelling data again:
```{r modelling data}
# Display first 5 rows
new_pumpkins %>%
slice_head(n = 5)
```
What if we wanted to predict the `price` of a pumpkin based on the `city` or `package` columns which are of type character? Or even more simply, how could we find the correlation (which requires both of its inputs to be numeric) between, say, `package` and `price`? 🤷🤷
Machine learning models work best with numeric features rather than text values, so you generally need to convert categorical features into numeric representations.
This means that we have to find a way to reformat our predictors to make them easier for a model to use effectively, a process known as `feature engineering`.
## 3. Preprocessing data for modelling with recipes 👩‍🍳👨‍🍳
Activities that reformat predictor values to make them easier for a model to use effectively has been termed `feature engineering`.
Different models have different preprocessing requirements. For instance, least squares requires `encoding categorical variables` such as month, variety and city_name. This simply involves `translating` a column with `categorical values` into one or more `numeric columns` that take the place of the original.
For example, suppose your data includes the following categorical feature:
| city |
|:-------:|
| Denver |
| Nairobi |
| Tokyo |
You can apply *ordinal encoding* to substitute a unique integer value for each category, like this:
| city |
|:----:|
| 0 |
| 1 |
| 2 |
And that's what we'll do to our data!
In this section, we'll explore another amazing Tidymodels package: [recipes](https://tidymodels.github.io/recipes/) - which is designed to help you preprocess your data **before** training your model. At its core, a recipe is an object that defines what steps should be applied to a data set in order to get it ready for modelling.
Now, let's create a recipe that prepares our data for modelling by substituting a unique integer for all the observations in the predictor columns:
```{r pumpkins_recipe}
# Specify a recipe
pumpkins_recipe <- recipe(price ~ ., data = new_pumpkins) %>%
step_integer(all_predictors(), zero_based = TRUE)
# Print out the recipe
pumpkins_recipe
```
Awesome! 👏 We just created our first recipe that specifies an outcome (price) and its corresponding predictors and that all the predictor columns should be encoded into a set of integers 🙌! Let's quickly break it down:
- The call to `recipe()` with a formula tells the recipe the *roles* of the variables using `new_pumpkins` data as the reference. For instance the `price` column has been assigned an `outcome` role while the rest of the columns have been assigned a `predictor` role.
- `step_integer(all_predictors(), zero_based = TRUE)` specifies that all the predictors should be converted into a set of integers with the numbering starting at 0.
We are sure you may be having thoughts such as: "This is so cool!! But what if I needed to confirm that the recipes are doing exactly what I expect them to do? 🤔"
That's an awesome thought! You see, once your recipe is defined, you can estimate the parameters required to actually preprocess the data, and then extract the processed data. You don't typically need to do this when you use Tidymodels (we'll see the normal convention in just a minute-\> `workflows`) but it can come in handy when you want to do some kind of sanity check for confirming that recipes are doing what you expect.
For that, you'll need two more verbs: `prep()` and `bake()` and as always, our little R friends by [`Allison Horst`](https://github.com/allisonhorst/stats-illustrations) help you in understanding this better!
![Artwork by \@allison_horst](../images/recipes.png){width="550"}
[`prep()`](https://recipes.tidymodels.org/reference/prep.html): estimates the required parameters from a training set that can be later applied to other data sets. For instance, for a given predictor column, what observation will be assigned integer 0 or 1 or 2 etc
[`bake()`](https://recipes.tidymodels.org/reference/bake.html): takes a prepped recipe and applies the operations to any data set.
That said, lets prep and bake our recipes to really confirm that under the hood, the predictor columns will be first encoded before a model is fit.
```{r prep_bake}
# Prep the recipe
pumpkins_prep <- prep(pumpkins_recipe)
# Bake the recipe to extract a preprocessed new_pumpkins data
baked_pumpkins <- bake(pumpkins_prep, new_data = NULL)
# Print out the baked data set
baked_pumpkins %>%
slice_head(n = 10)
```
Woo-hoo!🥳 The processed data `baked_pumpkins` has all it's predictors encoded confirming that indeed the preprocessing steps defined as our recipe will work as expected. This makes it harder for you to read but much more intelligible for Tidymodels! Take some time to find out what observation has been mapped to a corresponding integer.
It is also worth mentioning that `baked_pumpkins` is a data frame that we can perform computations on.
For instance, let's try to find a good correlation between two points of your data to potentially build a good predictive model. We'll use the function `cor()` to do this. Type `?cor()` to find out more about the function.
```{r corr}
# Find the correlation between the city_name and the price
cor(baked_pumpkins$city_name, baked_pumpkins$price)
# Find the correlation between the package and the price
cor(baked_pumpkins$package, baked_pumpkins$price)
```
As it turns out, there's only weak correlation between the City and Price. However there's a bit better correlation between the Package and its Price. That makes sense, right? Normally, the bigger the produce box, the higher the price.
While we are at it, let's also try and visualize a correlation matrix of all the columns using the `corrplot` package.
```{r corrplot}
# Load the corrplot package
library(corrplot)
# Obtain correlation matrix
corr_mat <- cor(baked_pumpkins %>%
# Drop columns that are not really informative
select(-c(low_price, high_price)))
# Make a correlation plot between the variables
corrplot(corr_mat, method = "shade", shade.col = NA, tl.col = "black", tl.srt = 45, addCoef.col = "black", cl.pos = "n", order = "original")
```
🤩🤩 Much better.
A good question to now ask of this data will be: '`What price can I expect of a given pumpkin package?`' Let's get right into it!
> Note: When you **`bake()`** the prepped recipe **`pumpkins_prep`** with **`new_data = NULL`**, you extract the processed (i.e. encoded) training data. If you had another data set for example a test set and would want to see how a recipe would pre-process it, you would simply bake **`pumpkins_prep`** with **`new_data = test_set`**
## 4. Build a linear regression model
![Infographic by Dasani Madipalli](../images/linear-polynomial.png){width="800"}
Now that we have build a recipe, and actually confirmed that the data will be pre-processed appropriately, let's now build a regression model to answer the question: `What price can I expect of a given pumpkin package?`
#### Train a linear regression model using the training set
As you may have already figured out, the column *price* is the `outcome` variable while the *package* column is the `predictor` variable.
To do this, we'll first split the data such that 80% goes into training and 20% into test set, then define a recipe that will encode the predictor column into a set of integers, then build a model specification. We won't prep and bake our recipe since we already know it will preprocess the data as expected.
```{r lm_rec_spec}
set.seed(2056)
# Split the data into training and test sets
pumpkins_split <- new_pumpkins %>%
initial_split(prop = 0.8)
# Extract training and test data
pumpkins_train <- training(pumpkins_split)
pumpkins_test <- testing(pumpkins_split)
# Create a recipe for preprocessing the data
lm_pumpkins_recipe <- recipe(price ~ package, data = pumpkins_train) %>%
step_integer(all_predictors(), zero_based = TRUE)
# Create a linear model specification
lm_spec <- linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
```
Good job! Now that we have a recipe and a model specification, we need to find a way of bundling them together into an object that will first preprocess the data (prep+bake behind the scenes), fit the model on the preprocessed data and also allow for potential post-processing activities. How's that for your peace of mind!🤩
In Tidymodels, this convenient object is called a [`workflow`](https://workflows.tidymodels.org/) and conveniently holds your modeling components! This is what we'd call *pipelines* in *Python*.
So let's bundle everything up into a workflow!📦
```{r lm_workflow}
# Hold modelling components in a workflow
lm_wf <- workflow() %>%
add_recipe(lm_pumpkins_recipe) %>%
add_model(lm_spec)
# Print out the workflow
lm_wf
```
👌 Into the bargain, a workflow can be fit/trained in much the same way a model can.
```{r lm_wf_fit}
# Train the model
lm_wf_fit <- lm_wf %>%
fit(data = pumpkins_train)
# Print the model coefficients learned
lm_wf_fit
```
From the model output, we can see the coefficients learned during training. They represent the coefficients of the line of best fit that gives us the lowest overall error between the actual and predicted variable.
#### Evaluate model performance using the test set
It's time to see how the model performed 📏! How do we do this?
Now that we've trained the model, we can use it to make predictions for the test_set using `parsnip::predict()`. Then we can compare these predictions to the actual label values to evaluate how well (or not!) the model is working.
Let's start with making predictions for the test set then bind the columns to the test set.
```{r lm_pred}
# Make predictions for the test set
predictions <- lm_wf_fit %>%
predict(new_data = pumpkins_test)
# Bind predictions to the test set
lm_results <- pumpkins_test %>%
select(c(package, price)) %>%
bind_cols(predictions)
# Print the first ten rows of the tibble
lm_results %>%
slice_head(n = 10)
```
Yes, you have just trained a model and used it to make predictions!🔮 Is it any good, let's evaluate the model's performance!
In Tidymodels, we do this using `yardstick::metrics()`! For linear regression, let's focus on the following metrics:
- `Root Mean Square Error (RMSE)`: The square root of the [MSE](https://en.wikipedia.org/wiki/Mean_squared_error). This yields an absolute metric in the same unit as the label (in this case, the price of a pumpkin). The smaller the value, the better the model (in a simplistic sense, it represents the average price by which the predictions are wrong!)
- `Coefficient of Determination (usually known as R-squared or R2)`: A relative metric in which the higher the value, the better the fit of the model. In essence, this metric represents how much of the variance between predicted and actual label values the model is able to explain.
```{r lm_yardstick}
# Evaluate performance of linear regression
metrics(data = lm_results,
truth = price,
estimate = .pred)
```
There goes the model performance. Let's see if we can get a better indication by visualizing a scatter plot of the package and price then use the predictions made to overlay a line of best fit.
This means we'll have to prep and bake the test set in order to encode the package column then bind this to the predictions made by our model.
```{r lm_plot}
# Encode package column
package_encode <- lm_pumpkins_recipe %>%
prep() %>%
bake(new_data = pumpkins_test) %>%
select(package)
# Bind encoded package column to the results
lm_results <- lm_results %>%
bind_cols(package_encode %>%
rename(package_integer = package)) %>%
relocate(package_integer, .after = package)
# Print new results data frame
lm_results %>%
slice_head(n = 5)
# Make a scatter plot
lm_results %>%
ggplot(mapping = aes(x = package_integer, y = price)) +
geom_point(size = 1.6) +
# Overlay a line of best fit
geom_line(aes(y = .pred), color = "orange", size = 1.2) +
xlab("package")
```
Great! As you can see, the linear regression model does not really well generalize the relationship between a package and its corresponding price.
🎃 Congratulations, you just created a model that can help predict the price of a few varieties of pumpkins. Your holiday pumpkin patch will be beautiful. But you can probably create a better model!
## 5. Build a polynomial regression model
![Infographic by Dasani Madipalli](../images/linear-polynomial.png){width="800"}
Sometimes our data may not have a linear relationship, but we still want to predict an outcome. Polynomial regression can help us make predictions for more complex non-linear relationships.
Take for instance the relationship between the package and price for our pumpkins data set. While sometimes there's a linear relationship between variables - the bigger the pumpkin in volume, the higher the price - sometimes these relationships can't be plotted as a plane or straight line.
> ✅ Here are [some more examples](https://online.stat.psu.edu/stat501/lesson/9/9.8) of data that could use polynomial regression
>
> Take another look at the relationship between Variety to Price in the previous plot. Does this scatterplot seem like it should necessarily be analyzed by a straight line? Perhaps not. In this case, you can try polynomial regression.
>
> ✅ Polynomials are mathematical expressions that might consist of one or more variables and coefficients
#### Train a polynomial regression model using the training set
Polynomial regression creates a *curved line* to better fit nonlinear data.
Let's see whether a polynomial model will perform better in making predictions. We'll follow a somewhat similar procedure as we did before:
- Create a recipe that specifies the preprocessing steps that should be carried out on our data to get it ready for modelling i.e: encoding predictors and computing polynomials of degree *n*
- Build a model specification
- Bundle the recipe and model specification into a workflow
- Create a model by fitting the workflow
- Evaluate how well the model performs on the test data
Let's get right into it!
```{r polynomial_reg}
# Specify a recipe
poly_pumpkins_recipe <-
recipe(price ~ package, data = pumpkins_train) %>%
step_integer(all_predictors(), zero_based = TRUE) %>%
step_poly(all_predictors(), degree = 4)
# Create a model specification
poly_spec <- linear_reg() %>%
set_engine("lm") %>%
set_mode("regression")
# Bundle recipe and model spec into a workflow
poly_wf <- workflow() %>%
add_recipe(poly_pumpkins_recipe) %>%
add_model(poly_spec)
# Create a model
poly_wf_fit <- poly_wf %>%
fit(data = pumpkins_train)
# Print learned model coefficients
poly_wf_fit
```
#### Evaluate model performance
👏👏You've built a polynomial model let's make predictions on the test set!
```{r poly_predict}
# Make price predictions on test data
poly_results <- poly_wf_fit %>% predict(new_data = pumpkins_test) %>%
bind_cols(pumpkins_test %>% select(c(package, price))) %>%
relocate(.pred, .after = last_col())
# Print the results
poly_results %>%
slice_head(n = 10)
```
Woo-hoo , let's evaluate how the model performed on the test_set using `yardstick::metrics()`.
```{r poly_eval}
metrics(data = poly_results, truth = price, estimate = .pred)
```
🤩🤩 Much better performance.
The `rmse` decreased from about 7. to about 3. an indication that of a reduced error between the actual price and the predicted price. You can *loosely* interpret this as meaning that on average, incorrect predictions are wrong by around \$3. The `rsq` increased from about 0.4 to 0.8.
All these metrics indicate that the polynomial model performs way better than the linear model. Good job!
Let's see if we can visualize this!
```{r poly_viz}
# Bind encoded package column to the results
poly_results <- poly_results %>%
bind_cols(package_encode %>%
rename(package_integer = package)) %>%
relocate(package_integer, .after = package)
# Print new results data frame
poly_results %>%
slice_head(n = 5)
# Make a scatter plot
poly_results %>%
ggplot(mapping = aes(x = package_integer, y = price)) +
geom_point(size = 1.6) +
# Overlay a line of best fit
geom_line(aes(y = .pred), color = "midnightblue", size = 1.2) +
xlab("package")
```
You can see a curved line that fits your data better! 🤩
You can make this more smoother by passing a polynomial formula to `geom_smooth` like this:
```{r smooth curve}
# Make a scatter plot
poly_results %>%
ggplot(mapping = aes(x = package_integer, y = price)) +
geom_point(size = 1.6) +
# Overlay a line of best fit
geom_smooth(method = lm, formula = y ~ poly(x, degree = 4), color = "midnightblue", size = 1.2, se = FALSE) +
xlab("package")
```
Much like a smooth curve!🤩
Here's how you would make a new prediction:
```{r predict}
# Make a hypothetical data frame
hypo_tibble <- tibble(package = "bushel baskets")
# Make predictions using linear model
lm_pred <- lm_wf_fit %>% predict(new_data = hypo_tibble)
# Make predictions using polynomial model
poly_pred <- poly_wf_fit %>% predict(new_data = hypo_tibble)
# Return predictions in a list
list("linear model prediction" = lm_pred,
"polynomial model prediction" = poly_pred)
```
The `polynomial model` prediction does make sense, given the scatter plots of `price` and `package`! And, if this is a better model than the previous one, looking at the same data, you need to budget for these more expensive pumpkins!
🏆 Well done! You created two regression models in one lesson. In the final section on regression, you will learn about logistic regression to determine categories.
## **🚀Challenge**
Test several different variables in this notebook to see how correlation corresponds to model accuracy.
## [**Post-lecture quiz**](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/14/)
## **Review & Self Study**
In this lesson we learned about Linear Regression. There are other important types of Regression. Read about Stepwise, Ridge, Lasso and Elasticnet techniques. A good course to study to learn more is the [Stanford Statistical Learning course](https://online.stanford.edu/courses/sohs-ystatslearning-statistical-learning)
If you want to learn more about how to use the amazing Tidymodels framework, please check out the following resources:
- Tidymodels website: [Get started with Tidymodels](https://www.tidymodels.org/start/)
- Max Kuhn and Julia Silge, [*Tidy Modeling with R*](https://www.tmwr.org/)*.*
###### **THANK YOU TO:**
[Allison Horst](https://twitter.com/allison_horst?lang=en) for creating the amazing illustrations that make R more welcoming and engaging. Find more illustrations at her [gallery](https://www.google.com/url?q=https://github.com/allisonhorst/stats-illustrations&sa=D&source=editors&ust=1626380772530000&usg=AOvVaw3zcfyCizFQZpkSLzxiiQEM).

@ -0,0 +1,339 @@
# Costruire un modello di regressione usando Scikit-learn: regressione in due modi
![Infografica di regressione lineare e polinomiale](../images/linear-polynomial.png)
> Infografica di [Dasani Madipalli](https://twitter.com/dasani_decoded)
## [Quiz Pre-Lezione](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/13/)
### Introduzione
Finora si è esplorato cos'è la regressione con dati di esempio raccolti dall'insieme di dati relativo ai prezzi della zucca, che verrà usato in questa lezione. Lo si è anche visualizzato usando Matplotlib.
Ora si è pronti per approfondire la regressione per machine learning. In questa lezione si imparerà di più su due tipi di regressione: _regressione lineare di base_ e _regressione polinomiale_, insieme ad alcuni dei calcoli alla base di queste tecniche.
> In questo programma di studi, si assume una conoscenza minima della matematica, e si cerca di renderla accessibile agli studenti provenienti da altri campi, quindi si faccia attenzione a note, 🧮 didascalie, diagrammi e altri strumenti di apprendimento che aiutano la comprensione.
### Prerequisito
Si dovrebbe ormai avere familiarità con la struttura dei dati della zucca che si sta esaminando. Lo si può trovare precaricato e prepulito nel file _notebook.ipynb_ di questa lezione. Nel file, il prezzo della zucca viene visualizzato per bushel (staio) in un nuovo dataframe. Assicurasi di poter eseguire questi notebook nei kernel in Visual Studio Code.
### Preparazione
Come promemoria, si stanno caricando questi dati in modo da porre domande su di essi.
- Qual è il momento migliore per comprare le zucche?
- Che prezzo ci si può aspettare da una cassa di zucche in miniatura?
- Si devono acquistare in cestini da mezzo bushel o a scatola da 1 1/9 bushel? Si continua a scavare in questi dati.
Nella lezione precedente, è stato creato un dataframe Pandas e si è popolato con parte dell'insieme di dati originale, standardizzando il prezzo per lo bushel. In questo modo, tuttavia, si sono potuti raccogliere solo circa 400 punti dati e solo per i mesi autunnali.
Si dia un'occhiata ai dati precaricati nel notebook di accompagnamento di questa lezione. I dati sono precaricati e viene tracciato un grafico a dispersione iniziale per mostrare i dati mensili. Forse si può ottenere qualche dettaglio in più sulla natura dei dati pulendoli ulteriormente.
## Una linea di regressione lineare
Come si è appreso nella lezione 1, l'obiettivo di un esercizio di regressione lineare è essere in grado di tracciare una linea per:
- **Mostrare le relazioni tra variabili**.
- **Fare previsioni**. Fare previsioni accurate su dove cadrebbe un nuovo punto dati in relazione a quella linea.
È tipico della **Regressione dei Minimi Quadrati** disegnare questo tipo di linea. Il termine "minimi quadrati" significa che tutti i punti dati che circondano la linea di regressione sono elevati al quadrato e quindi sommati. Idealmente, quella somma finale è la più piccola possibile, perché si vuole un basso numero di errori, o `minimi quadrati`.
Lo si fa perché si vuole modellare una linea che abbia la distanza cumulativa minima da tutti i punti dati. Si esegue anche il quadrato dei termini prima di aggiungerli poiché interessa la grandezza piuttosto che la direzione.
> **🧮 Mostrami la matematica**
>
> Questa linea, chiamata _linea di miglior adattamento_ , può essere espressa da [un'equazione](https://en.wikipedia.org/wiki/Simple_linear_regression):
>
> ```
> Y = a + bX
> ```
>
> `X` è la "variabile esplicativa". `Y` è la "variabile dipendente". La pendenza della linea è `b` e `a` è l'intercetta di y, che si riferisce al valore di `Y` quando `X = 0`.
>
> ![calcolare la pendenza](../images/slope.png)
>
> Prima, calcolare la pendenza `b`. Infografica di [Jen Looper](https://twitter.com/jenlooper)
>
> In altre parole, facendo riferimento alla domanda originale per i dati sulle zucche: "prevedere il prezzo di una zucca per bushel per mese", `X` si riferisce al prezzo e `Y` si riferirisce al mese di vendita.
>
> ![completare l'equazione](../images/calculation.png)
>
> Si calcola il valore di Y. Se si sta pagando circa $4, deve essere aprile! Infografica di [Jen Looper](https://twitter.com/jenlooper)
>
> La matematica che calcola la linea deve dimostrare la pendenza della linea, che dipende anche dall'intercetta, o dove `Y` si trova quando `X = 0`.
>
> Si può osservare il metodo di calcolo per questi valori sul sito web [Math is Fun](https://www.mathsisfun.com/data/least-squares-regression.html) . Si visiti anche [questo calcolatore dei minimi quadrati](https://www.mathsisfun.com/data/least-squares-calculator.html) per vedere come i valori dei numeri influiscono sulla linea.
## Correlazione
Un altro termine da comprendere è il **Coefficiente di Correlazione** tra determinate variabili X e Y. Utilizzando un grafico a dispersione, è possibile visualizzare rapidamente questo coefficiente. Un grafico con punti dati sparsi in una linea ordinata ha un'alta correlazione, ma un grafico con punti dati sparsi ovunque tra X e Y ha una bassa correlazione.
Un buon modello di regressione lineare sarà quello che ha un Coefficiente di Correlazione alto (più vicino a 1 rispetto a 0) utilizzando il Metodo di Regressione dei Minimi Quadrati con una linea di regressione.
✅ Eseguire il notebook che accompagna questa lezione e guardare il grafico a dispersione City to Price. I dati che associano la città al prezzo per le vendite di zucca sembrano avere una correlazione alta o bassa, secondo la propria interpretazione visiva del grafico a dispersione?
## Preparare i dati per la regressione
Ora che si ha una comprensione della matematica alla base di questo esercizio, si crea un modello di regressione per vedere se si può prevedere quale pacchetto di zucche avrà i migliori prezzi per zucca. Qualcuno che acquista zucche per una festa con tema un campo di zucche potrebbe desiderare che queste informazioni siano in grado di ottimizzare i propri acquisti di pacchetti di zucca per il campo.
Dal momento che si utilizzerà Scikit-learn, non c'è motivo di farlo a mano (anche se si potrebbe!). Nel blocco di elaborazione dati principale del notebook della lezione, aggiungere una libreria da Scikit-learn per convertire automaticamente tutti i dati di tipo stringa in numeri:
```python
from sklearn.preprocessing import LabelEncoder
new_pumpkins.iloc[:, 0:-1] = new_pumpkins.iloc[:, 0:-1].apply(LabelEncoder().fit_transform)
```
Se si guarda ora il dataframe new_pumpkins, si vede che tutte le stringhe ora sono numeriche. Questo rende più difficile la lettura per un umano ma molto più comprensibile per Scikit-learn!
Ora si possono prendere decisioni più consapevoli (non solo basate sull'osservazione di un grafico a dispersione) sui dati più adatti alla regressione.
Si provi a trovare una buona correlazione tra due punti nei propri dati per costruire potenzialmente un buon modello predittivo. A quanto pare, c'è solo una debole correlazione tra la città e il prezzo:
```python
print(new_pumpkins['City'].corr(new_pumpkins['Price']))
0.32363971816089226
```
Tuttavia, c'è una correlazione leggermente migliore tra il pacchetto e il suo prezzo. Ha senso, vero? Normalmente, più grande è la scatola dei prodotti, maggiore è il prezzo.
```python
print(new_pumpkins['Package'].corr(new_pumpkins['Price']))
0.6061712937226021
```
Una buona domanda da porre a questi dati sarà: "Che prezzo posso aspettarmi da un determinato pacchetto di zucca?"
Si costruisce questo modello di regressione
## Costruire un modello lineare
Prima di costruire il modello, si esegue un altro riordino dei dati. Si eliminano tutti i dati nulli e si controlla ancora una volta che aspetto hanno i dati.
```python
new_pumpkins.dropna(inplace=True)
new_pumpkins.info()
```
Quindi, si crea un nuovo dataframe da questo set minimo e lo si stampa:
```python
new_columns = ['Package', 'Price']
lin_pumpkins = new_pumpkins.drop([c for c in new_pumpkins.columns if c not in new_columns], axis='columns')
lin_pumpkins
```
```output
Package Price
70 0 13.636364
71 0 16.363636
72 0 16.363636
73 0 15.454545
74 0 13.636364
... ... ...
1738 2 30.000000
1739 2 28.750000
1740 2 25.750000
1741 2 24.000000
1742 2 24.000000
415 rows × 2 columns
```
1. Ora si possono assegnare i dati delle coordinate X e y:
```python
X = lin_pumpkins.values[:, :1]
y = lin_pumpkins.values[:, 1:2]
```
Cosa sta succedendo qui? Si sta usando [la notazione slice Python](https://stackoverflow.com/questions/509211/understanding-slice-notation/509295#509295) per creare array per popolare `X` e `y`.
2. Successivamente, si avvia le routine di creazione del modello di regressione:
```python
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
lin_reg = LinearRegression()
lin_reg.fit(X_train,y_train)
pred = lin_reg.predict(X_test)
accuracy_score = lin_reg.score(X_train,y_train)
print('Model Accuracy: ', accuracy_score)
```
Poiché la correlazione non è particolarmente buona, il modello prodotto non è molto accurato.
```output
Model Accuracy: 0.3315342327998987
```
3. Si può visualizzare la linea tracciata nel processo:
```python
plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, pred, color='blue', linewidth=3)
plt.xlabel('Package')
plt.ylabel('Price')
plt.show()
```
![Un grafico a dispersione che mostra il rapporto tra pacchetto e prezzo](../images/linear.png)
4. Si testa il modello contro una varietà ipotetica:
```python
lin_reg.predict( np.array([ [2.75] ]) )
```
Il prezzo restituito per questa varietà mitologica è:
```output
array([[33.15655975]])
```
Quel numero ha senso, se la logica della linea di regressione è vera.
🎃 Congratulazioni, si è appena creato un modello che può aiutare a prevedere il prezzo di alcune varietà di zucche. La zucca per le festività sarà bellissima. Ma probabilmente si può creare un modello migliore!
## Regressione polinomiale
Un altro tipo di regressione lineare è la regressione polinomiale. Mentre a volte c'è una relazione lineare tra le variabili - più grande è il volume della zucca, più alto è il prezzo - a volte queste relazioni non possono essere tracciate come un piano o una linea retta.
✅ Ecco [alcuni altri esempi](https://online.stat.psu.edu/stat501/lesson/9/9.8) di dati che potrebbero utilizzare la regressione polinomiale
Si dia un'altra occhiata alla relazione tra Varietà e Prezzo nel tracciato precedente. Questo grafico a dispersione deve essere necessariamente analizzato da una linea retta? Forse no. In questo caso, si può provare la regressione polinomiale.
✅ I polinomi sono espressioni matematiche che possono essere costituite da una o più variabili e coefficienti
La regressione polinomiale crea una linea curva per adattare meglio i dati non lineari.
1. Viene ricreato un dataframe popolato con un segmento dei dati della zucca originale:
```python
new_columns = ['Variety', 'Package', 'City', 'Month', 'Price']
poly_pumpkins = new_pumpkins.drop([c for c in new_pumpkins.columns if c not in new_columns], axis='columns')
poly_pumpkins
```
Un buon modo per visualizzare le correlazioni tra i dati nei dataframe è visualizzarli in un grafico "coolwarm":
2. Si usa il metodo `Background_gradient()` con `coolwarm` come valore dell'argomento:
```python
corr = poly_pumpkins.corr()
corr.style.background_gradient(cmap='coolwarm')
```
Questo codice crea una mappa di calore:
![Una mappa di calore che mostra la correlazione dei dati](../images/heatmap.png)
Guardando questo grafico, si può visualizzare la buona correlazione tra Pacchetto e Prezzo. Quindi si dovrebbe essere in grado di creare un modello un po' migliore dell'ultimo.
### Creare una pipeline
Scikit-learn include un'API utile per la creazione di modelli di regressione polinomiale: l'[API](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html?highlight=pipeline#sklearn.pipeline.make_pipeline) `make_pipeline`. Viene creata una 'pipeline' che è una catena di stimatori. In questo caso, la pipeline include caratteristiche polinomiali o previsioni che formano un percorso non lineare.
1. Si costruiscono le colonne X e y:
```python
X=poly_pumpkins.iloc[:,3:4].values
y=poly_pumpkins.iloc[:,4:5].values
```
2. Si crea la pipeline chiamando il metodo `make_pipeline()` :
```python
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
pipeline = make_pipeline(PolynomialFeatures(4), LinearRegression())
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
pipeline.fit(np.array(X_train), y_train)
y_pred=pipeline.predict(X_test)
```
### Creare una sequenza
A questo punto, è necessario creare un nuovo dataframe con dati _ordinati_ in modo che la pipeline possa creare una sequenza.
Si aggiunge il seguente codice:
```python
df = pd.DataFrame({'x': X_test[:,0], 'y': y_pred[:,0]})
df.sort_values(by='x',inplace = True)
points = pd.DataFrame(df).to_numpy()
plt.plot(points[:, 0], points[:, 1],color="blue", linewidth=3)
plt.xlabel('Package')
plt.ylabel('Price')
plt.scatter(X,y, color="black")
plt.show()
```
Si è creato un nuovo dataframe chiamato `pd.DataFrame`. Quindi si sono ordinati i valori chiamando `sort_values()`. Alla fine si è creato un grafico polinomiale:
![Un grafico polinomiale che mostra la relazione tra pacchetto e prezzo](../images/polynomial.png)
Si può vedere una linea curva che si adatta meglio ai dati.
Si verifica la precisione del modello:
```python
accuracy_score = pipeline.score(X_train,y_train)
print('Model Accuracy: ', accuracy_score)
```
E voilà!
```output
Model Accuracy: 0.8537946517073784
```
Ecco, meglio! Si prova a prevedere un prezzo:
### Fare una previsione
E possibile inserire un nuovo valore e ottenere una previsione?
Si chiami `predict()` per fare una previsione:
```python
pipeline.predict( np.array([ [2.75] ]) )
```
Viene data questa previsione:
```output
array([[46.34509342]])
```
Ha senso, visto il tracciato! Se questo è un modello migliore del precedente, guardando gli stessi dati, si deve preventivare queste zucche più costose!
Ben fatto! Sono stati creati due modelli di regressione in una lezione. Nella sezione finale sulla regressione, si imparerà a conoscere la regressione logistica per determinare le categorie.
---
## 🚀 Sfida
Testare diverse variabili in questo notebook per vedere come la correlazione corrisponde all'accuratezza del modello.
## [Quiz post-lezione](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/14/)
## Revisione e Auto Apprendimento
In questa lezione si è appreso della regressione lineare. Esistono altri tipi importanti di regressione. Leggere le tecniche Stepwise, Ridge, Lazo ed Elasticnet. Un buon corso per studiare per saperne di più è il [corso Stanford Statistical Learning](https://online.stanford.edu/courses/sohs-ystatslearning-statistical-learning)
## Compito
[Costruire un modello](assignment.it.md)

@ -0,0 +1,334 @@
# Scikit-learnを用いた回帰モデルの構築: 回帰を行う2つの方法
![線形回帰 vs 多項式回帰 のインフォグラフィック](../images/linear-polynomial.png)
> [Dasani Madipalli](https://twitter.com/dasani_decoded) によるインフォグラフィック
## [講義前のクイズ](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/13/)
### イントロダクション
これまで、このレッスンで使用するカボチャの価格データセットから集めたサンプルデータを使って、回帰とは何かを探ってきました。また、Matplotlibを使って可視化を行いました。
これで、MLにおける回帰をより深く理解する準備が整いました。このレッスンでは、2種類の回帰について詳しく説明します。基本的な線形回帰 (_basic linear regression_)と多項式回帰 (_polynomial regression_)の2種類の回帰について、その基礎となる数学を学びます。
> このカリキュラムでは、最低限の数学の知識を前提とし、他の分野の学生にも理解できるようにしていますので、理解を助けるためのメモ、🧮吹き出し、図などの学習ツールをご覧ください。
### 事前確認
ここでは、パンプキンデータの構造について説明しています。このレッスンの_notebook.ipynb_ファイルには、事前に読み込まれ、整形されたデータが入っています。このファイルでは、カボチャの価格がブッシェル単位で新しいデータフレームに表示されています。 これらのートブックを、Visual Studio Codeのカーネルで実行できることを確認してください。
### 準備
忘れてはならないのは、データを読み込んだら問いかけを行うことです。
- カボチャを買うのに最適な時期はいつですか?
- ミニカボチャ1ケースの価格はどのくらいでしょうか
- 半ブッシェルのバスケットで買うべきか、1 1/9ブッシェルの箱で買うべきか。
データを掘り下げていきましょう。
前回のレッスンでは、Pandasのデータフレームを作成し、元のデータセットの一部を入力して、ブッシェル単位の価格を標準化しました。しかし、この方法では、約400のデータポイントしか集めることができず、しかもそれは秋の期間のものでした。
このレッスンに付属するノートブックで、あらかじめ読み込んでおいたデータを見てみましょう。データが事前に読み込まれ、月毎のデータが散布図として表示されています。データをもっと綺麗にすることで、データの性質をもう少し知ることができるかもしれません。
## 線形回帰
レッスン1で学んだように、線形回帰の演習では、以下のような線を描けるようになることが目標です。
- **変数間の関係を示す。**
- **予測を行う。** 新しいデータポイントが、その線のどこに位置するかを正確に予測することができる。
このような線を描くことは、**最小二乗回帰 (Least-Squares Regression)** の典型的な例です。「最小二乗」という言葉は、回帰線を囲むすべてのデータポイントとの距離が二乗され、その後加算されることを意味しています。理想的には、最終的な合計ができるだけ小さくなるようにします。これはエラーの数、つまり「最小二乗」の値を小さくするためです。
これは、すべてのデータポイントからの累積距離が最小となる直線をモデル化したいためです。また、方向ではなく大きさに注目しているので、足す前に項を二乗します。
> **🧮 Show me the math**
>
> この線は、_line of best fit_ と呼ばれ、[方程式](https://en.wikipedia.org/wiki/Simple_linear_regression) で表すことができます。
>
> ```
> Y = a + bX
> ```
>
> `X`は「説明変数」です。`Y`は「目的変数」です。`a`は切片で`b`は直線の傾きを表します。`X=0`のとき、`Y`の値は切片`a`となります。
>
>![傾きの計算](../images/slope.png)
>
> はじめに、傾き`b`を計算してみます。[Jen Looper](https://twitter.com/jenlooper) によるインフォグラフィック。
>
> カボチャのデータに関する最初の質問である、「月毎のブッシェル単位でのカボチャの価格を予測してください」で言い換えてみると、`X`は価格を、`Y`は販売された月を表しています。
>
>![方程式の計算](../images/calculation.png)
>
> Yの値を計算してみましょう。$4前後払っているなら、4月に違いありません[Jen Looper](https://twitter.com/jenlooper) によるインフォグラフィック。
>
> 直線を計算する数学は、直線の傾きを示す必要がありますが、これは切片、つまり「X = 0」のときに「Y」がどこに位置するかにも依存します。
>
> これらの値の計算方法は、[Math is Fun](https://www.mathsisfun.com/data/least-squares-regression.html) というサイトで見ることができます。また、[this Least-squares calculator](https://www.mathsisfun.com/data/least-squares-calculator.html) では、値が線にどのような影響を与えるかを見ることができます。
## 相関関係
もう一つの理解すべき用語は、与えられたXとYの変数間の**相関係数 (Correlation Coefficient)** です。散布図を使えば、この係数をすぐに可視化することができます。データポイントがきれいな直線上に散らばっているプロットは、高い相関を持っていますが、データポイントがXとYの間のあらゆる場所に散らばっているプロットは、低い相関を持っています。
良い線形回帰モデルとは、最小二乗法によって求めた回帰線が高い相関係数 (0よりも1に近い)を持つものです。
✅ このレッスンのノートを開いて、「都市と価格」の散布図を見てみましょう。散布図の視覚的な解釈によると、カボチャの販売に関する「都市」と「価格」の関連データは、相関性が高いように見えますか、それとも低いように見えますか?
## 回帰に用いるデータの準備
この演習の背景にある数学を理解したので、回帰モデルを作成して、どのパッケージのカボチャの価格が最も高いかを予測できるかどうかを確認してください。休日のパンプキンパッチ用にパンプキンを購入する人は、パッチ用のパンプキンパッケージの購入を最適化するために、この情報を必要とするかもしれません。
ここではScikit-learnを使用するので、手作業で行う必要はありません。レッスンートのメインのデータ処理ブロックに、Scikit-learnのライブラリを追加して、すべての文字列データを自動的に数字に変換します。
```python
from sklearn.preprocessing import LabelEncoder
new_pumpkins.iloc[:, 0:-1] = new_pumpkins.iloc[:, 0:-1].apply(LabelEncoder().fit_transform)
```
new_pumpkinsデータフレームを見ると、すべての文字列が数値になっているのがわかります。これにより、人が読むのは難しくなりましたが、Scikit-learnにとってはとても分かりやすくなりました。
これで、回帰に最も適したデータについて、(散布図を見ただけではなく)より高度な判断ができるようになりました。
良い予測モデルを構築するために、データの2点間に良い相関関係を見つけようとします。その結果、「都市」と「価格」の間には弱い相関関係しかないことがわかりました。
```python
print(new_pumpkins['City'].corr(new_pumpkins['Price']))
0.32363971816089226
```
しかし、パッケージと価格の間にはもう少し強い相関関係があります。これは理にかなっていると思いますか?通常、箱が大きければ大きいほど、価格は高くなります。
```python
print(new_pumpkins['Package'].corr(new_pumpkins['Price']))
0.6061712937226021
```
このデータに対する良い質問は、次のようになります。「あるカボチャのパッケージの価格はどのくらいになるか?」
この回帰モデルを構築してみましょう!
## 線形モデルの構築
モデルを構築する前に、もう一度データの整理をしてみましょう。NULLデータを削除し、データがどのように見えるかをもう一度確認します。
```python
new_pumpkins.dropna(inplace=True)
new_pumpkins.info()
```
そして、この最小セットから新しいデータフレームを作成し、それを出力します。
```python
new_columns = ['Package', 'Price']
lin_pumpkins = new_pumpkins.drop([c for c in new_pumpkins.columns if c not in new_columns], axis='columns')
lin_pumpkins
```
```output
Package Price
70 0 13.636364
71 0 16.363636
72 0 16.363636
73 0 15.454545
74 0 13.636364
... ... ...
1738 2 30.000000
1739 2 28.750000
1740 2 25.750000
1741 2 24.000000
1742 2 24.000000
415 rows × 2 columns
```
1. これで、XとYの座標データを割り当てることができます。
```python
X = lin_pumpkins.values[:, :1]
y = lin_pumpkins.values[:, 1:2]
```
✅ ここでは何をしていますか? Pythonの[スライス記法](https://stackoverflow.com/questions/509211/understanding-slice-notation/509295#509295) を使って、`X`と`y`の配列を作成しています。
2. 次に、回帰モデル構築のためのルーチンを開始します。
```python
from sklearn.linear_model import LinearRegression
from sklearn.metrics import r2_score, mean_squared_error, mean_absolute_error
from sklearn.model_selection import train_test_split
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
lin_reg = LinearRegression()
lin_reg.fit(X_train,y_train)
pred = lin_reg.predict(X_test)
accuracy_score = lin_reg.score(X_train,y_train)
print('Model Accuracy: ', accuracy_score)
```
相関関係があまり良くないので、生成されたモデルもあまり正確ではありません。
```output
Model Accuracy: 0.3315342327998987
```
3. 今回の過程で描かれた線を可視化します。
```python
plt.scatter(X_test, y_test, color='black')
plt.plot(X_test, pred, color='blue', linewidth=3)
plt.xlabel('Package')
plt.ylabel('Price')
plt.show()
```
![パッケージと価格の関係を表す散布図](../images/linear.png)
4. 架空の値に対してモデルをテストする。
```python
lin_reg.predict( np.array([ [2.75] ]) )
```
この架空の値に対して、以下の価格が返されます。
```output
array([[33.15655975]])
```
回帰の線が正しく引かれていれば、その数字は理にかなっています。
🎃 おめでとうございます!数種類のカボチャの価格を予測するモデルを作成しました。休日のパンプキンパッチは美しいものになるでしょう。でも、もっと良いモデルを作れるかもしれません。
## 多項式回帰
線形回帰のもう一つのタイプは、多項式回帰です。時には変数の間に直線的な関係 (カボチャの量が多いほど、価格は高くなる)があることもありますが、これらの関係は、平面や直線としてプロットできないこともあります。
✅ 多項式回帰を使うことができる、[いくつかの例](https://online.stat.psu.edu/stat501/lesson/9/9.8) を示します。
先ほどの散布図の「品種」と「価格」の関係をもう一度見てみましょう。この散布図は、必ずしも直線で分析しなければならないように見えますか?そうではないかもしれません。このような場合は、多項式回帰を試してみましょう。
✅ 多項式とは、1つ以上の変数と係数で構成される数学的表現である。
多項式回帰では、非線形データをよりよく適合させるために曲線を作成します。
1. 元のカボチャのデータの一部を入力したデータフレームを作成してみましょう。
```python
new_columns = ['Variety', 'Package', 'City', 'Month', 'Price']
poly_pumpkins = new_pumpkins.drop([c for c in new_pumpkins.columns if c not in new_columns], axis='columns')
poly_pumpkins
```
データフレーム内のデータ間の相関関係を視覚化するには、「coolwarm」チャートで表示するのが良いでしょう。
2. `Background_gradient()` メソッドの引数に `coolwarm` を指定して使用します。
```python
corr = poly_pumpkins.corr()
corr.style.background_gradient(cmap='coolwarm')
```
  このコードはヒートマップを作成します。
![データの相関関係を示すヒートマップ](../images/heatmap.png)
このチャートを見ると、「パッケージ」と「価格」の間に正の相関関係があることが視覚化されています。つまり、前回のモデルよりも多少良いモデルを作ることができるはずです。
### パイプラインの作成
Scikit-learnには、多項式回帰モデルを構築するための便利なAPIである`make_pipeline` [API](https://scikit-learn.org/stable/modules/generated/sklearn.pipeline.make_pipeline.html?highlight=pipeline#sklearn.pipeline.make_pipeline) が用意されています。「パイプライン」は推定量の連鎖で作成されます。今回の場合、パイプラインには多項式の特徴量、非線形の経路を形成する予測値が含まれます。
1. X列とy列を作ります。
```python
X=poly_pumpkins.iloc[:,3:4].values
y=poly_pumpkins.iloc[:,4:5].values
```
2. `make_pipeline()` メソッドを呼び出してパイプラインを作成します。
```python
from sklearn.preprocessing import PolynomialFeatures
from sklearn.pipeline import make_pipeline
pipeline = make_pipeline(PolynomialFeatures(4), LinearRegression())
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
pipeline.fit(np.array(X_train), y_train)
y_pred=pipeline.predict(X_test)
```
### 系列の作成
この時点で、パイプラインが系列を作成できるように、ソートされたデータで新しいデータフレームを作成する必要があります。
以下のコードを追加します。
```python
df = pd.DataFrame({'x': X_test[:,0], 'y': y_pred[:,0]})
df.sort_values(by='x',inplace = True)
points = pd.DataFrame(df).to_numpy()
plt.plot(points[:, 0], points[:, 1],color="blue", linewidth=3)
plt.xlabel('Package')
plt.ylabel('Price')
plt.scatter(X,y, color="black")
plt.show()
```
`pd.DataFrame` を呼び出して新しいデータフレームを作成しました。次に`sort_values()` を呼び出して値をソートしました。最後に多項式のプロットを作成しました。
![パッケージと価格の関係を示す多項式のプロット](../images/polynomial.png)
よりデータにフィットした曲線を確認することができます。
モデルの精度を確認してみましょう。
```python
accuracy_score = pipeline.score(X_train,y_train)
print('Model Accuracy: ', accuracy_score)
```
これで完成です!
```output
Model Accuracy: 0.8537946517073784
```
いい感じです!価格を予測してみましょう。
### 予測の実行
新しい値を入力し、予測値を取得できますか?
`predict()` メソッドを呼び出して、予測を行います。
```python
pipeline.predict( np.array([ [2.75] ]) )
```
以下の予測結果が得られます。
```output
array([[46.34509342]])
```
プロットを見てみると、納得できそうです!そして、同じデータを見て、これが前のモデルよりも良いモデルであれば、より高価なカボチャのために予算を組む必要があります。
🏆 お疲れ様でした1つのレッスンで2つの回帰モデルを作成しました。回帰に関する最後のセクションでは、カテゴリーを決定するためのロジスティック回帰について学びます。
---
## 🚀チャレンジ
このノートブックでいくつかの異なる変数をテストし、相関関係がモデルの精度にどのように影響するかを確認してみてください。
## [講義後クイズ](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/14/)
## レビュー & 自主学習
このレッスンでは、線形回帰について学びました。回帰には他にも重要な種類があります。Stepwise、Ridge、Lasso、Elasticnetなどのテクニックをご覧ください。より詳しく学ぶには、[Stanford Statistical Learning course](https://online.stanford.edu/courses/sohs-ystatslearning-statistical-learning) が良いでしょう。
## 課題
[モデル構築](./assignment.ja.md)

@ -0,0 +1,11 @@
# Creare un Modello di Regressione
## Istruzioni
In questa lezione è stato mostrato come costruire un modello utilizzando sia la Regressione Lineare che Polinomiale. Usando questa conoscenza, trovare un insieme di dati o utilizzare uno degli insiemi integrati di Scikit-Learn per costruire un modello nuovo. Spiegare nel proprio notebook perché si è scelto una determinata tecnica e dimostrare la precisione del modello. Se non è accurato, spiegare perché.
## Rubrica
| Criteri | Ottimo | Adeguato | Necessita miglioramento |
| -------- | ------------------------------------------------------------ | -------------------------- | ------------------------------- |
| | presenta un notebook completo con una soluzione ben documentata | La soluzione è incompleta | La soluzione è difettosa o contiene bug |

@ -0,0 +1,11 @@
# 回帰モデルの作成
## 課題の指示
このレッスンでは、線形回帰と多項式回帰の両方を使ってモデルを構築する方法を紹介しました。この知識をもとに、自分でデータセットを探すか、Scikit-learnのビルトインセットの1つを使用して、新しいモデルを構築してください。手法を選んだ理由をートブックに書き、モデルの精度を示してください。精度が十分でない場合は、その理由も説明してください。
## ルーブリック
| 指標 | 模範的 | 適切 | 要改善 |
| -------- | ------------------------------------------------------------ | -------------------------- | ------------------------------- |
| | ドキュメント化されたソリューションを含む完全なノートブックを提示する。 | 解決策が不完全である。 | 解決策に欠陥またはバグがある。 |

@ -0,0 +1,12 @@
# 创建自己的回归模型
## 说明
在这节课中你学到了如何用线性回归和多项式回归建立一个模型。利用这些只是,找到一个你感兴趣的数据集或者是 Scikit-learn 内置的数据集来建立一个全新的模型。用你的 notebook 来解释为什么用了这种技术来对这个数据集进行建模,并且证明出你的模型的准确度。如果它没你想象中准确,请思考一下并解释一下原因。
## 评判标准
| 标准 | 优秀 | 中规中矩 | 仍需努力 |
| -------- | ------------------------------------------------------------ | -------------------------- | ------------------------------- |
| | 提交了一个完整的 notebook 工程文件,其中包含了解集,并且可读性良好 | 不完整的解集 | 解集是有缺陷或者有错误的 |

@ -140,7 +140,7 @@ Now that we have an idea of the relationship between the binary categories of co
> **🧮 Show Me The Math**
>
> Remember how linear regression often used ordinary least squares to arrive at a value? Logistic regression relies on the concept of 'maximum likelihood' using [sigmoid functions](https://wikipedia.org/wiki/Sigmoid_function). A 'Sigmoid Function' on a plot looks like an 'S' shape. It takes a value and maps it to somewhere between 0 and 1. Its curve is also called a 'logistic curve'. Its formula looks like thus:
> Remember how linear regression often used ordinary least squares to arrive at a value? Logistic regression relies on the concept of 'maximum likelihood' using [sigmoid functions](https://wikipedia.org/wiki/Sigmoid_function). A 'Sigmoid Function' on a plot looks like an 'S' shape. It takes a value and maps it to somewhere between 0 and 1. Its curve is also called a 'logistic curve'. Its formula looks like this:
>
> ![logistic function](images/sigmoid.png)
>
@ -206,7 +206,7 @@ While you can get a scoreboard report [terms](https://scikit-learn.org/stable/mo
> 🎓 A '[confusion matrix](https://wikipedia.org/wiki/Confusion_matrix)' (or 'error matrix') is a table that expresses your model's true vs. false positives and negatives, thus gauging the accuracy of predictions.
1. To use a confusion metrics, call `confusin_matrix()`:
1. To use a confusion metrics, call `confusion_matrix()`:
```python
from sklearn.metrics import confusion_matrix
@ -220,26 +220,35 @@ While you can get a scoreboard report [terms](https://scikit-learn.org/stable/mo
[ 33, 0]])
```
What's going on here? Let's say our model is asked to classify items between two binary categories, category 'pumpkin' and category 'not-a-pumpkin'.
In Scikit-learn, confusion matrices Rows (axis 0) are actual labels and columns (axis 1) are predicted labels.
- If your model predicts something as a pumpkin and it belongs to category 'pumpkin' in reality we call it a true positive, shown by the top left number.
- If your model predicts something as not a pumpkin and it belongs to category 'pumpkin' in reality we call it a false positive, shown by the top right number.
- If your model predicts something as a pumpkin and it belongs to category 'not-a-pumpkin' in reality we call it a false negative, shown by the bottom left number.
- If your model predicts something as not a pumpkin and it belongs to category 'not-a-pumpkin' in reality we call it a true negative, shown by the bottom right number.
| |0|1|
|:-:|:-:|:-:|
|0|TN|FP|
|1|FN|TP|
![Confusion Matrix](images/confusion-matrix.png)
What's going on here? Let's say our model is asked to classify pumpkins between two binary categories, category 'orange' and category 'not-orange'.
> Infographic by [Jen Looper](https://twitter.com/jenlooper)
- If your model predicts a pumpkin as not orange and it belongs to category 'not-orange' in reality we call it a true negative, shown by the top left number.
- If your model predicts a pumpkin as orange and it belongs to category 'not-orange' in reality we call it a false negative, shown by the bottom left number.
- If your model predicts a pumpkin as not orange and it belongs to category 'orange' in reality we call it a false positive, shown by the top right number.
- If your model predicts a pumpkin as orange and it belongs to category 'orange' in reality we call it a true positive, shown by the bottom right number.
As you might have guessed it's preferable to have a larger number of true positives and true negatives and a lower number of false positives and false negatives, which implies that the model performs better.
✅ Q: According to the confusion matrix, how did the model do? A: Not too bad; there are a good number of true positives but also several false negatives.
How does the confusion matrix relate to precision and recall? Remember, the classification report printed above showed precision (0.83) and recall (0.98).
Precision = tp / (tp + fp) = 162 / (162 + 33) = 0.8307692307692308
Recall = tp / (tp + fn) = 162 / (162 + 4) = 0.9759036144578314
✅ Q: According to the confusion matrix, how did the model do? A: Not too bad; there are a good number of true negatives but also several false negatives.
Let's revisit the terms we saw earlier with the help of the confusion matrix's mapping of TP/TN and FP/FN:
🎓 Precision: TP/(TP + FN) The fraction of relevant instances among the retrieved instances (e.g. which labels were well-labeled)
🎓 Precision: TP/(TP + FP) The fraction of relevant instances among the retrieved instances (e.g. which labels were well-labeled)
🎓 Recall: TP/(TP + FP) The fraction of relevant instances that were retrieved, whether well-labeled or not
🎓 Recall: TP/(TP + FN) The fraction of relevant instances that were retrieved, whether well-labeled or not
🎓 f1-score: (2 * precision * recall)/(precision + recall) A weighted average of the precision and recall, with best being 1 and worst being 0
@ -252,6 +261,7 @@ Let's revisit the terms we saw earlier with the help of the confusion matrix's m
🎓 Weighted Avg: The calculation of the mean metrics for each label, taking label imbalance into account by weighting them by their support (the number of true instances for each label).
✅ Can you think which metric you should watch if you want your model to reduce the number of false negatives?
## Visualize the ROC curve of this model
This is not a bad model; its accuracy is in the 80% range so ideally you could use it to predict the color of a pumpkin given a set of variables.
@ -284,7 +294,8 @@ In future lessons on classifications, you will learn how to iterate to improve y
---
## 🚀Challenge
There's a lot more to unpack regarding logistic regression! But the best way to learn is to experiment. Find a dataset that lends itself to this type of analysis and build a model with it. What do you learn? tip: try [Kaggle](https://kaggle.com) for interesting datasets.
There's a lot more to unpack regarding logistic regression! But the best way to learn is to experiment. Find a dataset that lends itself to this type of analysis and build a model with it. What do you learn? tip: try [Kaggle](https://www.kaggle.com/search?q=logistic+regression+datasets) for interesting datasets.
## [Post-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/16/)
## Review & Self Study

@ -230,10 +230,6 @@ Apa yang sedang terjadi di sini? Mari kita asumsi dulu bahwa model kita ditanyak
- Kalau modelmu memprediksi sesuati sebagai sebuah labu tetapi sebenarnya bukan sebuah labu, itu disebut negatif palsu yang diindikasi angka di pojok kiri bawah.
- Kalau modelmu memprediksi sesuati sebagai bukan sebuah labu dan memang benar sesuatu itu bukan sebuah labu, itu disebut negatif benar yang diindikasi angka di pojok kanan bawah.
![Matriks Kebingungan](../images/confusion-matrix.png)
> Infografik oleh [Jen Looper](https://twitter.com/jenlooper)
Sebagaimana kamu mungkin sudah pikirkan, lebih baik dapat banyak positif benar dan negatif benar dan sedikit positif palsu dan negatif palsu. Implikasinya adalah performa modelnya bagus.
✅ Pertanyaan: Berdasarkan matriks kebingungan, modelnya baik tidak? Jawaban: Tidak buruk; ada banyak positif benar dan sedikit negatif palsu.
@ -245,9 +241,9 @@ Mari kita lihat kembali istilah-istilah yang kita lihat tadi dengan bantuan matr
> NB: Negatif benar
> NP: Negatif palsu
🎓 Presisi: PB/(PB + NP) Rasio titik data relevan antara semua titik data (seperti data mana yang benar dilabelkannya)
🎓 Presisi: PB/(PB + PP) Rasio titik data relevan antara semua titik data (seperti data mana yang benar dilabelkannya)
🎓 *Recall*: PB/(PB + PP) Rasio titk data relevan yang digunakan, maupun labelnya benar atau tidak.
🎓 *Recall*: PB/(PB + NP) Rasio titk data relevan yang digunakan, maupun labelnya benar atau tidak.
🎓 *f1-score*: (2 * Presisi * *Recall*)/(Presisi + *Recall*) Sebuah rata-rata tertimbang antara presisi dan *recall*. 1 itu baik dan 0 itu buruk.

@ -0,0 +1,295 @@
# Regressione logistica per prevedere le categorie
![Infografica di regressione lineare e logistica](../images/logistic-linear.png)
> Infografica di [Dasani Madipalli](https://twitter.com/dasani_decoded)
## [Quiz Pre-Lezione](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/15/)
## Introduzione
In questa lezione finale sulla Regressione, una delle tecniche _classiche_ di base di machine learning, si darà un'occhiata alla Regressione Logistica. Si dovrebbe utilizzare questa tecnica per scoprire modelli per prevedere le categorie binarie. Questa caramella è al cioccolato o no? Questa malattia è contagiosa o no? Questo cliente sceglierà questo prodotto o no?
In questa lezione, si imparerà:
- Una nuova libreria per la visualizzazione dei dati
- Tecniche per la regressione logistica
✅ Con questo [modulo di apprendimento](https://docs.microsoft.com/learn/modules/train-evaluate-classification-models?WT.mc_id=academic-15963-cxa) si potrà approfondire la comprensione del lavoro con questo tipo di regressione
## Prerequisito
Avendo lavorato con i dati della zucca, ora si ha abbastanza familiarità con essi per rendersi conto che esiste una categoria binaria con cui è possibile lavorare: `Color` (Colore).
Si costruisce un modello di regressione logistica per prevedere, date alcune variabili, di _che colore sarà probabilmente una data zucca_ (arancione 🎃 o bianca 👻).
> Perché si parla di classificazione binaria in un gruppo di lezioni sulla regressione? Solo per comodità linguistica, poiché la regressione logistica è in [realtà un metodo di classificazione](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression), anche se lineare. Si scopriranno altri modi per classificare i dati nel prossimo gruppo di lezioni.
## Definire la domanda
Allo scopo, verrà espressa come binaria: 'Arancio' o 'Non Arancio'. C'è anche una categoria "striped" (a strisce) nell'insieme di dati, ma ci sono pochi casi, quindi non verrà presa in considerazione. Comunque scompare una volta rimossi i valori null dall'insieme di dati.
> 🎃 Fatto divertente, a volte le zucche bianche vengono chiamate zucche "fantasma" Non sono molto facili da intagliare, quindi non sono così popolari come quelle arancioni ma hanno un bell'aspetto!
## Informazioni sulla regressione logistica
La regressione logistica differisce dalla regressione lineare, che si è appresa in precedenza, in alcuni importanti modi.
### Classificazione Binaria
La regressione logistica non offre le stesse caratteristiche della regressione lineare. La prima offre una previsione su una categoria binaria ("arancione o non arancione") mentre la seconda è in grado di prevedere valori continui, ad esempio data l'origine di una zucca e il momento del raccolto, di _quanto aumenterà il suo prezzo_.
![Modello di classificazione della zucca](../images/pumpkin-classifier.png)
> Infografica di [Dasani Madipalli](https://twitter.com/dasani_decoded)
### Altre classificazioni:
Esistono altri tipi di regressione logistica, inclusi multinomiale e ordinale:
- **Multinomiale**, che implica avere più di una categoria: "arancione, bianco e a strisce".
- **Ordinale**, che coinvolge categorie ordinate, utile se si volessero ordinare i risultati in modo logico, come le zucche che sono ordinate per un numero finito di dimensioni (mini,sm,med,lg,xl,xxl).
![Regressione multinomiale contro ordinale](../images/multinomial-ordinal.png)
> Infografica di [Dasani Madipalli](https://twitter.com/dasani_decoded)
### È ancora lineare
Anche se questo tipo di Regressione riguarda le "previsioni di categoria", funziona ancora meglio quando esiste una chiara relazione lineare tra la variabile dipendente (colore) e le altre variabili indipendenti (il resto dell'insieme di dati, come il nome della città e le dimensioni) . È bene avere un'idea se c'è qualche linearità che divide queste variabili o meno.
### Le variabili NON devono essere correlate
Si ricorda come la regressione lineare ha funzionato meglio con più variabili correlate? La regressione logistica è l'opposto: le variabili non devono essere allineate. Funziona per questi dati che hanno correlazioni alquanto deboli.
### Servono molti dati puliti
La regressione logistica fornirà risultati più accurati se si utilizzano più dati; quindi si tenga a mente che, essendo l'insieme di dati sulla zucca piccolo, non è ottimale per questo compito
✅ Si pensi ai tipi di dati che si prestano bene alla regressione logistica
## Esercizio: riordinare i dati
Innanzitutto, si puliscono un po 'i dati, eliminando i valori null e selezionando solo alcune delle colonne:
1. Aggiungere il seguente codice:
```python
from sklearn.preprocessing import LabelEncoder
new_columns = ['Color','Origin','Item Size','Variety','City Name','Package']
new_pumpkins = pumpkins.drop([c for c in pumpkins.columns if c not in new_columns], axis=1)
new_pumpkins.dropna(inplace=True)
new_pumpkins = new_pumpkins.apply(LabelEncoder().fit_transform)
```
Si può sempre dare un'occhiata al nuovo dataframe:
```python
new_pumpkins.info
```
### Visualizzazione - griglia affiancata
A questo punto si è caricato di nuovo il [notebook iniziale](../notebook.ipynb) con i dati della zucca e lo si è pulito in modo da preservare un insieme di dati contenente alcune variabili, incluso `Color`. Si visualizza il dataframe nel notebook utilizzando una libreria diversa: [Seaborn](https://seaborn.pydata.org/index.html), che è costruita su Matplotlib, usata in precedenza.
Seaborn offre alcuni modi accurati per visualizzare i dati. Ad esempio, si possono confrontare le distribuzioni dei dati per ogni punto in una griglia affiancata.
1. Si crea una griglia di questo tipo istanziando `PairGrid`, usando i dati della zucca `new_pumpkins`, poi chiamando `map()`:
```python
import seaborn as sns
g = sns.PairGrid(new_pumpkins)
g.map(sns.scatterplot)
```
![Una griglia di dati visualizzati](../images/grid.png)
Osservando i dati fianco a fianco, si può vedere come i dati di Color si riferiscono alle altre colonne.
✅ Data questa griglia del grafico a dispersione, quali sono alcune esplorazioni interessanti che si possono immaginare?
### Usare un grafico a sciame
Poiché Color è una categoria binaria (arancione o no), viene chiamata "dati categoriali" e richiede "un [approccio più specializzato](https://seaborn.pydata.org/tutorial/categorical.html?highlight=bar) alla visualizzazione". Esistono altri modi per visualizzare la relazione di questa categoria con altre variabili.
È possibile visualizzare le variabili fianco a fianco con i grafici di Seaborn.
1. Si provi un grafico a "sciame" per mostrare la distribuzione dei valori:
```python
sns.swarmplot(x="Color", y="Item Size", data=new_pumpkins)
```
![Uno sciame di dati visualizzati](../images/swarm.png)
### Grafico violino
Un grafico di tipo "violino" è utile in quanto è possibile visualizzare facilmente il modo in cui sono distribuiti i dati nelle due categorie. I grafici di tipo violino non funzionano così bene con insieme di dati più piccoli poiché la distribuzione viene visualizzata in modo più "liscio".
1. Chiamare `catplot()` passando i parametri `x=Color`, `kind="violin"` :
```python
sns.catplot(x="Color", y="Item Size",
kind="violin", data=new_pumpkins)
```
![una tabella di un grafico di tipo violino](../images/violin.png)
✅ Provare a creare questo grafico e altri grafici Seaborn, utilizzando altre variabili.
Ora che si ha un'idea della relazione tra le categorie binarie di colore e il gruppo più ampio di dimensioni, si esplora la regressione logistica per determinare il probabile colore di una data zucca.
> **🧮 Mostrami la matematica**
>
> Si ricorda come la regressione lineare usava spesso i minimi quadrati ordinari per arrivare a un valore? La regressione logistica si basa sul concetto di "massima verosimiglianza" utilizzando [le funzioni sigmoidi](https://wikipedia.org/wiki/Sigmoid_function). Una "Funzione Sigmoide" su un grafico ha l'aspetto di una forma a "S". Prende un valore e lo mappa da qualche parte tra 0 e 1. La sua curva è anche chiamata "curva logistica". La sua formula si presenta così:
>
> ![funzione logistica](../images/sigmoid.png)
>
> dove il punto medio del sigmoide si trova nel punto 0 di x, L è il valore massimo della curva e k è la pendenza della curva. Se l'esito della funzione è maggiore di 0,5, all'etichetta in questione verrà assegnata la classe '1' della scelta binaria. In caso contrario, sarà classificata come '0'.
## Costruire il modello
Costruire un modello per trovare queste classificazioni binarie è sorprendentemente semplice in Scikit-learn.
1. Si selezionano le variabili da utilizzare nel modello di classificazione e si dividono gli insiemi di training e test chiamando `train_test_split()`:
```python
from sklearn.model_selection import train_test_split
Selected_features = ['Origin','Item Size','Variety','City Name','Package']
X = new_pumpkins[Selected_features]
y = new_pumpkins['Color']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
```
1. Ora si può addestrare il modello, chiamando `fit()` con i dati di addestramento e stamparne il risultato:
```python
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print(classification_report(y_test, predictions))
print('Predicted labels: ', predictions)
print('Accuracy: ', accuracy_score(y_test, predictions))
```
Si dia un'occhiata al tabellone segnapunti del modello. Non è male, considerando che si hanno solo circa 1000 righe di dati:
```output
precision recall f1-score support
0 0.85 0.95 0.90 166
1 0.38 0.15 0.22 33
accuracy 0.82 199
macro avg 0.62 0.55 0.56 199
weighted avg 0.77 0.82 0.78 199
Predicted labels: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 1 0 0 1 0 0 0 1 0]
```
## Migliore comprensione tramite una matrice di confusione
Sebbene si possano ottenere [i termini](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html?highlight=classification_report#sklearn.metrics.classification_report) del rapporto dei punteggi stampando gli elementi di cui sopra, si potrebbe essere in grado di comprendere più facilmente il modello utilizzando una [matrice di confusione](https://scikit-learn.org/stable/modules/model_evaluation.html#confusion-matrix) che aiuti a capire come lo stesso sta funzionando.
> 🎓 Una '[matrice di confusione](https://it.wikipedia.org/wiki/Matrice_di_confusione)' (o 'matrice di errore') è una tabella che esprime i veri contro i falsi positivi e negativi del modello, misurando così l'accuratezza delle previsioni.
1. Per utilizzare una metrica di confusione, si `chiama confusion_matrix()`:
```python
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, predictions)
```
Si dia un'occhiata alla matrice di confusione del modello:
```output
array([[162, 4],
[ 33, 0]])
```
Cosa sta succedendo qui? Si supponga che al modello venga chiesto di classificare gli elementi tra due categorie binarie, la categoria "zucca" e la categoria "non una zucca".
- Se il modello prevede qualcosa come una zucca e appartiene alla categoria 'zucca' in realtà lo si chiama un vero positivo, mostrato dal numero in alto a sinistra.
- Se il modello prevede qualcosa come non una zucca e appartiene alla categoria 'zucca' in realtà si chiama falso positivo, mostrato dal numero in alto a destra.
- Se il modello prevede qualcosa come una zucca e appartiene alla categoria 'non-una-zucca' in realtà si chiama falso negativo, mostrato dal numero in basso a sinistra.
- Se il modello prevede qualcosa come non una zucca e appartiene alla categoria 'non-una-zucca' in realtà lo si chiama un vero negativo, mostrato dal numero in basso a destra.
Come si sarà intuito, è preferibile avere un numero maggiore di veri positivi e veri negativi e un numero inferiore di falsi positivi e falsi negativi, il che implica che il modello funziona meglio.
✅ Domanda: Secondo la matrice di confusione, come si è comportato il modello? Risposta: Non male; ci sono un buon numero di veri positivi ma anche diversi falsi negativi.
I termini visti in precedenza vengono rivisitati con l'aiuto della mappatura della matrice di confusione di TP/TN e FP/FN:
🎓 Precisione: TP/(TP + FP) La frazione di istanze rilevanti tra le istanze recuperate (ad es. quali etichette erano ben etichettate)
🎓 Richiamo: TP/(TP + FN) La frazione di istanze rilevanti che sono state recuperate, ben etichettate o meno
🎓 f1-score: (2 * precisione * richiamo)/(precisione + richiamo) Una media ponderata della precisione e del richiamo, dove il migliore è 1 e il peggiore è 0
🎓 Supporto: il numero di occorrenze di ciascuna etichetta recuperata
🎓 Accuratezza: (TP + TN)/(TP + TN + FP + FN) La percentuale di etichette prevista accuratamente per un campione.
🎓 Macro Media: il calcolo delle metriche medie non ponderate per ciascuna etichetta, senza tener conto dello squilibrio dell'etichetta.
🎓 Media ponderata: il calcolo delle metriche medie per ogni etichetta, tenendo conto dello squilibrio dell'etichetta pesandole in base al loro supporto (il numero di istanze vere per ciascuna etichetta).
✅ Si riesce a pensare a quale metrica si dovrebbe guardare se si vuole che il modello riduca il numero di falsi negativi?
## Visualizzare la curva ROC di questo modello
Questo non è un cattivo modello; la sua precisione è nell'intervallo dell'80%, quindi idealmente si potrebbe usare per prevedere il colore di una zucca dato un insieme di variabili.
Si rende un'altra visualizzazione per vedere il cosiddetto punteggio 'ROC':
```python
from sklearn.metrics import roc_curve, roc_auc_score
y_scores = model.predict_proba(X_test)
# calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])
sns.lineplot([0, 1], [0, 1])
sns.lineplot(fpr, tpr)
```
Usando di nuovo Seaborn, si traccia la [Caratteristica Operativa di Ricezione](https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html?highlight=roc) o il ROC del modello. Le curve ROC vengono spesso utilizzate per ottenere una visualizzazione dell'output di un classificatore in termini di veri e falsi positivi. "Le curve ROC in genere presentano un tasso di veri positivi sull'asse Y e un tasso di falsi positivi sull'asse X". Pertanto, la ripidità della curva e lo spazio tra la linea del punto medio e la curva contano: si vuole una curva che si sposti rapidamente verso l'alto e oltre la linea. In questo caso, ci sono falsi positivi con cui iniziare, quindi la linea si dirige correttamente:
![ROC](../images/ROC.png)
Infine, si usa l'[`API roc_auc_score`](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html?highlight=roc_auc#sklearn.metrics.roc_auc_score) di Scikit-learn per calcolare l'effettiva "Area sotto la curva" (AUC):
```python
auc = roc_auc_score(y_test,y_scores[:,1])
print(auc)
```
Il risultato è `0.6976998904709748`. Dato che l'AUC varia da 0 a 1, si desidera un punteggio elevato, poiché un modello corretto al 100% nelle sue previsioni avrà un AUC di 1; in questo caso, il modello è _abbastanza buono_.
Nelle lezioni future sulle classificazioni si imparerà come eseguire l'iterazione per migliorare i punteggi del modello. Ma per ora, congratulazioni! Si sono completate queste lezioni di regressione!
---
## 🚀 Sfida
C'è molto altro da svelare riguardo alla regressione logistica! Ma il modo migliore per imparare è sperimentare. Trovare un insieme di dati che si presti a questo tipo di analisi e costruire un modello con esso. Cosa si è appreso? suggerimento: provare [Kaggle](https://kaggle.com) per ottenere insiemi di dati interessanti.
## [Quiz post-lezione](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/16/)
## Revisione e Auto Apprendimento
Leggere le prime pagine di [questo articolo da Stanford](https://web.stanford.edu/~jurafsky/slp3/5.pdf) su alcuni usi pratici della regressione logistica. Si pensi alle attività più adatte per l'uno o l'altro tipo di attività di regressione studiate fino a questo punto. Cosa funzionerebbe meglio?
## Compito
[Ritentare questa regressione](assignment.it.md)

@ -0,0 +1,310 @@
# カテゴリ予測のためのロジスティック回帰
![ロジスティク回帰 vs 線形回帰のインフォグラフィック](../images/logistic-linear.png)
> [Dasani Madipalli](https://twitter.com/dasani_decoded) によるインフォグラフィック
## [講義前のクイズ](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/15/)
## イントロダクション
回帰の最後のレッスンでは、古典的な機械学習手法の一つである、「ロジスティック回帰」を見ていきます。この手法は、2値のカテゴリを予測するためのパターンを発見するために使います。例えば、「このお菓子は、チョコレートかどうか」、「この病気は伝染するかどうか」、「この顧客は、この商品を選ぶかどうか」などです。
このレッスンでは以下の内容を扱います。
- データを可視化するための新しいライブラリ
- ロジスティック回帰について
✅ この[モジュール](https://docs.microsoft.com/learn/modules/train-evaluate-classification-models?WT.mc_id=academic-15963-cxa) では、今回のタイプのような回帰について理解を深めることができます。
## 前提条件
カボチャのデータを触ったことで、データの扱いにかなり慣れてきました。その際にバイナリカテゴリが一つあることに気づきました。「`Color`」です。
いくつかの変数が与えられたときに、あるカボチャがどのような色になる可能性が高いか (オレンジ🎃または白👻)を予測するロジスティック回帰モデルを構築してみましょう。
> なぜ、回帰についてのレッスンで二値分類の話をしているのでしょうか?ロジスティック回帰は、線形ベースのものではありますが、[実際には分類法](https://scikit-learn.org/stable/modules/linear_model.html#logistic-regression) であるため、言語的な便宜上です。次のレッスングループでは、データを分類する他の方法について学びます。
## 質問の定義
ここでは、「Orange」か「Not Orange」かの二値で表現しています。データセットには「striped」というカテゴリーもありますが、ほとんど例がないので、ここでは使いません。データセットからnull値を削除すると、このカテゴリーは消えてしまいます。
> 🎃 面白いことに、白いカボチャを「お化けカボチャ」と呼ぶことがあります。彫るのが簡単ではないので、オレンジ色のカボチャほど人気はありませんが、見た目がクールですよね!
## ロジスティック回帰について
ロジスティック回帰は、前回学んだ線形回帰とは、いくつかの重要な点で異なります。
### 2値分類
ロジスティック回帰は、線形回帰とは異なる特徴を持っています。ロジスティック回帰は、二値のカテゴリー(「オレンジ色かオレンジ色でないか」)についての予測を行うのに対し、線形回帰は連続的な値を予測します。例えば、カボチャの産地と収穫時期が与えられれば、その価格がどれだけ上昇するかを予測することができます。
![カボチャ分類モデル](../images/pumpkin-classifier.png)
> [Dasani Madipalli](https://twitter.com/dasani_decoded) によるインフォグラフィック
### その他の分類
ロジスティック回帰には他にもMultinomialやOrdinalなどの種類があります。
- **Multinomial**: これは2つ以上のカテゴリーを持つ場合です。 (オレンジ、白、ストライプ)
- **Ordinal**: これは、順序付けられたカテゴリを含むもので、有限の数のサイズmini、sm、med、lg、xl、xxlで並べられたカボチャのように、結果を論理的に並べたい場合に便利です。
![Multinomial vs ordinal 回帰](../images/multinomial-ordinal.png)
> [Dasani Madipalli](https://twitter.com/dasani_decoded) によるインフォグラフィック
### 線形について
このタイプの回帰は、「カテゴリーの予測」が目的ですが、従属変数(色)と他の独立変数(都市名やサイズなどのデータセットの残りの部分)の間に明確な線形関係がある場合に最も効果的です。これらの変数を分ける線形性があるかどうかを把握するのは良いことです。
### 変数が相関している必要はない
線形回帰は、相関性の高い変数ほどよく働くことを覚えていますか?ロジスティック回帰は、そうとは限りません。相関関係がやや弱いこのデータには有効ですね。
### 大量のきれいなデータが必要です
一般的にロジスティック回帰は、より多くのデータを使用すれば、より正確な結果が得られます。私たちの小さなデータセットは、このタスクには最適ではありませんので、その点に注意してください。
✅ ロジスティック回帰に適したデータの種類を考えてみてください。
## エクササイズ - データの整形
まず、NULL値を削除したり、一部の列だけを選択したりして、データを少し綺麗にします。
1. 以下のコードを追加:
```python
from sklearn.preprocessing import LabelEncoder
new_columns = ['Color','Origin','Item Size','Variety','City Name','Package']
new_pumpkins = pumpkins.drop([c for c in pumpkins.columns if c not in new_columns], axis=1)
new_pumpkins.dropna(inplace=True)
new_pumpkins = new_pumpkins.apply(LabelEncoder().fit_transform)
```
新しいデータフレームはいつでも確認することができます。
```python
new_pumpkins.info
```
### 可視化 - グリッド状に並べる
ここまでで、[スターターノートブック](../notebook.ipynb) にパンプキンデータを再度読み込み、`Color`を含むいくつかの変数を含むデータセットを保持するように整形しました。別のライブラリを使って、ノートブック内のデータフレームを可視化してみましょう。[Seaborn](https://seaborn.pydata.org/index.html) というライブラリを使って、ノートブック内のデータフレームを可視化してみましょう。このライブラリは、今まで使っていた`Matplotlib`をベースにしています。
Seabornには、データを可視化するためのいくつかの優れた方法があります。例えば、各データの分布を横並びのグリッドで比較することができます。
1. かぼちゃのデータ`new_pumpkins`を使って、`PairGrid`をインスタンス化し、`map()`メソッドを呼び出して、以下のようなグリッドを作成します。
```python
import seaborn as sns
g = sns.PairGrid(new_pumpkins)
g.map(sns.scatterplot)
```
![グリッド状の可視化](../images/grid.png)
データを並べて観察することで、Colorのデータが他の列とどのように関連しているのかを知ることができます。
✅ この散布図をもとに、どのような面白い試みが考えられるでしょうか?
### swarm plot
Colorは2つのカテゴリーOrange or Notであるため、「カテゴリカルデータ」と呼ばれ、「可視化にはより[専門的なアプローチ](https://seaborn.pydata.org/tutorial/categorical.html?highlight=bar) 」が必要となります。このカテゴリと他の変数との関係を可視化する方法は他にもあります。
Seabornプロットでは、変数を並べて表示することができます。
1. 値の分布を示す、'swarm' plotを試してみます。
```python
sns.swarmplot(x="Color", y="Item Size", data=new_pumpkins)
```
![swarm plotによる可視化](../images/swarm.png)
### Violin plot
'violin' タイプのプロットは、2つのカテゴリーのデータがどのように分布しているかを簡単に視覚化できるので便利です。Violin plotは、分布がより「滑らか」に表示されるため、データセットが小さい場合はあまりうまくいきません。
1. パラメータとして`x=Color`、`kind="violin"` をセットし、 `catplot()`メソッドを呼びます。
```python
sns.catplot(x="Color", y="Item Size",
kind="violin", data=new_pumpkins)
```
![バイオリンタイプのチャート](../images/violin.png)
✅ 他の変数を使って、このプロットや他のSeabornのプロットを作成してみてください。
さて、`Color`の二値カテゴリと、より大きなサイズのグループとの関係がわかったところで、ロジスティック回帰を使って、あるカボチャの色について調べてみましょう。
> **🧮 数学の確認**
>
> 線形回帰では、通常の最小二乗法を用いて値を求めることが多かったことを覚えていますか?ロジスティック回帰は、[シグモイド関数](https://wikipedia.org/wiki/Sigmoid_function) を使った「最尤」の概念に依存しています。シグモイド関数は、プロット上では「S」字のように見えます。その曲線は「ロジスティック曲線」とも呼ばれます。数式は次のようになります。
>
> ![ロジスティック関数](../images/sigmoid.png)
>
> ここで、シグモイドの中点はx=0の点、Lは曲線の最大値、kは曲線の急峻さを表します。この関数の結果が0.5以上であれば、そのラベルは二値選択のクラス「1」になります。そうでない場合は、「0」に分類されます。
## モデルの構築
これらの二値分類を行うためのモデルの構築は、Scikit-learnでは驚くほど簡単にできます。
1. 分類モデルで使用したい変数を選択し、`train_test_split()`メソッドでトレーニングセットとテストセットを分割します。
```python
from sklearn.model_selection import train_test_split
Selected_features = ['Origin','Item Size','Variety','City Name','Package']
X = new_pumpkins[Selected_features]
y = new_pumpkins['Color']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
```
2. これで、学習データを使って`fit()`メソッドを呼び出し、モデルを訓練し、その結果を出力することができます。
```python
from sklearn.model_selection import train_test_split
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print(classification_report(y_test, predictions))
print('Predicted labels: ', predictions)
print('Accuracy: ', accuracy_score(y_test, predictions))
```
モデルのスコアボードを見てみましょう。1000行程度のデータしかないことを考えると、悪くないと思います。
```output
precision recall f1-score support
0 0.85 0.95 0.90 166
1 0.38 0.15 0.22 33
accuracy 0.82 199
macro avg 0.62 0.55 0.56 199
weighted avg 0.77 0.82 0.78 199
Predicted labels: [0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0
0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 0 0 0 0 0 0 0 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 1 1 1
0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0 0
0 0 0 1 0 1 0 0 1 0 0 0 1 0]
```
## 混同行列による理解度の向上
上記の項目を出力することで[スコアボードレポート](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.classification_report.html?highlight=classification_report#sklearn.metrics.classification_report) を得ることができますが、[混同行列](https://scikit-learn.org/stable/modules/model_evaluation.html#confusion-matrix) を使うことで、より簡単にモデルを理解することができるかもしれません。
> 🎓 [混同行列](https://wikipedia.org/wiki/Confusion_matrix) とは、モデルの真の陽性と陰性を表す表で、予測の正確さを測ることができます。
1. `confusion_matrix()`メソッドを呼んで、混同行列を作成します。
```python
from sklearn.metrics import confusion_matrix
confusion_matrix(y_test, predictions)
```
T作成したモデルの混同行列をみてみてください。
```output
array([[162, 4],
[ 33, 0]])
```
Scikit-learnでは、混同行列の行 (axis=0)が実際のラベル、列 (axis=1)が予測ラベルとなります。
| |0|1|
|:-:|:-:|:-:|
|0|TN|FP|
|1|FN|TP|
ここで何が起こっているのか例えば、カボチャを「オレンジ色」と「オレンジ色でない」という2つのカテゴリーに分類するように求められたとしましょう。
- モデルではオレンジ色ではないと予測されたカボチャが、実際には「オレンジ色ではない」というカテゴリーに属していた場合、「true negative」と呼ばれ、左上の数字で示されます。
- モデルではオレンジ色と予測されたカボチャが、実際には「オレンジ色ではない」カテゴリーに属していた場合、「false negative」と呼ばれ、左下の数字で示されます。
- モデルがオレンジではないと予測したかぼちゃが、実際にはカテゴリー「オレンジ」に属していた場合、「false positive」と呼ばれ、右上の数字で示されます。
- モデルがカボチャをオレンジ色と予測し、それが実際にカテゴリ「オレンジ」に属する場合、「true positive」と呼ばれ、右下の数字で示されます。
お気づきの通り、true positiveとtrue negativeの数が多く、false positiveとfalse negativeの数が少ないことが好ましく、これはモデルの性能が高いことを意味します。
混同行列は、precisionとrecallにどのように関係するのでしょうか上記の分類レポートでは、precision0.83とrecall0.98)が示されています。
Precision = tp / (tp + fp) = 162 / (162 + 33) = 0.8307692307692308
Recall = tp / (tp + fn) = 162 / (162 + 4) = 0.9759036144578314
✅ Q: 混同行列によると、モデルの出来はどうでしたか? A: 悪くありません。true negativeがかなりの数ありますが、false negativeもいくつかあります。
先ほどの用語を、混同行列のTP/TNとFP/FNのマッピングを参考にして再確認してみましょう。
🎓 Precision: TP/(TP + FP) 探索されたインスタンスのうち、関連性のあるインスタンスの割合(どのラベルがよくラベル付けされていたかなど)。
🎓 Recall: TP/(TP + FN) ラベリングされているかどうかに関わらず、探索された関連インスタンスの割合です。
🎓 f1-score: (2 * precision * recall)/(precision + recall) precisionとrecallの加重平均で、最高が1、最低が0となる。
🎓 Support: 取得した各ラベルの出現回数です。
🎓 Accuracy: (TP + TN)/(TP + TN + FP + FN) サンプルに対して正確に予測されたラベルの割合です。
🎓 Macro Avg: 各ラベルの非加重平均指標の計算で、ラベルの不均衡を考慮せずに算出される。
🎓 Weighted Avg: 各ラベルのサポート数(各ラベルの真のインスタンス数)で重み付けすることにより、ラベルの不均衡を考慮して、各ラベルの平均指標を算出する。
✅ 自分のモデルでfalse negativeの数を減らしたい場合、どの指標に注目すべきか考えられますか
## モデルのROC曲線を可視化する
これは悪いモデルではありません。精度は80%の範囲で、理想的には、一連の変数が与えられたときにカボチャの色を予測するのに使うことができます。
いわゆる「ROC」スコアを見るために、もう一つの可視化を行ってみましょう。
```python
from sklearn.metrics import roc_curve, roc_auc_score
y_scores = model.predict_proba(X_test)
# calculate ROC curve
fpr, tpr, thresholds = roc_curve(y_test, y_scores[:,1])
sns.lineplot([0, 1], [0, 1])
sns.lineplot(fpr, tpr)
```
Seaborn を再度使用して、モデルの [受信者操作特性 (Receiving Operating Characteristic)](https://scikit-learn.org/stable/auto_examples/model_selection/plot_roc.html?highlight=roc) またはROCをプロットします。ROC曲線は、分類器の出力を、true positiveとfalse positiveの観点から見るためによく使われます。ROC曲線は通常、true positive rateをY軸に、false positive rateをX軸にとっています。したがって、曲線の急峻さと、真ん中の線形な線と曲線の間のスペースが重要で、すぐに頭を上げて中線を超えるような曲線を求めます。今回のケースでは、最初にfalse positiveが出て、その後、ラインがきちんと上に向かって超えていきます。
![ROC](../images/ROC.png)
最後に、Scikit-learnの[`roc_auc_score` API](https://scikit-learn.org/stable/modules/generated/sklearn.metrics.roc_auc_score.html?highlight=roc_auc#sklearn.metrics.roc_auc_score) を使って、実際の「Area Under the Curve」(AUC)を計算します。
```python
auc = roc_auc_score(y_test,y_scores[:,1])
print(auc)
```
結果は`0.6976998904709748`となりました。AUCの範囲が0から1であることを考えると、大きなスコアが欲しいところです。なぜなら、予測が100%正しいモデルはAUCが1になるからです。
今後の分類のレッスンでは、モデルのスコアを向上させるための反復処理の方法を学びます。一旦おめでとうございます。あなたはこの回帰のレッスンを完了しました。
---
## 🚀チャレンジ
ロジスティック回帰については、まだまだ解き明かすべきことがたくさんあります。しかし、学ぶための最良の方法は、実験することです。この種の分析に適したデータセットを見つけて、それを使ってモデルを構築してみましょう。ヒント:面白いデータセットを探すために[Kaggle](https://www.kaggle.com/search?q=logistic+regression+datasets) を試してみてください。
## [講義後クイズ](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/16/)
## レビュー & 自主学習
ロジスティック回帰の実用的な使い方について、[Stanfordからのこの論文](https://web.stanford.edu/~jurafsky/slp3/5.pdf) の最初の数ページを読んでみてください。これまで学んできた回帰タスクのうち、どちらか一方のタイプに適したタスクについて考えてみてください。何が一番うまくいくでしょうか?
## 課題
[回帰に再挑戦する](./assignment.ja.md)

@ -120,7 +120,7 @@ Seaborn提供了一些巧妙的方法来可视化你的数据。例如你可
sns.swarmplot(x="Color", y="Item Size", data=new_pumpkins)
```
![分类散点图可视化数据](images/swarm.png)
![分类散点图可视化数据](../images/swarm.png)
### 小提琴图
@ -133,7 +133,7 @@ Seaborn提供了一些巧妙的方法来可视化你的数据。例如你可
kind="violin", data=new_pumpkins)
```
![小提琴图](images/violin.png)
![小提琴图](../images/violin.png)
✅ 尝试使用其他变量创建此图和其他Seaborn图。
@ -228,19 +228,15 @@ Seaborn提供了一些巧妙的方法来可视化你的数据。例如你可
- 如果你的模型将某物预测为南瓜并且它实际上属于“非南瓜”类别,我们将其称为假阴性,由左下角的数字显示。
- 如果你的模型预测某物不是南瓜,并且它实际上属于“非南瓜”类别,我们将其称为真阴性,如右下角的数字所示。
![混淆矩阵](../images/confusion-matrix.png)
> 作者[Jen Looper](https://twitter.com/jenlooper)
正如你可能已经猜到的那样,最好有更多的真阳性和真阴性以及较少的假阳性和假阴性,这意味着模型性能更好。
✅ Q根据混淆矩阵模型怎么样 A还不错有很多真阳性但也有一些假阴性。
让我们借助混淆矩阵对TP/TN和FP/FN的映射重新审视一下我们之前看到的术语
🎓 准确率TP/TP+FN检索实例中相关实例的分数(例如,哪些标签标记得很好)
🎓 准确率TP/(TP + FP) 检索实例中相关实例的分数(例如,哪些标签标记得很好)
🎓 召回率: TP/(TP + FP) 检索到的相关实例的比例,无论是否标记良好
🎓 召回率: TP/(TP + FN) 检索到的相关实例的比例,无论是否标记良好
🎓 F1分数: (2 * 准确率 * 召回率)/(准确率 + 召回率) 准确率和召回率的加权平均值最好为1最差为0

@ -0,0 +1,10 @@
# Riprovare un po' di Regressione
## Istruzioni
Nella lezione è stato usato un sottoinsieme dei dati della zucca. Ora si torna ai dati originali e si prova a usarli tutti, puliti e standardizzati, per costruire un modello di regressione logistica.
## Rubrica
| Criteri | Ottimo | Adeguato | Necessita miglioramento |
| -------- | ----------------------------------------------------------------------- | ------------------------------------------------------------ | ----------------------------------------------------------- |
| | Un notebook viene presentato con un modello ben spiegato con buone prestazioni | Un notebook viene presentato con un modello dalle prestazioni minime | Un notebook viene presentato con un modello con scarse o nessuna prestazione |

@ -0,0 +1,11 @@
# 回帰に再挑戦する
## 課題の指示
レッスンでは、カボチャのデータのサブセットを使用しました。今度は、元のデータに戻って、ロジスティック回帰モデルを構築するために、整形して標準化したデータをすべて使ってみましょう。
## ルーブリック
| 指標 | 模範的 | 適切 | 要改善 |
| -------- | ----------------------------------------------------------------------- | ------------------------------------------------------------ | ----------------------------------------------------------- |
| | 説明がわかりやすく、性能の良いモデルが含まれたノートブック| 最小限の性能しか発揮できないモデルが含まれたノートブック | 性能の劣るモデルや、何もないモデルが含まれたノートブック |

@ -0,0 +1,11 @@
# 再探回归模型
## 说明
在这节课中,你使用了 pumpkin 数据集的子集。现在,让我们回到原始数据,并尝试使用所有数据。经过了数据清理和标准化,建立一个逻辑回归模型。
## 评判标准
| 标准 | 优秀 | 中规中矩 | 仍需努力 |
| -------- | ----------------------------------------------------------------------- | ------------------------------------------------------------ | ----------------------------------------------------------- |
| | 用notebook呈现了一个解释性和性能良好的模型 | 用notebook呈现了一个性能一般的模型 | 用notebook呈现了一个性能差的模型或根本没有模型 |

@ -0,0 +1,33 @@
# Modèles de régression pour le machine learning
## Sujet régional : Modèles de régression des prix des citrouilles en Amérique du Nord 🎃
En Amérique du Nord, les citrouilles sont souvent sculptées en visages effrayants pour Halloween. Découvrons-en plus sur ces légumes fascinants!
![jack-o-lanterns](../images/jack-o-lanterns.jpg)
> Photo de <a href="https://unsplash.com/@teutschmann?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Beth Teutschmann</a> sur <a href="https://unsplash.com/s/photos/jack-o-lanterns?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
## Ce que vous apprendrez
Les leçons de cette section couvrent les types de régression dans le contexte du machine learning. Les modèles de régression peuvent aider à déterminer la _relation_ entre les variables. Ce type de modèle peut prédire des valeurs telles que la longueur, la température ou l'âge, découvrant ainsi les relations entre les variables lors de l'analyse des points de données.
Dans cette série de leçons, vous découvrirez la différence entre la régression linéaire et la régression logistique, et quand vous devriez utiliser l'une ou l'autre.
Dans ce groupe de leçons, vous serez préparé afin de commencer les tâches de machine learning, y compris la configuration de Visual Studio Code pour gérer les blocs-notes, l'environnement commun pour les scientifiques des données. Vous découvrirez Scikit-learn, une bibliothèque pour le machine learning, et vous construirez vos premiers modèles, en vous concentrant sur les modèles de régression dans ce chapitre.
> Il existe des outils low-code utiles qui peuvent vous aider à apprendre à travailler avec des modèles de régression. Essayez [Azure ML pour cette tâche](https://docs.microsoft.com/learn/modules/create-regression-model-azure-machine-learning-designer/?WT.mc_id=academic-15963-cxa)
### Cours
1. [Outils du métier](1-Tools/translations/README.fr.md)
2. [Gestion des données](2-Data/translations/README.fr.md)
3. [Régression linéaire et polynomiale](3-Linear/translations/README.fr.md)
4. [Régression logistique](4-Logistic/translations/README.fr.md)
---
### Crédits
"ML avec régression" a été écrit avec ♥️ par [Jen Looper](https://twitter.com/jenlooper)
♥️ Les contributeurs du quiz incluent : [Muhammad Sakib Khan Inan](https://twitter.com/Sakibinan) et [Ornella Altunyan](https://twitter.com/ornelladotcom)
L'ensemble de données sur la citrouille est suggéré par [ce projet sur Kaggle](https://www.kaggle.com/usda/a-year-of-pumpkin-prices) et ses données proviennent des [Rapports standard des marchés terminaux des cultures spécialisées](https://www.marketnews.usda.gov/mnp/fv-report-config-step1?type=termPrice) distribué par le département américain de l'Agriculture. Nous avons ajouté quelques points autour de la couleur en fonction de la variété pour normaliser la distribution. Ces données sont dans le domaine public.

@ -0,0 +1,34 @@
# Modelli di regressione per machine learning
## Argomento regionale: modelli di Regressione per i prezzi della zucca in Nord America 🎃
In Nord America, le zucche sono spesso intagliate in facce spaventose per Halloween. Si scoprirà di più su queste affascinanti verdure!
![jack-o-lantern](../images/jack-o-lanterns.jpg)
> Foto di <a href="https://unsplash.com/@teutschmann?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Beth Teutschmann</a> su <a href="https://unsplash.com/s/photos/jack-o-lanterns?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
## Cosa si imparerà
Le lezioni in questa sezione riguardano i tipi di regressione nel contesto di machine learning. I modelli di regressione possono aiutare a determinare la _relazione_ tra le variabili. Questo tipo di modello può prevedere valori come lunghezza, temperatura o età, scoprendo così le relazioni tra le variabili mentre analizza i punti dati.
In questa serie di lezioni si scoprirà la differenza tra regressione lineare e regressione logistica e quando si dovrebbe usare l'una o l'altra.
In questo gruppo di lezioni si imposterà una configurazione per iniziare le attività di machine learning, inclusa la configurazione di Visual Studio Code per gestire i notebook, l'ambiente comune per i data scientist. Si scoprirà Scikit-learn, una libreria per machine learning, e si creeranno i primi modelli, concentrandosi in questo capitolo sui modelli di Regressione.
> Esistono utili strumenti a basso codice che possono aiutare a imparare a lavorare con i modelli di regressione. Si provi [Azure Machine Learning per questa attività](https://docs.microsoft.com/learn/modules/create-regression-model-azure-machine-learning-designer/?WT.mc_id=academic-15963-cxa)
### Lezioni
1. [Gli Attrezzi Necessari](../1-Tools/translations/README.it.md)
2. [Gestione dati](../2-Data/translations/README.it.md)
3. [Regressione lineare e polinomiale](../3-Linear/translations/README.it.md)
4. [Regressione logistica](../4-Logistic/translations/README.it.md)
---
### Crediti
"ML con regressione" scritto con ♥️ da [Jen Looper](https://twitter.com/jenlooper)
♥️ I collaboratori del quiz includono: [Muhammad Sakib Khan Inan](https://twitter.com/Sakibinan) e [Ornella Altunyan](https://twitter.com/ornelladotcom)
L'insieme di dati relativi alla zucca è suggerito da [questo progetto su](https://www.kaggle.com/usda/a-year-of-pumpkin-prices) Kaggle e i suoi dati provengono dai [Rapporti Standard sui Mercati Terminali delle Colture Speciali](https://www.marketnews.usda.gov/mnp/fv-report-config-step1?type=termPrice) distribuiti dal Dipartimento dell'Agricoltura degli Stati Uniti. Sono stati aggiunti alcuni punti intorno al colore in base alla varietà per normalizzare la distribuzione. Questi dati sono di pubblico dominio.

@ -0,0 +1,34 @@
# 机器学习中的回归模型
## 本节主题: 北美南瓜价格的回归模型 🎃
在北美,南瓜经常在万圣节被刻上吓人的鬼脸。让我们来深入研究一下这种奇妙的蔬菜
![jack-o-lantern](../images/jack-o-lanterns.jpg)
> Foto oleh <a href="https://unsplash.com/@teutschmann?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Beth Teutschmann</a> di <a href="https://unsplash.com/s/photos/jack-o-lanterns?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
## 你会学到什么
这节的课程包括机器学习领域中的多种回归模型。回归模型可以明确多种变量间的_关系_。这种模型可以用来预测类似长度、温度和年龄之类的值 通过分析数据点来揭示变量之间的关系。
在本节的一系列课程中,你会学到线性回归和逻辑回归之间的区别,并且你将知道对于特定问题如何在这两种模型中进行选择
在这组课程中你会准备好包括为管理笔记而设置VS Code、配置数据科学家常用的环境等机器学习的初始任务。你会开始上手Scikit-learn学习项目一个机器学习的百科并且你会以回归模型为主构建起你的第一种机器学习模型
> 这里有一些代码难度较低但很有用的工具可以帮助你学习使用回归模型。 试一下 [Azure ML for this task](https://docs.microsoft.com/learn/modules/create-regression-model-azure-machine-learning-designer/?WT.mc_id=academic-15963-cxa)
### Lessons
1. [交易的工具](../1-Tools/translations/README.zh-cn.md)
2. [管理数据](../2-Data/translations/README.zh-cn.md)
3. [线性和多项式回归](../3-Linear/translations/README.zh-cn.md)
4. [逻辑回归](../4-Logistic/translations/README.zh-cn.md)
---
### Credits
"机器学习中的回归" 由[Jen Looper](https://twitter.com/jenlooper)♥️ 撰写
♥️ 测试的贡献者: [Muhammad Sakib Khan Inan](https://twitter.com/Sakibinan) 和 [Ornella Altunyan](https://twitter.com/ornelladotcom)
南瓜数据集受此启发 [this project on Kaggle](https://www.kaggle.com/usda/a-year-of-pumpkin-prices) 并且其数据源自 [Specialty Crops Terminal Markets Standard Reports](https://www.marketnews.usda.gov/mnp/fv-report-config-step1?type=termPrice) 由美国农业部上传分享。我们根据种类添加了围绕颜色的一些数据点。这些数据处在公共的域名上。

@ -1,6 +1,6 @@
# Build a Web App to use a ML Model
In this lesson, you will train an ML model on a data set that's out of this world: _UFO sightings over the past century_, sourced from [NUFORC's database](https://www.nuforc.org).
In this lesson, you will train an ML model on a data set that's out of this world: _UFO sightings over the past century_, sourced from NUFORC's database.
You will learn:
@ -165,7 +165,7 @@ Now you can build a Flask app to call your model and return similar results, but
web-app/
static/
css/
templates/
templates/
notebook.ipynb
ufo-model.pkl
```
@ -187,7 +187,7 @@ Now you can build a Flask app to call your model and return similar results, but
cd web-app
```
1. In your terminal type `pip install`, to install the libraries listed in _reuirements.txt_:
1. In your terminal type `pip install`, to install the libraries listed in _requirements.txt_:
```bash
pip install -r requirements.txt
@ -199,7 +199,7 @@ Now you can build a Flask app to call your model and return similar results, but
2. Create **index.html** in _templates_ directory.
3. Create **styles.css** in _static/css_ directory.
1. Build out the _styles.css__ file with a few styles:
1. Build out the _styles.css_ file with a few styles:
```css
body {

@ -0,0 +1,347 @@
# Creare un'app web per utilizzare un modello ML
In questa lezione, si addestrerà un modello ML su un insieme di dati fuori dal mondo: _avvistamenti di UFO nel secolo scorso_, provenienti dal [database di NUFORC](https://www.nuforc.org).
Si imparerà:
- Come serializzare/deserializzare un modello addestrato
- Come usare quel modello in un'app Flask
Si continuerà a utilizzare il notebook per pulire i dati e addestrare il modello, ma si può fare un ulteriore passo avanti nel processo esplorando l'utilizzo del modello direttamente in un'app web.
Per fare ciò, è necessario creare un'app Web utilizzando Flask.
## [Quiz Pre-Lezione](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/17/)
## Costruire un'app
Esistono diversi modi per creare app Web per utilizzare modelli di machine learning. L'architettura web può influenzare il modo in cui il modello viene addestrato. Si immagini di lavorare in un'azienda nella quale il gruppo di data science ha addestrato un modello che va utilizzato in un'app.
### Considerazioni
Ci sono molte domande da porsi:
- **È un'app web o un'app su dispositivo mobile?** Se si sta creando un'app su dispositivo mobile o si deve usare il modello in un contesto IoT, ci si può avvalere [di TensorFlow Lite](https://www.tensorflow.org/lite/) e usare il modello in un'app Android o iOS.
- **Dove risiederà il modello**? E' utilizzato in cloud o in locale?
- **Supporto offline**. L'app deve funzionare offline?
- **Quale tecnologia è stata utilizzata per addestrare il modello?** La tecnologia scelta può influenzare gli strumenti che è necessario utilizzare.
- **Utilizzare** TensorFlow. Se si sta addestrando un modello utilizzando TensorFlow, ad esempio, tale ecosistema offre la possibilità di convertire un modello TensorFlow per l'utilizzo in un'app Web utilizzando [TensorFlow.js](https://www.tensorflow.org/js/).
- **Utilizzare PyTorch**. Se si sta costruendo un modello utilizzando una libreria come PyTorch[,](https://pytorch.org/) si ha la possibilità di esportarlo in formato [ONNX](https://onnx.ai/) ( Open Neural Network Exchange) per l'utilizzo in app Web JavaScript che possono utilizzare il [motore di esecuzione Onnx](https://www.onnxruntime.ai/). Questa opzione verrà esplorata in una lezione futura per un modello addestrato da Scikit-learn
- **Utilizzo di Lobe.ai o Azure Custom vision**. Se si sta usando un sistema ML SaaS (Software as a Service) come [Lobe.ai](https://lobe.ai/) o [Azure Custom Vision](https://azure.microsoft.com/services/cognitive-services/custom-vision-service/?WT.mc_id=academic-15963-cxa) per addestrare un modello, questo tipo di software fornisce modi per esportare il modello per molte piattaforme, inclusa la creazione di un'API su misura da interrogare nel cloud dalla propria applicazione online.
Si ha anche l'opportunità di creare un'intera app Web Flask in grado di addestrare il modello stesso in un browser Web. Questo può essere fatto anche usando TensorFlow.js in un contesto JavaScript.
Per questo scopo, poiché si è lavorato con i notebook basati su Python, verranno esplorati i passaggi necessari per esportare un modello addestrato da tale notebook in un formato leggibile da un'app Web creata in Python.
## Strumenti
Per questa attività sono necessari due strumenti: Flask e Pickle, entrambi eseguiti su Python.
✅ Cos'è [Flask](https://palletsprojects.com/p/flask/)? Definito come un "micro-framework" dai suoi creatori, Flask fornisce le funzionalità di base dei framework web utilizzando Python e un motore di template per creare pagine web. Si dia un'occhiata a [questo modulo di apprendimento](https://docs.microsoft.com/learn/modules/python-flask-build-ai-web-app?WT.mc_id=academic-15963-cxa) per esercitarsi a sviluppare con Flask.
✅ Cos'è [Pickle](https://docs.python.org/3/library/pickle.html)? Pickle 🥒 è un modulo Python che serializza e de-serializza la struttura di un oggetto Python. Quando si utilizza pickle in un modello, si serializza o si appiattisce la sua struttura per l'uso sul web. Cautela: pickle non è intrinsecamente sicuro, quindi si faccia attenzione se viene chiesto di de-serializzare un file. Un file creato con pickle ha il suffisso `.pkl`.
## Esercizio: pulire i dati
In questa lezione verranno utilizzati i dati di 80.000 avvistamenti UFO, raccolti dal Centro Nazionale per gli Avvistamenti di UFO [NUFORC](https://nuforc.org) (The National UFO Reporting Center). Questi dati hanno alcune descrizioni interessanti di avvistamenti UFO, ad esempio:
- **Descrizione di esempio lunga**. "Un uomo emerge da un raggio di luce che di notte brilla su un campo erboso e corre verso il parcheggio della Texas Instruments".
- **Descrizione di esempio breve**. "le luci ci hanno inseguito".
Il foglio di calcolo [ufo.csv](../data/ufos.csv) include colonne su città (`city`), stato (`state`) e nazione (`country`) in cui è avvenuto l'avvistamento, la forma (`shape`) dell'oggetto e la sua latitudine (`latitude`) e longitudine (`longitude`).
Nel [notebook](../notebook.ipynb) vuoto incluso in questa lezione:
1. importare `pandas`, `matplotlib` e `numpy` come fatto nelle lezioni precedenti e importare il foglio di calcolo ufo.csv. Si può dare un'occhiata a un insieme di dati campione:
```python
import pandas as pd
import numpy as np
ufos = pd.read_csv('../data/ufos.csv')
ufos.head()
```
1. Convertire i dati ufos in un piccolo dataframe con nuove intestazioni Controllare i valori univoci nel campo `Country` .
```python
ufos = pd.DataFrame({'Seconds': ufos['duration (seconds)'], 'Country': ufos['country'],'Latitude': ufos['latitude'],'Longitude': ufos['longitude']})
ufos.Country.unique()
```
1. Ora si può ridurre la quantità di dati da gestire eliminando qualsiasi valore nullo e importando solo avvistamenti tra 1-60 secondi:
```python
ufos.dropna(inplace=True)
ufos = ufos[(ufos['Seconds'] >= 1) & (ufos['Seconds'] <= 60)]
ufos.info()
```
1. Importare la libreria `LabelEncoder` di Scikit-learn per convertire i valori di testo per le nazioni in un numero:
✅ LabelEncoder codifica i dati in ordine alfabetico
```python
from sklearn.preprocessing import LabelEncoder
ufos['Country'] = LabelEncoder().fit_transform(ufos['Country'])
ufos.head()
```
I dati dovrebbero assomigliare a questo:
```output
Seconds Country Latitude Longitude
2 20.0 3 53.200000 -2.916667
3 20.0 4 28.978333 -96.645833
14 30.0 4 35.823889 -80.253611
23 60.0 4 45.582778 -122.352222
24 3.0 3 51.783333 -0.783333
```
## Esercizio: costruire il proprio modello
Ora ci si può preparare per addestrare un modello portando i dati nei gruppi di addestramento e test.
1. Selezionare le tre caratteristiche su cui lo si vuole allenare come vettore X mentre il vettore y sarà `Country` Si deve essere in grado di inserire secondi (`Seconds`), latitudine (`Latitude`) e longitudine (`Longitude`) e ottenere un ID nazione da restituire.
```python
from sklearn.model_selection import train_test_split
Selected_features = ['Seconds','Latitude','Longitude']
X = ufos[Selected_features]
y = ufos['Country']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
```
1. Addestrare il modello usando la regressione logistica:
```python
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print(classification_report(y_test, predictions))
print('Predicted labels: ', predictions)
print('Accuracy: ', accuracy_score(y_test, predictions))
```
La precisione non è male **(circa il 95%)**, non sorprende che `Country` e `Latitude/Longitude` siano correlati.
Il modello creato non è molto rivoluzionario in quanto si dovrebbe essere in grado di dedurre una nazione (`Country`) dalla sua latitudine e longitudine (`Latitude` e `Longitude`), ma è un buon esercizio provare ad allenare dai dati grezzi che sono stati puliti ed esportati, e quindi utilizzare questo modello in una app web.
## Esercizio: usare pickle con il modello
Ora è il momento di utilizzare _pickle_ con il modello! Lo si può fare in poche righe di codice. Una volta che è stato _serializzato con pickle_, caricare il modello e testarlo rispetto a un array di dati di esempio contenente valori per secondi, latitudine e longitudine,
```python
import pickle
model_filename = 'ufo-model.pkl'
pickle.dump(model, open(model_filename,'wb'))
model = pickle.load(open('ufo-model.pkl','rb'))
print(model.predict([[50,44,-12]]))
```
Il modello restituisce **"3"**, che è il codice nazione per il Regno Unito. Fantastico! 👽
## Esercizio: creare un'app Flask
Ora si può creare un'app Flask per chiamare il modello e restituire risultati simili, ma in un modo visivamente più gradevole.
1. Iniziare creando una cartella chiamata **web-app** a livello del file _notebook.ipynb_ dove risiede il file _ufo-model.pkl_.
1. In quella cartella creare altre tre cartelle: **static**, con una cartella **css** al suo interno e **templates**. Ora si dovrebbero avere i seguenti file e directory:
```output
web-app/
static/
css/
templates/
notebook.ipynb
ufo-model.pkl
```
✅ Fare riferimento alla cartella della soluzione per una visualizzazione dell'app finita.
1. Il primo file da creare nella cartella _web-app_ è il file **requirements.txt**. Come _package.json_ in un'app JavaScript, questo file elenca le dipendenze richieste dall'app. In **requirements.txt** aggiungere le righe:
```text
scikit-learn
pandas
numpy
flask
```
1. Ora, eseguire questo file portandosi su _web-app_:
```bash
cd web-app
```
1. Aprire una finestra di terminale dove risiede requirements.txt e digitare `pip install`, per installare le librerie elencate in _reuirements.txt_:
```bash
pip install -r requirements.txt
```
1. Ora si è pronti per creare altri tre file per completare l'app:
1. Creare **app.py** nella directory radice.
2. Creare **index.html** nella directory _templates_.
3. Creare **sytles.css** nella directory _static/css_.
1. Inserire nel file _styles.css_ alcuni stili:
```css
body {
width: 100%;
height: 100%;
font-family: 'Helvetica';
background: black;
color: #fff;
text-align: center;
letter-spacing: 1.4px;
font-size: 30px;
}
input {
min-width: 150px;
}
.grid {
width: 300px;
border: 1px solid #2d2d2d;
display: grid;
justify-content: center;
margin: 20px auto;
}
.box {
color: #fff;
background: #2d2d2d;
padding: 12px;
display: inline-block;
}
```
1. Quindi, creare il file _index.html_ :
```html
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>🛸 UFO Appearance Prediction! 👽</title>
<link rel="stylesheet" href="{{ url_for('static', filename='css/styles.css') }}">
</head>
<body>
<div class="grid">
<div class="box">
<p>According to the number of seconds, latitude and longitude, which country is likely to have reported seeing a UFO?</p>
<form action="{{ url_for('predict')}}" method="post">
<input type="number" name="seconds" placeholder="Seconds" required="required" min="0" max="60" />
<input type="text" name="latitude" placeholder="Latitude" required="required" />
<input type="text" name="longitude" placeholder="Longitude" required="required" />
<button type="submit" class="btn">Predict country where the UFO is seen</button>
</form>
<p>{{ prediction_text }}</p>
</div>
</div>
</body>
</html>
```
Dare un'occhiata al template di questo file. Notare la sintassi con le parentesi graffe attorno alle variabili che verranno fornite dall'app, come il testo di previsione: `{{}}`. C'è anche un modulo che invia una previsione alla rotta `/predict`.
Infine, si è pronti per creare il file python che guida il consumo del modello e la visualizzazione delle previsioni:
1. In `app.py` aggiungere:
```python
import numpy as np
from flask import Flask, request, render_template
import pickle
app = Flask(__name__)
model = pickle.load(open("../ufo-model.pkl", "rb"))
@app.route("/")
def home():
return render_template("index.html")
@app.route("/predict", methods=["POST"])
def predict():
int_features = [int(x) for x in request.form.values()]
final_features = [np.array(int_features)]
prediction = model.predict(final_features)
output = prediction[0]
countries = ["Australia", "Canada", "Germany", "UK", "US"]
return render_template(
"index.html", prediction_text="Likely country: {}".format(countries[output])
)
if __name__ == "__main__":
app.run(debug=True)
```
> 💡 Suggerimento: quando si aggiunge [`debug=True`](https://www.askpython.com/python-modules/flask/flask-debug-mode) durante l'esecuzione dell'app web utilizzando Flask, qualsiasi modifica apportata all'applicazione verrà recepita immediatamente senza la necessità di riavviare il server. Attenzione! Non abilitare questa modalità in un'app di produzione.
Se si esegue `python app.py` o `python3 app.py` , il server web si avvia, localmente, e si può compilare un breve modulo per ottenere una risposta alla domanda scottante su dove sono stati avvistati gli UFO!
Prima di farlo, dare un'occhiata alle parti di `app.py`:
1. Innanzitutto, le dipendenze vengono caricate e l'app si avvia.
1. Poi il modello viene importato.
1. Infine index.html viene visualizzato sulla rotta home.
Sulla rotta `/predict` , accadono diverse cose quando il modulo viene inviato:
1. Le variabili del modulo vengono raccolte e convertite in un array numpy. Vengono quindi inviate al modello e viene restituita una previsione.
2. Le nazioni che si vogliono visualizzare vengono nuovamente esposte come testo leggibile ricavato dal loro codice paese previsto e tale valore viene inviato a index.html per essere visualizzato nel template della pagina web.
Usare un modello in questo modo, con Flask e un modello serializzato è relativamente semplice. La cosa più difficile è capire che forma hanno i dati che devono essere inviati al modello per ottenere una previsione. Tutto dipende da come è stato addestrato il modello. Questo ha tre punti dati da inserire per ottenere una previsione.
In un ambiente professionale, si può vedere quanto sia necessaria una buona comunicazione tra le persone che addestrano il modello e coloro che lo consumano in un'app web o su dispositivo mobile. In questo caso, si ricoprono entrambi i ruoli!
---
## 🚀 Sfida
Invece di lavorare su un notebook e importare il modello nell'app Flask, si può addestrare il modello direttamente nell'app Flask! Provare a convertire il codice Python nel notebook, magari dopo che i dati sono stati puliti, per addestrare il modello dall'interno dell'app su un percorso chiamato `/train`. Quali sono i pro e i contro nel seguire questo metodo?
## [Quiz post-lezione](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/18/)
## Revisione e Auto Apprendimento
Esistono molti modi per creare un'app web per utilizzare i modelli ML. Elencare dei modi in cui si potrebbe utilizzare JavaScript o Python per creare un'app web per sfruttare machine learning. Considerare l'architettura: il modello dovrebbe rimanere nell'app o risiedere nel cloud? In quest'ultimo casi, come accedervi? Disegnare un modello architettonico per una soluzione web ML applicata.
## Compito
[Provare un modello diverso](assignment.it.md)

@ -0,0 +1,347 @@
# 构建使用ML模型的Web应用程序
在本课中你将在一个数据集上训练一个ML模型这个数据集来自世界各地过去一个世纪的UFO目击事件来源于[NUFORC的数据库](https://www.nuforc.org)。
你将学会:
- 如何“pickle”一个训练有素的模型
- 如何在Flask应用程序中使用该模型
我们将继续使用notebook来清理数据和训练我们的模型但你可以进一步探索在web应用程序中使用模型。
为此你需要使用Flask构建一个web应用程序。
## [课前测](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/17/)
## 构建应用程序
有多种方法可以构建Web应用程序以使用机器学习模型。你的web架构可能会影响你的模型训练方式。想象一下你在一家企业工作其中数据科学小组已经训练了他们希望你在应用程序中使用的模型。
### 注意事项
你需要问很多问题:
- **它是web应用程序还是移动应用程序**如果你正在构建移动应用程序或需要在物联网环境中使用模型,你可以使用[TensorFlow Lite](https://www.tensorflow.org/lite/)并在Android或iOS应用程序中使用该模型。
- **模型放在哪里?**在云端还是本地?
- **离线支持**。该应用程序是否必须离线工作?
- **使用什么技术来训练模型?**所选的技术可能会影响你需要使用的工具。
- **使用Tensor flow**。例如如果你正在使用TensorFlow训练模型则该生态系统提供了使用[TensorFlow.js](https://www.tensorflow.org/js/)转换TensorFlow模型以便在Web应用程序中使用的能力。
- **使用 PyTorch**。如果你使用[PyTorch](https://pytorch.org/)等库构建模型,则可以选择将其导出到[ONNX](https://onnx.ai/)(开放神经网络交换)格式,用于可以使用 [Onnx Runtime](https://www.onnxruntime.ai/)的JavaScript Web 应用程序。此选项将在Scikit-learn-trained模型的未来课程中进行探讨。
- **使用Lobe.ai或Azure自定义视觉**。如果你使用ML SaaS软件即服务系统例如[Lobe.ai](https://lobe.ai/)或[Azure Custom Vision](https://azure.microsoft.com/services/ cognitive-services/custom-vision-service/?WT.mc_id=academic-15963-cxa)来训练模型这种类型的软件提供了为许多平台导出模型的方法包括构建一个定制API供在线应用程序在云中查询。
你还有机会构建一个完整的Flask Web应用程序该应用程序能够在 Web浏览器中训练模型本身。这也可以在JavaScript上下文中使用 TensorFlow.js来完成。
出于我们的目的既然我们一直在使用基于Python的notebook那么就让我们探讨一下将经过训练的模型从notebook导出为Python构建的web应用程序可读的格式所需要采取的步骤。
## 工具
对于此任务你需要两个工具Flask和Pickle它们都在Python上运行。
✅ 什么是 [Flask](https://palletsprojects.com/p/flask/) Flask被其创建者定义为“微框架”它提供了使用Python和模板引擎构建网页的Web框架的基本功能。看看[本学习单元](https://docs.microsoft.com/learn/modules/python-flask-build-ai-web-app?WT.mc_id=academic-15963-cxa)练习使用Flask构建应用程序。
✅ 什么是[Pickle](https://docs.python.org/3/library/pickle.html) Pickle🥒是一 Python模块用于序列化和反序列化 Python对象结构。当你“pickle”一个模型时你将其结构序列化或展平以在 Web上使用。小心pickle本质上不是安全的所以如果提示“un-pickle”文件请小心。生产的文件具有后缀`.pkl`。
## 练习 - 清理你的数据
在本课中,你将使用由 [NUFORC](https://nuforc.org)(国家 UFO 报告中心收集的80,000次UFO目击数据。这些数据对UFO目击事件有一些有趣的描述例如
- **详细描述**。"一名男子从夜间照射在草地上的光束中出现,他朝德克萨斯仪器公司的停车场跑去"。
- **简短描述**。 “灯光追着我们”。
[ufos.csv](./data/ufos.csv)电子表格包括有关目击事件发生的`city`、`state`和`country`、对象的`shape`及其`latitude`和`longitude`的列。
在包含在本课中的空白[notebook](notebook.ipynb)中:
1. 像在之前的课程中一样导入`pandas`、`matplotlib`和`numpy`然后导入ufos电子表格。你可以查看一个示例数据集
```python
import pandas as pd
import numpy as np
ufos = pd.read_csv('../data/ufos.csv')
ufos.head()
```
2. 将ufos数据转换为带有新标题的小dataframe。检查`country`字段中的唯一值。
```python
ufos = pd.DataFrame({'Seconds': ufos['duration (seconds)'], 'Country': ufos['country'],'Latitude': ufos['latitude'],'Longitude': ufos['longitude']})
ufos.Country.unique()
```
3. 现在你可以通过删除任何空值并仅导入1-60秒之间的目击数据来减少我们需要处理的数据量
```python
ufos.dropna(inplace=True)
ufos = ufos[(ufos['Seconds'] >= 1) & (ufos['Seconds'] <= 60)]
ufos.info()
```
4. 导入Scikit-learn的`LabelEncoder`库,将国家的文本值转换为数字:
✅ LabelEncoder按字母顺序编码数据
```python
from sklearn.preprocessing import LabelEncoder
ufos['Country'] = LabelEncoder().fit_transform(ufos['Country'])
ufos.head()
```
你的数据应如下所示:
```output
Seconds Country Latitude Longitude
2 20.0 3 53.200000 -2.916667
3 20.0 4 28.978333 -96.645833
14 30.0 4 35.823889 -80.253611
23 60.0 4 45.582778 -122.352222
24 3.0 3 51.783333 -0.783333
```
## 练习 - 建立你的模型
现在,你可以通过将数据划分为训练和测试组来准备训练模型。
1. 选择要训练的三个特征作为X向量y向量将是`Country` 你希望能够输入`Seconds`、`Latitude`和`Longitude`并获得要返回的国家/地区ID。
```python
from sklearn.model_selection import train_test_split
Selected_features = ['Seconds','Latitude','Longitude']
X = ufos[Selected_features]
y = ufos['Country']
X_train, X_test, y_train, y_test = train_test_split(X, y, test_size=0.2, random_state=0)
```
2. 使用逻辑回归训练模型:
```python
from sklearn.metrics import accuracy_score, classification_report
from sklearn.linear_model import LogisticRegression
model = LogisticRegression()
model.fit(X_train, y_train)
predictions = model.predict(X_test)
print(classification_report(y_test, predictions))
print('Predicted labels: ', predictions)
print('Accuracy: ', accuracy_score(y_test, predictions))
```
准确率还不错**(大约 95%**,不出所料,因为`Country`和`Latitude/Longitude`相关。
你创建的模型并不是非常具有革命性,因为你应该能够从其`Latitude`和`Longitude`推断出`Country`但是尝试从清理、导出的原始数据进行训练然后在web应用程序中使用此模型是一个很好的练习。
## 练习 - “pickle”你的模型
现在是时候_pickle_你的模型了你可以在几行代码中做到这一点。一旦它是 _pickled_加载你的pickled模型并针对包含秒、纬度和经度值的示例数据数组对其进行测试
```python
import pickle
model_filename = 'ufo-model.pkl'
pickle.dump(model, open(model_filename,'wb'))
model = pickle.load(open('ufo-model.pkl','rb'))
print(model.predict([[50,44,-12]]))
```
该模型返回**'3'**,这是英国的国家代码。👽
## 练习 - 构建Flask应用程序
现在你可以构建一个Flask应用程序来调用你的模型并返回类似的结果但以一种更美观的方式。
1. 首先在你的 _ufo-model.pkl_ 文件所在的_notebook.ipynb_文件旁边创建一个名为**web-app**的文件夹。
2. 在该文件夹中创建另外三个文件夹:**static**,其中有文件夹**css**和**templates`**。 你现在应该拥有以下文件和目录
```output
web-app/
static/
css/
templates/
notebook.ipynb
ufo-model.pkl
```
✅ 请参阅解决方案文件夹以查看已完成的应用程序
3. 在_web-app_文件夹中创建的第一个文件是**requirements.txt**文件。与JavaScript应用程序中的_package.json_一样此文件列出了应用程序所需的依赖项。在**requirements.txt**中添加以下几行:
```text
scikit-learn
pandas
numpy
flask
```
4. 现在进入web-app文件夹
```bash
cd web-app
```
5. 在你的终端中输入`pip install`以安装_reuirements.txt_中列出的库
```bash
pip install -r requirements.txt
```
6. 现在,你已准备好创建另外三个文件来完成应用程序:
1. 在根目录中创建**app.py**
2. 在_templates_目录中创建**index.html**。
3. 在_static/css_目录中创建**styles.css**。
7. 使用一些样式构建_styles.css_文件
```css
body {
width: 100%;
height: 100%;
font-family: 'Helvetica';
background: black;
color: #fff;
text-align: center;
letter-spacing: 1.4px;
font-size: 30px;
}
input {
min-width: 150px;
}
.grid {
width: 300px;
border: 1px solid #2d2d2d;
display: grid;
justify-content: center;
margin: 20px auto;
}
.box {
color: #fff;
background: #2d2d2d;
padding: 12px;
display: inline-block;
}
```
8. 接下来构建_index.html_文件
```html
<!DOCTYPE html>
<html>
<head>
<meta charset="UTF-8">
<title>🛸 UFO Appearance Prediction! 👽</title>
<link rel="stylesheet" href="{{ url_for('static', filename='css/styles.css') }}">
</head>
<body>
<div class="grid">
<div class="box">
<p>According to the number of seconds, latitude and longitude, which country is likely to have reported seeing a UFO?</p>
<form action="{{ url_for('predict')}}" method="post">
<input type="number" name="seconds" placeholder="Seconds" required="required" min="0" max="60" />
<input type="text" name="latitude" placeholder="Latitude" required="required" />
<input type="text" name="longitude" placeholder="Longitude" required="required" />
<button type="submit" class="btn">Predict country where the UFO is seen</button>
</form>
<p>{{ prediction_text }}</p>
</div>
</div>
</body>
</html>
```
看看这个文件中的模板。请注意应用程序将提供的变量周围的“mustache”语法例如预测文本`{{}}`。还有一个表单可以将预测发布到`/predict`路由。
最后你已准备好构建使用模型和显示预测的python 文件:
9. 在`app.py`中添加:
```python
import numpy as np
from flask import Flask, request, render_template
import pickle
app = Flask(__name__)
model = pickle.load(open("../ufo-model.pkl", "rb"))
@app.route("/")
def home():
return render_template("index.html")
@app.route("/predict", methods=["POST"])
def predict():
int_features = [int(x) for x in request.form.values()]
final_features = [np.array(int_features)]
prediction = model.predict(final_features)
output = prediction[0]
countries = ["Australia", "Canada", "Germany", "UK", "US"]
return render_template(
"index.html", prediction_text="Likely country: {}".format(countries[output])
)
if __name__ == "__main__":
app.run(debug=True)
```
> 💡 提示当你在使用Flask运行Web应用程序时添加 [`debug=True`](https://www.askpython.com/python-modules/flask/flask-debug-mode)时你对应用程序所做的任何更改将立即反映,无需重新启动服务器。注意!不要在生产应用程序中启用此模式
如果你运行`python app.py`或`python3 app.py` - 你的网络服务器在本地启动你可以填写一个简短的表格来回答你关于在哪里看到UFO的问题
在此之前,先看一下`app.py`的实现:
1. 首先,加载依赖项并启动应用程序。
2. 然后,导入模型。
3. 然后在home路由上渲染index.html。
在`/predict`路由上,当表单被发布时会发生几件事情:
1. 收集表单变量并转换为numpy数组。然后将它们发送到模型并返回预测。
2. 我们希望显示的国家/地区根据其预测的国家/地区代码重新呈现为可读文本并将该值发送回index.html以在模板中呈现。
以这种方式使用模型包括Flask和pickled模型是相对简单的。最困难的是要理解数据是什么形状的这些数据必须发送到模型中才能得到预测。这完全取决于模型是如何训练的。有三个数据要输入以便得到一个预测。
在一个专业的环境中你可以看到训练模型的人和在Web或移动应用程序中使用模型的人之间的良好沟通是多么的必要。在我们的情况下只有一个人
---
## 🚀 挑战:
你可以在Flask应用程序中训练模型而不是在notebook上工作并将模型导入Flask应用程序尝试在notebook中转换Python代码可能是在清除数据之后从应用程序中的一个名为`train`的路径训练模型。采用这种方法的利弊是什么?
## [课后测](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/18/)
## 复习与自学
有很多方法可以构建一个Web应用程序来使用ML模型。列出可以使用JavaScript或Python构建Web应用程序以利用机器学习的方法。考虑架构模型应该留在应用程序中还是存在于云中如果是后者你将如何访问它为应用的ML Web解决方案绘制架构模型。
## 任务
[尝试不同的模型](../assignment.md)

@ -0,0 +1,11 @@
# Provare un modello diverso
## Istruzioni
Ora che si è creato un'app web utilizzando un modello di Regressione addestrato, usare uno dei modelli da una lezione precedente sulla Regressione per rifare questa app web. Si può mantenere lo stile o progettarla in modo diverso per riflettere i dati della zucca. Fare attenzione a modificare gli input in modo che riflettano il metodo di addestramento del proprio modello.
## Rubrica
| Criteri | Ottimo | Adeguato | Necessita miglioramento |
| -------------------------- | --------------------------------------------------------- | --------------------------------------------------------- | -------------------------------------- |
| | L'app web funziona come previsto e viene distribuita nel cloud | L'app web contiene difetti o mostra risultati imprevisti | L'app web non funziona correttamente |

@ -0,0 +1,12 @@
# 尝试不同的模型
## 说明
现在你已经能够使用一个经过训练的回归模型来搭建web应用程序那么请你从前面的回归课程中重新选择一个模型来重做一遍web应用程序。你可以使用原来的风格或者其他不同的风格进行设计来展示pumpkin数据。注意更改输入以反映模型的训练方法。
## 评判标准
| 标准 | 优秀 | 中规中矩 | 仍需努力 |
| -------------------------- | --------------------------------------------------------- | --------------------------------------------------------- | -------------------------------------- |
| | web应用程序按预期运行并部署到云端 | web应用程序存在缺陷或者显示意想不到的结果 | web应用程序无法正常运行 |

@ -0,0 +1,22 @@
# Creare un'app web per utilizzare il modello ML
In questa sezione del programma di studi, verrà presentato un argomento ML applicato: come salvare il modello di Scikit-learn come file che può essere utilizzato per fare previsioni all'interno di un'applicazione web. Una volta salvato il modello, si imparerà come utilizzarlo in un'app web sviluppata con Flask. Per prima cosa si creerà un modello utilizzando alcuni dati che riguardano gli avvistamenti di UFO! Quindi, si creerà un'app web che consentirà di inserire un numero di secondi con un valore di latitudine e longitudine per prevedere quale paese ha riferito di aver visto un UFO.
![Parcheggio UFO](../images/ufo.jpg)
Foto di <a href="https://unsplash.com/@mdherren?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Michael Herren</a> su <a href="https://unsplash.com/s/photos/ufo?utm_source=unsplash&utm_medium=referral&utm_content=creditCopyText">Unsplash</a>
## Lezioni
1. [Costruire un'app web](../1-Web-App/translations/README.it.md)
## Crediti
"Costruire un'app web" è stato scritto con ♥️ da [Jen Looper](https://twitter.com/jenlooper).
♥️ I quiz sono stati scritti da Rohan Raj.
L'insieme di dati proviene da [Kaggle](https://www.kaggle.com/NUFORC/ufo-sightings).
L'architettura dell'app web è stata suggerita in parte da [questo articolo](https://towardsdatascience.com/how-to-easily-deploy-machine-learning-models-using-flask-b95af8fe34d4) e da [questo](https://github.com/abhinavsagar/machine-learning-deployment) repository di Abhinav Sagar.

@ -163,7 +163,7 @@ Now you can dig deeper into the data and learn what are the typical ingredients
def create_ingredient_df(df):
ingredient_df = df.T.drop(['cuisine','Unnamed: 0']).sum(axis=1).to_frame('value')
ingredient_df = ingredient_df[(ingredient_df.T != 0).any()]
ingredient_df = ingredient_df.sort_values(by='value', ascending=False
ingredient_df = ingredient_df.sort_values(by='value', ascending=False,
inplace=False)
return ingredient_df
```
@ -275,7 +275,7 @@ Now that you have cleaned the data, use [SMOTE](https://imbalanced-learn.org/dev
```python
transformed_df.head()
transformed_df.info()
transformed_df.to_csv("../data/cleaned_cuisine.csv")
transformed_df.to_csv("../data/cleaned_cuisines.csv")
```
This fresh CSV can now be found in the root data folder.

@ -622,7 +622,7 @@
"metadata": {},
"outputs": [],
"source": [
"transformed_df.to_csv(\"../../data/cleaned_cuisine.csv\")"
"transformed_df.to_csv(\"../../data/cleaned_cuisines.csv\")"
]
},
{

@ -0,0 +1,297 @@
# Introduzione alla classificazione
In queste quattro lezioni si esplorerà un focus fondamentale del machine learning classico: _la classificazione_. Verrà analizzato l'utilizzo di vari algoritmi di classificazione con un insieme di dati su tutte le brillanti cucine dell'Asia e dell'India. Si spera siate affamati!
![solo un pizzico!](../images/pinch.png)
> In queste lezioni di celebrano le cucine panasiatiche! Immagine di [Jen Looper](https://twitter.com/jenlooper)
La classificazione è una forma di [apprendimento supervisionato](https://it.wikipedia.org/wiki/Apprendimento_supervisionato) che ha molto in comune con le tecniche di regressione. Se machine learning riguarda la previsione di valori o nomi di cose utilizzando insiemi di dati, la classificazione generalmente rientra in due gruppi: _classificazione binaria_ e _classificazione multiclasse_.
[![Introduzione allaclassificazione](https://img.youtube.com/vi/eg8DJYwdMyg/0.jpg)](https://youtu.be/eg8DJYwdMyg "Introduzione alla classificazione")
> 🎥 Fare clic sull'immagine sopra per un video: John Guttag del MIT introduce la classificazione
Ricordare:
- La **regressione lineare** ha aiutato a prevedere le relazioni tra le variabili e a fare previsioni accurate su dove un nuovo punto dati si sarebbe posizionato in relazione a quella linea. Quindi, si potrebbe prevedere _quale prezzo avrebbe una zucca a settembre rispetto a dicembre_, ad esempio.
- La **regressione logistica** ha aiutato a scoprire le "categorie binarie": a questo prezzo, _questa zucca è arancione o non arancione_?
La classificazione utilizza vari algoritmi per determinare altri modi per definire l'etichetta o la classe di un punto dati. Si lavorerà con questi dati di cucina per vedere se, osservando un gruppo di ingredienti, è possibile determinarne la cucina di origine.
## [Quiz Pre-Lezione](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/19/)
### Introduzione
La classificazione è una delle attività fondamentali del ricercatore di machine learning e data scientist. Dalla classificazione basica di un valore binario ("questa email è spam o no?"), alla complessa classificazione e segmentazione di immagini utilizzando la visione artificiale, è sempre utile essere in grado di ordinare i dati in classi e porre domande su di essi.
Per definire il processo in modo più scientifico, il metodo di classificazione crea un modello predittivo che consente di mappare la relazione tra le variabili di input e le variabili di output.
![classificazione binaria vs. multiclasse](../images/binary-multiclass.png)
> Problemi binari e multiclasse per la gestione di algoritmi di classificazione. Infografica di [Jen Looper](https://twitter.com/jenlooper)
Prima di iniziare il processo di pulizia dei dati, visualizzazione e preparazione per le attività di machine learning, si apprenderà qualcosa circa i vari modi in cui machine learning può essere sfruttato per classificare i dati.
Derivata dalla [statistica](https://it.wikipedia.org/wiki/Classificazione_statistica), la classificazione che utilizza machine learning classico utilizza caratteristiche come l'`essere fumatore`, il `peso` e l'`età` per determinare _la probabilità di sviluppare la malattia X._ Essendo una tecnica di apprendimento supervisionata simile agli esercizi di regressione eseguiti in precedenza, i dati vengono etichettati e gli algoritmi ML utilizzano tali etichette per classificare e prevedere le classi (o "caratteristiche") di un insieme di dati e assegnarle a un gruppo o risultato.
✅ Si prenda un momento per immaginare un insieme di dati sulle cucine. A cosa potrebbe rispondere un modello multiclasse? A cosa potrebbe rispondere un modello binario? Se si volesse determinare se una determinata cucina potrebbe utilizzare il fieno greco? Se si volesse vedere se, regalando una busta della spesa piena di anice stellato, carciofi, cavolfiori e rafano, si possa creare un piatto tipico indiano?
[![Cesti misteriosi pazzeschi](https://img.youtube.com/vi/GuTeDbaNoEU/0.jpg)](https://youtu.be/GuTeDbaNoEU " Cestini misteriosi pazzeschi")
> 🎥 Fare clic sull'immagine sopra per un video. L'intera premessa dello spettacolo 'Chopped' è il 'cesto misterioso' dove gli chef devono preparare un piatto con una scelta casuale di ingredienti. Sicuramente un modello ML avrebbe aiutato!
## Ciao 'classificatore'
La domanda che si vuole porre a questo insieme di dati sulla cucina è in realtà una **domanda multiclasse**, poiché ci sono diverse potenziali cucine nazionali con cui lavorare. Dato un lotto di ingredienti, in quale di queste molte classi si identificheranno i dati?
Scikit-learn offre diversi algoritmi da utilizzare per classificare i dati, a seconda del tipo di problema che si desidera risolvere. Nelle prossime due lezioni si impareranno a conoscere molti di questi algoritmi.
## Esercizio: pulire e bilanciare i dati
Il primo compito, prima di iniziare questo progetto, sarà pulire e **bilanciare** i dati per ottenere risultati migliori. Si inizia con il file vuoto _notebook.ipynb_ nella radice di questa cartella.
La prima cosa da installare è [imblearn](https://imbalanced-learn.org/stable/). Questo è un pacchetto di apprendimento di Scikit che consentirà di bilanciare meglio i dati (si imparerà di più su questa attività tra un minuto).
1. Per installare `imblearn`, eseguire `pip install`, in questo modo:
```python
pip install imblearn
```
1. Importare i pacchetti necessari per caricare i dati e visualizzarli, importare anche `SMOTE` da `imblearn`.
```python
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
from imblearn.over_sampling import SMOTE
```
Ora si è pronti per la successiva importazione dei dati.
1. Il prossimo compito sarà quello di importare i dati:
```python
df = pd.read_csv('../data/cuisines.csv')
```
Usando `read_csv()` si leggerà il contenuto del file csv _cusines.csv_ e lo posizionerà nella variabile `df`.
1. Controllare la forma dei dati:
```python
df.head()
```
Le prime cinque righe hanno questo aspetto:
```output
| | Unnamed: 0 | cuisine | almond | angelica | anise | anise_seed | apple | apple_brandy | apricot | armagnac | ... | whiskey | white_bread | white_wine | whole_grain_wheat_flour | wine | wood | yam | yeast | yogurt | zucchini |
| --- | ---------- | ------- | ------ | -------- | ----- | ---------- | ----- | ------------ | ------- | -------- | --- | ------- | ----------- | ---------- | ----------------------- | ---- | ---- | --- | ----- | ------ | -------- |
| 0 | 65 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 66 | indian | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 67 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 68 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 69 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
```
1. Si possono ottienere informazioni su questi dati chiamando `info()`:
```python
df.info()
```
Il risultato assomiglia a:
```output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2448 entries, 0 to 2447
Columns: 385 entries, Unnamed: 0 to zucchini
dtypes: int64(384), object(1)
memory usage: 7.2+ MB
```
## Esercizio - conoscere le cucine
Ora il lavoro inizia a diventare più interessante. Si scoprirà la distribuzione dei dati, per cucina
1. Tracciare i dati come barre chiamando `barh()`:
```python
df.cuisine.value_counts().plot.barh()
```
![distribuzione dati cuisine](../images/cuisine-dist.png)
Esiste un numero finito di cucine, ma la distribuzione dei dati non è uniforme. Si può sistemare! Prima di farlo, occorre esplorare un po' di più.
1. Si deve scoprire quanti dati sono disponibili per cucina e stamparli:
```python
thai_df = df[(df.cuisine == "thai")]
japanese_df = df[(df.cuisine == "japanese")]
chinese_df = df[(df.cuisine == "chinese")]
indian_df = df[(df.cuisine == "indian")]
korean_df = df[(df.cuisine == "korean")]
print(f'thai df: {thai_df.shape}')
print(f'japanese df: {japanese_df.shape}')
print(f'chinese df: {chinese_df.shape}')
print(f'indian df: {indian_df.shape}')
print(f'korean df: {korean_df.shape}')
```
il risultato si presenta così:
```output
thai df: (289, 385)
japanese df: (320, 385)
chinese df: (442, 385)
indian df: (598, 385)
korean df: (799, 385)
```
## Alla scoperta degli ingredienti
Ora si possono approfondire i dati e scoprire quali sono gli ingredienti tipici per cucina. Si dovrebbero ripulire i dati ricorrenti che creano confusione tra le cucine, quindi si affronterà questo problema.
1. Creare una funzione `create_ingredient()` in Python per creare un dataframe ingredient Questa funzione inizierà eliminando una colonna non utile e ordinando gli ingredienti in base al loro conteggio:
```python
def create_ingredient_df(df):
ingredient_df = df.T.drop(['cuisine','Unnamed: 0']).sum(axis=1).to_frame('value')
ingredient_df = ingredient_df[(ingredient_df.T != 0).any()]
ingredient_df = ingredient_df.sort_values(by='value', ascending=False
inplace=False)
return ingredient_df
```
Ora si può usare questa funzione per farsi un'idea dei primi dieci ingredienti più popolari per cucina.
1. Chiamare `create_ingredient_df()` e tracciare il grafico chiamando `barh()`:
```python
thai_ingredient_df = create_ingredient_df(thai_df)
thai_ingredient_df.head(10).plot.barh()
```
![thai](../images/thai.png)
1. Fare lo stesso per i dati giapponesi:
```python
japanese_ingredient_df = create_ingredient_df(japanese_df)
japanese_ingredient_df.head(10).plot.barh()
```
![Giapponese](../images/japanese.png)
1. Ora per gli ingredienti cinesi:
```python
chinese_ingredient_df = create_ingredient_df(chinese_df)
chinese_ingredient_df.head(10).plot.barh()
```
![cinese](../images/chinese.png)
1. Tracciare gli ingredienti indiani:
```python
indian_ingredient_df = create_ingredient_df(indian_df)
indian_ingredient_df.head(10).plot.barh()
```
![indiano](../images/indian.png)
1. Infine, tracciare gli ingredienti coreani:
```python
korean_ingredient_df = create_ingredient_df(korean_df)
korean_ingredient_df.head(10).plot.barh()
```
![Coreano](../images/korean.png)
1. Ora, eliminare gli ingredienti più comuni che creano confusione tra le diverse cucine, chiamando `drop()`:
Tutti amano il riso, l'aglio e lo zenzero!
```python
feature_df= df.drop(['cuisine','Unnamed: 0','rice','garlic','ginger'], axis=1)
labels_df = df.cuisine #.unique()
feature_df.head()
```
## Bilanciare l'insieme di dati
Ora che i dati sono puliti, si usa [SMOTE](https://imbalanced-learn.org/dev/references/generated/imblearn.over_sampling.SMOTE.html) - "Tecnica di sovracampionamento della minoranza sintetica" - per bilanciarlo.
1. Chiamare `fit_resample()`, questa strategia genera nuovi campioni per interpolazione.
```python
oversample = SMOTE()
transformed_feature_df, transformed_label_df = oversample.fit_resample(feature_df, labels_df)
```
Bilanciando i dati, si otterranno risultati migliori quando si classificano. Si pensi a una classificazione binaria. Se la maggior parte dei dati è una classe, un modello ML prevederà quella classe più frequentemente, solo perché ci sono più dati per essa. Il bilanciamento dei dati prende tutti i dati distorti e aiuta a rimuovere questo squilibrio.
1. Ora si può controllare il numero di etichette per ingrediente:
```python
print(f'new label count: {transformed_label_df.value_counts()}')
print(f'old label count: {df.cuisine.value_counts()}')
```
il risultato si presenta così:
```output
new label count: korean 799
chinese 799
indian 799
japanese 799
thai 799
Name: cuisine, dtype: int64
old label count: korean 799
indian 598
chinese 442
japanese 320
thai 289
Name: cuisine, dtype: int64
```
I dati sono belli e puliti, equilibrati e molto deliziosi!
1. L'ultimo passaggio consiste nel salvare i dati bilanciati, incluse etichette e caratteristiche, in un nuovo dataframe che può essere esportato in un file:
```python
transformed_df = pd.concat([transformed_label_df,transformed_feature_df],axis=1, join='outer')
```
1. Si può dare un'altra occhiata ai dati usando `transform_df.head()` e `transform_df.info()`. Salvare una copia di questi dati per utilizzarli nelle lezioni future:
```python
transformed_df.head()
transformed_df.info()
transformed_df.to_csv("../data/cleaned_cuisine.csv")
```
Questo nuovo CSV può ora essere trovato nella cartella data in radice.
---
## 🚀 Sfida
Questo programma di studi contiene diversi insiemi di dati interessanti. Esaminare le cartelle `data` e vedere se contiene insiemi di dati che sarebbero appropriati per la classificazione binaria o multiclasse. Quali domande si farebbero a questo insieme di dati?
## [Quiz post-lezione](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/20/)
## Revisione e Auto Apprendimento
Esplorare l'API di SMOTE. Per quali casi d'uso è meglio usarla? Quali problemi risolve?
## Compito
[Esplorare i metodi di classificazione](assignment.it.md)

@ -0,0 +1,298 @@
# Sınıflandırmaya giriş
Bu dört derste klasik makine öğreniminin temel bir odağı olan _sınıflandırma_ konusunu keşfedeceksiniz. Asya ve Hindistan'ın nefis mutfağının tamamı üzerine hazırlanmış bir veri setiyle çeşitli sınıflandırma algoritmalarını kullanmanın üzerinden geçeceğiz. Umarız açsınızdır!
![sadece bir tutam!](../images/pinch.png)
> Bu derslerede Pan-Asya mutfağını kutlayın! Fotoğraf [Jen Looper](https://twitter.com/jenlooper) tarafından çekilmiştir.
Sınıflandırma, regresyon yöntemleriyle birçok ortak özelliği olan bir [gözetimli öğrenme](https://wikipedia.org/wiki/Supervised_learning) biçimidir. Eğer makine öğrenimi tamamen veri setleri kullanarak değerleri veya nesnelere verilecek isimleri öngörmekse, sınıflandırma genellikle iki gruba ayrılır: _ikili sınıflandırma_ ve _çok sınıflı sınıflandırma_.
[![Sınıflandırmaya giriş](https://img.youtube.com/vi/eg8DJYwdMyg/0.jpg)](https://youtu.be/eg8DJYwdMyg "Introduction to classification")
> :movie_camera: Video için yukarıdaki fotoğrafa tıklayın: MIT's John Guttag introduces classification (MIT'den John Guttag sınıflandırmayı tanıtıyor)
Hatırlayın:
- **Doğrusal regresyon** değişkenler arasındaki ilişkileri öngörmenize ve o doğruya ilişkili olarak yeni bir veri noktasının nereye düşeceğine dair doğru öngörülerde bulunmanıza yardımcı oluyordu. Yani, _bir balkabağının fiyatının aralık ayına göre eylül ayında ne kadar olabileceğini_ öngörebilirsiniz örneğin.
- **Lojistik regresyon** "ikili kategoriler"i keşfetmenizi sağlamıştı: bu fiyat noktasında, _bu balkabağı turuncu mudur, turuncu-değil midir?_
Sınıflandırma, bir veri noktasının etiketini veya sınıfını belirlemek için farklı yollar belirlemek üzere çeşitli algoritmalar kullanır. Bir grup malzemeyi gözlemleyerek kökeninin hangi mutfak olduğunu belirleyip belirleyemeyeceğimizi görmek için bu mutfak verisiyle çalışalım.
## [Ders öncesi kısa sınavı](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/19/?loc=tr)
### Giriş
Sınıflandırma, makine öğrenimi araştırmacısının ve veri bilimcisinin temel işlerinden biridir. İkili bir değerin temel sınıflandırmasından ("Bu e-posta gereksiz (spam) midir yoksa değil midir?") bilgisayarla görüden yararlanarak karmaşık görüntü sınıflandırma ve bölütlemeye kadar, veriyi sınıf sınıf sıralayabilmek ve soru sorabilmek daima faydalıdır.
Süreci daha bilimsel bir yolla ifade etmek gerekirse, sınıflandırma yönteminiz, girdi bilinmeyenlerinin arasındaki ilişkiyi çıktı bilinmeyenlerine eşlemenizi sağlayan öngörücü bir model oluşturur.
![ikili ve çok sınıflı sınıflandırma karşılaştırması](../images/binary-multiclass.png)
> Sınıflandırma algoritmalarının başa çıkması gereken ikili ve çok sınıflı problemler. Bilgilendirme grafiği [Jen Looper](https://twitter.com/jenlooper) tarafından hazırlanmıştır.
Verimizi temizleme, görselleştirme ve makine öğrenimi görevleri için hazırlama süreçlerine başlamadan önce, veriyi sınıflandırmak için makine öğreniminin leveraj edilebileceği çeşitli yolları biraz öğrenelim.
[İstatistikten](https://wikipedia.org/wiki/Statistical_classification) türetilmiş olarak, klasik makine öğrenimi kullanarak sınıflandırma, _X hastalığının gelişmesi ihtimalini_ belirlemek için `smoker`, `weight`, ve `age` gibi öznitelikler kullanır. Daha önce yaptığınız regresyon alıştırmalarına benzeyen bir gözetimli öğrenme yöntemi olarak, veriniz etiketlenir ve makine öğrenimi algoritmaları o etiketleri, sınıflandırmak ve veri setinin sınıflarını (veya 'özniteliklerini') öngörmek ve onları bir gruba veya bir sonuca atamak için kullanır.
:white_check_mark: Mutfaklarla ilgili bir veri setini biraz düşünün. Çok sınıflı bir model neyi cevaplayabilir? İkili bir model neyi cevaplayabilir? Farz edelim ki verilen bir mutfağın çemen kullanmasının muhtemel olup olmadığını belirlemek istiyorsunuz. Farzedelim ki yıldız anason, enginar, karnabahar ve bayır turpu ile dolu bir alışveriş poşetinden tipik bir Hint yemeği yapıp yapamayacağınızı görmek istiyorsunuz.
[![Çılgın gizem sepetleri](https://img.youtube.com/vi/GuTeDbaNoEU/0.jpg)](https://youtu.be/GuTeDbaNoEU "Crazy mystery baskets")
> :movie_camera: Video için yukarıdaki fotoğrafa tıklayın. Aşçıların rastgele malzeme seçeneklerinden yemek yaptığı 'Chopped' programının tüm olayı 'gizem sepetleri'dir. Kuşkusuz, bir makine öğrenimi modeli onlara yardımcı olurdu!
## Merhaba 'sınıflandırıcı'
Bu mutfak veri setiyle ilgili sormak istediğimiz soru aslında bir **çok sınıflı soru**dur çünkü elimizde farklı potansiyel ulusal mutfaklar var. Verilen bir grup malzeme için, veri bu sınıflardan hangisine uyacak?
Scikit-learn, veriyi sınıflandırmak için kullanmak üzere, çözmek istediğiniz problem çeşidine bağlı olarak, çeşitli farklı algoritmalar sunar. Önümüzdeki iki derste, bu algoritmalardan birkaçını öğreneceksiniz.
## Alıştırma - verinizi temizleyip dengeleyin
Bu projeye başlamadan önce elinizdeki ilk görev, daha iyi sonuçlar almak için, verinizi temizlemek ve **dengelemek**. Üst klasördeki boş _notebook.ipynb_ dosyasıyla başlayın.
Kurmanız gereken ilk şey [imblearn](https://imbalanced-learn.org/stable/). Bu, veriyi daha iyi dengelemenizi sağlayacak bir Scikit-learn paketidir. (Bu görev hakkında birazdan daha fazla bilgi göreceksiniz.)
1. `imblearn` kurun, `pip install` çalıştırın, şu şekilde:
```python
pip install imblearn
```
1. Verinizi almak ve görselleştirmek için ihtiyaç duyacağınız paketleri alın (import edin), ayrıca `imblearn` paketinden `SMOTE` alın.
```python
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
from imblearn.over_sampling import SMOTE
```
Şimdi okumak için hazırsınız, sonra veriyi alın.
1. Sonraki görev veriyi almak olacak:
```python
df = pd.read_csv('../data/cuisines.csv')
```
`read_csv()` kullanmak _cusines.csv_ csv dosyasının içeriğini okuyacak ve `df` değişkenine yerleştirecek.
1. Verinin şeklini kontrol edin:
```python
df.head()
```
İlk beş satır şöyle görünüyor:
```output
| | Unnamed: 0 | cuisine | almond | angelica | anise | anise_seed | apple | apple_brandy | apricot | armagnac | ... | whiskey | white_bread | white_wine | whole_grain_wheat_flour | wine | wood | yam | yeast | yogurt | zucchini |
| --- | ---------- | ------- | ------ | -------- | ----- | ---------- | ----- | ------------ | ------- | -------- | --- | ------- | ----------- | ---------- | ----------------------- | ---- | ---- | --- | ----- | ------ | -------- |
| 0 | 65 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 66 | indian | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 67 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 68 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 69 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
```
1. `info()` fonksiyonunu çağırarak bu veri hakkında bilgi edinin:
```python
df.info()
```
Çıktınız şuna benzer:
```output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2448 entries, 0 to 2447
Columns: 385 entries, Unnamed: 0 to zucchini
dtypes: int64(384), object(1)
memory usage: 7.2+ MB
```
## Alıştırma - mutfaklar hakkında bilgi edinmek
Şimdi, işimiz daha da ilginçleşmeye başlıyor. Mutfak mutfak verinin dağılımını keşfedelim
1. `barh()` fonksiyonunu çağırarak veriyi sütunlarla çizdirin:
```python
df.cuisine.value_counts().plot.barh()
```
![mutfak veri dağılımı](../images/cuisine-dist.png)
Sonlu sayıda mutfak var, ancak verinin dağılımı düzensiz. Bunu düzeltebilirsiniz! Bunu yapmadan önce, biraz daha keşfedelim.
1. Her mutfak için ne kadar verinin mevcut olduğunu bulun ve yazdırın:
```python
thai_df = df[(df.cuisine == "thai")]
japanese_df = df[(df.cuisine == "japanese")]
chinese_df = df[(df.cuisine == "chinese")]
indian_df = df[(df.cuisine == "indian")]
korean_df = df[(df.cuisine == "korean")]
print(f'thai df: {thai_df.shape}')
print(f'japanese df: {japanese_df.shape}')
print(f'chinese df: {chinese_df.shape}')
print(f'indian df: {indian_df.shape}')
print(f'korean df: {korean_df.shape}')
```
çıktı şöyle görünür:
```output
thai df: (289, 385)
japanese df: (320, 385)
chinese df: (442, 385)
indian df: (598, 385)
korean df: (799, 385)
```
## Malzemeleri keşfetme
Şimdi veriyi daha derinlemesine inceleyebilirsiniz ve her mutfak için tipik malzemelerin neler olduğunu öğrenebilirsiniz. Mutfaklar arasında karışıklık yaratan tekrar eden veriyi temizlemelisiniz, dolayısıyla şimdi bu problemle ilgili bilgi edinelim.
1. Python'da, malzeme veri iskeleti yaratmak için `create_ingredient_df()` diye bir fonksiyon oluşturun. Bu fonksiyon, yardımcı olmayan bir sütunu temizleyerek ve sayılarına göre malzemeleri sıralayarak başlar:
```python
def create_ingredient_df(df):
ingredient_df = df.T.drop(['cuisine','Unnamed: 0']).sum(axis=1).to_frame('value')
ingredient_df = ingredient_df[(ingredient_df.T != 0).any()]
ingredient_df = ingredient_df.sort_values(by='value', ascending=False
inplace=False)
return ingredient_df
```
Şimdi bu fonksiyonu, her mutfağın en yaygın ilk on malzemesi hakkında hakkında fikir edinmek için kullanabilirsiniz.
1. `create_ingredient_df()` fonksiyonunu çağırın ve `barh()` fonksiyonunu çağırarak çizdirin:
```python
thai_ingredient_df = create_ingredient_df(thai_df)
thai_ingredient_df.head(10).plot.barh()
```
![Tayland](../images/thai.png)
1. Japon verisi için de aynısını yapın:
```python
japanese_ingredient_df = create_ingredient_df(japanese_df)
japanese_ingredient_df.head(10).plot.barh()
```
![Japon](../images/japanese.png)
1. Şimdi Çin malzemeleri için yapın:
```python
chinese_ingredient_df = create_ingredient_df(chinese_df)
chinese_ingredient_df.head(10).plot.barh()
```
![Çin](../images/chinese.png)
1. Hint malzemelerini çizdirin:
```python
indian_ingredient_df = create_ingredient_df(indian_df)
indian_ingredient_df.head(10).plot.barh()
```
![Hint](../images/indian.png)
1. Son olarak, Kore malzemelerini çizdirin:
```python
korean_ingredient_df = create_ingredient_df(korean_df)
korean_ingredient_df.head(10).plot.barh()
```
![Kore](../images/korean.png)
1. Şimdi, `drop()` fonksiyonunu çağırarak, farklı mutfaklar arasında karışıklığa sebep olan en çok ortaklık taşıyan malzemeleri temizleyelim:
Herkes pirinci, sarımsağı ve zencefili seviyor!
```python
feature_df= df.drop(['cuisine','Unnamed: 0','rice','garlic','ginger'], axis=1)
labels_df = df.cuisine #.unique()
feature_df.head()
```
## Veri setini dengeleyin
Veriyi temizlediniz, şimdi [SMOTE](https://imbalanced-learn.org/dev/references/generated/imblearn.over_sampling.SMOTE.html) - "Synthetic Minority Over-sampling Technique" ("Sentetik Azınlık Aşırı-Örnekleme/Örneklem-Artırma Tekniği") kullanarak dengeleyelim.
1. `fit_resample()` fonksiyonunu çağırın, bu strateji ara değerlemeyle yeni örnekler üretir.
```python
oversample = SMOTE()
transformed_feature_df, transformed_label_df = oversample.fit_resample(feature_df, labels_df)
```
Verinizi dengeleyerek, sınıflandırırken daha iyi sonuçlar alabileceksiniz. Bir ikili sınıflandırma düşünün. Eğer verimizin çoğu tek bir sınıfsa, bir makine öğrenimi modeli, sırf onun için daha fazla veri olduğundan o sınıfı daha sık tahmin edecektir. Veriyi dengelemek herhangi eğri veriyi alır ve bu dengesizliğin ortadan kaldırılmasına yardımcı olur.
1. Şimdi, her bir malzeme için etiket sayısını kontrol edebilirsiniz:
```python
print(f'new label count: {transformed_label_df.value_counts()}')
print(f'old label count: {df.cuisine.value_counts()}')
```
Çıktınız şöyle görünür:
```output
new label count: korean 799
chinese 799
indian 799
japanese 799
thai 799
Name: cuisine, dtype: int64
old label count: korean 799
indian 598
chinese 442
japanese 320
thai 289
Name: cuisine, dtype: int64
```
Veri şimdi tertemiz, dengeli ve çok lezzetli!
1. Son adım, dengelenmiş verinizi, etiket ve özniteliklerle beraber, yeni bir dosyaya gönderilebilecek yeni bir veri iskeletine kaydetmek:
```python
transformed_df = pd.concat([transformed_label_df,transformed_feature_df],axis=1, join='outer')
```
1. `transformed_df.head()` ve `transformed_df.info()` fonksiyonlarını kullanarak verinize bir kez daha göz atabilirsiniz. Gelecek derslerde kullanabilmek için bu verinin bir kopyasını kaydedin:
```python
transformed_df.head()
transformed_df.info()
transformed_df.to_csv("../../data/cleaned_cuisines.csv")
```
Bu yeni CSV şimdi kök data (veri) klasöründe görülebilir.
---
## :rocket: Meydan okuma
Bu öğretim programı farklı ilgi çekici veri setleri içermekte. `data` klasörlerini inceleyin ve ikili veya çok sınıflı sınıflandırma için uygun olabilecek veri setleri bulunduran var mı, bakın. Bu veri seti için hangi soruları sorabilirdiniz?
## [Ders sonrası kısa sınavı](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/20/?loc=tr)
## Gözden Geçirme & Kendi Kendine Çalışma
SMOTE'nin API'ını keşfedin. En iyi hangi durumlar için kullanılıyor? Hangi problemleri çözüyor?
## Ödev
[Sınıflandırma yöntemlerini keşfedin](assignment.tr.md)

@ -0,0 +1,291 @@
# 对分类方法的介绍
在这四节课程中,你将会学习机器学习中一个基本的重点 - _分类_. 我们会在关于亚洲和印度的神奇的美食的数据集上尝试使用多种分类算法。希望你有点饿了。
![一个桃子!](../images/pinch.png)
>在学习的课程中赞叹泛亚地区的美食吧! 图片由 [Jen Looper](https://twitter.com/jenlooper)提供
分类算法是[监督学习](https://wikipedia.org/wiki/Supervised_learning) 的一种。它与回归算法在很多方面都有相同之处。如果机器学习所有的目标都是使用数据集来预测数值或物品的名字,那么分类算法通常可以分为两类 _二元分类__多元分类_
[![对分类算法的介绍](https://img.youtube.com/vi/eg8DJYwdMyg/0.jpg)](https://youtu.be/eg8DJYwdMyg "对分类算法的介绍")
> 🎥 点击上方给的图片可以跳转到一个视频-MIT的John对分类算法的介绍
请记住:
- **线性回归** 帮助你预测变量之间的关系并对一个新的数据点会落在哪条线上做出精确的预测。因此,你可以预测 _南瓜在九月的价格和十月的价格_
- **逻辑回归** 帮助你发现“二元范畴”:即在当前这个价格, _这个南瓜是不是橙色_
分类方法采用多种算法来确定其他可以用来确定一个数据点的标签或类别的方法。让我们来研究一下这个数据集,看看我们能否通过观察菜肴的原料来确定它的源头。
## [课程前的小问题](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/19/)
分类是机器学习研究者和数据科学家使用的一种基本方法。从基本的二元分类(这是不是一份垃圾邮件?)到复杂的图片分类和使用计算机视觉的分割技术,它都是将数据分类并提出相关问题的有效工具。
![二元分类 vs 多元分类](../images/binary-multiclass.png)
> 需要分类算法解决的二元分类和多元分类问题的对比. 信息图由[Jen Looper](https://twitter.com/jenlooper)提供
在开始清洗数据、数据可视化和调整数据以适应机器学习的任务前,让我们来了解一下多种可用来数据分类的机器学习方法。
派生自[统计数学](https://wikipedia.org/wiki/Statistical_classification),分类算法使用经典的机器学习的一些特征,比如通过'吸烟者'、'体重'和'年龄'来推断 _罹患某种疾病的可能性_。作为一个与你刚刚实践过的回归算法很相似的监督学习算法,你的数据是被标记过的并且算法通过采集这些标签来进行分类和预测并进行输出。
✅ 花一点时间来想象一下一个关于菜肴的数据集。一个多元分类的模型应该能回答什么问题?一个二元分类的模型又应该能回答什么?如果你想确定一个给定的菜肴是否会用到葫芦巴(一种植物,种子用来调味)该怎么做?如果你想知道给你一个装满了八角茴香、花椰菜和辣根的购物袋你能否做出一道代表性的印度菜又该怎么做?
[![Crazy mystery baskets](https://img.youtube.com/vi/GuTeDbaNoEU/0.jpg)](https://youtu.be/GuTeDbaNoEU "疯狂的神秘篮子")
> 🎥 点击图像观看视频。整个'Chopped'节目的前提都是建立在神秘的篮子上,在这个节目中厨师必须利用随机给定的食材做菜。可见一个机器学习模型能起到不小的作用
## 初见-分类器
我们关于这个菜肴数据集想要提出的问题其实是一个 **多元问题**,因为我们有很多潜在的具有代表性的菜肴。给定一系列食材数据,数据能够符合这些类别中的哪一类?
Scikit-learn项目提供多种对数据进行分类的算法你需要根据问题的具体类型来进行选择。在下两节课程中你会学到这些算法中的几个。
## 练习 - 清洗并平衡你的数据
在你开始进行这个项目前的第一个上手的任务就是清洗和 **平衡**你的数据来得到更好的结果。从当前目录的根目录中的 _nodebook.ipynb_ 开始。
第一个需要安装的东西是 [imblearn](https://imbalanced-learn.org/stable/)这是一个Scikit-learn项目中的一个包它可以让你更好的平衡数据 (关于这个任务你很快你就会学到更多)。
1. 安装 `imblearn`, 运行命令 `pip install`:
```python
pip install imblearn
```
1. 为了导入和可视化数据你需要导入下面的这些包, 你还需要从`imblearn`导入`SMOTE`
```python
import pandas as pd
import matplotlib.pyplot as plt
import matplotlib as mpl
import numpy as np
from imblearn.over_sampling import SMOTE
```
现在你已经准备好导入数据了。
1. 下一项任务是导入数据:
```python
df = pd.read_csv('../data/cuisines.csv')
```
使用函数 `read_csv()` 会读取csv文件的内容 _cusines.csv_ 并将内容放置在 变量`df`中。
1. 检查数据的形状是否正确:
```python
df.head()
```
前五行输出应该是这样的:
```output
| | Unnamed: 0 | cuisine | almond | angelica | anise | anise_seed | apple | apple_brandy | apricot | armagnac | ... | whiskey | white_bread | white_wine | whole_grain_wheat_flour | wine | wood | yam | yeast | yogurt | zucchini |
| --- | ---------- | ------- | ------ | -------- | ----- | ---------- | ----- | ------------ | ------- | -------- | --- | ------- | ----------- | ---------- | ----------------------- | ---- | ---- | --- | ----- | ------ | -------- |
| 0 | 65 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 66 | indian | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 67 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 68 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 69 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
```
1. 调用函数 `info()` 可以获得有关这个数据集的信息:
```python
df.info()
```
Your out resembles:
```output
<class 'pandas.core.frame.DataFrame'>
RangeIndex: 2448 entries, 0 to 2447
Columns: 385 entries, Unnamed: 0 to zucchini
dtypes: int64(384), object(1)
memory usage: 7.2+ MB
```
## 练习 - 了解这些菜肴
现在任务变得更有趣了,让我们来探索如何将数据分配给各个菜肴
1. 调用函数 `barh()`可以绘制出数据的条形图:
```python
df.cuisine.value_counts().plot.barh()
```
![菜肴数据分配](../images/cuisine-dist.png)
这里有有限的一些菜肴,但是数据的分配是不平均的。但是你可以修正这一现象!在这样做之前再稍微探索一下。
1. 找出对于每个菜肴有多少数据是有效的并将其打印出来:
```python
thai_df = df[(df.cuisine == "thai")]
japanese_df = df[(df.cuisine == "japanese")]
chinese_df = df[(df.cuisine == "chinese")]
indian_df = df[(df.cuisine == "indian")]
korean_df = df[(df.cuisine == "korean")]
print(f'thai df: {thai_df.shape}')
print(f'japanese df: {japanese_df.shape}')
print(f'chinese df: {chinese_df.shape}')
print(f'indian df: {indian_df.shape}')
print(f'korean df: {korean_df.shape}')
```
输出应该是这样的 :
```output
thai df: (289, 385)
japanese df: (320, 385)
chinese df: (442, 385)
indian df: (598, 385)
korean df: (799, 385)
```
## 探索有关食材的内容
现在你可以在数据中探索的更深一点并了解每道菜肴的代表性食材。你需要将反复出现的、容易造成混淆的数据清理出去,那么让我们来学习解决这个问题。
1. 在Python中创建一个函数 `create_ingredient_df()` 来创建一个食材的数据帧。这个函数会去掉数据中无用的列并按食材的数量进行分类。
```python
def create_ingredient_df(df):
ingredient_df = df.T.drop(['cuisine','Unnamed: 0']).sum(axis=1).to_frame('value')
ingredient_df = ingredient_df[(ingredient_df.T != 0).any()]
ingredient_df = ingredient_df.sort_values(by='value', ascending=False
inplace=False)
return ingredient_df
```
现在你可以使用这个函数来得到理想的每道菜肴最重要的10种食材。
1. 调用函数 `create_ingredient_df()` 然后通过函数`barh()`来绘制图像:
```python
thai_ingredient_df = create_ingredient_df(thai_df)
thai_ingredient_df.head(10).plot.barh()
```
![thai](../images/thai.png)
1. 对日本的数据进行相同的操作:
```python
japanese_ingredient_df = create_ingredient_df(japanese_df)
japanese_ingredient_df.head(10).plot.barh()
```
![日本](../images/japanese.png)
1. 现在处理中国的数据:
```python
chinese_ingredient_df = create_ingredient_df(chinese_df)
chinese_ingredient_df.head(10).plot.barh()
```
![中国](../images/chinese.png)
1. 绘制印度食材的数据:
```python
indian_ingredient_df = create_ingredient_df(indian_df)
indian_ingredient_df.head(10).plot.barh()
```
![印度](../images/indian.png)
1. 最后,绘制韩国的食材的数据:
```python
korean_ingredient_df = create_ingredient_df(korean_df)
korean_ingredient_df.head(10).plot.barh()
```
![韩国](../images/korean.png)
1. 现在,去除在不同的菜肴间最普遍的容易造成混乱的食材,调用函数 `drop()`:
大家都喜欢米饭、大蒜和生姜
```python
feature_df= df.drop(['cuisine','Unnamed: 0','rice','garlic','ginger'], axis=1)
labels_df = df.cuisine #.unique()
feature_df.head()
```
## 平衡数据集
现在你已经清理过数据集了, 使用 [SMOTE](https://imbalanced-learn.org/dev/references/generated/imblearn.over_sampling.SMOTE.html) - "Synthetic Minority Over-sampling Technique" - 来平衡数据集。
1. 调用函数 `fit_resample()`, 此方法通过插入数据来生成新的样本
```python
oversample = SMOTE()
transformed_feature_df, transformed_label_df = oversample.fit_resample(feature_df, labels_df)
```
通过对数据集的平衡,当你对数据进行分类时能够得到更好的结果。现在考虑一个二元分类的问题,如果你的数据集中的大部分数据都属于其中一个类别,那么机器学习的模型就会因为在那个类别的数据更多而判断那个类别更为常见。平衡数据能够去除不公平的数据点。
1. 现在你可以查看每个食材的标签数量:
```python
print(f'new label count: {transformed_label_df.value_counts()}')
print(f'old label count: {df.cuisine.value_counts()}')
```
输出应该是这样的 :
```output
new label count: korean 799
chinese 799
indian 799
japanese 799
thai 799
Name: cuisine, dtype: int64
old label count: korean 799
indian 598
chinese 442
japanese 320
thai 289
Name: cuisine, dtype: int64
```
现在这个数据集不仅干净、平衡而且还很“美味” !
1. 最后一步是保存你处理过后的平衡的数据(包括标签和特征),将其保存为一个可以被输出到文件中的数据帧。
```python
transformed_df = pd.concat([transformed_label_df,transformed_feature_df],axis=1, join='outer')
```
1. 你可以通过调用函数 `transformed_df.head()``transformed_df.info()`再检查一下你的数据。 接下来要将数据保存以供在未来的课程中使用:
```python
transformed_df.head()
transformed_df.info()
transformed_df.to_csv("../data/cleaned_cuisines.csv")
```
这个全新的CSV文件可以在数据根目录中被找到。
---
## 🚀小练习
本项目的全部课程含有很多有趣的数据集。 探索一下 `data`文件夹,看看这里面有没有适合二元分类、多元分类算法的数据集,再想一下你对这些数据集有没有什么想问的问题。
## [课后练习](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/20/)
## 回顾 & 自学
探索一下 SMOTE的API文档。思考一下它最适合于什么样的情况、它能够解决什么样的问题。
## 课后作业
[探索一下分类方法](../assignment.md)

@ -0,0 +1,11 @@
# Esplorare i metodi di classificazione
## Istruzioni
Nella [documentazione](https://scikit-learn.org/stable/supervised_learning.html) di Scikit-learn si troverà un ampio elenco di modi per classificare i dati. Fare una piccola caccia al tesoro in questi documenti: l'obiettivo è cercare metodi di classificazione e abbinare un insieme di dati in questo programma di studi, una domanda che si può porre e una tecnica di classificazione. Creare un foglio di calcolo o una tabella in un file .doc e spiegare come funzionerebbe l'insieme di dati con l'algoritmo di classificazione.
## Rubrica
| Criteri | Ottimo | Adeguato | Necessita miglioramento |
| -------- | ----------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| | viene presentato un documento che riporta una panoramica di 5 algoritmi insieme a una tecnica di classificazione. La panoramica è ben spiegata e dettagliata. | viene presentato un documento che riporta una panoramica di 3 algoritmi insieme a una tecnica di classificazione. La panoramica è ben spiegata e dettagliata. | viene presentato un documento che riporta una panoramica di meno di tre algoritmi insieme a una tecnica di classificazione e la panoramica non è né ben spiegata né dettagliata. |

@ -0,0 +1,11 @@
# Sınıflandırma yöntemlerini keşfedin
## Yönergeler
[Scikit-learn dokümentasyonunda](https://scikit-learn.org/stable/supervised_learning.html) veriyi sınıflandırma yöntemlerini içeren büyük bir liste göreceksiniz. Bu dokümanlar arasında ufak bir çöpçü avı yapın: Hedefiniz, sınıflandırma yöntemleri aramak ve bu eğitim programındaki bir veri seti, sorabileceğiniz bir soru ve bir sınıflandırma yöntemi eşleştirmek. Bir .doc dosyasında elektronik çizelge veya tablo hazırlayın ve veri setinin sınıflandırma algoritmasıyla nasıl çalışacağınııklayın.
## Rubrik
| Ölçüt | Örnek Alınacak Nitelikte | Yeterli | Geliştirme Gerekli |
| -------- | ----------------------------------------------------------------------------------------------------------------------------------- | ----------------------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------------------------------------------------------------------------------------------------------------------- |
| | Bir sınıflandırma yönteminin yanısıra 5 algoritmayı inceleyen bir doküman sunulmuş. İnceleme iyi açıklanmış ve detaylı. | Bir sınıflandırma yönteminin yanısıra 5 algoritmayı inceleyen bir doküman sunulmuş. İnceleme iyi açıklanmış ve detaylı. | Bir sınıflandırma yönteminin yanısıra 3'ten az algoritmayı inceleyen bir doküman sunulmuş ve inceleme iyi açıklanmış veya detaylı değil. |

@ -15,21 +15,20 @@ Assuming you completed [Lesson 1](../1-Introduction/README.md), make sure that a
```python
import pandas as pd
cuisines_df = pd.read_csv("../../data/cleaned_cuisine.csv")
cuisines_df = pd.read_csv("../../data/cleaned_cuisines.csv")
cuisines_df.head()
```
The data looks like this:
```output
| | Unnamed: 0 | cuisine | almond | angelica | anise | anise_seed | apple | apple_brandy | apricot | armagnac | ... | whiskey | white_bread | white_wine | whole_grain_wheat_flour | wine | wood | yam | yeast | yogurt | zucchini |
| --- | ---------- | ------- | ------ | -------- | ----- | ---------- | ----- | ------------ | ------- | -------- | --- | ------- | ----------- | ---------- | ----------------------- | ---- | ---- | --- | ----- | ------ | -------- |
| 0 | 0 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | indian | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 2 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 3 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 4 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
```
| | Unnamed: 0 | cuisine | almond | angelica | anise | anise_seed | apple | apple_brandy | apricot | armagnac | ... | whiskey | white_bread | white_wine | whole_grain_wheat_flour | wine | wood | yam | yeast | yogurt | zucchini |
| --- | ---------- | ------- | ------ | -------- | ----- | ---------- | ----- | ------------ | ------- | -------- | --- | ------- | ----------- | ---------- | ----------------------- | ---- | ---- | --- | ----- | ------ | -------- |
| 0 | 0 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | indian | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 2 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 3 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 4 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
1. Now, import several more libraries:
@ -68,13 +67,13 @@ Assuming you completed [Lesson 1](../1-Introduction/README.md), make sure that a
Your features look like this:
| almond | angelica | anise | anise_seed | apple | apple_brandy | apricot | armagnac | artemisia | artichoke | ... | whiskey | white_bread | white_wine | whole_grain_wheat_flour | wine | wood | yam | yeast | yogurt | zucchini | |
| -----: | -------: | ----: | ---------: | ----: | -----------: | ------: | -------: | --------: | --------: | ---: | ------: | ----------: | ---------: | ----------------------: | ---: | ---: | ---: | ----: | -----: | -------: | --- |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
| | almond | angelica | anise | anise_seed | apple | apple_brandy | apricot | armagnac | artemisia | artichoke | ... | whiskey | white_bread | white_wine | whole_grain_wheat_flour | wine | wood | yam | yeast | yogurt | zucchini |
| -----: | -------: | ----: | ---------: | ----: | -----------: | ------: | -------: | --------: | --------: | ---: | ------: | ----------: | ---------: | ----------------------: | ---: | ---: | ---: | ----: | -----: | -------: | -----: |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 | 0 |
Now you are ready to train your model!
@ -200,13 +199,13 @@ Since you are using the multiclass case, you need to choose what _scheme_ to use
The result is printed - Indian cuisine is its best guess, with good probability:
| | 0 | | | | | | | | | | | | | | | | | | | | |
| -------: | -------: | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| indian | 0.715851 | | | | | | | | | | | | | | | | | | | | |
| chinese | 0.229475 | | | | | | | | | | | | | | | | | | | | |
| japanese | 0.029763 | | | | | | | | | | | | | | | | | | | | |
| korean | 0.017277 | | | | | | | | | | | | | | | | | | | | |
| thai | 0.007634 | | | | | | | | | | | | | | | | | | | | |
| | 0 |
| -------: | -------: |
| indian | 0.715851 |
| chinese | 0.229475 |
| japanese | 0.029763 |
| korean | 0.017277 |
| thai | 0.007634 |
✅ Can you explain why the model is pretty sure this is an Indian cuisine?
@ -217,22 +216,23 @@ Since you are using the multiclass case, you need to choose what _scheme_ to use
print(classification_report(y_test,y_pred))
```
| precision | recall | f1-score | support | | | | | | | | | | | | | | | | | | |
| ------------ | ------ | -------- | ------- | ---- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| chinese | 0.73 | 0.71 | 0.72 | 229 | | | | | | | | | | | | | | | | | |
| indian | 0.91 | 0.93 | 0.92 | 254 | | | | | | | | | | | | | | | | | |
| japanese | 0.70 | 0.75 | 0.72 | 220 | | | | | | | | | | | | | | | | | |
| korean | 0.86 | 0.76 | 0.81 | 242 | | | | | | | | | | | | | | | | | |
| thai | 0.79 | 0.85 | 0.82 | 254 | | | | | | | | | | | | | | | | | |
| accuracy | 0.80 | 1199 | | | | | | | | | | | | | | | | | | | |
| macro avg | 0.80 | 0.80 | 0.80 | 1199 | | | | | | | | | | | | | | | | | |
| weighted avg | 0.80 | 0.80 | 0.80 | 1199 | | | | | | | | | | | | | | | | | |
| | precision | recall | f1-score | support |
| ------------ | ------ | -------- | ------- | ---- |
| chinese | 0.73 | 0.71 | 0.72 | 229 |
| indian | 0.91 | 0.93 | 0.92 | 254 |
| japanese | 0.70 | 0.75 | 0.72 | 220 |
| korean | 0.86 | 0.76 | 0.81 | 242 |
| thai | 0.79 | 0.85 | 0.82 | 254 |
| accuracy | 0.80 | 1199 | | |
| macro avg | 0.80 | 0.80 | 0.80 | 1199 |
| weighted avg | 0.80 | 0.80 | 0.80 | 1199 |
## 🚀Challenge
In this lesson, you used your cleaned data to build a machine learning model that can predict a national cuisine based on a series of ingredients. Take some time to read through the many options Scikit-learn provides to classify data. Dig deeper into the concept of 'solver' to understand what goes on behind the scenes.
## [Post-lecture quiz](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/22/)
## Review & Self Study
Dig a little more into the math behind logistic regression in [this lesson](https://people.eecs.berkeley.edu/~russell/classes/cs194/f11/lectures/CS194%20Fall%202011%20Lecture%2006.pdf)

@ -47,7 +47,7 @@
],
"source": [
"import pandas as pd\n",
"cuisines_df = pd.read_csv(\"../../data/cleaned_cuisine.csv\")\n",
"cuisines_df = pd.read_csv(\"../../data/cleaned_cuisines.csv\")\n",
"cuisines_df.head()"
]
},

@ -0,0 +1,241 @@
# Classificatori di cucina 1
In questa lezione, si utilizzerà l'insieme di dati salvati dall'ultima lezione, pieno di dati equilibrati e puliti relativi alle cucine.
Si utilizzerà questo insieme di dati con una varietà di classificatori per _prevedere una determinata cucina nazionale in base a un gruppo di ingredienti_. Mentre si fa questo, si imparerà di più su alcuni dei modi in cui gli algoritmi possono essere sfruttati per le attività di classificazione.
## [Quiz Pre-Lezione](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/21/)
# Preparazione
Supponendo che la [Lezione 1](../1-Introduction/README.md) sia stata completata, assicurarsi che _esista_ un file clean_cuisines.csv nella cartella in radice `/data` per queste quattro lezioni.
## Esercizio - prevedere una cucina nazionale
1. Lavorando con il _notebook.ipynb_ di questa lezione nella cartella radice, importare quel file insieme alla libreria Pandas:
```python
import pandas as pd
cuisines_df = pd.read_csv("../../data/cleaned_cuisine.csv")
cuisines_df.head()
```
I dati si presentano così:
```output
| | Unnamed: 0 | cuisine | almond | angelica | anise | anise_seed | apple | apple_brandy | apricot | armagnac | ... | whiskey | white_bread | white_wine | whole_grain_wheat_flour | wine | wood | yam | yeast | yogurt | zucchini |
| --- | ---------- | ------- | ------ | -------- | ----- | ---------- | ----- | ------------ | ------- | -------- | --- | ------- | ----------- | ---------- | ----------------------- | ---- | ---- | --- | ----- | ------ | -------- |
| 0 | 0 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | indian | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 2 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 3 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 4 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
```
1. Ora importare molte altre librerie:
```python
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve
from sklearn.svm import SVC
import numpy as np
```
1. Dividere le coordinate X e y in due dataframe per l'addestramento. `cuisine` può essere il dataframe delle etichette:
```python
cuisines_label_df = cuisines_df['cuisine']
cuisines_label_df.head()
```
Apparirà così
```output
0 indian
1 indian
2 indian
3 indian
4 indian
Name: cuisine, dtype: object
```
1. Scartare la colonna `Unnamed: 0` e la colonna `cuisine` , chiamando `drop()`. Salvare il resto dei dati come caratteristiche addestrabili:
```python
cuisines_feature_df = cuisines_df.drop(['Unnamed: 0', 'cuisine'], axis=1)
cuisines_feature_df.head()
```
Le caratteristiche sono così:
| almond | angelica | anise | anise_seed | apple | apple_brandy | apricot | armagnac | artemisia | artichoke | ... | whiskey | white_bread | white_wine | whole_grain_wheat_flour | wine | wood | yam | yeast | yogurt | zucchini | |
| -----: | -------: | ----: | ---------: | ----: | -----------: | ------: | -------: | --------: | --------: | ---: | ------: | ----------: | ---------: | ----------------------: | ---: | ---: | ---: | ----: | -----: | -------: | --- |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
Ora si è pronti per addestrare il modello!
## Scegliere il classificatore
Ora che i dati sono puliti e pronti per l'addestramento, si deve decidere quale algoritmo utilizzare per il lavoro.
Scikit-learn raggruppa la classificazione in Supervised Learning, e in quella categoria si troveranno molti modi per classificare. [La varietà](https://scikit-learn.org/stable/supervised_learning.html) è piuttosto sconcertante a prima vista. I seguenti metodi includono tutti tecniche di classificazione:
- Modelli Lineari
- Macchine a Vettori di Supporto
- Discesa stocastica del gradiente
- Nearest Neighbors
- Processi Gaussiani
- Alberi di Decisione
- Apprendimento ensemble (classificatore di voto)
- Algoritmi multiclasse e multioutput (classificazione multiclasse e multietichetta, classificazione multiclasse-multioutput)
> Si possono anche usare [le reti neurali per classificare i dati](https://scikit-learn.org/stable/modules/neural_networks_supervised.html#classification), ma questo esula dall'ambito di questa lezione.
### Con quale classificatore andare?
Quale classificatore si dovrebbe scegliere? Spesso, scorrerne diversi e cercare un buon risultato è un modo per testare. Scikit-learn offre un [confronto fianco](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html) a fianco su un insieme di dati creato, confrontando KNeighbors, SVC in due modi, GaussianProcessClassifier, DecisionTreeClassifier, RandomForestClassifier, MLPClassifier, AdaBoostClassifier, GaussianNB e QuadraticDiscrinationAnalysis, mostrando i risultati visualizzati:
![confronto di classificatori](../images/comparison.png)
> Grafici generati sulla documentazione di Scikit-learn
> AutoML risolve questo problema in modo ordinato eseguendo questi confronti nel cloud, consentendo di scegliere l'algoritmo migliore per i propri dati. Si può provare [qui](https://docs.microsoft.com/learn/modules/automate-model-selection-with-azure-automl/?WT.mc_id=academic-15963-cxa)
### Un approccio migliore
Un modo migliore che indovinare a caso, tuttavia, è seguire le idee su questo [ML Cheat sheet](https://docs.microsoft.com/azure/machine-learning/algorithm-cheat-sheet?WT.mc_id=academic-15963-cxa) scaricabile. Qui si scopre che, per questo problema multiclasse, si dispone di alcune scelte:
![cheatsheet per problemi multiclasse](../images/cheatsheet.png)
> Una sezione dell'Algorithm Cheat Sheet di Microsoft, che descrive in dettaglio le opzioni di classificazione multiclasse
✅ Scaricare questo cheat sheet, stamparlo e appenderlo alla parete!
### Motivazione
Si prova a ragionare attraverso diversi approcci dati i vincoli presenti:
- **Le reti neurali sono troppo pesanti**. Dato l'insieme di dati pulito, ma minimo, e il fatto che si sta eseguendo l'addestramento localmente tramite notebook, le reti neurali sono troppo pesanti per questo compito.
- **Nessun classificatore a due classi**. Non si usa un classificatore a due classi, quindi questo esclude uno contro tutti.
- L'**albero decisionale o la regressione logistica potrebbero funzionare**. Potrebbe funzionare un albero decisionale o una regressione logistica per dati multiclasse.
- **Gli alberi decisionali potenziati multiclasse risolvono un problema diverso**. L'albero decisionale potenziato multiclasse è più adatto per attività non parametriche, ad esempio attività progettate per costruire classifiche, quindi non è utile in questo caso.
### Utilizzo di Scikit-learn
Si userà Scikit-learn per analizzare i dati. Tuttavia, ci sono molti modi per utilizzare la regressione logistica in Scikit-learn. Dare un'occhiata ai [parametri da passare](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logistic%20regressio#sklearn.linear_model.LogisticRegression).
Essenzialmente ci sono due importanti parametri `multi_class` e `solver`, che occorre specificare, quando si chiede a Scikit-learn di eseguire una regressione logistica. Il valore `multi_class` si applica un certo comportamento. Il valore del risolutore è quale algoritmo utilizzare. Non tutti i risolutori possono essere associati a tutti i valori `multi_class` .
Secondo la documentazione, nel caso multiclasse, l'algoritmo di addestramento:
- **Utilizza lo schema one-vs-rest (OvR)** - uno contro tutti, se l'opzione `multi_class` è impostata su `ovr`
- **Utilizza la perdita di entropia incrociata**, se l 'opzione `multi_class` è impostata su `multinomial`. (Attualmente l'opzione multinomiale è supportata solo dai solutori 'lbfgs', 'sag', 'saga' e 'newton-cg')."
> 🎓 Lo 'schema' qui può essere 'ovr' (one-vs-rest) - uno contro tutti - o 'multinomiale'. Poiché la regressione logistica è realmente progettata per supportare la classificazione binaria, questi schemi consentono di gestire meglio le attività di classificazione multiclasse. [fonte](https://machinelearningmastery.com/one-vs-rest-and-one-vs-one-for-multi-class-classification/)
> 🎓 Il 'solver' è definito come "l'algoritmo da utilizzare nel problema di ottimizzazione". [fonte](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logistic%20regressio#sklearn.linear_model.LogisticRegression).
Scikit-learn offre questa tabella per spiegare come i risolutori gestiscono le diverse sfide presentate da diversi tipi di strutture dati:
![risolutori](../images/solvers.png)
## Esercizio: dividere i dati
Ci si può concentrare sulla regressione logistica per la prima prova di addestramento poiché di recente si è appreso di quest'ultima in una lezione precedente.
Dividere i dati in gruppi di addestramento e test chiamando `train_test_split()`:
```python
X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)
```
## Esercizio: applicare la regressione logistica
Poiché si sta utilizzando il caso multiclasse, si deve scegliere quale _schema_ utilizzare e quale _solutore_ impostare. Usare LogisticRegression con un'impostazione multiclasse e il solutore **liblinear** da addestrare.
1. Creare una regressione logistica con multi_class impostato su `ovr` e il risolutore impostato su `liblinear`:
```python
lr = LogisticRegression(multi_class='ovr',solver='liblinear')
model = lr.fit(X_train, np.ravel(y_train))
accuracy = model.score(X_test, y_test)
print ("Accuracy is {}".format(accuracy))
```
✅ Provare un risolutore diverso come `lbfgs`, che è spesso impostato come predefinito
> Nota, usare la funzione [`ravel`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.ravel.html) di Pandas per appiattire i dati quando necessario.
La precisione è buona oltre l'**80%**!
1. Si può vedere questo modello in azione testando una riga di dati (#50):
```python
print(f'ingredients: {X_test.iloc[50][X_test.iloc[50]!=0].keys()}')
print(f'cuisine: {y_test.iloc[50]}')
```
Il risultato viene stampato:
```output
ingredients: Index(['cilantro', 'onion', 'pea', 'potato', 'tomato', 'vegetable_oil'], dtype='object')
cuisine: indian
```
✅ Provare un numero di riga diverso e controllare i risultati
1. Scavando più a fondo, si può verificare l'accuratezza di questa previsione:
```python
test= X_test.iloc[50].values.reshape(-1, 1).T
proba = model.predict_proba(test)
classes = model.classes_
resultdf = pd.DataFrame(data=proba, columns=classes)
topPrediction = resultdf.T.sort_values(by=[0], ascending = [False])
topPrediction.head()
```
Il risultato è stampato: la cucina indiana è la sua ipotesi migliore, con buone probabilità:
| | 0 | | | | | | | | | | | | | | | | | | | | |
| -------: | -------: | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| indiano | 0,715851 | | | | | | | | | | | | | | | | | | | | |
| cinese | 0.229475 | | | | | | | | | | | | | | | | | | | | |
| Giapponese | 0,029763 | | | | | | | | | | | | | | | | | | | | |
| Coreano | 0.017277 | | | | | | | | | | | | | | | | | | | | |
| thai | 0.007634 | | | | | | | | | | | | | | | | | | | | |
✅ Si è in grado di spiegare perché il modello è abbastanza sicuro che questa sia una cucina indiana?
1. Ottenere maggiori dettagli stampando un rapporto di classificazione, come fatto nelle lezioni di regressione:
```python
y_pred = model.predict(X_test)
print(classification_report(y_test,y_pred))
```
| precisione | recall | punteggio f1 | supporto | | | | | | | | | | | | | | | | | | |
| ------------ | ------ | -------- | ------- | ---- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- | --- |
| cinese | 0,73 | 0,71 | 0,72 | 229 | | | | | | | | | | | | | | | | | |
| indiano | 0,91 | 0,93 | 0,92 | 254 | | | | | | | | | | | | | | | | | |
| Giapponese | 0.70 | 0,75 | 0,72 | 220 | | | | | | | | | | | | | | | | | |
| Coreano | 0,86 | 0,76 | 0,81 | 242 | | | | | | | | | | | | | | | | | |
| thai | 0,79 | 0,85 | 0.82 | 254 | | | | | | | | | | | | | | | | | |
| accuratezza | 0,80 | 1199 | | | | | | | | | | | | | | | | | | | |
| macro media | 0,80 | 0,80 | 0,80 | 1199 | | | | | | | | | | | | | | | | | |
| Media ponderata | 0,80 | 0,80 | 0,80 | 1199 | | | | | | | | | | | | | | | | | |
## 🚀 Sfida
In questa lezione, sono stati utilizzati dati puliti per creare un modello di apprendimento automatico in grado di prevedere una cucina nazionale basata su una serie di ingredienti. Si prenda del tempo per leggere le numerose opzioni fornite da Scikit-learn per classificare i dati. Approfondire il concetto di "risolutore" per capire cosa succede dietro le quinte.
## [Quiz post-lezione](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/22/)
## Revisione e Auto Apprendimento
Approfondire un po' la matematica alla base della regressione logistica in [questa lezione](https://people.eecs.berkeley.edu/~russell/classes/cs194/f11/lectures/CS194%20Fall%202011%20Lecture%2006.pdf)
## Compito
[Studiare i risolutori](assignment.it.md)

@ -0,0 +1,241 @@
# Mutfak sınıflandırıcıları 1
Bu derste, mutfaklarla ilgili dengeli ve temiz veriyle dolu, geçen dersten kaydettiğiniz veri setini kullanacaksınız.
Bu veri setini çeşitli sınıflandırıcılarla _bir grup malzemeyi baz alarak verilen bir ulusal mutfağı öngörmek_ için kullanacaksınız. Bunu yaparken, sınıflandırma görevleri için algoritmaların leveraj edilebileceği yollardan bazıları hakkında daha fazla bilgi edineceksiniz.
## [Ders öncesi kısa sınavı](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/21/?loc=tr)
# Hazırlık
[Birinci dersi](../../1-Introduction/README.md) tamamladığınızı varsayıyoruz, dolayısıyla bu dört ders için _cleaned_cuisines.csv_ dosyasının kök `/data` klasöründe var olduğundan emin olun.
## Alıştırma - ulusal bir mutfağı öngörün
1. Bu dersin _notebook.ipynb_ dosyasında çalışarak, Pandas kütüphanesiyle beraber o dosyayı da alın:
```python
import pandas as pd
cuisines_df = pd.read_csv("../data/cleaned_cuisines.csv")
cuisines_df.head()
```
Veri şöyle görünüyor:
| | Unnamed: 0 | cuisine | almond | angelica | anise | anise_seed | apple | apple_brandy | apricot | armagnac | ... | whiskey | white_bread | white_wine | whole_grain_wheat_flour | wine | wood | yam | yeast | yogurt | zucchini |
| --- | ---------- | ------- | ------ | -------- | ----- | ---------- | ----- | ------------ | ------- | -------- | --- | ------- | ----------- | ---------- | ----------------------- | ---- | ---- | --- | ----- | ------ | -------- |
| 0 | 0 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | indian | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 2 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 3 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 4 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
1. Şimdi, birkaç kütüphane daha alın:
```python
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve
from sklearn.svm import SVC
import numpy as np
```
1. X ve y koordinatlarını eğitme için iki veri iskeletine bölün. `cuisine` etiket veri iskeleti olabilir:
```python
cuisines_label_df = cuisines_df['cuisine']
cuisines_label_df.head()
```
Şöyle görünecek:
```output
0 indian
1 indian
2 indian
3 indian
4 indian
Name: cuisine, dtype: object
```
1. `Unnamed: 0` ve `cuisine` sütunlarını, `drop()` fonksiyonunu çağırarak temizleyin. Kalan veriyi eğitilebilir öznitelikler olarak kaydedin:
```python
cuisines_feature_df = cuisines_df.drop(['Unnamed: 0', 'cuisine'], axis=1)
cuisines_feature_df.head()
```
Öznitelikleriniz şöyle görünüyor:
| almond | angelica | anise | anise_seed | apple | apple_brandy | apricot | armagnac | artemisia | artichoke | ... | whiskey | white_bread | white_wine | whole_grain_wheat_flour | wine | wood | yam | yeast | yogurt | zucchini |
| -----: | -------: | ----: | ---------: | ----: | -----------: | ------: | -------: | --------: | --------: | ---: | ------: | ----------: | ---------: | ----------------------: | ---: | ---: | ---: | ----: | -----: | -------: |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
Şimdi modelinizi eğitmek için hazırsınız!
## Sınıflandırıcınızı seçme
Veriniz temiz ve eğitme için hazır, şimdi bu iş için hangi algoritmanın kullanılması gerektiğine karar vermelisiniz.
Scikit-learn, sınıflandırmayı gözetimli öğrenme altında grupluyor. Bu kategoride sınıflandırma için birçok yöntem görebilirsiniz. [Çeşitlilik](https://scikit-learn.org/stable/supervised_learning.html) ilk bakışta oldukça şaşırtıcı. Aşağıdaki yöntemlerin hepsi sınıflandırma yöntemlerini içermektedir:
- Doğrusal Modeller
- Destek Vektör Makineleri
- Stokastik Gradyan İnişi
- En Yakın Komşu
- Gauss Süreçleri
- Karar Ağaçları
- Topluluk Metotları (Oylama Sınıflandırıcısı)
- Çok sınıflı ve çok çıktılı algoritmalar (çok sınıflı ve çok etiketli sınıflandırma, çok sınıflı-çok çıktılı sınıflandırma)
> [Verileri sınıflandırmak için sinir ağlarını](https://scikit-learn.org/stable/modules/neural_networks_supervised.html#classification) da kullanabilirsiniz, ancak bu, bu dersin kapsamı dışındadır.
### Hangi sınıflandırıcıyı kullanmalı?
Şimdi, hangi sınıflandırıcıyı seçmelisiniz? Genellikle, birçoğunu gözden geçirmek ve iyi bir sonuç aramak deneme yollarından biridir. Scikit-learn, oluşturulmuş bir veri seti üzerinde KNeighbors, iki yolla SVC, GaussianProcessClassifier, DecisionTreeClassifier, RandomForestClassifier, MLPClassifier, AdaBoostClassifier, GaussianNB ve QuadraticDiscrinationAnalysis karşılaştırmaları yapan ve sonuçları görsel olarak gösteren bir [yan yana karşılaştırma](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html) sunar:
![sınıflandırıcıların karşılaştırılması](../images/comparison.png)
> Grafikler Scikit-learn dokümantasyonlarında oluşturulmuştur.
> AutoML, bu karşılaştırmaları bulutta çalıştırarak bu problemi muntazam bir şekilde çözer ve veriniz için en iyi algoritmayı seçmenizi sağlar. [Buradan](https://docs.microsoft.com/learn/modules/automate-model-selection-with-azure-automl/?WT.mc_id=academic-15963-cxa) deneyin.
### Daha iyi bir yaklaşım
Böyle tahminlerle çözmekten daha iyi bir yol ise, indirilebilir [ML Kopya kağıdı](https://docs.microsoft.com/azure/machine-learning/algorithm-cheat-sheet?WT.mc_id=academic-15963-cxa) içindeki fikirlere bakmaktır. Burada, bizim çok sınıflı problemimiz için bazı seçenekler olduğunu görüyoruz:
![çok sınıflı problemler için kopya kağıdı](../images/cheatsheet.png)
> Microsoft'un Algoritma Kopya Kağıdı'ndan, çok sınıflı sınıflandırma seçeneklerini detaylandıran bir bölüm
:white_check_mark: Bu kopya kağıdını indirin, yazdırın ve duvarınıza asın!
### Akıl yürütme
Elimizdeki kısıtlamalarla farklı yaklaşımlar üzerine akıl yürütelim:
- **Sinir ağları çok ağır**. Temiz ama minimal veri setimizi ve eğitimi not defterleriyle yerel makinelerde çalıştırdığımızı göz önünde bulundurursak, sinir ağları bu görev için çok ağır oluyor.
- **İki sınıflı sınıflandırıcısı yok**. İki sınıflı sınıflandırıcı kullanmıyoruz, dolayısıyla bire karşı hepsi (one-vs-all) yöntemi eleniyor.
- **Karar ağacı veya lojistik regresyon işe yarayabilirdi**. Bir karar ağacı veya çok sınıflı veri için lojistik regresyon işe yarayabilir.
- **Çok Sınıf Artırmalı Karar Ağaçları farklı bir problemi çözüyor**. Çok sınıf artırmalı karar ağacı, parametrik olmayan görevler için en uygunu, mesela sıralama (ranking) oluşturmak için tasarlanan görevler. Yani, bizim için kullanışlı değil.
### Scikit-learn kullanımı
Verimizi analiz etmek için Scikit-learn kullanacağız. Ancak, Scikit-learn içerisinde lojistik regresyonu kullanmanın birçok yolu var. [Geçirilecek parametreler](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logistic%20regressio#sklearn.linear_model.LogisticRegression) göz atın.
Aslında, Scikit-learn'den lojistik regresyon yapmasını beklediğimizde belirtmemiz gereken `multi_class` ve `solver` diye iki önemli parametre var. `multi_class` değeri belli bir davranış uygular. Çözücünün değeri, hangi algoritmanın kullanılacağını gösterir. Her çözücü her `multi_class` değeriyle eşleştirilemez.
Dokümanlara göre, çok sınıflı durumunda eğitme algoritması:
- Eğer `multi_class` seçeneği `ovr` olarak ayarlanmışsa, **bire karşı diğerleri (one-vs-rest, OvR) şemasını kullanır**
- Eğer `multi_class` seçeneği `multinomial` olarak ayarlanmışsa, **çapraz düzensizlik yitimini/kaybını kullanır**. (Güncel olarak `multinomial` seçeneği yalnızca lbfgs, sag, saga ve newton-cg çözücüleriyle destekleniyor.)
> :mortar_board: Buradaki 'şema' ya 'ovr' (one-vs-rest, yani bire karşı diğerleri) ya da 'multinomial' olabilir. Lojistik regresyon aslında ikili sınıflandırmayı desteklemek için tasarlandığından, bu şemalar onun çok sınıflı sınıflandırma görevlerini daha iyi ele alabilmesini sağlıyor. [kaynak](https://machinelearningmastery.com/one-vs-rest-and-one-vs-one-for-multi-class-classification/)
> :mortar_board: 'Çözücü', "eniyileştirme probleminde kullanılacak algoritma" olarak tanımlanır. [kaynak](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logistic%20regressio#sklearn.linear_model.LogisticRegression)
Scikit-learn, çözücülerin, farklı tür veri yapıları tarafından sunulan farklı meydan okumaları nasıl ele aldığınııklamak için bu tabloyu sunar:
![çözücüler](../images/solvers.png)
## Alıştırma - veriyi bölün
İkincisini önceki derte öğrendiğinizden, ilk eğitme denememiz için lojistik regresyona odaklanabiliriz.
`train_test_split()` fonksiyonunu çağırarak verilerinizi eğitme ve sınama gruplarına bölün:
```python
X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)
```
## Alıştırma - lojistik regresyon uygulayın
Çok sınıflı durumu kullandığınız için, hangi _şemayı_ kullanacağınızı ve hangi _çözücüyü_ ayarlayacağınızı seçmeniz gerekiyor. Eğitme için, bir çok sınıflı ayarında LogisticRegression ve **liblinear** çözücüsünü kullanın.
1. multi_class'ı `ovr` ve solver'ı `liblinear` olarak ayarlayarak bir lojistik regresyon oluşturun:
```python
lr = LogisticRegression(multi_class='ovr',solver='liblinear')
model = lr.fit(X_train, np.ravel(y_train))
accuracy = model.score(X_test, y_test)
print ("Accuracy is {}".format(accuracy))
```
:white_check_mark: Genelde varsayılan olarak ayarlanan `lbfgs` gibi farklı bir çözücü deneyin.
> Not olarak, gerektiğinde verinizi düzleştirmek için Pandas [`ravel`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.ravel.html) fonksiyonunu kullanın.
Doğruluk **%80** üzerinde iyidir!
1. Bir satır veriyi (#50) sınayarak bu modeli eylem halinde görebilirsiniz:
```python
print(f'ingredients: {X_test.iloc[50][X_test.iloc[50]!=0].keys()}')
print(f'cuisine: {y_test.iloc[50]}')
```
Sonuç bastırılır:
```output
ingredients: Index(['cilantro', 'onion', 'pea', 'potato', 'tomato', 'vegetable_oil'], dtype='object')
cuisine: indian
```
:white_check_mark: Farklı bir satır sayısı deneyin ve sonuçları kontrol edin
1. Daha derinlemesine inceleyerek, bu öngörünün doğruluğunu kontrol edebilirsiniz:
```python
test= X_test.iloc[50].values.reshape(-1, 1).T
proba = model.predict_proba(test)
classes = model.classes_
resultdf = pd.DataFrame(data=proba, columns=classes)
topPrediction = resultdf.T.sort_values(by=[0], ascending = [False])
topPrediction.head()
```
Sonuç bastırılır - Hint mutfağı iyi olasılıkla en iyi öngörü:
| | 0 |
| -------: | -------: |
| indian | 0.715851 |
| chinese | 0.229475 |
| japanese | 0.029763 |
| korean | 0.017277 |
| thai | 0.007634 |
:while_check_mark: Modelin, bunun bir Hint mutfağı olduğundan nasıl emin olduğunu açıklayabilir misiniz?
1. Regresyon derslerinde yaptığınız gibi, bir sınıflandırma raporu bastırarak daha fazla detay elde edin:
```python
y_pred = model.predict(X_test)
print(classification_report(y_test,y_pred))
```
| | precision | recall | f1-score | support |
| ------------ | ------ | -------- | ------- | ---- |
| chinese | 0.73 | 0.71 | 0.72 | 229 |
| indian | 0.91 | 0.93 | 0.92 | 254 |
| japanese | 0.70 | 0.75 | 0.72 | 220 |
| korean | 0.86 | 0.76 | 0.81 | 242 |
| thai | 0.79 | 0.85 | 0.82 | 254 |
| accuracy | 0.80 | 1199 | | |
| macro avg | 0.80 | 0.80 | 0.80 | 1199 |
| weighted avg | 0.80 | 0.80 | 0.80 | 1199 |
## :rocket: Meydan Okuma
Bu derste, bir grup malzemeyi baz alarak bir ulusal mutfağı öngörebilen bir makine öğrenimi modeli oluşturmak için temiz verinizi kullandınız. Scikit-learn'ün veri sınıflandırmak için sağladığı birçok yöntemi okumak için biraz vakit ayırın. Arka tarafta neler olduğunu anlamak için 'çözücü' kavramını derinlemesine inceleyin.
## [Ders sonrası kısa sınavı](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/22/?loc=tr)
## Gözden geçirme & kendi kendine çalışma
[Bu deste](https://people.eecs.berkeley.edu/~russell/classes/cs194/f11/lectures/CS194%20Fall%202011%20Lecture%2006.pdf) lojistik regresyonun arkasındaki matematiği derinlemesine inceleyin.
## Ödev
[Çözücüleri çalışın](assignment.tr.md)

@ -0,0 +1,242 @@
# 菜品分类器1
本节课程将使用你在上一个课程中所保存的全部经过均衡和清洗的菜品数据。
你将使用此数据集和各种分类器_根据一组配料预测这是哪一国家的美食_。在此过程中你将学到更多用来权衡分类任务算法的方法
## [课前测验](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/21/)
# 准备工作
假如你已经完成了[课程1](../../1-Introduction/translations/README.zh-cn.md), 确保在根目录的`/data`文件夹中有 _cleaned_cuisines.csv_ 这份文件来进行接下来的四节课程。
## 练习 - 预测某国的菜品
1. 在本节课的 _notebook.ipynb_ 文件中导入Pandas并读取相应的数据文件
```python
import pandas as pd
cuisines_df = pd.read_csv("../../data/cleaned_cuisine.csv")
cuisines_df.head()
```
数据如下所示:
```output
| | Unnamed: 0 | cuisine | almond | angelica | anise | anise_seed | apple | apple_brandy | apricot | armagnac | ... | whiskey | white_bread | white_wine | whole_grain_wheat_flour | wine | wood | yam | yeast | yogurt | zucchini |
| --- | ---------- | ------- | ------ | -------- | ----- | ---------- | ----- | ------------ | ------- | -------- | --- | ------- | ----------- | ---------- | ----------------------- | ---- | ---- | --- | ----- | ------ | -------- |
| 0 | 0 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | indian | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 2 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 3 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 4 | indian | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
```
1. 现在,再多导入一些库:
```python
from sklearn.linear_model import LogisticRegression
from sklearn.model_selection import train_test_split, cross_val_score
from sklearn.metrics import accuracy_score,precision_score,confusion_matrix,classification_report, precision_recall_curve
from sklearn.svm import SVC
import numpy as np
```
1. 接下来需要将数据分为训练模型所需的X译者注代表特征数据和y译者注代表标签数据两个dataframe。首先可将`cuisine`列的数据单独保存为的一个dataframe作为标签label
```python
cuisines_label_df = cuisines_df['cuisine']
cuisines_label_df.head()
```
输出如下:
```output
0 indian
1 indian
2 indian
3 indian
4 indian
Name: cuisine, dtype: object
```
1. 调用`drop()`方法将 `Unnamed: 0``cuisine`列删除并将余下的数据作为可以用于训练的特证feature数据:
```python
cuisines_feature_df = cuisines_df.drop(['Unnamed: 0', 'cuisine'], axis=1)
cuisines_feature_df.head()
```
你的特征集看上去将会是这样:
| | almond | angelica | anise | anise_seed | apple | apple_brandy | apricot | armagnac | artemisia | artichoke | ... | whiskey | white_bread | white_wine | whole_grain_wheat_flour | wine | wood | yam | yeast | yogurt | zucchini |
| -----: | -------: | ----: | ---------: | ----: | -----------: | ------: | -------: | --------: | --------: | ---: | ------: | ----------: | ---------: | ----------------------: | ---: | ---: | ---: | ----: | -----: | -------: | --- |
| 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 1 | 1 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 2 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 3 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 |
| 4 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | ... | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 0 | 1 | 0 |
现在,你已经准备好可以开始训练你的模型了!
## 选择你的分类器
你的数据已经清洗干净并已经准备好可以进行训练了,现在需要决定你想要使用的算法来完成这项任务。
Scikit_learn将分类任务归在了监督学习类别中在这个类别中你可以找到很多可以用来分类的方法。乍一看上去有点[琳琅满目](https://scikit-learn.org/stable/supervised_learning.html)。以下这些算法都可以用于分类:
- 线性模型Linear Models
- 支持向量机Support Vector Machines
- 随机梯度下降Stochastic Gradient Descent
- 最近邻Nearest Neighbors
- 高斯过程Gaussian Processes
- 决策树Decision Trees
- 集成方法投票分类器Ensemble methodsvoting classifier
- 多类别多输出算法多类别多标签分类多类别多输出分类Multiclass and multioutput algorithms (multiclass and multilabel classification, multiclass-multioutput classification)
> 你也可以使用[神经网络来分类数据](https://scikit-learn.org/stable/modules/neural_networks_supervised.html#classification), 但这对于本课程来说有点超纲了。
### 如何选择分类器?
那么你应该如何从中选择分类器呢一般来说可以选择多个分类器并对比他们的运行结果。Scikit-learn提供了各种算法包括KNeighbors、 SVC two ways、 GaussianProcessClassifier、 DecisionTreeClassifier、 RandomForestClassifier、 MLPClassifier、 AdaBoostClassifier、 GaussianNB以及QuadraticDiscrinationAnalysis的[对比](https://scikit-learn.org/stable/auto_examples/classification/plot_classifier_comparison.html),并且将结果进行了可视化的展示:
![各分类器比较](../images/comparison.png)
> 图表来源于Scikit-learn的官方文档
> AutoML通过在云端运行这些算法并进行了对比非常巧妙地解决的算法选择的问题能帮助你根据数据集的特点来选择最佳的算法。试试点击[这里](https://docs.microsoft.com/learn/modules/automate-model-selection-with-azure-automl/?WT.mc_id=academic-15963-cxa)了解更多。
### 另外一种效果更佳的分类器选择方法
比起无脑地猜测,你可以下载这份[机器学习小抄cheatsheet](https://docs.microsoft.com/azure/machine-learning/algorithm-cheat-sheet?WT.mc_id=academic-15963-cxa)。这里面将各算法进行了比较,能更有效地帮助我们选择算法。根据这份小抄,我们可以找到要完成本课程中涉及的多类型的分类任务,可以有以下这些选择:
![多类型问题作弊表](../images/cheatsheet.png)
> 微软算法小抄中部分关于多类型分类任务可选算法
✅ 下载这份小抄,并打印出来,挂在你的墙上吧!
### 选择的流程
让我们根据所有限制条件依次对各种算法的可行性进行判断:
- **神经网络Neural Network太过复杂了**。我们的数据很清晰但数据量比较小此外我们是通过notebook在本地进行训练的神经网络对于这个任务来说过于复杂了。
- **二分类法(two-class classifier)是不可行的**。我们不能使用二分类法,所以这就排除了一对多one-vs-all算法。
- **可以选择决策树以及逻辑回归算法**。决策树应该是可行的,此外也可以使用逻辑回归来处理多类型数据。
- **多类型增强决策树是用于解决其他问题的**. 多类型增强决策树最适合的是非参数化的任务,即任务目标是建立一个排序,这对我们当前的任务并没有作用。
### 使用Scikit-learn
我们将会使用Scikit-learn来对我们的数据进行分析。然而在Scikit-learn中使用逻辑回归也有很多方法。可以先了解一下逻辑回归算法需要[传递的参数](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logistic%20regressio#sklearn.linear_model.LogisticRegression)。
当我们需要Scikit-learn进行逻辑回归运算时`multi_class` 以及 `solver`是最重要的两个参数,因此我们需要特别说明一下。 `multi_class` 是分类方式选择参数,而`solver`优化算法选择参数。值得注意的是并不是所有的solvers都可以与`multi_class`参数进行匹配的。
根据官方文档,在多类型分类问题中:
- 当`multi_class`被设置为`ovr`时,将使用 **“一对其余”(OvR)策略scheme**。
- 当`multi_class`被设置为`multinomial`时,则使用的是**交叉熵损失cross entropy loss** 作为损失函数。(注意,目前`multinomial`只支持lbfgs, sag, saga以及newton-cg等solver作为损失函数的优化方法)
> 🎓 在本课程的任务中“scheme”可以是“ovr(one-vs-rest)”也可以是“multinomial”。因为逻辑回归本来是设计来用于进行二分类任务的这两个scheme参数的选择都可以使得逻辑回归很好的完成多类型分类任务。[来源](https://machinelearningmastery.com/one-vs-rest-and-one-vs-one-for-multi-class-classification/)
> 🎓 “solver”被定义为是"用于解决优化问题的算法"。[来源](https://scikit-learn.org/stable/modules/generated/sklearn.linear_model.LogisticRegression.html?highlight=logistic%20regressio#sklearn.linear_model.LogisticRegression).
Scikit-learn提供了以下这个表格来解释各种solver是如何应对的不同的数据结构所带来的不同的挑战的:
![solvers](../images/solvers.png)
## 练习 - 分割数据
因为你刚刚在上一节课中学习了逻辑回归,我们这里就通过逻辑回归算法,来演练一下如何进行你的第一个机器学习模型的训练。首先,需要通过调用`train_test_split()`方法可以把你的数据分割成训练集和测试集:
```python
X_train, X_test, y_train, y_test = train_test_split(cuisines_feature_df, cuisines_label_df, test_size=0.3)
```
## 练习 - 调用逻辑回归算法
接下来,你需要决定选用什么 _scheme_ 以及 _solver_ 来进行我们这个多类型分类的案例。在这里我们使用LogisticRegression方法并设置相应的multi_class参数同时将solver设置为**liblinear**来进行模型训练。
1. 创建一个逻辑回归模型并将multi_class设置为`ovr`同时将solver设置为 `liblinear`:
```python
lr = LogisticRegression(multi_class='ovr',solver='liblinear')
model = lr.fit(X_train, np.ravel(y_train))
accuracy = model.score(X_test, y_test)
print ("Accuracy is {}".format(accuracy))
```
✅ 也可以试试其他solver比如`lbfgs`, 这也是默认参数
> 注意, 使用Pandas的[`ravel`](https://pandas.pydata.org/pandas-docs/stable/reference/api/pandas.Series.ravel.html) 方法可以在需要的时候将你的数据进行降维
运算之后,可以看到准确率高达了**80%**!
1. 你也可以通过查看某一行数据比如第50行来观测到模型运行的情况:
```python
print(f'ingredients: {X_test.iloc[50][X_test.iloc[50]!=0].keys()}')
print(f'cuisine: {y_test.iloc[50]}')
```
运行后的输出如下:
```output
ingredients: Index(['cilantro', 'onion', 'pea', 'potato', 'tomato', 'vegetable_oil'], dtype='object')
cuisine: indian
```
✅ 试试不同的行索引来检查一下计算的结果吧
1. 我们可以再进行一部深入的研究,检查一下本轮预测结果的准确率:
```python
test= X_test.iloc[50].values.reshape(-1, 1).T
proba = model.predict_proba(test)
classes = model.classes_
resultdf = pd.DataFrame(data=proba, columns=classes)
topPrediction = resultdf.T.sort_values(by=[0], ascending = [False])
topPrediction.head()
```
运行后的输出如下———可以发现这是一道印度菜的可能性最大,是最合理的猜测:
| | 0 |
| -------: | -------: |
| indian | 0.715851 |
| chinese | 0.229475 |
| japanese | 0.029763 |
| korean | 0.017277 |
| thai | 0.007634 |
✅ 你能解释下为什么模型会如此确定这是一道印度菜么?
1. 和你在之前的回归的课程中所做的一样,我们也可以通过输出分类的报告得到关于模型的更多的细节:
```python
y_pred = model.predict(X_test)
print(classification_report(y_test,y_pred))
```
| precision | recall | f1-score | support | |
| ------------ | ------ | -------- | ------- | ---- |
| chinese | 0.73 | 0.71 | 0.72 | 229 |
| indian | 0.91 | 0.93 | 0.92 | 254 |
| japanese | 0.70 | 0.75 | 0.72 | 220 |
| korean | 0.86 | 0.76 | 0.81 | 242 |
| thai | 0.79 | 0.85 | 0.82 | 254 |
| accuracy | 0.80 | 1199 | | |
| macro avg | 0.80 | 0.80 | 0.80 | 1199 |
| weighted avg | 0.80 | 0.80 | 0.80 | 1199 |
## 挑战
在本课程中你使用了清洗后的数据建立了一个机器学习的模型这个模型能够根据输入的一系列的配料来预测菜品来自于哪个国家。请再花点时间阅读一下Scikit-learn所提供的关于可以用来分类数据的其他方法的资料。此外你也可以深入研究一下“solver”的概念并尝试一下理解其背后的原理。
## [课后测验](https://jolly-sea-0a877260f.azurestaticapps.net/quiz/22/)
## 回顾与自学
[这个课程](https://people.eecs.berkeley.edu/~russell/classes/cs194/f11/lectures/CS194%20Fall%202011%20Lecture%2006.pdf)将对逻辑回归背后的数学原理进行更加深入的讲解
## 作业
[学习solver](assignment.md)

@ -0,0 +1,10 @@
# Studiare i risolutori
## Istruzioni
In questa lezione si è imparato a conoscere i vari risolutori che associano algoritmi a un processo di machine learning per creare un modello accurato. Esaminare i risolutori elencati nella lezione e sceglierne due. Con parole proprie, confrontare questi due risolutori. Che tipo di problema affrontano? Come funzionano con varie strutture di dati? Perché se ne dovrebbe sceglierne uno piuttosto che un altro?
## Rubrica
| Criteri | Ottimo | Adeguato | Necessita miglioramento |
| -------- | ---------------------------------------------------------------------------------------------- | ------------------------------------------------ | ---------------------------- |
| | Viene presentato un file .doc con due paragrafi, uno su ciascun risolutore, confrontandoli attentamente. | Un file .doc viene presentato con un solo paragrafo | Il compito è incompleto |

@ -0,0 +1,9 @@
# Çözücüleri çalışın
## Yönergeler
Bu derste, doğru bir model yaratmak için algoritmaları bir makine öğrenimi süreciyle eşleştiren çeşitli çözücüleri öğrendiniz. Derste sıralanan çözücüleri inceleyin ve iki tanesini seçin. Kendi cümlelerinizle, bu iki çözücünün benzerliklerini ve farklılıklarını bulup yazın. Ne tür problemleri ele alıyorlar? Çeşitli veri yapılarıyla nasıl çalışıyorlar? Birini diğerine neden tercih ederdiniz?
## Rubrik
| Ölçüt | Örnek Alınacak Nitelikte | Yeterli | Geliştirme Gerekli |
| -------- | -------------------------------------------------------------------------------------------------------------------- | ------------------------------------------------ | ---------------------------- |
| | Her biri bir çözücü üzerine yazılmış, onları dikkatle karşılaştıran ve iki paragraf içeren bir .doc dosyası sunulmuş | Bir paragraf içeren bir .doc dosyası sunulmuş | Görev tamamlanmamış |

Some files were not shown because too many files have changed in this diff Show More

Loading…
Cancel
Save