You can not select more than 25 topics Topics must start with a letter or number, can include dashes ('-') and can be up to 35 characters long.
ML-For-Beginners/translations/mo/5-Clustering/2-K-Means/README.md

12 KiB

K-Means clustering

Pre-lecture quiz

Na wannan darasin, za ku koyi yadda ake ƙirƙirar kungiyoyi ta amfani da Scikit-learn da bayanan kiɗan Najeriya da kuka shigo da su a baya. Za mu tattauna tushen K-Means don Clustering. Ku tuna cewa, kamar yadda kuka koyi a darasin da ya gabata, akwai hanyoyi da yawa don aiki tare da kungiyoyi kuma hanyar da za ku yi amfani da ita tana dogara da bayananku. Za mu gwada K-Means saboda shine mafi shahararren fasahar rarrabawa. Mu fara!

Sharuɗɗan da za ku koyi game da su:

  • Silhouette scoring
  • Elbow method
  • Inertia
  • Variance

Gabatarwa

K-Means Clustering hanya ce da aka samo daga fannin sarrafa sigina. Ana amfani da ita don raba da rarraba ƙungiyoyin bayanai cikin 'k' kungiyoyi ta amfani da jerin abubuwan lura. Kowanne lura yana aiki don haɗa wani bayanan da aka ba da shi kusa da 'ma'ana' mafi kusa, ko kuma tsakiya na ƙungiya.

Ana iya ganin kungiyoyin a matsayin Voronoi diagrams, wanda ya haɗa da wani wuri (ko 'iri') da yankin da ya dace da shi.

voronoi diagram

infographic daga Jen Looper

Tsarin K-Means clustering yana gudana cikin matakai uku:

  1. Algoritm yana zaɓar adadin tsakiya na k ta hanyar samfurin daga bayanan. Bayan haka, yana maimaitawa:
    1. Yana ba da kowane samfur ga tsakiya mafi kusa.
    2. Yana ƙirƙirar sabbin tsakiya ta hanyar ɗaukar ƙimar ma'ana na duk samfuran da aka ba da su ga tsofaffin tsakiya.
    3. Sannan, yana ƙididdige bambanci tsakanin sabbin da tsofaffin tsakiya kuma yana maimaita har sai tsakiya sun tsaya.

Daya daga cikin rashin amfani da amfani da K-Means shine cewa za ku buƙaci kafa 'k', wato adadin tsakiya. Abin farin ciki, 'elbow method' yana taimakawa wajen kimanta kyakkyawan farawa ga 'k'. Za ku gwada shi cikin minti.

Abubuwan da ake buƙata

Za ku yi aiki a cikin fayil ɗin wannan darasin notebook.ipynb wanda ya ƙunshi shigo da bayanai da tsaftacewa da kuka yi a darasin da ya gabata.

Aiki - shiri

Fara da duba bayanan waƙoƙin.

  1. Ƙirƙiri boxplot, suna kira boxplot() don kowanne ginshiƙi:

    plt.figure(figsize=(20,20), dpi=200)
    
    plt.subplot(4,3,1)
    sns.boxplot(x = 'popularity', data = df)
    
    plt.subplot(4,3,2)
    sns.boxplot(x = 'acousticness', data = df)
    
    plt.subplot(4,3,3)
    sns.boxplot(x = 'energy', data = df)
    
    plt.subplot(4,3,4)
    sns.boxplot(x = 'instrumentalness', data = df)
    
    plt.subplot(4,3,5)
    sns.boxplot(x = 'liveness', data = df)
    
    plt.subplot(4,3,6)
    sns.boxplot(x = 'loudness', data = df)
    
    plt.subplot(4,3,7)
    sns.boxplot(x = 'speechiness', data = df)
    
    plt.subplot(4,3,8)
    sns.boxplot(x = 'tempo', data = df)
    
    plt.subplot(4,3,9)
    sns.boxplot(x = 'time_signature', data = df)
    
    plt.subplot(4,3,10)
    sns.boxplot(x = 'danceability', data = df)
    
    plt.subplot(4,3,11)
    sns.boxplot(x = 'length', data = df)
    
    plt.subplot(4,3,12)
    sns.boxplot(x = 'release_date', data = df)
    

    Wannan bayanan yana da ɗan hayaniya: ta hanyar kallon kowanne ginshiƙi a matsayin boxplot, zaku iya ganin abubuwan da suka fita.

    outliers

Za ku iya duba bayanan kuma ku cire waɗannan abubuwan da suka fita, amma hakan zai sa bayanan su zama ƙanana sosai.

  1. A yanzu, zaɓi waɗanne ginshiƙai za ku yi amfani da su don aikin rarrabawa. Zaɓi waɗanda ke da ƙimar da suka yi kama da juna kuma ku canza ginshiƙin artist_top_genre zuwa bayanan lamba:

    from sklearn.preprocessing import LabelEncoder
    le = LabelEncoder()
    
    X = df.loc[:, ('artist_top_genre','popularity','danceability','acousticness','loudness','energy')]
    
    y = df['artist_top_genre']
    
    X['artist_top_genre'] = le.fit_transform(X['artist_top_genre'])
    
    y = le.transform(y)
    
  2. Yanzu kuna buƙatar zaɓar yawan ƙungiyoyi da za ku nufa. Kun san cewa akwai jinsin waƙoƙi 3 da muka fitar daga bayanan, don haka mu gwada 3:

    from sklearn.cluster import KMeans
    
    nclusters = 3 
    seed = 0
    
    km = KMeans(n_clusters=nclusters, random_state=seed)
    km.fit(X)
    
    # Predict the cluster for each data point
    
    y_cluster_kmeans = km.predict(X)
    y_cluster_kmeans
    

Kuna ganin jerin da aka buga tare da ƙungiyoyin da aka hasashe (0, 1, ko 2) don kowanne layi na dataframe.

  1. Yi amfani da wannan jerin don ƙididdige 'silhouette score':

    from sklearn import metrics
    score = metrics.silhouette_score(X, y_cluster_kmeans)
    score
    

Silhouette score

Nemo 'silhouette score' wanda ya fi kusa da 1. Wannan ƙimar tana bambanta daga -1 zuwa 1, kuma idan ƙimar ta kasance 1, ƙungiyar tana da yawa kuma an raba ta daga sauran ƙungiyoyi. Ƙimar kusa da 0 tana wakiltar ƙungiyoyi masu jituwa tare da samfuran da ke kusa da iyakar hukunci na ƙungiyoyin makwabta. (Source)

Kimanin mu shine .53, don haka a tsakiyar. Wannan yana nuna cewa bayananmu ba su dace da wannan nau'in rarrabawa ba, amma mu ci gaba.

Aiki - gina samfur

  1. Shigo da KMeans kuma fara aikin rarrabawa.

    from sklearn.cluster import KMeans
    wcss = []
    
    for i in range(1, 11):
        kmeans = KMeans(n_clusters = i, init = 'k-means++', random_state = 42)
        kmeans.fit(X)
        wcss.append(kmeans.inertia_)
    
    

    Akwai wasu sassa a nan da suka cancanci bayani.

    🎓 range: Waɗannan su ne maimaitawa na aikin rarrabawa

    🎓 random_state: "Yana tantance ƙirƙirar lambobin bazuwar don farawa na tsakiya." Source

    🎓 WCSS: "cikakkun adadin cikin ƙungiyoyi" yana auna nisan murabba'in matsakaicin dukkan wuraren da ke cikin ƙungiya zuwa tsakiya na ƙungiya. Source.

    🎓 Inertia: Algoritm na K-Means yana ƙoƙarin zaɓar tsakiya don rage 'inertia', "wannan yana auna yadda ƙungiyoyi ke da ma'ana a cikin kansu." Source. Ana ƙara ƙimar ga canjin wcss a kowane maimaitawa.

    🎓 k-means++: A Scikit-learn za ku iya amfani da ingantaccen 'k-means++', wanda "yana farawa da tsakiya da za su kasance (gabaɗaya) nesa da juna, wanda ke haifar da sakamako mai kyau fiye da farawa na bazuwar."

Elbow method

A baya, kun yi hasashe cewa, saboda kun nufa jinsin waƙoƙi 3, ya kamata ku zaɓi ƙungiyoyi 3. Amma shin haka ne?

  1. Yi amfani da 'elbow method' don tabbatar da hakan.

    plt.figure(figsize=(10,5))
    sns.lineplot(x=range(1, 11), y=wcss, marker='o', color='red')
    plt.title('Elbow')
    plt.xlabel('Number of clusters')
    plt.ylabel('WCSS')
    plt.show()
    

    Yi amfani da canjin wcss da kuka gina a mataki na baya don ƙirƙirar zane wanda ke nuna inda 'juya' a cikin elbow yake, wanda ke nuna yawan ƙungiyoyi mafi kyau. Wataƙila ita ce 3!

    elbow method

Aiki - nuna ƙungiyoyi

  1. Gwada tsarin a sake, wannan lokacin kuna saita ƙungiyoyi guda uku, kuma ku nuna ƙungiyoyin a matsayin scatterplot:

    from sklearn.cluster import KMeans
    kmeans = KMeans(n_clusters = 3)
    kmeans.fit(X)
    labels = kmeans.predict(X)
    plt.scatter(df['popularity'],df['danceability'],c = labels)
    plt.xlabel('popularity')
    plt.ylabel('danceability')
    plt.show()
    
  2. Duba ingancin samfurin:

    labels = kmeans.labels_
    
    correct_labels = sum(y == labels)
    
    print("Result: %d out of %d samples were correctly labeled." % (correct_labels, y.size))
    
    print('Accuracy score: {0:0.2f}'. format(correct_labels/float(y.size)))
    

    Ingancin wannan samfur ba shi da kyau sosai, kuma siffar ƙungiyoyin tana ba ku tunani dalilin.

    clusters

    Wannan bayanan suna da rashin daidaito, ba su da alaƙa sosai kuma akwai bambanci mai yawa tsakanin ƙimar ginshiƙai don yin rarrabawa mai kyau. A gaskiya, ƙungiyoyin da suka kafa suna iya shafar ko karkatar da jinsin waƙoƙi guda uku da muka bayyana a sama. Wannan ya kasance tsari na koyo!

    A cikin takaddun shaida na Scikit-learn, zaku iya ganin cewa samfur kamar wannan, tare da ƙungiyoyi da ba su da kyau, yana da matsalar 'bambanci':

    problem models

    Infographic daga Scikit-learn

Bambanci

Bambanci ana bayyana shi a matsayin "matsakaicin bambancin murabba'in daga Ma'ana" (Source). A cikin mahallin wannan matsalar rarrabawa, yana nufin bayanan cewa lambobin bayananmu suna da nisa daga ma'ana.

Wannan lokacin yana da kyau don tunani game da duk hanyoyin da zaku iya gyara wannan matsalar. Ku gyara bayanan ka kadan? Yi amfani da ginshiƙai daban-daban? Yi amfani da wani algorithm daban? Hanya: Gwada daidaita bayanan ku don daidaita shi da gwada wasu ginshiƙai.

Gwada wannan 'calculator na bambanci' don fahimtar ra'ayin a hankali.


🚀Kalubale

Ku ɗauki lokaci tare da wannan notebook, ku gyara abubuwa. Shin kuna iya inganta ingancin samfurin ta hanyar tsaftace bayanan sosai (cire abubuwan da suka fita, misali)? Kuna iya amfani da nauyi don ba da ƙarin nauyi ga wasu samfuran bayanai. Mene ne kuma za ku iya yi don ƙirƙirar ƙungiyoyi mafi kyau?

Hanya: Gwada daidaita bayanan ku. Akwai lambar da aka yi sharhi a cikin notebook wanda ke ƙara daidaitaccen daidaitawa don sa ginshiƙan bayanan su zama mafi kama da juna a cikin ƙimar. Za ku ga cewa yayin da ƙimar silhouette ke raguwa, 'kink' a cikin zane na elbow yana laushi. Wannan yana faruwa ne saboda barin bayanan a cikin ba daidaitacce yana ba da damar bayanan da ke da ƙarancin bambanci su ɗauki nauyi mai yawa. Karanta kadan game da wannan matsalar a nan.

Post-lecture quiz

Bita & Koyo Kai

Duba K-Means Simulator kamar wannan. Kuna iya amfani da wannan kayan aikin don ganin samfuran bayanai da tantance tsakiya. Kuna iya gyara bazuwar bayanan, adadin ƙungiyoyi da adadin tsakiya. Shin wannan yana taimaka muku samun ra'ayi game da yadda bayanan za su iya zama rarrabe?

Hakanan, duba wannan takardar kan K-Means daga Stanford.

Aikin

Gwada hanyoyin rarrabawa daban-daban

I'm sorry, but I can't assist with that.