spaCy涓€庝箞杩涜鏂囨湰鑱氱被
鍦╯paCy涓繘琛屾枃鏈仛绫伙紝閫氬父闇€瑕佷互涓嬫楠わ細
-
浣跨敤spaCy鍔犺浇鏂囨湰鏁版嵁锛屽苟杩涜鏂囨湰棰勫鐞嗭紝鍖呮嫭鍒嗚瘝銆佽瘝鎬ф爣娉ㄣ€佸疄浣撹瘑鍒瓑銆?/p>
-
鎻愬彇鏂囨湰鐨勭壒寰佸悜閲忥紝鍙互浣跨敤璇嶈妯″瀷銆乀F-IDF绛夋柟娉曞皢鏂囨湰杞崲涓烘暟鍊肩壒寰併€?/p>
-
浣跨敤鑱氱被绠楁硶瀵规枃鏈繘琛岃仛绫伙紝甯哥敤鐨勮仛绫荤畻娉曞寘鎷琄鍧囧€艰仛绫汇€佸眰娆¤仛绫汇€丏BSCAN绛夈€?/p>
-
鍙鍖栬仛绫荤粨鏋滐紝鍙互浣跨敤闄嶇淮绠楁硶濡侾CA鎴杢-SNE灏嗘枃鏈壒寰佸悜閲忛檷缁村埌浜岀淮鎴栦笁缁寸┖闂达紝骞剁敤鏁g偣鍥惧睍绀轰笉鍚岀被鍒殑鏂囨湰銆?/p>
浠ヤ笅鏄竴涓ず渚嬩唬鐮侊紝婕旂ず濡備綍鍦╯paCy涓繘琛屾枃鏈仛绫伙細
import spacy
from sklearn.feature_extraction.text import TfidfVectorizer
from sklearn.cluster import KMeans
import matplotlib.pyplot as plt
nlp = spacy.load("en_core_web_sm")
# 鍔犺浇鏂囨湰鏁版嵁
data = ["This is an example sentence.",
"Another example sentence is here.",
"I am writing a sample text for clustering.",
"Text clustering is a useful technique."]
# 鏂囨湰棰勫鐞?/span>
processed_data = [nlp(text) for text in data]
# 鎻愬彇鏂囨湰鐗瑰緛鍚戦噺
tfidf_vectorizer = TfidfVectorizer()
tfidf_matrix = tfidf_vectorizer.fit_transform([text.text for text in processed_data])
# 浣跨敤K鍧囧€艰仛绫荤畻娉曡繘琛屾枃鏈仛绫?/span>
kmeans = KMeans(n_clusters=2)
clusters = kmeans.fit_predict(tfidf_matrix)
# 鍙鍖栬仛绫荤粨鏋?/span>
plt.scatter(tfidf_matrix.toarray()[:, 0], tfidf_matrix.toarray()[:, 1], c=clusters, cmap='viridis')
plt.xlabel("Feature 1")
plt.ylabel("Feature 2")
plt.show()
鍦ㄨ繖涓ず渚嬩唬鐮佷腑锛屾垜浠鍏堝姞杞戒簡spaCy妯″瀷锛岀劧鍚庡姞杞戒簡涓€浜涙枃鏈暟鎹紝瀵规枃鏈暟鎹繘琛屼簡棰勫鐞嗗拰鐗瑰緛鎻愬彇锛屾渶鍚庝娇鐢↘鍧囧€艰仛绫荤畻娉曞鏂囨湰杩涜浜嗚仛绫伙紝骞堕€氳繃鏁g偣鍥惧睍绀轰簡鑱氱被缁撴灉銆?/p>