如前所述,相關叢集 ID 可取代該叢集中所有範例的其他特徵。這種替換方式可減少特徵數量,進而減少儲存、處理及訓練資料模型所需的資源。對於龐大的資料集來說,這些節省的成本就會變得相當可觀。
舉例來說,單一 YouTube 影片可能包含以下特徵資料:
觀眾所在位置、時間和客層
註解時間戳記、文字和使用者 ID
影片代碼
將 YouTube 影片分群後,系統會以單一叢集 ID 取代這組特徵,藉此壓縮資料。
隱私權保護
您可以將使用者分組,並將使用者資料與叢集 ID 建立關聯,而非使用者 ID,以便稍微保護隱私。舉例來說,假設您想根據 YouTube 使用者的觀看記錄訓練模型。您可以將使用者分組,並只傳遞叢集 ID,而非將使用者 ID 傳遞至模型。這樣一來,個別觀看記錄就不會連結至個別使用者。請注意,叢集必須包含足夠大量的使用者,才能確保隱私權。
[[["容易理解","easyToUnderstand","thumb-up"],["確實解決了我的問題","solvedMyProblem","thumb-up"],["其他","otherUp","thumb-up"]],[["缺少我需要的資訊","missingTheInformationINeed","thumb-down"],["過於複雜/步驟過多","tooComplicatedTooManySteps","thumb-down"],["過時","outOfDate","thumb-down"],["翻譯問題","translationIssue","thumb-down"],["示例/程式碼問題","samplesCodeIssue","thumb-down"],["其他","otherDown","thumb-down"]],["上次更新時間:2025-02-25 (世界標準時間)。"],[[["Clustering is an unsupervised machine learning technique used to group similar unlabeled data points into clusters based on defined similarity measures."],["Cluster analysis can be applied to various domains like market segmentation, social network analysis, and medical imaging to identify patterns and simplify complex datasets."],["Clustering enables data compression by replacing numerous features with a single cluster ID, reducing storage and processing needs."],["It facilitates data imputation by inferring missing feature data from other examples within the same cluster."],["Clustering offers a degree of privacy preservation by associating user data with cluster IDs instead of individual identifiers."]]],[]]