Explainable Graph Spectral Clustering of text documents.
Spectral clustering methods are known for their ability to represent clusters of diverse shapes, densities etc. However, the results of such algorithms, when applied e.g. to text documents, are hard to explain to the user, especially due to embedding in the spectral space which has no obvious relati...
Saved in:
Main Authors: | , , , , , |
---|---|
Format: | Article |
Language: | English |
Published: |
Public Library of Science (PLoS)
2025-01-01
|
Series: | PLoS ONE |
Online Access: | https://doi.org/10.1371/journal.pone.0313238 |
Tags: |
Add Tag
No Tags, Be the first to tag this record!
|
Summary: | Spectral clustering methods are known for their ability to represent clusters of diverse shapes, densities etc. However, the results of such algorithms, when applied e.g. to text documents, are hard to explain to the user, especially due to embedding in the spectral space which has no obvious relation to document contents. Therefore, there is an urgent need to elaborate methods for explaining the outcome of the clustering. We have constructed in this paper a theoretical bridge linking the clusters resulting from Graph Spectral Clustering and the actual document content, given that similarities between documents are computed as cosine measures in tf or tfidf representation. This link enables to provide with explanation of cluster membership in clusters produced by GSA. We present a proposal of explanation of the results of combinatorial and normalized Laplacian based graph spectral clustering. For this purpose, we show (approximate) equivalence of combinatorial Laplacian embedding and of K-embedding (proposed in this paper) and term vector space embedding. We performed an experimental study showing that K-embedding approximates well Laplacian embedding under favourable block matrix conditions and show that approximation is good enough under other conditions. We show also perfect equivalence of normalized Laplacian embedding and the [Formula: see text]-embedding (proposed in this paper) and (weighted) term vector space embedding. Hence a bridge is constructed between the textual contents and the clustering results using both combinatorial and normalized Laplacian based Graph Spectral Clustering methods. We provide a theoretical background for our approach. An initial version of this paper is available at arXiv, (Starosta B 2023). The Reader may refer to that text to get acquainted with formal aspects of our method and find a detailed overview of motivation. |
---|---|
ISSN: | 1932-6203 |