DWDM \ Text Clustering

Clustering is process of grouping of similar kind of things to one group and remaining to other group.
Text Clustering is process of set of unlabelled texts to one group and remaining to other group.

Text clustering steps
step Name Details
Text pre-processing It involves Tokenization, Transformation, Normalization and Filtering.
Feature Extraction It is used to extract the features (words/tokens/ document/corpus) from textual data Clustering and those features are used to cluster different text documents..


KeyWord Definition
Tokenization It parses text data into smaller parts (tokens= words / phrases).
Transformation It converts the text to lowercase
Normalization It transforms a text into a canonical (root) form. root word deriving techniques are Stemming and lemmatization.
Filtering Words which are not having any meaning are removed from the texts for clustering.


3 Levels of text clustering
Word level clustering It is a process used to group words by collecting synonyms for a particular word.
Sentence level clustering It is a process used to group sentences from different documents. Example Twitter analysis.
Document level clustering It is a process used to group documents based on a topic. Example emails, search engines, etc.


Text clustering similarity measures
Words can be checked for 2 types of similarities. They were lexically similarity or semantically similarity.

Lexical similarity Words are said to be lexically similar if they have a similar character sequence and measured using string-based algorithms.
Semantic similarity Words are said to be semantically similar if they have the equal meaning, are opposite of each other and measured using knowledge-based algorithms.


Text Clustering Algorithms
1. Hierarchical Clustering Algorithm It is of 2 types. They were.
a. Divisive approach.
b. Agglomerative approach.

Divisive approach
It start with one cluster and split that into sub-clusters.
Example algorithms: DIANA and MONA.

Agglomerative approach
It start merging small clusters to form big cluster.
Examples algorithms: BIRCH and CURE.

Partitioning
Examples algorithms: k-means, ISODATA and PAM.

Density
clusters are formed based on how many data points fall within a given radius. Examples algorithms: DBSCAN

Graph
It addresses the document similarity.

Probabilistic
Here words belong to topics are assigned probabilities to cluster.



Home    Back