DOCUMENT SIMILARITY IN R

 DOCUMENT SIMILARITY

Document Similarity
What is Document Similarity

Document similarity is a measure of similarity between two or more documents.




There are different algorithms for calculating document similarity :

1) Euclidean Distance
2) Cosine Similarity
3) Pearsons Correlation Coefficient


PROBLEM

Source Code

install.packages("textreuse")
library(textreuse)
minhash<- minhash_generator(200, seed = 235) 
ats<- TextReuseCorpus(dir="C:/neha/DS/prac 11/french-plays",n=5,minhash_func = minhash) 
buckets<- lsh(ats, bands = 50 ,progress = interactive()) 
candidates<- lsh_candidates(buckets) 
my.df<- lsh_compare(candidates, ats , jaccard_similarity) 
my.df 
color <- c("red" , "green" , "blue" , "orange" , "yellow" , "pink") 
barplot(as.matrix(my.df),col=color) 
minhash<- minhash_generator(200, seed = 235) 
ats<- TextReuseCorpus(dir="C:/neha/DS/prac 11/french-plays",n=2,minhash_func = minhash) 
buckets<- lsh(ats, bands = 50 ,progress = interactive()) 
candidates<- lsh_candidates(buckets) 
my.df<- lsh_compare(candidates, ats , jaccard_similarity) 
my.df 
color <- c("red" , "green" , "blue" , "orange" , "yellow" , "pink") 
barplot(as.matrix(my.df),col=color)

Output

Document Similarity


lsh - Locality sensitive hashing (LSH) discovers potential matches among a corpus of documents quickly, so that only likely pairs can be compared.

bands - The number of bands to use for locality sensitive hashing. The number of hashes in the documents in the corpus must be evenly divisible by the number of bands.

progress - Display a progress bar while comparing documents.

lsh_candidates - Given a data frame of LSH buckets returned from lsh, this function returns the potential candidates.

lsh_compare - The lsh_candidates only identifies potential matches, but cannot estimate the actual similarity of the documents. This function takes a data frame returned by lsh_candidates and applies a comparison function to each of the documents in a corpus, thereby calculating the document similarity score. 


Document Similarity

Document Similarity

The two documents are very much similar according to the colors.

Document Similarity

Document Similarity

Documents are not similar according to above graph.

Comments