DOCUMENT SIMILARITY
What is Document Similarity Document similarity is a measure of similarity between two or more documents.
There are different algorithms for calculating document similarity :
1) Euclidean Distance
2) Cosine Similarity
3) Pearsons Correlation Coefficient
PROBLEM
Source Code
install.packages("textreuse")library(textreuse)minhash<- minhash_generator(200, seed = 235) ats<- TextReuseCorpus(dir="C:/neha/DS/prac 11/french-plays",n=5,minhash_func = minhash) buckets<- lsh(ats, bands = 50 ,progress = interactive()) candidates<- lsh_candidates(buckets) my.df<- lsh_compare(candidates, ats , jaccard_similarity) my.df color <- c("red" , "green" , "blue" , "orange" , "yellow" , "pink") barplot(as.matrix(my.df),col=color) minhash<- minhash_generator(200, seed = 235) ats<- TextReuseCorpus(dir="C:/neha/DS/prac 11/french-plays",n=2,minhash_func = minhash) buckets<- lsh(ats, bands = 50 ,progress = interactive()) candidates<- lsh_candidates(buckets) my.df<- lsh_compare(candidates, ats , jaccard_similarity) my.df color <- c("red" , "green" , "blue" , "orange" , "yellow" , "pink") barplot(as.matrix(my.df),col=color)Output
lsh - Locality sensitive hashing (LSH) discovers potential matches among a corpus of documents quickly, so that only likely pairs can be compared.
bands - The number of bands to use for locality sensitive hashing. The number of hashes in the documents in the corpus must be evenly divisible by the number of bands.
progress - Display a progress bar while comparing documents.
lsh_candidates - Given a data frame of LSH buckets returned from lsh, this function returns the potential candidates.
lsh_compare - The lsh_candidates only identifies potential matches, but cannot estimate the actual similarity of the documents. This function takes a data frame returned by lsh_candidates and applies a comparison function to each of the documents in a corpus, thereby calculating the document similarity score.
The two documents are very much similar according to the colors.
2) Cosine Similarity
3) Pearsons Correlation Coefficient
Comments
Post a Comment