TOPIC MODELLING
Description Of Topic Modelling
Topic modelling is a machine learning technique that automatically analyses text data to determine cluster words for a set of documents. This is known as ‘unsupervised’ machine learning because it doesn’t require a predefined list of tags or training data that’s been previously classified by humans.
Topic modeling involves counting words and grouping similar word patterns to infer topics within unstructured data. Let’s say you’re a software company and you want to know what customers are saying about particular features of your product. Instead of spending hours going through heaps of feedback, in an attempt to deduce which texts are talking about your topics of interest, you could analyze them with a topic modeling algorithm.
By detecting patterns such as word frequency and distance between words, a topic model clusters feedback that is similar, and words and expressions that appear most often. With this information, you can quickly deduce what each set of texts are talking about. Remember, this approach is ‘unsupervised’ meaning that no training is required.
PROBLEM
Source Code
install.packages("tm")
install.packages("topicmodels")
install.packages("corpus")
library("NLP", lib.loc = "~/R/win-lib/3.5")
library(tm)
library(topicmodels)
setwd("C:/Semester 6/Data Science/british-fiction-corpus")
filenames<-list.files(path="C:/Semester 6/Data Science/british-fiction-corpus",pattern="*.txt")
filetext<-lapply(filenames, readLines)
mycorpus<-Corpus(VectorSource(filetext))
mycorpus<-tm_map(mycorpus,removeNumbers)
mycorpus<-tm_map(mycorpus,removePunctuation)
mystopwords=c("of","a","and","the","in","to","for","that","is","on","are","with","as","by","be","an","which","it","from","or","can","have","these","has","such")
mycorpus<-tm_map(mycorpus,tolower)
mycorpus<-tm_map(mycorpus,removeWords,mystopwords)
dtm<-DocumentTermMatrix(mycorpus)
k<-3
lda_output_3<-LDA(dtm,k,method = "VEM")
dtm
topics(lda_output_3)
terms(lda_output_3,10)
Output
setwd returns the current directory before the change
lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.
corpus - This package contains functions for text corpus analysis
Lda - Latent Dirichlet Allocation
Estimate a LDA model using for example the VEM algorithm
According to topics, displaying the similarity i.e topic 1 has 1, topic 2 has 1, etc.
Comments
Post a Comment