TOPIC MODELLING

Description Of Topic Modelling

Topic modelling is a machine learning technique that automatically analyses text data to determine cluster words for a set of documents. This is known as ‘unsupervised’ machine learning because it doesn’t require a predefined list of tags or training data that’s been previously classified by humans.

Topic modeling involves counting words and grouping similar word patterns to infer topics within unstructured data. Let’s say you’re a software company and you want to know what customers are saying about particular features of your product. Instead of spending hours going through heaps of feedback, in an attempt to deduce which texts are talking about your topics of interest, you could analyze them with a topic modeling algorithm.

By detecting patterns such as word frequency and distance between words, a topic model clusters feedback that is similar, and words and expressions that appear most often. With this information, you can quickly deduce what each set of texts are talking about. Remember, this approach is ‘unsupervised’ meaning that no training is required.

PROBLEM

Source Code

install.packages("tm")

install.packages("topicmodels")

install.packages("corpus")

library("NLP", lib.loc = "~/R/win-lib/3.5")

library(tm)

library(topicmodels)

setwd("C:/Semester 6/Data Science/british-fiction-corpus")

filenames<-list.files(path="C:/Semester 6/Data Science/british-fiction-corpus",pattern="*.txt")

filetext<-lapply(filenames, readLines)

mycorpus<-Corpus(VectorSource(filetext))

mycorpus<-tm_map(mycorpus,removeNumbers)

mycorpus<-tm_map(mycorpus,removePunctuation)

mystopwords=c("of","a","and","the","in","to","for","that","is","on","are","with","as","by","be","an","which","it","from","or","can","have","these","has","such")

mycorpus<-tm_map(mycorpus,tolower)

mycorpus<-tm_map(mycorpus,removeWords,mystopwords)

dtm<-DocumentTermMatrix(mycorpus)

k<-3

lda_output_3<-LDA(dtm,k,method = "VEM")

dtm

topics(lda_output_3)

terms(lda_output_3,10)

Output

TOPIC MODELLING

setwd returns the current directory before the change

lapply returns a list of the same length as X, each element of which is the result of applying FUN to the corresponding element of X.

corpus - This package contains functions for text corpus analysis

TOPIC MODELLING

Lda - Latent Dirichlet Allocation

Estimate a LDA model using for example the VEM algorithm

method

The method to be used for fitting;

TOPIC MODELLING

According to topics, displaying the similarity i.e topic 1 has 1, topic 2 has 1, etc.

Computer Science Space

Search This Blog

TOPIC MODELLING

TOPIC MODELLING

Description Of Topic Modelling

Comments

Post a Comment