You are reading the article Topic Modeling With Ml Techniques updated in September 2023 on the website Climeeviet.com. We hope that the information we have shared is helpful to you. If you find the content interesting and meaningful, please share it with your friends and continue to follow and support us for the latest updates. Suggested October 2023 Topic Modeling With Ml Techniques
IntroductionTopic modeling is a method to use and identify the themes that exist in large sets of data. It’s a kind of unsupervised learning technique where the model tries to predict the presence of underlying topics without ground truth labels. It is helpful in a wide range of industries, including healthcare, finance, and marketing, where there’s a lot of text-based data to analyze. Using topic modeling, organizations can quickly gain valuable insights from the topics that matter most to their business that can help them make better decisions and improve their products and services.
This article was published as a part of the Data Science Blogathon.
Project DescriptionTopic modeling is valuable for numerous industries, including and not limited to finance, healthcare, and marketing. It is beneficial for industries that deal with huge amounts of unstructured text data, such as, customer reviews, social media posts, or medical records, as it can help reduce the vast amount of time and labor to do the same without machines.
For example, in the healthcare industry, topic modeling can identify common themes or patterns in patient records that can help improve patient outcomes, identify risk factors, and guide clinical decision-making. In finance, topic modeling can analyze news articles, financial reports, and other text data to identify trends, market sentiment, and potential investment opportunities.
In marketing industry, topic modeling can analyze customer feedback, social media posts, and other text data to identify customer needs and preferences and develop targeted marketing campaigns. This can help companies improve customer satisfaction, increase sales, and gain a competitive market edge.
Problem StatementThe aim is to do topic modeling on the A million headlines news dataset. It is a collection of over one million news article headlines published by the ABC.
By identifying the main themes in the news headlines dataset. The project aims to provide insights into the types of news stories that will cover the ABC. Use this information by journalists, editors, and media organizations to better understand their audience and to tailor their news coverage to meet the needs and interests of their readers.
Dataset DescriptionThe dataset contains a large collection of news headlines published over a period of nineteen years, between February 19, 2003, and December 31, 2023. The data is sourced from the Australian Broadcasting Corporation (ABC), a reputable news organization in Australia. The dataset is provided in CSV format and contains two columns: “publish_date” and “headline_text“.
The “publish_date” column provides the date when the news article was published, in the YYYYMMDD format. The “headline_text” column contains the text of the headline, written in ASCII, English, and lowercase.
Project PlanThe project steps for applying topic modeling to the news headlines dataset can be as follow:
1. Exploratory Data Analysis: The next step is analyzing the data to understand the distribution of headlines over time. The frequency of different words and phrases, and other patterns in the data. Also, you can visualizing the data using charts and graphs to gain insights into the data.
2. Data Pre-processing: The first step is cleaning and preprocessing the text to remove stop words, punctuation, etc. It also involves tokenization, stemming, and lemmatization to standardize the text data and make it suitable for analysis.
3. Topic Modeling: The core of the project is applying techniques such as LDA. Then, identify the main topics and themes in the news headlines dataset. It requires selecting the appropriate parameters for the topic modeling algorithms. For example, the number of topics, the size of the vocabulary, and the similarity measure.
4. Topic Interpretation: After identifying the main topics, the next step is interpreting the topics and assigning human-readable labels to them. It includes analyzing the top words and phrases associated with each topic and identifying the main themes and trends.
5. Evaluation: The final step involves evaluating the performance of the topic modeling algorithms. Then, comparing them based on metrics such as coherence score and perplexity. Identifying the limitations and challenges of the topic modeling approach and proposing possible solutions.
Steps for The ProjectFirst, importing the necessary libraries.
import numpy as np import pandas as pd from IPython.display import display from tqdm import tqdm from collections import Counter import matplotlib.pyplot as plt import seaborn as sns from sklearn.feature_extraction.text import CountVectorizer from textblob import TextBlob import scipy.stats as stats from sklearn.decomposition import LatentDirichletAllocation from sklearn.manifold import TSNE from wordcloud import WordCloud, STOPWORDS from bokeh.plotting import figure, output_file, show from bokeh.models import Label from chúng tôi import output_notebook output_notebook() %matplotlib inlineLoading the csv format data in dataframe while parsing the dates in usable format.
path = '/content/drive/MyDrive/topic_modeling/abcnews-date-text.csv' #path of your dataset df = pd.read_csv(path, parse_dates=[0], infer_datetime_format=True) reindexed_data = df['headline_text'] reindexed_data.index = df['publish_date']Seeing a glimpse of the loaded data through first five rows.
df.head()There are 2 columns named publish_date and headline_text as mentioned above in the dataset description.
df.info() #general description of dataWe can see that there are 12,44,184 rows in the dataset with no null values.
Now, using 100,000 rows of the data for convenience and feasibility for using LDA model
Exploratory Data AnalysisStarting with visualizing the top 15 words in the data without including stopwords.
def get_top_n_words(n_top_words, count_vectorizer, text_data): ''' returns a tuple of the top n words in a sample and their accompanying counts, given a CountVectorizer object and text sample ''' vectorized_headlines = count_vectorizer.fit_transform(text_data.values) vectorized_total = np.sum(vectorized_headlines, axis=0) word_indices = np.flip(np.argsort(vectorized_total)[0,:], 1) word_values = np.flip(np.sort(vectorized_total)[0,:],1) word_vectors = np.zeros((n_top_words, vectorized_headlines.shape[1])) for i in range(n_top_words): word_vectors[i,word_indices[0,i]] = 1 words = [word[0].encode('ascii').decode('utf-8') for word in count_vectorizer.inverse_transform(word_vectors)] return (words, word_values[0,:n_top_words].tolist()[0]) # CountVectorizer function maps words to a vector space with similar words closer together count_vectorizer = CountVectorizer(max_df=0.8, min_df=2,stop_words='english') words, word_values = get_top_n_words(n_top_words=15, count_vectorizer=count_vectorizer, text_data=reindexed_data) fig, ax = plt.subplots(figsize=(16,8)) ax.bar(range(len(words)), word_values); ax.set_xticks(range(len(words))); ax.set_xticklabels(words, rotation='vertical'); ax.set_title('Top words in headlines dataset (excluding stop words)'); ax.set_xlabel('Word'); ax.set_ylabel('Number of occurences'); plt.show()Now, doing part of speech tagging for the headlines.
import nltk nltk.download('punkt') nltk.download('averaged_perceptron_tagger') tagged_headlines = [TextBlob(reindexed_data[i]).pos_tags for i in range(reindexed_data.shape[0])] tagged_headlines[10] #checking the 10th headline tagged_headlines_df = pd.DataFrame({'tags':tagged_headlines}) word_counts = [] pos_counts = {} for headline in tagged_headlines_df[u'tags']: word_counts.append(len(headline)) for tag in headline: if tag[1] in pos_counts: pos_counts[tag[1]] += 1 else: pos_counts[tag[1]] = 1 print('Total number of words: ', np.sum(word_counts)) print('Mean number of words per headline: ', np.mean(word_counts))Output
Total number of words: 8166553
Mean number of words per headline: 6.563782366595294
Checking if the distribution is normal.
y = stats.norm.pdf(np.linspace(0,14,50), np.mean(word_counts), np.std(word_counts)) fig, ax = plt.subplots(figsize=(8,4)) ax.hist(word_counts, bins=range(1,14), density=True); ax.plot(np.linspace(0,14,50), y, 'r--', linewidth=1); ax.set_title('Headline word lengths'); ax.set_xticks(range(1,14)); ax.set_xlabel('Number of words'); plt.show()Visualizing the proportion of top 5 used parts of speech.
# importing libraries import matplotlib.pyplot as plt import seaborn as sns # declaring data pos_sorted_types = sorted(pos_counts, key=pos_counts.__getitem__, reverse=True) pos_sorted_counts = sorted(pos_counts.values(), reverse=True) top_five = pos_sorted_types[:5] data = pos_sorted_counts[:5] # declaring exploding pie explode = [0, 0.1, 0, 0, 0] # define Seaborn color palette to use palette_color = sns.color_palette('dark') # plotting data on chart plt.pie(data, labels=top_five, colors=palette_color, explode=explode, autopct='%.0f%%') # displaying chart plt.show()Here, it’s visible that 50% of the words in headlines are Noun which sounds reasonable.
Pre-processingFirst, sampling 100,000 healines and converting sentences to words.
def sent_to_words(sentences): for sentence in sentences: # deacc=True removes punctuations yield(gensim.utils.simple_preprocess(str(sentence), deacc=True)) text_sample = reindexed_data.sample(n=100000, random_state=0).values data = text_sample.tolist() data_words = list(sent_to_words(data)) print(data_words[0])Making bigram and trigram models.
# Build the bigram and trigram models bigram = gensim.models.Phrases(data_words, min_count=5, threshold=100) trigram = gensim.models.Phrases(bigram[data_words], threshold=100) # higher threshold fewer phrases. # Faster way to get a sentence clubbed as a trigram/bigram bigram_mod = gensim.models.phrases.Phraser(bigram) trigram_mod = gensim.models.phrases.Phraser(trigram)We will do Stopwords removal, bigrams and trigrams and lemmatization in this step.
import nltk nltk.download('stopwords') from nltk.corpus import stopwords stop_words = stopwords.words('english') stop_words.extend(['from', 'subject', 're', 'edu', 'use']) # Define functions for stopwords, bigrams, trigrams and lemmatization def remove_stopwords(texts): return [[word for word in simple_preprocess(str(doc)) if word not in stop_words] for doc in texts] def make_bigrams(texts): return [bigram_mod[doc] for doc in texts] def make_trigrams(texts): return [trigram_mod[bigram_mod[doc]] for doc in texts] def lemmatization(texts, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']): texts_out = [] for sent in texts: doc = nlp(" ".join(sent)) texts_out.append([token.lemma_ for token in doc if chúng tôi in allowed_postags]) return texts_out # !python -m spacy download en_core_web_sm import spacy # Remove Stop Words data_words_nostops = remove_stopwords(text_sample) # Form Bigrams data_words_bigrams = make_bigrams(data_words_nostops) # Initialize spacy 'en' model, keeping only tagger component (for efficiency) nlp = spacy.load("en_core_web_sm", disable=['parser', 'ner']) data_lemmatized = lemmatization(data_words_bigrams, allowed_postags=['NOUN', 'ADJ', 'VERB', 'ADV']) import gensim.corpora as corpora # Create Dictionary id2word = corpora.Dictionary(data_lemmatized) # Create Corpus texts = data_lemmatized # Term Document Frequency corpus = [id2word.doc2bow(text) for text in texts] Topic ModelingApplying LDA model assuming 15 themes in whole dataset
num_topics = 15 lda_model = gensim.models.LdaMulticore(corpus=corpus, id2word=id2word, num_topics=num_topics, random_state=100, chunksize=100, passes=10, alpha=0.01, eta=0.9) Topic Interpretation from pprint import pprint # Print the Keyword in the 15 topics pprint(lda_model.print_topics()) doc_lda = lda_model[corpus] Output: [(0, '0.046*"new" + 0.034*"fire" + 0.020*"year" + 0.018*"ban" + 0.016*"open" + ' '0.014*"set" + 0.011*"consider" + 0.009*"security" + 0.009*"name" + ' '0.008*"melbourne"'), (1, '0.021*"urge" + 0.020*"attack" + 0.016*"government" + 0.014*"lead" + ' '0.014*"driver" + 0.013*"public" + 0.011*"want" + 0.010*"rise" + ' '0.010*"student" + 0.010*"funding"'), (2, '0.019*"day" + 0.015*"flood" + 0.013*"go" + 0.013*"work" + 0.011*"fine" + ' '0.010*"launch" + 0.009*"union" + 0.009*"final" + 0.007*"run" + ' '0.006*"game"'), (3, '0.023*"australian" + 0.023*"crash" + 0.016*"health" + 0.016*"arrest" + ' '0.013*"fight" + 0.013*"community" + 0.013*"job" + 0.013*"indigenous" + ' '0.012*"victim" + 0.012*"support"'), (4, '0.024*"face" + 0.022*"nsw" + 0.018*"council" + 0.018*"seek" + 0.017*"talk" ' '+ 0.016*"home" + 0.012*"price" + 0.011*"bushfire" + 0.010*"high" + ' '0.010*"return"'), (5, '0.068*"police" + 0.019*"car" + 0.015*"accuse" + 0.014*"change" + ' '0.013*"road" + 0.010*"strike" + 0.008*"safety" + 0.008*"federal" + ' '0.008*"keep" + 0.007*"problem"'), (6, '0.042*"call" + 0.029*"win" + 0.015*"first" + 0.013*"show" + 0.013*"time" + ' '0.012*"trial" + 0.012*"cut" + 0.009*"review" + 0.009*"top" + 0.009*"look"'), (7, '0.027*"take" + 0.021*"make" + 0.014*"farmer" + 0.014*"probe" + ' '0.011*"target" + 0.011*"rule" + 0.008*"season" + 0.008*"drought" + ' '0.007*"confirm" + 0.006*"point"'), (8, '0.047*"say" + 0.026*"water" + 0.021*"report" + 0.020*"fear" + 0.015*"test" ' '+ 0.015*"power" + 0.014*"hold" + 0.013*"continue" + 0.013*"search" + ' '0.012*"election"'), (9, '0.024*"warn" + 0.020*"worker" + 0.014*"end" + 0.011*"industry" + ' '0.011*"business" + 0.009*"speak" + 0.008*"stop" + 0.008*"regional" + ' '0.007*"turn" + 0.007*"park"'), (10, '0.050*"man" + 0.035*"charge" + 0.017*"jail" + 0.016*"murder" + ' '0.016*"woman" + 0.016*"miss" + 0.016*"get" + 0.014*"claim" + 0.014*"school" ' '+ 0.011*"leave"'), (11, '0.024*"find" + 0.015*"push" + 0.015*"drug" + 0.014*"govt" + 0.010*"labor" + ' '0.008*"state" + 0.008*"investigate" + 0.008*"threaten" + 0.008*"mp" + ' '0.008*"world"'), (12, '0.028*"court" + 0.026*"interview" + 0.025*"kill" + 0.021*"death" + ' '0.017*"die" + 0.015*"national" + 0.014*"hospital" + 0.010*"pay" + ' '0.009*"announce" + 0.008*"rail"'), (13, '0.020*"help" + 0.017*"boost" + 0.016*"child" + 0.016*"hit" + 0.016*"group" ' '+ 0.013*"case" + 0.011*"fund" + 0.011*"market" + 0.011*"appeal" + ' '0.010*"local"'), (14, '0.036*"plan" + 0.021*"back" + 0.015*"service" + 0.012*"concern" + ' '0.012*"move" + 0.011*"centre" + 0.010*"inquiry" + 0.010*"budget" + ' '0.010*"law" + 0.009*"remain"')] Evaluation1. Calculating Coherence score (ranges between -1 and 1), which is a measure of how similar the words in a topic are.
from gensim.models import CoherenceModel # Compute Coherence Score coherence_model_lda = CoherenceModel(model=lda_model, texts=data_lemmatized, dictionary=id2word, coherence='c_v') coherence_lda = coherence_model_lda.get_coherence() print('Coherence Score: ', coherence_lda)Output
Coherence Score: 0.38355488160129025
2. Calculating perplexity score that is a measure of randomness in the model and how well the probability distribution predicts the sample. (lower value indicates better model)
perplexity = lda_model.log_perplexity(corpus) print(perplexity)Output
-10.416591518443418
We can see that the coherence score is fairly low but can still predict relevant themes well and can surely be improved by doing hyperparameter tuning. Also, perplexity is low which can be justified with the normal distribution of the data as was seen in exploratory data analysis section.
ConclusionTopic Modeling is an unsupervised learning technique to identify themes in large sets of data. It is useful in various domains such as healthcare, finance, and marketing, where there is a huge amount of text-based data to analyze. In this project, you had to apply topic modeling to a dataset called “A million headlines” consisting of over one million news article headlines published by the ABC. The aim is to use Latent Dirichlet Allocation (LDA) algorithm, which is a probabilistic generative model, to identify the main topics in the dataset.
The project plan involves several steps: exploratory data analysis to understand the data distribution, preprocessing the text by removing stop words, punctuation, etc., and applying techniques like tokenization, stemming, and lemmatization. The essence of the project revolves around topic modeling, leveraging LDA to identify the primary topics and themes within the news headlines. We analyze associated words and phrases to interpret the topics and assign human-readable labels to them. The evaluation of topic modeling algorithms encompasses metrics such as coherence score and perplexity, while also taking into account the limitations of the approach.
Key Takeaways
Topic Modeling is an effective way of finding broad themes from the data with Machine Learning (ML) without labels.
It has a wide range of applications from healthcare to recommender systems.
LDA is one effective way of implementing topic modeling.
Coherence score and perplexity are effective evaluation metrics for checking the performance of topic modeling through ML models.
Frequently Asked QuestionsQ1. What is topic modeling in ML?
A. Topic modeling in ML refers to a technique that automatically extracts underlying themes or topics from a collection of text documents. It helps uncover latent patterns and structures, enabling tasks like document clustering, text summarization, and content recommendation in natural language processing (NLP) and machine learning.
Q2. What is topic modeling with examples?
A. Topic modeling, with an example, involves extracting topics from a set of news articles. The algorithm identifies topics such as “politics,” “sports,” and “technology” based on word co-occurrence patterns. This helps organize and categorize articles, making browsing and searching for specific topics of interest easier.
Q3. What is the best algorithm for topic modeling?
A. The best algorithm for topic modeling depends on the specific requirements and characteristics of the dataset. Popular algorithms include Latent Dirichlet Allocation (LDA), Non-negative Matrix Factorization (NMF), and Latent Semantic Analysis (LSA). Each algorithm has its strengths and weaknesses, so the choice should align with the task at hand.
Q4. Is topic modeling an NLP technique?
A. Yes, topic modeling is a technique commonly used in natural language processing (NLP). It leverages machine learning algorithms to identify and extract topics from text data, allowing for better understanding, organization, and analysis of textual information. It aids in various NLP tasks, including text classification, sentiment analysis, and information retrieval.
The media shown in this article is not owned by Analytics Vidhya and is used at the Author’s discretion.
Related
You're reading Topic Modeling With Ml Techniques
Update the detailed information about Topic Modeling With Ml Techniques on the Climeeviet.com website. We hope the article's content will meet your needs, and we will regularly update the information to provide you with the fastest and most accurate information. Have a great day!