[教程]Python实现词频占比：轻松掌握数据分析技巧，揭秘文本内容核心词汇！

发布于 2025-11-24 09:30:33

1447

在文本分析中，词频占比是一个重要的指标，它可以帮助我们了解文本中各个词汇出现的频率和重要性。Python作为一种功能强大的编程语言，提供了多种方法来实现词频占比的计算。本文将详细介绍如何使用Pytho...

在文本分析中，词频占比是一个重要的指标，它可以帮助我们了解文本中各个词汇出现的频率和重要性。Python作为一种功能强大的编程语言，提供了多种方法来实现词频占比的计算。本文将详细介绍如何使用Python进行词频占比的分析，并揭示文本内容的核心词汇。

准备工作

在开始之前，我们需要准备以下工具和库：

Python环境
Jupyter Notebook（可选，用于演示）
collections库：用于统计词频
nltk库：用于文本预处理和分词（可选）

步骤一：文本预处理

在进行词频占比分析之前，通常需要对文本进行预处理，包括去除标点符号、转换为小写、去除停用词等。以下是一个简单的文本预处理函数：

import re
from nltk.corpus import stopwords
from nltk.tokenize import word_tokenize
def preprocess_text(text): # 去除标点符号 text = re.sub(r'[^\w\s]', '', text) # 转换为小写 text = text.lower() # 分词 tokens = word_tokenize(text) # 去除停用词 stop_words = set(stopwords.words('english')) filtered_tokens = [word for word in tokens if word not in stop_words] return filtered_tokens

步骤二：计算词频

使用collections.Counter类可以方便地统计词频。以下是一个计算词频的函数：

from collections import Counter
def calculate_frequency(tokens): return Counter(tokens)

步骤三：计算词频占比

为了计算每个词汇的词频占比，我们可以将每个词汇的词频除以总词数。以下是一个计算词频占比的函数：

def calculate_frequency_percentage(counter): total_words = sum(counter.values()) return {word: count / total_words for word, count in counter.items()}

步骤四：展示结果

最后，我们可以将计算出的词频占比展示出来，以便分析文本内容的核心词汇。以下是一个示例：

def display_top_words(counter, top_n=10): frequency_percentage = calculate_frequency_percentage(counter) top_words = sorted(frequency_percentage.items(), key=lambda x: x[1], reverse=True)[:top_n] for word, percentage in top_words: print(f"{word}: {percentage:.2%}")
# 示例文本
text = "Python is a high-level, interpreted, general-purpose programming language. Its design philosophy emphasizes code readability with its notable use of significant whitespace."
# 预处理文本
tokens = preprocess_text(text)
# 计算词频
counter = calculate_frequency(tokens)
# 展示前10个高频词汇
display_top_words(counter)

总结

通过以上步骤，我们可以使用Python轻松地实现词频占比的计算，并揭示文本内容的核心词汇。这不仅有助于我们更好地理解文本，还可以应用于各种文本分析场景，如情感分析、主题建模等。

一个月内的热帖推荐