-
2021-11-21 20:13:32
设计一个程序,读出“三国演义.txt”文件中的三国演义全文,将常见人名进行去重后生成词云,并列出词频最高的10-20
import jieba # 优秀的中文分词第三方库 import wordcloud from matplotlib import pyplot mk = pyplot.imread('caochao.jpg') txt = open('三国演义.txt','r',encoding='utf-8').read() # 排除一些不是人名,但是出现次数比较靠前的单词 excludes = {"将军", "却说", "荆州", "二人", "不可", "不能", "如此", "商议", "如何", "主公", "军士", "左右", "军马", "引兵", "次日", "大喜", "天下", "东吴", "于是", "今日", "不敢", "魏兵", "陛下", "一人", "都督", "人马", "不知", "汉中", "只见", "众将", "后主", "蜀兵", "上马", "大叫", "太守", "此人", "夫人", "先主", "后人", "背后", "城中", "天子", "一面", "何不", "大军", "忽报", "先生", "百姓", "何故", "然后", "先锋", "不如", "赶来", "原来", "令人", "江东", "下马", "喊声", "正是", "徐州", "忽然", "因此", "成都", "不见", "未知", "大败", "大事", "之后", "一军", "引军", "起兵", "军中", "接应", "进兵", "大惊", "可以", "以为", "大怒", "不得", "心中", "下文", "一声", "追赶", "粮草", "曹兵", "一齐", "分解", "回报", "分付", "只得", "出马", "三千", "大将", "许都", "随后", "报知", "前面", "之兵", "且说", "众官", "洛阳", "领兵", "何人", "星夜", "精兵", "城上", "之计", "不肯", "相见", "其言", "一日", "而行", "文武", "襄阳", "准备", "若何", "出战", "亲自", "必有", "此事", "军师", "之中", "伏兵", "祁山", "乘势", "忽见", "大笑", "樊城", "兄弟", "首级", "立于", "西川", "朝廷", "三军", "大王", "传令", "当先", "五百", "一彪", "坚守", "此时", "之间", "投降", "五千", "埋伏", "长安", "三路", "遣使", "英雄","回见","大将军","是夜","小路","望见","无不","有人","马下","必然","将士","甘宁","下寨","杀出","诸葛","中原", "屯兵","邓艾","蛮兵","之意","城下","前来","武士","城外","出迎","本部","两路","一阵","连夜","四面","奔走","交锋","冀州","细作","使者","江南","杀来", "人报","而出","心腹","何处","皇叔","众人","当日","吴兵","兴兵","何以","如之奈何","先帝","江夏","前进","国家","城门","杀入","两军","来到","厮杀","两个","拜谢", "岂可","慌忙","饮酒","为首","性命","进发","谋士","此言"} # 精确模式,把文本精确的切分开,不存在冗余单词,返回列表类型 words = jieba.lcut(txt) # 构造一个字典,来表达单词和出现频率的对应关系 counts = {} # 逐一从words中取出每一个元素 for word in words: # 已经有这个键的话就把相应的值加1,没有的话就取值为0,再加1 if len(word) == 1: continue elif word == "诸葛亮" or word == "孔明曰": rword = "孔明" elif word == "关公" or word == "云长": rword = "关羽" elif word == "玄德" or word == "玄德曰": rword = "刘备" elif word == "孟德" or word == "丞相": rword = "曹操" else: rword = word # 如果在里面返回他的次数,如果不在则添加到字典里面并加一 counts[rword] = counts.get(rword, 0) + 1 # 删除停用词 for word in excludes: del counts[word] # 排序,变成list类型,并使用sort方法 items = list(counts.items()) # 对一个列表按照键值对的2个元素的第二个元素进行排序 # Ture从大到小,结果保存在items中,第一个元素就是出现次数最多的元素 items.sort(key=lambda x: x[1], reverse=True) # 将前十个单词以及出现的次数打印出来 name = [] times = [] for i in range(40): word, count = items[i] print("{0:<10}{1:>5}".format(word, count)) name.append(word) times.append(count) # 词云部分 w = wordcloud.WordCloud( font_path='songti.TTF', # 设置字体 background_color="white", # 设置词云背景颜色 max_words=1000, # 词云允许最大词汇数 max_font_size=100, # 最大字体大小 random_state=50, # 配色方案的种数 mask=mk ) txt = " ".join(name) w.generate(txt) w.to_file("ciyun.png")
个词,并形成词云(可以有不同的形状)
更多相关内容 -
python——三国演义 制作词云
2021-05-17 20:02:39python——三国演义制作词云 题目: 设计一个程序,读出threekingdoms.txt文件中的三国演义全文,将常见人名进行去重后生成词云,并列出词频最高的5个词。 例:'玄德','刘备','玄德曰','刘皇叔','皇叔'都是同一个...python——三国演义 制作词云
题目:
设计一个程序,读出threekingdoms.txt文件中的三国演义全文,将常见人名进行去重后生成词云,并列出词频最高的5个词。
例:'玄德','刘备','玄德曰','刘皇叔','皇叔'都是同一个人。
可利用字典来保存需要去重的词。
dupDict={'曹操' : ['孟德','丞相'],
'玄德' : ['刘备','皇叔','刘皇叔','玄德曰'],
'云长' : ['关羽','关云长','关公'],
'孔明' : ['诸葛亮','诸葛','孔明曰'],
'张飞' : ['翼徳'],
'赵云' : ['子龙','赵子龙'],
'周瑜' : ['公瑾','都督']}首先:
下载jieba,wordcloud ,imread
代码:
import jieba
from wordcloud import WordCloud
from imageio import imread
# 读文件
filename='threekingdoms.txt'
mytext=open(filename,encoding='utf-8').read()
# 使用结巴分词
words=jieba.lcut(mytext)
# 除掉不重要的词
removes=['将军','二人','却说','次日','主公','不能','不可','罗贯中','上卷','长江','滚滚','逝水']
words=[word for word in words if word not in removes]# 去掉文件中重复名字的词
ls = []
for i in words:
if len(i)==1 :
continue
elif i in ['孟德','丞相']:
ls.append('曹操')
elif i in ['诸葛亮','诸葛','孔明曰']:
ls.append('孔明')
elif i in ['刘备','皇叔','刘皇叔','玄德曰']:
ls.append('刘备')
elif i in ['关羽','关云长','关公']:
ls.append('云长')
elif i in ['公瑾','都督']:
ls.append('周瑜')
elif i in ['子龙','赵子龙']:
ls.append('赵云')
elif i in ['翼徳']:
ls.append('张飞')
else:
ls.append(i)
words=ls# 词频统计--字典
word_count = {}
for word in words:
if len(word)>1:
word_count[word] = word_count.get(word, 0) + 1
print(sorted(word_count.items(), key = lambda kv:kv[1],reverse=True)[:5])
# 更换背景图片 设置字体
mask = imread("PeiJian\Zhangfei.png")
w=WordCloud(font_path="PeiJian\STXINWEI.TTF",background_color='white',width=1000,height=500,max_words=2000,mask=mask)
# 必须给个符号分隔开分词结果来形成字符串,否则不能绘制词云
w.generate(" ".join(words))
# 最后生成的图片
w.to_file(r'PeiJian\1.png') -
使用python统计《三国演义》小说里人物出现次数前十名,并实现可视化。
2020-12-29 15:09:55open('三国演义.txt', 'r' ,encoding='gb18030').read() remove = {"将军", "却说", "不能", "后主", "上马", "不知", "天子", "大叫", "众将", "不可", "主公", "蜀兵", "只见", "如何", "商议", "都督", "一人", ...一、安装所需要的第三方库
jieba (jieba是优秀的中文分词第三分库)
pyecharts (一个优秀的数据可视化库)
使用pycharm安装库
打开Pycharm选择【File】下的Settings
出现下面页面,
选择右边的【+】出现下面页面,在此页面顶端搜索想要的库,然后安装就可以了
二、编写代码
import jieba #导入库
import os
print("人物出现次数前十名:")
txt = open('三国演义.txt', 'r' ,encoding='gb18030').read()
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
elif word == "诸葛亮" or word == "孔明曰":
rword = "孔明"
elif word == "关公" or word == "云长":
rword = "关羽"
elif word == "玄德" or word == "玄德曰":
rword = "刘备"
elif word == "孟德" or word == "丞相":
rword = "曹操" # 把相同意思的名字归为一个人
else:
rword = word
counts[rword] = counts.get(rword, 0) + 1
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)
for i in range(10):
word, count=items[i]
print("{}:{}".format(word, count)) # 打印前十名名单
结果如下图:
可以看到这里面有很多不是人物的名字,所以咱们要把这些删掉。更改代码如下
import jieba #导入库
import os
print("人物出现次数前十名:")
txt = open('三国演义.txt', 'r' ,encoding='gb18030').read()
remove = {"将军", "却说", "不能", "后主", "上马", "不知", "天子", "大叫", "众将", "不可",
"主公", "蜀兵", "只见", "如何", "商议", "都督", "一人", "汉中", "人马",
"陛下", "魏兵", "天下", "今日", "左右", "东吴", "于是", "荆州", "不能", "如此",
"大喜", "引兵", "次日", "军士", "军马","二人","不敢"} # 这些文字是要排出掉的,多次运行程序所得到的
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
elif word == "诸葛亮" or word == "孔明曰":
rword = "孔明"
elif word == "关公" or word == "云长":
rword = "关羽"
elif word == "玄德" or word == "玄德曰":
rword = "刘备"
elif word == "孟德" or word == "丞相":
rword = "曹操" # 把相同意思的名字归为一个人
else:
rword = word
counts[rword] = counts.get(rword, 0) + 1
for word in remove:
del counts[word] #匹配文字相等就删除
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)
for i in range(10):
word, count=items[i]
print("{}:{}".format(word, count)) # 打印前十名名单
运行结果如下图
可以看到现在都是人物名称了
导出数据,代码如下
import jieba #导入库
import os
print("人物出现次数前十名:")
txt = open('三国演义.txt', 'r' ,encoding='gb18030').read()
remove = {"将军", "却说", "不能", "后主", "上马", "不知", "天子", "大叫", "众将", "不可",
"主公", "蜀兵", "只见", "如何", "商议", "都督", "一人", "汉中", "人马",
"陛下", "魏兵", "天下", "今日", "左右", "东吴", "于是", "荆州", "不能", "如此",
"大喜", "引兵", "次日", "军士", "军马","二人","不敢"} # 这些文字是要排出掉的,多次运行程序所得到的
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
elif word == "诸葛亮" or word == "孔明曰":
rword = "孔明"
elif word == "关公" or word == "云长":
rword = "关羽"
elif word == "玄德" or word == "玄德曰":
rword = "刘备"
elif word == "孟德" or word == "丞相":
rword = "曹操" # 把相同意思的名字归为一个人
else:
rword = word
counts[rword] = counts.get(rword, 0) + 1
for word in remove:
del counts[word] #匹配文字相等就删除
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)
#导出数据
fo = open("三国人物出场次数.txt", "a", encoding='utf-8')
for i in range(10):
word, count=items[i]
word = str(word)
count = str(count)
fo.write(word)
fo.write(':') #使用冒号分开
fo.write(count)
fo.write('\n') #换行
fo.close() #关闭文件
现在咱们运行看是否导出,运行结果如下图。
可以看到已经生成一个名为三国人物出场次数.txt的文件,而文件里的内容就是咱们刚才的数据。
三、数据可视化
想要可视化首先咱们要有数据,咱们把刚才导出的数据转换为字典形式。代码如下
#将txt文本里的数据转换为字典形式
fr = open('三国人物出场次数.txt', 'r', encoding='utf-8')
dic = {}
keys = [] # 用来存储读取的顺序
for line in fr:
v = line.strip().split(':')
dic[v[0]] = v[1]
keys.append(v[0])
fr.close()
print(dic)
-运行结果如下
使用pyecharts绘图
先倒入模块
from pyecharts import options as opts
from pyecharts.charts import Bar
代码如下
# 绘图
list1=list(dic.keys())
list2=list(dic.values()) #提取字典里的数据作为绘图数据
c = (
Bar()
.add_xaxis(list1)
.add_yaxis("人物出场次数",list2)
.set_global_opts(
xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=-15)),
)
.render("人物出场次数可视化图.html")
)
运行程序看到目录下会生成一个名为人物出场次数可视化图.html的文件,如下图
使用浏览器打开,就可以看到数据以图形的方式呈现出来。
三、全部代码呈现
#《三国演义》的人物出场次数Python代码:
import jieba #导入库
import os
from pyecharts import options as opts
from pyecharts.charts import Bar
print("人物出现次数前十名:")
txt = open('三国演义.txt', 'r' ,encoding='gb18030').read()
remove = {"将军", "却说", "不能", "后主", "上马", "不知", "天子", "大叫", "众将", "不可",
"主公", "蜀兵", "只见", "如何", "商议", "都督", "一人", "汉中", "人马",
"陛下", "魏兵", "天下", "今日", "左右", "东吴", "于是", "荆州", "不能", "如此",
"大喜", "引兵", "次日", "军士", "军马","二人","不敢"} # 这些文字是要排出掉的,多次运行程序所得到的
words = jieba.lcut(txt)
counts = {}
for word in words:
if len(word) == 1:
continue
elif word == "诸葛亮" or word == "孔明曰":
rword = "孔明"
elif word == "关公" or word == "云长":
rword = "关羽"
elif word == "玄德" or word == "玄德曰":
rword = "刘备"
elif word == "孟德" or word == "丞相":
rword = "曹操" # 把相同意思的名字归为一个人
else:
rword = word
counts[rword] = counts.get(rword, 0) + 1
for word in remove:
del counts[word] #匹配文字相等就删除
items = list(counts.items())
items.sort(key=lambda x: x[1], reverse=True)
#导出数据
fo = open("三国人物出场次数.txt", "a", encoding='utf-8')
for i in range(10):
word, count=items[i]
word = str(word)
count = str(count)
fo.write(word)
fo.write(':') #使用冒号分开
fo.write(count)
fo.write('\n') #换行
fo.close() #关闭文件
#将txt文本里的数据转换为字典形式
fr = open('三国人物出场次数.txt', 'r',encoding='utf-8' )
dic = {}
keys = [] # 用来存储读取的顺序
for line in fr:
v = line.strip().split(':')
dic[v[0]] = v[1]
keys.append(v[0])
fr.close()
print(dic)
# 绘图
list1=list(dic.keys())
list2=list(dic.values()) #提取字典里的数据作为绘图数据
c = (
Bar()
.add_xaxis(list1)
.add_yaxis("人物出场次数",list2)
.set_global_opts(
xaxis_opts=opts.AxisOpts(axislabel_opts=opts.LabelOpts(rotate=-15)),
)
.render("人物出场次数可视化图.html")
)
标签:elif,rword,word,python,items,可视化,counts,txt,前十名
来源: https://www.cnblogs.com/ke-wu-a/p/14026658.html
-
三国演义人物词频统计-4
2018-09-08 11:30:46题目来源:Python语言程序设计 授课老师: 嵩天、黄天羽、礼欣 ... 三国演义人物词频统计-3:https://blog.csdn.net/Mzjuser/article/details/82527464 ...三国演义人物词频统计-2:https://blog.cs...题目来源:Python语言程序设计
授课老师: 嵩天、黄天羽、礼欣
hamlet小说下载路径:https://python123.io/resources/pye/threekingdoms.txt
三国演义人物词频统计-3:https://blog.csdn.net/Mzjuser/article/details/82527464
三国演义人物词频统计-2:https://blog.csdn.net/Mzjuser/article/details/82527412
三国演义人物词频统计-1:https://blog.csdn.net/Mzjuser/article/details/82527289
代码
import jieba path = 'C:\\Users\\Desktop\\三国演义.txt' text = open(path,'r',encoding='utf-8').read() #使用结巴的函数对文本进行分词 words = jieba.lcut(text) #需要排除一些不是人名的单词 excludes = ['将军','却说','二人','不可','荆州','不能','如此','商议','如何' ,'军士','左右','天下','次日','大喜','引兵','军马','东吴','于是' ,'今日','不敢','魏兵','陛下','一人','人马','汉中','不知','只见', '众将','蜀兵','上马','大叫'] #定义字典类型去存储文字和文字出现的次数 counts = {} for word in words: if len(word) == 1: continue elif word == '诸葛亮'or word == '孔明曰': rword = '孔明' elif word == '玄德'or word == '玄德曰' or word == '主公': rword = '刘备' elif word == '孟德'or word == '丞相': rword = '曹操' elif word == '关公'or word == '云长': rword = '关羽' elif word == '都督': rword = '周瑜' elif word == '后主': rword = '刘禅' elif word == '太守': rword = '刘度' else: rword = word counts[rword] = counts.get(rword,0) + 1 #把一些不是人名的词语排除掉 for word in excludes: del counts[word] items = list(counts.items()) #根据iems的第二个值进行从大到小的排序 items.sort(key = lambda x:x[1],reverse=True) for i in range(15): word,count = items[i] #左对齐,占位10位,填充字符为空格 print("{0:<10}{1:>5}".format(word,count))
结果显示
其他解决方案
可以通过人物的名称(需要对三国中的人物有详细的了解)对人物出现的次数进行统计,然后在进行排序。
-
Python文本词频统计(对三国演义进行人物出场频率的统计)
2020-07-02 11:34:41== 1: continue elif word == "诸葛亮" or word == "孔明曰": # 合并同一个人的名词 r_word = "孔明" elif word == "关公" or word == "云长": r_word = "关羽" elif word == "玄德" or word == "玄德曰": r_word = ... -
Python文本处理:《三国演义》词云的构建与分析
2020-09-20 13:30:25“《三国演义》词云”是近期归纳学习心得期间一时兴起做来练手的,水平极其有限,仅作记录。 自学Python强推北京理工大学嵩天教授的MOOC:Python语言程序设计;课件的深度设置地很舒服,非常适合零基础入门或者有... -
三国演义词云
2020-12-19 10:37:23 -
Macbook Macos系统中python读取文件出错的解决过程:三国演义词频统计实例
2020-03-26 18:15:55一个小问题却能极大挫伤学习热情,愣是两天不想碰。 在学习中国大学mooc上嵩天老师的《python语言程序设计》,在第六周的实例文本词频统计中遇到问题,按照老师的代码在mac上...**txt = open("三国演义.txt", "r"... -
求大佬指点,写的是三国演义中文文本的词频统计,可是总是报错(已解决)
2020-03-03 17:37:37'D:/三国演义.txt' , 'r' , encoding = 'utf-8' ) . read ( ) words = jieba . lcut ( text ) counts = { } for word in words : if len ( word ) == 1 : continue else : counts [ ... -
【渝粤题库】陕西师范大学202521中国古代文学(三) 作业(高起专)
2021-11-29 12:01:52二、名词解释: 1、元诗四大家 2、“铁崖体” 三、简答题: 无 四、论述题 无 第十一章 《三国志演义》与历史演义的繁荣 一、填空题: 1、( )是我国第一部长篇章回小说,也是历史演义小说的开山之作。 2、在长期的... -
张飞比关羽还能打?一位酷爱三国的日本程序员,用NLP分析了武将们的战斗力...
2019-08-08 18:30:00大数据文摘出品来源:Qiita编译:李欣月、刘俊寰作为中国四大名著之一,三国的故事自然备受国人喜爱和追捧,但是谁又能想到三国竟然在日本也“出了圈”,举个例子,吴宇森导演的... -
张飞比关羽还能打?一位酷爱三国的日本程序员,用NLP分析了武将们的战斗力
2019-08-08 14:26:02大数据文摘出品来源:Qiita编译:李欣月、刘俊寰作为中国四大名著之一,三国的故事自然备受国人喜爱和追捧,但是谁又能想到三国竟然在日本也“出了圈”,举个例子,吴宇森导演的电影《赤壁》在日本的票房收入超过... -
python第七章
2022-05-14 16:56:54#增加一个停用词的集合excludes excludes = {"将军","却说","荆州","二人","不可","不能","如此"} txt = open("F:\\code\\编程\\py\\学习代码\\savetext\\三国演义.txt", "r", encoding='utf-8').read() words = ... -
艺术概论
2021-02-02 18:17:44艺术作品被创作出来,是为了供人们阅读或欣赏,如果没人欣赏,它就还只是潜在的作品*” 艺术生产必须适应欣赏者的消费需要,艺术欣赏反过来又成为刺激艺术生产的动力,推动着艺术生产的发展 20世纪新批评派、结构... -
python day 17 文本词频统计
2020-11-25 06:37:48第二版 import jieba txt =open('D:/pythonfiles/三国演义.txt','r',encoding='utf-8').read() excludes = {'将军','却说','荆州','二人','不可','不能','如此','如何','军士','商议','左右','军马','次日','引兵','... -
春节灯谜及答案
2019-08-08 06:19:54字谜—— 饭(打一字)。 糙 稻(打一字)。 类 武(打一字)。...乍得人(打一字)。...内里有人(打一字)。...一人背张弓(打一字)。...陕西人十分好(打一字)。... 乘人不备 ...人在楼头空伫立(打安徽一地名)。...南人不复... -
投资 - 课程学习: 实现财富自由的科学路径-量化投资
2022-01-15 11:28:27人总是有一些固有的心理偏差。你的经验和技术就是来源于别人的错误。 应该用科学的系统的方法,刻意练习。 量化的起源、基石、优势。 起源:均值回复策略。法郎永续国债,中位数73法郎,价格在30-80... -
锚点定位及案例
2019-01-26 10:47:59三国演义三国演义《三国演义》是中国第一部长篇章回体小说。《三国演义》故事开始黄巾兵起义,结束于司马氏灭吴开晋,以描写战争为主,反映了魏、蜀汉、吴三个政治集团之间的政治和军事斗争,展现了从东汉末年到... -
支付宝研究员王益的建议:“学好语文,才能写好代码”
2020-07-06 15:30:15不好好学习的人才看的,比如三国、水浒、西厢记、金瓶梅、金瓶梅、还有金瓶梅啦。藏传佛教有因明学。因就是因果,明就是说明白逻辑关系。古希腊就不用说了,整个社会制度建立在逻辑辩论之上。我们小学时学过一个为了... -
【python练习,6.9】(文本词频统计等)
2020-06-10 10:45:05encoding="utf-8").read() excludes = {"将军", "却说", "荆州", "二人", "不可", "不能", "如此"} # 出去非人名词 words = jieba.lcut(txt) counts = {} for word in words: if len(word) == 1: continue elif word... -
Python编程进阶03-自然语言处理实战
2020-01-21 15:37:062、文本词频统计实战(《三国演义》词频统计、人物统计) 首先我们来了解一下中文分词的特点和难点: 【中文分词介绍】 【中文分词特点】 词是最小的能够独立活动的有意义的语言成分 汉语是以字为单位,不像... -
Python---------实验七 字符串
2020-04-24 20:11:22"成都","徐州", "因此","未知","大败","百姓","大事","一军", "起兵","之后","接应","不见","进兵","可以", "引军","军中","大怒"} txt=open("E:\三国演义.txt","r",encoding="utf-8").read()#读取文件《三国演义》 ... -
思科九年
2020-07-10 09:06:58在这里上班的员工也多是二三十岁的年轻人,有几个日本人被安插在各个部门作为外资方的管理人员。 1998年,我27岁,刚刚完成了婚房的装修,计划和女友次年结婚。 1998年8月,我向早已预料到我要走的丁老板递交了辞呈... -
品三国
2009-01-15 06:05:03想在线听三国演义评书(1-365集)请点这里 这是一个三足鼎立的舞台,这里曾经走过一批个性张扬的英雄。然而,这又是一段被演义笼罩的历史。三国,究竟是英雄的传奇,还是智者的比拼?穿透演义迷雾,还原历史真实...