查看原文
其他

自然语言处理之实例应用

爬虫俱乐部 Stata and Python数据分析 2023-10-24

本文作者:胡思航,中南财经政法大学统计与数学学院

本文编辑:胡艺粼

技术总编:孙一博

Stata and Python 数据分析

爬虫俱乐部Stata基础课程Stata进阶课程Python课程可在小鹅通平台查看,欢迎大家多多支持订阅!如需了解详情,可以通过课程链接(https://appbqiqpzi66527.h5.xiaoeknow.com/homepage/10)或课程二维码进行访问哦~

本文介绍了自然语言处理中的一些实际应用,包括:文本分类、阅读理解、完形填空、文本生成、文本总结和翻译。


自然语言处理(Natural Language Processing,NLP)是计算机科学领域与人工智能领域中的一个重要方向。它研究人与计算机之间用自然语言进行有效通信的理论和方法。融语言学、计算机科学、数学等于一体的科学。旨在从文本数据中提取信息。目的是让计算机处理或“理解”自然语言,以执行自动翻译、文本分类等。自然语言处理是人工智能中最为困难的问题之一。






文本分类(Text Classification或Text Categorization,TC),或者称为自动文本分类(Automatic Text Categorization),是指计算机将载有信息的一篇文本映射到预先给定的某一类别或某几类别主题的过程。文本分类另外也属于自然语言处理领域。文本分类的应用场景有:情感分析、新闻主题分类、邮件过滤等,下面以情感分析为例。以下面的代码为为例:
from transformers import pipeline # 导入transformers库
classifier = pipeline("sentiment-analysis") # 调用sentiment-analysis
result = classifier("I hate you")[0] # 对"I hate you"进行分类
print(result) # 打印结果
result = classifier("I love you")[0] # 对"I love you"进行分类
print(result) # 打印结果
  • 运行结果如下:

  • 运行结果解释:

"I hate you"表示负面情绪,因此在NEGATIVE中评分较高,所以被归类为NEGATIVE类;"I love you"表示正面情绪,因此在POSITIVE中评分较高,所以被归类为POSITIVE类。





机器阅读理解(Machine Reading Comprehension,MRC)是一种利用算法使计算机理解文章语义并回答相关问题的技术。由于文章和问题均采用人类语言的形式,因此机器阅读理解属于自然语言处理(Natural Language Processing,NLP)的范畴。以下面的代码为为例:
from transformers import pipeline # 导入transformers库
question_answerer = pipeline("question-answering") # 调用question-answering
context = r"""Extractive Question Answering is the task of extracting an answer from a text given a question. An example of a question answering dataset is the SQuAD dataset, which is entirely based on that task. If you would like to fine-tune a model on a SQuAD task, you may leverage the examples/pytorch/question-answering/run_squad.py script."""
result = question_answerer(question="What is extractive question answering?",context=context) # 函数调用print(result) # 打印结果
result = question_answerer( question="What is a good example of a question answering dataset?", context=context) # 函数调用
print(result) # 打印结果
  • 运行结果如下:

  • 运行结果解释:

对于第一个问题"What is extractive question answering?",机器给出的回答是:'the task of extracting an answer from a text given a question. '对于第二个问题"What is a good example of a question answering dataset?",机器给出的回答是:'SQuAD dataset'





模型会对 <mask> 处进行填空,分数代表填这个词的概率。

from transformers import pipeline # 导入transformers库
unmasker = pipeline("fill-mask") # 调用fill-mask
sentence = 'HuggingFace is creating a <mask> that the community uses to solve NLP tasks.' # 初始化sentence
unmasker(sentence) # 调用函数
  • 运行结果如下:

  • 运行结果解释:

我们定义了一个句子:'HuggingFace is creating a <mask> that the community uses to solve NLP tasks.',其中型会对<mask>处进行填空,分数代表填这个词的概率。

在本例中,我们以得分最高的为例,经过模型计算得到结果为:'HuggingFace is creating a tool that the community uses to solve NLP tasks.'





给定模型一段话/一句话,模型接着生成后续的文本,生成的长度由 max_length 决定。

from transformers import pipeline # 导入transformers库
text_generator = pipeline("text-generation") # 调用text-generation
text_generator("As far as I am concerned, I will", max_length=50, do_sample=False) # 调用函数,设置最大长度为5
  • 运行结果如下:

  • 运行结果解释:

我们给机器一个开头:'As far as I am concerned, I will',机器自动生成了'As far as I am concerned, I will be the first to admit that I am not a fan of the idea of a "free market." I think that the idea of a free market is a bit of a stretch. I think that the idea'





给出一段文字,利用summarizer函数返回对这段文字的总结。
from transformers import pipeline # 导入transformers库
summarizer = pipeline("summarization") # 调用summarization
ARTICLE = """ New York (CNN)When Liana Barrientos was 23 years old, she got married in Westchester County, New York.A year later, she got married again in Westchester County, but to a different man and without divorcing her first husband.Only 18 days after that marriage, she got hitched yet again. Then, Barrientos declared "I do" five more times, sometimes only within two weeks of each other.In 2010, she married once more, this time in the Bronx. In an application for a marriage license, she stated it was her "first and only" marriage.Barrientos, now 39, is facing two criminal counts of "offering a false instrument for filing in the first degree," referring to her false statements on the2010 marriage license application, according to court documents.Prosecutors said the marriages were part of an immigration scam.On Friday, she pleaded not guilty at State Supreme Court in the Bronx, according to her attorney, Christopher Wright, who declined to comment further.After leaving court, Barrientos was arrested and charged with theft of service and criminal trespass for allegedly sneaking into the New York subway through an emergency exit, said DetectiveAnnette Markowski, a police spokeswoman. In total, Barrientos has been married 10 times, with nine of her marriages occurring between 1999 and 2002.All occurred either in Westchester County, Long Island, New Jersey or the Bronx. She is believed to still be married to four men, and at one time, she was married to eight men at once, prosecutors say.Prosecutors said the immigration scam involved some of her husbands, who filed for permanent residence status shortly after the marriages.Any divorces happened only after such filings were approved. It was unclear whether any of the men will be prosecuted.The case was referred to the Bronx District Attorney\'s Office by Immigration and Customs Enforcement and the Department of Homeland Security\'sInvestigation Division. Seven of the men are from so-called "red-flagged" countries, including Egypt, Turkey, Georgia, Pakistan and Mali.Her eighth husband, Rashid Rajput, was deported in 2006 to his native Pakistan after an investigation by the Joint Terrorism Task Force.If convicted, Barrientos faces up to four years in prison. Her next court appearance is scheduled for May 18."""
summarizer(ARTICLE, max_length=130, min_length=30, do_sample=False) # 调用函数,设置最大长度130,最小长度30
  • 运行结果如下:

  • 运行结果解释:

对ARTICLE的总结为: ' Liana Barrientos, 39, is charged with two counts of "offering a false instrument for filing in the first degree" In total, she has been married 10 times, with nine of her marriages occurring between 1999 and 2002 . At one time, she was married to eight men at once, prosecutors say .'


对输入的文字转换成其他指定语言,在本例中我们将英文翻译为德语。
from transformers import pipeline # 导入transformers库
translator = pipeline("translation_en_to_de") # 调用translation_en_to_de
sentence = "Hugging Face is a technology company based in New York and Paris" # 初始化sentence
translator(sentence, max_length=40) # 调用函数,最大长度为40
  • 运行结果如下:
  • 运行结果解释:
基于translation_en_to_de函数,可以将"Hugging Face is a technology company based in New York and Paris"翻译为德语:'Hugging Face ist ein Technologieunternehmen mit Sitz in New York und Paris.'
除了以上的展示,Python在自然语言处理中还有很多应用,特别在BERT模型提出之后自然语言处理得到了极快的发展,欢迎大家学习交流!

重磅福利!为了更好地服务各位同学的研究,爬虫俱乐部将在小鹅通平台上持续提供金融研究所需要的各类指标,包括上市公司十大股东、股价崩盘、投资效率、融资约束、企业避税、分析师跟踪、净资产收益率、资产回报率、国际四大审计、托宾Q值、第一大股东持股比例、账面市值比、沪深A股上市公司研究常用控制变量等一系列深加工数据,基于各交易所信息披露的数据利用Stata在实现数据实时更新的同时还将不断上线更多的数据指标。我们以最前沿的数据处理技术、最好的服务质量、最大的诚意望能助力大家的研究工作!相关数据链接,请大家访问:(https://appbqiqpzi66527.h5.xiaoeknow.com/homepage/10)或扫描二维码:

最后,我们为大家揭秘雪球网(https://xueqiu.com/)最新所展示的沪深证券和港股关注人数增长Top10。




对我们的推文累计打赏超过1000元,我们即可给您开具发票,发票类别为“咨询费”。用心做事,不负您的支持!







往期推文推荐 

JSON帮手,FeHelper

最新、最热门的命令这里都有!

Python实现微信自动回复告诉python,我想“狂飙”了——线程池与异步协程为爬虫提速高级函数——map()和reduce()

Stata绘制条形图的进阶用法

快来看看武汉的房价是不是又双叒叕涨了!Python 常见内置函数(二)

Stata绘制饼形图的进阶用法

Python标准库--logging模块盲区探索——Stata的读写极限Camelot提取PDF表格:一页多表、多页一表

Stata绘图系列——条形图绘制

Python常见内置函数(一)Stata绘图系列——饼形图绘制【爬虫实战】深交所服务业年报数据

“挂羊头卖狗肉”?

Python与excel交互--xlsxwriter模块
     关于我们 

   微信公众号“Stata and Python数据分析”分享实用的Stata、Python等软件的数据处理知识,欢迎转载、打赏。我们是由李春涛教授领导下的研究生及本科生组成的大数据处理和分析团队。

   武汉字符串数据科技有限公司一直为广大用户提供数据采集和分析的服务工作,如果您有这方面的需求,请发邮件到statatraining@163.com,或者直接联系我们的数据中台总工程司海涛先生,电话:18203668525,wechat: super4ht。海涛先生曾长期在香港大学从事研究工作,现为知名985大学的博士生,爬虫俱乐部网络爬虫技术和正则表达式的课程负责人。



此外,欢迎大家踊跃投稿,介绍一些关于Stata和Python的数据处理和分析技巧。

投稿邮箱:statatraining@163.com投稿要求:1)必须原创,禁止抄袭;2)必须准确,详细,有例子,有截图;注意事项:1)所有投稿都会经过本公众号运营团队成员的审核,审核通过才可录用,一经录用,会在该推文里为作者署名,并有赏金分成。2)邮件请注明投稿,邮件名称为“投稿+推文名称”。3)应广大读者要求,现开通有偿问答服务,如果大家遇到有关数据处理、分析等问题,可以在公众号中提出,只需支付少量赏金,我们会在后期的推文里给予解答。

您可能也对以下帖子感兴趣

文章有问题?点此查看未经处理的缓存