WikiText数据集来自维基百科(Wiki)的词条,经过验证的优质文章内容被收录,总数超过1亿个单词(原词叫token,令牌,就是句子拆分为单词的数量)。

与Penn Treebank(PTB)的预处理版本相比,WikiText-2要大2倍多,WikiText-103要大110倍多。WikiText数据集还具有更大的词汇表,并保留了原始大小写、标点符号和数字,所有这些都在PTB中删除。由于数据集由完整的文章组成,因此它非常适合可以利用长期依赖关系的模型。

WikiText-2 约 4.3 MB,WikiText-103 约 181 MB。

示例

数据集保存在txt中,大概的格式如下:

= Homarus gammarus = Homarus gammarus , known as the European lobster or common lobster , is a species of <unk> lobster from the eastern Atlantic Ocean , Mediterranean Sea and parts of the Black Sea . It is closely related to the American lobster , H. americanus . It may grow to a length of 60 cm ( 24 in ) and a mass of 6 kilograms ( 13 lb ) , and bears a conspicuous pair of claws . In life , the lobsters are blue , only becoming " lobster red " on cooking . Mating occurs in the summer , producing eggs which are carried by the females for up to a year before hatching into <unk> larvae . Homarus gammarus is a highly esteemed food , and is widely caught using lobster pots , mostly around the British Isles . = = Description = = Homarus gammarus is a large <unk> , with a body length up to 60 centimetres ( 24 in ) and weighing up to 5 – 6 kilograms ( 11 – 13 lb ) , although the lobsters caught in lobster pots are usually 23 – 38 cm ( 9 – 15 in ) long and weigh 0 @.@ 7 – 2 @.@ 2 kg ( 1 @.@ 5 – 4 @.@ 9 lb ) . Like other crustaceans , lobsters have a hard <unk> which they must shed in order to grow , in a process called <unk> ( <unk> ) . This may occur several times a year for young lobsters , but decreases to once every 1 – 2 years for larger animals . The first pair of <unk> is armed with a large , asymmetrical pair of claws . The larger one is the " <unk> " , and has rounded <unk> used for crushing prey ; the other is the " cutter " , which has sharp inner edges , and is used for holding or tearing the prey . Usually , the left claw is the <unk> , and the right is the cutter . The <unk> is generally blue above , with spots that <unk> , and yellow below . The red colour associated with lobsters only appears after cooking . This occurs because , in life , the red pigment <unk> is bound to a protein complex , but the complex is broken up by the heat of cooking , releasing the red pigment . The closest relative of H. gammarus is the American lobster , Homarus americanus . The two species are very similar , and can be crossed artificially , although hybrids are unlikely to occur in the wild since their ranges do not overlap . The two species can be distinguished by a number of characteristics : The <unk> of H. americanus bears one or more spines on the underside , which are lacking in H. gammarus . The spines on the claws of H. americanus are red or red @-@ tipped , while those of H. gammarus are white or white @-@ tipped . The underside of the claw of H. americanus is orange or red , while that of H. gammarus is creamy white or very pale red . = = Life cycle = = Female H. gammarus reach sexual maturity when they have grown to a carapace length of 80 – 85 millimetres ( 3 @.@ 1 – 3 @.@ 3 in ) , whereas males mature at a slightly smaller size . Mating typically occurs in summer between a recently <unk> female , whose shell is therefore soft , and a hard @-@ shelled male . The female carries the eggs for up to 12 months , depending on the temperature , attached to her <unk> . Females carrying eggs are said to be " <unk> " and can be found throughout the year . The eggs hatch at night , and the larvae swim to the water surface where they drift with the ocean currents , preying on <unk> . This stage involves three <unk> and lasts for 15 – 35 days . After the third moult , the juvenile takes on a form closer to the adult , and adopts a <unk> lifestyle . The juveniles are rarely seen in the wild , and are poorly known , although they are known to be capable of digging extensive burrows . It is estimated that only 1 larva in every 20 @,@ 000 survives to the <unk> phase . When they reach a carapace length of 15 mm ( 0 @.@ 59 in ) , the juveniles leave their burrows and start their adult lives . .....

以上例子是词条 = Homarus gammarus =的介绍 (一种龙虾品种)。其也是wiki词条下的一级标题,后面的 = = Description = = 和 = = Life cycle = = 是二级标题,大意为“简述”和“生命周期“。
同理,=== xxx === 这样的格式就是该词条下的三级标题。

这里需要注意,一个一级标题下的内容称为一篇文章(Article),下面可以有几个二级标题和三级标题,以及对应的内容。

标识符

这里需要注意NLP里面常用的标识符经常出现

  • <unk>

意思是这个单词是低频词,不在统计词频范围内

  • @

这个是连接符,比如词条中有one-apple,那么数据库文本中是这样记录 one @-@ apple

  • <eos>

一段话结尾所添加的标识符,一般是一段话存在一个string中,之后split成为一个列表,列表最后一个元素是<eos>

处理方式

处理需要建立一个语料库(Corpus),一般构造一个字典(Dictionary)来索引(index)全部单词(vocab/words)。
wikitext-2 的Corpus共 33,278个不同的单词。 wikitext-101的Corpus共267,735个不同的单词。文本由这些词组成,字典格式如下

{ 单词A:0 ; 单词B:1 ;... ;单词X:N}

官方给出的数据统计:

这里需要注意Tokens, 它是整个数据库文本拆分为单词后,单词的统计总量,是有顺序的。

另外如果按行统计的话(就是<eos>隔断), Train, Valid, Test三者的量为36718, 3760, 4358

下载和参考文献

  • https://blog.salesforceairesearch.com/the-wikitext-long-term-dependency-language-modeling-dataset/
  • https://paperswithcode.com/dataset/wikitext-2

WikiText数据集_自然语言处理相关推荐

  1. 机器学习 啤酒数据集_啤酒数据集上的神经网络

    机器学习 啤酒数据集 Artificial neural networks (ANNs), usually simply called neural networks (NNs), are compu ...

  2. 熊猫数据集_对熊猫数据框使用逻辑比较

    熊猫数据集 P (tPYTHON) Logical comparisons are used everywhere. 逻辑比较随处可见 . The Pandas library gives you a ...

  3. 熊猫数据集_大熊猫数据框的5个基本操作

    熊猫数据集 Tips and Tricks for Data Science 数据科学技巧与窍门 Pandas is a powerful and easy-to-use software libra ...

  4. 熊猫数据集_熊猫迈向数据科学的第一步

    熊猫数据集 I started learning Data Science like everyone else by creating my first model using some machi ...

  5. 【深度学习】机器学习\深度学习常见相关公开数据集汇总(图像处理相关数据集、自然语言处理相关数据集、语音处理相关数据集)

    一.前言 1. 介绍 常来说,深度学习的关键在于实践.从图像处理到语音识别,每一个细分领域都有着独特的细微差别和解决方法. 然而,你可以从哪里获得这些数据呢?现在大家所看到的大部分研究论文都用的是专有 ...

  6. msra数据集_干货下载 | 中文自然语言处理 语料/数据集

    来源:GitHub 作者:SophonPlus ChineseNlpCorpus 搜集.整理.发布中文自然语言处理 语料/数据集,与有志之士共同促进中文自然语言处理 的 发展. 情感/观点/评论 倾向 ...

  7. 自然语言处理综述_自然语言处理

    自然语言处理综述 Aren't we all initially got surprised when smart devices understood what we were telling th ...

  8. nlp自然语言处理_自然语言处理(NLP):不要重新发明轮子

    nlp自然语言处理 介绍 (Introduction) Natural language processing (NLP) is an intimidating name for an intimid ...

  9. uci数据集_干货收藏!三大领域常用十大开源数据集

    全文共1144字,预计学习时长2分钟 机器学习的研究与实现离不开大数据.知晓通用的开源数据集,一方面可以验证自己算法,另一方面也可以与其他算法进行比较.本文介绍了计算机视觉.自然语言处理和语音识别三大 ...

最新文章

  1. 齐聚上海,get多媒体开发新技能(内附讲师资料下载)
  2. 计算机网络实验设计应用题,计算机网络实验三实验报告.doc
  3. node install.js 很长_余生很长,放下错的人,才能拥抱属于你的幸福。很唯美的心灵鸡汤...
  4. java Class对象返回的都是指向同一个java堆地址上的Class引用
  5. Linux内核3.0移植并基于Initramfs根文件系统启动
  6. 应该算是在说 delphi 的日志框架吧
  7. jmeter+mysql+set_jmeter学习指南之操作 mysql 数据库
  8. 如何在 Mac 上切换语言
  9. O036、Snapshot Instance 操作详解
  10. 《设计模式之禅》--单例扩展:多例模式
  11. 设置SUID用于提权或降权
  12. 20岁生日 nyoj 312(闰年算法)
  13. 2.struts2 Actions动作 - ActionSupport
  14. python爬取去哪儿网机票_去哪儿网机票爬虫
  15. php 车架号校验规则,JAVASCRIPT车架号识别/验证函数代码 汽车车架号验证程序
  16. 利用网络劫持解决微信远程域名真机调试Api问题
  17. Android 样式系统 | 常见的主题背景属性
  18. WMS系统关于退料的几种方式
  19. PG的timestamp
  20. ailx10的hacknet攻略005

热门文章

  1. 本周言论 之 C2C模式
  2. c语言实例 魔术师的猜牌术(1),C语言实例:魔术师的猜牌术(1)
  3. Win10系统的引导区安装在了机械硬盘,系统安装在了固态硬盘,拔掉机械硬盘后无法开机的解决办法
  4. 手把手教你在Linux(Deepin)把自己下载的软件放到任务栏启动器
  5. 【转】扫描二维码登入安全吗?
  6. 2021年Java面经分享:java软件工程师证书多少钱
  7. python 抓取微博评论破亿_Python爬虫实战演练:爬取微博大V的评论数据
  8. 女性英文名對照及涵意大全
  9. Mento Carlo方法
  10. M1 Pro MacBookPro使用tmux