The Enron corpus with an extensive annotation of organizational hierarchy.
http://www1.ccls.columbia.edu/~rambow/enron/

======================================================================================================================

CMU AI Repository (非常全!!!)
http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/util/areas/
Names Corpus
http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/util/areas/nlp/corpora/names/
======================================================================================================================

Email Datasets

http://www.cs.cmu.edu/~einat/datasets.html

Email Datasets


1. Personal Name Annotation

Due to privacy issues, it is very hard to get hold of large and realistic email corpora. Here you can find
a couple of email datasets, as well as a dataset of news groups text - annotated with personal names spans.

The full description of these datasets, including relevant statistics and references, is available in:

Einat Minkov, Richard C. Wang &William W. Cohen, Extracting Personal Names from Emails:
Applying Named Entity Recognition to Informal Text
, in HLT/EMNLP 2005(PDF)

Some fast details:

  • The email corpora given here were extracted from the Enron corpus, made public by the Federal
    Agency Regulatory commission. A version of this data was later purchased by the CALO project,
    and made available for research purposes.
  • The first dataset, 'Enron-Meetings', consists of all messages located in folders named "meetings"
    or "calendar" (excluding a few very large files). Most of these messages are meeting related. The second
    subset, 'Enron-Random', was formed by uniformly sampling a user name (out of 158 users) and then
    randomly sampling an email from that user.
  • As a second type of informal text, we also annotated a collection of newsgroups postings. The
    'Newsgroups' dataset was extracted from the 20Newsgroups corpus, by Vitor R. Carvalho.
  • These datasets are given here in a Minorthird format (plain text, with separate labels files), as well as
    in a 'general' format, where the personal labels are embedded in the text using XML tags.
  • The given zipped files construct a directory tree. The separation into train and test folders corresponds
    to the data splits described in the abovementioned paper. Further separation is for convenience purposes.

Download:

  • Enron Meetings: Minorthird format , XML tags
    Enron - random : Minorthird format , XML tags
    NewsGroups : Minorthird format , XML tags

2. Person name disambiguation and threading

Here you can download Enron corpora and datasets, used for the general problems of entity disambiguation
and the extraction of inter-entity relations. Email here is represented as a relational database, which includes
text. Specifically, the tasks considered in these subsets of the Enron corpus are person name disambiguation
in email and intelligent message threading.

Two variations of the data are provided:

A. row email essages, and the corresponding datasets (queries and correct answers), as used in

  • Einat Minkov, William W. Cohen andAndrew Y. Ng,
    "Contextual Search and Name Disambiguation in Email using Graphs", SIGIR 2006(PDF)

    Download: Person name diambiguation corpora, datasets
    Threading corpora ,datasets

B. graph files (net relations and entity declarations), and the corresponding datasets, as used in

  • Einat Minkov and William W. Cohen,
    "Learning to Rank Typed Graph Walks: Local and Global Approaches",
    WebKDD and SNA-KDD joint workshop 2007 (PDF)

    Download: Person name diambiguation corpora, datasets
    Threading corpora ,datasets

Note: the corpora files of (A) and (B) are different representation of the same data (where reply lines
have been removed in the latter). The datasets are mostly identical, with the exception that some examples
were moved from the training and test sets to a development set.

======================================================================================================================

Software and Datasets


Software: Jangada

Jangada is an API for signature block extraction and reply-to extraction from email messages. The ideas follow the ideas of the following paper (CEAS2004 - Learning to Extract Signature and Reply Lines from Email),, but performance was slightly improved by using a new set of features not mentioned in the original reference.

Some Features:Extracts signature blocks and reply lines in email messages with very good accuracy. Can be easily integrated in other Java applications (For instance, the entire email message as a String can be used as input). Can be easily integrated in other Minorthird applications (using the TextLabels format, it accepts as input email messages with other annotations - such as dates, personal names, speech acts, etc)

Licensing:University of Illinois/NCSA Open Source License

Documentation: Very poor. An initial javadocs page ishere. There is some documentation on how to use Jangada in the example files below.

Requires:j2sdk1.4 or later. Uses MinorThird.jar.

Recommended: When using email files as input, results will be better if the messages are in mime (.eml) format.

Usage example:

1.create a new directory (for instance,jangadaDir)

2.downloadjangada.jar,minorThird.jar, the example files, and the email files to jangadaDir

3.Unzip (gunzip Demos.tar.gz) and Untar (tar –xvf Demos.tar) the example files, as well as the email files.

4.addjangadaDir, jangadaDir/minorThird.jar and jangadaDir/jangada.jar to the CLASSPATH

5.

6.For a quick demo,

7.compile the example files. For instance: “javac Demo2.java” – (in case of errors, please check you CLASSPATH again)

8.run the examples on the email files directory: “java Demo2 emails/*”

9.Check the documentation on the DemoX.java files and try your own application.

Reminder 1:if you’d like to have access to the source code, please send me an email.

Reminder 2:If you used this package, please cite the following reference:

·Learning to Extract Signature and Reply Lines from Email,Vitor R. Carvalho and William W. Cohen, CEAS-2004 (Conference on Email and Anti-Spam),Mountain View,CA,July 2004


Software: Ciranda

A java application that predicts the Email-Acts (or email speech-Acts) of email messages. The ideas follow the contents of the following papers (emnlp04 and sigir05), but performance was significantly improved by careful feature selection and additional features.

Some Features:

Predicts the following acts: Request, Commit, Deliver, Propose, Meet, dData.

Provides the confidence in each prediction.

Easy way to use these acts as features in your application.

Licensing:No guarantees are provided. Lots of bugs for sure.Use at your own risk!

Documentation: Very poor. An initial javadocs page ishere. Please check Example.java on how to use it.

Requires:j2sdk1.4 or later. Uses MinorThird.jar (see below)

Questions:I’ll be happy to help, especially if you tell me what a good Ciranda is :-)

Usage example:

1.create a new directory calledciranda, and ciranda/lib

2.downloadciranda.jar and minorThird.jar to ciranda/lib

3.addciranda/ andlib/ciranda.jar to the CLASSPATH

4.download the example fileExample.java to ciranda/

5.compile it: “javac Example.java” – (in case of errors, please check you CLASSPATH again)

6.run the example: “java Example”

7.or run the main application on a directory with emails in text format (without headers)

8.create the test directoryciranda/testdir

9.add some emails in text format (such asmsg1,msg2,msg3) to ciranda/testdir

10.run “java –jar lib/ciranda.jar testdir”

11.or try your own application.

Reminder:Send me an email if you'd like the source code. If you use this package, please use the following reference:

·Learning to Classify Email into ”Speech Acts”,,William W. Cohen, Vitor R. Carvalho and Tom M. Mitchell, EMNLP-2004 (Conference on Empirical Methods in Natural Language Processing), Barcelona, Spain, July 2004


Dataset:Signature and Reply Dataset [Datasets in Minorthird Format]

These 617 email messages have signature lines and reply-to lines annotations. The messages are a subset of the 20 Newsgroups dataset (produced by Ken Lang at CMU in the mid-90's).


Back to Vitor Carvalho’s Home page

======================================================================================================================

The Enron dataset seems to be popular, email often has privacy restrictions, and the Enron set has no restrictions. The Enron stuff will be 2001 and earlier.

The Enron datasets at CMU:
http://www.cs.cmu.edu/~einat/dat...

List of the Enron data in other places, and  variations:
http://infochimps.com/search?que...

Here is a source for chat postings, which should be similar to email.
However, it is from the Naval Postgraduate School in Monterey, CA so
it may not be as "normal", but it is 2006
http://faculty.nps.edu/cmartell/...

That's the best info I could find immediately.
**** ****
The are some academic resources here:
http://www.clres.com/corparchive...

There are a number of these datasets listed here:
http://infochimps.com/tags/text?...

======================================================================================================================

Spam email datasets

http://www.csmining.org/index.php/spam-email-datasets-.html

======================================================================================================================

ACL SIGLEX Links to the CORPORA Mailing List Archive

http://www.clres.com/corparchive.html

Selected messages to the CORPORA mailing list have been categorized and links to the threads have been provided. The categorization is based on a SIGLEX ontology. The links have been generated automatically based on subject, the date, and the sender. The links include only the years 1997 to the present. Before 2000, the CORPORA archive is not threaded. This system is based on the webmaster's categorization and ontology, both of which can easily be modified, for which your suggestions are solicited. Messages can be put into multiple categories. New categories can easily be created. Existing categories can easily be renamed and reorganized. Many messages have been categorized but do not appear here; we are working to improve the automatic linking.

SIGLEX Lexical Resources

  • Corpus Linguistics

    • Definitional Issues
    • Representativeness
    • History
    • Legal Issues
    • Course Design

    Corpora

    • Linguistically Annotated Corpora

      • English Corpora

        • British National Corpus
        • Brown Corpus
    • Multilingual Corpora
    • Written Language Corpora
      • Sublanguage Corpora

        • Learner Corpora
    • Spoken Language Corpora
    • Language Specific
  • Lexicons
    • Thesauri WordNets
    • New Sense Discovery
    • Language Phenomena
  • Text Tokenisation
    • Stop Lists
    • Text Format Conversions
    • Tokenizers
    • Markup
    • Sentence Splitting
    • Spellchecking
  • Concordancing
    • Collocations
    • Lexical Cohesion
  • Tagging
    • POS-Tagging
  • Mathematical Methods
    • Mutual Information
    • Perplexity
    • Maximum Entropy
    • Chi-Square
    • N-Gram Analysis
    • Frequency Analysis
    • Significance Tests
    • Semantic Similarity
  • Grammars
  • Software
    • Taggers

Last modified August 12, 2004

Maintained by Ken Litkowski (webmaster@siglex.org)

======================================================================================================================

http://www.cs.cmu.edu/~vitor/papers/sigFilePaper_finalversion.pdf


======================================================================================================================

NLP之路-Dataset大全相关推荐

  1. NLP前路何在?Bengio等27位NLP顶级研究者有话说

    机器之心整理 作者:Sebastian Ruder 机器之心编辑部 Deep Learning Indaba 2018 是由 DeepMind 主办的深度学习峰会,于今年 9 月份在南非斯泰伦博斯举行 ...

  2. NLP之路-查看获取文本语料库

    继续学习NLP in Python #coding=UTF-8 #上面一句解决中文注释编码错误问题 import nltk #查看获取到的文本语料库 nltk.corpus.gutenberg.fil ...

  3. NLP之路-Deep Learning in NLP (一)词向量和语言模型

     from: http://licstar.net/archives/328 这篇博客是我看了半年的论文后,自己对 Deep Learning 在 NLP 领域中应用的理解和总结,在此分享.其中必 ...

  4. NLP之路-Deep Learning for NLP 文章列举

    From:  http://www.xperseverance.net/blogs/2013/07/2124/ 慢慢补充 大部分文章来自: http://www.socher.org/ http: ...

  5. NLP之路-一点小语言工具函数

    统计工具 #coding=utf-8 def lexical_diversity(my_text_data):word_count=len(my_text_data)vocal_size=len(se ...

  6. NLP之路-继续书本上的实验

    继续书本上的实验 #coding=utf-8 import nltk from nltk.corpus import brown news_text=brown.words(categories='n ...

  7. NLP之路-实验nltk中的raw 和 words

     为了实验首先在nltk_data中建立了一个实验文本文件,如下: 文字内容是: hello this is a test sentence. this is the second line ha ...

  8. NLP之路-warm up

    今天继续做了一些小的尝试,算作技术铺垫. from nltk.book import * print("*****import nltk.book OK")print(sorted ...

  9. NLP之路-python爬虫

    解决了IDE中文显示的问题,通过print(soup.head.title).encode('gb18030')解决了中文路径无法打开的问题. 通过file=open(u"D:/users/ ...

最新文章

  1. RHEL6基础之三RHEL官网获取ISO镜像
  2. 元宇宙iwemeta: 克林顿指出,加密货币与生物恐怖主义有相同的威胁
  3. 【转】C语言中DEFINE简介及多行宏定义
  4. Linux下screen的应用
  5. 中职专业课教师资格证计算机,中职专业课教师资格证报考科目是什么?
  6. 1451 - Average 高速求平均值
  7. PE文件格式(加密与解密3)(一)
  8. 计算机基础知识二进制转换,计算机基础知识数制转换
  9. 什么是 AWS Data Pipeline
  10. Linux如何使用GPG(GNU Privacy Guard)对信息/文件进行加密和解密
  11. windows7安装openssh
  12. CodeBlocks下载与安装教程
  13. 吉他箱体模拟效果器插件-Positive Grid BIAS FX 2 DeskTop 2.3.0.6070 Elite WiN
  14. 0927锚框(Anchor box)
  15. 关于Chrome浏览器设置启用Flash插件
  16. 2015小米校招技术类笔试题
  17. 随机变量的函数的分布
  18. [python] 使用正则表达式验证email地址是否有效
  19. (转)计算机领域的顶级会议和期刊
  20. 虚拟服务器没有目录,云虚拟服务器指向目录

热门文章

  1. 直接访问WEB-INF目录下的JSP页面的方法
  2. mangle 打标签冲突
  3. HTTP/1.1与HTTP/1.0的区别
  4. 51nod 1292 字符串中的最大值V2(后缀自动机)
  5. vue2.0中vue-router使用总结
  6. QML 可以多选ComboBox的实现
  7. org.eclipse.birt.report.exception.ViewerException: 没有可用的报表设计对象.
  8. (转)C#对FTP的操作(上传,下载,重命名文件,删除文件,文件存在检查)
  9. 课程、问题-利用mincemeat编写简单的MapReduce程序-by小雨
  10. 百度地图根据经纬度获取地址