The Enron corpus with an extensive annotation of organizational hierarchy.

http://www1.ccls.columbia.edu/~rambow/enron/

======================================================================================================================

CMU AI Repository (非常全！！！)

http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/util/areas/

Names Corpus

http://www.cs.cmu.edu/afs/cs/project/ai-repository/ai/util/areas/nlp/corpora/names/

======================================================================================================================

Email Datasets

http://www.cs.cmu.edu/~einat/datasets.html

Email Datasets

1. Personal Name Annotation

Due to privacy issues, it is very hard to get hold of large and realistic email corpora. Here you can find
a couple of email datasets, as well as a dataset of news groups text - annotated with personal names spans.

The full description of these datasets, including relevant statistics and references, is available in:

Einat Minkov, Richard C. Wang &William W. Cohen, Extracting Personal Names from Emails:
Applying Named Entity Recognition to Informal Text, in HLT/EMNLP 2005(PDF)

Some fast details:

The email corpora given here were extracted from the Enron corpus, made public by the Federal
Agency Regulatory commission. A version of this data was later purchased by the CALO project,
and made available for research purposes.
The first dataset, 'Enron-Meetings', consists of all messages located in folders named "meetings"
or "calendar" (excluding a few very large files). Most of these messages are meeting related. The second
subset, 'Enron-Random', was formed by uniformly sampling a user name (out of 158 users) and then
randomly sampling an email from that user.
As a second type of informal text, we also annotated a collection of newsgroups postings. The
'Newsgroups' dataset was extracted from the 20Newsgroups corpus, by Vitor R. Carvalho.
These datasets are given here in a Minorthird format (plain text, with separate labels files), as well as
in a 'general' format, where the personal labels are embedded in the text using XML tags.
The given zipped files construct a directory tree. The separation into train and test folders corresponds
to the data splits described in the abovementioned paper. Further separation is for convenience purposes.

Download:

Enron Meetings: Minorthird format , XML tags
Enron - random : Minorthird format , XML tags
NewsGroups : Minorthird format , XML tags

2. Person name disambiguation and threading

Here you can download Enron corpora and datasets, used for the general problems of entity disambiguation
and the extraction of inter-entity relations. Email here is represented as a relational database, which includes
text. Specifically, the tasks considered in these subsets of the Enron corpus are person name disambiguation
in email and intelligent message threading.

Two variations of the data are provided:

A. row email essages, and the corresponding datasets (queries and correct answers), as used in

Einat Minkov, William W. Cohen andAndrew Y. Ng,
"Contextual Search and Name Disambiguation in Email using Graphs", SIGIR 2006(PDF)

Download: Person name diambiguation corpora, datasets
Threading corpora ,datasets

B. graph files (net relations and entity declarations), and the corresponding datasets, as used in

Einat Minkov and William W. Cohen,
"Learning to Rank Typed Graph Walks: Local and Global Approaches",
WebKDD and SNA-KDD joint workshop 2007 (PDF)

Download: Person name diambiguation corpora, datasets
Threading corpora ,datasets

Note: the corpora files of (A) and (B) are different representation of the same data (where reply lines
have been removed in the latter). The datasets are mostly identical, with the exception that some examples
were moved from the training and test sets to a development set.

======================================================================================================================

Software and Datasets

Software: Jangada

Jangada is an API for signature block extraction and reply-to extraction from email messages. The ideas follow the ideas of the following paper (CEAS2004 - Learning to Extract Signature and Reply Lines from Email),, but performance was slightly improved by using a new set of features not mentioned in the original reference.

Some Features:Extracts signature blocks and reply lines in email messages with very good accuracy. Can be easily integrated in other Java applications (For instance, the entire email message as a String can be used as input). Can be easily integrated in other Minorthird applications (using the TextLabels format, it accepts as input email messages with other annotations - such as dates, personal names, speech acts, etc)

Licensing:University of Illinois/NCSA Open Source License

Documentation: Very poor. An initial javadocs page ishere. There is some documentation on how to use Jangada in the example files below.

Requires:j2sdk1.4 or later. Uses MinorThird.jar.

Recommended: When using email files as input, results will be better if the messages are in mime (.eml) format.

Usage example:

1.create a new directory (for instance,jangadaDir)

2.downloadjangada.jar,minorThird.jar, the example files, and the email files to jangadaDir

3.Unzip (gunzip Demos.tar.gz) and Untar (tar –xvf Demos.tar) the example files, as well as the email files.

4.addjangadaDir, jangadaDir/minorThird.jar and jangadaDir/jangada.jar to the CLASSPATH

6.For a quick demo,

7.compile the example files. For instance: “javac Demo2.java” – (in case of errors, please check you CLASSPATH again)

8.run the examples on the email files directory: “java Demo2 emails/*”

9.Check the documentation on the DemoX.java files and try your own application.

Reminder 1:if you’d like to have access to the source code, please send me an email.

Reminder 2:If you used this package, please cite the following reference:

·Learning to Extract Signature and Reply Lines from Email,Vitor R. Carvalho and William W. Cohen, CEAS-2004 (Conference on Email and Anti-Spam),Mountain View,CA,July 2004

Software: Ciranda

A java application that predicts the Email-Acts (or email speech-Acts) of email messages. The ideas follow the contents of the following papers (emnlp04 and sigir05), but performance was significantly improved by careful feature selection and additional features.

Some Features:

Predicts the following acts: Request, Commit, Deliver, Propose, Meet, dData.

Provides the confidence in each prediction.

Easy way to use these acts as features in your application.

Licensing:No guarantees are provided. Lots of bugs for sure.Use at your own risk!

Documentation: Very poor. An initial javadocs page ishere. Please check Example.java on how to use it.

Requires:j2sdk1.4 or later. Uses MinorThird.jar (see below)

Questions:I’ll be happy to help, especially if you tell me what a good Ciranda is :-)

Usage example:

1.create a new directory calledciranda, and ciranda/lib

2.downloadciranda.jar and minorThird.jar to ciranda/lib

3.addciranda/ andlib/ciranda.jar to the CLASSPATH

4.download the example fileExample.java to ciranda/

5.compile it: “javac Example.java” – (in case of errors, please check you CLASSPATH again)

6.run the example: “java Example”

7.or run the main application on a directory with emails in text format (without headers)

8.create the test directoryciranda/testdir

9.add some emails in text format (such asmsg1,msg2,msg3) to ciranda/testdir

10.run “java –jar lib/ciranda.jar testdir”

11.or try your own application.

Reminder:Send me an email if you'd like the source code. If you use this package, please use the following reference:

·Learning to Classify Email into ”Speech Acts”,,William W. Cohen, Vitor R. Carvalho and Tom M. Mitchell, EMNLP-2004 (Conference on Empirical Methods in Natural Language Processing), Barcelona, Spain, July 2004

Dataset:Signature and Reply Dataset [Datasets in Minorthird Format]

These 617 email messages have signature lines and reply-to lines annotations. The messages are a subset of the 20 Newsgroups dataset (produced by Ken Lang at CMU in the mid-90's).

Back to Vitor Carvalho’s Home page

======================================================================================================================

The Enron dataset seems to be popular, email often has privacy restrictions, and the Enron set has no restrictions. The Enron stuff will be 2001 and earlier.

The Enron datasets at CMU:
http://www.cs.cmu.edu/~einat/dat...

List of the Enron data in other places, and variations:
http://infochimps.com/search?que...

Here is a source for chat postings, which should be similar to email.
However, it is from the Naval Postgraduate School in Monterey, CA so
it may not be as "normal", but it is 2006
http://faculty.nps.edu/cmartell/...

That's the best info I could find immediately.
**** ****
The are some academic resources here:
http://www.clres.com/corparchive...

There are a number of these datasets listed here:
http://infochimps.com/tags/text?...

======================================================================================================================

Spam email datasets

http://www.csmining.org/index.php/spam-email-datasets-.html

======================================================================================================================

ACL SIGLEX Links to the CORPORA Mailing List Archive

http://www.clres.com/corparchive.html

Selected messages to the CORPORA mailing list have been categorized and links to the threads have been provided. The categorization is based on a SIGLEX ontology. The links have been generated automatically based on subject, the date, and the sender. The links include only the years 1997 to the present. Before 2000, the CORPORA archive is not threaded. This system is based on the webmaster's categorization and ontology, both of which can easily be modified, for which your suggestions are solicited. Messages can be put into multiple categories. New categories can easily be created. Existing categories can easily be renamed and reorganized. Many messages have been categorized but do not appear here; we are working to improve the automatic linking.

SIGLEX Lexical Resources

Corpus Linguistics
- Definitional Issues
- Representativeness
- History
- Legal Issues
- Course Design
Corpora
- Linguistically Annotated Corpora
  - English Corpora
    - British National Corpus
    - Brown Corpus
- Multilingual Corpora
- Written Language Corpora
  - Sublanguage Corpora
    - Learner Corpora
- Spoken Language Corpora
- Language Specific
Lexicons
- Thesauri WordNets
- New Sense Discovery
- Language Phenomena
Text Tokenisation
- Stop Lists
- Text Format Conversions
- Tokenizers
- Markup
- Sentence Splitting
- Spellchecking
Concordancing
- Collocations
- Lexical Cohesion
Tagging
- POS-Tagging
Mathematical Methods
- Mutual Information
- Perplexity
- Maximum Entropy
- Chi-Square
- N-Gram Analysis
- Frequency Analysis
- Significance Tests
- Semantic Similarity
Grammars
Software
- Taggers

Last modified August 12, 2004

Maintained by Ken Litkowski (webmaster@siglex.org)

======================================================================================================================

http://www.cs.cmu.edu/~vitor/papers/sigFilePaper_finalversion.pdf

======================================================================================================================

NLP之路-Dataset大全相关推荐

NLP前路何在？Bengio等27位NLP顶级研究者有话说
机器之心整理作者:Sebastian Ruder 机器之心编辑部 Deep Learning Indaba 2018 是由 DeepMind 主办的深度学习峰会,于今年 9 月份在南非斯泰伦博斯举行 ...
NLP之路-查看获取文本语料库
继续学习NLP in Python #coding=UTF-8 #上面一句解决中文注释编码错误问题 import nltk #查看获取到的文本语料库 nltk.corpus.gutenberg.fil ...
NLP之路-Deep Learning in NLP （一）词向量和语言模型
from: http://licstar.net/archives/328 这篇博客是我看了半年的论文后,自己对 Deep Learning 在 NLP 领域中应用的理解和总结,在此分享.其中必 ...
NLP之路-Deep Learning for NLP 文章列举
From: http://www.xperseverance.net/blogs/2013/07/2124/ 慢慢补充大部分文章来自: http://www.socher.org/ http: ...
NLP之路-一点小语言工具函数
统计工具 #coding=utf-8 def lexical_diversity(my_text_data):word_count=len(my_text_data)vocal_size=len(se ...
NLP之路-继续书本上的实验
继续书本上的实验 #coding=utf-8 import nltk from nltk.corpus import brown news_text=brown.words(categories='n ...
NLP之路-实验nltk中的raw 和 words
为了实验首先在nltk_data中建立了一个实验文本文件,如下: 文字内容是: hello this is a test sentence. this is the second line ha ...
NLP之路-warm up
今天继续做了一些小的尝试,算作技术铺垫. from nltk.book import * print("*****import nltk.book OK")print(sorted ...
NLP之路-python爬虫
解决了IDE中文显示的问题,通过print(soup.head.title).encode('gb18030')解决了中文路径无法打开的问题. 通过file=open(u"D:/users/ ...

NLP之路-Dataset大全

1. Personal Name Annotation

2. Person name disambiguation and threading

ACL SIGLEX Links to the CORPORA Mailing List Archive

NLP之路-Dataset大全相关推荐

最新文章

热门文章