nlp数据集复旦
2012-06-03 01:13:37 robinliu2010 阅读数 5779

http://www.oschina.net/p/fudannlp

 

FudanNLP主要是为中文自然语言处理而开发的工具包,也包含为实现这些任务的机器学习算法和数据集。

演示地址: http://jkx.fudan.edu.cn/nlp/query

FudanNLP目前实现的内容如下:

  1. 中文处理工具  
    1. 中文分词
    2. 词性标注
    3. 实体名识别
    4. 句法分析
    5. 时间表达式识别
  2. 信息检索  
    1. 文 本分类
    2. 新闻聚类
    3. Lucene中文分词
  3. 机 器学习  
    1. Average Perceptron
    2. Passive-aggressive Algorithm
    3. K-means
    4. Exact Inference

 

2019-05-08 11:45:22 Yonggie 阅读数 28

复旦nlp:https://github.com/FudanNLP/fnlp

安装和编译看它的QuickStart:https://github.com/FudanNLP/fnlp/wiki/quicktutorial#12-%E5%91%BD%E4%BB%A4%25E

遇到的错误:

1.其中如果你  mvn install -Dmaven.test.skip=true  这个命令出现了   Unknown lifecycle phase ".test.skip=true". 等等   这个错误的话, 把D和maven分开,也就是用  mvn install -D maven.test.skip=true就可以了。

2.敲cmd验证的时候,你敲进去 

java -Xmx1024m -Dfile.encoding=UTF-8 -classpath "fnlp-core/target/fnlp-core-2.1-SNAPSH
OT.jar;libs/trove-3.1a1.jar;libs/commons-cli-1.4.jar" org.fnlp.nlp.cn.tag.CWSTagger -s models/seg.m "自然语言是人类交流
和思维的主要工具,是人类智慧的结晶。"

发现它提示

那就把 -Dfile.encoding=UTF-8删掉,直接写到-classpath那个参数就可以了。(我这个图上后面还缺了个"不过不影响错误)

 

 

 

2017-12-28 10:30:46 chuchus 阅读数 1883

Yelp reviews

yelp 可类比为中国的大众点评. 数据集介绍见参考[1].
这里写图片描述
figure yelp 网站的点评. 星星个数是评价.

Yahoo answers

a topic classification task with 10 classes :

  1. Society & Culture
  2. Science & Mathematics,
  3. Health
  4. Education & Reference
  5. Computers & Internet,
  6. Sports
  7. Business & Finance,
  8. Entertainment & Music
  9. Family & Relationships
  10. and Politics & Government

The document includes question titles, question contexts and best answers. There are 140,000 training samples and 5000 testing samples.

参考

  1. yelp dataset challenge 官网: yelp dataset challenge
2017-10-19 16:25:48 Gavin__Zhou 阅读数 5220

原文地址: https://machinelearningmastery.com/datasets-natural-language-processing/


针对NLP中常见的7个问题进行分类,归纳常用数据集,mark之

  • Text Classification
  • Language Modeling
  • Image Captioning
  • Machine Translation
  • Question Answering
  • Speech Recognition
  • Document Summarization

Text Classification

Text classification refers to labeling sentences or documents, such as email spam classification and sentiment analysis.

Below are some good beginner text classification datasets.

  1. Reuters Newswire Topic Classification (Reuters-21578). A collection of news documents that appeared on Reuters in 1987 indexed by categories. Also see RCV1, RCV2 and TRC2.
  2. [IMDB Movie Review Sentiment Classification] (stanford)(http://ai.stanford.edu/~amaas/data/sentiment/). A collection of movie reviews from the website imdb.com and their positive or negative sentiment.
  3. News Group Movie Review Sentiment Classification (cornell). A collection of movie reviews from the website imdb.com and their positive or negative sentiment.

For more, see the post:
Datasets for single-label text categorization.

Language Modeling

Language modeling involves developing a statistical model for predicting the next word in a sentence or next letter in a word given whatever has come before. It is a pre-cursor task in tasks like speech recognition and machine translation.

It is a pre-cursor task in tasks like speech recognition and machine translation.

Below are some good beginner language modeling datasets.

  1. Project Gutenberg, a large collection of free books that can be retrieved in plain text for a variety of languages.

  2. There are more formal corpora that are well studied; for example:
    Brown University Standard Corpus of Present-Day American English. A large sample of English words.
    Google 1 Billion Word Corpus.

Image Captioning

mage captioning is the task of generating a textual description for a given image.

Below are some good beginner image captioning datasets.

  1. Common Objects in Context (COCO). A collection of more than 120 thousand images with descriptions
  2. Flickr 8K. A collection of 8 thousand described images taken from flickr.com.
  3. Flickr 30K. A collection of 30 thousand described images taken from flickr.com.
    For more see the post:

Exploring Image Captioning Datasets, 2016

Machine Translation

Machine translation is the task of translating text from one language to another.

Below are some good beginner machine translation datasets.

  1. Aligned Hansards of the 36th Parliament of Canada. Pairs of sentences in English and French.
  2. European Parliament Proceedings Parallel Corpus 1996-2011. Sentences pairs of a suite of European languages.
    There are a ton of standard datasets used for the annual machine translation challenges; see:

Statistical Machine Translation

Question Answering

Question answering is a task where a sentence or sample of text is provided from which questions are asked and must be answered.

Below are some good beginner question answering datasets.

  1. Stanford Question Answering Dataset (SQuAD). Question answering about Wikipedia articles.
  2. Deepmind Question Answering Corpus. Question answering about news articles from the Daily Mail.
  3. Amazon question/answer data. Question answering about Amazon products.
    For more, see the post:

Datasets: How can I get corpus of a question-answering website like Quora or Yahoo Answers or Stack Overflow for analyzing answer quality?

Speech Recognition

Speech recognition is the task of transforming audio of a spoken language into human readable text.

Below are some good beginner speech recognition datasets.

  1. TIMIT Acoustic-Phonetic Continuous Speech Corpus. Not free, but listed because of its wide use. Spoken American English and associated transcription.
  2. VoxForge. Project to build an open source database for speech recognition.
  3. LibriSpeech ASR corpus. Large collection of English audiobooks taken from LibriVox.

Document Summarization

Document summarization is the task of creating a short meaningful description of a larger document.

Below are some good beginner document summarization datasets.

  1. Legal Case Reports Data Set. A collection of 4 thousand legal cases and their summarization.
  2. TIPSTER Text Summarization Evaluation Conference Corpus. A collection of nearly 200 documents and their summaries.
  3. The AQUAINT Corpus of English News Text. Not free, but widely used. A corpus of news articles.
    For more see:

Document Understanding Conference (DUC) Tasks.
Where can I find good data sets for text summarization?

Further Reading

This section provides additional lists of datasets if you are looking to go deeper.

  1. Text Datasets Used in Research on Wikipedia
  2. Datasets: What are the major text corpora used by computational linguists and natural language processing researchers?
  3. Stanford Statistical Natural Language Processing Corpora
  4. Alphabetical list of NLP Datasets
  5. NLTK Corpora
  6. Open Data for Deep Learning on DL4J
  7. NLP datasets
2018-09-21 15:37:46 biubiubiu888 阅读数 391

[转]NLP数据集

阅读数 162

100+个NLP数据集

阅读数 16

没有更多推荐了,返回首页