• us-atlas, 来自 U.S的预生成 TopoJSON Census 美国 TopoJSON库为从bureau制图边界 shapefile 2015版的 Census生成TopoJSON文件提供了一种方便的机制。用法在浏览器( 使用 d3-geo插件和 SVG ) 中， bl.ocks. org/41
• PAD-US is America's official national inventory of U.S. terrestrial and marine protected areas that are dedicated to the preservation of biological diversity and to other natural, recreation and ...
 PAD-US is America's official national inventory of U.S. terrestrial and marine protected areas that are dedicated to the preservation of biological diversity and to other natural, recreation and cultural uses, managed for these purposes through legal or other effective means. This database is separated into 4 separate table assets: designation, easement, fee, and proclamation.
The 'Designation' asset includes areas expected to overlap fee-owned lands, including designations such as 'Wilderness Area', leases, agreements, and areas where the protection mechanism (Category) is 'Unknown'.
The PAD-US database strives to be a complete inventory of areas dedicated to the preservation of biological diversity, and other natural (including extraction), recreational or cultural uses, managed for these purposes through legal or other effective means. PAD-US is an aggregation of "best available" spatial data provided by agencies and organizations at a point in time. This includes both fee ownership of lands as well as management through leases, easements, or other binding agreements. The data also tracks Congressional designations, Executive designations, and administrative designations identified in management plans (e.g. Bureau of Land Management's 'Area of Environmental Concern'). These factors provide for a robust dataset offering a spatial representation of the complex U.S. protected areas network. It is important to have in mind a specific analysis question when approaching how to work with the data. As a full inventory of areas aggregated from authoritative source data, PAD-US includes overlapping designation types and small boundary discrepancies between agency datasets. Overlapping designations largely occur in the Federal estate of the 'Designation' or 'Combined' feature classes (e.g. 'Wilderness Area' over a 'Wild and Scenic River' and 'National Forest').
It is important to note the presence of overlaps, especially when trying to calculate area statistics; overlapping boundaries count the same area of ground multiple times. While minor boundary discrepancies remain, most major overlaps have been removed from the 'Fee' asset and this is the best source for overall land area calculations by land manager ('Manager Name') within the PAD-US database (data gaps limit calculations by fee ownership or 'Owner Name'). Statistics summarizing 'Public Access' or Protection Status ('GAP Status Code') by managing agency or organization from an analysis of the PAD-US 1.4 'Combined' feature class are available and will be updated with PAD-US 2.0. As the PAD-US database is a direct aggregation of source data, the PAD-US development team does not alter spatial linework. The exception is to "clip" lands data along State boundary lines (using the authoritative State boundary file provided by the U.S. Census Bureau) and remove the small segments of boundaries created by this process associated with State or local lands (not Federal or nonprofit lands). Some boundary discrepancies (or slivers) remain in the dataset. Data overlaps have been identified and are shared, along with the U.S. Census Bureau State jurisdictional boundary file, with agency data stewards to facilitate edits in source files that will then be incorporated in subsequent PAD-US versions over time. The PAD-US database is built in collaboration with many partners and data stewards. Information regarding data stewards is available.
指定 "资产包括预计与收费土地重叠的区域，包括 "荒野区 "等指定、租约、协议以及保护机制（类别）为 "未知 "的区域。
Dataset Availability
2018-09-01T00:00:00 - 2018-09-01T00:00:00
Dataset Provider
US Geological Survey
Collection Snippet
Copied
ee.FeatureCollection("USGS/GAP/PAD-US/v20/designation")
使用说明：

引用：
U.S. Geological Survey (USGS) Gap Analysis Project (GAP), 2018, Protected Areas Database of the United States (PAD-US): U.S. Geological Survey data release, Protected Areas Database of the United States (PAD-US) 2.0 - ScienceBase-Catalog.
Protected Areas Database of the United States (PAD-US) 2.0 - ScienceBase-Catalog
代码：
var dataset = ee.FeatureCollection('USGS/GAP/PAD-US/v20/designation');
var styleParams = {
fillColor: '000070',
color: '0000be',
width: 3.0,
};
var regions = dataset.style(styleParams);
Map.setCenter(-73, 43, 8);


其他相关：
Dataset Availability
2018-09-01T00:00:00 - 2018-09-01T00:00:00
Dataset Provider
US Geological Survey
Collection Snippet
Copied
ee.FeatureCollection("USGS/GAP/PAD-US/v20/easement")
var dataset = ee.FeatureCollection('USGS/GAP/PAD-US/v20/easement');
var styleParams = {
fillColor: '000070',
color: '0000be',
width: 3.0,
};
var regions = dataset.style(styleParams);
Map.setCenter(-73, 43, 8);

Dataset Availability
2018-09-01T00:00:00 - 2018-09-01T00:00:00
Dataset Provider
US Geological Survey
Collection Snippet
Copied
ee.FeatureCollection("USGS/GAP/PAD-US/v20/fee")

var dataset = ee.FeatureCollection('USGS/GAP/PAD-US/v20/fee');
var styleParams = {
fillColor: '000070',
color: '0000be',
width: 3.0,
};
var regions = dataset.style(styleParams);
Map.setCenter(-73, 43, 8);


Dataset Availability
2018-09-01T00:00:00 - 2018-09-01T00:00:00
Dataset Provider
US Geological Survey
Collection Snippet
Copied
ee.FeatureCollection("USGS/GAP/PAD-US/v20/proclamation")
代码：
var dataset = ee.FeatureCollection('USGS/GAP/PAD-US/v20/proclamation');
var styleParams = {
fillColor: '000070',
color: '0000be',
width: 3.0,
};
var regions = dataset.style(styleParams);
Map.setCenter(-73, 43, 8);



展开全文
• List of datasets for machine learning researchFace recognitionIn computer vision, face images have been used extensively to develop face recognition systems, face detection, and many other ...
Face recognition
In computer vision, face images have been used extensively to develop face recognition systems, face detection, and many other projects that use images of faces.
Action recognition
Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreatorHuman Motion DataBase (HMDB51)51 action categories, each containing at least 101 clips, extracted from a range of sources.None.6,766 video clipsvideo clipsAction classification2011[42]H. Kuehne et al.TV Human Interaction DatasetVideos from 20 different TV shows for prediction social actions: handshake, high five, hug, kiss and none.None.6,766 video clipsvideo clipsAction prediction2013[43]Patron-Perez, A. et al.UT InteractionPeople acting out one of 6 actions (shake-hands, point, hug, push, kick, and punch) sometimes with multiple groups in the same video clip.None.120 video clipsvideo clipsAction prediction2009[44]Ryoo, M. S. et al.UT Kinect10 different people performing one of 6 actions (walk, sit down, stand up, pick up, carry, throw, push, pull, wave hands and clap hands) in an office setting.None.200 video clips with depth information at 15 frames per secondvideo clips with depth informationAction classification2012[45]Xia, L. et al.SBU InteractSeven participants performing one of 8 actions together (approaching, departing, pushing, kicking, punching, exchanging objects, hugging, and shaking hands) in an office setting.None.Around 300 interactionsvideo clips with depth informationAction classification2012[46]Yun, K. et al.Berkeley Multimodal Human Action Database (MHAD)Recordings of a single person performing 12 actionsMoCap pre-processing660 action samples8 PhaseSpace Motion Cpature, 2 Stereo Cameras, 4 Quad Cameras, 6 accelerometers, 4 microphonesAction classification2013[47]Ofli, F. et al.UCF 101 DatasetSelf described as “a dataset of 101 human actions classes from videos in the wild.” Dataset is large with over 27 hours of video.Actions classified and labeled.13,000Video, images, textClassification, action detection2012[48][49]K. Soomro et al.THUMOS DatasetLarge video dataset for action classification.Actions classified and labeled.45M frames of videoVideo, images, textClassification, action detection2013[50][51]Y. Jiang et al.ActivitynetLarge video dataset for activity recognition and detection.Actions classified and labeled.10,024Video, images, textClassification, action detection2015[52]Heilbron et al.MSP-AVATARImprovised scenarios annotated for discourse functions: contrast, confirmation/negation, question, uncertainty, suggest, giving orders, warn, inform, size description, using pronouns.Actions classified and labeled.74 sessionsMotion-captured video, audioClassification, action detection2015[53]Sadoughi, N. et al.LILiR Twotalk CorpusVideo datasets for non-verbal communication activity recognition: agreement, thinking, asking and understanding.Actions classified and labeled.527VideoAction detection2011[54]Sheerman-Chase et al.
Object detection & recognition
Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreatorDAVIS: Densely Annotated VIdeo Segmentation150 video sequences containing 10459 frames with a total of 376 objects annotated.Dataset released for the 2017 DAVIS Challenge with a dedicated workshop co-located with CVPR 2017. The videos contain several types of objects and humans with a high quality segmentation annotation.10,459Frames annotatedVideo object segmentation2017[55]Pont-Tuset, J. et al.T-LESS: An RGB-D Dataset for 6D Pose Estimation of Texture-less Objects30 industry-relevant objects. 39K training and 10K test images from each of three sensors. Two types of 3D models for each object.6D poses for all modeled objects in all images. Per-pixel labelling can be obtained by rendering of the object models at the ground truth poses.49,000RGB-D images, 3D object models6D object pose estimation, object detection2017[56]T. Hodan et al.Berkeley 3-D Object Dataset849 images taken in 75 different scenes. About 50 different object classes are labeled.Object bounding boxes and labeling.849labeled images, textObject recognition2014[57][58]A. Janoch et al.Berkeley Segmentation Data Set and Benchmarks 500 (BSDS500)500 natural images, explicitly separated into disjoint train, validation and test subsets + benchmarking code. Based on BSDS300.Each image segmented by five different subjects on average.500Segmented imagesContour detection and hierarchical image segmentation2011[59]University of California, BerkeleyMicrosoft Common Objects in Context (COCO)complex everyday scenes of common objects in their natural context.Object highlighting, labeling, and classification into 91 object types.2,500,000Labeled images, textObject recognition2015[60][61]T. Lin et al.SUN DatabaseVery large scene and object recognition database.Places and objects are labeled. Objects are segmented.131,067Images, textObject recognition, scene recognition2014[62][63]J. Xiao et al.ImageNetLabeled object image database, used in the ImageNet Large Scale Visual Recognition ChallengeLabeled objects, bounding boxes, descriptive words, SIFT features14,197,122Images, textObject recognition, scene recognition2014[64][65]J. Deng et al.TV News Channel Commercial Detection DatasetTV commercials and news broadcasts.Audio and video features extracted from still images.129,685TextClustering, classification2015[66][67]P. Guha et al.Statlog (Image Segmentation) DatasetThe instances were drawn randomly from a database of 7 outdoor images and hand-segmented to create a classification for every pixel.Many features calculated.2310TextClassification1990[68]University of MassachusettsCaltech 101Pictures of objects.Detailed object outlines marked.9146ImagesClassification, object recognition.2003[69][70]F. Li et al.Caltech-256Large dataset of images for object classification.Images categorized and hand-sorted.30,607Images, TextClassification, object detection2007[71][72]G. Griffin et al.SIFT10M DatasetSIFT features of Caltech-256 dataset.Extensive SIFT feature extraction.11,164,866TextClassification, object detection2016[73]X. Fu et al.LabelMeAnnotated pictures of scenes.Objects outlined.187,240Images, textClassification, object detection2005[74]MIT Computer Science and Artificial Intelligence LaboratoryCityscapes DatasetStereo video sequences recorded in street scenes, with pixel-level annotations. Metadata also included.Pixel-level segmentation and labeling25,000Images, textClassification, object detection2016[75]Daimler AG et al.PASCAL VOC DatasetLarge number of images for classification tasks.Labeling, bounding box included500,000Images, textClassification, object detection2010[76][77]M. Everingham et al.CIFAR-10 DatasetMany small, low-resolution, images of 10 classes of objects.Classes labelled, training set splits created.60,000ImagesClassification2009[65][78]A. Krizhevsky et al.CIFAR-100 DatasetLike CIFAR-10, above, but 100 classes of objects are given.Classes labelled, training set splits created.60,000ImagesClassification2009[65][78]A. Krizhevsky et al.German Traffic Sign Detection Benchmark DatasetImages from vehicles of traffic signs on German roads. These signs comply with UN standards and therefore are the same as in other countries.Signs manually labeled900ImagesClassification2013[79][80]S Houben et al.KITTI Vision Benchmark DatasetAutonomous vehicles driving through a mid-size city captured images of various areas using cameras and laser scanners.Many benchmarks extracted from data.>100 GB of dataImages, textClassification, object detection2012[81][82]A Geiger et al.
Handwriting and character recognition
Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreatorArtificial Characters DatasetArtificially generated data describing the structure of 10 capital English letters.Coordinates of lines drawn given as integers. Various other features.6000TextHandwriting recognition, classification1992[83]H. Guvenir et al.Letter DatasetUpper case printed letters.17 features are extracted from all images.20,000TextOCR, classification1991[84][85]D. Slate et al.Character Trajectories DatasetLabeled samples of pen tip trajectories for people writing simple characters.3-dimensional pen tip velocity trajectory matrix for each sample2858TextHandwriting recognition, classification2008[86][87]B. WilliamsChars74K DatasetCharacter recognition in natural images of symbols used in both English and Kannada74,107Character recognition, handwriting recognition, OCR, classification2009[88]T. de CamposUJI Pen Characters DatasetIsolated handwritten charactersCoordinates of pen position as characters were written given.11,640TextHandwriting recognition, classification2009[89][90]F. Prat et al.Gisette DatasetHandwriting samples from the often-confused 4 and 9 characters.Features extracted from images, split into train/test, handwriting images size-normalized.13,500Images, textHandwriting recognition, classification2003[91]Yann LeCun et al.MNIST DatabaseDatabase of handwritten digits.Hand-labeled.60,000Images, textClassification1998[92][93]National Institute of Standards and TechnologyOptical Recognition of Handwritten Digits DatasetNormalized bitmaps of handwritten data.Size normalized and mapped to bitmaps.5620Images, textHandwriting recognition, classification1998[94]E. Alpaydin et al.Pen-Based Recognition of Handwritten Digits DatasetHandwritten digits on electronic pen-tablet.Feature vectors extracted to be uniformly spaced.10,992Images, textHandwriting recognition, classification1998[95][96]E. Alpaydin et al.Semeion Handwritten Digit DatasetHandwritten digits from 80 people.All handwritten digits have been normalized for size and mapped to the same grid.1593Images, textHandwriting recognition, classification2008[97]T. SrlHASYv2Handwritten mathematical symbolsAll symbols are centered and of size 32px x 32px.168233Images, textClassification2017[98]Martin Thoma
Aerial images
Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreatorAerial Image Segmentation Dataset80 high-resolution aerial images with spatial resolution ranging from 0.3 to 1.0.Images manually segmented.80ImagesAerial Classification, object detection2013[99][100]J. Yuan et al.KIT AIS Data SetMultiple labeled training and evaluation datasets of aerial images of crowds.Images manually labeled to show paths of individuals through crowds.~ 150Images with pathsPeople tracking, aerial tracking2012[101][102]M. Butenuth et al.Wilt DatasetRemote sensing data of diseased trees and other land cover.Various features extracted.4899ImagesClassification, aerial object detection2014[103][104]B. JohnsonForest Type Mapping DatasetSatellite imagery of forests in Japan.Image wavelength bands extracted.326TextClassification2015[105][106]B. JohnsonOverhead Imagery Research Data SetAnnotated overhead imagery. Images with multiple objects.Over 30 annotations and over 60 statistics that describe the target within the context of the image.1000Images, textClassification2009[107][108]F. Tanner et al.
Other images
Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreatorMPII Cooking Activities DatasetVideos and images of various cooking activities.Activity paths and directions, labels, fine-grained motion labeling, activity class, still image extraction and labeling.881,755 framesLabeled video, images, textClassification2012[109][110]M. Rohrbach et al.Stanford Dogs DatasetImages of 120 breeds of dogs from around the world.Train/test splits and ImageNet annotations provided.20,580Images, textFine-grain classification2011[111][112]A. Khosla et al.The Oxford-IIIT Pet Dataset37 categories of pets with roughly 200 images of each.Breed labeled, tight bounding box, foreground-background segmentation.~ 7,400Images, textClassification, object detection2012[112][113]O. Parkhi et al.Corel Image Features Data SetDatabase of images with features extracted.Many features including color histogram, co-occurrence texture, and colormoments,68,040TextClassification, object detection1999[114][115]M. Ortega-Bindenberger et al.Online Video Characteristics and Transcoding Time Dataset.Transcoding times for various different videos and video properties.Video features given.168,286TextRegression2015[116]T. Deneke et al.Microsoft Sequential Image Narrative Dataset (SIND)Dataset for sequential vision-to-languageDescriptive caption and storytelling given for each photo, and photos are arranged in sequences81,743Images, textVisual storytelling2016[117]Microsoft ResearchCaltech-UCSD Birds-200-2011 DatasetLarge dataset of images of birds.Part locations for birds, bounding boxes, 312 binary attributes given11,788Images, textClassification2011[118][119]C. Wah et al.YouTube-8MLarge and diverse labeled video datasetYouTube video IDs and associated labels from a diverse vocabulary of 4800 visual entities8 millionVideo, textVideo classification2016[120][121]S. Abu-El-Haija et al.YFCC100MLarge and diverse labeled image and video datasetFlickr Videos and Images and associated description, titles, tags, and other metadata (such as EXIF and geotags)100 millionVideo, Image, TextVideo and Image classification2016[122][123]B. Thomee et al.Discrete LIRIS-ACCEDEShort videos annotated for valence and arousal.Valence and arousal labels.9800VideoVideo emotion elicitation detection2015[124]Y. Baveye et al.Continuous LIRIS-ACCEDELong videos annotated for valence and arousal while also collecting Galvanic Skin Response.Valence and arousal labels.30VideoVideo emotion elicitation detection2015[125]Y. Baveye et al.MediaEval LIRIS-ACCEDEExtension of Discrete LIRIS-ACCEDE including annotations for violence levels of the films.Vioence, valence and arousal labels.10900VideoVideo emotion elicitation detection2015[126]Y. Baveye et al.
Text data
Datasets consisting primarily of text for tasks such as natural language processing, sentiment analysis, translation, and cluster analysis.
Reviews
Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreatorAmazon reviewsUS product reviews from Amazon.com.None.~ 82MTextClassification, sentiment analysis2015[127]McAuley et al.OpinRank Review DatasetReviews of cars and hotels from Edmunds.com and TripAdvisor respectively.None.42,230 / ~259,000 respectivelyTextSentiment analysis, clustering2011[128][129]K. Ganesan et al.MovieLens22,000,000 ratings and 580,000 tags applied to 33,000 movies by 240,000 users.None.~ 22MTextRegression, clustering, classification2016[130]GroupLens ResearchYahoo! Music User Ratings of Musical ArtistsOver 10M ratings of artists by Yahoo users.None described.~ 10MTextClustering, regression2004[131][132]Yahoo!Car Evaluation Data SetCar properties and their overall acceptability.Six categorical features given.1728TextClassification1997[133][134]M. BohanecYouTube Comedy Slam Preference DatasetUser vote data for pairs of videos shown on YouTube. Users voted on funnier videos.Video metadata given.1,138,562TextClassification2012[135][136]GoogleSkytrax User Reviews DatasetUser reviews of airlines, airports, seats, and lounges from Skytrax.Ratings are fine-grain and include many aspects of airport experience.41396TextClassification, regression2015[137]Q. NguyenTeaching Assistant Evaluation DatasetTeaching assistant reviews.Features of each instance such as class, class size, and instructor are given.151TextClassification1997[138][139]W. Loh et al.
News articles
Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreatorNYSK DatasetEnglish news articles about the case relating to allegations of sexual assault against the former IMF director Dominique Strauss-Kahn.Filtered and presented in XML format.10,421XML, textSentiment analysis, topic extraction2013[140]Dermouche, M. et al.The Reuters Corpus Volume 1Large corpus of Reuters news stories in English.Fine-grain categorization and topic codes.810,000TextClassification, clustering, summarization2002[141]ReutersThe Reuters Corpus Volume 2Large corpus of Reuters news stories in multiple languages.Fine-grain categorization and topic codes.487,000TextClassification, clustering, summarization2005[142]ReutersThomson Reuters Text Research CollectionLarge corpus of news stories.Details not described.1,800,370TextClassification, clustering, summarization2009[143]T. Rose et al.Saudi Newspapers Corpus31,030 Arabic newspaper articles.Metadata extracted.31,030JSONSummarization, clustering2015[144]M. Alhagri
Messages
Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreatorEnron Email DatasetEmails from employees at Enron organized into folders.Attachments removed, invalid email addresses converted to user@enron.com or no_address@enron.com.~ 500,000TextNetwork analysis, sentiment analysis2004 (2015)[145][146]Klimt, B. and Y. YangLing-Spam DatasetCorpus containing both legitimate and spam emails.Four version of the corpus involving whether or not a lemmatiser or stop-list was enabled.TextClassification2000[147][148]Androutsopoulos, J. et al.SMS Spam Collection DatasetCollected SMS spam messages.None.5574TextClassification2011[149][150]T. Almeida et al.Twenty Newsgroups DatasetMessages from 20 different newsgroups.None.20,000TextNatural language processing1999[151]T. Mitchell et al.Spambase DatasetSpam emails.Many text features extracted.4601TextSpam detection, classification1999[152]M. Hopkins et al.
Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreatorSentiment140Tweet data from 2009 including original text, time stamp, user and sentiment.Classified using distant supervision from presence of emoticon in tweet.1,578,627Tweets, comma, separated valuesSentiment analysis2009[153][154]A. Go et al.ASU Twitter DatasetTwitter network data, not actual tweets. Shows connections between a large number of users.None.11,316,811 users, 85,331,846 connectionsTextClustering, graph analysis2009[155][156]R. Zafarani et al.SNAP Social Circles: Twitter DatabaseLarge twitter network data.Node features, circles, and ego networks.1,768,149TextClustering, graph analysis2012[157][158]J. McAuley et al.Twitter Dataset for Arabic Sentiment AnalysisArabic tweets.Samples hand-labeled as positive or negative.2000TextClassification2014[159][160]N. AbdullaBuzz in Social Media DatasetData from Twitter and Tom’s Hardware. This dataset focuses on specific buzz topics being discussed on those sites.Data is windowed so that the user can attempt to predict the events leading up to social media buzz.140,000TextRegression, Classification2013[161][162]F. Kawala et al.Paraphrase and Semantic Similarity in Twitter (PIT)This dataset focuses on whether tweets have (almost) same meaning/information or not. Manually labeled.tokenization, part-of-speech and named entity tagging18,762TextRegression, Classification2015[163][164]Xu et al.
Other text
Sound data
Datasets of sounds and sound features.
Speech
Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreatorZero Resource Speech Challenge 2015Spontaneous speech (English), Read speech (Xitsonga).raw wavEnglish: 5h, 12 speakers; Xitsonga: 2h30; 24 speakerssoundUnsupervised discovery of speech features/subword units/word units2015[187][188]www.zerospeech.com/2015Versteegh et al.Parkinson Speech DatasetMultiple recordings of people with and without Parkinson’s Disease.Voice features extracted, disease scored by physician using unified Parkinson’s disease rating scale1,040TextClassification, regression2013[189][190]B. E. Sakar et al.Spoken Arabic DigitsSpoken Arabic digits from 44 male and 44 female.Time-series of mel-frequency cepstrum coefficients.8,800TextClassification2010[191][192]M. Bedda et al.ISOLET DatasetSpoken letter names.Features extracted from sounds.7797TextClassification1994[193][194]R. Cole et al.Japanese Vowels DatasetNine male speakers uttered two Japanese vowels successively.Applied 12-degree linear prediction analysis to it to obtain a discrete-time series with 12 cepstrum coefficients.640TextClassification1999[195][196]M. Kudo et al.Parkinson’s Telemonitoring DatasetMultiple recordings of people with and without Parkinson’s Disease.Sound features extracted.5875TextClassification2009[197][198]A. Tsanas et al.TIMITRecordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences.Speech is lexically and phonemically transcribed.6300TextSpeech recognition, classification.1986[199][200]J. Garofolo et al.Arabic Speech CorpusA single-speaker, Modern Standard Arabic (MSA) speech corpus with phonetic and orthographic transcripts aligned to phoneme levelSpeech is orthographically and phonetically transcribed with stress marks.~1900Text, WAVSpeech Synthesis, Speech Recognition, Corpus Alignment, Speech Therapy, Education.2016[201]N. Halabi
Music
Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreatorGeographical Original of Music Data SetAudio features of music samples from different locations.Audio features extracted using MARSYAS software.1,059TextGeographical classification, clustering2014[202][203]F. Zhou et al.Million Song DatasetAudio features from one million different songs.Audio features extracted.1MTextClassification, clustering2011[204][205]T. Bertin-Mahieux et al.Free Music ArchiveAudio under Creative Commons from 100k songs (343 days, 1TiB) with a hierarchy of 161 genres, metadata, user data, free-form text.Raw audio and audio features.106,574Text, MP3Classification, recommendation2017[206]M. Defferrard et al.Bach Choral Harmony DatasetBach chorale chords.Audio features extracted.5665TextClassification2014[207][208]D. Radicioni et al.
Other sounds
Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreatorUrbanSoundLabeled sound recordings of sounds like air conditioners, car horns and children playing.Sorted into folders by class of events as well as metadata in a JSON file and annotations in a CSV file.1,059Sound (WAV) Classification2014[209][210]J. Salamon et al.
Signal data
Datasets containing electric signal information requiring some sort of Signal processing for further analysis.
Electrical
Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreatorWitty Worm DatasetDataset detailing the spread of the Witty worm and the infected computers.Split into a publicly available set and a restricted set containing more sensitive information like IP and UDP headers.55,909 IP addressesTextClassification2004[211][212]Center for Applied Internet Data AnalysisCuff-Less Blood Pressure Estimation DatasetCleaned vital signals from human patients which can be used to estimate blood pressure.125 Hz vital signs have been cleaned.12,000TextClassification, regression2015[213][214]M. Kachuee et al.Gas Sensor Array Drift DatasetMeasurements from 16 chemical sensors utilized in simulations for drift compensation.Extensive number of features given.13,910TextClassification2012[215][216]A. VergaraServo DatasetData covering the nonlinear relationships observed in a servo-amplifier circuit.Levels of various components as a function of other components are given.167TextRegression1993[217][218]K. UllrichUJIIndoorLoc-Mag DatasetIndoor localization database to test indoor positioning systems. Data is magnetic field based.Train and test splits given.40,000TextClassification, regression, clustering2015[219][220]D. Rambla et al.Sensorless Drive Diagnosis DatasetElectrical signals from motors with defective components.Statistical features extracted.58,508TextClassification2015[221][222]M. Bator
Motion-tracking
Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreatorWearable Computing: Classification of Body Postures and Movements (PUC-Rio)People performing five standard actions while wearing motion tackers.None.165,632TextClassification2013[223][224]Pontifical Catholic University of Rio de JaneiroGesture Phase Segmentation DatasetFeatures extracted from video of people doing various gestures.Features extracted aim at studying gesture phase segmentation.9900TextClassification, clustering2014[225][226]R. Madeo et aVicon Physical Action Data Set Dataset10 normal and 10 aggressive physical actions that measure the human activity tracked by a 3D tracker.Many parameters recorded by 3D tracker.3000TextClassification2011[227][228]T. TheodoridisDaily and Sports Activities DatasetMotor sensor data for 19 daily and sports activities.Many sensors given, no preprocessing done on signals.9120TextClassification2013[229][230]B. Barshan et al.Human Activity Recognition Using Smartphones DatasetGyroscope and accelerometer data from people wearing smartphones and performing normal actions.Actions performed are labeled, all signals preprocessed for noise.10,299TextClassification2012[231][232]J. Reyes-Ortiz et al.Australian Sign Language SignsAustralian sign language signs captured by motion-tracking gloves.None.2565TextClassification2002[233][234]M. KadousWeight Lifting Exercises monitored with Inertial Measurement UnitsFive variations of the biceps curl exercise monitored with IMUs.Some statistics calculated from raw data.39,242TextClassification2013[235][236]W. Ugulino et al.sEMG for Basic Hand movements DatasetTwo databases of surface electromyographic signals of 6 hand movements.None.3000TextClassification2014[237][238]C. Sapsanis et al.REALDISP Activity Recognition DatasetEvaluate techniques dealing with the effects of sensor displacement in wearable activity recognition.None.1419TextClassification2014[238][239]O. Banos et al.Heterogeneity Activity Recognition DatasetData from multiple different smart devices for humans performing various activities.None.43,930,257TextClassification, clustering2015[240][241]A. Stisen et al.Indoor User Movement Prediction from RSS DataTemporal wireless network data that can be used to track the movement of people in an office.None.13,197TextClassification2016[242][243]D. BacciuPAMAP2 Physical Activity Monitoring Dataset18 different types of physical activities performed by 9 subjects wearing 3 IMUs.None.3,850,505TextClassification2012[244]A. ReissOPPORTUNITY Activity Recognition DatasetHuman Activity Recognition from wearable, object, and ambient sensors is a dataset devised to benchmark human activity recognition algorithms.None.2551TextClassification2012[245][246]D. Roggen et al.Real World Activity Recognition DatasetHuman Activity Recognition from wearable devices. Distinguishes between seven on-body device positions and comprises six different kinds of sensors.None.3,150,000 (per sensor)TextClassification2016[247]T. Sztyler et al.
Other signals
Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreatorWine DatasetChemical analysis of wines grown in the same region in Italy but derived from three different cultivars.13 properties of each wine are given178TextClassification, regression1991[248][249]M. Forina et al.Combined Cycle Power Plant Data SetData from various sensors within a power plant running for 6 years.None9568TextRegression2014[250][251]P. Tufekci et al.
Physical data
Datasets from physical systems
High-energy physics
Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreatorHIGGS DatasetMonte Carlo simulations of particle accelerator collisions.28 features of each collision are given.11MTextClassification2014[252][253][254]D. WhitesonHEPMASS DatasetMonte Carlo simulations of particle accelerator collisions. Goal is to separate the signal from noise.28 features of each collision are given.10,500,000TextClassification2016[253][254][255]D. Whiteson
Systems
Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreatorYacht Hydrodynamics DatasetYacht performance based on dimensions.Six features are given for each yacht.308TextRegression2013[256][257]R. LopezRobot Execution Failures Dataset5 data sets that center around robotic failure to execute common tasks.Integer valued features such as torque and other sensor measurements.463TextClassification1999[258]L. Seabra et al.Pittsburgh Bridges DatasetDesign description is given in terms of several properties of various bridges.Various bridge features are given.108TextClassification1990[259][260]Y. Reich et al.Automobile DatasetData about automobiles, their insurance risk, and their normalized losses.Car features extracted.205TextRegression1987[261][262]J. Schimmer et al.Auto MPG DatasetMPG data for cars.Eight features of each car given.398TextRegression1993[263]Carnegie Mellon UniversityEnergy Efficiency DatasetHeating and cooling requirements given as a function of building parameters.Building parameters given.768TextClassification, regression2012[264][265]A. Xifara et al.Airfoil Self-Noise DatasetA series of aerodynamic and acoustic tests of two and three-dimensional airfoil blade sections.Data about frequency, angle of attack, etc., are given.1503TextRegression2014[266]R. LopezChallenger USA Space Shuttle O-Ring DatasetAttempt to predict O-ring problems given past Challenger data.Several features of each flight, such as launch temperature, are given.23TextRegression1993[267][268]D. Draper et al.Statlog (Shuttle) DatasetNASA space shuttle datasets.Nine features given.58,000TextClassification2002[269]NASA
Astronomy
Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreatorVolcanoes on Venus – JARtool experiment DatasetVenus images returned by the Magellan spacecraft.Images are labeled by humans.not givenImagesClassification1991[270][271]M. BurlMAGIC Gamma Telescope DatasetMonte Carlo generated high-energy gamma particle events.Numerous features extracted from the simulations.19,020TextClassification2007[271][272]R. BockSolar Flare DatasetMeasurements of the number of certain types of solar flare events occurring in a 24-hour period.Many solar flare-specific features are given.1389TextRegression, classification1989[273]G. Bradshaw
Earth science
Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreatorVolcanoes of the WorldVolcanic eruption data for all known volcanic events on earth.Details such as region, subregion, tectonic setting, dominant rock type are given.1535TextRegression, classification2013[274]E. Venzke et al.Seismic-bumps DatasetSeismic activities from a coal mine.Seismic activity was classified as hazardous or not.2584TextClassification2013[275][276]M. Sikora et al.
Other physical
Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreatorConcrete Compressive Strength DatasetDataset of concrete properties and compressive strength.Nine features are given for each sample.1030TextRegression2007[277][278]I. YehConcrete Slump Test DatasetConcrete slump flow given in terms of properties.Features of concrete given such as fly ash, water, etc.103TextRegression2009[279][280]I. YehMusk DatasetPredict if a molecule, given the features, will be a musk or a non-musk.168 features given for each molecule.6598TextClassification1994[281]Arris Pharmaceutical Corp.Steel Plates Faults DatasetSteel plates of 7 different types.27 features given for each sample.1941TextClassification2010[282]Semeion Research Center
Biological data
Datasets from biological systems.
Human
Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreatorEEG DatabaseStudy to examine EEG correlates of genetic predisposition to alcoholism.Measurements from 64 electrodes placed on the scalp sampled at 256 Hz (3.9 ms epoch) for 1 second.122TextClassification1999[283][284]H. BegleiterP300 Interface DatasetData from nine subjects collected using P300-based brain-computer interface for disabled subjects.Split into four sessions for each subject. MATLAB code given.1,224TextClassification2008[285][286]U. Hoffman et al.Heart Disease Data SetAttributed of patients with and without heart disease.75 attributes given for each patient with some missing values.303TextClassification1988[287][288]A. Janosi et al.Breast Cancer Wisconsin (Diagnostic) DatasetDataset of features of breast masses. Diagnoses by physician is given.10 features for each sample are given.569TextClassification1995[289][290]W. Wolberg et al.National Survey on Drug Use and HealthLarge scale survey on health and drug use in the United States.None.55,268TextClassification, regression2012[291]United States Department of Health and Human ServicesLung Cancer DatasetLung cancer dataset without attribute definitions56 features are given for each case32TextClassification1992[292][293]Z. Hong et al.Arrhythmia DatasetData for a group of patients, of which some have cardiac arrhythmia.276 features for each instance.452TextClassification1998[294][295]H. Altay et al.Diabetes 130-US hospitals for years 1999–2008 Dataset9 years of readmission data across 130 US hospitals for patients with diabetes.Many features of each readmission are given.100,000TextClassification, clustering2014[296][297]J. Clore et al.Diabetic Retinopathy Debrecen DatasetFeatures extracted from images of eyes with and without diabetic retinopathy.Features extracted and conditions diagnosed.1151TextClassification2014[298][299]B. Antal et al.Liver Disorders DatasetData for people with liver disorders.Seven biological features given for each patient.345TextClassification1990[300][301]Bupa Medical Research Ltd.Thyroid Disease Dataset10 databases of thyroid disease patient data.None.7200TextClassification1987[302][303]R. QuinlanMesothelioma DatasetMesothelioma patient data.Large number of features, including asbestos exposure, are given.324TextClassification2016[304][305]A. Tanrikulu et al.KEGG Metabolic Reaction Network (Undirected) DatasetNetwork of metabolic pathways. A reaction network and a relation network are given.Detailed features for each network node and pathway are given.65,554TextClassification, clustering, regression2011[306]M. Naeem et al.
Animal
Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreatorAbalone DatasetPhysical measurements of Abalone. Weather patterns and location are also given.None.4177TextRegression1995[307]Marine Research Laboratories – TaroonaZoo DatasetArtificial dataset covering 7 classes of animals.Animals are classed into 7 categories and features are given for each.101TextClassification1990[308]R. ForsythDemospongiae DatasetData about marine sponges.503 sponges in the Demosponge class are described by various features.503TextClassification2010[309]E. Armengol et al.Splice-junction Gene Sequences DatasetPrimate splice-junction gene sequences (DNA) with associated imperfect domain theory.None.3190TextClassification1992[293]G. Towell et al.Mice Protein Expression DatasetExpression levels of 77 proteins measured in the cerebral cortex of mice.None.1080TextClassification, Clustering2015[310][311]C. Higuera et al.
Plant
Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreatorForest Fires DatasetForest fires and their properties.13 features of each fire are extracted.517TextRegression2008[312][313]P. Cortez et al.Iris DatasetThree types of iris plants are described by 4 different attributes.None.150TextClassification1936[314][315]R. FisherPlant Species Leaves DatasetSixteen samples of leaf each of one-hundred plant species.Shape descriptor, fine-scale margin, and texture histograms are given.1600TextClassification2012[316][317]J. Cope et al.Mushroom DatasetMushroom attributes and classification.Many properties of each mushroom are given.8124TextClassification1987[318]J. SchlimmerSoybean DatasetDatabase of diseased soybean plants.35 features for each plant are given. Plants are classified into 19 categories.307TextClassification1988[319]R. Michalshi et al.Seeds DatasetMeasurements of geometrical properties of kernels belonging to three different varieties of wheat.None.210TextClassification, clustering2012[320][321]Charytanowicz et al.Covertype DatasetData for predicting forest cover type strictly from cartographic variables.Many geographical features given.581,012TextClassification1998[322][323]J. Blackard et al.Abscisic Acid Signaling Network DatasetData for a plant signaling network. Goal is to determine set of rules that governs the network.None.300TextCausal-discovery2008[324]J. Jenkens et al.Folio Dataset20 photos of leaves for each of 32 species.None.637Images, textClassification, clustering2015[325][326]T. Munisami et al.Oxford Flower Dataset17 category dataset of flowers.Train/test splits, labeled images,1360Images, textClassification2006[113][327]M-E Nilsback et al.
Microbe
Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreatorEcoli DatasetProtein localization sites.Various features of the protein localizations sites are given.336TextClassification1996[328][329]K. Nakai et al.MicroMass DatasetIdentification of microorganisms from mass-spectrometry data.Various mass spectrometer features.931TextClassification2013[330][331]P. Mahe et al.Yeast DatasetPredictions of Cellular localization sites of proteins.Eight features given per instance.1484TextClassification1996[332][333]K. Nakai et al.
Drug Discovery
Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreatorTox21 DatasetPrediction of outcome of biological assays.Chemical descriptors of molecules are given.12707TextClassification2016[334]A. Mayr et al.
Anomaly data
Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreatorNumenta Anomaly Benchmark (NAB)Data are ordered, timestamped, single-valued metrics. All data files contain anomalies, unless otherwise noted.?50+ filesComma separated valuesAnomaly detection2016 (continually updated)[335]Numenta
Multivariate data
Datasets consisting of rows of observations and columns of attributes characterizing those observations. Typically used for regression analysis or classification but other types of algorithms can also be used. This section includes datasets that do not fit in the above categories.
Financial
Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreatorDow Jones IndexWeekly data of stocks from the first and second quarters of 2011.Calculated values included such as percentage change and a lags.750Comma separated valuesClassification, regression, time Series2014[336][337]M. Brown et al.Statlog (Australian Credit Approval)Credit card applications either accepted or rejected and attributes about the application.Attribute names are removed as well as identifying information. Factors have been relabeled.690Comma separated valuesClassification1987[338][339]R. QuinlaneBay auction dataAuction data from various eBay.com objects over various length auctionsContains all bids, bidderID, bid times, and opening prices.~ 550TextRegression, classification2012[340][341]G. Shmueli et al.Statlog (German Credit Data)Binary credit classification into “good” or “bad” with many featuresVarious financial features of each person are given.690TextClassification1994[342]H. HofmannBank Marketing DatasetData from a large marketing campaign carried out by a large bank .Many attributes of the clients contacted are given. If the client subscribed to the bank is also given.45,211TextClassification2012[343][344]S. Moro et al.Istanbul Stock Exchange DatasetSeveral stock indexes tracked for almost two years.None.536TextClassification, regression2013[345][346]O. AkbilgicDefault of Credit Card ClientsCredit default data for Taiwanese creditors.Various features about each account are given.30,000TextClassification2016[347][348]I. Yeh
Weather
Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreatorCloud DataSetData about 1024 different clouds.Image features extracted.1024TextClassification, clustering1989[349]P. CollardEl Nino DatasetOceanographic and surface meteorological readings taken from a series of buoys positioned throughout the equatorial Pacific.12 weather attributes are measured at each buoy.178080TextRegression1999[350]Pacific Marine Environmental LaboratoryGreenhouse Gas Observing Network DatasetTime-series of greenhouse gas concentrations at 2921 grid cells in California created using simulations of the weather.None.2921TextRegression2015[351]D. LucasAtmospheric CO2 from Continuous Air Samples at Mauna Loa ObservatoryContinuous air samples in Hawaii, USA. 44 years of records.None.44 yearsTextRegression2001[352]Mauna Loa ObservatoryIonosphere DatasetRadar data from the ionosphere. Task is to classify into good and bad radar returns.Many radar features given.351TextClassification1989[303][353]Johns Hopkins UniversityOzone Level Detection DatasetTwo ground ozone level datasets.Many features given, including weather conditions at time of measurement.2536TextClassification2008[354][355]K. Zhang et al.
Census
Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreatorAdult DatasetCensus data from 1994 containing demographic features of adults and their income.Cleaned and anonymized.48,842Comma separated valuesClassification1996[356]United States Census BureauCensus-Income (KDD)Weighted census data from the 1994 and 1995 Current Population Surveys.Split into training and test sets.299,285Comma separated valuesClassification2000[357][358]United States Census BureauIPUMS Census DatabaseCensus data from the Los Angeles and Long Beach areas.None256,932TextClassification, regression1999[359]IPUMSUS Census Data 1990Partial data from 1990 US census.Results randomized and useful attributes selected.2,458,285TextClassification, regression1990[360]United States Census Bureau
Transit
Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreatorBike Sharing DatasetHourly and daily count of rental bikes in a large city.Many features, including weather, length of trip, etc., are given.17,389TextRegression2013[361][362]H. Fanaee-TNew York City Taxi Trip DataTrip data for yellow and green taxis in New York City.Gives pick up and drop off locations, fares, and other details of trips.6 yearsTextClassification, clustering2015[363]New York City Taxi and Limousine CommissionTaxi Service Trajectory ECML PKDDTrajectories of all taxis in a large city.Many features given, including start and stop points.1,710,671TextClustering, causal-discovery2015[364][365]M. Ferreira et al.
Internet
Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreatorWebpages from Common Crawl 2012Large collection of webpages and how they are connected via hyperlinksNone.3.5BTextclustering, classification2013[366]V. GranvilleInternet Advertisements DatasetDataset for predicting if a given image is an advertisement or not.Features encode geometry of ads and phrases occurring in the URL.3279TextClassification1998[367][368]N. KushmerickInternet Usage DatasetGeneral demographics of internet users.None.10,104TextClassification, clustering1999[369]D. CookURL Dataset120 days of URL data from a large conference.Many features of each URL are given.2,396,130TextClassification2009[370][371]J. MaPhishing Websites DatasetDataset of phishing websites.Many features of each site are given.2456TextClassification2015[372]R. Mustafa et al.Online Retail DatasetOnline transactions for a UK online retailer.Details of each transaction given.541,909TextClassification, clustering2015[373]D. ChenFreebase Simple Topic DumpFreebase is an online effort to structure all human knowledge.Topics from Freebase have been extracted.largeTextClassification, clustering2011[374][375]FreebaseFarm Ads DatasetThe text of farm ads from websites. Binary approval or disapproval by content owners is given.SVMlight sparse vectors of text words in ads calculated.4143TextClassification2011[376][377]C. Masterharm et al.
Games
Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreatorPoker Hand Dataset5 card hands from a standard 52 card deck.Attributes of each hand are given, including the Poker hands formed by the cards it contains.1,025,010TextRegression, classification2007[378]R. CattralConnect-4 DatasetContains all legal 8-ply positions in the game of connect-4 in which neither player has won yet, and in which the next move is not forced.None.67,557TextClassification1995[379]J. TrompChess (King-Rook vs. King) DatasetEndgame Database for White King and Rook against Black King.None.28,056TextClassification1994[380][381]M. Bain et al.Chess (King-Rook vs. King-Pawn) DatasetKing+Rook versus King+Pawn on a7.None.3196TextClassification1989[382]R. HolteTic-Tac-Toe Endgame DatasetBinary classification for win conditions in tic-tac-toe.None.958TextClassification1991[383]D. Aha
Other multivariate
Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreatorHousing Data SetMedian home values of Boston with associated home and neighborhood attributes.None.506TextRegression1993[384]D. Harrison et al.The Getty Vocabulariesstructured terminology for art and other material culture, archival materials, visual surrogates, and bibliographic materials.None.largeTextClassification2015[385]Getty CenterYahoo! Front Page Today Module User Click LogUser click log for news articles displayed in the Featured Tab of the Today Module on Yahoo! Front Page.Conjoint analysis with a bilinear model.45,811,883 user visitsTextRegression, clustering2009[386][387]Chu et al.British Oceanographic Data CentreBiological, chemical, physical and geophysical data for oceans. 22K variables tracked.Various.22K variables, many instancesTextRegression, clustering2015[388]British Oceanographic Data CentreCongressional Voting Records DatasetVoting data for all USA representatives on 16 issues.Beyond the raw voting data, various other features are provided.435TextClassification1987[389]J. SchlimmerEntree Chicago Recommendation DatasetRecord of user interactions with Entree Chicago recommendation system.Details of each users usage of the app are recorded in detail.50,672TextRegression, recommendation2000[390]R. BurkeInsurance Company Benchmark (COIL 2000)Information on customers of an insurance company.Many features of each customer and the services they use.9,000TextRegression, classification2000[391][392]P. van der PuttenNursery DatasetData from applicants to nursery schools.Data about applicant’s family and various other factors included.12,960TextClassification1997[393][394]V. Rajkovic et al.University DatasetData describing attributed of a large number of universities.None.285TextClustering, classification1988[395]S. Sounders et al.Blood Transfusion Service Center DatasetData from blood transfusion service center. Gives data on donors return rate, frequency, etc.None.748TextClassification2008[396][397]I. YehRecord Linkage Comparison Patterns DatasetLarge dataset of records. Task is to link relevant records together.Blocking procedure applied to select only certain record pairs.5,749,132TextClassification2011[398][399]University of MainzNomao DatasetNomao collects data about places from many different sources. Task is to detect items that describe the same place.Duplicates labeled.34,465TextClassification2012[400][401]Nomao LabsMovie DatasetData for 10,000 movies.Several features for each movie are given.10,000TextClustering, classification1999[402]G. WiederholdOpen University Learning Analytics DatasetInformation about students and their interactions with a virtual learning environment.None.~ 30,000TextClassification, clustering, regression2015[403][404]J. Kuzilek et al.

Ref:
List of datasets for machine learning research - Wikipedia


展开全文
• Rather, we humans are stupendously, astoundingly good at making sense of what our eyes show us. But nearly all that work is done unconsciously. And so we don't usually appreciate how tough a problem ...
The human visual system is one of the wonders of the world. Consider the following sequence of handwritten digits:

Most people effortlessly recognize those digits as 504192. That ease is deceptive. In each hemisphere of our brain, humans have a primary visual cortex, also known as V1, containing 140 million neurons, with tens of billions of connections between them. And yet human vision involves not just V1, but an entire series of visual cortices - V2, V3, V4, and V5 - doing progressively more complex image processing. We carry in our heads a supercomputer, tuned by evolution over hundreds of millions of years, and superbly adapted to understand the visual world. Recognizing handwritten digits isn't easy. Rather, we humans are stupendously, astoundingly good at making sense of what our eyes show us. But nearly all that work is done unconsciously. And so we don't usually appreciate how tough a problem our visual systems solve.
The difficulty of visual pattern recognition becomes apparent if you attempt to write a computer program to recognize digits like those above. What seems easy when we do it ourselves suddenly becomes extremely difficult. Simple intuitions about how we recognize shapes - "a 9 has a loop at the top, and a vertical stroke in the bottom right" - turn out to be not so simple to express algorithmically. When you try to make such rules precise, you quickly get lost in a morass of exceptions and caveats and special cases. It seems hopeless.

Neural networks approach the problem in a different way. The idea is to take a large number of handwritten digits, known as training examples,

and then develop a system which can learn from those training examples. In other words, the neural network uses the examples to automatically infer rules for recognizing handwritten digits. Furthermore, by increasing the number of training examples, the network can learn more about handwriting, and so improve its accuracy. So while I've shown just 100 training digits above, perhaps we could build a better handwriting recognizer by using thousands or even millions or billions of training examples.
In this chapter we'll write a computer program implementing a neural network that learns to recognize handwritten digits. The program is just 74 lines long, and uses no special neural network libraries. But this short program can recognize digits with an accuracy over 96 percent, without human intervention. Furthermore, in later chapters we'll develop ideas which can improve accuracy to over 99 percent. In fact, the best commercial neural networks are now so good that they are used by banks to process cheques, and by post offices to recognize addresses.
We're focusing on handwriting recognition because it's an excellent prototype problem for learning about neural networks in general. As a prototype it hits a sweet spot: it's challenging - it's no small feat to recognize handwritten digits - but it's not so difficult as to require an extremely complicated solution, or tremendous computational power. Furthermore, it's a great way to develop more advanced techniques, such as deep learning. And so throughout the book we'll return repeatedly to the problem of handwriting recognition. Later in the book, we'll discuss how these ideas may be applied to other problems in computer vision, and also in speech, natural language processing, and other domains.
Of course, if the point of the chapter was only to write a computer program to recognize handwritten digits, then the chapter would be much shorter! But along the way we'll develop many key ideas about neural networks, including two important types of artificial neuron (the perceptron and the sigmoid neuron), and the standard learning algorithm for neural networks, known as stochastic gradient descent. Throughout, I focus on explaining why things are done the way they are, and on building your neural networks intuition. That requires a lengthier discussion than if I just presented the basic mechanics of what's going on, but it's worth it for the deeper understanding you'll attain. Amongst the payoffs, by the end of the chapter we'll be in position to understand what deep learning is, and why it matters.

Perceptrons

What is a neural network? To get started, I'll explain a type of artificial neuron called a perceptron. Perceptrons were developed in the 1950s and 1960s by the scientist Frank Rosenblatt, inspired by earlier work by Warren McCulloch and Walter Pitts. Today, it's more common to use other models of artificial neurons - in this book, and in much modern work on neural networks, the main neuron model used is one called the sigmoid neuron. We'll get to sigmoid neurons shortly. But to understand why sigmoid neurons are defined the way they are, it's worth taking the time to first understand perceptrons.
So how do perceptrons work? A perceptron takes several binary inputs,

x1,x2,…

x

1

,

x

2

,

…

, and produces a single binary output:

In the example shown the perceptron has three inputs,

x1,x2,x3

x

1

,

x

2

,

x

3

. In general it could have more or fewer inputs. Rosenblatt proposed a simple rule to compute the output. He introduced
weights
,

w1,w2,…

w

1

,

w

2

,

…

, real numbers expressing the importance of the respective inputs to the output. The neuron's output,

0

0

or

1

1

, is determined by whether the weighted sum

∑jwjxj

∑

j

w

j

x

j

is less than or greater than some
threshold value
. Just like the weights, the threshold is a real number which is a parameter of the neuron. To put it in more precise algebraic terms:

output={01if ∑jwjxj≤ thresholdif ∑jwjxj> threshold(1)

(1)

output

=

{

0

if

∑

j

w

j

x

j

≤

threshold

1

if

∑

j

w

j

x

j

>

threshold

That's all there is to how a perceptron works!

That's the basic mathematical model. A way you can think about the perceptron is that it's a device that makes decisions by weighing up evidence. Let me give an example. It's not a very realistic example, but it's easy to understand, and we'll soon get to more realistic examples. Suppose the weekend is coming up, and you've heard that there's going to be a cheese festival in your city. You like cheese, and are trying to decide whether or not to go to the festival. You might make your decision by weighing up three factors:
Is the weather good?Does your boyfriend or girlfriend want to accompany you?Is the festival near public transit? (You don't own a car).
We can represent these three factors by corresponding binary variables

x1,x2

x

1

,

x

2

, and

x3

x

3

. For instance, we'd have

x1=1

x

1

=

1

if the weather is good, and

x1=0

x

1

=

0

if the weather is bad. Similarly,

x2=1

x

2

=

1

if your boyfriend or girlfriend wants to go, and

x2=0

x

2

=

0

if not. And similarly again for

x3

x

3

and public transit.

Now, suppose you absolutely adore cheese, so much so that you're happy to go to the festival even if your boyfriend or girlfriend is uninterested and the festival is hard to get to. But perhaps you really loathe bad weather, and there's no way you'd go to the festival if the weather is bad. You can use perceptrons to model this kind of decision-making. One way to do this is to choose a weight

w1=6

w

1

=

6

for the weather, and

w2=2

w

2

=

2

and

w3=2

w

3

=

2

for the other conditions. The larger value of

w1

w

1

indicates that the weather matters a lot to you, much more than whether your boyfriend or girlfriend joins you, or the nearness of public transit. Finally, suppose you choose a threshold of

5

5

for the perceptron. With these choices, the perceptron implements the desired decision-making model, outputting

1

1

whenever the weather is good, and

0

0

whenever the weather is bad. It makes no difference to the output whether your boyfriend or girlfriend wants to go, or whether public transit is nearby.
By varying the weights and the threshold, we can get different models of decision-making. For example, suppose we instead chose a threshold of

3

3

. Then the perceptron would decide that you should go to the festival whenever the weather was good or when both the festival was near public transit and your boyfriend or girlfriend was willing to join you. In other words, it'd be a different model of decision-making. Dropping the threshold means you're more willing to go to the festival.
Obviously, the perceptron isn't a complete model of human decision-making! But what the example illustrates is how a perceptron can weigh up different kinds of evidence in order to make decisions. And it should seem plausible that a complex network of perceptrons could make quite subtle decisions:

In this network, the first column of perceptrons - what we'll call the first
layer
of perceptrons - is making three very simple decisions, by weighing the input evidence. What about the perceptrons in the second layer? Each of those perceptrons is making a decision by weighing up the results from the first layer of decision-making. In this way a perceptron in the second layer can make a decision at a more complex and more abstract level than perceptrons in the first layer. And even more complex decisions can be made by the perceptron in the third layer. In this way, a many-layer network of perceptrons can engage in sophisticated decision making.

Incidentally, when I defined perceptrons I said that a perceptron has just a single output. In the network above the perceptrons look like they have multiple outputs. In fact, they're still single output. The multiple output arrows are merely a useful way of indicating that the output from a perceptron is being used as the input to several other perceptrons. It's less unwieldy than drawing a single output line which then splits.
Let's simplify the way we describe perceptrons. The condition

∑jwjxj>threshold

∑

j

w

j

x

j

>

threshold

is cumbersome, and we can make two notational changes to simplify it. The first change is to write

∑jwjxj

∑

j

w

j

x

j

as a dot product,

w⋅x≡∑jwjxj

w

⋅

x

≡

∑

j

w

j

x

j

, where

w

w

and

x

x

are vectors whose components are the weights and inputs, respectively. The second change is to move the threshold to the other side of the inequality, and to replace it by what's known as the perceptron's bias,

b≡−threshold

b

≡

−

threshold

. Using the bias instead of the threshold, the perceptron rule can be rewritten:

output={01if w⋅x+b≤0if w⋅x+b>0(2)

(2)

output

=

{

0

if

w

⋅

x

+

b

≤

0

1

if

w

⋅

x

+

b

>

0

You can think of the bias as a measure of how easy it is to get the perceptron to output a

1

1

. Or to put it in more biological terms, the bias is a measure of how easy it is to get the perceptron to
fire. For a perceptron with a really big bias, it's extremely easy for the perceptron to output a

1

1

. But if the bias is very negative, then it's difficult for the perceptron to output a

1

1

. Obviously, introducing the bias is only a small change in how we describe perceptrons, but we'll see later that it leads to further notational simplifications. Because of this, in the remainder of the book we won't use the threshold, we'll always use the bias.

I've described perceptrons as a method for weighing evidence to make decisions. Another way perceptrons can be used is to compute the elementary logical functions we usually think of as underlying computation, functions such as AND, OR, and NAND. For example, suppose we have a perceptron with two inputs, each with weight

−2

−

2

, and an overall bias of

3

3

. Here's our perceptron:

Then we see that input

00

00

produces output

1

1

, since

(−2)∗0+(−2)∗0+3=3

(

−

2

)

∗

0

+

(

−

2

)

∗

0

+

3

=

3

is positive. Here, I've introduced the

∗

∗

symbol to make the multiplications explicit. Similar calculations show that the inputs

01

01

and

10

10

produce output

1

1

. But the input

11

11

produces output

0

0

, since

(−2)∗1+(−2)∗1+3=−1

(

−

2

)

∗

1

+

(

−

2

)

∗

1

+

3

=

−

1

is negative. And so our perceptron implements a
NAND
gate!

The NAND example shows that we can use perceptrons to compute simple logical functions. In fact, we can use networks of perceptrons to compute any logical function at all. The reason is that the NAND gate is universal for computation, that is, we can build any computation up out of NAND gates. For example, we can use NAND gates to build a circuit which adds two bits,

x1

x

1

and

x2

x

2

. This requires computing the bitwise sum,

x1⊕x2

x

1

⊕

x

2

, as well as a carry bit which is set to

1

1

when both

x1

x

1

and

x2

x

2

are

1

1

, i.e., the carry bit is just the bitwise product

x1x2

x

1

x

2

:

To get an equivalent network of perceptrons we replace all the
NAND
gates by perceptrons with two inputs, each with weight

−2

−

2

, and an overall bias of

3

3

. Here's the resulting network. Note that I've moved the perceptron corresponding to the bottom right
NAND
gate a little, just to make it easier to draw the arrows on the diagram:

One notable aspect of this network of perceptrons is that the output from the leftmost perceptron is used twice as input to the bottommost perceptron. When I defined the perceptron model I didn't say whether this kind of double-output-to-the-same-place was allowed. Actually, it doesn't much matter. If we don't want to allow this kind of thing, then it's possible to simply merge the two lines, into a single connection with a weight of -4 instead of two connections with -2 weights. (If you don't find this obvious, you should stop and prove to yourself that this is equivalent.) With that change, the network looks as follows, with all unmarked weights equal to -2, all biases equal to 3, and a single weight of -4, as marked:

Up to now I've been drawing inputs like

x1

x

1

and

x2

x

2

as variables floating to the left of the network of perceptrons. In fact, it's conventional to draw an extra layer of perceptrons - the
input layer
- to encode the inputs:

This notation for input perceptrons, in which we have an output, but no inputs,

is a shorthand. It doesn't actually mean a perceptron with no inputs. To see this, suppose we did have a perceptron with no inputs. Then the weighted sum

∑jwjxj

∑

j

w

j

x

j

would always be zero, and so the perceptron would output

1

1

if

b>0

b

>

0

, and

0

0

if

b≤0

b

≤

0

. That is, the perceptron would simply output a fixed value, not the desired value (

x1

x

1

, in the example above). It's better to think of the input perceptrons as not really being perceptrons at all, but rather special units which are simply defined to output the desired values,

x1,x2,…

x

1

,

x

2

,

…

.

The adder example demonstrates how a network of perceptrons can be used to simulate a circuit containing many NAND gates. And because NAND gates are universal for computation, it follows that perceptrons are also universal for computation.
The computational universality of perceptrons is simultaneously reassuring and disappointing. It's reassuring because it tells us that networks of perceptrons can be as powerful as any other computing device. But it's also disappointing, because it makes it seem as though perceptrons are merely a new type of NAND gate. That's hardly big news!
However, the situation is better than this view suggests. It turns out that we can devise learning algorithms which can automatically tune the weights and biases of a network of artificial neurons. This tuning happens in response to external stimuli, without direct intervention by a programmer. These learning algorithms enable us to use artificial neurons in a way which is radically different to conventional logic gates. Instead of explicitly laying out a circuit of NAND and other gates, our neural networks can simply learn to solve problems, sometimes problems where it would be extremely difficult to directly design a conventional circuit.

Sigmoid neurons

Learning algorithms sound terrific. But how can we devise such algorithms for a neural network? Suppose we have a network of perceptrons that we'd like to use to learn to solve some problem. For example, the inputs to the network might be the raw pixel data from a scanned, handwritten image of a digit. And we'd like the network to learn weights and biases so that the output from the network correctly classifies the digit. To see how learning might work, suppose we make a small change in some weight (or bias) in the network. What we'd like is for this small change in weight to cause only a small corresponding change in the output from the network. As we'll see in a moment, this property will make learning possible. Schematically, here's what we want (obviously this network is too simple to do handwriting recognition!):

If it were true that a small change in a weight (or bias) causes only a small change in output, then we could use this fact to modify the weights and biases to get our network to behave more in the manner we want. For example, suppose the network was mistakenly classifying an image as an "8" when it should be a "9". We could figure out how to make a small change in the weights and biases so the network gets a little closer to classifying the image as a "9". And then we'd repeat this, changing the weights and biases over and over to produce better and better output. The network would be learning.
The problem is that this isn't what happens when our network contains perceptrons. In fact, a small change in the weights or bias of any single perceptron in the network can sometimes cause the output of that perceptron to completely flip, say from

0

0

to

1

1

. That flip may then cause the behaviour of the rest of the network to completely change in some very complicated way. So while your "9" might now be classified correctly, the behaviour of the network on all the other images is likely to have completely changed in some hard-to-control way. That makes it difficult to see how to gradually modify the weights and biases so that the network gets closer to the desired behaviour. Perhaps there's some clever way of getting around this problem. But it's not immediately obvious how we can get a network of perceptrons to learn.
We can overcome this problem by introducing a new type of artificial neuron called a sigmoid neuron. Sigmoid neurons are similar to perceptrons, but modified so that small changes in their weights and bias cause only a small change in their output. That's the crucial fact which will allow a network of sigmoid neurons to learn.
Okay, let me describe the sigmoid neuron. We'll depict sigmoid neurons in the same way we depicted perceptrons:

Just like a perceptron, the sigmoid neuron has inputs,

x1,x2,…

x

1

,

x

2

,

…

. But instead of being just

0

0

or

1

1

, these inputs can also take on any values
between

0

0

and

1

1

. So, for instance,

0.638…

0.638

…

is a valid input for a sigmoid neuron. Also just like a perceptron, the sigmoid neuron has weights for each input,

w1,w2,…

w

1

,

w

2

,

…

, and an overall bias,

b

b

. But the output is not

0

0

or

1

1

σ(w⋅x+b)

σ

(

w

⋅

x

+

b

)

, where

σ

σ

is called the
sigmoid function
*
*Incidentally,

σ

σ

is sometimes called the logistic function, and this new class of neurons called logistic neurons. It's useful to remember this terminology, since these terms are used by many people working with neural nets. However, we'll stick with the sigmoid terminology.
, and is defined by:

σ(z)≡11+e−z.(3)

(3)

σ

(

z

)

≡

1

1

+

e

−

z

.

To put it all a little more explicitly, the output of a sigmoid neuron with inputs

x1,x2,…

x

1

,

x

2

,

…

, weights

w1,w2,…

w

1

,

w

2

,

…

, and bias

b

b

is

11+exp(−∑jwjxj−b).(4)

(4)

1

1

+

exp

⁡

(

−

∑

j

w

j

x

j

−

b

)

.

At first sight, sigmoid neurons appear very different to perceptrons. The algebraic form of the sigmoid function may seem opaque and forbidding if you're not already familiar with it. In fact, there are many similarities between perceptrons and sigmoid neurons, and the algebraic form of the sigmoid function turns out to be more of a technical detail than a true barrier to understanding.
To understand the similarity to the perceptron model, suppose

z≡w⋅x+b

z

≡

w

⋅

x

+

b

is a large positive number. Then

e−z≈0

e

−

z

≈

0

and so

σ(z)≈1

σ

(

z

)

≈

1

. In other words, when

z=w⋅x+b

z

=

w

⋅

x

+

b

is large and positive, the output from the sigmoid neuron is approximately

1

1

, just as it would have been for a perceptron. Suppose on the other hand that

z=w⋅x+b

z

=

w

⋅

x

+

b

is very negative. Then

e−z→∞

e

−

z

→

∞

, and

σ(z)≈0

σ

(

z

)

≈

0

. So when

z=w⋅x+b

z

=

w

⋅

x

+

b

is very negative, the behaviour of a sigmoid neuron also closely approximates a perceptron. It's only when

w⋅x+b

w

⋅

x

+

b

is of modest size that there's much deviation from the perceptron model.
What about the algebraic form of

σ

σ

? How can we understand that? In fact, the exact form of

σ

σ

isn't so important - what really matters is the shape of the function when plotted. Here's the shape:

-4

-3

-2

-1

0

1

2

3

4

0.0

0.2

0.4

0.6

0.8

1.0

z

sigmoid function

This shape is a smoothed out version of a step function:

-4

-3

-2

-1

0

1

2

3

4

0.0

0.2

0.4

0.6

0.8

1.0

z

step function

If

σ

σ

had in fact been a step function, then the sigmoid neuron would be a perceptron, since the output would be

1

1

or

0

0

depending on whether

w⋅x+b

w

⋅

x

+

b

was positive or negative**Actually, when

w⋅x+b=0

w

⋅

x

+

b

=

0

the perceptron outputs

0

0

, while the step function outputs

1

1

. So, strictly speaking, we'd need to modify the step function at that one point. But you get the idea.. By using the actual

σ

σ

function we get, as already implied above, a smoothed out perceptron. Indeed, it's the smoothness of the

σ

σ

function that is the crucial fact, not its detailed form. The smoothness of

σ

σ

means that small changes

Δwj

Δ

w

j

in the weights and

Δb

Δ

b

in the bias will produce a small change

Δoutput

Δ

output

in the output from the neuron. In fact, calculus tells us that

Δoutput

Δ

output

is well approximated by

Δoutput≈∑j∂output∂wjΔwj+∂output∂bΔb,(5)

(5)

Δ

output

≈

∑

j

∂

output

∂

w

j

Δ

w

j

+

∂

output

∂

b

Δ

b

,

where the sum is over all the weights,

wj

w

j

, and

∂output/∂wj

∂

output

/

∂

w

j

and

∂output/∂b

∂

output

/

∂

b

denote partial derivatives of the

output

output

with respect to

wj

w

j

and

b

b

, respectively. Don't panic if you're not comfortable with partial derivatives! While the expression above looks complicated, with all the partial derivatives, it's actually saying something very simple (and which is very good news):

Δoutput

Δ

output

is a
linear functionof the changes

Δwj

Δ

w

j

and

Δb

Δ

b

in the weights and bias. This linearity makes it easy to choose small changes in the weights and biases to achieve any desired small change in the output. So while sigmoid neurons have much of the same qualitative behaviour as perceptrons, they make it much easier to figure out how changing the weights and biases will change the output.

If it's the shape of

σ

σ

which really matters, and not its exact form, then why use the particular form used for

σ

σ

in Equation (3)? In fact, later in the book we will occasionally consider neurons where the output is

f(w⋅x+b)

f

(

w

⋅

x

+

b

)

for some other activation function

f(⋅)

f

(

⋅

)

. The main thing that changes when we use a different activation function is that the particular values for the partial derivatives in Equation (5) change. It turns out that when we compute those partial derivatives later, using

σ

σ

will simplify the algebra, simply because exponentials have lovely properties when differentiated. In any case,

σ

σ

is commonly-used in work on neural nets, and is the activation function we'll use most often in this book.
How should we interpret the output from a sigmoid neuron? Obviously, one big difference between perceptrons and sigmoid neurons is that sigmoid neurons don't just output

0

0

or