• us-atlas, 来自 U.S的预生成 TopoJSON Census 美国 TopoJSON库为从bureau制图边界 shapefile 2015版的 Census生成TopoJSON文件提供了一种方便的机制。用法在浏览器( 使用 d3-geo插件和 SVG ) 中, bl.ocks. org/41
  • PAD-US is America's official national inventory of U.S. terrestrial and marine protected areas that are dedicated to the preservation of biological diversity and to other natural, recreation and ...

     PAD-US is America's official national inventory of U.S. terrestrial and marine protected areas that are dedicated to the preservation of biological diversity and to other natural, recreation and cultural uses, managed for these purposes through legal or other effective means. This database is separated into 4 separate table assets: designation, easement, fee, and proclamation.

    The 'Designation' asset includes areas expected to overlap fee-owned lands, including designations such as 'Wilderness Area', leases, agreements, and areas where the protection mechanism (Category) is 'Unknown'.

    The PAD-US database strives to be a complete inventory of areas dedicated to the preservation of biological diversity, and other natural (including extraction), recreational or cultural uses, managed for these purposes through legal or other effective means. PAD-US is an aggregation of "best available" spatial data provided by agencies and organizations at a point in time. This includes both fee ownership of lands as well as management through leases, easements, or other binding agreements. The data also tracks Congressional designations, Executive designations, and administrative designations identified in management plans (e.g. Bureau of Land Management's 'Area of Environmental Concern'). These factors provide for a robust dataset offering a spatial representation of the complex U.S. protected areas network. It is important to have in mind a specific analysis question when approaching how to work with the data. As a full inventory of areas aggregated from authoritative source data, PAD-US includes overlapping designation types and small boundary discrepancies between agency datasets. Overlapping designations largely occur in the Federal estate of the 'Designation' or 'Combined' feature classes (e.g. 'Wilderness Area' over a 'Wild and Scenic River' and 'National Forest').

    It is important to note the presence of overlaps, especially when trying to calculate area statistics; overlapping boundaries count the same area of ground multiple times. While minor boundary discrepancies remain, most major overlaps have been removed from the 'Fee' asset and this is the best source for overall land area calculations by land manager ('Manager Name') within the PAD-US database (data gaps limit calculations by fee ownership or 'Owner Name'). Statistics summarizing 'Public Access' or Protection Status ('GAP Status Code') by managing agency or organization from an analysis of the PAD-US 1.4 'Combined' feature class are available and will be updated with PAD-US 2.0. As the PAD-US database is a direct aggregation of source data, the PAD-US development team does not alter spatial linework. The exception is to "clip" lands data along State boundary lines (using the authoritative State boundary file provided by the U.S. Census Bureau) and remove the small segments of boundaries created by this process associated with State or local lands (not Federal or nonprofit lands). Some boundary discrepancies (or slivers) remain in the dataset. Data overlaps have been identified and are shared, along with the U.S. Census Bureau State jurisdictional boundary file, with agency data stewards to facilitate edits in source files that will then be incorporated in subsequent PAD-US versions over time. The PAD-US database is built in collaboration with many partners and data stewards. Information regarding data stewards is available.


    指定 "资产包括预计与收费土地重叠的区域,包括 "荒野区 "等指定、租约、协议以及保护机制(类别)为 "未知 "的区域。

    PAD-US数据库力争成为一个完整的清单,列出专门用于保护生物多样性和其他自然(包括开采)、娱乐或文化用途的区域,并通过法律或其他有效手段对这些用途进行管理。PAD-US是各机构和组织在某个时间点提供的 "最佳可用 "空间数据的汇总。这既包括土地的收费所有权,也包括通过租赁、地役权或其他有约束力的协议进行管理。该数据还跟踪国会指定、行政指定和管理计划中确定的行政指定(例如,土地管理局的 "环境关注区")。这些因素提供了一个强大的数据集,提供了复杂的美国保护区网络的空间表现。在处理如何使用这些数据时,必须牢记一个具体的分析问题。作为一个由权威源数据汇总而成的完整的区域清单,PAD-US包括重叠的指定类型和机构数据集之间的小边界差异。重叠的指定主要发生在 "指定 "或 "组合 "特征类别的联邦遗产中(例如,"荒野区 "在 "野生和风景河 "和 "国家森林 "之上)。

    注意重叠的存在是很重要的,特别是在试图计算面积统计时;重叠的边界会多次计算同一地区的地面。虽然小的边界差异仍然存在,但大多数主要的重叠部分已经从 "收费 "资产中删除,这是PAD-US数据库中按土地管理者("管理者姓名")计算总体土地面积的最佳来源(数据缺口限制了按收费所有权或 "所有者姓名 "计算)。通过对PAD-US 1.4 "组合 "地物类别的分析,可以得到按管理机构或组织划分的 "公共访问 "或保护状态("GAP状态代码")的统计数据,并将在PAD-US 2.0中更新。由于PAD-US数据库是源数据的直接汇总,PAD-US开发团队不改变空间线型。例外情况是沿州界线 "剪辑 "土地数据(使用美国人口普查局提供的权威性州界文件),并删除这一过程中产生的与州或地方土地(非联邦或非营利性土地)有关的小段边界。一些边界差异(或片段)仍然保留在数据集中。数据重叠已经被确认,并与美国人口普查局的国家管辖边界文件一起,与机构数据管理员共享,以促进源文件的编辑,然后随着时间的推移,将其纳入随后的PAD-US版本。PAD-US数据库是与许多合作伙伴和数据管理人合作建立的。有关数据管理人的信息可供查阅。

    Dataset Availability

    2018-09-01T00:00:00 - 2018-09-01T00:00:00

    Dataset Provider

    US Geological Survey

    Collection Snippet




    The Digital Object Identifier Protected Areas Database of the United States (PAD-US) 2.0 - ScienceBase-Catalog for PAD-US 2.0 provides the persistent reference that should be used to obtain the data for use. The U.S. Geological Survey and all contributing data partners shall not be held liable for improper or incorrect use of the data described and (or) contained herein. All information is created with a specific end use or uses in mind. This is especially true for GIS data, which is expensive to produce and must be directed to meet the immediate program needs. These data were created with the expectation that they would be used for other applications; however, inappropriate uses are listed below. This list is in no way exhaustive but should serve as a guide to assess whether a proposed use can or cannot be supported by these data. For many uses, it is unlikely that PAD-US will provide the only data needed, and for uses with a regulatory outcome, authoritative agency data and field surveys should verify the result. PAD-US is recommended for users seeking basic information about more than one agency or organizations lands. Users should seek authoritative source data directly to answer questions regarding one agency or those requiring more frequent updates. Ultimately, it will be the responsibility of each data user to determine if these data can answer the question being asked. Inappropriate uses include: Using PAD-US for applications or analyses associated with one agency or a particular unit (agencies are always the best and authoritative source of their lands data and many publish updates more frequently than PAD-US). Using some data to map small areas (less than thousands of hectares), typically requiring mapping resolution at 1:24,000 scale (as boundary quality varies by data source) and using aerial photographs or ground surveys in areas where data are incomplete. Combining these data with other data finer than 1:100,000 scale to produce new hybrid maps or answer queries. Generating specific areal measurements from the data finer than the nearest thousand hectares. Representing boundaries as a legal representation for regulation or acquisition. Establishing definite occurrence or non-occurrence of any feature for an exact geographic area. Determining abundance, health, or condition of any feature. Using the data without acquiring and reviewing the metadata.

    PAD-US 2.0 的数字对象标识符 https://doi.org/10.5066/P955KPLE 提供了用于获取数据以供使用的持久参考。美国地质调查局和所有提供数据的合作伙伴对本文所述和(或)包含的数据的不当或不正确使用概不负责。所有信息都是根据特定的最终用途或用途创建的。对于 GIS 数据来说尤其如此,因为 GIS 数据的生成成本很高,而且必须直接用于满足当前的程序需求。创建这些数据是为了将它们用于其他应用程序;但是,下面列出了不适当的用途。此列表绝不是详尽无遗的,但应作为评估这些数据是否可以支持拟议用途的指南。对于许多用途,PAD-US 不太可能提供所需的唯一数据,对于具有监管结果的用途,权威机构数据和现场调查应验证结果。 PAD-US 推荐用于寻求有关多个机构或组织土地的基本信息的用户。用户应直接寻求权威来源数据,以回答有关某一机构或需要更频繁更新的机构的问题。最终,每个数据用户都有责任确定这些数据是否可以回答所提出的问题。不当用途包括: 将 PAD-US 用于与一个机构或特定单位相关的应用程序或分析(机构始终是其土地数据的最佳和权威来源,并且许多机构比 PAD-US 更频繁地发布更新)。使用一些数据绘制小面积(小于数千公顷)地图,通常需要 1:24,000 比例尺的地图分辨率(因为边界质量因数据源而异),并在数据不完整的地区使用航空照片或地面调查。将这些数据与比例尺小于 1:100,000 的其他数据相结合,以生成新的混合地图或回答查询。根据比最近的千公顷更精细的数据生成特定的面积测量值。代表边界作为监管或收购的法律代表。确定特定地理区域的任何特征的确定发生或不发生。确定任何特征的丰度、健康或状况。在不获取和审查元数据的情况下使用数据。


    U.S. Geological Survey (USGS) Gap Analysis Project (GAP), 2018, Protected Areas Database of the United States (PAD-US): U.S. Geological Survey data release, Protected Areas Database of the United States (PAD-US) 2.0 - ScienceBase-Catalog.

    Protected Areas Database of the United States (PAD-US) 2.0 - ScienceBase-Catalog


    var dataset = ee.FeatureCollection('USGS/GAP/PAD-US/v20/designation');
    var styleParams = {
      fillColor: '000070',
      color: '0000be',
      width: 3.0,
    var regions = dataset.style(styleParams);
    Map.setCenter(-73, 43, 8);
    Map.addLayer(regions, {}, 'USGS/GAP/PAD-US/v20/designation');


    Easement: USGS GAP PAD-US v2.0

    Dataset Availability

    2018-09-01T00:00:00 - 2018-09-01T00:00:00

    Dataset Provider

    US Geological Survey

    Collection Snippet



    var dataset = ee.FeatureCollection('USGS/GAP/PAD-US/v20/easement');
    var styleParams = {
      fillColor: '000070',
      color: '0000be',
      width: 3.0,
    var regions = dataset.style(styleParams);
    Map.setCenter(-73, 43, 8);
    Map.addLayer(regions, {}, 'USGS/GAP/PAD-US/v20/easement');

    Fee: USGS GAP PAD-US v2.0 

    Dataset Availability

    2018-09-01T00:00:00 - 2018-09-01T00:00:00

    Dataset Provider

    US Geological Survey

    Collection Snippet



    var dataset = ee.FeatureCollection('USGS/GAP/PAD-US/v20/fee');
    var styleParams = {
      fillColor: '000070',
      color: '0000be',
      width: 3.0,
    var regions = dataset.style(styleParams);
    Map.setCenter(-73, 43, 8);
    Map.addLayer(regions, {}, 'USGS/GAP/PAD-US/v20/fee');

     Proclamation: USGS GAP PAD-US v2.0

    Dataset Availability

    2018-09-01T00:00:00 - 2018-09-01T00:00:00

    Dataset Provider

    US Geological Survey

    Collection Snippet




    var dataset = ee.FeatureCollection('USGS/GAP/PAD-US/v20/proclamation');
    var styleParams = {
      fillColor: '000070',
      color: '0000be',
      width: 3.0,
    var regions = dataset.style(styleParams);
    Map.setCenter(-73, 43, 8);
    Map.addLayer(regions, {}, 'USGS/GAP/PAD-US/v20/proclamation');


  • Dataset 列表:机器学习研究

    千次阅读 2017-06-28 20:54:54
    List of datasets for machine learning researchFace recognition[edit]In computer vision, face images have been used extensively to develop face recognition systems, face detection, and many other ...

    Face recognition

    In computer vision, face images have been used extensively to develop face recognition systems, face detection, and many other projects that use images of faces.

    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    Face Recognition Technology (FERET)11338 images of 1199 individuals in different positions and at different times.None.11,338ImagesClassification, face recognition2003[6][7]United States Department of Defense
    CMU Pose, Illumination, and Expression (PIE)41,368 color images of 68 people in 13 different poses.Images labeled with expressions.41,368Images, textClassification, face recognition2000[8][9]R. Gross et al.
    SCFaceColor images of faces at various angles.Location of facial features extracted. Coordinates of features given.4,160Images, textClassification, face recognition2011[10][11]M. Grgic et al.
    YouTube Faces DBVideos of 1,595 different people gathered from YouTube. Each clip is between 48 and 6,070 frames.Identity of those appearing in videos and descriptors.3,425 videosVideo, textVideo classification, face recognition2011[12][13]L. Wolf et al.
    300 videos in-the-Wild114 videos annotated for facial landmark tracking. The 68 landmark mark-up is applied to every frame.None114 videos, 218,000 frames.Video, annotation file.Facial landmark tracking.2015[14]Shen, Jie et al.
    Grammatical Facial Expressions DatasetGrammatical Facial Expressions from Brazilian Sign Language.Microsoft Kinect features extracted.27,965TextFacial gesture recognition2014[15]F. Freitas et al.
    CMU Face Images DatasetImages of faces. Each person is photographed multiple times to capture different expressions.Labels and features.640Images, TextFace recognition1999[16][17]T. Mitchell
    Yale Face DatabaseFaces of 15 individuals in 11 different expressions.Labels of expressions.165ImagesFace recognition1997[18][19]J. Yang et al.
    Cohn-Kanade AU-Coded Expression DatabaseLarge database of images with labels for expressions.Tracking of certain facial features.500+ sequencesImages, textFacial expression analysis2000[20][21]T. Kanade et al.
    FaceScrubImages of public figures scrubbed from image searching.Name and m/f annotation.107,818Images, textFace recognition2014[22][23]H. Ng et al.
    BioID Face DatabaseImages of faces with eye positions marked.Manually set eye positions.1521Images, textFace recognition2001[24][25]BioID
    Skin Segmentation DatasetRandomly sampled color values from face images.B, G, R, values extracted.245,057TextSegmentation, classification2012[26][27]R. Bhatt.
    Bosphorus3D Face image database.34 action units and 6 expressions labeled; 24 facial landmarks labeled.4652

    Images, text

    Face recognition, classification2008[28][29]A Savran et al.
    UOY 3D-Faceneutral face, 5 expressions: anger, happiness, sadness, eyes closed, eyebrows raised.labeling.5250

    Images, text

    Face recognition, classification2004[30][31]University of York
    CASIAExpressions: Anger, smile, laugh, surprise, closed eyes.None.4624

    Images, text

    Face recognition, classification2007[32][33]Institute of Automation, Chinese Academy of Sciences
    CASIAExpressions: Anger Disgust Fear Happiness Sadness SurpriseNone.480Annotated Visible Spectrum and Near Infrared Video captures at 25 frames per secondFace recognition, classification2011[34]Zhao, G. et al.
    BU-3DFEneutral face, and 6 expressions: anger, happiness, sadness, surprise, disgust, fear (4 levels). 3D images extracted.None.2500Images, textFacial expression recognition, classification2006[35]Binghamton University
    Face Recognition Grand Challenge DatasetUp to 22 samples for each subject. Expressions: anger, happiness, sadness, surprise, disgust, puffy. 3D Data.None.4007Images, textFace recognition, classification2004[36][37]National Institute of Standards and Technology
    GavabdbUp to 61 samples for each subject. Expressions neutral face, smile, frontal accentuated laugh, frontal random gesture. 3D images.None.549Images, textFace recognition, classification2008[38][39]King Juan Carlos University
    3D-RMAUp to 100 subjects, expressions mostly neutral. Several poses as well.None.9971Images, textFace recognition, classification2004[40][41]Royal Military Academy (Belgium)

    Action recognition

    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    Human Motion DataBase (HMDB51)51 action categories, each containing at least 101 clips, extracted from a range of sources.None.6,766 video clipsvideo clipsAction classification2011[42]H. Kuehne et al.
    TV Human Interaction DatasetVideos from 20 different TV shows for prediction social actions: handshake, high five, hug, kiss and none.None.6,766 video clipsvideo clipsAction prediction2013[43]Patron-Perez, A. et al.
    UT InteractionPeople acting out one of 6 actions (shake-hands, point, hug, push, kick, and punch) sometimes with multiple groups in the same video clip.None.120 video clipsvideo clipsAction prediction2009[44]Ryoo, M. S. et al.
    UT Kinect10 different people performing one of 6 actions (walk, sit down, stand up, pick up, carry, throw, push, pull, wave hands and clap hands) in an office setting.None.200 video clips with depth information at 15 frames per secondvideo clips with depth informationAction classification2012[45]Xia, L. et al.
    SBU InteractSeven participants performing one of 8 actions together (approaching, departing, pushing, kicking, punching, exchanging objects, hugging, and shaking hands) in an office setting.None.Around 300 interactionsvideo clips with depth informationAction classification2012[46]Yun, K. et al.
    Berkeley Multimodal Human Action Database (MHAD)Recordings of a single person performing 12 actionsMoCap pre-processing660 action samples8 PhaseSpace Motion Cpature, 2 Stereo Cameras, 4 Quad Cameras, 6 accelerometers, 4 microphonesAction classification2013[47]Ofli, F. et al.
    UCF 101 DatasetSelf described as “a dataset of 101 human actions classes from videos in the wild.” Dataset is large with over 27 hours of video.Actions classified and labeled.13,000Video, images, textClassification, action detection2012[48][49]K. Soomro et al.
    THUMOS DatasetLarge video dataset for action classification.Actions classified and labeled.45M frames of videoVideo, images, textClassification, action detection2013[50][51]Y. Jiang et al.
    ActivitynetLarge video dataset for activity recognition and detection.Actions classified and labeled.10,024Video, images, textClassification, action detection2015[52]Heilbron et al.
    MSP-AVATARImprovised scenarios annotated for discourse functions: contrast, confirmation/negation, question, uncertainty, suggest, giving orders, warn, inform, size description, using pronouns.Actions classified and labeled.74 sessionsMotion-captured video, audioClassification, action detection2015[53]Sadoughi, N. et al.
    LILiR Twotalk CorpusVideo datasets for non-verbal communication activity recognition: agreement, thinking, asking and understanding.Actions classified and labeled.527VideoAction detection2011[54]Sheerman-Chase et al.

    Object detection & recognition

    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    DAVIS: Densely Annotated VIdeo Segmentation150 video sequences containing 10459 frames with a total of 376 objects annotated.Dataset released for the 2017 DAVIS Challenge with a dedicated workshop co-located with CVPR 2017. The videos contain several types of objects and humans with a high quality segmentation annotation.10,459Frames annotatedVideo object segmentation2017[55]Pont-Tuset, J. et al.
    T-LESS: An RGB-D Dataset for 6D Pose Estimation of Texture-less Objects30 industry-relevant objects. 39K training and 10K test images from each of three sensors. Two types of 3D models for each object.6D poses for all modeled objects in all images. Per-pixel labelling can be obtained by rendering of the object models at the ground truth poses.49,000RGB-D images, 3D object models6D object pose estimation, object detection2017[56]T. Hodan et al.
    Berkeley 3-D Object Dataset849 images taken in 75 different scenes. About 50 different object classes are labeled.Object bounding boxes and labeling.849labeled images, textObject recognition2014[57][58]A. Janoch et al.
    Berkeley Segmentation Data Set and Benchmarks 500 (BSDS500)500 natural images, explicitly separated into disjoint train, validation and test subsets + benchmarking code. Based on BSDS300.Each image segmented by five different subjects on average.500Segmented imagesContour detection and hierarchical image segmentation2011[59]University of California, Berkeley
    Microsoft Common Objects in Context (COCO)complex everyday scenes of common objects in their natural context.Object highlighting, labeling, and classification into 91 object types.2,500,000Labeled images, textObject recognition2015[60][61]T. Lin et al.
    SUN DatabaseVery large scene and object recognition database.Places and objects are labeled. Objects are segmented.131,067Images, textObject recognition, scene recognition2014[62][63]J. Xiao et al.
    ImageNetLabeled object image database, used in the ImageNet Large Scale Visual Recognition ChallengeLabeled objects, bounding boxes, descriptive words, SIFT features14,197,122Images, textObject recognition, scene recognition2014[64][65]J. Deng et al.
    TV News Channel Commercial Detection DatasetTV commercials and news broadcasts.Audio and video features extracted from still images.129,685TextClustering, classification2015[66][67]P. Guha et al.
    Statlog (Image Segmentation) DatasetThe instances were drawn randomly from a database of 7 outdoor images and hand-segmented to create a classification for every pixel.Many features calculated.2310TextClassification1990[68]University of Massachusetts
    Caltech 101Pictures of objects.Detailed object outlines marked.9146ImagesClassification, object recognition.2003[69][70]F. Li et al.
    Caltech-256Large dataset of images for object classification.Images categorized and hand-sorted.30,607Images, TextClassification, object detection2007[71][72]G. Griffin et al.
    SIFT10M DatasetSIFT features of Caltech-256 dataset.Extensive SIFT feature extraction.11,164,866TextClassification, object detection2016[73]X. Fu et al.
    LabelMeAnnotated pictures of scenes.Objects outlined.187,240Images, textClassification, object detection2005[74]MIT Computer Science and Artificial Intelligence Laboratory
    Cityscapes DatasetStereo video sequences recorded in street scenes, with pixel-level annotations. Metadata also included.Pixel-level segmentation and labeling25,000Images, textClassification, object detection2016[75]Daimler AG et al.
    PASCAL VOC DatasetLarge number of images for classification tasks.Labeling, bounding box included500,000Images, textClassification, object detection2010[76][77]M. Everingham et al.
    CIFAR-10 DatasetMany small, low-resolution, images of 10 classes of objects.Classes labelled, training set splits created.60,000ImagesClassification2009[65][78]A. Krizhevsky et al.
    CIFAR-100 DatasetLike CIFAR-10, above, but 100 classes of objects are given.Classes labelled, training set splits created.60,000ImagesClassification2009[65][78]A. Krizhevsky et al.
    German Traffic Sign Detection Benchmark DatasetImages from vehicles of traffic signs on German roads. These signs comply with UN standards and therefore are the same as in other countries.Signs manually labeled900ImagesClassification2013[79][80]S Houben et al.
    KITTI Vision Benchmark DatasetAutonomous vehicles driving through a mid-size city captured images of various areas using cameras and laser scanners.Many benchmarks extracted from data.>100 GB of dataImages, textClassification, object detection2012[81][82]A Geiger et al.

    Handwriting and character recognition

    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    Artificial Characters DatasetArtificially generated data describing the structure of 10 capital English letters.Coordinates of lines drawn given as integers. Various other features.6000TextHandwriting recognition, classification1992[83]H. Guvenir et al.
    Letter DatasetUpper case printed letters.17 features are extracted from all images.20,000TextOCR, classification1991[84][85]D. Slate et al.
    Character Trajectories DatasetLabeled samples of pen tip trajectories for people writing simple characters.3-dimensional pen tip velocity trajectory matrix for each sample2858TextHandwriting recognition, classification2008[86][87]B. Williams
    Chars74K DatasetCharacter recognition in natural images of symbols used in both English and Kannada74,107Character recognition, handwriting recognition, OCR, classification2009[88]T. de Campos
    UJI Pen Characters DatasetIsolated handwritten charactersCoordinates of pen position as characters were written given.11,640TextHandwriting recognition, classification2009[89][90]F. Prat et al.
    Gisette DatasetHandwriting samples from the often-confused 4 and 9 characters.Features extracted from images, split into train/test, handwriting images size-normalized.13,500Images, textHandwriting recognition, classification2003[91]Yann LeCun et al.
    MNIST DatabaseDatabase of handwritten digits.Hand-labeled.60,000Images, textClassification1998[92][93]National Institute of Standards and Technology
    Optical Recognition of Handwritten Digits DatasetNormalized bitmaps of handwritten data.Size normalized and mapped to bitmaps.5620Images, textHandwriting recognition, classification1998[94]E. Alpaydin et al.
    Pen-Based Recognition of Handwritten Digits DatasetHandwritten digits on electronic pen-tablet.Feature vectors extracted to be uniformly spaced.10,992Images, textHandwriting recognition, classification1998[95][96]E. Alpaydin et al.
    Semeion Handwritten Digit DatasetHandwritten digits from 80 people.All handwritten digits have been normalized for size and mapped to the same grid.1593Images, textHandwriting recognition, classification2008[97]T. Srl
    HASYv2Handwritten mathematical symbolsAll symbols are centered and of size 32px x 32px.168233Images, textClassification2017[98]Martin Thoma

    Aerial images

    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    Aerial Image Segmentation Dataset80 high-resolution aerial images with spatial resolution ranging from 0.3 to 1.0.Images manually segmented.80ImagesAerial Classification, object detection2013[99][100]J. Yuan et al.
    KIT AIS Data SetMultiple labeled training and evaluation datasets of aerial images of crowds.Images manually labeled to show paths of individuals through crowds.~ 150Images with pathsPeople tracking, aerial tracking2012[101][102]M. Butenuth et al.
    Wilt DatasetRemote sensing data of diseased trees and other land cover.Various features extracted.4899ImagesClassification, aerial object detection2014[103][104]B. Johnson
    Forest Type Mapping DatasetSatellite imagery of forests in Japan.Image wavelength bands extracted.326TextClassification2015[105][106]B. Johnson
    Overhead Imagery Research Data SetAnnotated overhead imagery. Images with multiple objects.Over 30 annotations and over 60 statistics that describe the target within the context of the image.1000Images, textClassification2009[107][108]F. Tanner et al.

    Other images[edit]

    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    MPII Cooking Activities DatasetVideos and images of various cooking activities.Activity paths and directions, labels, fine-grained motion labeling, activity class, still image extraction and labeling.881,755 framesLabeled video, images, textClassification2012[109][110]M. Rohrbach et al.
    Stanford Dogs DatasetImages of 120 breeds of dogs from around the world.Train/test splits and ImageNet annotations provided.20,580Images, textFine-grain classification2011[111][112]A. Khosla et al.
    The Oxford-IIIT Pet Dataset37 categories of pets with roughly 200 images of each.Breed labeled, tight bounding box, foreground-background segmentation.~ 7,400Images, textClassification, object detection2012[112][113]O. Parkhi et al.
    Corel Image Features Data SetDatabase of images with features extracted.Many features including color histogram, co-occurrence texture, and colormoments,68,040TextClassification, object detection1999[114][115]M. Ortega-Bindenberger et al.
    Online Video Characteristics and Transcoding Time Dataset.Transcoding times for various different videos and video properties.Video features given.168,286TextRegression2015[116]T. Deneke et al.
    Microsoft Sequential Image Narrative Dataset (SIND)Dataset for sequential vision-to-languageDescriptive caption and storytelling given for each photo, and photos are arranged in sequences81,743Images, textVisual storytelling2016[117]Microsoft Research
    Caltech-UCSD Birds-200-2011 DatasetLarge dataset of images of birds.Part locations for birds, bounding boxes, 312 binary attributes given11,788Images, textClassification2011[118][119]C. Wah et al.
    YouTube-8MLarge and diverse labeled video datasetYouTube video IDs and associated labels from a diverse vocabulary of 4800 visual entities8 millionVideo, textVideo classification2016[120][121]S. Abu-El-Haija et al.
    YFCC100MLarge and diverse labeled image and video datasetFlickr Videos and Images and associated description, titles, tags, and other metadata (such as EXIF and geotags)100 millionVideo, Image, TextVideo and Image classification2016[122][123]B. Thomee et al.
    Discrete LIRIS-ACCEDEShort videos annotated for valence and arousal.Valence and arousal labels.9800VideoVideo emotion elicitation detection2015[124]Y. Baveye et al.
    Continuous LIRIS-ACCEDELong videos annotated for valence and arousal while also collecting Galvanic Skin Response.Valence and arousal labels.30VideoVideo emotion elicitation detection2015[125]Y. Baveye et al.
    MediaEval LIRIS-ACCEDEExtension of Discrete LIRIS-ACCEDE including annotations for violence levels of the films.Vioence, valence and arousal labels.10900VideoVideo emotion elicitation detection2015[126]Y. Baveye et al.

    Text data[edit]

    Datasets consisting primarily of text for tasks such as natural language processing, sentiment analysis, translation, and cluster analysis.


    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    Amazon reviewsUS product reviews from Amazon.com.None.~ 82MTextClassification, sentiment analysis2015[127]McAuley et al.
    OpinRank Review DatasetReviews of cars and hotels from Edmunds.com and TripAdvisor respectively.None.42,230 / ~259,000 respectivelyTextSentiment analysis, clustering2011[128][129]K. Ganesan et al.
    MovieLens22,000,000 ratings and 580,000 tags applied to 33,000 movies by 240,000 users.None.~ 22MTextRegression, clustering, classification2016[130]GroupLens Research
    Yahoo! Music User Ratings of Musical ArtistsOver 10M ratings of artists by Yahoo users.None described.~ 10MTextClustering, regression2004[131][132]Yahoo!
    Car Evaluation Data SetCar properties and their overall acceptability.Six categorical features given.1728TextClassification1997[133][134]M. Bohanec
    YouTube Comedy Slam Preference DatasetUser vote data for pairs of videos shown on YouTube. Users voted on funnier videos.Video metadata given.1,138,562TextClassification2012[135][136]Google
    Skytrax User Reviews DatasetUser reviews of airlines, airports, seats, and lounges from Skytrax.Ratings are fine-grain and include many aspects of airport experience.41396TextClassification, regression2015[137]Q. Nguyen
    Teaching Assistant Evaluation DatasetTeaching assistant reviews.Features of each instance such as class, class size, and instructor are given.151TextClassification1997[138][139]W. Loh et al.

    News articles[edit]

    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    NYSK DatasetEnglish news articles about the case relating to allegations of sexual assault against the former IMF director Dominique Strauss-Kahn.Filtered and presented in XML format.10,421XML, textSentiment analysis, topic extraction2013[140]Dermouche, M. et al.
    The Reuters Corpus Volume 1Large corpus of Reuters news stories in English.Fine-grain categorization and topic codes.810,000TextClassification, clustering, summarization2002[141]Reuters
    The Reuters Corpus Volume 2Large corpus of Reuters news stories in multiple languages.Fine-grain categorization and topic codes.487,000TextClassification, clustering, summarization2005[142]Reuters
    Thomson Reuters Text Research CollectionLarge corpus of news stories.Details not described.1,800,370TextClassification, clustering, summarization2009[143]T. Rose et al.
    Saudi Newspapers Corpus31,030 Arabic newspaper articles.Metadata extracted.31,030JSONSummarization, clustering2015[144]M. Alhagri


    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    Enron Email DatasetEmails from employees at Enron organized into folders.Attachments removed, invalid email addresses converted to user@enron.com or no_address@enron.com.~ 500,000TextNetwork analysis, sentiment analysis2004 (2015)[145][146]Klimt, B. and Y. Yang
    Ling-Spam DatasetCorpus containing both legitimate and spam emails.Four version of the corpus involving whether or not a lemmatiser or stop-list was enabled.TextClassification2000[147][148]Androutsopoulos, J. et al.
    SMS Spam Collection DatasetCollected SMS spam messages.None.5574TextClassification2011[149][150]T. Almeida et al.
    Twenty Newsgroups DatasetMessages from 20 different newsgroups.None.20,000TextNatural language processing1999[151]T. Mitchell et al.
    Spambase DatasetSpam emails.Many text features extracted.4601TextSpam detection, classification1999[152]M. Hopkins et al.

    Twitter and tweets[edit]

    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    Sentiment140Tweet data from 2009 including original text, time stamp, user and sentiment.Classified using distant supervision from presence of emoticon in tweet.1,578,627Tweets, comma, separated valuesSentiment analysis2009[153][154]A. Go et al.
    ASU Twitter DatasetTwitter network data, not actual tweets. Shows connections between a large number of users.None.11,316,811 users, 85,331,846 connectionsTextClustering, graph analysis2009[155][156]R. Zafarani et al.
    SNAP Social Circles: Twitter DatabaseLarge twitter network data.Node features, circles, and ego networks.1,768,149TextClustering, graph analysis2012[157][158]J. McAuley et al.
    Twitter Dataset for Arabic Sentiment AnalysisArabic tweets.Samples hand-labeled as positive or negative.2000TextClassification2014[159][160]N. Abdulla
    Buzz in Social Media DatasetData from Twitter and Tom’s Hardware. This dataset focuses on specific buzz topics being discussed on those sites.Data is windowed so that the user can attempt to predict the events leading up to social media buzz.140,000TextRegression, Classification2013[161][162]F. Kawala et al.
    Paraphrase and Semantic Similarity in Twitter (PIT)This dataset focuses on whether tweets have (almost) same meaning/information or not. Manually labeled.tokenization, part-of-speech and named entity tagging18,762TextRegression, Classification2015[163][164]Xu et al.

    Other text[edit]

    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    Legal Case ReportsFederal Court of Australia cases from 2006–2009.None.4,000TextSummarization,

    citation analysis

    2012[165][166]F. Galgani et al.
    Blogger Authorship CorpusBlog entries of 19,320 people from blogger.com.Blogger self-provided gender, age, industry, and astrological sign.681,288TextSentiment analysis, summarization, classification2006[167][168]J. Schler et al.
    Social Structure of Facebook NetworksLarge dataset of the social structure of Facebook.None.100 colleges coveredTextNetwork analysis, clustering2012[169][170]A. Traud et al.
    Dataset for the Machine Comprehension of TextStories and associated questions for testing comprehension of text.None.660TextNatural language processing, machine comprehension2013[171][172]M. Richardson et al.
    The Penn Treebank ProjectNaturally occurring text annotated for linguistic structure.Text is parsed into semantic trees.~ 1M wordsTextNatural language processing, summarization1995[173][174]M. Marcus et al.
    DEXTER DatasetTask given is to determine, from features given, which articles are about corporate acquisitions.Features extracted include word stems. Distractor features included.2600TextClassification2008[175]Reuters
    Google Books N-gramsN-grams from a very large corpus of booksNone.2.2 TB of textTextClassification, clustering, regression2011[176][177]Google
    Personae CorpusCollected for experiments in Authorship Attribution and Personality Prediction. Consists of 145 Dutch-language essays.In addition to normal texts, syntactically annotated texts are given.145TextClassification, regression2008[178][179]K. Luyckx et al.
    CNAE-9 DatasetCategorization task for free text descriptions of Brazilian companies.Word frequency has been extracted.1080TextClassification2012[180][181]P. Ciarelli et al.
    Sentiment Labeled Sentences Dataset3000 sentiment labeled sentences.Sentiment of each sentence has been hand labeled as positive or negative.3000TextClassification, sentiment analysis2015[182][183]D. Kotzias
    BlogFeedback DatasetDataset to predict the number of comments a post will receive based on features of that post.Many features of each post extracted.60,021TextRegression2014[184][185]K. Buza
    Stanford Natural Language Inference (SNLI) CorpusImage captions matched with newly constructed sentences to form entailment, contradiction, or neutral pairs.Entailment class labels, syntactic parsing by the Stanford PCFG parser570,000TextNatural language inference/recognizing textual entailment2015[186]S. Bowman et al.

    Sound data[edit]

    Datasets of sounds and sound features.


    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    Zero Resource Speech Challenge 2015Spontaneous speech (English), Read speech (Xitsonga).raw wavEnglish: 5h, 12 speakers; Xitsonga: 2h30; 24 speakerssoundUnsupervised discovery of speech features/subword units/word units2015[187][188]www.zerospeech.com/2015Versteegh et al.
    Parkinson Speech DatasetMultiple recordings of people with and without Parkinson’s Disease.Voice features extracted, disease scored by physician using unified Parkinson’s disease rating scale1,040TextClassification, regression2013[189][190]B. E. Sakar et al.
    Spoken Arabic DigitsSpoken Arabic digits from 44 male and 44 female.Time-series of mel-frequency cepstrum coefficients.8,800TextClassification2010[191][192]M. Bedda et al.
    ISOLET DatasetSpoken letter names.Features extracted from sounds.7797TextClassification1994[193][194]R. Cole et al.
    Japanese Vowels DatasetNine male speakers uttered two Japanese vowels successively.Applied 12-degree linear prediction analysis to it to obtain a discrete-time series with 12 cepstrum coefficients.640TextClassification1999[195][196]M. Kudo et al.
    Parkinson’s Telemonitoring DatasetMultiple recordings of people with and without Parkinson’s Disease.Sound features extracted.5875TextClassification2009[197][198]A. Tsanas et al.
    TIMITRecordings of 630 speakers of eight major dialects of American English, each reading ten phonetically rich sentences.Speech is lexically and phonemically transcribed.6300TextSpeech recognition, classification.1986[199][200]J. Garofolo et al.
    Arabic Speech CorpusA single-speaker, Modern Standard Arabic (MSA) speech corpus with phonetic and orthographic transcripts aligned to phoneme levelSpeech is orthographically and phonetically transcribed with stress marks.~1900Text, WAVSpeech Synthesis, Speech Recognition, Corpus Alignment, Speech Therapy, Education.2016[201]N. Halabi


    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    Geographical Original of Music Data SetAudio features of music samples from different locations.Audio features extracted using MARSYAS software.1,059TextGeographical classification, clustering2014[202][203]F. Zhou et al.
    Million Song DatasetAudio features from one million different songs.Audio features extracted.1MTextClassification, clustering2011[204][205]T. Bertin-Mahieux et al.
    Free Music ArchiveAudio under Creative Commons from 100k songs (343 days, 1TiB) with a hierarchy of 161 genres, metadata, user data, free-form text.Raw audio and audio features.106,574Text, MP3Classification, recommendation2017[206]M. Defferrard et al.
    Bach Choral Harmony DatasetBach chorale chords.Audio features extracted.5665TextClassification2014[207][208]D. Radicioni et al.

    Other sounds[edit]

    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    UrbanSoundLabeled sound recordings of sounds like air conditioners, car horns and children playing.Sorted into folders by class of events as well as metadata in a JSON file and annotations in a CSV file.1,059Sound


    Classification2014[209][210]J. Salamon et al.

    Signal data[edit]

    Datasets containing electric signal information requiring some sort of Signal processing for further analysis.


    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    Witty Worm DatasetDataset detailing the spread of the Witty worm and the infected computers.Split into a publicly available set and a restricted set containing more sensitive information like IP and UDP headers.55,909 IP addressesTextClassification2004[211][212]Center for Applied Internet Data Analysis
    Cuff-Less Blood Pressure Estimation DatasetCleaned vital signals from human patients which can be used to estimate blood pressure.125 Hz vital signs have been cleaned.12,000TextClassification, regression2015[213][214]M. Kachuee et al.
    Gas Sensor Array Drift DatasetMeasurements from 16 chemical sensors utilized in simulations for drift compensation.Extensive number of features given.13,910TextClassification2012[215][216]A. Vergara
    Servo DatasetData covering the nonlinear relationships observed in a servo-amplifier circuit.Levels of various components as a function of other components are given.167TextRegression1993[217][218]K. Ullrich
    UJIIndoorLoc-Mag DatasetIndoor localization database to test indoor positioning systems. Data is magnetic field based.Train and test splits given.40,000TextClassification, regression, clustering2015[219][220]D. Rambla et al.
    Sensorless Drive Diagnosis DatasetElectrical signals from motors with defective components.Statistical features extracted.58,508TextClassification2015[221][222]M. Bator


    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    Wearable Computing: Classification of Body Postures and Movements (PUC-Rio)People performing five standard actions while wearing motion tackers.None.165,632TextClassification2013[223][224]Pontifical Catholic University of Rio de Janeiro
    Gesture Phase Segmentation DatasetFeatures extracted from video of people doing various gestures.Features extracted aim at studying gesture phase segmentation.9900TextClassification, clustering2014[225][226]R. Madeo et a
    Vicon Physical Action Data Set Dataset10 normal and 10 aggressive physical actions that measure the human activity tracked by a 3D tracker.Many parameters recorded by 3D tracker.3000TextClassification2011[227][228]T. Theodoridis
    Daily and Sports Activities DatasetMotor sensor data for 19 daily and sports activities.Many sensors given, no preprocessing done on signals.9120TextClassification2013[229][230]B. Barshan et al.
    Human Activity Recognition Using Smartphones DatasetGyroscope and accelerometer data from people wearing smartphones and performing normal actions.Actions performed are labeled, all signals preprocessed for noise.10,299TextClassification2012[231][232]J. Reyes-Ortiz et al.
    Australian Sign Language SignsAustralian sign language signs captured by motion-tracking gloves.None.2565TextClassification2002[233][234]M. Kadous
    Weight Lifting Exercises monitored with Inertial Measurement UnitsFive variations of the biceps curl exercise monitored with IMUs.Some statistics calculated from raw data.39,242TextClassification2013[235][236]W. Ugulino et al.
    sEMG for Basic Hand movements DatasetTwo databases of surface electromyographic signals of 6 hand movements.None.3000TextClassification2014[237][238]C. Sapsanis et al.
    REALDISP Activity Recognition DatasetEvaluate techniques dealing with the effects of sensor displacement in wearable activity recognition.None.1419TextClassification2014[238][239]O. Banos et al.
    Heterogeneity Activity Recognition DatasetData from multiple different smart devices for humans performing various activities.None.43,930,257TextClassification, clustering2015[240][241]A. Stisen et al.
    Indoor User Movement Prediction from RSS DataTemporal wireless network data that can be used to track the movement of people in an office.None.13,197TextClassification2016[242][243]D. Bacciu
    PAMAP2 Physical Activity Monitoring Dataset18 different types of physical activities performed by 9 subjects wearing 3 IMUs.None.3,850,505TextClassification2012[244]A. Reiss
    OPPORTUNITY Activity Recognition DatasetHuman Activity Recognition from wearable, object, and ambient sensors is a dataset devised to benchmark human activity recognition algorithms.None.2551TextClassification2012[245][246]D. Roggen et al.
    Real World Activity Recognition DatasetHuman Activity Recognition from wearable devices. Distinguishes between seven on-body device positions and comprises six different kinds of sensors.None.3,150,000 (per sensor)TextClassification2016[247]T. Sztyler et al.

    Other signals[edit]

    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    Wine DatasetChemical analysis of wines grown in the same region in Italy but derived from three different cultivars.13 properties of each wine are given178TextClassification, regression1991[248][249]M. Forina et al.
    Combined Cycle Power Plant Data SetData from various sensors within a power plant running for 6 years.None9568TextRegression2014[250][251]P. Tufekci et al.

    Physical data[edit]

    Datasets from physical systems

    High-energy physics[edit]

    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    HIGGS DatasetMonte Carlo simulations of particle accelerator collisions.28 features of each collision are given.11MTextClassification2014[252][253][254]D. Whiteson
    HEPMASS DatasetMonte Carlo simulations of particle accelerator collisions. Goal is to separate the signal from noise.28 features of each collision are given.10,500,000TextClassification2016[253][254][255]D. Whiteson


    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    Yacht Hydrodynamics DatasetYacht performance based on dimensions.Six features are given for each yacht.308TextRegression2013[256][257]R. Lopez
    Robot Execution Failures Dataset5 data sets that center around robotic failure to execute common tasks.Integer valued features such as torque and other sensor measurements.463TextClassification1999[258]L. Seabra et al.
    Pittsburgh Bridges DatasetDesign description is given in terms of several properties of various bridges.Various bridge features are given.108TextClassification1990[259][260]Y. Reich et al.
    Automobile DatasetData about automobiles, their insurance risk, and their normalized losses.Car features extracted.205TextRegression1987[261][262]J. Schimmer et al.
    Auto MPG DatasetMPG data for cars.Eight features of each car given.398TextRegression1993[263]Carnegie Mellon University
    Energy Efficiency DatasetHeating and cooling requirements given as a function of building parameters.Building parameters given.768TextClassification, regression2012[264][265]A. Xifara et al.
    Airfoil Self-Noise DatasetA series of aerodynamic and acoustic tests of two and three-dimensional airfoil blade sections.Data about frequency, angle of attack, etc., are given.1503TextRegression2014[266]R. Lopez
    Challenger USA Space Shuttle O-Ring DatasetAttempt to predict O-ring problems given past Challenger data.Several features of each flight, such as launch temperature, are given.23TextRegression1993[267][268]D. Draper et al.
    Statlog (Shuttle) DatasetNASA space shuttle datasets.Nine features given.58,000TextClassification2002[269]NASA


    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    Volcanoes on Venus – JARtool experiment DatasetVenus images returned by the Magellan spacecraft.Images are labeled by humans.not givenImagesClassification1991[270][271]M. Burl
    MAGIC Gamma Telescope DatasetMonte Carlo generated high-energy gamma particle events.Numerous features extracted from the simulations.19,020TextClassification2007[271][272]R. Bock
    Solar Flare DatasetMeasurements of the number of certain types of solar flare events occurring in a 24-hour period.Many solar flare-specific features are given.1389TextRegression, classification1989[273]G. Bradshaw

    Earth science[edit]

    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    Volcanoes of the WorldVolcanic eruption data for all known volcanic events on earth.Details such as region, subregion, tectonic setting, dominant rock type are given.1535TextRegression, classification2013[274]E. Venzke et al.
    Seismic-bumps DatasetSeismic activities from a coal mine.Seismic activity was classified as hazardous or not.2584TextClassification2013[275][276]M. Sikora et al.

    Other physical[edit]

    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    Concrete Compressive Strength DatasetDataset of concrete properties and compressive strength.Nine features are given for each sample.1030TextRegression2007[277][278]I. Yeh
    Concrete Slump Test DatasetConcrete slump flow given in terms of properties.Features of concrete given such as fly ash, water, etc.103TextRegression2009[279][280]I. Yeh
    Musk DatasetPredict if a molecule, given the features, will be a musk or a non-musk.168 features given for each molecule.6598TextClassification1994[281]Arris Pharmaceutical Corp.
    Steel Plates Faults DatasetSteel plates of 7 different types.27 features given for each sample.1941TextClassification2010[282]Semeion Research Center

    Biological data[edit]

    Datasets from biological systems.


    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    EEG DatabaseStudy to examine EEG correlates of genetic predisposition to alcoholism.Measurements from 64 electrodes placed on the scalp sampled at 256 Hz (3.9 ms epoch) for 1 second.122TextClassification1999[283][284]H. Begleiter
    P300 Interface DatasetData from nine subjects collected using P300-based brain-computer interface for disabled subjects.Split into four sessions for each subject. MATLAB code given.1,224TextClassification2008[285][286]U. Hoffman et al.
    Heart Disease Data SetAttributed of patients with and without heart disease.75 attributes given for each patient with some missing values.303TextClassification1988[287][288]A. Janosi et al.
    Breast Cancer Wisconsin (Diagnostic) DatasetDataset of features of breast masses. Diagnoses by physician is given.10 features for each sample are given.569TextClassification1995[289][290]W. Wolberg et al.
    National Survey on Drug Use and HealthLarge scale survey on health and drug use in the United States.None.55,268TextClassification, regression2012[291]United States Department of Health and Human Services
    Lung Cancer DatasetLung cancer dataset without attribute definitions56 features are given for each case32TextClassification1992[292][293]Z. Hong et al.
    Arrhythmia DatasetData for a group of patients, of which some have cardiac arrhythmia.276 features for each instance.452TextClassification1998[294][295]H. Altay et al.
    Diabetes 130-US hospitals for years 1999–2008 Dataset9 years of readmission data across 130 US hospitals for patients with diabetes.Many features of each readmission are given.100,000TextClassification, clustering2014[296][297]J. Clore et al.
    Diabetic Retinopathy Debrecen DatasetFeatures extracted from images of eyes with and without diabetic retinopathy.Features extracted and conditions diagnosed.1151TextClassification2014[298][299]B. Antal et al.
    Liver Disorders DatasetData for people with liver disorders.Seven biological features given for each patient.345TextClassification1990[300][301]Bupa Medical Research Ltd.
    Thyroid Disease Dataset10 databases of thyroid disease patient data.None.7200TextClassification1987[302][303]R. Quinlan
    Mesothelioma DatasetMesothelioma patient data.Large number of features, including asbestos exposure, are given.324TextClassification2016[304][305]A. Tanrikulu et al.
    KEGG Metabolic Reaction Network (Undirected) DatasetNetwork of metabolic pathways. A reaction network and a relation network are given.Detailed features for each network node and pathway are given.65,554TextClassification, clustering, regression2011[306]M. Naeem et al.


    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    Abalone DatasetPhysical measurements of Abalone. Weather patterns and location are also given.None.4177TextRegression1995[307]Marine Research Laboratories – Taroona
    Zoo DatasetArtificial dataset covering 7 classes of animals.Animals are classed into 7 categories and features are given for each.101TextClassification1990[308]R. Forsyth
    Demospongiae DatasetData about marine sponges.503 sponges in the Demosponge class are described by various features.503TextClassification2010[309]E. Armengol et al.
    Splice-junction Gene Sequences DatasetPrimate splice-junction gene sequences (DNA) with associated imperfect domain theory.None.3190TextClassification1992[293]G. Towell et al.
    Mice Protein Expression DatasetExpression levels of 77 proteins measured in the cerebral cortex of mice.None.1080TextClassification, Clustering2015[310][311]C. Higuera et al.


    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    Forest Fires DatasetForest fires and their properties.13 features of each fire are extracted.517TextRegression2008[312][313]P. Cortez et al.
    Iris DatasetThree types of iris plants are described by 4 different attributes.None.150TextClassification1936[314][315]R. Fisher
    Plant Species Leaves DatasetSixteen samples of leaf each of one-hundred plant species.Shape descriptor, fine-scale margin, and texture histograms are given.1600TextClassification2012[316][317]J. Cope et al.
    Mushroom DatasetMushroom attributes and classification.Many properties of each mushroom are given.8124TextClassification1987[318]J. Schlimmer
    Soybean DatasetDatabase of diseased soybean plants.35 features for each plant are given. Plants are classified into 19 categories.307TextClassification1988[319]R. Michalshi et al.
    Seeds DatasetMeasurements of geometrical properties of kernels belonging to three different varieties of wheat.None.210TextClassification, clustering2012[320][321]Charytanowicz et al.
    Covertype DatasetData for predicting forest cover type strictly from cartographic variables.Many geographical features given.581,012TextClassification1998[322][323]J. Blackard et al.
    Abscisic Acid Signaling Network DatasetData for a plant signaling network. Goal is to determine set of rules that governs the network.None.300TextCausal-discovery2008[324]J. Jenkens et al.
    Folio Dataset20 photos of leaves for each of 32 species.None.637Images, textClassification, clustering2015[325][326]T. Munisami et al.
    Oxford Flower Dataset17 category dataset of flowers.Train/test splits, labeled images,1360Images, textClassification2006[113][327]M-E Nilsback et al.


    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    Ecoli DatasetProtein localization sites.Various features of the protein localizations sites are given.336TextClassification1996[328][329]K. Nakai et al.
    MicroMass DatasetIdentification of microorganisms from mass-spectrometry data.Various mass spectrometer features.931TextClassification2013[330][331]P. Mahe et al.
    Yeast DatasetPredictions of Cellular localization sites of proteins.Eight features given per instance.1484TextClassification1996[332][333]K. Nakai et al.

    Drug Discovery[edit]

    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    Tox21 DatasetPrediction of outcome of biological assays.Chemical descriptors of molecules are given.12707TextClassification2016[334]A. Mayr et al.

    Anomaly data[edit]

    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    Numenta Anomaly Benchmark (NAB)Data are ordered, timestamped, single-valued metrics. All data files contain anomalies, unless otherwise noted.?50+ filesComma separated valuesAnomaly detection2016 (continually updated)[335]Numenta

    Multivariate data[edit]

    Datasets consisting of rows of observations and columns of attributes characterizing those observations. Typically used for regression analysis or classification but other types of algorithms can also be used. This section includes datasets that do not fit in the above categories.


    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    Dow Jones IndexWeekly data of stocks from the first and second quarters of 2011.Calculated values included such as percentage change and a lags.750Comma separated valuesClassification, regression, time Series2014[336][337]M. Brown et al.
    Statlog (Australian Credit Approval)Credit card applications either accepted or rejected and attributes about the application.Attribute names are removed as well as identifying information. Factors have been relabeled.690Comma separated valuesClassification1987[338][339]R. Quinlan
    eBay auction dataAuction data from various eBay.com objects over various length auctionsContains all bids, bidderID, bid times, and opening prices.~ 550TextRegression, classification2012[340][341]G. Shmueli et al.
    Statlog (German Credit Data)Binary credit classification into “good” or “bad” with many featuresVarious financial features of each person are given.690TextClassification1994[342]H. Hofmann
    Bank Marketing DatasetData from a large marketing campaign carried out by a large bank .Many attributes of the clients contacted are given. If the client subscribed to the bank is also given.45,211TextClassification2012[343][344]S. Moro et al.
    Istanbul Stock Exchange DatasetSeveral stock indexes tracked for almost two years.None.536TextClassification, regression2013[345][346]O. Akbilgic
    Default of Credit Card ClientsCredit default data for Taiwanese creditors.Various features about each account are given.30,000TextClassification2016[347][348]I. Yeh


    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    Cloud DataSetData about 1024 different clouds.Image features extracted.1024TextClassification, clustering1989[349]P. Collard
    El Nino DatasetOceanographic and surface meteorological readings taken from a series of buoys positioned throughout the equatorial Pacific.12 weather attributes are measured at each buoy.178080TextRegression1999[350]Pacific Marine Environmental Laboratory
    Greenhouse Gas Observing Network DatasetTime-series of greenhouse gas concentrations at 2921 grid cells in California created using simulations of the weather.None.2921TextRegression2015[351]D. Lucas
    Atmospheric CO2 from Continuous Air Samples at Mauna Loa ObservatoryContinuous air samples in Hawaii, USA. 44 years of records.None.44 yearsTextRegression2001[352]Mauna Loa Observatory
    Ionosphere DatasetRadar data from the ionosphere. Task is to classify into good and bad radar returns.Many radar features given.351TextClassification1989[303][353]Johns Hopkins University
    Ozone Level Detection DatasetTwo ground ozone level datasets.Many features given, including weather conditions at time of measurement.2536TextClassification2008[354][355]K. Zhang et al.


    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    Adult DatasetCensus data from 1994 containing demographic features of adults and their income.Cleaned and anonymized.48,842Comma separated valuesClassification1996[356]United States Census Bureau
    Census-Income (KDD)Weighted census data from the 1994 and 1995 Current Population Surveys.Split into training and test sets.299,285Comma separated valuesClassification2000[357][358]United States Census Bureau
    IPUMS Census DatabaseCensus data from the Los Angeles and Long Beach areas.None256,932TextClassification, regression1999[359]IPUMS
    US Census Data 1990Partial data from 1990 US census.Results randomized and useful attributes selected.2,458,285TextClassification, regression1990[360]United States Census Bureau


    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    Bike Sharing DatasetHourly and daily count of rental bikes in a large city.Many features, including weather, length of trip, etc., are given.17,389TextRegression2013[361][362]H. Fanaee-T
    New York City Taxi Trip DataTrip data for yellow and green taxis in New York City.Gives pick up and drop off locations, fares, and other details of trips.6 yearsTextClassification, clustering2015[363]New York City Taxi and Limousine Commission
    Taxi Service Trajectory ECML PKDDTrajectories of all taxis in a large city.Many features given, including start and stop points.1,710,671TextClustering, causal-discovery2015[364][365]M. Ferreira et al.


    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    Webpages from Common Crawl 2012Large collection of webpages and how they are connected via hyperlinksNone.3.5BTextclustering, classification2013[366]V. Granville
    Internet Advertisements DatasetDataset for predicting if a given image is an advertisement or not.Features encode geometry of ads and phrases occurring in the URL.3279TextClassification1998[367][368]N. Kushmerick
    Internet Usage DatasetGeneral demographics of internet users.None.10,104TextClassification, clustering1999[369]D. Cook
    URL Dataset120 days of URL data from a large conference.Many features of each URL are given.2,396,130TextClassification2009[370][371]J. Ma
    Phishing Websites DatasetDataset of phishing websites.Many features of each site are given.2456TextClassification2015[372]R. Mustafa et al.
    Online Retail DatasetOnline transactions for a UK online retailer.Details of each transaction given.541,909TextClassification, clustering2015[373]D. Chen
    Freebase Simple Topic DumpFreebase is an online effort to structure all human knowledge.Topics from Freebase have been extracted.largeTextClassification, clustering2011[374][375]Freebase
    Farm Ads DatasetThe text of farm ads from websites. Binary approval or disapproval by content owners is given.SVMlight sparse vectors of text words in ads calculated.4143TextClassification2011[376][377]C. Masterharm et al.


    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    Poker Hand Dataset5 card hands from a standard 52 card deck.Attributes of each hand are given, including the Poker hands formed by the cards it contains.1,025,010TextRegression, classification2007[378]R. Cattral
    Connect-4 DatasetContains all legal 8-ply positions in the game of connect-4 in which neither player has won yet, and in which the next move is not forced.None.67,557TextClassification1995[379]J. Tromp
    Chess (King-Rook vs. King) DatasetEndgame Database for White King and Rook against Black King.None.28,056TextClassification1994[380][381]M. Bain et al.
    Chess (King-Rook vs. King-Pawn) DatasetKing+Rook versus King+Pawn on a7.None.3196TextClassification1989[382]R. Holte
    Tic-Tac-Toe Endgame DatasetBinary classification for win conditions in tic-tac-toe.None.958TextClassification1991[383]D. Aha

    Other multivariate[edit]

    Dataset NameBrief descriptionPreprocessingInstancesFormatDefault TaskCreated (updated)ReferenceCreator
    Housing Data SetMedian home values of Boston with associated home and neighborhood attributes.None.506TextRegression1993[384]D. Harrison et al.
    The Getty Vocabulariesstructured terminology for art and other material culture, archival materials, visual surrogates, and bibliographic materials.None.largeTextClassification2015[385]Getty Center
    Yahoo! Front Page Today Module User Click LogUser click log for news articles displayed in the Featured Tab of the Today Module on Yahoo! Front Page.Conjoint analysis with a bilinear model.45,811,883 user visitsTextRegression, clustering2009[386][387]Chu et al.
    British Oceanographic Data CentreBiological, chemical, physical and geophysical data for oceans. 22K variables tracked.Various.22K variables, many instancesTextRegression, clustering2015[388]British Oceanographic Data Centre
    Congressional Voting Records DatasetVoting data for all USA representatives on 16 issues.Beyond the raw voting data, various other features are provided.435TextClassification1987[389]J. Schlimmer
    Entree Chicago Recommendation DatasetRecord of user interactions with Entree Chicago recommendation system.Details of each users usage of the app are recorded in detail.50,672TextRegression, recommendation2000[390]R. Burke
    Insurance Company Benchmark (COIL 2000)Information on customers of an insurance company.Many features of each customer and the services they use.9,000TextRegression, classification2000[391][392]P. van der Putten
    Nursery DatasetData from applicants to nursery schools.Data about applicant’s family and various other factors included.12,960TextClassification1997[393][394]V. Rajkovic et al.
    University DatasetData describing attributed of a large number of universities.None.285TextClustering, classification1988[395]S. Sounders et al.
    Blood Transfusion Service Center DatasetData from blood transfusion service center. Gives data on donors return rate, frequency, etc.None.748TextClassification2008[396][397]I. Yeh
    Record Linkage Comparison Patterns DatasetLarge dataset of records. Task is to link relevant records together.Blocking procedure applied to select only certain record pairs.5,749,132TextClassification2011[398][399]University of Mainz
    Nomao DatasetNomao collects data about places from many different sources. Task is to detect items that describe the same place.Duplicates labeled.34,465TextClassification2012[400][401]Nomao Labs
    Movie DatasetData for 10,000 movies.Several features for each movie are given.10,000TextClustering, classification1999[402]G. Wiederhold
    Open University Learning Analytics DatasetInformation about students and their interactions with a virtual learning environment.None.~ 30,000TextClassification, clustering, regression2015[403][404]J. Kuzilek et al.


  • Rather, we humans are stupendously, astoundingly good at making sense of what our eyes show us. But nearly all that work is done unconsciously. And so we don't usually appreciate how tough a problem ...

    The human visual system is one of the wonders of the world. Consider the following sequence of handwritten digits:

    Most people effortlessly recognize those digits as 504192. That ease is deceptive. In each hemisphere of our brain, humans have a primary visual cortex, also known as V1, containing 140 million neurons, with tens of billions of connections between them. And yet human vision involves not just V1, but an entire series of visual cortices - V2, V3, V4, and V5 - doing progressively more complex image processing. We carry in our heads a supercomputer, tuned by evolution over hundreds of millions of years, and superbly adapted to understand the visual world. Recognizing handwritten digits isn't easy. Rather, we humans are stupendously, astoundingly good at making sense of what our eyes show us. But nearly all that work is done unconsciously. And so we don't usually appreciate how tough a problem our visual systems solve.

    The difficulty of visual pattern recognition becomes apparent if you attempt to write a computer program to recognize digits like those above. What seems easy when we do it ourselves suddenly becomes extremely difficult. Simple intuitions about how we recognize shapes - "a 9 has a loop at the top, and a vertical stroke in the bottom right" - turn out to be not so simple to express algorithmically. When you try to make such rules precise, you quickly get lost in a morass of exceptions and caveats and special cases. It seems hopeless.

    Neural networks approach the problem in a different way. The idea is to take a large number of handwritten digits, known as training examples,

    and then develop a system which can learn from those training examples. In other words, the neural network uses the examples to automatically infer rules for recognizing handwritten digits. Furthermore, by increasing the number of training examples, the network can learn more about handwriting, and so improve its accuracy. So while I've shown just 100 training digits above, perhaps we could build a better handwriting recognizer by using thousands or even millions or billions of training examples.

    In this chapter we'll write a computer program implementing a neural network that learns to recognize handwritten digits. The program is just 74 lines long, and uses no special neural network libraries. But this short program can recognize digits with an accuracy over 96 percent, without human intervention. Furthermore, in later chapters we'll develop ideas which can improve accuracy to over 99 percent. In fact, the best commercial neural networks are now so good that they are used by banks to process cheques, and by post offices to recognize addresses.

    We're focusing on handwriting recognition because it's an excellent prototype problem for learning about neural networks in general. As a prototype it hits a sweet spot: it's challenging - it's no small feat to recognize handwritten digits - but it's not so difficult as to require an extremely complicated solution, or tremendous computational power. Furthermore, it's a great way to develop more advanced techniques, such as deep learning. And so throughout the book we'll return repeatedly to the problem of handwriting recognition. Later in the book, we'll discuss how these ideas may be applied to other problems in computer vision, and also in speech, natural language processing, and other domains.

    Of course, if the point of the chapter was only to write a computer program to recognize handwritten digits, then the chapter would be much shorter! But along the way we'll develop many key ideas about neural networks, including two important types of artificial neuron (the perceptron and the sigmoid neuron), and the standard learning algorithm for neural networks, known as stochastic gradient descent. Throughout, I focus on explaining why things are done the way they are, and on building your neural networks intuition. That requires a lengthier discussion than if I just presented the basic mechanics of what's going on, but it's worth it for the deeper understanding you'll attain. Amongst the payoffs, by the end of the chapter we'll be in position to understand what deep learning is, and why it matters.


    What is a neural network? To get started, I'll explain a type of artificial neuron called a perceptron. Perceptrons were developed in the 1950s and 1960s by the scientist Frank Rosenblatt, inspired by earlier work by Warren McCulloch and Walter Pitts. Today, it's more common to use other models of artificial neurons - in this book, and in much modern work on neural networks, the main neuron model used is one called the sigmoid neuron. We'll get to sigmoid neurons shortly. But to understand why sigmoid neurons are defined the way they are, it's worth taking the time to first understand perceptrons.

    So how do perceptrons work? A perceptron takes several binary inputs,  x1,x2, x 1 , x 2 , … , and produces a single binary output:

    In the example shown the perceptron has three inputs,  x1,x2,x3 x 1 , x 2 , x 3 . In general it could have more or fewer inputs. Rosenblatt proposed a simple rule to compute the output. He introduced  weights w1,w2, w 1 , w 2 , … , real numbers expressing the importance of the respective inputs to the output. The neuron's output,  0 0  or  1 1 , is determined by whether the weighted sum  jwjxj ∑ j w j x j  is less than or greater than some  threshold value . Just like the weights, the threshold is a real number which is a parameter of the neuron. To put it in more precise algebraic terms:
    output={01if jwjxj thresholdif jwjxj> threshold(1) (1) output = { 0 if  ∑ j w j x j ≤  threshold 1 if  ∑ j w j x j >  threshold
    That's all there is to how a perceptron works!

    That's the basic mathematical model. A way you can think about the perceptron is that it's a device that makes decisions by weighing up evidence. Let me give an example. It's not a very realistic example, but it's easy to understand, and we'll soon get to more realistic examples. Suppose the weekend is coming up, and you've heard that there's going to be a cheese festival in your city. You like cheese, and are trying to decide whether or not to go to the festival. You might make your decision by weighing up three factors:

    1. Is the weather good?
    2. Does your boyfriend or girlfriend want to accompany you?
    3. Is the festival near public transit? (You don't own a car).
    We can represent these three factors by corresponding binary variables  x1,x2 x 1 , x 2 , and  x3 x 3 . For instance, we'd have  x1=1 x 1 = 1  if the weather is good, and  x1=0 x 1 = 0  if the weather is bad. Similarly,  x2=1 x 2 = 1 if your boyfriend or girlfriend wants to go, and  x2=0 x 2 = 0  if not. And similarly again for  x3 x 3  and public transit.

    Now, suppose you absolutely adore cheese, so much so that you're happy to go to the festival even if your boyfriend or girlfriend is uninterested and the festival is hard to get to. But perhaps you really loathe bad weather, and there's no way you'd go to the festival if the weather is bad. You can use perceptrons to model this kind of decision-making. One way to do this is to choose a weight  w1=6 w 1 = 6 for the weather, and  w2=2 w 2 = 2  and  w3=2 w 3 = 2  for the other conditions. The larger value of  w1 w 1  indicates that the weather matters a lot to you, much more than whether your boyfriend or girlfriend joins you, or the nearness of public transit. Finally, suppose you choose a threshold of  5 5  for the perceptron. With these choices, the perceptron implements the desired decision-making model, outputting  1 1  whenever the weather is good, and  0 0  whenever the weather is bad. It makes no difference to the output whether your boyfriend or girlfriend wants to go, or whether public transit is nearby.

    By varying the weights and the threshold, we can get different models of decision-making. For example, suppose we instead chose a threshold of  3 3 . Then the perceptron would decide that you should go to the festival whenever the weather was good or when both the festival was near public transit and your boyfriend or girlfriend was willing to join you. In other words, it'd be a different model of decision-making. Dropping the threshold means you're more willing to go to the festival.

    Obviously, the perceptron isn't a complete model of human decision-making! But what the example illustrates is how a perceptron can weigh up different kinds of evidence in order to make decisions. And it should seem plausible that a complex network of perceptrons could make quite subtle decisions:

    In this network, the first column of perceptrons - what we'll call the first  layer  of perceptrons - is making three very simple decisions, by weighing the input evidence. What about the perceptrons in the second layer? Each of those perceptrons is making a decision by weighing up the results from the first layer of decision-making. In this way a perceptron in the second layer can make a decision at a more complex and more abstract level than perceptrons in the first layer. And even more complex decisions can be made by the perceptron in the third layer. In this way, a many-layer network of perceptrons can engage in sophisticated decision making.

    Incidentally, when I defined perceptrons I said that a perceptron has just a single output. In the network above the perceptrons look like they have multiple outputs. In fact, they're still single output. The multiple output arrows are merely a useful way of indicating that the output from a perceptron is being used as the input to several other perceptrons. It's less unwieldy than drawing a single output line which then splits.

    Let's simplify the way we describe perceptrons. The condition  jwjxj>threshold ∑ j w j x j > threshold  is cumbersome, and we can make two notational changes to simplify it. The first change is to write  jwjxj ∑ j w j x j  as a dot product,  wxjwjxj w ⋅ x ≡ ∑ j w j x j , where  w w  and  x x  are vectors whose components are the weights and inputs, respectively. The second change is to move the threshold to the other side of the inequality, and to replace it by what's known as the perceptron's bias bthreshold b ≡ − threshold . Using the bias instead of the threshold, the perceptron rule can be rewritten:

    output={01if wx+b0if wx+b>0(2) (2) output = { 0 if  w ⋅ x + b ≤ 0 1 if  w ⋅ x + b > 0
    You can think of the bias as a measure of how easy it is to get the perceptron to output a  1 1 . Or to put it in more biological terms, the bias is a measure of how easy it is to get the perceptron to  fire. For a perceptron with a really big bias, it's extremely easy for the perceptron to output a  1 1 . But if the bias is very negative, then it's difficult for the perceptron to output a  1 1 . Obviously, introducing the bias is only a small change in how we describe perceptrons, but we'll see later that it leads to further notational simplifications. Because of this, in the remainder of the book we won't use the threshold, we'll always use the bias.

    I've described perceptrons as a method for weighing evidence to make decisions. Another way perceptrons can be used is to compute the elementary logical functions we usually think of as underlying computation, functions such as ANDOR, and NAND. For example, suppose we have a perceptron with two inputs, each with weight  2 − 2 , and an overall bias of  3 3 . Here's our perceptron:

    Then we see that input  00 00  produces output  1 1 , since  (2)0+(2)0+3=3 ( − 2 ) ∗ 0 + ( − 2 ) ∗ 0 + 3 = 3  is positive. Here, I've introduced the  symbol to make the multiplications explicit. Similar calculations show that the inputs  01 01  and  10 10  produce output  1 1 . But the input  11 11 produces output  0 0 , since  (2)1+(2)1+3=1 ( − 2 ) ∗ 1 + ( − 2 ) ∗ 1 + 3 = − 1  is negative. And so our perceptron implements a  NAND  gate!

    The NAND example shows that we can use perceptrons to compute simple logical functions. In fact, we can use networks of perceptrons to compute any logical function at all. The reason is that the NAND gate is universal for computation, that is, we can build any computation up out of NAND gates. For example, we can use NAND gates to build a circuit which adds two bits,  x1 x 1  and  x2 x 2 . This requires computing the bitwise sum,  x1x2 x 1 ⊕ x 2 , as well as a carry bit which is set to  1 1  when both  x1 x 1  and  x2 x 2  are  1 1 , i.e., the carry bit is just the bitwise product  x1x2 x 1 x 2 :

    To get an equivalent network of perceptrons we replace all the  NAND gates by perceptrons with two inputs, each with weight  2 − 2 , and an overall bias of  3 3 . Here's the resulting network. Note that I've moved the perceptron corresponding to the bottom right  NAND  gate a little, just to make it easier to draw the arrows on the diagram:
    One notable aspect of this network of perceptrons is that the output from the leftmost perceptron is used twice as input to the bottommost perceptron. When I defined the perceptron model I didn't say whether this kind of double-output-to-the-same-place was allowed. Actually, it doesn't much matter. If we don't want to allow this kind of thing, then it's possible to simply merge the two lines, into a single connection with a weight of -4 instead of two connections with -2 weights. (If you don't find this obvious, you should stop and prove to yourself that this is equivalent.) With that change, the network looks as follows, with all unmarked weights equal to -2, all biases equal to 3, and a single weight of -4, as marked:
    Up to now I've been drawing inputs like  x1 x 1  and  x2 x 2  as variables floating to the left of the network of perceptrons. In fact, it's conventional to draw an extra layer of perceptrons - the  input layer - to encode the inputs:
    This notation for input perceptrons, in which we have an output, but no inputs,
    is a shorthand. It doesn't actually mean a perceptron with no inputs. To see this, suppose we did have a perceptron with no inputs. Then the weighted sum  jwjxj ∑ j w j x j  would always be zero, and so the perceptron would output  1 1  if  b>0 b > 0 , and  0 0  if  b0 b ≤ 0 . That is, the perceptron would simply output a fixed value, not the desired value ( x1 x 1 , in the example above). It's better to think of the input perceptrons as not really being perceptrons at all, but rather special units which are simply defined to output the desired values,  x1,x2, x 1 , x 2 , … .

    The adder example demonstrates how a network of perceptrons can be used to simulate a circuit containing many NAND gates. And because NAND gates are universal for computation, it follows that perceptrons are also universal for computation.

    The computational universality of perceptrons is simultaneously reassuring and disappointing. It's reassuring because it tells us that networks of perceptrons can be as powerful as any other computing device. But it's also disappointing, because it makes it seem as though perceptrons are merely a new type of NAND gate. That's hardly big news!

    However, the situation is better than this view suggests. It turns out that we can devise learning algorithms which can automatically tune the weights and biases of a network of artificial neurons. This tuning happens in response to external stimuli, without direct intervention by a programmer. These learning algorithms enable us to use artificial neurons in a way which is radically different to conventional logic gates. Instead of explicitly laying out a circuit of NAND and other gates, our neural networks can simply learn to solve problems, sometimes problems where it would be extremely difficult to directly design a conventional circuit.

    Sigmoid neurons

    Learning algorithms sound terrific. But how can we devise such algorithms for a neural network? Suppose we have a network of perceptrons that we'd like to use to learn to solve some problem. For example, the inputs to the network might be the raw pixel data from a scanned, handwritten image of a digit. And we'd like the network to learn weights and biases so that the output from the network correctly classifies the digit. To see how learning might work, suppose we make a small change in some weight (or bias) in the network. What we'd like is for this small change in weight to cause only a small corresponding change in the output from the network. As we'll see in a moment, this property will make learning possible. Schematically, here's what we want (obviously this network is too simple to do handwriting recognition!):

    If it were true that a small change in a weight (or bias) causes only a small change in output, then we could use this fact to modify the weights and biases to get our network to behave more in the manner we want. For example, suppose the network was mistakenly classifying an image as an "8" when it should be a "9". We could figure out how to make a small change in the weights and biases so the network gets a little closer to classifying the image as a "9". And then we'd repeat this, changing the weights and biases over and over to produce better and better output. The network would be learning.

    The problem is that this isn't what happens when our network contains perceptrons. In fact, a small change in the weights or bias of any single perceptron in the network can sometimes cause the output of that perceptron to completely flip, say from  0 0  to  1 1 . That flip may then cause the behaviour of the rest of the network to completely change in some very complicated way. So while your "9" might now be classified correctly, the behaviour of the network on all the other images is likely to have completely changed in some hard-to-control way. That makes it difficult to see how to gradually modify the weights and biases so that the network gets closer to the desired behaviour. Perhaps there's some clever way of getting around this problem. But it's not immediately obvious how we can get a network of perceptrons to learn.

    We can overcome this problem by introducing a new type of artificial neuron called a sigmoid neuron. Sigmoid neurons are similar to perceptrons, but modified so that small changes in their weights and bias cause only a small change in their output. That's the crucial fact which will allow a network of sigmoid neurons to learn.

    Okay, let me describe the sigmoid neuron. We'll depict sigmoid neurons in the same way we depicted perceptrons:

    Just like a perceptron, the sigmoid neuron has inputs,  x1,x2, x 1 , x 2 , … . But instead of being just  0 0  or  1 1 , these inputs can also take on any values  between   0 0  and  1 1 . So, for instance,  0.638 0.638 …  is a valid input for a sigmoid neuron. Also just like a perceptron, the sigmoid neuron has weights for each input,  w1,w2, w 1 , w 2 , … , and an overall bias,  b b . But the output is not  0 0  or  1 1 . Instead, it's  σ(wx+b) σ ( w ⋅ x + b ) , where  σ σ  is called the sigmoid function * *Incidentally,  σ σ  is sometimes called the logistic function, and this new class of neurons called logistic neurons. It's useful to remember this terminology, since these terms are used by many people working with neural nets. However, we'll stick with the sigmoid terminology. , and is defined by:
    σ(z)11+ez.(3) (3) σ ( z ) ≡ 1 1 + e − z .
    To put it all a little more explicitly, the output of a sigmoid neuron with inputs  x1,x2, x 1 , x 2 , … , weights  w1,w2, w 1 , w 2 , … , and bias  b b  is
    11+exp(jwjxjb).(4) (4) 1 1 + exp ⁡ ( − ∑ j w j x j − b ) .

    At first sight, sigmoid neurons appear very different to perceptrons. The algebraic form of the sigmoid function may seem opaque and forbidding if you're not already familiar with it. In fact, there are many similarities between perceptrons and sigmoid neurons, and the algebraic form of the sigmoid function turns out to be more of a technical detail than a true barrier to understanding.

    To understand the similarity to the perceptron model, suppose  zwx+b z ≡ w ⋅ x + b  is a large positive number. Then  ez0 e − z ≈ 0  and so  σ(z)1 σ ( z ) ≈ 1 . In other words, when  z=wx+b z = w ⋅ x + b  is large and positive, the output from the sigmoid neuron is approximately  1 1 , just as it would have been for a perceptron. Suppose on the other hand that  z=wx+b z = w ⋅ x + b  is very negative. Then  ez e − z → ∞ , and  σ(z)0 σ ( z ) ≈ 0 . So when  z=wx+b z = w ⋅ x + b  is very negative, the behaviour of a sigmoid neuron also closely approximates a perceptron. It's only when  wx+b w ⋅ x + b  is of modest size that there's much deviation from the perceptron model.

    What about the algebraic form of  σ σ ? How can we understand that? In fact, the exact form of  σ σ  isn't so important - what really matters is the shape of the function when plotted. Here's the shape:

    -4 -3 -2 -1 0 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0 z sigmoid function

    This shape is a smoothed out version of a step function:

    -4 -3 -2 -1 0 1 2 3 4 0.0 0.2 0.4 0.6 0.8 1.0 z step function

    If  σ σ  had in fact been a step function, then the sigmoid neuron would be a perceptron, since the output would be  1 1  or  0 0  depending on whether  wx+b w ⋅ x + b  was positive or negative**Actually, when  wx+b=0 w ⋅ x + b = 0  the perceptron outputs  0 0 , while the step function outputs  1 1 . So, strictly speaking, we'd need to modify the step function at that one point. But you get the idea.. By using the actual  σ σ function we get, as already implied above, a smoothed out perceptron. Indeed, it's the smoothness of the  σ σ  function that is the crucial fact, not its detailed form. The smoothness of  σ σ  means that small changes  Δwj Δ w j  in the weights and  Δb Δ b  in the bias will produce a small change  Δoutput Δ output  in the output from the neuron. In fact, calculus tells us that  Δoutput Δ output  is well approximated by

    ΔoutputjoutputwjΔwj+outputbΔb,(5) (5) Δ output ≈ ∑ j ∂ output ∂ w j Δ w j + ∂ output ∂ b Δ b ,
    where the sum is over all the weights,  wj w j , and  output/wj ∂ output / ∂ w j  and  output/b ∂ output / ∂ b  denote partial derivatives of the  output output  with respect to  wj w j  and  b b , respectively. Don't panic if you're not comfortable with partial derivatives! While the expression above looks complicated, with all the partial derivatives, it's actually saying something very simple (and which is very good news):  Δoutput Δ output  is a  linear functionof the changes  Δwj Δ w j  and  Δb Δ b  in the weights and bias. This linearity makes it easy to choose small changes in the weights and biases to achieve any desired small change in the output. So while sigmoid neurons have much of the same qualitative behaviour as perceptrons, they make it much easier to figure out how changing the weights and biases will change the output.

    If it's the shape of  σ σ  which really matters, and not its exact form, then why use the particular form used for  σ σ  in Equation (3)? In fact, later in the book we will occasionally consider neurons where the output is  f(wx+b) f ( w ⋅ x + b )  for some other activation function  f() f ( ⋅ ) . The main thing that changes when we use a different activation function is that the particular values for the partial derivatives in Equation (5) change. It turns out that when we compute those partial derivatives later, using  σ σ  will simplify the algebra, simply because exponentials have lovely properties when differentiated. In any case,  σ σ  is commonly-used in work on neural nets, and is the activation function we'll use most often in this book.

    How should we interpret the output from a sigmoid neuron? Obviously, one big difference between perceptrons and sigmoid neurons is that sigmoid neurons don't just output  0 0  or