Dataset之COCO数据集：COCO数据集的简介、下载、使用方法之详细攻略2018-10-04 19:52:43Dataset之COCO数据集：COCO数据集的简介、安装、使用方法之详细攻略 目录 COCO数据集的简介 0、COCO数据集的80个类别—YoloV3算法采用的数据集 1、COCO数据集的意义 2、COCO数据集的特点 3、数据集的...
MS COCO的全称是Microsoft Common Objects in Context，起源于微软于2014年出资标注的Microsoft COCO数据集，与ImageNet竞赛一样，被视为是计算机视觉领域最受关注和最权威的比赛之一。
COCO数据集是一个大型的、丰富的物体检测，分割和字幕数据集。这个数据集以scene understanding为目标，主要从复杂的日常场景中截取，图像中的目标通过精确的segmentation进行位置的标定。图像包括91类目标，328,000影像和2,500,000个label。目前为止有语义分割的最大数据集，提供的类别有80 类，有超过33 万张图片，其中20 万张有标注，整个数据集中个体的数目超过150 万个。
bicycle(自行车) car(汽车) motorbike(摩托车) aeroplane(飞机) bus(公共汽车) train(火车) truck(卡车) boat(船)
traffic light(信号灯) fire hydrant(消防栓) stop sign(停车标志) parking meter(停车计费器) bench(长凳)
bird(鸟) cat(猫) dog(狗) horse(马) sheep(羊) cow(牛) elephant(大象) bear(熊) zebra(斑马) giraffe(长颈鹿)
backpack(背包) umbrella(雨伞) handbag(手提包) tie(领带) suitcase(手提箱)
frisbee(飞盘) skis(滑雪板双脚) snowboard(滑雪板) sports ball(运动球) kite(风筝) baseball bat(棒球棒) baseball glove(棒球手套) skateboard(滑板) surfboard(冲浪板) tennis racket(网球拍)
bottle(瓶子) wine glass(高脚杯) cup(茶杯) fork(叉子) knife(刀)
banana(香蕉) apple(苹果) sandwich(三明治) orange(橘子) broccoli(西兰花) carrot(胡萝卜) hot dog(热狗) pizza(披萨) donut(甜甜圈) cake(蛋糕)
chair(椅子) sofa(沙发) pottedplant(盆栽植物) bed(床) diningtable(餐桌) toilet(厕所) tvmonitor(电视机)
laptop(笔记本) mouse(鼠标) remote(遥控器) keyboard(键盘) cell phone(电话)
microwave(微波炉) oven(烤箱) toaster(烤面包器) sink(水槽) refrigerator(冰箱)
book(书) clock(闹钟) vase(花瓶) scissors(剪刀) teddy bear(泰迪熊) hair drier(吹风机) toothbrush(牙刷)
MS COCO的全称是Microsoft Common Objects in Context，起源于是微软于2014年出资标注的Microsoft COCO数据集，与ImageNet 竞赛一样，被视为是计算机视觉领域最受关注和最权威的比赛之一。
COCO is a large-scale object detection, segmentation, and captioning dataset. COCO has several features:
- Object segmentation
- Recognition in context
- Superpixel stuff segmentation
- 330K images (>200K labeled)
- 1.5 million object instances
- 80 object categories
- 91 stuff categories
- 5 captions per image
- 250,000 people with keypoints
- 330K图像（> 200K标记）；
COCO数据集分两部分发布，前部分于2014年发布，后部分于2015年，2014年版本：82,783 training, 40,504 validation, and 40,775 testing images，有270k的segmented people和886k的segmented object；2015年版本：165,482 train, 81,208 val, and 81,434 test images。
(1)、Download Images and Annotations from [MSCOCO] 后期更新…… (2)、Get the coco code 后期更新…… (3)、Build the coco code 后期更新…… (4)、Split the annotation to many files per image and get the image size info 后期更新…… (5)、 Create the LMDB file 后期更新……
UCF Sport Action 运动视频数据数据集2021-07-06 16:27:50UCF Sport Action dataset 是一个不同的场景和角度人类运动视频数据集，从BBC和ESPN的电视新闻中采集，包含 150段 视频剪辑，视频分辨率为 720x480。运动种类及剪辑数为：跳水（14视频）、高尔夫秋千（18视频...
车辆运动轨迹数据集2016-08-11 10:33:351.数据集包括500辆出租车近30天的（2008年5月17日-6月10日）行驶数据 2.车辆行驶数据的采样时间间隔1min 3.车辆轨迹数据包含：车辆ID-经纬度（位置）-是否载客-时间 4.无瞬时速度 下载链接：点击打开链接 二、...
一、SanFrancisco Bay Area
与运动相关的数据集2016-03-11 09:51:01Survey of related motion databases
Survey of related databases
It captures 25 people preparing 2 mixed salads each and contains over 4h of annotated accelerometer and RGB-D video data. Including detailed annotations, multiple sensor types, and two sequences per participant, the 50 Salads dataset may be used for research in areas such as activity recognition, activity spotting, sequence analysis, progress tracking, sensor fusion, transfer learning, and user-adaptation. http://cvip.computing.dundee.ac.uk/datasets/foodpreparation/50salads/
The dataset comprises of two views of various scenario’s of people acting out various interactions. Ten basic scenarios were acted out by some of the Vision Group team members. These were called InGroup (IG), Approach (A), WalkTogether (WT), Split (S), Ignore (I), Following (FO), Chase (C), Fight (FI), RunTogether (RT), and Meet (M). Many of the interactions in the video sequence are labelled accordingly.
The data is captured at 25 frames per second. The resolution is 640x480. The videos are available either as AVI’s or as a numbered set of JPEG single image files.
A lot (but not all) of the video sequences have ground truth bounding boxes of the pedestrians in the scene.
The Berkeley Multimodal Human Action Database (MHAD) contains 11 actions performed by 7 male and 5 female subjects in the range 23-30 years of age except for one elderly subject. All the subjects performed 5 repetitions of each action, yielding about 660 action sequences which correspond to about 82 minutes of total recording time. In addition, we have recorded a T-pose for each subject which can be used for the skeleton extraction; and the background data (with and without the chair used in some of the activities).
The specified set of actions comprises of the following: (1) actions with movement in both upper and lower extremities, e.g., jumping in place, jumping jacks, throwing, etc., (2) actions with high dynamics in upper extremities, e.g., waving hands, clapping hands, etc. and (3) actions with high dynamics in lower extremities, e.g., sit down, stand up.
8 interaction classes, such as bow, boxing, handshake, high-five, hug, kick, pat and push. 400 video clips.
CAD120 (Cornell Activity Datasets)
The CAD-120 data sets comprise of RGB-D video (120 RGB-D videos) sequences of humans performing activities which are recording using the Microsoft Kinect sensor. 4 subjects: two male, two female.
10 high-level activities: making cereal, taking medicine, stacking objects, unstacking objects, microwaving food, picking objects, cleaning objects, taking food, arranging objects, having a meal
10 sub-activity labels: reaching, moving, pouring, eating, drinking, opening, placing, closing, scrubbing, null
12 object affordance labels: reachable, movable, pourable, pourto, containable, drinkable, openable, placeable, closable, scrubbable, scrubber, stationary
CAD60 (Cornell Activity Datasets)
The CAD-60 comprise of RGB-D video sequences (60 RGB-D videos) of humans performing activities which are recording using the Microsoft Kinect sensor. 4 subjects: two male, two female. There are 5 different environments: office, kitchen, bedroom, bathroom and living room, and 12 activities: rinsing mouth, brushing teeth, wearing contact lens, talking on the phone, drinking water, opening pill container, cooking (chopping), cooking (stirring), talking on couch, relaxing on couch, writing on whiteboard, working on computer
CASIA (action database for recognition)
CASIA action database is a collection of sequences of human activities captured by video cameras outdoors from different angle of view. There are 1446 sequences in all containing eight types of actions of single person (walk, run, bend, jump, crouch, faint, wander and punching a car) performed each by 24 subjects and seven types of two person interactions (rob, fight, follow, follow and gather, meet and part, meet and gather, overtake) performed by every 2 subjects.
For the CAVIAR project a number of video clips were recorded acting out the different scenarios of interest. These include people walking alone, meeting with others, window shopping, entering and exitting shops, fighting and passing out and last, but not least, leaving a package in a public place. The ground truth for these sequences was found by hand-labeling the images
CMU MMAC (CMU Multi-Modal Activity Database)
The CMU Multi-Modal Activity Database (CMU-MMAC) contains multimodal measures of the human activity of subjects performing the tasks involved in cooking and food preparation. A kitchen was built and to date twenty-five subjects have been recorded cooking five different recipes: brownies, pizza, sandwich, salad, and scrambled eggs.
CMU MoCap (CMU Graphics Lab Motion Capture Database)
CMU MoCap is a database of Human Interaction with Environment and Locomotion. There are 2605 trials in 6 categories and 23 subcategories, which include some common two people interaction, some activities, such as walking, running, and some ports, such as play basketball, dance.
CONVERSE (Human Conversational Interaction Dataset)
This is a human interaction recognition dataset intended for the exploration of classifying naturally executed conversational scenarios between a pair of individuals via the use of pose- and appearance-based features. The motivation behind CONVERSE is to present the problem of classifying subtle and complex behaviors between participants with pose-based information, classes which are not easily defined by the poses they contain. A pair of individuals are recorded performing natural dialogues across 7 different conversational scenarios by use of commercial depth sensor, providing pose-based representation of the interactions in the form of the extracted human skeletal models. Baseline classification results are presented in the associated publication to allow cross-comparison with future research into pose-based interaction recognition.
Drinking/Smoking (Drinking & Smoking action annotaion)
The annotation describes each action by a cuboid in space-time, a keyframe and the position of the head on the keyframe. 308 events with labels: PersonDrinking (159) , PersonSmoking (149) are extracted from movies.
(Hollywood-2 Human Actions and Scenes dataset, Hollywood Human Actions dataset)
ETISEO focuses on the treatment and interpretation of videos involving pedestrians and vehicles, indoors or outdoors, obtained from fixed cameras.
ETISEO aims at studying the dependency between algorithms and the video characteristics.
G3D (gaming dataset)
G3D dataset contains a range of gaming actions captured with Microsoft Kinect. The Kinect enabled us to record synchronised video, depth and skeleton data. The dataset contains 10 subjects performing 20 gaming actions: punch right, punch left, kick right, kick left, defend, golf swing, tennis swing forehand, tennis swing backhand, tennis serve, throw bowling ball, aim and fire gun, walk, run, jump, climb, crouch, steer a car, wave, flap and clap. The 20 gaming actions are recorded in 7 action sequences. Most sequences contain multiple actions in a controlled indoor environment with a fixed camera, a typical setup for gesture based gaming.
G3Di (gaming dataset)
G3Di is a realistic and challenging human interaction dataset for multiplayer gaming, containing synchronised colour, depth and skeleton data. This dataset contains 12 people split into 6 pairs. Each pair interacted through a gaming interface showcasing six sports: boxing, volleyball, football, table tennis, sprint and hurdles. The interactions can be collaborative or competitive depending on the specific sport and game mode. In this dataset volleyball was played collaboratively and the other sports in competitive mode. In most sports the interactions were explicit and can be decomposed by an action and counter action but in the sprint and hurdles the interactions were implicit, as the players competed with each other for the fastest time. The actions for each sport are: boxing (right punch, left punch, defend), volleyball (serve, overhand hit, underhand hit, and jump hit), football (kick, block and save), table tennis (serve, forehand hit and backhand hit), sprint (run) and hurdles (run and jump)
HMDB51 (A Large Human Motion Database)
HMDB collected from various sources, mostly from movies, and a small proportion from public databases such as the Prelinger archive, YouTube and Google videos. The dataset contains 6849 clips divided into 51 action categories, each containing a minimum of 101 clips. The actions categories can be grouped in five types :general facial actions, facial actions with object manipulation, general body movements, body movements with object interaction and body movements for human interaction.
General facial actions smile, laugh, chew, talk.
Facial actions with object manipulation: smoke, eat, drink.
General body movements: cartwheel, clap hands, climb, climb stairs, dive, fall on the floor, backhand flip, handstand, jump, pull up, push up, run, sit down, sit up, somersault, stand up, turn, walk, wave.
Body movements with object interaction: brush hair, catch, draw sword, dribble, golf, hit something, kick ball, pick, pour, push something, ride bike, ride horse, shoot ball, shoot bow, shoot gun, swing baseball bat, sword exercise, throw.
Body movements for human interaction: fencing, hug, kick someone, kiss, punch, shake hands, sword fight.
Hollywood (Hollywood Human Actions dataset)
Hollywood dataset contains video samples with human action from 32 movies. Each sample is labeled according to one or more of 8 action classes: AnswerPhone, GetOutCar, HandShake, HugPerson, Kiss, SitDown, SitUp, StandUp. The dataset is divided into a test set obtained from 20 movies and two training sets obtained from 12 movies different from the test set. The Automatic training set is obtained using automatic script-based action annotation and contains 233 video samples with approximately 60% correct labels. The Clean training set contains 219 video samples with manually verified labels. The test set contains 211 samples with manually verified labels.
Hollywood2 (Hollywood-2 Human Actions and Scenes dataset)
Hollywood2 contains 12 classes of human actions and 10 classes of scenes distributed over 3669 video clips and approximately 20.1 hours of video in total. The dataset intends to provide a comprehensive benchmark for human action recognition in realistic and challenging settings. The dataset is composed of video clips extracted from 69 movies, it contains approximately 150 samples per action class and 130 samples per scene class in training and test subsets.
The dataset contains around 650 video clips, across 14 classes. In addition, two state of the art action recognition algorithms are extended to make use of the 3D data, and five new interest point detection strategies are also proposed, that extend to the 3D data. Our evaluation compares all 4 feature descriptors, using 7 different types of interest point, over a variety of threshold levels, for the Hollywood3D dataset. We make the dataset including stereo video, estimated depth maps and all code required to reproduce the benchmark results, available to the wider community.
The HumanEVA-I dataset contains 7 calibrated video sequences (4 grayscale and 3 color) that are synchronized with 3D body poses obtained from a motion capture system. The database contains 4 subjects performing a 6 common actions (e.g. walking, jogging, gesturing, etc.). The error metrics for computing error in 2D and 3D pose are provided to participants. The dataset contains training, validation and testing (with withheld ground truth) sets.
HUMANEVA-II contains only 2 subjects (both also appear in the HUMANEVA-I dataset) performing an extended sequence of actions that we call Combo. In this sequence a subject starts by walking along an elliptical path, then continues on to jog in the same direction and concludes with the subject alternatively balancing on each of the two feet roughly in the center of the viewing volume.
The HUMANEVA-I training and validation data is intended to be shared across the two datasets with test results primarily being reported on HUMANEVA-II.
IXMAS (INRIA Xmas Motion Acquisition Sequences)
INRIA Xmas Motion Acquisition Sequences (IXMAS) is a multiview dataset for view-invariant human action recognition. There are 13 daily-live motions performed each 3 times by 11 actors. The actors choose freely position and orientation.
Framewise ground truth labeling: 0 - nothing, 1 - check watch, 2 - cross arms, 3 - scratch head, 4 - sit down, 5 - get up, 6 - turn around, 7 - walk, 8 - wave, 9 - punch, 10 - kick, 11 - point, 12 - pick up, 13 - throw (over head), 14 - throw (from bottom up).
JPL (JPL First-Person Interaction dataset)
JPL First-Person Interaction dataset (JPL-Interaction dataset) is composed of human activity videos taken from a first-person viewpoint. The dataset particularly aims to provide first-person videos of interaction-level activities, recording how things visually look from the perspective (i.e., viewpoint) of a person/robot participating in such physical interactions.
We attached a GoPro2 camera to the head of our humanoid model, and asked human participants to interact with the humanoid by performing activities. In order to emulate the mobility of a real robot, we also placed wheels below the humanoid and made an operator to move the humanoid by pushing it from the behind.
There are 7 different types of activities in the dataset, including 4 positive (i.e., friendly) interactions with the observer, 1 neutral interaction, and 2 negative (i.e., hostile) interactions. ‘Shaking hands with the observer’, ‘hugging the observer’, ‘petting the observer’, and ‘waving a hand to the observer’ are the four friendly interactions. The neutral interaction is the situation where two persons have a conversation about the observer while occasionally pointing it. ‘Punching the observer’ and ‘throwing objects to the observer’ are the two negative interactions. Videos were recorded continuously during human activities where each video sequence contains 0 to 3 activities. The videos are in 320*240 resolution with 30 fps.
KTH (Action Database)
The current video database containing six types of human actions (walking, jogging, running, boxing, hand waving and hand clapping) performed several times by 25 subjects in four different scenarios: outdoors s1, outdoors with scale variation s2, outdoors with different clothes s3 and indoors s4. Currently the database contains 2391 sequences. All sequences were taken over homogeneous backgrounds with a static camera with 25fps frame rate. The sequences were downsampled to the spatial resolution of 160x120 pixels and have a length of four seconds in average.
LIRIS (The LIRIS human activities dataset)
The LIRIS human activities dataset contains (gray/rgb/depth) videos showing people performing various activities taken from daily life (discussing, telphone calls, giving an item etc.). The dataset is fully annotated, where the annotation not only contains information on the action class but also its spatial and temporal positions in the video.
The dataset has been shot with two different cameras:
Subset D1 has been shot with a MS Kinect module mounted on a remotely controlled Wany robotics Pekee II mobile robot which is part of the LIRIS-VOIR platform.
Subset D2 has been shot with a sony consumer camcorder
The indoor motion capture dataset (MPI08) of:
sequences : multi-view sequences obtained from 8 calibrated cameras.
silhouettes : binary segmented images obtained with chroma-keying.
meshes : 3D laser scans for each of the four actors in the dataset and also the registered meshes with inserted skeletton.
projection matrices : one for each of the 8 cameras.
orientation data : raw and calibrated and sensor orientation data (5 sensors)
All takes have been recorded in a lab environment using eight calibrated video cameras and five inertial sensors fixed at the two lower legs, the two hands, and the neck. Our evaluation data set comprises various actions including standard motions such as walking, sitting down and standing up as well as fast and complex motions such as jumping, throwing, arm rotations, and cartwheels
MPII Cooking (MPII (Max Planck Institute for Informatics) Cooking Activities dataset)
The dataset records 12 participants performing 65 different cooking activities, such as cut slices, pour, or spice. To record realistic behavior we did not record activities individually but asked participants to prepare one to six of a total of 14 dishes such as fruit salad or cake containing several cooking activities. In total we recorded 44 videos with a total length of more than 8 hours or 881,755 frames.
We also provide an annotated body pose training and test set. This allows to work on the raw data but also on higher level modeling of activities. Activities are distinguished by fine-grained body motions that have low inter-class variability and high intraclass variability due to diverse subjects and ingredients.
We record a dataset containing different cooking activities. We discard some of the composite activities in the script corpus which are either too elementary to form a composite activity (e.g. how to secure a chopping board), or were duplicates with slightly different titles, or because of limited availability of the ingredients (e.g. butternut squash). This resulted in 41 composite cooking activities for evaluation. For each composite activity, we asked the subjects to give tutorial-like sequential instructions for executing the respective kitchen task. The instructions had to be divided into sequential steps with at most 15 steps per sequence. We select 53 relevant kitchen tasks as composite activities by mining the tutorials for basic kitchen tasks on the webpage Jamie’s Home Cooking Skills”4. All those tasks are steps to process ingredients or to use certain kitchen tools. In addition to the data we collected in this experiment, we use data from the OMICS corpus for 6 kitchen-related composite activities. This results in a corpus with 2124 sequences in sum, having a total of 12958 event descriptions.
This is a data set used for human action-detection experiments. It consists of a number of video sequences we have recorded.
It contains 16 video sequences and has in total 63 actions: 14 hand clapping, 24 hand waving, and 25 boxing, performed by 10 subjects. Each sequence contains multiple types of actions. Some sequences contain actions performed by different people. There are both indoor and outdoor scenes. All of the video sequences are captured with clutter and moving backgrounds. Each video is of low resolution 320 x 240 and frame rate 15 frames per second. Their lengths are between 32 to 76 seconds. To evaluate the performance, we manually label a spatio-temporal bounding box for each action. The ground truth labeling can be found in the groundtruth.txt file. The ground truth format of each labeled action is “X width Y height T length”.
Microsoft Research Action Data Set II is an extended version of the Microsoft Research Action Data Set. It consists of 54 video sequences recorded in a crowded environment. Each video sequence consists of multiple actions. There are three action types: hand waving, handclapping, and boxing. These action types are overlapped with the KTH data set. One could perform cross-data-set action recognition by using the KTH data set for training while using this data set for testing.
(This is Part 2 of Microsoft Research Action Data Set II. There are five parts in total.)
MSR-Action3D dataset contains twenty actions: high arm wave, horizontal arm wave, hammer, hand catch, forward punch, high throw, draw x, draw tick, draw circle, hand clap, two hand wave, side-boxing, bend, forward kick, side kick, jogging, tennis swing, tennis serve, golf swing, pick up & throw. There are 10 subjects, each subject performs each action 2 or 3 times. There are 567 depth map sequences in total. The resolution is 320x240. The data was recorded with a depth sensor similar to the Kinect device.
DailyActivity3D dataset is a daily activity dataset captured by a Kinect device. There are 16 activity types: drink, eat, read book, call cellphone, write on a paper, use laptop, use vacuum cleaner, cheer up, sit still, toss paper, play game, lay down on sofa, walk, play guitar, stand up, sit down. There are 10 subjects. Each subject performs each activity twice, once in standing position, and once in sitting position. There is a sofa in the scene. Three channels are recorded: depth maps (.bin), skeleton joint positions (.txt), and RGB video (.avi).
MSR Gesture 3D Dataset (weak correlation)
The dataset was captured by a Kinect device. There are 12 dynamic American Sign Language (ASL) gestures, and 10 people. Each person performs each gesture 2-3 times. There are 336 files in total, each corresponding to a depth sequence. The hand portion (above the wrist) has been segmented.
MuHAVi (Multicamera Human Action Video Data)
We have collected a large body of human action video (MuHAVi) data using 8 cameras. There are 17 action classes performed by 14 actors, that is, WalkTurnBack, RunStop, Punch,Kick, ShotGunCollapse, PullHeavyObject, PickupThrowObject, WalkFall, LookInCar, CrawlOnKnees, WaveArms, DrawGraffiti, JumpOverFence, DrunkWalk, ClimbLadder, SmashObject, JumpOverGap
The Olympic Sports Dataset contains videos of athletes practicing different sports. We have obtained all video sequences from YouTube and annotated their class label with the help of Amazon Mechanical Turk.
The current release contains 16 sports: High-jump, long-jump, tirple-jump, pole-vault, discus throw, hammer throw, javelin throw, shot put, basketballlay-up, bowling, tennis-serve, platform diving, springboard diving, snatch (weightlifting), clean-jerk(weightlifting), Gymnasticvault.
POETICON video dataset is used for several experiments, separated into 6 activities, with segmented actions that describe each activity saved in separate zip files, such as Cleaning, Make Salad, Make Sangria, Packing a Parcel, Planting and Table Setting.
For example, actions of Activity Cleaning are of sweeping with broom(ub1), clean chair with cloth(ub2), clear trash bin(I), clean lamp with cloth(ub1), clean glasses with cloth(T), change light bulb(ub2), fold napkin(T), clean small table with cloth(K), change clock batteries(ub2), adjust clock time(ub2).
Rochester AoDL (University of Rochester Activities of Daily Living Dataset)
A high resolution video dataset is recorded about the activities of daily living, such as answering a phone, dialing a phone, looking up a phone number in a telephone directory, writing a phone number on a whiteboard, drinking a glass of water, eating snack chips, peeling a banana, eating a banana, chopping a banana, and eating food with silverware.
These activities were each performed three times by five different people. These people were all members of computer science department, and were naive to the details of our model when the data was collected.
SBU Kinect Interaction
We collect eight interactions: approaching, departing, pushing, kicking, punching, exchanging objects, hugging, and shaking hands from seven participants and 21 pairs of two-actor sets. The entire dataset has a total of 300 interactions approximately. It comprises of RGB-D video sequences of humans performing interaction activities that are recording using the Microsoft Kinect sensor. In our dataset, color-depth video and motion capture data have been synchronized and annotated with action label for each frame.
Stanford 40 Actions
The Stanford 40 Action Dataset contains images of humans performing 40 actions. In each image, we provide a bounding box of the person who is performing the action indicated by the filename of the image. There are 9532 images in total with 180-300 images per action class.
The TUM Kitchen Data Set is provided to foster research in the areas of marker less human motion capture, motion segmentation and human activity recognition. It should aid researchers in these fields by providing a comprehensive collection of sensory input data that can be used to try out and to verify their algorithms. It is also meant to serve as a benchmark for comparative studies given the manually annotated “ground truth” labels of the underlying actions. The recorded activities have been selected with the intention to provide realistic and seemingly natural motions, and consist of everyday manipulation activities in a natural kitchen environment.
UCF101 is an action recognition data set of realistic action videos, collected from YouTube, having 101 action categories. This data set is an extension of UCF50 data set which has 50 action categories.
With 13320 videos from 101 action categories, UCF101 gives the largest diversity in terms of actions and with the presence of large variations in camera motion, object appearance and pose, object scale, viewpoint, cluttered background, illumination conditions, etc, it is the most challenging data set to date. As most of the available action recognition data sets are not realistic and are staged by actors, UCF101 aims to encourage further research into action recognition by learning and exploring new realistic action categories.
The videos in 101 action categories are grouped into 25 groups, where each group can consist of 4-7 videos of an action. The videos from the same group may share some common features, such as similar background, similar viewpoint, etc.
The action categories can be divided into five types: 1)Human-Object Interaction 2) Body-Motion Only 3) Human-Human Interaction 4) Playing Musical Instruments 5) Sports.
It contains 11 action categories: basketball shooting, biking/cycling, diving, golf swinging, horseback riding, soccer juggling, swinging, tennis swinging, trampoline jumping, volleyball spiking, and walking with a dog.
This data set is very challenging due to large variations in camera motion, object appearance and pose, object scale, viewpoint, cluttered background, illumination conditions, etc.
For each category, the videos are grouped into 25 groups with more than 4 action clips in it. The video clips in the same group share some common features, such as the same actor, similar background, similar viewpoint, and so on.
The videos are ms mpeg4 format. You need to install the right Codec (e.g. K-lite Codec Pack contains a cellection of Codecs) to access them.
UCF50 is an action recognition data set with 50 action categories, consisting of realistic videos taken from youtube. This data set is an extension of YouTube Action data set (UCF11) which has 11 action categories.
Most of the available action recognition data sets are not realistic and are staged by actors. In our data set, the primary focus is to provide the computer vision community with an action recognition data set consisting of realistic videos which are taken from youtube. Our data set is very challenging due to large variations in camera motion, object appearance and pose, object scale, viewpoint, cluttered background, illumination conditions, etc. For all the 50 categories, the videos are grouped into 25 groups, where each group consists of more than 4 action clips. The video clips in the same group may share some common features, such as the same person, similar background, similar viewpoint, and so on.
UCF Sports dataset consists of a set of actions collected from various sports which are typically featured on broadcast television channels such as the BBC and ESPN. The video sequences were obtained from a wide range of stock footage websites including BBC Motion gallery and GettyImages.
The dataset includes a total of 150 sequences with the resolution of 720 x 480. The collection represents a natural pool of actions featured in a wide range of scenes and viewpoints. By releasing the data set we hope to encourage further research into this class of action recognition in unconstrained environments. Since its introduction, the dataset has been used for numerous applications such as: action recognition, action localization, and saliency detection.
The dataset includes the following 10 actions. Diving (14 videos)、Golf Swing (18 videos)、Kicking (20 videos)、Lifting (6 videos)、Riding Horse (12 videos)、Running (13 videos)、SkateBoarding (12 videos)、Swing-Bench (20 videos)、Swing-Side (13 videos)、Walking (22 videos)
The UMPM Benchmark is a collection of video recordings together with a ground truth based on motion capture data. It is intended to be used for assessing the quality of methods for recognition of poses from multiple persons from video data, both using a single or multiple cameras.
The (UMPM) benchmark includes synchronized motion capture data and video sequences from multiple viewpoints for multi-person motion including multi-person interaction. The data set is available to the research community to promote research in multi-person articulated human motion analysis.
The recordings should also show the main challenges of multi-person motion, which are visibility (a (part of a) person is not visible because of occlusions by other persons or static objects, or by self-occlusions)
and ambiguity (body parts are identified ambiguously when persons are close to each other). The body poses and gestures are classified as natural (commonly used in daily life) and synthetic (special human movements for some particular purpose such as human-computer interaction, sports or gaming). Each of these two classes is subdivided into a few scenarios. In total, our data set consists of 9 different scenarios. Each scenario is recorded with 1, 2, 3 and 4 persons in the scene and is recorded multiple times to provide variations, i.e. different subject combination, order of poses and motion patterns. For natural motion we defined 5 different scenarios where the subjects (1) walk, jog and run in an arbitrary way among each other, (2) walk along a circle or triangle of a predetermined size, (3) walk around while one of them sits or hangs on a chair, (4) sit, lie, hang or stand on a table or walk around it, and (5) grab objects from a table. These scenarios include individual actions, but the number of subjects moving around in the restricted area cause inter-person occlusions. We also include two scenarios with interaction between the subjects: (6) a conversation with natural gestures, and (7) the subjects throw or pass a ball to each other while walking around. The scenarios with synthetic motions include poses as shown in Figure 1, performed when the subjects (8) stand still and (9) move around. These scenarios are recorded without any static occluders to focus only on inter-person occlusions.
The UT-Interaction dataset contains videos of continuous executions of 6 classes of human-human interactions: shake-hands, point, hug, push, kick and punch. Ground truth labels for these interactions are provided, including time intervals and bounding boxes. There is a total of 20 video sequences whose lengths are around 1 minute. Each video contains at least one execution per interaction, providing us 8 executions of human activities per video on average. Several participants with more than 15 different clothing conditions appear in the videos. The videos are taken with the resolution of 720*480, 30fps, and the height of a person in the video is about 200 pixels.
We divide videos into two sets. The set 1 is composed of 10 video sequences taken on a parking lot. The videos of the set 1 are taken with slightly different zoom rate, and their backgrounds are mostly static with little camera jitter. The set 2 (i.e. the other 10 sequences) are taken on a lawn in a windy day. Background is moving slightly (e.g. tree moves), and they contain more camera jitters. From sequences 1 to 4 and from 11 to 13, only two interacting persons appear in the scene. From sequences 5 to 8 and from 14 to 17, both interacting persons and pedestrians are present in the scene. In sets 9, 10, 18, 19, and 20, several pairs of interacting persons execute the activities simultaneously. Each set has a different background, scale, and illumination.
We provide a large body of synthetic video data generated for the purpose of evaluating different algorithms on human action recognition which are based on silhouettes. The data consist of 20 action classes, 9 actors and up to 40 synchronized perspective camera views. It is well known that for the action recognition algorithms which are purely based on human body masks, where other image properties such as colour and intensity are not used, it is important to obtain accurate silhouette data from video frames. This problem is not usually considered as part of the action recognition, but as a lower level problem in the motion tracking and change detection. Hence for researchers working on the recognition side, access to reliable Virtual Human Action Silhouette (ViHASi) data seems to be both a necessity and a relief. The reason for this is that such data provide a way of comprehensive experimentation and evaluation of the methods under study, that might even lead to their improvements.
The dataset is designed to be realistic, natural and challenging for video surveillance domains in terms of its resolution, background clutter, diversity in scenes, and human activity/event categories than existing action recognition datasets.
Data was collected in natural scenes showing people performing normal actions in standard contexts, with uncontrolled, cluttered backgrounds. There are frequent incidental movers and background activities. Actions performed by directed actors were minimized; most were actions performed by the general population. Data was collected at multiple sites distributed throughout the USA. A variety of camera viewpoints and resolutions were included, and actions are performed by many different people. Diverse types of human actions and human-vehicle interactions are included, with a large number of examples (>30) per action class. Many applications such as video surveillance operate across a wide range of spatial and temporal resolutions. The dataset is designed to capture these ranges, with 2–30Hz frame rates and 10–200 pixels in person-height. The dataset provides both the original videos with HD quality and down sampled versions both spatially and temporally. Both ground camera videos and aerial videos are collected released as part of VIRAT Video Dataset.
we collected a database of 90 low-resolution (180 144, deinterlaced 50 fps) video sequences showing nine different people, each performing 10 natural actions such as “run,” “walk,” “skip,”“jumping-jack” (or shortly “jack”), “jump-forward-on-two-legs” (or“jump”), “jump-in-place-on-two-legs” (or “pjump”), “gallopsideways” (or“side”), “wave-two-hands” (or “wave2”), “wave one-hand” (or “wave1”), or “bend.” To obtain space-time shapes of the actions, we subtracted the median background from each of the sequences and used a simple thresholding in color-space.
As part of our research on real-time multi-view human action recognition in a camera network, we collected data of subjects performing several actions from different views using a network of 8 embedded cameras. This data could be potentially useful for related research on activity recognition.
Dataset 1: This dataset was used to evaluate recognition of unit actions – each sample consists of a subject performing only one action, the start and end times for each action are known, and the input provided is exactly equal to the duration of an action. The subject performs a set of 12 actions at approximately the same pace. The data was collected at a rate of 20 fps with 640 x 480 resolution.
Dataset 2: This dataset was used for evaluating interleaved sequences of actions. Each sequence consists of multiple unit actions and each unit actions may be of varying duration. The data was collected at a rate of 20 fps with 960 x 720 resolution.
The multi-camera network system consists of 8 cameras that provide completely overlapping coverage of a rectangular region R (about 50 x 50 feet) from different viewing directions.
eeg2008 competition 2a数据集（运动想象四分类）2020-10-04 11:24:41eeg2008 competition 2a数据集（运动想象四分类）eeg2008 competition 2a数据集（运动想象四分类）eeg2008 competition 2a数据集（运动想象四分类）eeg2008 competition 2a数据集（运动想象四分类）
运动检测常用数据集2019-11-20 20:47:15网址： ... 网址： ... ...公开了一种数据集，该数据集包含多种视频类型并且每一帧都有其对应的ground-truth，并提出了多项用来评估运动检测算法的性能指标，同时按这些指标对多种算法进行了排名
年份 论文题目 作者 论文内容 2012 Changedetection.net: A new change detection benchmark dataset GB/T 7714Goyette N , Jodoin P M , Porikli F，et al 公开了一种数据集，该数据集包含多种视频类型并且每一帧都有其对应的ground-truth，并提出了多项用来评估运动检测算法的性能指标，同时按这些指标对多种算法进行了排名
奥林匹克运动会：探索基于奥林匹克运动会的一些数据集-源码2021-02-16 04:25:43奥运探索 探索一些基于奥运会的数据集。
运动想象，脑电情绪等公开数据集汇总2020-09-24 08:43:02点击上面"脑机接口社区"关注我们更多技术干货第一时间送达运动想像数据Left/Right Hand MI: http://gigadb.org/dataset/10029...
数据融合matlab代码-UMONS-TAICHI:太极拳手势的多模态运动捕捉数据集2021-05-22 11:42:45UMONS-TAICHI是太极拳武术手势的大型3D运动捕捉数据集（n = 2200个样本），包括由12个不同技能水平的参与者执行的13类（相对于太极拳技术）。 参加者的级别由三位专家在[0-10]的等级中进行排名。 使用两个运动捕获...
a-dataset:数据集来自Kaggle的表单，并且此分析包含EDA，Datawrangling，最终足球运动员数据集上的数据可视...2021-05-25 11:49:54该数据集包含从2008年到2016年（总共8个赛季）的25,000多次足球比赛，并且该球员在11个国家/地区拥有10,000多人。 此外，数据还提供了诸如交叉数量，进球类型，角球数量，球员和球队属性等细节，我们可以在EDA阶段对...
数据集-使用Smartbug-TDK进行运动分析-源码2021-02-13 17:31:28数据集-使用Smartbug-TDK进行运动分析 这包括UBTECH Alpha 1机器人记录的10个动作，用于教育目的
【数据共享】深度学习异常行为数据集—疲劳驾驶数据集—行为分析数据集2020-07-18 15:41:23文章目录行为分析数据集：疲劳驾驶数据集异常行为监控数据集三维卷积特征提取器：100G异常行为数据集送上：异常行为数据集（图像）公众号来袭 行为分析数据集： oops数据集，近21000个视频的异常行为视频帧，截取...
Football/Football players price prediction足球/足球运动员的价格预测-数据集2021-03-23 18:26:43足球/足球运动员的价格预测，参加每个欧洲，亚洲或美国联赛的球员的数据/价格。
数据集：KITTI数据集分析2017-11-03 10:34:17The KITTI Vision Benchmark Suite和Vision meets Robotics: The KITTI Dataset两篇论文的内容，主要介绍KITTI数据集概述，数据采集平台，数据集详细描述，评价准则以及具体使用案例。本文对KITTI数据集提供一个...
waymo-open-dataset:Waymo打开数据集-源码2021-03-26 06:43:24我们扩展了Waymo开放数据集，使其还包括一个运动数据集，该运动数据集包含对象轨迹和超过100,000个细分的相应3D地图。 我们已经更新了此存储库，以添加对此新数据集的支持。 请参考。 此外，我们添加了有关实时检测...
【数据集+评测】视频序列中的运动检测算法2018-01-18 11:08:06CDW-2014数据集 Reference 0. 运动检测 通俗来说，运动检测是指从视频中识别发生变化或移动的区域（感兴趣的区域），是计算机视觉和视频处理中常用的预处理步骤。 与静态图像的通用目标检测不同，...
EchoNet-Dynamic Dataset心跳动态影像数据集-数据集2021-03-19 13:04:09本数据集由斯坦福大学出品，为医学机器学习提供心脏运动动态数据。 STANFORD UNIVERSITY SCHOOL OF MEDICINE ECHONET-DYNAMIC DATASET RESEARCH USE AGREEMENT.pdf video-based AI for beat-to-beat assessment of...
手语数据集2020-03-15 16:59:42国内数据集三、自采集数据集1.要求2.有效样本 一、须知 1.手语与手势的区别 手势： 手的姿势 ，通常称作手势。它指的是人在运用手臂时，所出现的具体动作与体位。 手语： 手语是用手势比量动作，根据手势的变化模拟...
机器学习数据集的方法 及 数据集资源2019-01-21 14:54:40亚马逊AWS高级技术顾问Will Badr介绍了8种寻找机器学习数据集的方法 1、Kaggle数据集 Kaggle的数据集中，包含了用于各种任务，不同规模的真实数据集，而且有许多不同的格式。此外，你还可以在这里找到与每个...