ROAD is designed to test an autonomous vehicle's ability to detect road events, defined as triplets composed by an active agent, the action(s) it performs and the corresponding scene locations. ROAD comprises videos originally from the Oxford RobotCar Dataset, annotated with bounding boxes showing the location in the image plane of each road event.
SEND (Stanford Emotional Narratives Dataset) is a set of rich, multimodal videos of self-paced, unscripted emotional narratives, annotated for emotional valence over time. The complex narratives and naturalistic expressions in this dataset provide a challenging test for contemporary time-series emotion recognition models.
SeaDronesSee is a large-scale data set aimed at helping develop systems for Search and Rescue (SAR) using Unmanned Aerial Vehicles (UAVs) in maritime scenarios. Building highly complex autonomous UAV systems that aid in SAR missions requires robust computer vision algorithms to detect and track objects or persons of interest. This data set provides three sets of tracks: object detection, single-object tracking and multi-object tracking. Each track consists of its own data set and leaderboard.
SynPick is a synthetic dataset for dynamic scene understanding in bin-picking scenarios. In contrast to existing datasets, this dataset is both situated in a realistic industrial application domain -- inspired by the well-known Amazon Robotics Challenge (ARC) -- and features dynamic scenes with authentic picking actions as chosen by our picking heuristic developed for the ARC 2017. The dataset is compatible with the popular BOP dataset format.
In order to create the TED-talks dataset, 3,035 YouTube videos were downloaded using the "TED talks" query. From these initial candidates, videos in which the upper part of the person is visible for at least 64 frames, and the height of the person bounding box was at least 384 pixels were selected. Static videos were manually filtered out and videos in which a person is doing something other than presenting.
Composed of 10,000 videos annotated with memorability scores. In contrast to previous work on image memorability -- where memorability was measured a few minutes after memorization -- memory performance is measured twice: a few minutes after memorization and again 24-72 hours later.
iQIYI-VID dataset, which comprises video clips from iQIYI variety shows, films, and television dramas. The whole dataset contains 500,000 videos clips of 5,000 celebrities. The length of each video is 1~30 seconds.
300 Videos in the Wild (300-VW) is a dataset for evaluating facial landmark tracking algorithms in the wild. The dataset authors collected a large number of long facial videos recorded in the wild. Each video has duration of ~1 minute (at 25-30 fps). All frames have been annotated with regards to the same mark-up (i.e. set of facial landmarks) used in the 300 W competition as well (a total of 68 landmarks). The dataset includes 114 videos (circa 1 min each).
AIST++ is a 3D dance dataset which contains 3D motion reconstructed from real dancers paired with music. The AIST++ Dance Motion Dataset is constructed from the AIST Dance Video DB. With multi-view videos, an elaborate pipeline is designed to estimate the camera parameters, 3D human keypoints and 3D human dance motion sequences:
BL30K is a synthetic dataset rendered using Blender with ShapeNet's data. We break the dataset into six segments, each with approximately 5K videos. The videos are organized in a similar format as DAVIS and YouTubeVOS, so dataloaders for those datasets can be used directly. Each video is 160 frames long, and each frame has a resolution of 768*512. There are 3-5 objects per video, and each object has a random smooth trajectory -- we tried to optimize the trajectories in a greedy fashion to minimize object intersection (not guaranteed), with occlusions still possible (happen a lot in reality). See MiVOS for details.
BLVD is a large scale 5D semantics dataset collected by the Visual Cognitive Computing and Intelligent Vehicles Lab.This dataset contains 654 high-resolution video clips owing 120k frames extracted from Changshu, Jiangsu Province, China, where the Intelligent Vehicle Proving Center of China (IVPCC) is located. The frame rate is 10fps/sec for RGB data and 3D point cloud. The dataset contains fully annotated frames which yield 249,129 3D annotations, 4,902 independent individuals for tracking with the length of overall 214,922 points, 6,004 valid fragments for 5D interactive event recognition, and 4,900 individuals for 5D intention prediction. These tasks are contained in four kinds of scenarios depending on the object density (low and high) and light conditions (daytime and nighttime).
BosphorusSign22k is a benchmark dataset for vision-based user-independent isolated Sign Language Recognition (SLR). The dataset is based on the BosphorusSign (Camgoz et al., 2016c) corpus which was collected with the purpose of helping both linguistic and computer science communities. It contains isolated videos of Turkish Sign Language glosses from three different domains: Health, finance and commonly used everyday signs. Videos in this dataset were performed by six native signers, which makes this dataset valuable for user independent sign language studies.
Car Crash Dataset (CCD) is collected for traffic accident analysis. It contains real traffic accident videos captured by dashcam mounted on driving vehicles, which is critical to developing safety-guaranteed self-driving systems. CCD is distinguished from existing datasets for diversified accident annotations, including environmental attributes (day/night, snowy/rainy/good weather conditions), whether ego-vehicles involved, accident participants, and accident reason descriptions.
We provide manual annotations of 14 semantic keypoints for 100,000 car instances (sedan, suv, bus, and truck) from 53,000 images captured from 18 moving cameras at Multiple intersections in Pittsburgh, PA. Please fill the google form to get a email with the download links:
The GTA Indoor Motion dataset (GTA-IM) that emphasizes human-scene interactions in the indoor environments. It consists of HD RGB-D image sequences of 3D human motion from a realistic game engine. The dataset has clean 3D human pose and camera pose annotations, and large diversity in human appearances, indoor environments, camera views, and human activities.
HAA500 is a manually annotated human-centric atomic action dataset for action recognition on 500 classes with over 591k labeled frames. Unlike existing atomic action datasets, where coarse-grained atomic actions were labeled with action-verbs, e.g., "Throw", HAA500 contains fine-grained atomic actions where only consistent actions fall under the same label, e.g., "Baseball Pitching" vs "Free Throw in Basketball", to minimize ambiguities in action classification. HAA500 has been carefully curated to capture the movement of human figures with less spatio-temporal label noises to greatly enhance the training of deep neural networks.
Kitchen Scenes is a multi-view RGB-D dataset of nine kitchen scenes, each containing several objects in realistic cluttered environments including a subset of objects from the BigBird dataset. The viewpoints of the scenes are densely sampled and objects in the scenes are annotated with bounding boxes and in the 3D point cloud.
LIVE Livestream is a database for Video Quality Assessment (VQA), specifically designed for live streaming VQA research. The dataset is called the Laboratory for Image and Video Engineering (LIVE) Live stream Database. The LIVE Livestream Database includes 315 videos of 45 contents impaired by 6 types of distortions.
LSVTD is a large scale video text dataset for promoting the video text spotting community, which contains 100 text videos from 22 different real-life scenarios. LSVTD covers a wide range of 13 indoor (eg. bookstore, shopping mall) and 9 outdoor scenarios, which is more than 3 times the diversity of IC15.
The MLB-YouTube dataset is a new, large-scale dataset consisting of 20 baseball games from the 2017 MLB post-season available on YouTube with over 42 hours of video footage. The dataset consists of two components: segmented videos for activity recognition and continuous videos for activity classification. It is quite challenging as it is created from TV broadcast baseball games where multiple different activities share the camera angle. Further, the motion/appearance difference between the various activities is quite small.
This is a dataset for a shot boundary detection task. The dataset contains 2 existing datasets and 19 manually marked up open source videos with a total length of more than 1200 minutes and 10000 scene transitions. The dataset includes different types of videos with different resolutions from 360×288 to 1920×1080 in MP4 and MKV formats. Videos include samples in RGB scale or in grayscale with FPS from 23 to 60.
The Privacy Annotated HMDB51 (PA-HMDB51) dataset is a video-based dataset for evaluating pirvacy protection in visual action recognition algorithms. The dataset contains both target task labels (action) and selected privacy attributes (skin color, face, gender, nudity, and relationship) annotated on a per-frame basis.
The Query-based Video Highlights (QVHighlights) dataset is a dataset for detecting customized moments and highlights from videos given natural language (NL). It consists of over 10,000 YouTube videos, covering a wide range of topics, from everyday activities and travel in lifestyle vlog videos to social and political activities in news videos. Each video in the dataset is annotated with: (1) a human-written free-form NL query, (2) relevant moments in the video w.r.t. the query, and (3) five-point scale saliency scores for all query-relevant clips.
A large-scale dataset for retrieval and event localisation in video. A unique feature of the dataset is the availability of two audio tracks for each video: the original audio, and a high-quality spoken description of the visual content. 2b1af7f3a8