Browse » Documents

Top Documents Datasets

Datasets related to using computer vision with images of documents, invoices, papers, contracts, screenshots, text, signatures, pdfs, jpegs, pngs, and more. Commonly used with optical character recognition (OCR) to translate text into usable data.

Guide to extract document structure
https://blog.roboflow.com/using-computer-vision-extract-document-structure/

Document processing case study
https://roboflow.com/case-study/column

Research for object detection with PDFs
https://vtechworks.lib.vt.edu/bitstream/handle/10919/109979/ObjectDetectionReport.pdf?sequence=16&isAllowed=y

About This Dataset

The Roboflow Website Screenshots dataset is a synthetically generated dataset composed of screenshots from over 1000 of the world's top websites. They have been automatically annotated to label the following classes:
:fa-spacer:

  • button - navigation links, tabs, etc.
  • heading - text that was enclosed in <h1> to <h6> tags.
  • link - inline, textual <a> tags.
  • label - text labeling form fields.
  • text - all other text.
  • image - <img>, <svg>, or <video> tags, and icons.
  • iframe - ads and 3rd party content.

Example

This is an example image and annotation from the dataset:
WIkipedia Screenshot

Usage

Annotated screenshots are very useful in Robotic Process Automation. But they can be expensive to label. This dataset would cost over $4000 for humans to label on popular labeling services. We hope this dataset provides a good starting point for your project. Try it with a model from our model library.

Collecting Custom Data

Roboflow is happy to provide a custom screenshots dataset to meet your particular needs. We can crawl public or internal web applications. Just reach out and we'll be happy to provide a quote!

About Roboflow

Roboflow makes managing, preprocessing, augmenting, and versioning datasets for computer vision seamless.
:fa-spacer:
Developers reduce 50% of their boilerplate code when using Roboflow's workflow, save training time, and increase model reproducibility.
:fa-spacer:

Roboflow Wordmark

The dataset comes from Devashish Prasad, Ayan Gadpal, Kshitij Kapadni, Manish Visave, and Kavita Sultanpure - creators of CascadeTabNet.

Depending on the dataset version downloaded, the images will include annotations for 'borderless' tables, 'bordered' tables', and 'cells'. Borderless tables are those in which every cell in the table does not have a border. Bordered tables are those in which every cell in the table has a border, and the table is bordered. Cells are the individual data points within the table.

A subset of the full dataset, the ICDAR Table Cells Dataset, was extracted and imported to Roboflow to create this hosted version of the Cascade TabNet project. All the additional dataset components used in the full project are available here: All Files.

Versions:

  1. Version 1, raw-images : 342 raw images of tables. No augmentations, preprocessing step of auto-orient was all that was added.
  2. Version 2, tableBordersOnly-rawImages : 342 raw images of tables. This dataset version contains the same images as version 1, but with the caveat of Modify Classes being applied to omit the 'cell' class from all images (rendering these images to be apt for creating a model to detect 'borderless' tables and 'bordered' tables.

For the versions below: Preprocessing step of Resize (416by416 Fit within-white edges) was added along with more augmentations to increase the size of the training set and to make our images more uniform. Preprocessing applies to all images whereas augmentations only apply to training set images.
3. Version 3, augmented-FAST-model : 818 raw images of tables. Trained from Scratch (no transfer learning) with the "Fast" model from Roboflow Train. 3X augmentation (generated images).
4. Version 4, augmented-ACCURATE-model : 818 raw images of tables. Trained from Scratch with the "Accurate" model from Roboflow Train. 3X augmentation.
5. Version 5, tableBordersOnly-augmented-FAST-model : 818 raw images of tables. 'Cell' class ommitted with Modify Classes. Trained from Scratch with the "Fast" model from Roboflow Train. 3X augmentation.
6. Version 6, tableBordersOnly-augmented-ACCURATE-model : 818 raw images of tables. 'Cell' class ommitted with Modify Classes. Trained from Scratch with the "Accurate" model from Roboflow Train. 3X augmentation.

Example Image from the DatasetExample Image from the Dataset

Cascade TabNet in ActionCascade TabNet in Action
CascadeTabNet is an automatic table recognition method for interpretation of tabular data in document images. We present an improved deep learning-based end to end approach for solving both problems of table detection and structure recognition using a single Convolution Neural Network (CNN) model. CascadeTabNet is a Cascade mask Region-based CNN High-Resolution Network (Cascade mask R-CNN HRNet) based model that detects the regions of tables and recognizes the structural body cells from the detected tables at the same time. We evaluate our results on ICDAR 2013, ICDAR 2019 and TableBank public datasets. We achieved 3rd rank in ICDAR 2019 post-competition results for table detection while attaining the best accuracy results for the ICDAR 2013 and TableBank dataset. We also attain the highest accuracy results on the ICDAR 2019 table structure recognition dataset.

From the Original Authors:

If you find this work useful for your research, please cite our paper:
@misc{ cascadetabnet2020,
title={CascadeTabNet: An approach for end to end table detection and structure recognition from image-based documents},
author={Devashish Prasad and Ayan Gadpal and Kshitij Kapadni and Manish Visave and Kavita Sultanpure},
year={2020},
eprint={2004.12629},
archivePrefix={arXiv},
primaryClass={cs.CV}
}

The BIB_Detection image dataset contains over one thousand images of athletes running in races, with their annotations for all bibs. This dataset and the already-trained object detection model are a great place to start for anyone looking to build a computer vision project around a race. Additionally, this data could be used to augment an existing detection dataset that detects numbers or characters more generally.

This segementation model is trained on old and new reports from the oil and gas industry.

It focusses on finding relevant images in these reports that can then be extracted and later on be classified with a CNN.

The tools to extract the images from pdf documets using the a google colab trained yolov5s model and a trained image classifier for oil and gas related geology and gepohphysics images can be found on the github page
https://github.com/bolgebrygg

Overview

The Poster dataset is an object detection dataset of posters in instagram-images. Annotations include examples of just "poster".

Example Image:
Example Image

Use Cases

This dataset is used to cut raw posters out of an image.

Using this Dataset

Use the fork button to copy this dataset to your own Roboflow account and export it with new preprocessing settings (perhaps resized for your model's desired format or converted to grayscale), or additional augmentations to make your model generalize better. This particular dataset would be very well suited for Roboflow's new advanced Bounding Box Only Augmentations.

This dataset was originally created by Montso Mokake. To see the current project, which may have been updated since this version, please go here: https://universe.roboflow.com/tweeter/tweeter.

This dataset is part of RF100, an Intel-sponsored initiative to create a new object detection benchmark for model generalizability.

Access the RF100 Github repo: https://github.com/roboflow-ai/roboflow-100-benchmark

This dataset was originally created by Pierrick Dossin. To see the current project, which may have been updated since this version, please go here: https://universe.roboflow.com/signature-detection/signaturesdetectiob.

This dataset is part of RF100, an Intel-sponsored initiative to create a new object detection benchmark for model generalizability.

Access the RF100 Github repo: https://github.com/roboflow-ai/roboflow-100-benchmark

This dataset was originally created by Parfait Ahouanto. To see the current project, which may have been updated since this version, please go here: https://universe.roboflow.com/public1/activity-diagrams-s7sxv.

This dataset is part of RF100, an Intel-sponsored initiative to create a new object detection benchmark for model generalizability.

Access the RF100 Github repo: https://github.com/roboflow-ai/roboflow-100-benchmark

This dataset was originally created by Wojciech Blachowski. To see the current project, which may have been updated since this version, please go here: https://universe.roboflow.com/wojciech-blachowski/tweets.

This dataset is part of RF100, an Intel-sponsored initiative to create a new object detection benchmark for model generalizability.

Access the RF100 Github repo: https://github.com/roboflow-ai/roboflow-100-benchmark

This dataset was originally created by Oliver Giesecke, Christian Green. To see the current project, which may have been updated since this version, please go here: https://universe.roboflow.com/new-workspace-vf0ib/calpers.

This dataset is part of RF100, an Intel-sponsored initiative to create a new object detection benchmark for model generalizability.

Access the RF100 Github repo: https://github.com/roboflow-ai/roboflow-100-benchmark

This dataset was originally created by Rik Biswas, Aakansha Prasad, Sarmistha Das. To see the current project, which may have been updated since this version, please go here: https://universe.roboflow.com/rik-biswas/tabular-data-dh4ek.

This dataset is part of RF100, an Intel-sponsored initiative to create a new object detection benchmark for model generalizability.

Access the RF100 Github repo: https://github.com/roboflow-ai/roboflow-100-benchmark

This dataset was originally created by Mohamed Refai, Abarna, Amjad Hafiz, Sutheshan Maiu, and Thanusha Sritharan. To see the current project, which may have been updated since this version, please go here: https://universe.roboflow.com/new-workspace-4vus5/singlemcq.

This dataset is part of RF100, an Intel-sponsored initiative to create a new object detection benchmark for model generalizability.

Access the RF100 Github repo: https://github.com/roboflow-ai/roboflow-100-benchmark

This dataset was originally created by Aman Ahuja and Alan Devera. To see the current project, which may have been updated since this version, please go here: https://universe.roboflow.com/object-detection-alan-devera/object-detection-ycqjb.

This dataset is part of RF100, an Intel-sponsored initiative to create a new object detection benchmark for model generalizability.

Access the RF100 Github repo: https://github.com/roboflow-ai/roboflow-100-benchmark