Dataset-收集一下文本的各种数据集

有关场景文本的数据集。

前言

​ 收集一下有关文本的各种数据集!感谢师兄的分享。

正文

真实数据集

CTW 数据集 (Chinese Text in the Wild)

一个自然场景下的中文字符数据集。

​ 资源:

​ 包含:

  • 32,285 张高分辨率图像(high resolution images)
  • 1,018,402 个字符实例(character instances)
  • 3850 种汉字(character categories)
  • 6 种类别(attributes)

​ 数据集结构:

webp
  • 训练集 + 验证集:images-trainval
  • 测试集:images-test
  • 预训练好的模型:trained-models
    • alexnet
    • inception
    • overfeat
    • resnet
    • vgg
    • yolo
  • 注释文档:ctw-annotations
webp

​ 举例,对于训练数据集下的文件 0000172.jpg

webp
0000172.jpg
json
{"annotations": [[
    {
        "adjusted_bbox": [140.26028096262758, 897.1957001682758, 22.167573140645146, 38.36424196832945], 
        "attributes": ["distorted", "raised"], 
        "is_chinese": true, 
        "polygon": [[140.26028096262758, 896.7550603352049], [162.42785410327272, 898.0769798344178], [162.42785410327272, 935.7929346470926], [140.26028096262758, 935.0939571156308]], 
        "text": "\u660e"
    }, 
    {
        "adjusted_bbox": [162.42785410327272, 898.5416545674744, 23.376713493771263, 37.74268246537315], 
        "attributes": ["distorted", "raised"],
        "is_chinese": true,
        "polygon": [[162.42785410327272, 898.0769798344178], [185.80456759704398, 899.4710040335876], [185.80456759704398, 936.5300382257251], [162.42785410327272, 935.7929346470926]],
        "text": "\u6d77"
    },
    ……
    "image_id": "0000172", "width": 2048}

​ 对应的注释,每个字都有:

  • adjusted_bbox:调整后的边界框

  • attributes:文字属性

    • distorted:扭曲
    • raised:浮雕
    • occluded:被遮挡
    • bgcomplex:复杂背景
    • handwritten:手写
    • wordart:艺术字
  • is_chinese:是否为中文

  • polygon:实际边界框

  • text:以 Unicode 形式存储的中文

SVT (Street View Text Dataset)

The Street View Text (SVT) dataset was harvested from Google Street View. Image text in this data exhibits high variability and often has low resolution. In dealing with outdoor street level imagery, we note two characteristics.

(1) Image text often comes from business signage and

(2) business names are easily available through geographic business searches.

These factors make the SVT set uniquely suited for word spotting in the wild: given a street view image, the goal is to identify words from nearby businesses.

街景文本SVT)数据集从谷歌街景中获取。该数据中的图像文本表现出很高的变异性,而且往往分辨率很低。在处理户外街景图像时,我们注意到两个特点。

(1) 图像文本通常来自于商业招牌,以及

(2) 企业名称很容易通过地理上的商业搜索获得。

这些因素使得 SVT 集独特地适合于在野外发现单词:给定一个街景图像,目标是识别附近企业的单词。

​ 资源:

​ 举例,对于数据集下的文件 17_18.jpg

webp
17_18.jpg

​ 对应的 Ground Truth,一个单词一个文本框,还包含了地址、环境等信息:

xml
<image>
    <imageName>img/17_18.jpg</imagesName>
    <address>420 South 1st Street San Jose CA 95112</address>
    <lex>SOUTH,FIRST,BILLIARDS,CLUB,AND,LOUNGE,AGENDA,RESTAURANT,BAR,RAMADA,LIMITED,SAN,JOSE,WET,NIGHTCLUB,MOTIF,ANNO,DOMINI,EULIPIA,DOWNTOWN,YOGA,SHALA,WHIPSAW,INC,ZOE,SAINTE,CLAIRE,HOTEL,SCORES,SPORTS,GRILL,WORKS,SPY,MUSEUM,QUILTS,TEXTILES,MIAMI,BEACH,STAGE,COMPANY,CACTUS,ANGELS,DAI,THANH,SUPERMARKET</lex>
    <Resolution x="1024" y="768"/>
    <taggedRectangles>
        <taggedRectangle height="41" width="152" x="480" y="403">
            <tag>BILLIARDS</tag>
        </taggedRectangle>
        <taggedRectangle height="33" width="78" x="407" y="410">
            <tag>FIRST</tag>
        </taggedRectangle>
        <taggedRectangle height="30" width="85" x="322" y="416">
            <tag>SOUTH</tag>
        </taggedRectangle>
    </taggedRectangles>
</images>

ICDAR

​ 资源:

Downloads - Focused Scene Text

Task 2.1: Text Localization (2013 edition)
  • 训练集 229 张图片
  • 测试集 233 张图片

​ 举例,对于训练数据集下的文件 img_1.jpg

webp
img_1.jpg

​ 对应的 Ground Truth gt_img_1.txt

38, 43, 920, 215, "Tiredness"
275, 264, 665, 450, "kills"
0, 699, 77, 830, "A"
128, 705, 483, 839, "short"
542, 710, 938, 841, "break"
87, 884, 457, 1021, "could"
517, 919, 831, 1024, "save"
166, 1095, 468, 1231, "your"
530, 1069, 743, 1206, "life"

数据集可视化代码:

python
import cv2
import os
import matplotlib.pyplot as plt
import numpy as np
 
index = 1
 
image_dir = r'XXX/ICDAR 2013/Challenge2_Test_Task12_Images/'
label_dir = r'XXX/ICDAR 2013/Challenge2_Test_Task1_GT/'
 
image_path = os.path.join(image_dir, 'img_' + str(index) + '.jpg')
label_path = os.path.join(label_dir, 'gt_img_' + str(index) + '.txt')
 
image_origin = cv2.imread(image_path)
image = image_origin.copy()
height, width, _ = image.shape
label_file = open(label_path, 'r')
annotations = label_file.readlines()
label_file.close()
 
for annotation in annotations:
    coords = list(map(int, annotation.split(',')[:-1]))
    transcriptions = annotation.split(',')[-1][2:-2]
    points = np.array([(coords[i], coords[i+1]) for i in range(0, len(coords), 2)])
    cv2.rectangle(image, (points[0][0], points[0][1]), (points[1][0], points[1][1]), (255, 0, 0), 2)
    for p in points:
        cv2.circle(image, (p[0], p[1]), int(min(height, width) / 150), (0, 255, 255), -1)
    cv2.putText(image, transcriptions, (points[0][0], points[0][1] - int(min(height, width) / 150)), cv2.FONT_HERSHEY_SIMPLEX,
                min(height, width) / 1000, (0, 255, 0), int(min(height, width) / 500))
    
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(16, 9))
axes = axes.flatten()
 
axes[0].imshow(cv2.cvtColor(image_origin, cv2.COLOR_BGR2RGB))
axes[0].axis('off')
axes[0].set_title('Origin')
 
axes[1].imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
axes[1].axis('off')
axes[1].set_title('Annotation')
 
plt.tight_layout()
plt.show()
webp
Task 2.2: Text Segmentation (2013 edition)

​ 数据集和 2.1 一样,只不过 Ground Truth 是 segmentation masks gt_img_1.png

webp
gt_img_1.png
Task 2.3: Word Recognition (2013 edition)
  • 训练集 848 张单词图片:Challenge2_Training_Task3_Images_GT
  • 测试集 1095 张单词图片:Challenge2_Test_Task3_ImagesChallenge2_Test_Task3_GT.txt

​ 这些图片都是从之前的数据集里裁切出来的。

​ 举例,对于训练数据集下的文件 word_1.jpg

webp
word_1.png

​ 对应的 Ground Truth gt.txt 里的一行:

word_1.png, "PROPER"
Task 2.4: End to End (2015 edition)

想让网络识别单词,并且提供了单词库?

  • 训练集 229 张图片
  • 测试集 233 张图片

​ 图片img_1.jpg,对应的 Ground Truth gt_img_1.txt 和词汇表 voc_img_1.txt

webp
img_1.jpg、gt_img_1.txt、voc_img_1.txt

Downloads - Incidental Scene Text

Task 4.1: Text Localization (2015 edition)

​ 图像质量真是刁钻啊 orz

  • 训练集 1000 张图片
  • 测试集 500 张图片

​ 举例,对于测试数据集下的文件 img_2.jpg

webp
img_2.jpg

​ 对应的 Ground Truth gt_img_2.txt

790,302,903,304,902,335,790,335,JOINT
822,288,872,286,871,298,823,300,yourself
641,138,657,139,657,151,641,151,###
669,139,693,140,693,154,669,153,154
700,141,723,142,723,155,701,154,197
637,101,721,106,722,115,637,110,###
668,157,693,158,693,170,668,170,727
636,155,661,156,662,169,636,168,198
660,82,700,85,700,99,660,96,20029
925,252,973,254,973,262,925,262,###
789,284,818,284,818,297,789,297,Free
875,286,902,289,903,298,875,298,from
791,337,863,337,863,364,791,364,PAIN
794,445,818,445,818,473,794,473,###
922,440,962,442,963,462,922,463,###
924,476,967,476,968,489,924,491,###
924,505,962,506,965,518,923,519,###
847,524,887,524,887,555,847,555,###
791,474,822,474,822,500,791,500,###
780,582,910,576,909,583,780,588,###
854,456,902,455,902,465,854,467,###
854,467,903,467,903,480,854,480,###

数据集可视化代码:

python
import cv2
import os
import matplotlib.pyplot as plt
import numpy as np
 
index = 463
 
image_dir = r'XXX/ICDAR_2015/test_img/'
label_dir = r'XXX/ICDAR_2015/test_gt/'
 
image_path = os.path.join(image_dir, 'img_' + str(index) + '.jpg')
label_path = os.path.join(label_dir, 'gt_img_' + str(index) + '.txt')
 
image_origin = cv2.imread(image_path)
image = image_origin.copy()
height, width, _ = image.shape
label_file = open(label_path, 'r')
annotations = label_file.readlines()
label_file.close()
 
for annotation in annotations:
    coords = list(map(int, annotation.split(',')[:-1]))
    transcriptions = annotation.split(',')[-1]
    points = np.array([(coords[i], coords[i+1]) for i in range(0, len(coords), 2)])
    cv2.polylines(image, [points], isClosed=True, color=(255, 0, 0), thickness=2)
    for p in points:
        cv2.circle(image, (p[0], p[1]), int(min(height, width) / 150), (0, 255, 255), -1)
 
    cv2.putText(image, transcriptions, (points[0][0], points[0][1] - int(min(height, width) / 150)), cv2.FONT_HERSHEY_SIMPLEX,
                min(height, width) / 1000, (0, 255, 0), int(min(height, width) / 500))
    
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(16, 9))
axes = axes.flatten()
 
axes[0].imshow(cv2.cvtColor(image_origin, cv2.COLOR_BGR2RGB))
axes[0].axis('off')
axes[0].set_title('Origin')
 
axes[1].imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
axes[1].axis('off')
axes[1].set_title('Annotation')
 
plt.tight_layout()
plt.show()
webp
1111,459,1266,495,1259,586,1104,550,FESTIVE
1100,523,1261,603,1244,719,1083,639,SALE
Task 4.2: Text Segmentation (N/A)

​ 不可用。

Task 4.3: Word Recognition (2015 edition)

​ 从上一个数据集中裁剪出单词图片。

  • 训练集 4468 张裁剪好的单词图片
  • 测试集 2077 张裁剪好的单词图片

​ 举例,对于测试数据集下的文件 word_10.png

webp
word_10.png

​ 对应 Challenge4_Test_Task3_GT.txt 里的一行:

word_10.png, "PAIN"
Task 4.4: End to End (2015 edition)

​ emmmm 我感觉就是之前的整合,多了一个词汇表。

  • 训练集 1000 张图片
  • 测试集 500 张图片
webp

ICDAR2017 Competition on Reading Chinese Text in the Wild (RCTW-17)

这里面的图像是有够杂的……

资源:

webp

​ 举例,对于训练数据集下的文件 image_0.jpg

webp
image_0.jpg

​ 对应的 Ground Truth image_0.txt

​ 包围框,是否有可识别的文字,对应文字

390,902,1856,902,1856,1225,390,1225,0,"金氏眼镜"
1875,1170,2149,1170,2149,1245,1875,1245,0,"创于 1989"
2054,1277,2190,1277,2190,1323,2054,1323,0,"城建店"
768,1648,987,1648,987,1714,768,1714,0,"金氏眼"
897,2152,988,2152,988,2182,897,2182,0,"金氏眼镜"
1457,2228,1575,2228,1575,2259,1457,2259,0,"金氏眼镜"
1858,2218,1966,2218,1966,2250,1858,2250,0,"金氏眼镜"
231,1853,308,1843,309,1885,230,1899,1,"谢#惠顾"
125,2270,180,2270,180,2288,125,2288,1,"###"
106,2297,160,2297,160,2316,106,2316,1,"###"
22,2363,82,2363,82,2383,22,2383,1,"###"
524,2511,837,2511,837,2554,524,2554,1,"###"
455,2456,921,2437,920,2478,455,2501,0,"欢迎光临"

Total-Text

​ 资源:Total-Text Dataset | Papers With Code

​ 弯曲文本数据集:

  • 训练集 1255 张图片
  • 测试集 300 张图片

​ 大部分英文文本,少部分中文文本。

​ 举例,对于训练数据集下的文件 img11.jpg

webp
img11.jpg

​ 对应的 Character_Level_Mask Ground Truth img11.jpg

webp
img11.jpg

​ 对应的 Text_Region_Mask Ground Truth img11.png

webp
img11.png

​ 还附有 mat 格式的 poly_gt_img11.matrect_gt_img11.mat,应该是存储了一些形状信息。

TextSeg

​ 资源:Rethinking Text Segmentation: A Novel Dataset and A Text-Specific Refinement Approach

​ 艺术字的文字分割数据集:

  • 4024 张图片,配有文字分割图

​ 举例,对于数据集下image/的文件 a00001.jpg

webp
a00001.jpg

bpoly_label/ 下对应的逐字分割掩码图a00001_mask.png

webp

​ json 文件 a00001_anno.json

json
{
    "0000": {
        "text": "WHY",
        "bbox": [
            300,
            264,
            799,
            264,
            799,
            521,
            300,
            521
        ],
        "char": {
            "00": {
                "text": "W",
                "bbox": [
                    304,
                    270,
                    519,
                    270,
                    519,
                    517,
                    304,
                    517
                ],
                "mask_value": 1
            },
            "01": {
                "text": "H",
                "bbox": [
                    514,
                    278,
                    650,
                    278,
                    650,
                    521,
                    514,
                    521
                ],
                "mask_value": 2
            },
            "02": {
                "text": "Y",
                "bbox": [
                    651,
                    272,
                    800,
                    272,
                    800,
                    521,
                    651,
                    521
                ],
                "mask_value": 3
            }
        }
    },
    "0001": {
        "text": "ME?",
        "bbox": [
            334,
            514,
            762,
            514,
            762,
            764,
            334,
            764
        ],
        "char": {
            "00": {
                "text": "M",
                "bbox": [
                    336,
                    513,
                    518,
                    513,
                    518,
                    761,
                    336,
                    761
                ],
                "mask_value": 4
            },
            "01": {
                "text": "E",
                "bbox": [
                    514,
                    514,
                    639,
                    514,
                    639,
                    761,
                    514,
                    761
                ],
                "mask_value": 5
            },
            "02": {
                "text": "?",
                "bbox": [
                    637,
                    517,
                    758,
                    517,
                    758,
                    762,
                    637,
                    762
                ],
                "mask_value": 6
            }
        }
    }
}

semantic_label/ 下的分割图 a00001_maskfg.png

webp
a00001_maskfg.png

CTW 1500

  • [Paper-Detecting Curve Text in the Wild-New Dataset and New Solution-Zi-Zi's Journey](..//Paper-Detecting Curve Text in the Wild-New Dataset and New Solution/)

合成数据集

SynthText

  • [Paper-Synthetic Data for Text Localisation in Natural Images-Zi-Zi's Journey](..//Paper-Synthetic Data for Text Localisation in Natural Images/)

  • [Paper-重读-Synthetic Data for Text Localisation in Natural Images-Zi-Zi's Journey](..//Paper-重读-Synthetic Data for Text Localisation in Natural Images/)

VISD

  • [Paper-Verisimilar Image Synthesis for Accurate Detection and Recognition of Texts in Scenes-Zi-Zi's Journey](..//Paper-Verisimilar Image Synthesis for Accurate Detection and Recognition of Texts in Scenes/)

SynthText3D

  • [Paper-SynthText3D-Synthesizing Scene Text Images from 3D Virtual Worlds-Zi-Zi's Journey](..//Paper-SynthText3D-Synthesizing Scene Text Images from 3D Virtual Worlds/)

UnrealText

  • Plan-对论文的目前想法-Zi-Zi's Journey

  • [Paper-UnrealText-Synthesizing Realistic Scene Text Images from the Unreal World-Zi-Zi's Journey](..//Paper-UnrealText-Synthesizing Realistic Scene Text Images from the Unreal World/)