前言

收集一下有关文本的各种数据集！感谢师兄的分享。

正文

真实数据集

CTW 数据集 (Chinese Text in the Wild)

一个自然场景下的中文字符数据集。

资源：

主页：CTW Dataset
论文：A Large Chinese Text Dataset in the Wild (ict.ac.cn)
代码：ctw-baseline/detection at master · yuantailing/ctw-baseline (github.com)
下载：ctw-public - OneDrive (live.com)
教程：1-basics (ctwdataset.github.io)

包含：

32,285 张高分辨率图像（high resolution images）
1,018,402 个字符实例（character instances）
3850 种汉字（character categories）
6 种类别（attributes）

数据集结构：

webp

训练集 + 验证集：images-trainval
测试集：images-test
预训练好的模型：trained-models
- alexnet
- inception
- overfeat
- resnet
- vgg
- yolo
注释文档：ctw-annotations

webp

举例，对于训练数据集下的文件 0000172.jpg：

webp

0000172.jpg

{"annotations": [[
    {
        "adjusted_bbox": [140.26028096262758, 897.1957001682758, 22.167573140645146, 38.36424196832945], 
        "attributes": ["distorted", "raised"], 
        "is_chinese": true, 
        "polygon": [[140.26028096262758, 896.7550603352049], [162.42785410327272, 898.0769798344178], [162.42785410327272, 935.7929346470926], [140.26028096262758, 935.0939571156308]], 
        "text": "\u660e"
    }, 
    {
        "adjusted_bbox": [162.42785410327272, 898.5416545674744, 23.376713493771263, 37.74268246537315], 
        "attributes": ["distorted", "raised"],
        "is_chinese": true,
        "polygon": [[162.42785410327272, 898.0769798344178], [185.80456759704398, 899.4710040335876], [185.80456759704398, 936.5300382257251], [162.42785410327272, 935.7929346470926]],
        "text": "\u6d77"
    },
    ……
    "image_id": "0000172", "width": 2048}

对应的注释，每个字都有：

adjusted_bbox：调整后的边界框
attributes：文字属性
- distorted：扭曲
- raised：浮雕
- occluded：被遮挡
- bgcomplex：复杂背景
- handwritten：手写
- wordart：艺术字
is_chinese：是否为中文
polygon：实际边界框
text：以 Unicode 形式存储的中文

SVT (Street View Text Dataset)

The Street View Text (SVT) dataset was harvested from Google Street View. Image text in this data exhibits high variability and often has low resolution. In dealing with outdoor street level imagery, we note two characteristics.

(1) Image text often comes from business signage and

(2) business names are easily available through geographic business searches.

These factors make the SVT set uniquely suited for word spotting in the wild: given a street view image, the goal is to identify words from nearby businesses.

街景文本（SVT）数据集从谷歌街景中获取。该数据中的图像文本表现出很高的变异性，而且往往分辨率很低。在处理户外街景图像时，我们注意到两个特点。

(1) 图像文本通常来自于商业招牌，以及

(2) 企业名称很容易通过地理上的商业搜索获得。

这些因素使得 SVT 集独特地适合于在野外发现单词：给定一个街景图像，目标是识别附近企业的单词。

资源：

论文（似乎不可用）：SVT Dataset | Papers With Code
下载：The Street View Text Dataset 街景文字数据集_数据集-阿里云天池 (aliyun.com)

举例，对于数据集下的文件 17_18.jpg：

webp

17_18.jpg

对应的 Ground Truth，一个单词一个文本框，还包含了地址、环境等信息：

<image>
    <imageName>img/17_18.jpg</imagesName>
    <address>420 South 1st Street San Jose CA 95112</address>
    <lex>SOUTH,FIRST,BILLIARDS,CLUB,AND,LOUNGE,AGENDA,RESTAURANT,BAR,RAMADA,LIMITED,SAN,JOSE,WET,NIGHTCLUB,MOTIF,ANNO,DOMINI,EULIPIA,DOWNTOWN,YOGA,SHALA,WHIPSAW,INC,ZOE,SAINTE,CLAIRE,HOTEL,SCORES,SPORTS,GRILL,WORKS,SPY,MUSEUM,QUILTS,TEXTILES,MIAMI,BEACH,STAGE,COMPANY,CACTUS,ANGELS,DAI,THANH,SUPERMARKET</lex>
    <Resolution x="1024" y="768"/>
    <taggedRectangles>
        <taggedRectangle height="41" width="152" x="480" y="403">
            <tag>BILLIARDS</tag>
        </taggedRectangle>
        <taggedRectangle height="33" width="78" x="407" y="410">
            <tag>FIRST</tag>
        </taggedRectangle>
        <taggedRectangle height="30" width="85" x="322" y="416">
            <tag>SOUTH</tag>
        </taggedRectangle>
    </taggedRectangles>
</images>

ICDAR

资源：

下载：Participant - Robust Reading Competition (uab.es)

Downloads - Focused Scene Text

Task 2.1: Text Localization (2013 edition)

训练集 229 张图片
测试集 233 张图片

举例，对于训练数据集下的文件 img_1.jpg：

webp

img_1.jpg

对应的 Ground Truth gt_img_1.txt：

38, 43, 920, 215, "Tiredness"
275, 264, 665, 450, "kills"
0, 699, 77, 830, "A"
128, 705, 483, 839, "short"
542, 710, 938, 841, "break"
87, 884, 457, 1021, "could"
517, 919, 831, 1024, "save"
166, 1095, 468, 1231, "your"
530, 1069, 743, 1206, "life"

数据集可视化代码：

import cv2
import os
import matplotlib.pyplot as plt
import numpy as np

index = 1

image_dir = r'XXX/ICDAR 2013/Challenge2_Test_Task12_Images/'
label_dir = r'XXX/ICDAR 2013/Challenge2_Test_Task1_GT/'

image_path = os.path.join(image_dir, 'img_' + str(index) + '.jpg')
label_path = os.path.join(label_dir, 'gt_img_' + str(index) + '.txt')

image_origin = cv2.imread(image_path)
image = image_origin.copy()
height, width, _ = image.shape
label_file = open(label_path, 'r')
annotations = label_file.readlines()
label_file.close()

for annotation in annotations:
    coords = list(map(int, annotation.split(',')[:-1]))
    transcriptions = annotation.split(',')[-1][2:-2]
    points = np.array([(coords[i], coords[i+1]) for i in range(0, len(coords), 2)])
    cv2.rectangle(image, (points[0][0], points[0][1]), (points[1][0], points[1][1]), (255, 0, 0), 2)
    for p in points:
        cv2.circle(image, (p[0], p[1]), int(min(height, width) / 150), (0, 255, 255), -1)
    cv2.putText(image, transcriptions, (points[0][0], points[0][1] - int(min(height, width) / 150)), cv2.FONT_HERSHEY_SIMPLEX,
                min(height, width) / 1000, (0, 255, 0), int(min(height, width) / 500))
    
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(16, 9))
axes = axes.flatten()

axes[0].imshow(cv2.cvtColor(image_origin, cv2.COLOR_BGR2RGB))
axes[0].axis('off')
axes[0].set_title('Origin')

axes[1].imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
axes[1].axis('off')
axes[1].set_title('Annotation')

plt.tight_layout()
plt.show()

webp

Task 2.2: Text Segmentation (2013 edition)

数据集和 2.1 一样，只不过 Ground Truth 是 segmentation masks gt_img_1.png：

webp

gt_img_1.png

Task 2.3: Word Recognition (2013 edition)

训练集 848 张单词图片：Challenge2_Training_Task3_Images_GT
测试集 1095 张单词图片：Challenge2_Test_Task3_Images 和 Challenge2_Test_Task3_GT.txt

这些图片都是从之前的数据集里裁切出来的。

举例，对于训练数据集下的文件 word_1.jpg：

webp

word_1.png

对应的 Ground Truth gt.txt 里的一行：

1	`word_1.png, "PROPER"`

Task 2.4: End to End (2015 edition)

想让网络识别单词，并且提供了单词库？

训练集 229 张图片
测试集 233 张图片

图片img_1.jpg，对应的 Ground Truth gt_img_1.txt 和词汇表 voc_img_1.txt：

webp

img_1.jpg、gt_img_1.txt、voc_img_1.txt

Downloads - Incidental Scene Text

Task 4.1: Text Localization (2015 edition)

图像质量真是刁钻啊 orz

训练集 1000 张图片
测试集 500 张图片

举例，对于测试数据集下的文件 img_2.jpg：

webp

img_2.jpg

对应的 Ground Truth gt_img_2.txt：

790,302,903,304,902,335,790,335,JOINT
822,288,872,286,871,298,823,300,yourself
641,138,657,139,657,151,641,151,###
669,139,693,140,693,154,669,153,154
700,141,723,142,723,155,701,154,197
637,101,721,106,722,115,637,110,###
668,157,693,158,693,170,668,170,727
636,155,661,156,662,169,636,168,198
660,82,700,85,700,99,660,96,20029
925,252,973,254,973,262,925,262,###
789,284,818,284,818,297,789,297,Free
875,286,902,289,903,298,875,298,from
791,337,863,337,863,364,791,364,PAIN
794,445,818,445,818,473,794,473,###
922,440,962,442,963,462,922,463,###
924,476,967,476,968,489,924,491,###
924,505,962,506,965,518,923,519,###
847,524,887,524,887,555,847,555,###
791,474,822,474,822,500,791,500,###
780,582,910,576,909,583,780,588,###
854,456,902,455,902,465,854,467,###
854,467,903,467,903,480,854,480,###

数据集可视化代码：

import cv2
import os
import matplotlib.pyplot as plt
import numpy as np

index = 463

image_dir = r'XXX/ICDAR_2015/test_img/'
label_dir = r'XXX/ICDAR_2015/test_gt/'

image_path = os.path.join(image_dir, 'img_' + str(index) + '.jpg')
label_path = os.path.join(label_dir, 'gt_img_' + str(index) + '.txt')

image_origin = cv2.imread(image_path)
image = image_origin.copy()
height, width, _ = image.shape
label_file = open(label_path, 'r')
annotations = label_file.readlines()
label_file.close()

for annotation in annotations:
    coords = list(map(int, annotation.split(',')[:-1]))
    transcriptions = annotation.split(',')[-1]
    points = np.array([(coords[i], coords[i+1]) for i in range(0, len(coords), 2)])
    cv2.polylines(image, [points], isClosed=True, color=(255, 0, 0), thickness=2)
    for p in points:
        cv2.circle(image, (p[0], p[1]), int(min(height, width) / 150), (0, 255, 255), -1)

    cv2.putText(image, transcriptions, (points[0][0], points[0][1] - int(min(height, width) / 150)), cv2.FONT_HERSHEY_SIMPLEX,
                min(height, width) / 1000, (0, 255, 0), int(min(height, width) / 500))
    
fig, axes = plt.subplots(nrows=1, ncols=2, figsize=(16, 9))
axes = axes.flatten()

axes[0].imshow(cv2.cvtColor(image_origin, cv2.COLOR_BGR2RGB))
axes[0].axis('off')
axes[0].set_title('Origin')

axes[1].imshow(cv2.cvtColor(image, cv2.COLOR_BGR2RGB))
axes[1].axis('off')
axes[1].set_title('Annotation')

plt.tight_layout()
plt.show()

webp

1 2	`1111,459,1266,495,1259,586,1104,550,FESTIVE 1100,523,1261,603,1244,719,1083,639,SALE`

Task 4.2: Text Segmentation (N/A)

不可用。

Task 4.3: Word Recognition (2015 edition)

从上一个数据集中裁剪出单词图片。

训练集 4468 张裁剪好的单词图片
测试集 2077 张裁剪好的单词图片

举例，对于测试数据集下的文件 word_10.png：

webp

word_10.png

对应 Challenge4_Test_Task3_GT.txt 里的一行：

1	`word_10.png, "PAIN"`

Task 4.4: End to End (2015 edition)

emmmm 我感觉就是之前的整合，多了一个词汇表。

训练集 1000 张图片
测试集 500 张图片

webp

ICDAR2017 Competition on Reading Chinese Text in the Wild (RCTW-17)

这里面的图像是有够杂的……

资源：

webp

举例，对于训练数据集下的文件 image_0.jpg：

webp

image_0.jpg

对应的 Ground Truth image_0.txt：

包围框，是否有可识别的文字，对应文字

390,902,1856,902,1856,1225,390,1225,0,"金氏眼镜"
1875,1170,2149,1170,2149,1245,1875,1245,0,"创于 1989"
2054,1277,2190,1277,2190,1323,2054,1323,0,"城建店"
768,1648,987,1648,987,1714,768,1714,0,"金氏眼"
897,2152,988,2152,988,2182,897,2182,0,"金氏眼镜"
1457,2228,1575,2228,1575,2259,1457,2259,0,"金氏眼镜"
1858,2218,1966,2218,1966,2250,1858,2250,0,"金氏眼镜"
231,1853,308,1843,309,1885,230,1899,1,"谢#惠顾"
125,2270,180,2270,180,2288,125,2288,1,"###"
106,2297,160,2297,160,2316,106,2316,1,"###"
22,2363,82,2363,82,2383,22,2383,1,"###"
524,2511,837,2511,837,2554,524,2554,1,"###"
455,2456,921,2437,920,2478,455,2501,0,"欢迎光临"

Total-Text

Paper-TotalText-Zi-Zi’s Journey

资源：Total-Text Dataset | Papers With Code

弯曲文本数据集：

训练集 1255 张图片
测试集 300 张图片

大部分英文文本，少部分中文文本。

举例，对于训练数据集下的文件 img11.jpg：

webp

img11.jpg

对应的 Character_Level_Mask Ground Truth img11.jpg：

webp

img11.jpg

对应的 Text_Region_Mask Ground Truth img11.png：

webp

img11.png

还附有 mat 格式的 poly_gt_img11.mat 和 rect_gt_img11.mat，应该是存储了一些形状信息。

TextSeg

资源：Rethinking Text Segmentation: A Novel Dataset and A Text-Specific Refinement Approach

艺术字的文字分割数据集：

4024 张图片，配有文字分割图

举例，对于数据集下image/的文件 a00001.jpg：

webp

a00001.jpg

bpoly_label/ 下对应的逐字分割掩码图a00001_mask.png：

webp

json 文件 a00001_anno.json：

{
    "0000": {
        "text": "WHY",
        "bbox": [
            300,
            264,
            799,
            264,
            799,
            521,
            300,
            521
        ],
        "char": {
            "00": {
                "text": "W",
                "bbox": [
                    304,
                    270,
                    519,
                    270,
                    519,
                    517,
                    304,
                    517
                ],
                "mask_value": 1
            },
            "01": {
                "text": "H",
                "bbox": [
                    514,
                    278,
                    650,
                    278,
                    650,
                    521,
                    514,
                    521
                ],
                "mask_value": 2
            },
            "02": {
                "text": "Y",
                "bbox": [
                    651,
                    272,
                    800,
                    272,
                    800,
                    521,
                    651,
                    521
                ],
                "mask_value": 3
            }
        }
    },
    "0001": {
        "text": "ME?",
        "bbox": [
            334,
            514,
            762,
            514,
            762,
            764,
            334,
            764
        ],
        "char": {
            "00": {
                "text": "M",
                "bbox": [
                    336,
                    513,
                    518,
                    513,
                    518,
                    761,
                    336,
                    761
                ],
                "mask_value": 4
            },
            "01": {
                "text": "E",
                "bbox": [
                    514,
                    514,
                    639,
                    514,
                    639,
                    761,
                    514,
                    761
                ],
                "mask_value": 5
            },
            "02": {
                "text": "?",
                "bbox": [
                    637,
                    517,
                    758,
                    517,
                    758,
                    762,
                    637,
                    762
                ],
                "mask_value": 6
            }
        }
    }
}

semantic_label/ 下的分割图 a00001_maskfg.png：

webp

a00001_maskfg.png

CTW 1500

[Paper-Detecting Curve Text in the Wild-New Dataset and New Solution-Zi-Zi’s Journey](…//Paper-Detecting Curve Text in the Wild-New Dataset and New Solution/)

合成数据集

SynthText

[Paper-Synthetic Data for Text Localisation in Natural Images-Zi-Zi’s Journey](…//Paper-Synthetic Data for Text Localisation in Natural Images/)
[Paper-重读-Synthetic Data for Text Localisation in Natural Images-Zi-Zi’s Journey](…//Paper-重读-Synthetic Data for Text Localisation in Natural Images/)

VISD

[Paper-Verisimilar Image Synthesis for Accurate Detection and Recognition of Texts in Scenes-Zi-Zi’s Journey](…//Paper-Verisimilar Image Synthesis for Accurate Detection and Recognition of Texts in Scenes/)

SynthText3D

[Paper-SynthText3D-Synthesizing Scene Text Images from 3D Virtual Worlds-Zi-Zi’s Journey](…//Paper-SynthText3D-Synthesizing Scene Text Images from 3D Virtual Worlds/)

UnrealText

Plan-对论文的目前想法-Zi-Zi’s Journey
[Paper-UnrealText-Synthesizing Realistic Scene Text Images from the Unreal World-Zi-Zi’s Journey](…//Paper-UnrealText-Synthesizing Realistic Scene Text Images from the Unreal World/)