Introduction to CHARS74K data set and related reading methods for handwritten character data sets

Last Update:2016-09-12 Source: Internet

Author: User

Developer on Alibaba Coud: Build your first app with APIs, SDKs, and tutorials on the Alibaba Cloud. Read more ＞

The chars74k DataSet is a classic character recognition dataset that includes both English and Kannada (Kannada) characters. The dataset has a total of 74K images, so it's called chars74k.

The English data set is divided into three categories according to the Image acquisition method:

1. Character image data collection under natural environment;

2. Handwritten character image data set;

3. A computer character image dataset with different font compositions.

Only English hand writing character datasets are introduced here. The dataset contains 52 character categories (A-Z,A-Z) and 10 numeric categories (0-9) altogether 62 categories, 3410 pairs of images, completed by 55 volunteers handwriting.

This data set is in englishhnd.tgz this file (English Hand writing), the image is mainly in the IMG folder, according to the samples001-samples062 naming method stored in 62 sub-folders, Each subfolder has 55 images, all in PNG format, with a resolution of 1200*900, and a three-channel RGB image.

Some images:

The dataset author provides a way to read Matlab, and there is a lists_var_size under the English/hnd folder in the Lists.tgz file. Mat file for data read-in, but the file simply establishes a struct (struct) that provides the relevant information, the actual data of the image, or the code to be read in itself.

The structure is loaded into the following:

The data set author has divided the training data and the test data into 30 different subsets, that is, the above Trnind and Tstind, which stores the index of the image, but it is important to note that some of the training data subsets are not 930, and some of the data behind it is 0.

The MATLAB code below, based on the mat file provided by the author, reads the training data of a subset, the test data and the label (the actual classification), and the image data is read into the cell array, and the tag data is read into the uint16 Array (note that label 1 represents the actual number 0, The label 2 represents the actual number 1, and so on).

Percent read images from chars74k Chinese Hnd DATASET.CLC, clear;% list is a struct, which contains:% alllabels: [3410*1 uint1 6]% allnames: [3410*24 char]% classlabels: [62*1 double]% classnames: [62*13 char]% numclasses:62% TSTind: [1674*30 uint1 6]% valind: []% txnind: [930*30 uint16]% trnind: [930*30 uint16]load (' Lists_var_size.mat '); percent extract training and test da Tasets%{there is patches in the dataset (Training & Test) We'll select the Nth training and test dataset.%}n = 14; % separats the training & test indexes in Datasettraining_index = list. Trnind (:, N); test_index = list. Tstind (:, N);% Some training patches may has some elements equal to 0% which we must ignore Them.locate_zero = Find (Traini Ng_index = = 0); Training_index (Locate_zero) = [];% The class labels for training settraining_labels = list. Alllabels (training_index);% The ground truth labels for test settest_true_labels = list. Alllabels (test_index);% Read image Datafor II = 1:length (training_index) img = Imread ([‘.. /.. /.. /english/hnd/',... list.    Allnames (Training_index (ii),:), '. png '); TRAINING_IMGS{II} = img;% If we want to see the image% image (IMG);% pause (); ENDfor II = 1:length (Test_index) img = Imread (['.. /.. /.. /english/hnd/',... list.    Allnames (Test_index (ii),:), '. png '); TEST_IMGS{II} = img;% If we want to see the image% image (IMG);% pause (); end

The PYTHON,OPENCV version waits for an update, or someone is willing to work together to communicate with each other.

If there is any mistake or inappropriate place, please correct me.

Reference Links:

http://www.ee.surrey.ac.uk/CVSSP/demos/chars74k/

Reference documents:

Teófilo Emídio de Campos, Bodla Rakesh Babu, Manik Varma. Character recognition in Natural Images. [c]//Visapp 2009-proceedings of the Fourth International Conference on computer Vision theory and applications, Lisboa, Portugal, February. 2009:273-280.

Note: This article was originally issued in July online forum, is a computer vision open class homework.

Hand-written character Recognition resource summary-chars74k DataSet Introduction and related reading method for handwritten character data set

Introduction to CHARS74K data set and related reading methods for handwritten character data sets

This article is an English version of an article which is originally in the Chinese language on aliyun.com and is provided for information purposes only. This website makes no representation or warranty of any kind, either expressed or implied, as to the accuracy, completeness ownership or reliability of the article or any translations thereof. If you have any concerns or complaints relating to the article, please send an email, providing a detailed description of the concern or complaint, to info-contact@alibabacloud.com. A staff member will contact you within 5 working days. Once verified, infringing content will be removed immediately.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

Get Started for Free

Sales Support

1 on 1 presale consultation

Chat Contact Sales
After-Sales Support

24/7 Technical Support 6 Free Tickets per Quarter Faster Response

Open a Ticket
Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.

Learn More