Introduction to CHARS74K data set and related reading methods for handwritten character data sets

Source: Internet
Author: User

The chars74k DataSet is a classic character recognition dataset that includes both English and Kannada (Kannada) characters. The dataset has a total of 74K images, so it's called chars74k.

The English data set is divided into three categories according to the Image acquisition method:

1. Character image data collection under natural environment;

2. Handwritten character image data set;

3. A computer character image dataset with different font compositions.

Only English hand writing character datasets are introduced here. The dataset contains 52 character categories (A-Z,A-Z) and 10 numeric categories (0-9) altogether 62 categories, 3410 pairs of images, completed by 55 volunteers handwriting.

This data set is in englishhnd.tgz this file (English Hand writing), the image is mainly in the IMG folder, according to the samples001-samples062 naming method stored in 62 sub-folders, Each subfolder has 55 images, all in PNG format, with a resolution of 1200*900, and a three-channel RGB image.

Some images:

The dataset author provides a way to read Matlab, and there is a lists_var_size under the English/hnd folder in the Lists.tgz file. Mat file for data read-in, but the file simply establishes a struct (struct) that provides the relevant information, the actual data of the image, or the code to be read in itself.

The structure is loaded into the following:

The data set author has divided the training data and the test data into 30 different subsets, that is, the above Trnind and Tstind, which stores the index of the image, but it is important to note that some of the training data subsets are not 930, and some of the data behind it is 0.

The MATLAB code below, based on the mat file provided by the author, reads the training data of a subset, the test data and the label (the actual classification), and the image data is read into the cell array, and the tag data is read into the uint16 Array (note that label 1 represents the actual number 0, The label 2 represents the actual number 1, and so on).

Percent read images from chars74k Chinese Hnd DATASET.CLC, clear;% list is a struct, which contains:% alllabels: [3410*1 uint1 6]% allnames: [3410*24 char]% classlabels: [62*1 double]% classnames: [62*13 char]% numclasses:62% TSTind: [1674*30 uint1 6]% valind: []% txnind: [930*30 uint16]% trnind: [930*30 uint16]load (' Lists_var_size.mat '); percent extract training and test da Tasets%{there is patches in the dataset (Training & Test) We'll select the Nth training and test dataset.%}n = 14; % separats the training & test indexes in Datasettraining_index = list. Trnind (:, N); test_index = list. Tstind (:, N);% Some training patches may has some elements equal to 0% which we must ignore Them.locate_zero = Find (Traini Ng_index = = 0); Training_index (Locate_zero) = [];% The class labels for training settraining_labels = list. Alllabels (training_index);% The ground truth labels for test settest_true_labels = list. Alllabels (test_index);% Read image Datafor II = 1:length (training_index) img = Imread ([‘.. /.. /.. /english/hnd/',... list.    Allnames (Training_index (ii),:), '. png '); TRAINING_IMGS{II} = img;% If we want to see the image% image (IMG);% pause (); ENDfor II = 1:length (Test_index) img = Imread (['.. /.. /.. /english/hnd/',... list.    Allnames (Test_index (ii),:), '. png '); TEST_IMGS{II} = img;% If we want to see the image% image (IMG);% pause (); end

  

The PYTHON,OPENCV version waits for an update, or someone is willing to work together to communicate with each other.

If there is any mistake or inappropriate place, please correct me.

Reference Links:

http://www.ee.surrey.ac.uk/CVSSP/demos/chars74k/

Reference documents:

Teófilo Emídio de Campos, Bodla Rakesh Babu, Manik Varma. Character recognition in Natural Images. [c]//Visapp 2009-proceedings of the Fourth International Conference on computer Vision theory and applications, Lisboa, Portugal, February. 2009:273-280.

Note: This article was originally issued in July online forum, is a computer vision open class homework.

Hand-written character Recognition resource summary-chars74k DataSet Introduction and related reading method for handwritten character data set

Introduction to CHARS74K data set and related reading methods for handwritten character data sets

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.