Basic methods:
(1) Use CNN to process images.
(2) Weighting the processed features as input to the RNN.
Figure 1. The four original images at the bottom of the model structure diagram are the input of the CNN feature extractor, and after the same CNN, four feature F is obtained, then the weights are combined into UT, and the weighting is at. UT is a fixed-length eigenvector and is used as an input to the RNN.
Cnn:
f = {fi,j,c} \{f_{i,j,c}\} output of the CNN extractor, I,j is the position index, C is the channel index
RNN:
Puzzle: Convert eigenvector F to a separate text string
Method: Use RNN
Input: The output eigenvector F of CNN is weighted together into a fixed-length eigenvector ut
Variable representation:
(1) St s_t
RNN the state of the hidden layer in time t.
(2) UT u_t
Input of the RNN
(3) At=at,i,j a_t = a_{t,i,j}
Weighted weights
(4) Ut,c=∑i,jat,i,jft,i,j u_{t,c}=\sum_{i,j} a_{t,i,j}f_{t,i,j}
Under state T, the input ut,c u_{t,c} of the C channel is calculated as a method.
(5) x^t,c=wcct−1+wu1ut−1 \hat{x}_{t,c}=w_cc_{t-1}+w_{u1}u_{t-1}
Where Ct−1 c_{t-1} represents the single-hot encoding of the output character of the previous state output layer, ut−1 u_{t-1} represents the input of the previous state input layer, and WC,WU1 w_c,w_{u1} represents both parameters
(6) (ot,st) =rnnstep (