Supplementary Materials of the Paper 1077 "Write Like You:
Synthesizing Your Cursive Online Chinese Handwriting via Metric-based Meta Learning"
ANONYMOUS AUTHOR(S)
1 MONOTONIC ATTENTION
Monotonic attention mechanism [4] attends to the memory (i.e., output of content encoder[hc,1,hc,2, ...,hc,N]in our case) in a monotonic manner: if the decoder attends to thehc,ti−1at previous decoding time stepi−1, at current decoding time stepi, we begin processing memory entries starting at indexti−1, namely we calculates the score scalar ofhc,j forj=ti−1,ti−1+1, ...,N. Then we use a logistic sigmoid functionσ(·)to transform these score scalars into probabilitiespi,j and samplezi,j from a Bernoulli distribution parameterized bypi,j:
αi,j =score(hi−1,hc,j) (1)
pi,j =σ(αi,j) (2)
zi,j ∼Bernoulli(pi,j), (3)
where score(hi−1,hc,j)measures how wellhi−1andhc,jmatch, which can be defined as described in [1] or [3].
The sampledzi,j in Equation (3) is a binary value that determines whether to pickhc,j. As soon aszi,j =1 for somej, we stop and setti =jandci =hc,j. Note thatzi,jis sampled from a Bernoulli distribution, thus the model can not be trained using backpropagation. As suggested in [4], we can use the soft monotonic attention and compute the expected value ofciover complete memory.
2 EVALUATION METRIC 2.1 DTW
We use the DTW distance to evaluate the accuracy of coordinate prediction. As described below, we keep the same calculation method of DTW as [5].
(1) Convert the target and the predicted offsets (i.e., relative coordinates) into corresponding absolute coordi- natesCandC′:
C=[(x1,y1),(x2,y2), ...,(x|C|,y|C|)]
C′=[(x1′,y1′),(x2′,y2′), ...,(x′|C′|,y′|C′|)], (4) where|C|and|C′|are the lengths ofCandC′, respectively.
(2) Normalize the DTW distance betweenCandC′by the spatial scale and length of real handwriting to eliminate the effects of different scales and lengths:
normalized DTW(C,C′)= DTW(C,C′)
|C|p
(xmax−xmin)2+(ymax−ymin)2
, (5)
where
2.2 Content Score and Style Score
We utilize two classifiers to quantitatively evaluate the generated handwriting in terms of content and style separately. The architectures of these two classifiers are depicted in Fig. 1.
For the content evaluation, we train a character recognizer (see Fig. 1(a)) on the training set and use recognition accuracy as the Content Score. We randomly select 20% of the training set as the validation set. We use the Adam [2] optimizer to train the recognizer with the batch size of 1024, learning rate of 0.001 and gradient clipping of 1.0. For data augmentation, we multiply the offset(∆x,∆y)by a random scale factor in the range[0.90,1.10]and dropping some points randomly with a probability of 0.10. After training, the validation accuracy is 0.9702, and the accuracy on the test set is 0.9627.
For the style evaluation, we train a writer identification network (Fig. 1(b)) that is a 4-layer LSTMs with hidden sizes of 256 on the test set which contains 60 writers. The specific training settings are the same as the character recognizer above. After training, the validation accuracy is 0.9112.
Bi-LSTM
hidden size = 512 512 FC
6763 FC
character id
Bi-LSTM
hidden size = 256 256 FC
60 FC writer id
4
(a) (b)
Fig. 1. The architectures of our two classifiers to score the generated handwriting in terms of content (a) and style (b), respectively.
Below are samples of a writer’s handwriting.
Please choose from the four characters above which you think is most likely written by this writer.
Please choose the one from the four characters below that you think is most similar to the character above.
(b) (a)
Fig. 2. Two examples of the user study questionnaires described in Section 5.4.2 and 5.5.2, respectively.
3 USER STUDY QUESTIONNAIRES
In our paper, we conduct two use studies which are described in Section 5.4.2 and 5.5.2, respectively. The examples of these two questionnaires are shown in Fig. 2.
In the first user study, we ask participants to point out one character from the four candidates which they think is most likely written by that writer. The four randomly arranged candidates are the genuine handwriting, handwriting generated by our model without and with fine-tuning, and the same character written by a random different writer.
In the second user study, the participants need to choose the one out of the four candidate fake handwrit- ten characters that is most similar as the real one. The four randomly arranged candidates are generated by DeepImitator [6], FontRNN [5] and our model without/with fine-tuning, respectively.
REFERENCES
[1] Dzmitry Bahdanau, Kyunghyun Cho, and Yoshua Bengio. 2014. Neural Machine Translation by Jointly Learning to Align and Translate.
CoRRabs/1409.0473 (2014).
[2] Diederik P. Kingma and Jimmy Ba. 2014. Adam: A Method for Stochastic Optimization.CoRRabs/1412.6980 (2014).
[3] Thang Luong, Hieu Pham, and Christopher D. Manning. 2015. Effective Approaches to Attention-based Neural Machine Translation. In EMNLP.
[4] Colin Raffel, Minh-Thang Luong, Peter J Liu, Ron J Weiss, and Douglas Eck. 2017. Online and linear-time attention by enforcing monotonic alignments. InProceedings of the 34th International Conference on Machine Learning-Volume 70. JMLR. org, 2837–2846.
[5] Shusen Tang, Zeqing Xia, Zhouhui Lian, Yingmin Tang, and Jianguo Xiao. 2019. FontRNN: Generating Large-scale Chinese Fonts via Recurrent Neural Network. InComputer Graphics Forum, Vol. 38. Wiley Online Library, 567–577.
[6] Bocheng Zhao, Jianhua Tao, Minghao Yang, Zhengkun Tian, Cunhang Fan, and Ye Bai. 2020. Deep imitator: handwriting calligraphy imitation via deep attention networks.Pattern Recognition(2020), 107080.
A APPENDIX
Here we show a large number of generated results. Each page corresponds to one test writer.
w/o FT FT-100 Ground truth w/o FT FT-100 Ground truth w/o FT FT-100 Ground truth w/o FT FT-100 Ground truth w/o FT FT-100 Groundtruth
w/o FT FT-100 Ground truth
w/o FT FT-100 Ground truth w/o FT FT-100 Ground truth w/o FT FT-100 Ground truth w/o FT FT-100 Ground truth w/o FT FT-100 Groundtruth
w/o FT FT-100 Ground truth
w/o FT FT-100 Ground truth w/o FT FT-100 Ground truth w/o FT FT-100 Ground truth w/o FT FT-100 Ground truth w/o FT FT-100 Groundtruth
w/o FT FT-100 Ground truth
w/o FT FT-100 Ground truth w/o FT FT-100 Ground truth w/o FT FT-100 Ground truth w/o FT FT-100 Ground truth w/o FT FT-100 Groundtruth
w/o FT FT-100 Ground truth
w/o FT FT-100 Ground truth w/o FT FT-100 Ground truth w/o FT FT-100 Ground truth w/o FT FT-100 Ground truth w/o FT FT-100 Groundtruth
w/o FT FT-100 Ground truth
w/o FT FT-100 Ground truth w/o FT FT-100 Ground truth w/o FT FT-100 Ground truth w/o FT FT-100 Ground truth w/o FT FT-100 Groundtruth
w/o FT FT-100 Ground truth
w/o FT FT-100 Ground truth w/o FT FT-100 Ground truth w/o FT FT-100 Ground truth w/o FT FT-100 Ground truth w/o FT FT-100 Groundtruth
w/o FT FT-100 Ground truth
w/o FT FT-100 Ground truth w/o FT FT-100 Ground truth w/o FT FT-100 Ground truth w/o FT FT-100 Ground truth w/o FT FT-100 Groundtruth
w/o FT FT-100 Ground truth
w/o FT FT-100 Ground truth w/o FT FT-100 Ground truth w/o FT FT-100 Ground truth w/o FT FT-100 Ground truth w/o FT FT-100 Groundtruth
w/o FT FT-100 Ground truth
w/o FT FT-100 Ground truth w/o FT FT-100 Ground truth w/o FT FT-100 Ground truth w/o FT FT-100 Ground truth w/o FT FT-100 Groundtruth
w/o FT FT-100 Ground truth