Layered Temporal Dataset for Anime Drawings
Ningyuan ZHENG
Cheng-hsin Emily WUU
[Project Proporsal]
[Slides]
[Code]
Update Date: May 15, 2021


Example input: we collect the underlying stroke data in PSD format, containing composition and blending information for RGBA image layers. It captures the temporality of the creation of illustraion along with the layered information in each time stamp.

Abstract

In this project, we create a dataset of high-quality anime illustrations in PSD format (20,000 samples, ~1.6TB of raw size), which is useful for tackling self-supervised segmentation and generation refinement tasks. The dataset consisting of high-quality anime illustrations voted by online users. For each illustration sample, we collect the underlying stroke data that gives rise to the finishing work and replay the data through a drawing engine to render a series of PSD (PhotoShop Document) filesthat capture the art creation process. PSD format contains composition and blending information for RGBA image layers, which is useful for production, and the temporality nature of our data allows us to refine the sketches autoregressively. We build a differentiable renderer for different channel blending modes that allows us to explore self-supervised segmentation tasks.
To show the merit of our dataset, we established baseline for generation task with existing models such as MUNIT[4] and Pix2Pix HD [5] Our next stage of works will be customizing model for generation/refinement task and establishing segmentation baseline with existing models.


Videos





Related Work

The success of modern computer vision approaches mostly attributes to the growing data. Richly annotated and cross-modal datasets nowadays open avenues to various research topics. These methods have also been applied to the anime domain, such as sketch colorization [3] and image generation [1]. Existing datasets such as Danbooru [2] contain images of anime illustrations and tags (4.2m images, 130m tags, 3.4TB total), enabling conditioned generation on tags. However, real-world applications usually require more fine-grained control, such as layered editing and refinement. Due to lack of such data, current data-driven approaches fail to tackle these problems.

Danbooru2020


[Website]

Danbooru2020 is a large-scale anime image database with 4.2m+ images annotated with 130m+ tags; it can be useful for machine learning purposes such as image recognition and generation.




Style2paints


[Website]

DStyle2paints V4 is the current best AI driven lineart colorization tool. Different from previous end-to-end image-to-image translation methods, style2paints V4 is the first system to colorize a lineart in real-life human workflow, and the outputs are layered.



Samples of Layered Temporal Dataset

Data Process

The meta-data describing the strokes that give rise to the artwork is collected from an online platform. The data is replayed through a rendering engine and captured in PSD format. Since there are flipping and warpping during the creation process, we calcuate the SIFT[6] features for each time step and perform affine transform to align the intermediate frames with the finishing frame.

Samples


Final Illustration

Layer 1

Layer 2

Final Illustration

Layer 1

Layer 2

Final Illustration

Layer 1

Layer 2

Layer 3



Task Description & Approach

We present 2 specific tasks that can be conqured by our layered temporal dataset. One is self-supervised generation refinmenet and the other is self-supervised segmentation. Along with the task description, we also describe the implementation details of our model for tackling the generation refinement task.

Self-supervised Generation Refinement



For this task, the input is a sketch, and the output is a refined illustration.
Our dataset enables generalized refining illustration using different layers and during the creation process.
We adapt the model from MUNIT[4] for the refinement task. The model learns to recolor the content.



Self-supervised Segmentation



For this task, the input is an illustration, and the output is a batch of segmented layers.
Our dataset enables fine-grained segmentation of layers on any illustration (unknown layer information).



Results on Generation Refinement Task

We show results on using conditional generation model MUNIT.[4]

[Code Avaliable]









References

[1]Gwern: Making anime faces with stylegan (Feb 2019), https://www.gwern.net/ Faces

[2]Gwern: Danbooru2020: A large-scale crowdsourced and tagged anime illustration dataset (Jan 2021), https://www.gwern.net/Danbooru2020

[3]LvMin Zhang, Chengze Li, T.T.W.Y.J., Liu, C.: Two-stage sketch colorization. ACM Transactions on Graphics 37(6) (Nov 2018). https://doi.org/https://doi.org/10.1145/3272127.3275090

[4]Huang, Xun, et al. "Multimodal unsupervised image-to-image translation." Proceedings of the European conference on computer vision (ECCV). 2018.

[5]Wang, T.C., Liu, M.Y., Zhu, J.Y., Tao, A., Kautz, J., Catanzaro, B.: High-resolution image synthesis and semantic manipulation with conditional gans. In: Proceedings of the IEEE Conference on Computer Vision and Pattern Recognition (2018)

[6]David G Lowe, “Distinctive Image Features from Scale-Invariant Keypoints,” International Journal of Computer Vision, vol.50, No. 2, 2004, pp.91-110.




Contact

Please Email Ningyuan ZHENG (ningyuaz@andrew.cmu.edu) or Cheng-hsin Emily WUU (cwuu@andrew.cmu.edu) for any questions.