[Paper Reivew] TextCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control

STE 에서 Style, glyph를 나누어 diffusion model의 guidance로 활용하는 TextCtrl을 제안한 논문입니다.

Posted Jan 6, 2025

By Daemin Park

15 min read

[Paper Reivew] TextCtrl: Diffusion-based Scene Text Editing with Prior Guidance Control

NeurIPS 2024 Highlight [Paper] [Github]
Weichao Zeng, Yan Shu, Zhenhang Li, Dongbao Yang, Yu Zhou
Chinese Academy of Sciences, Nankai University
14 Oct, 2024

TL;DR

STE 에서 Style, glyph를 나누어 diffusion model의 guidance로 활용.
Source image의 추가적인 prior를 활용하는 glyph-adaptive mutual self-attention mechanism 제안.
새로운 evaluation benchmark ScenePair 제안.

1. Introduction

Scene Text Editing (STE)란 input image의 text를 style, background를 보존한 채로, desired text로 바꾸는 task를 말합니다.

GAN을 사용한 초기 연구들은 STE를 3단계의 sub-task로 나누어 divide-and-conquer 방식으로 해결하려 했습니다.

Foreground text style transfer
Background restoration
Fusion

하지만 일반화 성능이 떨어지며, “bucket effect”라 불리는 불안정한 배경 복구가 messy fusion artifact로 이어지는 문제가 발생했다고 합니다.

더욱 최근에는 Large-scale의 text-to-image diffusion model을 이용하려는 연구가 진행되었다고 합니다.

Conditional synthesis manner : coarse-grained 학습으로 제한됨.
Inpainting manner : style guidance는 주로 unmasked region에서 비롯되며, 복합한 시나리오에서 스타일 일관성이 지켜지지 않는 문제.

게다가 diffusion 기반의 STE 방법론들은 text-prompt와 glyph 간의 약한 상관관계 때문에 typo가 생기거나 text rendering 정확도가 떨어지는 문제가 있다고 합니다.

glyph

저자들은 이러한 문제들의 원인이 style, structure의 불충분한 prior guidance라고 주장합니다. 따라서 저자들은 STE를 두 가지 측면으로 나누어 분석합니다.

Text style disentanglement : Visual coherency
Text glyph representation : Text rendering accuracy

또한 diffusion-based STE 방법론에서는 때떄로 의도치 않은 color deviation이나 texture degradation 문제가 발생한다고 합니다. 이를 해결하기 위해 저자들은 generator를 개선하는 glyph-adaptive mutual self-attention mechanism를 제안하며 visual inconsistency를 효과적으로 제거할 수 있었다고 합니다.

마지막으로 저자들은 마땅한 STE 평가 벤치마크가 없다는 점을 언급하며, ScenePair Dataset을 제안합니다.

저자들의 contribution을 요약하자면 다음과 같습니다.

STE 에서 Style, glyph를 나누어 diffusion model의 guidance로 활용.
Source image의 추가적인 prior를 활용하는 glyph-adaptive mutual self-attention mechanism 제안.
새로운 evaluation benchmark ScenePair 제안.

2.1. GAN-based STE

SRNet: 단어 수준 편집을 위한 최초의 분할-정복 방식 제안.
SwapText: 곡선 텍스트 편집을 위해 Thin Plate Spline Interpolation Network 추가.
STRIVE: 동영상 내 텍스트 교체로 응용 확장.
TextStyleBrush: StyleGAN2를 기반으로 한 Self-supervised 학습 방식 도입.
MOSTEL: Semi-supervised 학습 기반의 STE 방식 설계.

2.2. Diffusion-based STE

DiffSTE: Dual 인코더 설계로 렌더링 정확성과 스타일 제어 개선.
DiffUTE: OCR 기반 이미지 인코더 도입.
TextDiffuser:Character segmentation masks 사용
UDiffText: Supervised labels
LEG: Text-style 이용하기 위해 source image를 concat.
DBEST: Inference 중 fine-tuning 적용.
AnyText: ControlNet을 활용하여 다중 언어 STE 및 STG를 지원하는 범용 프레임워크 제안.

2.3. Image Editing with Diffusion Models

Model-Tuning 방식 : 전체 모델을 fine-tuning
- Imagic
- DreamBooth
Null-text Inversion : CFG를 통해 Null-text prompt를 optimize.
LDM 의 Self-attention layer 관련 연구들도 존재.

3. Method

전체적인 framework는 다음과 같습니다.

각 구조를 하나씩 살펴봅시다.

3.1. Text Glyph Structure Representation Pre-training

STE에서 이상적인 text encoder는 semantic 정보를 활용하지 않고 glyph structure에 기반한 encoding을 하는 것입니다. 예를들어 \(C_{text} = "Sifted"\)가 있으면, text encoder \(\mathcal{T}\)는 각 glyph structure \("S", "i", "f", "t", "e", "d"\)를 아는 것입니다.

저자들은 character-level text encoder를 활용하여, target text feature를 visual glyph structure와 align 시켰다고 합니다.

UDiffText와는 다르게 많은 양의 Font cluster를 구성해 font-variance augmentation을 했다고 합니다.

3.2. Text Style Disentanglement Pre-training

Text Style은 font, color, spatial transformation등 많은 양의 정보가 합쳐져 있는 정보입니다. Fine-grained disentanglement를 위해 저자들은 4가지 sub-task로 나누어 학습을 진행했다고 합니다. 1) text color transfer, 2) text font transfer, 3) text removal, 4) text segmentation.

Text Color Transfer

Intrinsic style이나, Lighting 조건 때문에 text color를 결정하거나 전체적인 color를 classification하는 것은 어려운 task라고 합니다. 이에 저자들은 image style transfer를 참고하여 colorization training을 통해 implicit하게 color을 추출했다고 합니다. Source text color image에 AdaIN 적용.

\[i_{\text{out}}^c = \mathcal{F}_{\text{dec}}^c \left( \mathcal{A} \left( \mathcal{F}_{\text{enc}}^c(i_{\text{in}}^c), c_{\text{texture}} \right) \right)\]

Text Font Transfer

Text Font Transfer는 Text color와 비슷한 목적이지만 glyph boundary에 조금 더 focus합니다. Pyramid Pooling Module을 사용한다고 합니다.

\[i_{\text{out}}^f = \mathcal{F}_{\text{dec}}^f \left( \mathcal{P} \left( \mathcal{F}_{\text{enc}}^f(i_{\text{in}}^f), c_{\text{texture}} \right) \right)\]

Text Removal and Text Segmentation

Text removal: text pixel을 지우고, text가 가리고 있던 background color를 예측하는 task. \(i_{\text{out}}^r = \mathcal{F}^r(c_{\text{spatial}})\)

Text segmentation : background와 text 사이 spatial 관계를 decouple 하는 task. \(i_{\text{out}}^s = \mathcal{F}^s(c_{\text{spatial}})\)

Multi-task Loss

전체 pre-training loss는 다음과 같습니다.

\[\mathcal{L}_{\text{disentangle}} = \mathcal{L}_{\text{color}}(i_{\text{out}}^c, i_{\text{gt}}^c) + \mathcal{L}_{\text{font}}(i_{\text{out}}^f, i_{\text{gt}}^f) + \mathcal{L}_{\text{rem}}(i_{\text{out}}^r, i_{\text{gt}}^r) + \mathcal{L}_{\text{seg}}(i_{\text{out}}^s, i_{\text{gt}}^s)\]

\(\mathcal{L}_{\text{color}}\) : MSE loss
\(\mathcal{L}_{\text{rem}}\) : MAE loss
\(\mathcal{L}_{\text{font}} , \mathcal{L}_{\text{seg}}\) : Dice loss

3.3. Prior Guided Generation

위에서 만든 \(C_{\text{struct}}\), \(C_{\text{style}}\)을 이용해 diffusion generator를 guide할 수 있습니다.

\(C_{\text{struct}}\): Cross-attention의 key-value를 \(C_{\text{struct}}\)의 projection으로 교체.
\(C_{\text{style}}\) : multi-scale style feature를 skip-connection, middle block에 적용해 high-fidelity rendering.

3.4. Inference Control

Diffusion-based STE 모델을 사용하면 때때로 의도치 않은 color deviation, texture degradation이 발생했다고 합니다. 저자들은 이런 문제의 원인을 denoising 과정에서 축적된 오차, training, inference set의 domain gap 때문이라고 주장합니다. 이를 해결하기 위해 저자들은 Source image의 prior를 injection하는 방법론인 Glyph-adaptive Mutual Self-Attention mechanism(GaMuSa)를 제안합니다.

Reconstruction Branch

STE 과정은 random noise를 image로 생성하는 것이 아니라, image-to-image translation 과정입니다. 저자들은 DDIM inversion을 통해 initial latent를 생성하고, deconstructed inversion 과정을 통해 reconstruction branch \((z_{\text{source}}^T, z_{\text{source}}^{T-1}, \dots, z_{\text{source}}^0)\)와 editing branch \((z_{\text{edit}}^T, z_{\text{edit}}^{T-1}, \dots, z_{\text{edit}}^0)\)를 동시에 만들 수 있다고 합니다.

GaMuSa

일반적인 image editing과는 다르게, STE에서 Target text modification는 상당한 condition representation 변화를 야기할 수 있다고 합니다. 최근 self-attention layer가 latent internal relation에 focus한다는 연구 결과가 발표됨에 따라, 저자들은 reconstruction branch, editing branch사이 mutual self-attention process를 적용했다고 합니다.

이때 domain gap을 줄이기 위해 editing branch의 key, value를 완전히 대체하는 것이 아니라, editing branch와 reconstruction branch를 적절히 integration하여 사용했다고 하며, 이때 합치는 정도는 vision encoder \(\mathcal{R}\)을 이용해 조절할 수 있다고 합니다.

\[\text{GaMuSa} = \text{Softmax}\left( \frac{Q_e \cdot (\lambda K_s + \mu K_e)^\top}{\sqrt{d}} \right) \cdot (\lambda V_s + \mu V_e), \quad \mu = 1 - \lambda.`\]

4. Experiments

4.1. Dataset and Metrics

Training Data

200k 개의 paired text image들을 생성하여 스타일 분리 사전 학습 및 TextCtrl의 지도 학습에 사용.
각 쌍은 동일한 스타일(폰트, 크기, 색상, 공간 변환, 배경)을 가지며, 서로 다른 텍스트, 분할 마스크(segmentation mask), 배경 이미지를 포함.
730개 폰트를 사용하여 glyph 구조 사전 학습용 시각적 텍스트 이미지를 합성.
ScenePair Benchmark
real-world image-pair dataset으로, ICDAR 2013, HierText, MLT 2017에서 수집된 1,280개의 이미지 쌍 포함.
각 쌍은 유사한 텍스트 길이, 스타일, 배경을 가진 크롭된 텍스트 이미지 2개와 원본 전체 크기 이미지를 포함.
시각적 품질과 텍스트 렌더링 정확도를 평가하는 데 사용.

Evaluation Dataset

ScenePair(Proposed):

Style fidelity와 텍스트 렌더링 정확도를 평가할 수 있는 1,280개 cropped text image pair와 원본 전체 크기 이미지 제공.

TamperScene:

7,725개의 크롭 텍스트 이미지와 사전 정의된 타겟 텍스트 제공.
텍스트 렌더링 정확도는 평가 가능하지만, 스타일 평가 쌍 데이터나 전체 이미지 기반 평가를 포함하지 않아 ScenePair의 필요성을 강조.

Evaluation Metrics

Visual quality:
1. SSIM: 구조적 유사성(Structural Similarity).
2. PSNR: 신호 대 잡음비(Peak Signal-to-Noise Ratio).
3. MSE: 픽셀 수준 평균 제곱 오차(Mean Squared Error).
4. FID: 특징 벡터 간 통계적 차이(Fréchet Inception Distance).
Text rendering accuracy:
1. ACC: 단어 정확도(Word Accuracy).
2. NED: 정규화된 편집 거리(Normalized Edit Distance).

4.2. Performance Comparison

Implementation

TextCtrl 비교 대상:
- GAN 기반: SRNet, MOSTEL.
- Diffusion 기반: DiffSTE, TextDiffuser, AnyText.
STE 수행 방식:
- Diffusion 기반 방법: inpainting manner로 전체 크기 이미지를 처리하며, 마스크된 타겟 텍스트 영역을 입력으로 사용.
- GAN 기반 방법과 TextCtrl: synthesis manner로 텍스트 이미지를 생성한 뒤, 전체 크기 이미지에 삽입.

Text Style Fidelity

Disentangled style prior과 추론 제어를 활용하여 높은 스타일 충실도와 정교한 편집 결과 달성.

Text Rendering Accuracy

TextCtrl은 강력한 glyph structure representation을 통해 모든 방법 중 최고의 spelling accuracy 달성.

4.3. Ablation Study

Text Glyph Structure Representation

Conditional text prompt는 STE에서 편집된 텍스트 글리프 glyph guide 하는 데 중요한 역할을 합니다. 저자들은 여러가지 text encoder를 실험한 결과 일반적인 CLIP은 spelling mistakes을 자주 겪었으며, 저자들이 제안한 font-variance 방법이 가장 높은 성능을 보여주었다고 합니다.

Text Style Disentanglement

TextCtrl은 explicit text style disentanglement pre-training을 통해 이전 방법들보다 폰트, 색상, glyph, 배경 텍스처에서 세분화된 특징 표현을 달성했습니다. t-SNE 시각화를 통해 유사한 색상의 텍스트가 feature space에서 cluster, 가까운 곳에 같은 style의 text image가 인접해 있음을 확인할 수 있습니다.

또한 저자들은 style-disentanglement를 ControlNet과 비교했을 때에도, 저자들의 style-disentanglement가 더 성능이 좋았다고 합니다.

Inference control with Glyph-adaptive Mutual Self-attention (GaMuSa)

GaMuSa는 ScenePair에서 direct sampling 및 MasaCtrl과 비교하여 style fidelity를 강화.

5. Limitations

Challenging arbitrary shape text editing : 초승달 모양 간판이나 원형 아이콘과 같은 복잡한 기하학적 속성을 가진 텍스트는 스타일 참조만으로 분리하는 데 어려움을 겪었다고 합니다.

Sub-optimal visual quality assessments metric: 기존 평가 지표는 pixel-level discrepancy 또는 feature similarity에 중점을 두며, 텍스트 스타일 일관성을 평가하기에는 한계가 있다고 합니다. 저자들의 ScenePair 데이터셋을 통해 일부 평가를 가능하게 했으나, 여전히 다수의 실세계 텍스트 이미지는 쌍이 없어 평가에 제약이 있습니다.

Deep Learning, Generative Model

This post is licensed under CC BY 4.0 by the author.