Paper:https://arxiv.org/pdf/1911.08947.pdf

Intro

The major contribution in this paper is the proposed DB module that is differentiable, which makes the process of binarization end-to-end trainable in a CNN.

Mask

检测目标基于mask

Fast

One shot,可以换很小的backbone

Composition

Backbone

Resnet, MobileNet, etc…

DB Module

经过CNN生成两个Mask:Probability map, threshold map

对probability map 应用 threshold map 后得到的就是binary map,即检测结果,可以通过求闭包将其变成polygon,求最小包围矩形

在预测阶段可以去掉threshold map,直接使用probability map加固定threshold

对于多类别问题,head中增加channel即可,即每个分类都有自己的probability map,然后通过n分类probability map得到n个binary map

Data

与PSENet类似,由bbox向内收缩变成mask。

The annotation of text polygon is visualized in red lines. The shrunk and dilated polygon are displayed in blue and green lines, respectively.

BBox向内收缩D

The offset D of shrinking is computed from the perimeter L and area A of the original polygon.

D=A(1r2)LD = \frac{A(1-r^2)}{L}

where r is the shrink ratio, set to 0.4 empirically.

如果文本框很长,这里可能出现收缩后太小,需要限制最小值。可以以文本框高度为基准,如最小不能小于文本框高度40%

Loss

  • threshold map使用L1 Loss
  • probability map 和 binary map 当作二分类使用BCE Loss

The loss function LL can be expressed as a weighted sum of the loss for the probability map LsL_s, the loss for the binary map LbL_b, and the loss for the threshold map LtL_t

L=Ls+α×Lb+β×LtL = L_s + \alpha \times L_b + \beta \times L_t

According to the numeric values of the losses, α\alpha and β\beta are set to 1.0 and 10 respectively.

apply BCE Loss for LsL_s and LbL_b, L1 Loss for LtL_t

DB Function

probability map 根据thresh map 得到binary map这一步是不可微的,

B={1=PT0=P<TB = \begin{cases} 1 &= P \ge T\\ 0 &= P \lt T \end{cases}

所以提出DB Function(可微分二值化)

B=11+ek(PT)B = \frac{1}{1+e^{-k(P-T)}}

B是binary map, P是probability map, T是thresh map
k为超参,根据经验设为50

Illustration of differentiable binarization and its derivative. (a) Numerical comparison of standard binarization (SB) and differentiable binarization (DB). (b) Derivative of l+. (c) Derivative of l−.

在交叉熵中Loss Function变成

let x=PTl+=log11+ekxl=log(111+ekx)\text{let} \ x = P - T \\ l_+ = -log\frac{1}{1+e^{-kx}} \\ l_- = -log(1-\frac{1}{1+e^{-kx}})

可计算微分为

let f(x)=11+ekxl+x=kf(x)ekxlx=kf(x)\text{let} \ f(x) = \frac{1}{1+e^{-kx}} \\ \frac{\partial l_+}{\partial x} = -kf(x)e^{-kx} \\ \frac{\partial l_-}{\partial x} = kf(x) \\

Predict

设置固定阈值如0.2,把probability map 中大于这个阈值的点取出来

对连接在一起的区域进行膨胀(预测的是收缩的框),膨胀系数是重要参数,能控制文本框大小

膨胀offset DD'计算:

D=A×rLD' = \frac{A'\times r'}{L'}

where A′ is the area of the shrunk polygon; L′ is the perimeter of the shrunk polygon; r′ is set to 1.5 empirically.

The threshold map with/without supervision. (a) Input image. (b) Probability map. (c) Threshold map without supervision. (d) Threshold map with supervision.

个人思考

阈值为什么是比预测框大一点而且渐变的,如文本框边框上阈值是0.7,阈值向外和向内扩散并减小到边缘变成0.3?

由于Target是Probability Map,这样的动态阈值能够让probability map更好的两级分化,避免出现0.5上下徘徊的情况。对于往外扩散的地方,阈值变小,Label是0,可以把probability压低。对于向内扩散的,阈值变小,Label是1,probability map学习难度降低,可以快速降低Loss使网络收敛,即使文本框内部的阈值变低了,网络快速收敛后还是会趋像更高,probability不会在阈值附近摆烂。

根据实验,文本框内部的probability最终是接近阈值最大值的,如果阈值为30~80,最终probability map大多是80~100,如果是30~70,则probability map为 70~100。预测时用的固定阈值可以设置在0.3