深度可分离卷积的Depth，Stack，Channel Multiplier

通道数目的不同

单通道的卷积

下面的代码测试了仅仅一个属性（depth是1）的深度卷积，其结果和普通卷积是一样的：

async function depthwiseConv2dTestSingleDepth() {const fSize = 2;const pad = 'valid';const stride = 1;const chMul = 1;const inDepth = 1;const x = tf.tensor4d([0, 1, 2, 5, 0, 5,0, 1, 5],[1, 3, 3, inDepth]);const w = tf.tensor4d([1.0, 2.0, 2.0, 2.0],[fSize, fSize, inDepth, chMul],);   const result = tf.depthwiseConv2d(x, w, stride, pad);result.print();
}

输出是：

Tensor[[[[12],[15]],[[7 ],[22]]]]

如下图（依次是：输入数据，滤波器，输出）演示了输出12的参与卷积的数据：

双通道的卷积

下面是深度是2的情况（NHWC）。输入数据是两个通道（假设是红色，黄色），每个通道大小是2x2。因此输入数据的形状是：1x2x2x2。卷积提供了两个窗口（假设是红色，黄色），每个窗口提取一个通道的属性。滤波器的形状是2x2x2x1。因此输出数据是两个通道。同时pading模式是valid，因此每个通道输出数据的大小是1x1。输出数据的形状是：1x1x2。

/*
[ [ [[6, 2],]]]
*/
async function depthwiseConv2dTestMultipleDepth3() {const fSize = 2;const pad = 'valid';const stride = 1;//const dilation = 2;const inDepth = 2;const x = tf.tensor4d([0, 1, 3, 1,0, 2, 2, 1,],[1, fSize, fSize, inDepth]);const w =tf.stack([tf.tensor2d([2,0,1,3], [fSize, fSize]),tf.tensor2d([1,0,0,1], [fSize, fSize])],2).expandDims(3);// as tf.Tensor4D;const result = tf.depthwiseConv2d(x, w, stride, pad);result.print();
}

输入数据是下面的：

  [0, 1, 3, 1,0, 2, 2, 1,],

但是实际上，由于深度为2，而且数据格式是NHWC（默认的数据格式），即输入数据会被解释为：通道1：通道2；通道1：通道2；…，所以，如果逐个通道来看，数据是这样的：

通道1：0，3，0，2；
通道2：1，1，2，1。

滤波器w则是按照两个通道存储的。所以对于滤波器：

通道1：2，0，1，3；
通道2：1，0，0，1。

要注意的是：虽然深度可分离卷积内部是逐个通道计算卷积的。但是，对于depthwiseConv2d算子而言，其输入是所有通道的数据！

stack与否

stack方式定义的滤波器（fSize=2）：

  const w =tf.stack([tf.tensor2d([2,0,1,3], [fSize, fSize]),tf.tensor2d([1,0,0,1], [fSize, fSize])],2).expandDims(3);// as tf.Tensor4D;

其解析出来的两个通道的滤波器如下：

容易得出：如果是stack的方式，数据是逐通道排布的：即通道1的所有数据；通道2的所有数据。

非stack方式定义的滤波器：

  const fSize = 2;const pad = 'valid';const stride = 1;const chMul = 1;const inDepth = 2;const w = tf.tensor4d([2,0,1,3,1,0,0,1,],[fSize, fSize, inDepth, chMul]);

如果不是stack的方式存储的数据，那么，其数据排列默认依旧是NHWC格式的，即通道1：通道2：通道1：通道2的数据交织分布。

测试代码：

async function testDepthwiseConv2dChannelmul1() {const fSize = 2;const pad = 'valid';const stride = 1;const chMul = 1;const inDepth = 2;const x = tf.tensor4d([0, 1, 3, 1, 0, 3, 1,0, 1, 2, 1, 2, 0, 2,0, 0, 1, 1,],[1, 3, 3, inDepth]);const w = tf.tensor4d([2,0,1,3,1,0,0,1,],[fSize, fSize, inDepth, chMul]);const result = tf.depthwiseConv2d(x, w, stride, pad);result.print();
}

一个复杂的例子

输入数据
再看一个例子。下面这段代码存储的是4个通道，每个通道的数据是：0-24. 5x5.
实际输入的数据格式是：
0，0，0，0，1，1，1，1，2，2，2，2，3，3，3，3，
…
23, 23, 23, 23, 24, 24, 24, 24

逐通道逻辑展开，其每个通道的输入数据是这样的（注意：每个通道的数据都是一样的）：

实际存储到纹理的数据，是这个样子的，是一个4x25的纹理：

getX用于从实际存储的纹理数据里面获取输入数据：

float getX(int row, int col, int depth) {int texR = row * 5 + col;int texC = depth;ivec2 uv = ivec2(texC, texR);return sampleTexture(x, uv);
}

depth指向当前数据所在的深度，能够通过gl_GlobalInvocationID.x来获取，也可以通过d1（int d1 = d2 / ${channelMul};）来得到。

int tempRow = int(gl_GlobalInvocationID.y)/5;
int tempCol = int(gl_GlobalInvocationID.y)- tempRow*5;
dotProd = getX(batch, tempRow, tempCol, gl_GlobalInvocationID.x);

滤波器
实际的滤波器是stack模式，即输入数据是一个通道一个通道存储，而不是交织存储的，其按通道的逻辑布局和实际数据布局是一样的：
[1,0,0,0,0,0,0,0,1],//通道1
[0,0,0,1,0,0,0,0,0],//通道2
[0,0,0,0,0,0,0,0,1],//通道3
[1,0,0,0,0,0,0,0,0],//通道4

注意：stride是2。
输入形状是：1 5 5 4
滤波器形状是：3 3 4 1
输出形状是：1 2 2 4。

对于计算着色器而言，输出的形状，决定了输出线程的数量。
输出坐标
输出的UV坐标计算代码是：

    ivec4 getOutputCoords() {int index = int(gl_GlobalInvocationID.y) * 4 +int(gl_GlobalInvocationID.x);int r = index / 16; index -= r * 16;int c = index / 8; index -= c * 8;int d = index / 4; int d2 = index - d * 4;return ivec4(r, c, d, d2);}

代码里面，“int(gl_GlobalInvocationID.y) * 4”的4，对应的是输入的depth。
结合前面讨论的输入数据的格式，gl_GlobalInvocationID.y对应的是列0-24。gl_GlobalInvocationID.x对应的是深度，对应0-3.

gl_GlobalInvocationID = gl_WorkGroupID * gl_WorkGroupSize + gl_LocalInvocationID;

容易得到，通道0对应的输出是：12，16，32，36

async function depthwiseConv2dTestArrayPad0(size) {const fSize = 3;const pad = 'valid';const stride = 2;//const dilation = 2;const inDepth = 4;const SQRT = size;const ARRAY_SIZE = SQRT*SQRT;//400, 20 20//let arr2: boolean[] = new Array();let array1 = new Array();let arrayd0 = new Array();let arrayd1 = new Array();let arrayd2 = new Array();let arrayd3 = new Array();for (let i=0; i < ARRAY_SIZE; i++) {for (let j =0; j <inDepth; j++) {array1[i*inDepth+j] = i;}}console.log(array1);let j =0;for (let i=0; i < ARRAY_SIZE*inDepth; i+= inDepth) {arrayd0 [j] = array1[i];arrayd1 [j] = array1[i+1];arrayd2 [j] = array1[i+2];arrayd3 [j] = array1[i+3];j++;}console.log(arrayd0);console.log(arrayd1);console.log(arrayd2);console.log(arrayd3);const x = tf.tensor4d(array1,[1, SQRT,SQRT, inDepth]);const w =tf.stack([tf.tensor2d([1,0,0,0,0,0,0,0,1], [fSize, fSize]),tf.tensor2d([0,0,0,1,0,0,0,0,0], [fSize, fSize]),tf.tensor2d([0,0,0,0,0,0,0,0,1], [fSize, fSize]),tf.tensor2d([1,0,0,0,0,0,0,0,0], [fSize, fSize])],2).expandDims(3);// as tf.Tensor4D;          const result = tf.depthwiseConv2d(x, w, stride, pad);result.print();
}

Channel Multiplier

这个参数其实有点难以理解。幸运的是，MobileNet里面，这个参数是1。但是，本节还是准备对这个参数的意义进行展开解释。
考虑下面的代码为什么产生[7, 0, 5, 6]的输出？

//[ [ [[7, 0, 5, 6],]]]
async function testDepthwiseConv2dChannelmul2_d2() {const fSize = 2;const pad = 'valid';const stride = 1;const chMul = 2;const inDepth = 2;const x = tf.tensor4d([0, 1, 3, 1,0, 1, 2, 1,],[1, 2, 2, inDepth]);const w = tf.tensor4d([2, 0, 1, 3, 1, 0, 0, 1,0, 1, 2, 1, 2, 0, 2, 1,],[fSize, fSize, inDepth, chMul]);const result = tf.depthwiseConv2d(x, w, stride, pad);result.print();
}

关于Channel Multiplier的规则：

Channel Multiplier影响的是滤波器的数据的组织形式。譬如说inDepth = 2， chMul = 2, 那么，滤波器的数据要被解析为，每个通道被重复一次，一共4个通道：CH0； CH0； CH1； CH1；
对输入没有任何影响。只影响当前Tensor（滤波器）的数据组织。

有了这两条规则，容易得到输入，和滤波器的数据排布如下表。相应的输出也容易得到了：

下面的文献谈到了其他的depth multiplier：
https://towardsdatascience.com/a-basic-introduction-to-separable-convolutions-b99ec3102728
https://towardsdatascience.com/review-mobilenetv1-depthwise-separable-convolution-light-weight-model-a382df364b69