fst 共享后缀_FST源代码解读2—

本篇博客介绍FST的生成。它使用的是生成器模式，因为FST的生成太复杂了，所以必须使用生成器模式，他的类是：org.apache.lucene.util.fst.Builder.Builder(INPUT_TYPE, int, int, boolean, boolean, int, Outputs, boolean, float, boolean, int)，代码如下：

/**

inputType 表示的是输入的类型，FST规定只能输入数字类型，字符串的输入也要转化为数字，这里是判断字符串的范围，不同的INPUT_TYPE有不同的范围

minSuffixCount1：这个说的是穿越某个节点的边的数量的下限，如果小于这个值，则要删除这个节点，在lucene中是0，所以永远不会删除，不用管这个参数就可以。

minSuffixCount2：不用管这个参数，是0，在代码中不起作用

doShareSuffix：最后编译节点进入FST的时候，要不要共享后缀，

doShareNonSingletonNodes：当编译某一个节点的时候，如果这个节点是多个term都穿过的，是否要共享此节点，如果不共享，则直接编译入fst中，否则要放入一个去重的对象中，让其他的节点共享这个节点。

shareMaxTailLength：当编译进fst时要共享后缀的额时候，最多共享多少个

outputs：输出的类型

doPackFST：对最终生成的FST，要不要继续压缩(后面会专门将压缩的过程)

acceptableOverheadRatio：在使用packedInts的时候使用的参数(参见我描述packedInts的博客：https://www.iteye.com/blog/suichangkele-2427364 有多篇，不一一列出)

allowArrayArcs：在存储一个节点的多个发出的arc的时候，对于某一个arc的存储，是否可以使用fixedArray的格式，后面会有介绍

bytesPageBits：在使用BytesStore对象记录fst编译后的二进制内容时，使用的byte[]的大小

public Builder(FST.INPUT_TYPE inputType, int minSuffixCount1, int minSuffixCount2, boolean doShareSuffix,

boolean doShareNonSingletonNodes, int shareMaxTailLength, Outputs outputs, boolean doPackFST,

float acceptableOverheadRatio, boolean allowArrayArcs, int bytesPageBits) {

this.minSuffixCount1 = minSuffixCount1;

this.minSuffixCount2 = minSuffixCount2;

this.doShareNonSingletonNodes = doShareNonSingletonNodes;

this.shareMaxTailLength = shareMaxTailLength;

this.doPackFST = doPackFST;

this.acceptableOverheadRatio = acceptableOverheadRatio;

this.allowArrayArcs = allowArrayArcs;

fst = new FST<>(inputType, outputs, doPackFST, acceptableOverheadRatio, bytesPageBits);//生成一个fst，

bytes = fst.bytes;

assert bytes != null;

if (doShareSuffix) {//这个就是共享后缀的部分，如果是的话，需要一个单独的对象，用来查找共享后缀，这个对象后面有介绍。

dedupHash = new NodeHash<>(fst, bytes.getReverseReader(false));

} else {

dedupHash = null;

}

NO_OUTPUT = outputs.getNoOutput();//对于没有输出的arc，统一通这个作为输出

@SuppressWarnings({"unchecked" })

final UnCompiledNode[] f = (UnCompiledNode[]) new UnCompiledNode[10];//frontier的数组

frontier = f;

for (int idx = 0; idx < frontier.length; idx++) {

frontier[idx] = new UnCompiledNode<>(this, idx);

}

上面提到了UnCompiledNode这个类，这个类就是构成Frontier的最小单位，在，用于表示没有编译进FST的节点类，看一下这个类：

/** 还没有写入FST的term的节点，保存一个出现的但是还没有写入FST的值
Expert: holds a pending (seen but not yet serialized) Node. */

public static final class UnCompiledNode implements Node {

final Builder owner;

/** 发出的arc边的数量，一个节点上可能有多个arc，如果是最后一个Node，则是0 */

public int numArcs;

/** 由这个节点出发的所有的arc */

public Arc[] arcs;

// TODO: instead of recording isFinal/output on the node, maybe we should use -1 arc to mean "end" (like we do when reading the FST). Would simplify much code here...

/** 节点的输出，有时候节点也是有输出的，就在下面的prependOutput方法中，其实就是以后的finalOutput，以后会有介绍 */

public T output;

/** 到这个节点截止是不是一个完整的term，如果这个节点是final的，则其可能含有output，后面会解释 */

public boolean isFinal;

/** 有多少个线路走过这个节点，无论是不是最后一个，都算是经过，第一个node的这个属性表示一共多少个term */

public long inputCount;

/** 节点在term中的偏移量(下标)，第一个是不用的，是root
This node's depth, starting from the automaton root. */

public final int depth;

/** @param depth The node's depth starting from the automaton root. Needed for LUCENE-2934 (node expansion based on conditions other than the fanout size). */

@SuppressWarnings({"unchecked" })

public UnCompiledNode(Builder owner, int depth) {

this.owner = owner;

arcs = (Arc[]) new Arc[1];

arcs[0] = new Arc<>();

output = owner.NO_OUTPUT;

this.depth = depth;

}

@Override

public boolean isCompiled() {

return false;

}

/** 清除，当把这个节点编译进fst之后调用，因为他已经进入FST了，所以留着也没用了。 */

public void clear() {

numArcs = 0;

isFinal = false;

output = owner.NO_OUTPUT;

inputCount = 0;

// We don't clear the depth here because it never changes for nodes on the frontier (even when reused).

}

/** 获得最后添加的一个arc的输出值 */

public T getLastOutput(int labelToMatch) {

assert numArcs > 0;

assert arcs[numArcs - 1].label == labelToMatch;

return arcs[numArcs - 1].output;

}

/**

* 添加一个arc，到下一个节点，此时并没有添加输出，输出暂定为没有输出，并且设置为不是final

* @param labelarc上的值(不是输出)

* @param target 链接的下一个节点

public void addArc(int label, Node target) {

assert label >= 0;

assert numArcs == 0 || label > arcs[numArcs - 1].label : "arc[-1].label=" + arcs[numArcs - 1].label+ " new label=" + label + " numArcs=" + numArcs;

if (numArcs == arcs.length) {//扩容

final Arc[] newArcs = ArrayUtil.grow(arcs, numArcs + 1);

for (int arcIdx = numArcs; arcIdx < newArcs.length; arcIdx++) {

newArcs[arcIdx] = new Arc<>();

}

arcs = newArcs;

}

final Arc arc = arcs[numArcs++];

arc.label = label;

arc.target = target;

arc.output = arc.nextFinalOutput = owner.NO_OUTPUT;//输出暂定为没有输出

arc.isFinal = false;

}

/** 将某一个节点编译进FST之后，需要将他在frontier中的前一个节点指向这个新的编译的节点，就是用的这个方法，target就是fst中的节点，这里的nextFinalOutput就是刚刚编译的节点output，在下面有介绍，isFinal表示刚刚编译进入FST的节点是不是final节点，如果是的话，则刚刚添加的arc也是final的 */

public void replaceLast(int labelToMatch, Node target /*已经编译进fst的节点*/ , T nextFinalOutput, boolean isFinal) {

assert numArcs > 0;

final Arc arc = arcs[numArcs - 1];

assert arc.label == labelToMatch : "arc.label=" + arc.label + " vs " + labelToMatch;

arc.target = target;//指向新的节点

// 在替换fst中的节点的时候，要把nextFinalOutput设置到arc中，因为fst中的节点是没有nextFinalOutput的

arc.nextFinalOutput = nextFinalOutput;//刚刚编译进fst的节点上的输出(也就是Node上也有输出，这种是前一个arc是final，没有后续的arc了，只能将多余的输出写在节点上)

arc.isFinal = isFinal;

}

上面的replaceLast方法中的nextFinalOutput是这个意思：假设之前的term是a，他的输出是10，现在又来了一个term：ab，输出是5，如果一个arc上只有一个output的输出，因为a和ab是共享前缀a的，但是a上已经有输出10了，10>5，那么只能让b输出-5，在这个例子中我们是可以把-5移动到label是b的arc上的，因为这里还没有已经编译过的arc(这个例子不是很好)，但是如果遇到存在已经编译过的情况下，就会失败，因为编译过的arc上的值是不允许改变的，所以就增加了这么一个nextFinalOutput，即a也输出5，但是在a指向的节点也添加一个输出，就是这个nextFinalOutput，也是5，这个节点的输出只能用在这个节点之前的arc中，而不影响后面的b的输出，也就是他不是对所有经过这个节点的term起作用，仅仅是经过他之前的term起作用，这样a的arc输出是5，a的节点还有个专属于a的输出也是5，所以a的输出还是10，b的节点没有输出，ab的总输出还是5(仅仅计算a的输出，而不计算a指向的节点的输出)，这就是nextFinalOutput的含义。上面的prependOutput方法就涉及到这个，当新来一个ab时，计算两者的共享的前缀a的arc的输出是5，那么a的arc剩余是5，这个剩余的5就作为参数调用这个方法，是a指向的节点调用的这个方法，进入这个方法以后，第一个for循环会把5追加到b的arc上(有的人看到这里会觉得这不是乱了吗，b上的是0才对啊，别急，在真正添加代码的时候还有一步会纠正的，最后并不是5，我要强调的是下面的if(isFinal)部分)，我们看第二个if(isFinal)部分，对于a来说，a指向的节点就是final的，也就是说a的term在这个node上已经没有了后续的arc，多余的输出5只能写在节点上，这就是这里的output = owner.fst.outputs.add(outputPrefix, output)的解释，output也就是介绍的nextFinalOutput 。nextFinalOutput在frontier中是体现在node上的，当这个节点编译到fst中的时候，就会转移到指向它的arc上，在这个例子中就是label是a的arc上。

了解了两个基本的类Node和arc后，下面开始看真正的添加term构成fst的方法：

/**

* 添加一个输入，先计算共享前缀，把共享前缀以后的都冷冻，编译进FST，然后写入新的后缀到Frontier，最后重新计算每个arc的输出。

* Add the next input/output pair. The provided input must be sorted after the previous one according to {@link IntsRef#compareTo}.

* It's also OK to add the same input twice in a row with different outputs, as long as {@link Outputs} implements the {@link Outputs#merge} method.

* Note that input is fully consumed after this method is returned (so caller is free to reuse), but output is not.

* So if your outputs are changeable (eg {@link ByteSequenceOutputs} or {@link IntSequenceOutputs}) then you cannot reuse across calls.

public void add(IntsRef input, T output) throws IOException {

// De-dup NO_OUTPUT since it must be a singleton: 所有的没有输出的输入的输出必须是同一个值，NO_OUTPUT

if (output.equals(NO_OUTPUT)) {

output = NO_OUTPUT;

}

assert lastInput.length() == 0 || input.compareTo(lastInput.get()) >= 0 : "inputs are added out of order lastInput=" + lastInput.get() + " vs input=" + input;

assert validOutput(output);

if (input.length == 0) {//TODO 还没看

// empty input: only allowed as first input. we have to special case this because the packed FST

// format cannot represent the empty input since 'finalness' is stored on the incoming arc, not on the node

frontier[0].inputCount++;

frontier[0].isFinal = true;

fst.setEmptyOutput(output);

return;

}

// compare shared prefix length

int pos1 = 0;// pos表示共享前缀的个数

int pos2 = input.offset;

final int pos1Stop = Math.min(lastInput.length(), input.length);

while (true) {

frontier[pos1].inputCount++;// 通过这个节点的term又多了一个

if (pos1 >= pos1Stop || lastInput.intAt(pos1) != input.ints[pos2]) {//一直到一个不同的结束

break;

}

pos1++;

pos2++;

}

//input和lastInput具有公共前缀的节点的数量，第一个是root，所以要加一，prefixLenPlus1下标的这个节点包括后续的节点都要冷冻，也就是进入FST

final int prefixLenPlus1 = pos1 + 1;

if (frontier.length < input.length + 1) {//扩容frontier数组

final UnCompiledNode[] next = ArrayUtil.grow(frontier, input.length + 1);// 创建一个新的数组

for (int idx = frontier.length; idx < next.length; idx++) {

next[idx] = new UnCompiledNode<>(this, idx);

}

frontier = next;

}

// minimize/compile states from previous input's orphan'd suffix 将上一个添加的input的后面几个node进入FST，从prefixLenPlus1下标开始(之前的节点仍然作为frontier)

freezeTail(prefixLenPlus1);

// init tail states for current input 将新插入的input的剩余部分写入frontier，注意是从共享前缀以后开始写的，共享前缀的部分不写，此时共享前缀后面的已经进入FST了。写入时是没有输出的，输出的在下面判定

for (int idx = prefixLenPlus1; idx <= input.length; idx++) { //一直到idx=input.length，表示结束的点(节点比插入的input的长度多一个)

frontier[idx - 1].addArc(input.ints[input.offset + idx - 1], frontier[idx]);//添加一个arc，到下一个节点，此时每个arc转移上的输出都是no_out，输出由下面的代码产生

frontier[idx].inputCount++;//通过下一个节点的路径有多了一个，共享前缀的那些已经在查找共享前缀的时候都添加过了

}

final UnCompiledNode lastNode = frontier[input.length];//结束的节点

// 这个判断是说如果当前添加的这个完全匹配上一个，即两者长度一样，且prefixLenPlus1 == input.length + 1，完全一样的话，不要设置isFinal=true，因为之前这条路径上已经设置过了

if (lastInput.length() != input.length || prefixLenPlus1 != input.length + 1) {

lastNode.isFinal = true;//标识这个点是一个完整的路径

lastNode.output = NO_OUTPUT;//这个点没有输出

}

// push conflicting outputs forward, only as far as needed

for (int idx = 1; idx < prefixLenPlus1; idx++) {//只看共享前缀的部分，从第二个开始，因为第一个是root，即从下标1开始，到最后共享前缀的节点(也就是prefixLenPlus1 - 1个)。操作是修改转移到的节点的输出

final UnCompiledNode node = frontier[idx];

final UnCompiledNode parentNode = frontier[idx - 1];//获得输出必须由parent来获得，从parent上获得

final T lastOutput = parentNode.getLastOutput(input.ints[input.offset + idx - 1]);//获得最后添加的arc转移的输出

assert validOutput(lastOutput);

final T commonOutputPrefix;//公共输出

final T wordSuffix;//现在的输出减去公共输出，挪到下一个输出中

if (lastOutput != NO_OUTPUT) {

commonOutputPrefix = fst.outputs.common(output, lastOutput);//计算输出的交集(也就是多个线路共同使用的公共输出)

assert validOutput(commonOutputPrefix);

wordSuffix = fst.outputs.subtract(lastOutput, commonOutputPrefix);//计算原来的输出中剩余的输出

assert validOutput(wordSuffix);

parentNode.setLastOutput(input.ints[input.offset + idx - 1], commonOutputPrefix);//设置父节点刚刚添加的转移的输出为公共输出

node.prependOutput(wordSuffix);// 将父节点多余的值追加到下一个节点的输出上，这个地方需要额外注意一下。

} else {

commonOutputPrefix = wordSuffix = NO_OUTPUT;

}

output = fst.outputs.subtract(output, commonOutputPrefix);//更新output，然后继续看下一个节点

assert validOutput(output);

}

if (lastInput.length() == input.length && prefixLenPlus1 == 1 + input.length) {//相同的内容，多次添加，将结果merge

// same input more than 1 time in a row, mapping to multiple outputs

lastNode.output = fst.outputs.merge(lastNode.output, output);

} else {//将剩余的output写在新添加的arc上。 (这就是上面写的没有乱的原因，这里会更新的)

// this new arc is private to this new input; set its arc output to the leftover output:

frontier[prefixLenPlus1 - 1].setLastOutput(input.ints[input.offset + prefixLenPlus1 - 1], output);

}

// save last input 保存上一个输入

lastInput.copyInts(input);

}

freezeTail方法就是将非共享前缀的部分写入到FST中，先不看这个也可以很好的理解添加的整个流程。当新来一个term后，和现在的frontier中的节点比较共享前缀，因为term是按照词典顺序写入的，超过共享前缀的一定不会再用到了，所以就进入FST，也就是调用freezeTail方法(怎么进入fst的先不用管)，然后再将term除共享前缀的部分写入frontier，记住此时是没有写输出的，因为输出的确定是需要和共享前缀的部分来共同确定的。在将剩余的term写入到frontier之后，需要计算新的输出，例如之前说的a和ab的为例，先写入a，输出是10，输出写在从根节点出发的lebal是a的arc上，然后再来一个ab，输出是5，此时从根节点触发的lebal是a的arc上的输出和5计算公共输出，也就是交集，也是5，此时将从根节点触发的lebal是a的arc的输出设置为5，然后将剩余的5调用node.prependOutput，写在lebal是b的arc上，同时写在lebal是a的arc指向的node上，因为a已经是final了。最后再将剩余的output(已经是0了)写在lebal是b的arc的输出上(之前是5，这次覆盖，变为0，所以上面说的没有乱)。

再看一下freezeTail方法，也就是如何将一些不再在frontier中的node编译进fst的。上面的a、ab的例子中没有涉及到这个方法，那继续在那个例子上继续举例，假设又来了一个ac，输出是3，先查找共享前缀，是a，因为是按照字典顺序来的，所以后续再添加的term一定会比ac更大，那么ab中的b这个arc以及他的指向的节点就一定不会再被用到了，所以可以编译进fst了，调用下面的方法。调用完了之后将剩余的c写入到frontier中，然后重新计算输出，lebal是a的arc上的输出是5，这次是3，所以交集是3，剩余2作为参数由a的指向的节点调用prependOutput方法，这里两个term，一个是a，一个是ab，(建议回看一下prependOutput方法)将2追加到b的arc中，而a是final的没有arc了，所以只能是在这个节点(lebal是a的arc指向的节点)的output上添加；最后将3中剩余的0，写在lebal是c的arc上(其实就是NO_OUT)，这样frontier就是一个新的了，包含a和c两个arc。下面看看freezeTail方法，即b是如何写入到fst中的。

/** 下标是prefixLenPlus1还有以后的节点，都要冷冻，也就是进入FST */

private void freezeTail(int prefixLenPlus1) throws IOException {

final int downTo = Math.max(1, prefixLenPlus1);//至少是1，因为root是第一个节点，至少从第二个节点开始冷冻，所以下标至少是1。只有在最后冷冻所有节点的时候才会传入0

for (int idx = lastInput.length(); idx >= downTo; idx--) {// 从最后面的一个节点开始向前，这样能查找共同的后缀

boolean doPrune = false;//在lucene中这个都是false

boolean doCompile = false;//在lucene中这个一定是true

final UnCompiledNode node = frontier[idx];//要冷冻的节点

final UnCompiledNode parent = frontier[idx - 1];//要冷冻的节点的父节点

if (node.inputCount < minSuffixCount1) {//之前已经说过minSuffixCount1=0，所以不进入

doPrune = true;

doCompile = true;

} else if (idx > prefixLenPlus1) {//如果不是要冷冻的左边的第一个(等于的话就是左边第一个要冷冻的节点)

// prune if parent's inputCount is less than suffixMinCount2

if (parent.inputCount < minSuffixCount2|| (minSuffixCount2 == 1 && parent.inputCount == 1 && idx > 1)) {//minSuffixCount=0，所以不进入

// my parent, about to be compiled, doesn't make the cut, so I'm definitely pruned

// if minSuffixCount2 is 1, we keep only up until the 'distinguished edge', ie we keep only the

// 'divergent' part of the FST. if my parent, about to be compiled, has inputCount 1 then we are already past the

// distinguished edge. NOTE: this only works if the FST outputs are not "compressible" (simple ords ARE compressible).

doPrune = true;

} else {

// my parent, about to be compiled, does make the cut, so I'm definitely not pruned

doPrune = false;

}

doCompile = true;

} else {

// if pruning is disabled (count is 0) we can always compile current node

doCompile = minSuffixCount2 == 0;//就是 0

}

if (node.inputCount < minSuffixCount2 || (minSuffixCount2 == 1 && node.inputCount == 1 && idx > 1)) {//不进入，在lucene中minSuffixCount2是0

// drop all arcs

for (int arcIdx = 0; arcIdx < node.numArcs; arcIdx++) {

@SuppressWarnings({"unchecked" })

final UnCompiledNode target = (UnCompiledNode) node.arcs[arcIdx].target;

target.clear();

}

node.numArcs = 0;

}

if (doPrune) {//不进入

// this node doesn't make it -- deref it

node.clear();

parent.deleteLast(lastInput.intAt(idx - 1), node);

} else {

if (minSuffixCount2 != 0) {

compileAllTargets(node, lastInput.length() - idx);

}

final T nextFinalOutput = node.output;//要冷冻的节点的输出，也就是在他位置结束的term的finalOutput，这个必须单独拿出来，因为在已经编译到fst的节点中是没有nextFinalOutput的，所以要将这个nextFinalOutput转移到前面的arc中去

// We "fake" the node as being final if it has no outgoing arcs; in theory we could leave it

// as non-final (the FST can represent this), but FSTEnum, Util, etc., have trouble w/ non-final dead-end states:

final boolean isFinal = node.isFinal || node.numArcs == 0;

if (doCompile) {

// this node makes it and we now compile it. first, compile any targets that were previously undecided:

// 先对node进行编译，进入fst，返回一个fst中的节点(编译过的节点)，然后再用父节点的arc指向这个新的编译过的节点(使用的父节点的arc一定是最后一个，如果不是最后一个早就进入fst了)

parent.replaceLast(lastInput.intAt(idx - 1), compileNode(node, 1 + lastInput.length() - idx), nextFinalOutput, isFinal);

} else {

// replaceLast just to install

// nextFinalOutput/isFinal onto the arc

parent.replaceLast(lastInput.intAt(idx - 1), node, nextFinalOutput, isFinal);

// this node will stay in play for now, since we are

// undecided on whether to prune it. later, it

// will be either compiled or pruned, so we must

// allocate a new node:

frontier[idx] = new UnCompiledNode<>(this, idx);

}

里面有一个compileNode方法，就是讲一个节点写入到fst中，其实就是编译这个节点为二进制，也就是byte[]，编译后返回一个编译后的节点，然后调用父节点通过刚刚添加的arc指向这个编译后的节点。如下：

/**

* 将一个节点编译进FST

* @param nodeIn要编译的节点

* @param tailLength共享后缀的长度，计算的时候包括这个节点，如果太长了，就不查找共享后缀了

private CompiledNode compileNode(UnCompiledNode nodeIn, int tailLength) throws IOException {

final long node;//写入fst时的返回值

long bytesPosStart = bytes.getPosition();//bytes就是保存编译后的byte[]的对象，

if (dedupHash != null && (doShareNonSingletonNodes /*配置多个路线穿过的节点要共享*/ || nodeIn.numArcs <= 1/*单个的当然更要共享*/) && tailLength <= shareMaxTailLength /* 最大的共享后缀的长度允许 */ ) {

if (nodeIn.numArcs == 0) {//是最后一个节点直接写入FST(因为此时没法判断是否是重复的，因为判断重复是根据arc判断的，没有arc没法判断)

node = fst.addNode(this, nodeIn);//写入fst，这种没有arc的，添加进fst后的返回值都是一样的，这样就可以认为所有的没有arc的节点是同一个节点。

lastFrozenNode = node;//这个也是一个很巧妙的，如果两个节点挨着，那么在fst中就不用记录一个arc的target node了，直接用当前的node减一即可。

} else {//不是最后一个节点，要去重，因为可能和之前的路径是共享后缀的

node = dedupHash.add(this, nodeIn);//用来去重，不过这个也会加入到FST中的

}

} else {//不去重的逻辑，比如共享后缀的多了，此时直接进入fst

node = fst.addNode(this, nodeIn);

}

assert node != -2;

long bytesPosEnd = bytes.getPosition();

if (bytesPosEnd != bytesPosStart) {//不相等，说明添加了一个新的节点，表示这个节点没有被去重，所以最后freeze的节点就是这个。

// The FST added a new node:

assert bytesPosEnd > bytesPosStart;

lastFrozenNode = node;

}

nodeIn.clear();//添加进fst之后，清楚所有的arc

final CompiledNode fn = new CompiledNode();

fn.node = node;

return fn;

}

里面出现了一个dedupHash(类型是NodeHash)，这个是用来给编译后的节点去重的，在编译进FST的时候，可能会出现重复的，举例：第一个term是abc，第二个term是bbc，第三个term是c，当第三个term来的时候，abc中的b、c两个arc已经编译进入fst了，a、b的指向节点已经存在dedupHash中了，frontier中是bbc的节点，此时需要编译bbc中的b的arc和c的arc，从后面按开始，当编译c时，发现是最后一个，也就是nodeIn.numArcs==0，直接进入fst，fst对于这种结束的节点返回的都是一个小于1的常数值(可以将其视为是同一个结束的节点)，其实这样abc的c和bbc的c都指向了同一个节点，也就是结束的节点，这样abc中b的指向节点和bbc的第二个b的指向节点其实就一样了，因为他俩的发出的arc都是只有一个，且label都是c，且都是指向的结束节点，所有的特征都一样，所以就可以将两个重合了，所以bbc中第二个b就可以指向abc中的b的指向节点，同样的逻辑a指向的节点和bbc中第一个b指向的节点也是一样的，所以b和c这两个arc完全可以合并，最终形成的结果是从root发出两个arc，一个的lebal是a，一个的lebal是b，都指向同一个节点，然后这个节点发出一个b指向下一个，下一个发出一个c指向最后一个。在代码中就是体现在deduphash.add(this,nodeIn)方法中，如果两个节点完全一样，则会返回之前的节点的node(其实就是一个编号)，这就是去重或者叫做共享后缀的实现。deduphash的代码稍后写。经过这个步骤，frontier中不再共享的节点就进入了FST中，且是实现了共享后缀。

node编译进fst的代码在下一个博客中介绍，这里再次回到当前添加的过程，添加到fst中的节点会返回一个数字，这个数字其实就是编译后的节点的编号(后续会介绍通过这个编号就可以查找到这个编译后的节点在bytes中的开始位置)，通过这个编号就可以获得未编译之前的整个term了，或者说完整的arc路径。

最后介绍一下NodeHash的代码，他其实就是一个hashmap，只不过他计算hash值的时候考虑的因素很多，如下：

/**

* 写入一个NODE，如果是第一次写入，则写入fst，否则不写入而是返回原来写入fst时的返回值。

* @param builder

* @param nodeIn

public long add(Builder builder, Builder.UnCompiledNode nodeIn) throws IOException {

final long h = hash(nodeIn);//计算hash值

long pos = h & mask;//通过hash值获得桶位

int c = 0;

while (true) {

final long v = table.get(pos);//这个桶位对应的值，如果是0表示第一次写入

if (v == 0) {// 第一次写入，不是去重

// freeze & add

final long node = fst.addNode(builder, nodeIn);//写入到fst中

// System.out.println(" now freeze node=" + node);

assert hash(node) == h : "frozenHash=" + hash(node) + " vs h=" + h;

count++;

table.set(pos, node);//写入真正的hash表，key是桶位，value是fst返回的值，table其实就是一个mutable(packedints部分的，可以将其视为一个大的数字数组)

// Rehash at 2/3 occupancy:

if (count > 2 * table.size() / 3) {

rehash();

}

return node;

} else if (nodesEqual(nodeIn, v)) {// 不是第一次且是相同的(没有hash冲突的时候，否则就是hash冲突)

// same node is already here

return v;

}

// quadratic probe 继续查找

pos = (pos + (++c)) & mask;

}

/** 计算一个未编译的节点的hash值，用节点的所有的输出arc的属性来计算的，包括label，指向的下一个节点，输出(两个，一个是单独的，一个是共享的)，是否是final */

private long hash(Builder.UnCompiledNode node) {

final int PRIME = 31;

// System.out.println("hash unfrozen");

long h = 0;

// TODO: maybe if number of arcs is high we can safely subsample?

for (int arcIdx = 0; arcIdx < node.numArcs; arcIdx++) {

final Builder.Arc arc = node.arcs[arcIdx];

// System.out.println(" label=" + arc.label + " target=" +

// ((Builder.CompiledNode) arc.target).node + " h=" + h + " output="

// + fst.outputs.outputToString(arc.output) + " isFinal?=" +

// arc.isFinal);

h = PRIME * h + arc.label;

long n = ((Builder.CompiledNode) arc.target).node;// 要编译进入fst的节点的转移节点，一定是已经写入fst的编译节点，所以这里是CompiledNode

h = PRIME * h + (int) (n ^ (n >> 32));

h = PRIME * h + arc.output.hashCode();

h = PRIME * h + arc.nextFinalOutput.hashCode();

if (arc.isFinal) {

h += 17;

}

return h & Long.MAX_VALUE;

}

/** 判断是不是相等的，如果是，表示不是hash冲突，注意里面的nextFinalOutput，这个在节点上是不存的，最后还是要转移到arc上 */

private boolean nodesEqual(Builder.UnCompiledNode node, long address) throws IOException {

fst.readFirstRealTargetArc(address, scratchArc, in);

if (scratchArc.bytesPerArc != 0 && node.numArcs != scratchArc.numArcs) {

return false;

}

for (int arcUpto = 0; arcUpto < node.numArcs; arcUpto++) {//遍历这个节点上的多个arc，都一样才说明是同一个节点

final Builder.Arc arc = node.arcs[arcUpto];

if (arc.label != scratchArc.label || !arc.output.equals(scratchArc.output)

|| ((Builder.CompiledNode) arc.target).node != scratchArc.target

|| !arc.nextFinalOutput.equals(scratchArc.nextFinalOutput) || arc.isFinal != scratchArc.isFinal()) {

return false;

}

if (scratchArc.isLast()) {

if (arcUpto == node.numArcs - 1) {//数量一样多

return true;

} else {

return false;

}

fst.readNextRealArc(scratchArc, in);

}

return false;

}

除此之外，下面的都是fst的代码了，在下一个博客中介绍。

分享到：

2019-12-17 13:56

fst 共享后缀_FST源代码解读2——FST的生成相关推荐

fst 共享后缀_FST源代码解读1——FST是什么
在之前研究IK的时候,看了他存储词典的格式,接触到了前缀树这种数据结构,他的好处是可以共享前缀,但是并不能共享后缀,而FST就能共享后缀,实现更加高效的存储或者查找.当然FST比前缀树添加了一个要求: ...
fst 共享后缀_关于Lucene的词典FST深入剖析
核心是在于如何快速的依据查询词快速的查找到所有的相关文档,这也是倒排索引(Inverted Index) 的核心思想.那么如何设计一个快速的(常量,或者1)定位词典的数据结构就显得尤其重要.简单 ...
fst 共享后缀_谈谈lucene中的FST
FST是lucene中用来存储字典,并进行检索的核心数据结构,我记得在lucene3.0版本前都用的是跳跃链表(不清楚的同学以后我会专门写一篇磁盘索引技术来说清楚,网上也有相关资料),实际上采用该数据 ...
java计算机毕业设计校园共享单车系统源代码+系统+数据库+lw文档
java计算机毕业设计校园共享单车系统源代码+系统+数据库+lw文档 java计算机毕业设计校园共享单车系统源代码+系统+数据库+lw文档本源码技术栈: 项目架构:B/S架构开发语言:Java语言 ...
jqfactor_analyzer源代码解读02
jqfactor_analyzer单因子分析02 首先jqfactor_analyzer源代码解读01中已经得到了self._clean_factor_data(注意binning_by_group参 ...
ML：LGBMClassifier、XGBClassifier和CatBoostClassifier的feature_importances_计算方法源代码解读之详细攻略
ML:LGBMClassifier.XGBClassifier和CatBoostClassifier的feature_importances_计算方法源代码解读之详细攻略目录 LGBMClassif ...
《加强5G公众移动通信系统无线电频率共享管理》最新解读来啦
<加强5G公众移动通信系统无线电频率共享管理>最新解读来啦! 人工智能技术与咨询近日,工业和信息化部印发了<关于加强5G公众移动通信系统无线电频率共享管理有关事项的通知>(下 ...
基于ssm高校共享单车管理系统 (源代码+数据库) 604
部分代码地址 https://gitee.com/ynwynwy/webike-public 基于ssm高校共享单车管理系统 (源代码+数据库) 一.系统介绍用户管理,服务点管理,单车管理,分类管理 ...
linux内核奇遇记之md源代码解读之八阵列同步二
linux内核奇遇记之md源代码解读之八阵列同步二转载请注明出处:http://blog.csdn.net/liumangxiong 在上一小节里讲到启动同步线程: 7824 mddev->s ...

fst 共享后缀_FST源代码解读2——FST的生成

fst 共享后缀_FST源代码解读2——FST的生成相关推荐

最新文章

热门文章