.NET下文本相似度算法余弦定理和SimHash浅析及应用实例分析

2019-05-23 06:12:16刘景俊

                for (int i=0; i < input.Length; i++)
                    if (!list.Contains(input[i])) // N-GRAM SIMILARITY?
                        list.Add(input[i]);
                return Tokeniser.ArrayListToArray(list) ;
            }
        }

        private int CountWords(string word, string[] words)
        {
            int itemIdx=Array.BinarySearch(words, word);
           
            if (itemIdx > 0)           
                while (itemIdx > 0 && words[itemIdx].Equals(word))
                    itemIdx--;               
            int count=0;
            while (itemIdx < words.Length && itemIdx >= 0)
            {
                if (words[itemIdx].Equals(word)) count++;
                itemIdx++;
                if (itemIdx < words.Length)               
                    if (!words[itemIdx].Equals(word)) break;
            }
            return count;
        }               
}
 
缺点:
 
由于有可能一个文章的特征向量词特别多导致整个向量维度很高,使得计算的代价太大不适合大数据量的计算。
 
SimHash原理: