[java into NLP] simhashアルゴリズムは、2つの記事の類似性を計算します



Simhash Algorithm Calculates Similarity Two Articles



Pythonは、2つの記事の類似性アルゴリズムsimhashを計算します。
https://blog.csdn.net/u013421629/article/details/85052915

長いテキスト(500語以上)に適しています
Javaバージョンの実装を貼り付けます。



Pom.xmlが依存関係に参加します

org.jsoup jsoup 1.8.1

ファイルを作成する
MySimHash.java



/* Calculate the similarity of two articles*/ import com.hankcs.hanlp.seg.common.Term import com.hankcs.hanlp.tokenizer.StandardTokenizer import org.apache.commons.lang3.StringUtils import org.jsoup.Jsoup import org.jsoup.safety.Whitelist import java.math.BigInteger import java.util.HashMap import java.util.List import java.util.Map public class MySimHash { Private String tokens //string Private BigInteger strSimHash / / character produced hash value Private int hashbits = 64 // the number of hashes after the word segmentation public MySimHash(String tokens) { this.tokens = tokens this.strSimHash = this.simHash() } private MySimHash(String tokens, int hashbits) { this.tokens = tokens this.hashbits = hashbits this.strSimHash = this.simHash() } /** * Clear html tags * @param content * @return */ private String cleanResume(String content) { // If the input is HTML, the following will filter out all the HTML tags. content = Jsoup.clean(content, Whitelist.none()) content = StringUtils.lowerCase(content) String[] strings = {' ', ' ', ' ', ' ', '\r', '\n', '\t', ' '} for (String s : strings) { content = content.replaceAll(s, '') } return content } /** * This is a hash calculation of the entire string * @return */ private BigInteger simHash() { Tokens = cleanResume(tokens) // cleanResume removes some special characters int[] v = new int[this.hashbits] List termList = StandardTokenizer.segment(this.tokens) // Segmentation of strings / / Some special treatment of the word segmentation: For example: add weight according to part of speech, filter out punctuation, filter overclocking vocabulary, etc. Map weightOfNature = new HashMap() // weight of part of speech weightOfNature.put('n', 2) //The weight given to the noun is 2 Map stopNatures = new HashMap()//Deactivated part of speech such as some punctuation stopNatures.put('w', '') // Int overCount = 5 //Set the bounds of the overclocked vocabulary Map wordCount = new HashMap() for (Term term : termList) { String word = term.word //word segmentation string String nature = term.nature.toString() // word segmentation attribute // Filter overclocking words if (wordCount.containsKey(word)) { int count = wordCount.get(word) if (count > overCount) { continue } wordCount.put(word, count + 1) } else { wordCount.put(word, 1) } // Filter stop words if (stopNatures.containsKey(nature)) { continue } // 2. Divide each participle hash into a fixed-length sequence. For example, an integer of 64bit. BigInteger t = this.hash(word) for (int i = 0 i

演算結果:

====================================== 0 20 0.6875 0.6875 Total time: 429 milliseconds ====================================== Process finished with exit code 0