文章目录

1, 下载lucene, 获取demo相关jar包
2, 源代码实现明细
- org.apache.lucene.demo.IndexFiles : StandardAnalyzer index阶段
- org.apache.lucene.demo.SearchFiles : StandardAnalyzer query阶段
- StandardAnalyzer 中文分词示例: index, query
3, 中文词库下载：同义词，停用词

lucene demo样例演示：如何创建索引，如何检索数据 https://lucene.apache.org/core/8_6_0/demo/index.html

1, 下载lucene, 获取demo相关jar包

https://lucene.apache.org/core/downloads.html

2, 源代码实现明细

配置pom.xml

         <dependency><groupId>org.apache.lucene</groupId><artifactId>lucene-demo</artifactId><version>8.6.0</version></dependency><dependency><groupId>org.apache.lucene</groupId><artifactId>lucene-core</artifactId><version>8.6.0</version></dependency><dependency><groupId>org.apache.lucene</groupId><artifactId>lucene-queryparser</artifactId><version>8.6.0</version></dependency>

org.apache.lucene.demo.IndexFiles : StandardAnalyzer index阶段

使用 StandardAnalyzer 来解析文件内容

// Index all text files under a directory.
public class IndexFiles {/** Index all text files under a directory. */public static void main(String[] args) {//指定源文档的路径：绝对路径 或 相对路径//cmd> ls D:\download\lucene-7.7.3\demo\txt\//      123.txt  456.txt  hello123.txt  hello12345678.txtString docsPath = "D:\\download\\lucene-7.7.3\\demo\\txt";//指定索引存放的位置：绝对路径 或 相对路径String indexPath = "D:\\download\\lucene-7.7.3\\demo\\index1";//是创建索引/ 还是更新索引boolean create = true;final Path docDir = Paths.get(docsPath);Date start = new Date();System.out.println("Indexing to directory '" + indexPath + "'...");Directory indexPathDir = FSDirectory.open(Paths.get(indexPath));Analyzer analyzer = new StandardAnalyzer();IndexWriterConfig iwc = new IndexWriterConfig(analyzer);if (create) {// Create a new index in the directory, removing old indexsiwc.setOpenMode(OpenMode.CREATE);} else {// Add new documents to an existing index:iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);}IndexWriter writer = new IndexWriter(indexPathDir, iwc);indexDocs(writer, docDir);// NOTE: if you want to maximize search performance,// you can optionally call forceMerge here.  This can be// a terribly costly operation, so generally it's only// worth it when your index is relatively static (ie// you're done adding documents to it)://// writer.forceMerge(1);writer.close();Date end = new Date();System.out.println(end.getTime() - start.getTime() + " total milliseconds");}/*** Indexes the given file using the given writer, or if a directory is given,* * NOTE: This method indexes one document per input file.  This is slow.  For good* throughput, put multiple documents into your input file(s).  An example of this is* in the benchmark module, which can create "line doc" files, one document per line,* using the* <a href="../../../../../contrib-benchmark/org/apache/lucene/benchmark/byTask/tasks/WriteLineDocTask.html"* >WriteLineDocTask</a>.*  * @param writer Writer to the index where the given file/dir info will be stored* @param path The file to index, or the directory to recurse into to find files to index*/static void indexDocs(final IndexWriter writer, Path path) throws IOException {if (Files.isDirectory(path)) {Files.walkFileTree(path, new SimpleFileVisitor<Path>() {@Overridepublic FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException {try {indexDoc(writer, file, attrs.lastModifiedTime().toMillis());} catch (IOException ignore) {// don't index files that can't be read.}return FileVisitResult.CONTINUE;}});} else {indexDoc(writer, path, Files.getLastModifiedTime(path).toMillis());}}/** Indexes a single document */static void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOException {try (InputStream stream = Files.newInputStream(file)) {// make a new, empty documentDocument doc = new Document();// Add the path of the file as a field named "path".  Use aField pathField = new StringField("path", file.toString(), Field.Store.YES);doc.add(pathField);// Add the last modified date of the file a field named "modified".doc.add(new LongPoint("modified", lastModified));// Add the contents of the file to a field named "contents".  Specify a Reader,// so that the text of the file is tokenized and indexed, but not stored.doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(stream, StandardCharsets.UTF_8))));if (writer.getConfig().getOpenMode() == OpenMode.CREATE) {// New index, so we just add the document (no old document can be there):System.out.println("adding " + file);writer.addDocument(doc);} else {// Existing index (an old copy of this document may have been indexed) so // we use updateDocument instead to replace the old one matching the exact // path, if present:System.out.println("updating " + file);writer.updateDocument(new Term("path", file.toString()), doc);}}}
}

org.apache.lucene.demo.SearchFiles : StandardAnalyzer query阶段

// search demo.
public class SearchFiles {public static void main(String[] args) throws Exception {//指定索引存放的位置：绝对路径 或 相对路径String index = "D:\\download\\lucene-7.7.3\\demo\\index1";//搜索的文本内容String queryString = "hello";//搜索的文本内容 --- 从哪个索引字段检索String field = "contents";//分页大小int hitsPerPage = 10;//是否查看：更详细的内容( 搜索配置的score )boolean raw = true;IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(index)));IndexSearcher searcher = new IndexSearcher(reader);Analyzer analyzer = new StandardAnalyzer();QueryParser parser = new QueryParser(field, analyzer);Query query = parser.parse(queryString);System.out.println("Searching for: " + query.toString(field));doSearch(searcher, query, hitsPerPage, raw);reader.close();}//查询public static void doSearch(IndexSearcher searcher, Query query, int hitsPerPage, boolean raw) throws IOException {BufferedReader in = new BufferedReader(new InputStreamReader(System.in, StandardCharsets.UTF_8));// Collect enough docs to show 5 pagesTopDocs results = searcher.search(query, 5 * hitsPerPage);ScoreDoc[] hits = results.scoreDocs;int numTotalHits = Math.toIntExact(results.totalHits.value);System.out.println(numTotalHits + " total matching documents ");System.out.println("\n\n>>>>>>>> doc details >>>>>>>>>");for (int i = 0; i < numTotalHits; i++) {if (raw) {                              // output raw formatSystem.out.println("doc=" + hits[i].doc + " score=" + hits[i].score);//doc=2 score=0.29767057}Document doc = searcher.doc(hits[i].doc);String path = doc.get("path");if (path != null) {System.out.println((i + 1) + ". " + path);//1. D:\download\lucene-7.7.3\demo\txt\hello123.txtString title = doc.get("title");if (title != null) {System.out.println("   Title: " + doc.get("title"));}} else {System.out.println((i + 1) + ". " + "No path for this document");}System.out.println();}}
}

StandardAnalyzer 中文分词示例: index, query

     @org.junit.Testpublic void analyzeTest() throws IOException {StringReader stopwords = new StringReader("the \n bigger");StringReader stringReader = new StringReader("The quick BIGGER brown   fox jumped over the bigger lazy dog. ");//StandardAnalyzer: 内置 LowerCaseFilter, StopFilterStandardAnalyzer analyzer = new StandardAnalyzer(stopwords);TokenStream tokenStream = analyzer.tokenStream("contents, ", stringReader);//    final StandardTokenizer src = new StandardTokenizer();//    TokenStream tok = new LowerCaseFilter(src);//    tok = new StopFilter(tok, stopwords);tokenStream.reset();CharTermAttribute term = tokenStream.addAttribute(CharTermAttribute.class);while (tokenStream.incrementToken()) {System.out.print("[" + term.toString() + "] ");//[quick] [brown] [fox] [jumped] [over] [lazy] [dog]}tokenStream.close();}

3, 中文词库下载：同义词，停用词

链接地址： https://github.com/fighting41love/funNLP
直接下载github文件：将 github.com 替换为 raw.githubusercontent.com，并去除 /blob:

url="https://github.com/guotong1988/chinese_dictionary/blob/master/dict_synonym.txt"
echo $url|sed -e 's@github.com@raw.githubusercontent.com@' -e 's@blob/@@'

生成solr格式的停用词：逗号分隔

public static void main(String[] args) throws IOException {// 创建字符流对象，并根据已创建的字节流对象创建字符流对象String file = "D:\\download\\solr-近义词，停用词\\synonym.txt";String outfile = "D:\\download\\solr-近义词，停用词\\synonym2.txt";BufferedWriter bw = new BufferedWriter(new FileWriter(outfile));BufferedReader raf = new BufferedReader(new FileReader(file));//同义词//Aa01A01= 人 士 人物 人士 人氏 人选//Aa01A02= 人类 生人 全人类String s = null;while ((s = raf.readLine()) != null) {String[] arr = s.split("=");if (arr.length < 2) continue;String[] arr2 = arr[1].split("\\s");for (int i=0;i <arr2.length; i++){if (i != arr2.length -1 ){if ( ! arr2[i].trim().equals("")){System.out.print(arr2[i]+",");bw.write(arr2[i]+",");}}else {System.out.print(arr2[i]);bw.write(arr2[i]);}}System.out.println();bw.write("\n");}bw.flush();bw.close();raf.close();
}
//人,士,人物,人士,人氏,人选
//人类,生人,全人类
//人手,人员,人口,人丁,口,食指
//劳力,劳动力,工作者

理解solr工作原理：lucene相关推荐

深入理解 Cache 工作原理
欢迎关注方志朋的博客,回复"666"获面试宝典大家好,今天给大家分享一篇关于 Cache 的硬核的技术文,基本上关于Cache的所有知识点都可以在这篇文章里看到. 关于 Cach ...
深入理解Cache工作原理
大家好,今天给大家分享一篇关于 Cache 的硬核的技术文,基本上关于Cache的所有知识点都可以在这篇文章里看到. 关于 Cache 这方面内容图比较多,不想自己画了,所以图都来自<Compu ...
深入理解Nginx工作原理
1 反向代理 1.1 概念反向代理(Reverse Proxy)方式是指以代理服务器来接受internet上的连接请求,然后将请求转发给内部网络上的服务器,并将从服务器上得到的结果返回给intern ...
深入理解HTTPS工作原理
前言近几年,互联网发生着翻天覆地的变化,尤其是我们一直习以为常的HTTP协议,在逐渐的被HTTPS协议所取代,在浏览器.搜索引擎.CA机构.大型互联网企业的共同促进下,互联网迎来了"HTT ...
深入理解IIS工作原理
本文是转载. 上一篇 / 下一篇 2009-08-26 09:41:46 / 个人分类:Python学习查看( 267 ) / 评论( 0 ) / 评分( 0 / 0 ) 基本概念: 1. 站点程 ...
理解GRUB2工作原理及配置选项与方法
GRUB2是借鉴GRUB改写到更加安全强大到多系统引导程序,现在大部分较新的Linux发行版都是使用GRUB2作为引导程序的. GRUB2采用了模块化设计,使得GRUB2核心更加精炼,使用更加灵活,同 ...
（转载）Struts2源码粗略分析四：理解xwork工作原理
http://blog.csdn.net/rcom10002/article/details/6044463 转载于:https://www.cnblogs.com/eecs2016/articles ...
lstm需要优化的参数_使用PyTorch手写代码从头构建LSTM，更深入的理解其工作原理...
这是一个造轮子的过程,但是从头构建LSTM能够使我们对体系结构进行更加了解,并将我们的研究带入下一个层次. LSTM单元是递归神经网络深度学习研究领域中最有趣的结构之一:它不仅使模型能够从长序列中学习 ...
lstm中look_back的大小选择_使用PyTorch手写代码从头构建LSTM，更深度的理解其工作原理
这是一个造轮子的过程,但是从头构建LSTM能够使我们对体系结构进行更加了解,并将我们的研究带入下一个层次. LSTM单元是递归神经网络深度学习研究领域中最有趣的结构之一:它不仅使模型能够从长序列中学习 ...

理解solr工作原理：lucene

文章目录

1, 下载lucene, 获取demo相关jar包

2, 源代码实现明细

org.apache.lucene.demo.IndexFiles : StandardAnalyzer index阶段

org.apache.lucene.demo.SearchFiles : StandardAnalyzer query阶段

StandardAnalyzer 中文分词示例: index, query

3, 中文词库下载：同义词，停用词

理解solr工作原理：lucene相关推荐

最新文章

热门文章