文章目录

  • 1, 下载lucene, 获取demo相关jar包
  • 2, 源代码实现明细
    • org.apache.lucene.demo.IndexFiles : StandardAnalyzer index阶段
    • org.apache.lucene.demo.SearchFiles : StandardAnalyzer query阶段
    • StandardAnalyzer 中文分词示例: index, query
  • 3, 中文词库下载:同义词,停用词

lucene demo样例演示:如何创建索引,如何检索数据 https://lucene.apache.org/core/8_6_0/demo/index.html

1, 下载lucene, 获取demo相关jar包

https://lucene.apache.org/core/downloads.html

2, 源代码实现明细

配置pom.xml

         <dependency><groupId>org.apache.lucene</groupId><artifactId>lucene-demo</artifactId><version>8.6.0</version></dependency><dependency><groupId>org.apache.lucene</groupId><artifactId>lucene-core</artifactId><version>8.6.0</version></dependency><dependency><groupId>org.apache.lucene</groupId><artifactId>lucene-queryparser</artifactId><version>8.6.0</version></dependency>

org.apache.lucene.demo.IndexFiles : StandardAnalyzer index阶段

  • 使用 StandardAnalyzer 来解析文件内容
// Index all text files under a directory.
public class IndexFiles {/** Index all text files under a directory. */public static void main(String[] args) {//指定源文档的路径:绝对路径 或 相对路径//cmd> ls D:\download\lucene-7.7.3\demo\txt\//      123.txt  456.txt  hello123.txt  hello12345678.txtString docsPath = "D:\\download\\lucene-7.7.3\\demo\\txt";//指定索引存放的位置:绝对路径 或 相对路径String indexPath = "D:\\download\\lucene-7.7.3\\demo\\index1";//是创建索引/ 还是更新索引boolean create = true;final Path docDir = Paths.get(docsPath);Date start = new Date();System.out.println("Indexing to directory '" + indexPath + "'...");Directory indexPathDir = FSDirectory.open(Paths.get(indexPath));Analyzer analyzer = new StandardAnalyzer();IndexWriterConfig iwc = new IndexWriterConfig(analyzer);if (create) {// Create a new index in the directory, removing old indexsiwc.setOpenMode(OpenMode.CREATE);} else {// Add new documents to an existing index:iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);}IndexWriter writer = new IndexWriter(indexPathDir, iwc);indexDocs(writer, docDir);// NOTE: if you want to maximize search performance,// you can optionally call forceMerge here.  This can be// a terribly costly operation, so generally it's only// worth it when your index is relatively static (ie// you're done adding documents to it)://// writer.forceMerge(1);writer.close();Date end = new Date();System.out.println(end.getTime() - start.getTime() + " total milliseconds");}/*** Indexes the given file using the given writer, or if a directory is given,* * NOTE: This method indexes one document per input file.  This is slow.  For good* throughput, put multiple documents into your input file(s).  An example of this is* in the benchmark module, which can create "line doc" files, one document per line,* using the* <a href="../../../../../contrib-benchmark/org/apache/lucene/benchmark/byTask/tasks/WriteLineDocTask.html"* >WriteLineDocTask</a>.*  * @param writer Writer to the index where the given file/dir info will be stored* @param path The file to index, or the directory to recurse into to find files to index*/static void indexDocs(final IndexWriter writer, Path path) throws IOException {if (Files.isDirectory(path)) {Files.walkFileTree(path, new SimpleFileVisitor<Path>() {@Overridepublic FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException {try {indexDoc(writer, file, attrs.lastModifiedTime().toMillis());} catch (IOException ignore) {// don't index files that can't be read.}return FileVisitResult.CONTINUE;}});} else {indexDoc(writer, path, Files.getLastModifiedTime(path).toMillis());}}/** Indexes a single document */static void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOException {try (InputStream stream = Files.newInputStream(file)) {// make a new, empty documentDocument doc = new Document();// Add the path of the file as a field named "path".  Use aField pathField = new StringField("path", file.toString(), Field.Store.YES);doc.add(pathField);// Add the last modified date of the file a field named "modified".doc.add(new LongPoint("modified", lastModified));// Add the contents of the file to a field named "contents".  Specify a Reader,// so that the text of the file is tokenized and indexed, but not stored.doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(stream, StandardCharsets.UTF_8))));if (writer.getConfig().getOpenMode() == OpenMode.CREATE) {// New index, so we just add the document (no old document can be there):System.out.println("adding " + file);writer.addDocument(doc);} else {// Existing index (an old copy of this document may have been indexed) so // we use updateDocument instead to replace the old one matching the exact // path, if present:System.out.println("updating " + file);writer.updateDocument(new Term("path", file.toString()), doc);}}}
}

org.apache.lucene.demo.SearchFiles : StandardAnalyzer query阶段

// search demo.
public class SearchFiles {public static void main(String[] args) throws Exception {//指定索引存放的位置:绝对路径 或 相对路径String index = "D:\\download\\lucene-7.7.3\\demo\\index1";//搜索的文本内容String queryString = "hello";//搜索的文本内容 --- 从哪个索引字段检索String field = "contents";//分页大小int hitsPerPage = 10;//是否查看:更详细的内容( 搜索配置的score )boolean raw = true;IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(index)));IndexSearcher searcher = new IndexSearcher(reader);Analyzer analyzer = new StandardAnalyzer();QueryParser parser = new QueryParser(field, analyzer);Query query = parser.parse(queryString);System.out.println("Searching for: " + query.toString(field));doSearch(searcher, query, hitsPerPage, raw);reader.close();}//查询public static void doSearch(IndexSearcher searcher, Query query, int hitsPerPage, boolean raw) throws IOException {BufferedReader in = new BufferedReader(new InputStreamReader(System.in, StandardCharsets.UTF_8));// Collect enough docs to show 5 pagesTopDocs results = searcher.search(query, 5 * hitsPerPage);ScoreDoc[] hits = results.scoreDocs;int numTotalHits = Math.toIntExact(results.totalHits.value);System.out.println(numTotalHits + " total matching documents ");System.out.println("\n\n>>>>>>>> doc details >>>>>>>>>");for (int i = 0; i < numTotalHits; i++) {if (raw) {                              // output raw formatSystem.out.println("doc=" + hits[i].doc + " score=" + hits[i].score);//doc=2 score=0.29767057}Document doc = searcher.doc(hits[i].doc);String path = doc.get("path");if (path != null) {System.out.println((i + 1) + ". " + path);//1. D:\download\lucene-7.7.3\demo\txt\hello123.txtString title = doc.get("title");if (title != null) {System.out.println("   Title: " + doc.get("title"));}} else {System.out.println((i + 1) + ". " + "No path for this document");}System.out.println();}}
}

StandardAnalyzer 中文分词示例: index, query

     @org.junit.Testpublic void analyzeTest() throws IOException {StringReader stopwords = new StringReader("the \n bigger");StringReader stringReader = new StringReader("The quick BIGGER brown   fox jumped over the bigger lazy dog. ");//StandardAnalyzer: 内置 LowerCaseFilter, StopFilterStandardAnalyzer analyzer = new StandardAnalyzer(stopwords);TokenStream tokenStream = analyzer.tokenStream("contents, ", stringReader);//    final StandardTokenizer src = new StandardTokenizer();//    TokenStream tok = new LowerCaseFilter(src);//    tok = new StopFilter(tok, stopwords);tokenStream.reset();CharTermAttribute term = tokenStream.addAttribute(CharTermAttribute.class);while (tokenStream.incrementToken()) {System.out.print("[" + term.toString() + "] ");//[quick] [brown] [fox] [jumped] [over] [lazy] [dog]}tokenStream.close();}

3, 中文词库下载:同义词,停用词

  • 链接地址: https://github.com/fighting41love/funNLP
  • 直接下载github文件:将 github.com 替换为 raw.githubusercontent.com,并去除 /blob:
url="https://github.com/guotong1988/chinese_dictionary/blob/master/dict_synonym.txt"
echo $url|sed -e 's@github.com@raw.githubusercontent.com@' -e 's@blob/@@'

生成solr格式的 停用词:逗号分隔

public static void main(String[] args) throws IOException {// 创建字符流对象,并根据已创建的字节流对象创建字符流对象String file = "D:\\download\\solr-近义词,停用词\\synonym.txt";String outfile = "D:\\download\\solr-近义词,停用词\\synonym2.txt";BufferedWriter bw = new BufferedWriter(new FileWriter(outfile));BufferedReader raf = new BufferedReader(new FileReader(file));//同义词//Aa01A01= 人 士 人物 人士 人氏 人选//Aa01A02= 人类 生人 全人类String s = null;while ((s = raf.readLine()) != null) {String[] arr = s.split("=");if (arr.length < 2) continue;String[] arr2 = arr[1].split("\\s");for (int i=0;i <arr2.length; i++){if (i != arr2.length -1 ){if ( ! arr2[i].trim().equals("")){System.out.print(arr2[i]+",");bw.write(arr2[i]+",");}}else {System.out.print(arr2[i]);bw.write(arr2[i]);}}System.out.println();bw.write("\n");}bw.flush();bw.close();raf.close();
}
//人,士,人物,人士,人氏,人选
//人类,生人,全人类
//人手,人员,人口,人丁,口,食指
//劳力,劳动力,工作者

理解solr工作原理:lucene相关推荐

  1. 深入理解 Cache 工作原理

    欢迎关注方志朋的博客,回复"666"获面试宝典 大家好,今天给大家分享一篇关于 Cache 的硬核的技术文,基本上关于Cache的所有知识点都可以在这篇文章里看到. 关于 Cach ...

  2. 深入理解Cache工作原理

    大家好,今天给大家分享一篇关于 Cache 的硬核的技术文,基本上关于Cache的所有知识点都可以在这篇文章里看到. 关于 Cache 这方面内容图比较多,不想自己画了,所以图都来自<Compu ...

  3. 深入理解Nginx工作原理

    1 反向代理 1.1 概念 反向代理(Reverse Proxy)方式是指以代理服务器来接受internet上的连接请求,然后将请求转发给内部网络上的服务器,并将从服务器上得到的结果返回给intern ...

  4. 深入理解HTTPS工作原理

    前言 近几年,互联网发生着翻天覆地的变化,尤其是我们一直习以为常的HTTP协议,在逐渐的被HTTPS协议所取代,在浏览器.搜索引擎.CA机构.大型互联网企业的共同促进下,互联网迎来了"HTT ...

  5. 深入理解IIS工作原理

    本文是转载. 上一篇 / 下一篇  2009-08-26 09:41:46 / 个人分类:Python学习 查看( 267 ) / 评论( 0 ) / 评分( 0 / 0 ) 基本概念: 1. 站点程 ...

  6. 理解GRUB2工作原理及配置选项与方法

    GRUB2是借鉴GRUB改写到更加安全强大到多系统引导程序,现在大部分较新的Linux发行版都是使用GRUB2作为引导程序的. GRUB2采用了模块化设计,使得GRUB2核心更加精炼,使用更加灵活,同 ...

  7. (转载)Struts2源码粗略分析四:理解xwork工作原理

    http://blog.csdn.net/rcom10002/article/details/6044463 转载于:https://www.cnblogs.com/eecs2016/articles ...

  8. lstm需要优化的参数_使用PyTorch手写代码从头构建LSTM,更深入的理解其工作原理...

    这是一个造轮子的过程,但是从头构建LSTM能够使我们对体系结构进行更加了解,并将我们的研究带入下一个层次. LSTM单元是递归神经网络深度学习研究领域中最有趣的结构之一:它不仅使模型能够从长序列中学习 ...

  9. lstm中look_back的大小选择_使用PyTorch手写代码从头构建LSTM,更深度的理解其工作原理

    这是一个造轮子的过程,但是从头构建LSTM能够使我们对体系结构进行更加了解,并将我们的研究带入下一个层次. LSTM单元是递归神经网络深度学习研究领域中最有趣的结构之一:它不仅使模型能够从长序列中学习 ...

最新文章

  1. POJ 2418 Hardwood Species(trie 树)
  2. linux生日_代码简介:让我们用25个Linux事实来庆祝Linux的25岁生日。
  3. 坡道行驶电动小车_事发红绿灯路口!东莞一女子骑电动滑板车被撞致颅内出血…...
  4. sqlsession.selectlist 会返回null么_StackOverflow经典问题:代码中如何去掉烦人的“!=nullquot;判空语句...
  5. python安装pyqt4_Python-Mac 安装 PyQt4
  6. 《中学生可以这样学Python》84节微课免费观看地址
  7. D37 682. Baseball Game
  8. 剑指offer58 二叉树的下一个结点
  9. 抑郁症是不可告人的病吗?
  10. IOS快速集成下拉上拉刷新
  11. MAPGIS与ARCVIEW之间的文件转换技巧。(转载自当当吧网络驿站)
  12. 《机械工程测试技术基础》教学大纲
  13. npm创建vue项目
  14. C++for循环经典九九乘法表打印
  15. linux rpm和deb的区别,rpm与deb都是什么呢 有什么区别呢
  16. 深度学习中梯度消失原因、梯度爆炸及解决方案
  17. 计算机网络 之 DNS (Domain Name System)域名服务器
  18. 如何用Jmeter发送消息到Solace JNDI
  19. 解决vscode进行chrome调试时报错:localhost拒绝了我们的连接请求
  20. RGB归一化及高效实现

热门文章

  1. 恩智浦i.MX6Q开发板软硬件全开源提供核心板原理图
  2. Arduino测试PAJ7620U2手势传感器
  3. 集成学习(上)机器学习基础
  4. 综合实验:给openlab搭建web网站
  5. 什么是HTML?什么是CSS?(零基础菜鸟入门)
  6. 重庆市九龙坡区第四届计算思维编程竞技活动——c++组
  7. 90网论坛php基础
  8. ShuffleNetV1-轻量级神经网络
  9. 街霸对决最新服务器机柜销售,街霸对决:2021年1月14日更新维护公告
  10. 【解救ROS】ROS实战之SLAM建图详细过程(含代码)