理解solr工作原理:lucene
文章目录
- 1, 下载lucene, 获取demo相关jar包
- 2, 源代码实现明细
- org.apache.lucene.demo.IndexFiles : StandardAnalyzer index阶段
- org.apache.lucene.demo.SearchFiles : StandardAnalyzer query阶段
- StandardAnalyzer 中文分词示例: index, query
- 3, 中文词库下载:同义词,停用词
lucene demo样例演示:如何创建索引,如何检索数据 https://lucene.apache.org/core/8_6_0/demo/index.html
1, 下载lucene, 获取demo相关jar包
https://lucene.apache.org/core/downloads.html
2, 源代码实现明细
配置pom.xml
<dependency><groupId>org.apache.lucene</groupId><artifactId>lucene-demo</artifactId><version>8.6.0</version></dependency><dependency><groupId>org.apache.lucene</groupId><artifactId>lucene-core</artifactId><version>8.6.0</version></dependency><dependency><groupId>org.apache.lucene</groupId><artifactId>lucene-queryparser</artifactId><version>8.6.0</version></dependency>
org.apache.lucene.demo.IndexFiles : StandardAnalyzer index阶段
- 使用 StandardAnalyzer 来解析文件内容
// Index all text files under a directory.
public class IndexFiles {/** Index all text files under a directory. */public static void main(String[] args) {//指定源文档的路径:绝对路径 或 相对路径//cmd> ls D:\download\lucene-7.7.3\demo\txt\// 123.txt 456.txt hello123.txt hello12345678.txtString docsPath = "D:\\download\\lucene-7.7.3\\demo\\txt";//指定索引存放的位置:绝对路径 或 相对路径String indexPath = "D:\\download\\lucene-7.7.3\\demo\\index1";//是创建索引/ 还是更新索引boolean create = true;final Path docDir = Paths.get(docsPath);Date start = new Date();System.out.println("Indexing to directory '" + indexPath + "'...");Directory indexPathDir = FSDirectory.open(Paths.get(indexPath));Analyzer analyzer = new StandardAnalyzer();IndexWriterConfig iwc = new IndexWriterConfig(analyzer);if (create) {// Create a new index in the directory, removing old indexsiwc.setOpenMode(OpenMode.CREATE);} else {// Add new documents to an existing index:iwc.setOpenMode(OpenMode.CREATE_OR_APPEND);}IndexWriter writer = new IndexWriter(indexPathDir, iwc);indexDocs(writer, docDir);// NOTE: if you want to maximize search performance,// you can optionally call forceMerge here. This can be// a terribly costly operation, so generally it's only// worth it when your index is relatively static (ie// you're done adding documents to it)://// writer.forceMerge(1);writer.close();Date end = new Date();System.out.println(end.getTime() - start.getTime() + " total milliseconds");}/*** Indexes the given file using the given writer, or if a directory is given,* * NOTE: This method indexes one document per input file. This is slow. For good* throughput, put multiple documents into your input file(s). An example of this is* in the benchmark module, which can create "line doc" files, one document per line,* using the* <a href="../../../../../contrib-benchmark/org/apache/lucene/benchmark/byTask/tasks/WriteLineDocTask.html"* >WriteLineDocTask</a>.* * @param writer Writer to the index where the given file/dir info will be stored* @param path The file to index, or the directory to recurse into to find files to index*/static void indexDocs(final IndexWriter writer, Path path) throws IOException {if (Files.isDirectory(path)) {Files.walkFileTree(path, new SimpleFileVisitor<Path>() {@Overridepublic FileVisitResult visitFile(Path file, BasicFileAttributes attrs) throws IOException {try {indexDoc(writer, file, attrs.lastModifiedTime().toMillis());} catch (IOException ignore) {// don't index files that can't be read.}return FileVisitResult.CONTINUE;}});} else {indexDoc(writer, path, Files.getLastModifiedTime(path).toMillis());}}/** Indexes a single document */static void indexDoc(IndexWriter writer, Path file, long lastModified) throws IOException {try (InputStream stream = Files.newInputStream(file)) {// make a new, empty documentDocument doc = new Document();// Add the path of the file as a field named "path". Use aField pathField = new StringField("path", file.toString(), Field.Store.YES);doc.add(pathField);// Add the last modified date of the file a field named "modified".doc.add(new LongPoint("modified", lastModified));// Add the contents of the file to a field named "contents". Specify a Reader,// so that the text of the file is tokenized and indexed, but not stored.doc.add(new TextField("contents", new BufferedReader(new InputStreamReader(stream, StandardCharsets.UTF_8))));if (writer.getConfig().getOpenMode() == OpenMode.CREATE) {// New index, so we just add the document (no old document can be there):System.out.println("adding " + file);writer.addDocument(doc);} else {// Existing index (an old copy of this document may have been indexed) so // we use updateDocument instead to replace the old one matching the exact // path, if present:System.out.println("updating " + file);writer.updateDocument(new Term("path", file.toString()), doc);}}}
}
org.apache.lucene.demo.SearchFiles : StandardAnalyzer query阶段
// search demo.
public class SearchFiles {public static void main(String[] args) throws Exception {//指定索引存放的位置:绝对路径 或 相对路径String index = "D:\\download\\lucene-7.7.3\\demo\\index1";//搜索的文本内容String queryString = "hello";//搜索的文本内容 --- 从哪个索引字段检索String field = "contents";//分页大小int hitsPerPage = 10;//是否查看:更详细的内容( 搜索配置的score )boolean raw = true;IndexReader reader = DirectoryReader.open(FSDirectory.open(Paths.get(index)));IndexSearcher searcher = new IndexSearcher(reader);Analyzer analyzer = new StandardAnalyzer();QueryParser parser = new QueryParser(field, analyzer);Query query = parser.parse(queryString);System.out.println("Searching for: " + query.toString(field));doSearch(searcher, query, hitsPerPage, raw);reader.close();}//查询public static void doSearch(IndexSearcher searcher, Query query, int hitsPerPage, boolean raw) throws IOException {BufferedReader in = new BufferedReader(new InputStreamReader(System.in, StandardCharsets.UTF_8));// Collect enough docs to show 5 pagesTopDocs results = searcher.search(query, 5 * hitsPerPage);ScoreDoc[] hits = results.scoreDocs;int numTotalHits = Math.toIntExact(results.totalHits.value);System.out.println(numTotalHits + " total matching documents ");System.out.println("\n\n>>>>>>>> doc details >>>>>>>>>");for (int i = 0; i < numTotalHits; i++) {if (raw) { // output raw formatSystem.out.println("doc=" + hits[i].doc + " score=" + hits[i].score);//doc=2 score=0.29767057}Document doc = searcher.doc(hits[i].doc);String path = doc.get("path");if (path != null) {System.out.println((i + 1) + ". " + path);//1. D:\download\lucene-7.7.3\demo\txt\hello123.txtString title = doc.get("title");if (title != null) {System.out.println(" Title: " + doc.get("title"));}} else {System.out.println((i + 1) + ". " + "No path for this document");}System.out.println();}}
}
StandardAnalyzer 中文分词示例: index, query
@org.junit.Testpublic void analyzeTest() throws IOException {StringReader stopwords = new StringReader("the \n bigger");StringReader stringReader = new StringReader("The quick BIGGER brown fox jumped over the bigger lazy dog. ");//StandardAnalyzer: 内置 LowerCaseFilter, StopFilterStandardAnalyzer analyzer = new StandardAnalyzer(stopwords);TokenStream tokenStream = analyzer.tokenStream("contents, ", stringReader);// final StandardTokenizer src = new StandardTokenizer();// TokenStream tok = new LowerCaseFilter(src);// tok = new StopFilter(tok, stopwords);tokenStream.reset();CharTermAttribute term = tokenStream.addAttribute(CharTermAttribute.class);while (tokenStream.incrementToken()) {System.out.print("[" + term.toString() + "] ");//[quick] [brown] [fox] [jumped] [over] [lazy] [dog]}tokenStream.close();}
3, 中文词库下载:同义词,停用词
- 链接地址: https://github.com/fighting41love/funNLP
- 直接下载github文件:将 github.com 替换为 raw.githubusercontent.com,并去除 /blob:
url="https://github.com/guotong1988/chinese_dictionary/blob/master/dict_synonym.txt"
echo $url|sed -e 's@github.com@raw.githubusercontent.com@' -e 's@blob/@@'
生成solr格式的 停用词:逗号分隔
public static void main(String[] args) throws IOException {// 创建字符流对象,并根据已创建的字节流对象创建字符流对象String file = "D:\\download\\solr-近义词,停用词\\synonym.txt";String outfile = "D:\\download\\solr-近义词,停用词\\synonym2.txt";BufferedWriter bw = new BufferedWriter(new FileWriter(outfile));BufferedReader raf = new BufferedReader(new FileReader(file));//同义词//Aa01A01= 人 士 人物 人士 人氏 人选//Aa01A02= 人类 生人 全人类String s = null;while ((s = raf.readLine()) != null) {String[] arr = s.split("=");if (arr.length < 2) continue;String[] arr2 = arr[1].split("\\s");for (int i=0;i <arr2.length; i++){if (i != arr2.length -1 ){if ( ! arr2[i].trim().equals("")){System.out.print(arr2[i]+",");bw.write(arr2[i]+",");}}else {System.out.print(arr2[i]);bw.write(arr2[i]);}}System.out.println();bw.write("\n");}bw.flush();bw.close();raf.close();
}
//人,士,人物,人士,人氏,人选
//人类,生人,全人类
//人手,人员,人口,人丁,口,食指
//劳力,劳动力,工作者
理解solr工作原理:lucene相关推荐
- 深入理解 Cache 工作原理
欢迎关注方志朋的博客,回复"666"获面试宝典 大家好,今天给大家分享一篇关于 Cache 的硬核的技术文,基本上关于Cache的所有知识点都可以在这篇文章里看到. 关于 Cach ...
- 深入理解Cache工作原理
大家好,今天给大家分享一篇关于 Cache 的硬核的技术文,基本上关于Cache的所有知识点都可以在这篇文章里看到. 关于 Cache 这方面内容图比较多,不想自己画了,所以图都来自<Compu ...
- 深入理解Nginx工作原理
1 反向代理 1.1 概念 反向代理(Reverse Proxy)方式是指以代理服务器来接受internet上的连接请求,然后将请求转发给内部网络上的服务器,并将从服务器上得到的结果返回给intern ...
- 深入理解HTTPS工作原理
前言 近几年,互联网发生着翻天覆地的变化,尤其是我们一直习以为常的HTTP协议,在逐渐的被HTTPS协议所取代,在浏览器.搜索引擎.CA机构.大型互联网企业的共同促进下,互联网迎来了"HTT ...
- 深入理解IIS工作原理
本文是转载. 上一篇 / 下一篇 2009-08-26 09:41:46 / 个人分类:Python学习 查看( 267 ) / 评论( 0 ) / 评分( 0 / 0 ) 基本概念: 1. 站点程 ...
- 理解GRUB2工作原理及配置选项与方法
GRUB2是借鉴GRUB改写到更加安全强大到多系统引导程序,现在大部分较新的Linux发行版都是使用GRUB2作为引导程序的. GRUB2采用了模块化设计,使得GRUB2核心更加精炼,使用更加灵活,同 ...
- (转载)Struts2源码粗略分析四:理解xwork工作原理
http://blog.csdn.net/rcom10002/article/details/6044463 转载于:https://www.cnblogs.com/eecs2016/articles ...
- lstm需要优化的参数_使用PyTorch手写代码从头构建LSTM,更深入的理解其工作原理...
这是一个造轮子的过程,但是从头构建LSTM能够使我们对体系结构进行更加了解,并将我们的研究带入下一个层次. LSTM单元是递归神经网络深度学习研究领域中最有趣的结构之一:它不仅使模型能够从长序列中学习 ...
- lstm中look_back的大小选择_使用PyTorch手写代码从头构建LSTM,更深度的理解其工作原理
这是一个造轮子的过程,但是从头构建LSTM能够使我们对体系结构进行更加了解,并将我们的研究带入下一个层次. LSTM单元是递归神经网络深度学习研究领域中最有趣的结构之一:它不仅使模型能够从长序列中学习 ...
最新文章
- POJ 2418 Hardwood Species(trie 树)
- linux生日_代码简介:让我们用25个Linux事实来庆祝Linux的25岁生日。
- 坡道行驶电动小车_事发红绿灯路口!东莞一女子骑电动滑板车被撞致颅内出血…...
- sqlsession.selectlist 会返回null么_StackOverflow经典问题:代码中如何去掉烦人的“!=nullquot;判空语句...
- python安装pyqt4_Python-Mac 安装 PyQt4
- 《中学生可以这样学Python》84节微课免费观看地址
- D37 682. Baseball Game
- 剑指offer58 二叉树的下一个结点
- 抑郁症是不可告人的病吗?
- IOS快速集成下拉上拉刷新
- MAPGIS与ARCVIEW之间的文件转换技巧。(转载自当当吧网络驿站)
- 《机械工程测试技术基础》教学大纲
- npm创建vue项目
- C++for循环经典九九乘法表打印
- linux rpm和deb的区别,rpm与deb都是什么呢 有什么区别呢
- 深度学习中梯度消失原因、梯度爆炸及解决方案
- 计算机网络 之 DNS (Domain Name System)域名服务器
- 如何用Jmeter发送消息到Solace JNDI
- 解决vscode进行chrome调试时报错:localhost拒绝了我们的连接请求
- RGB归一化及高效实现