使用POI提取Word文件的内容(纯文本、带html格式)

使用poi提取Word文件的内容，区分带html和不带格式的

依赖jar导入pom.xml

        <dependency><groupId>org.apache.poi</groupId><artifactId>poi-scratchpad</artifactId><version>3.17</version></dependency><dependency><groupId>org.apache.poi</groupId><artifactId>poi-ooxml</artifactId><version>3.17</version></dependency><dependency><groupId>fr.opensagres.xdocreport</groupId><artifactId>fr.opensagres.poi.xwpf.converter.xhtml</artifactId><version>2.0.1</version></dependency>

提取工具类：


import com.datahub.aimindgraph.exception.WordExtractorException;
import fr.opensagres.poi.xwpf.converter.core.FileImageExtractor;
import fr.opensagres.poi.xwpf.converter.core.FileURIResolver;
import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLConverter;
import fr.opensagres.poi.xwpf.converter.xhtml.XHTMLOptions;
import org.apache.poi.hwpf.HWPFDocument;
import org.apache.poi.hwpf.converter.WordToHtmlConverter;
import org.apache.poi.hwpf.extractor.WordExtractor;
import org.apache.poi.xwpf.extractor.XWPFWordExtractor;
import org.apache.poi.xwpf.usermodel.XWPFDocument;
import org.w3c.dom.Document;import javax.xml.parsers.DocumentBuilderFactory;
import javax.xml.transform.OutputKeys;
import javax.xml.transform.Transformer;
import javax.xml.transform.TransformerFactory;
import javax.xml.transform.dom.DOMSource;
import javax.xml.transform.stream.StreamResult;
import java.io.*;/*** @Desc word extraction* @Author wadu* @Date 2020/1/19* @Version 1.0**/
public class WordUtil {/*** word2007*/public final static String DOCX = ".docx";/*** word2003*/public final static String DOC = ".doc";public static void main(String[] args) {File file = new File("D:\\temp\\test.doc");File imageFolderFile = new File("D:\\temp\\images\\");wordToHtmlString(file, imageFolderFile);}public static String wordToHtmlString(String filePath, String imageFolderPath) {return wordToHtmlString(new File(filePath), new File(imageFolderPath));}public static String wordToString(String filePath) {return wordToString(new File(filePath));}/*** 从word获取带html格式的文本* @param file* @param imageFolderFile* @return*/public static String wordToHtmlString(File file, File imageFolderFile) {if (!file.exists()) {throw new WordExtractorException("file does not exists!");} else {if (!imageFolderFile.exists()) {imageFolderFile.mkdirs();}if (file.getName().toLowerCase().endsWith(DOCX)) {return word2007ToHtmlString(file, imageFolderFile);} else if(file.getName().toLowerCase().endsWith(DOC)){return word2003ToHtmlString(file, imageFolderFile);} else {throw new WordExtractorException("Only doc or docx files are supported");}}}/*** 从word获取不带格式的文本* @param file* @return*/public static String wordToString(File file) {if (!file.exists()) {throw new WordExtractorException("file does not exists!");} else {if (file.getName().toLowerCase().endsWith(DOCX)) {return word2007ToString(file);} else if(file.getName().toLowerCase().endsWith(DOC)){return word2003ToString(file);} else {throw new WordExtractorException("Only doc or docx files are supported");}}}/*** @param wordFile* @return*/private static String word2007ToString(File wordFile) {try(InputStream in = new FileInputStream(wordFile)) {StringBuilder result = new StringBuilder();XWPFDocument document = new XWPFDocument(in);XWPFWordExtractor re = new XWPFWordExtractor(document);result.append(re.getText());re.close();return result.toString();} catch (Exception e) {throw new WordExtractorException(e.getMessage());}}/*** @param wordFile* @return*/private static String word2003ToString(File wordFile) {try(InputStream in = new FileInputStream(wordFile)) {WordExtractor wordExtractor = new WordExtractor(in);return wordExtractor.getText();} catch (Exception e) {throw new WordExtractorException(e.getMessage());}}/**** @param wordFile* @param imageFolderFile* @return*/private static String word2007ToHtmlString(File wordFile, File imageFolderFile) {try (InputStream in = new FileInputStream(wordFile);XWPFDocument document = new XWPFDocument(in);ByteArrayOutputStream baos = new ByteArrayOutputStream()) {XHTMLOptions options = XHTMLOptions.create().URIResolver(new FileURIResolver(imageFolderFile));options.setExtractor(new FileImageExtractor(imageFolderFile));options.setIgnoreStylesIfUnused(false);options.setFragment(true);XHTMLConverter.getInstance().convert(document, baos, options);return baos.toString();} catch (Exception e) {throw new WordExtractorException(e.getMessage());}}/**** @param wordFile* @param imageFolderFile* @return*/private static String word2003ToHtmlString(File wordFile, File imageFolderFile) {String absolutePath = imageFolderFile.getAbsolutePath();String imagePath = absolutePath.endsWith(File.separator) ? absolutePath : absolutePath + File.separator;try (InputStream input = new FileInputStream(wordFile);HWPFDocument wordDocument = new HWPFDocument(input);ByteArrayOutputStream baos = new ByteArrayOutputStream();OutputStream outStream = new BufferedOutputStream(baos)) {WordToHtmlConverter wordToHtmlConverter = new WordToHtmlConverter(DocumentBuilderFactory.newInstance().newDocumentBuilder().newDocument());//图片存放的位置wordToHtmlConverter.setPicturesManager((content, pictureType, suggestedName, widthInches, heightInches) -> {String imageFile = imagePath + suggestedName;File file = new File(imageFile);try {OutputStream os = new FileOutputStream(file);os.write(content);os.close();} catch (FileNotFoundException e) {e.printStackTrace();} catch (IOException e) {e.printStackTrace();}return imageFile;});//解析word文档wordToHtmlConverter.processDocument(wordDocument);Document htmlDocument = wordToHtmlConverter.getDocument();DOMSource domSource = new DOMSource(htmlDocument);StreamResult streamResult = new StreamResult(outStream);TransformerFactory factory = TransformerFactory.newInstance();Transformer serializer = factory.newTransformer();serializer.setOutputProperty(OutputKeys.ENCODING, "utf-8");serializer.setOutputProperty(OutputKeys.INDENT, "yes");serializer.setOutputProperty(OutputKeys.METHOD, "html");serializer.transform(domSource, streamResult);return baos.toString();} catch (Exception e) {throw new WordExtractorException(e.getMessage());}}}

使用POI提取Word文件的内容(纯文本、带html格式)相关推荐

python-批量提取srt文件中的纯文本
python-批量提取srt文件中的纯文本 1.功能介绍为了方便日常的使用,我将批量提取 srt 文件中纯文本的程序打包成了 exe 文件,这样就不用安装 python 环境和相关的库了. 现在版本 ...
POI读取word文件，（支持HSSF和XSSF两种方式）
POI读取word文件,(支持HSSF和XSSF两种方式) 参考:HSSF,XSSF,SXSSF三种方式 1.引用maven(版本必须一致) <dependency><groupId ...
Java POI 读取word文件
Apache POI是Apache软件基金会的开放源码函式库,POI提供API给Java程序对Microsoft Office格式档案读和写的功能. 1.读取word 2003及word 2007需要 ...
poi导出word文件(带表格)
poi导出word文件(带表格) 一.背景介绍现有业务需求根据前端页面上所选的时间和列,来生成word表格,方便打印. 二.POM <dependency><groupId> ...
POI导出word文件中表格合并方法（行合并，列合并）
项目中遇到记录一下 POI导出word文件中表格合并方法(行合并,列合并) . // word表格跨列合并单元格//row 指定行.fromCell 开始列数.toCell 结束列数.public v ...
Python提取Word文件中的目录标题保存为Excel文件
from docx import Document from openpyxl import Workbook from openpyxl.styles import Alignment, Borde ...
Excel-vba打开word文件读取内容处理并保存至word中
Excel-vba打开word文件读取内容处理并保存至word中 Sub 按钮1()Dim myPath As StringSet wdapp = CreateObject("Word.Ap ...
手撸Java提取QSV文件视频内容
手撸Java提取QSV文件视频内容 QSV文件的构成详见上一篇文章,这篇文章带你手把手撸一遍代码. 创建类第一步新建一个java类QSV,构造函数传入需要解析的文件名称. public class ...
Python 批量提取 Word 中表格内容，一键写入 Excel
关注公众号:[小张Python],为你准备了 50+ 本Python 精品电子书籍与 50G + 优质视频学习资料,后台回复关键字:1024 即可获取:如果对博文内容有什么疑问,后台添加作者[个人微 ...
Java POI导出word文件及生成表格
HWPF是处理 Microsoft Word 97(-2007) .doc文件格式,它还为较旧的Word 6和Word 95文件格式提供了有限的只读支持.包含在poi-scratchpad-XXX.j ...

使用POI提取Word文件的内容(纯文本、带html格式)

使用POI提取Word文件的内容(纯文本、带html格式)相关推荐

最新文章

热门文章