最近研究一個翻譯系統,對老師上傳的一段文本自動拆分成句,乍一聽好像很簡單哦,split分隔下句號不就完事了嘛!。。。mdzz還是太年輕,一不小心上當了,還有嘆號問好雙引號呢~!當然這個也不算什么,找個正則表達式就好啦^_^!太天真了!!!勞資突然發現英文簡直了,竟然還有縮略詞!!!這尼瑪怎么分析哦,一頓翻山越嶺,發現國內的相關文章有限,對於縮略詞都不能有很好的支持,於是在這個時間段,國內嚴禁翻牆的時間。。。我偷偷翻牆去問問歪果仁了,警察叔叔不要抓我,我只是愛學習的騷年Σ( ° △ °|||)︴    然而實際情況是,歪果仁自己也煩躁他們自己的語言太事逼。。。為什么就不能像中文一樣有明顯的句子邊界呢。。。好吧,我特么也是醉了,正當我一籌莫展之際,一個白胡子老頭從天而降,說,騷年,需要幫助嗎。別誤會,不是援助交際ヽ(=^・ω・^=)丿。。。好吧言歸正傳,我看到了NLP,並找到了lingpipe,引用起來相當簡單,一個下午從接觸到實現徹底搞定,說了一堆廢話,開始正文!

import java.util.ArrayList;

import java.util.List;

import com.aliasi.sentences.IndoEuropeanSentenceModel;

import com.aliasi.sentences.SentenceModel;

import com.aliasi.tokenizer.IndoEuropeanTokenizerFactory;

import com.aliasi.tokenizer.Tokenizer;

import com.aliasi.tokenizer.TokenizerFactory;

public class SpliteTextInSentence {

static final TokenizerFactory TOKENIZER_FACTORY = IndoEuropeanTokenizerFactory.INSTANCE;

static final SentenceModel SENTENCE_MODEL = new IndoEuropeanSentenceModel();

//這里我選擇了好多典型例子,屬於正則表達式篩選會有問題的,你們的正則如果都能處理你牛逼,請留言給我,我也想要

public static void main(String[] args) {

SpliteTextInSentence s = new SpliteTextInSentence();

String str1 = "Water-splashing Festival is one of the most important festivals in the world, which is popular among Dai people of China and the southeast Asia. It has been celebrated by people for more than 700 years and now this festival is an necessary way for people to promote the cooperation and communication among countries.";

String str2 = "This is how I tried to split a paragraph into a sentence. But, there is a problem. My paragraph includes dates like Jan.13, 2014 , words like U.S and numbers like 2.2. They all got split by the above code.";

String str3 = "My friend holds a Msc. in Computer Science.";

String str4 = "This is a test? This is a T.L.A. test!";

String text = "50 Cent XYZ120 DVD Player 50 Cent lawyer. Person is john, he is a lawyer.";

String str5 = "\"I do not ask for your forgiveness,\" he said, in a tone that became more firm and forceful. \"I have no illusions, and I am convinced that death is waiting for me: it is just.\"";

String str6 = "\"The Times have had too much influence on me.\" He laughed bitterly and said to himself, \"it is only two steps away from death. Alone with me, I am still hypocritical... Ah, the 19th century!\"";

String str7 = "潑水節是世界上最重要節日之一,深受中國傣族和東南亞人民的喜愛。七百多年來,人們一直在慶祝這個節日,現在這個節日是促進國家間合作和交流的必要方式。";

System.out.println(s.splitfuhao(str7));

List sl = testChunkSentences(s.splitfuhao(str7));

if(sl.isEmpty()){

System.out.println("沒有識別到句子");

}

for (String row : sl) {

System.out.println(row);

}

}

//這個是引用句子識別的方法,找了好多資料,在一個用它做文本分析里的找到的↓

//https://blog.csdn.net/textboy/article/details/45580009

private static List testChunkSentences(String text) {

List result = new ArrayList();

List tokenList = new ArrayList();

List whiteList = new ArrayList();

Tokenizer tokenizer = TOKENIZER_FACTORY.tokenizer(text.toCharArray(),

0, text.length());

tokenizer.tokenize(tokenList, whiteList);

String[] tokens = new String[tokenList.size()];

String[] whites = new String[whiteList.size()];

tokenList.toArray(tokens);

whiteList.toArray(whites);

int[] sentenceBoundaries = SENTENCE_MODEL.boundaryIndices(tokens,

whites);

int sentStartTok = 0;

int sentEndTok = 0;

for (int i = 0; i < sentenceBoundaries.length; ++i) {

System.out.println("Sentense " + (i + 1) + ", sentense's length(from 0):" + (sentenceBoundaries[i]));

StringBuilder sb = new StringBuilder();

sentEndTok = sentenceBoundaries[i];

for (int j = sentStartTok; j <= sentEndTok; j++) {

sb.append(tokens[j]).append(whites[j + 1]);

}

sentStartTok = sentEndTok + 1;

result.add(sb.toString());

}

//System.out.println("Final result:" + result);

return result;

}

//替換中文標點符號,用於檢測是否識別中文分句

public String splitfuhao(String str){

String[] ChineseInterpunction = { "“", "”", "‘", "’", "。", ",", ";", ":", "?", "!", "……", "—", "~", "(", ")", "《", "》" };

String[] EnglishInterpunction = { "\"", "\"", "'", "'", ".", ",", ";", ":", "?", "!", "…", "-", "~", "(", ")", "" };

for (int j = 0; j < ChineseInterpunction.length; j++)

{

//alert("txt.replace("+ChineseInterpunction[j]+", "+EnglishInterpunction[j]+")");

//String reg=str.matches(ChineseInterpunction[j],"g");

str = str.replace(ChineseInterpunction[j], EnglishInterpunction[j]+" ");

}

return str;

}

}

原理不多解釋,因為我也不知道。。。代碼直接粘貼復制可用,哦對了,注意你的jdk版本,1.5以下的就算了。。。

還有英文句子要標准,沒有標點符號的話是識別不到句子的(被這個坑了好久,以為這個NLP是騙人的呢)

你們看吧,超級簡單!希望可以幫到大家

Java实现英文段落分句_java英文段落拆分成句(Split an article into sentences)相关推荐

  1. java 千分位格式话_Java 字符串小数转成千分位格式

    java中需要将字符串小数转成千分位格式显示,如下代码,原理是使用正则表达式对整数位进行千分位格式化,然后小数位拼接起来.为什么要这么麻烦的处理,是因为在android程序中使用的NumberForm ...

  2. java英文段落拆分成句(Split an article into sentences)

    最近研究一个翻译系统,对老师上传的一段文本自动拆分成句,乍一听好像很简单哦,split分隔下句号不就完事了嘛!...mdzz还是太年轻,一不小心上当了,还有叹号问好双引号呢~!当然这个也不算什么,找个 ...

  3. java 项目英语面试问题_Java 英文面试题

    1. Q: What is HashMap and Map? A: Map is Interface and Hashmap is class that implements that. 2. Q: ...

  4. java html转图片格式_java把html转成图片的方法

    代码 1.1 html模版 static String HtmlTemplateStr = " "body {background-color: yellow}"+ &q ...

  5. java代码二进制转为十六进制_Java 中二进制转换成十六进制的两种实现方法

    Java 中二进制转换成十六进制的两种实现方法 每个字节转成16进制,方法1 /** * 每个字节转成16进制,方法1 * * @param result */ private static Stri ...

  6. java long转换double类型_Java 将Long转换成Double类型

    Java 将Long转换成Double类型,其实很简单,调用Long类型的Long.doubleValue(): // 将数据库获取的数据进行拼接成一个月数据 public static List g ...

  7. java中将html语言转换_JAVA中将html转换成pdf

    importjava.io.File;public classHtmlToPdf {//wkhtmltopdf在系统中的路径 private static final String toPdfTool ...

  8. php 文章分句,php 英文分句/分段落

    [php]代码库<?php /*TWWY'S ART*/ function break_passage($text){ //分割段落 return preg_split("/(\\r| ...

  9. java英文参考文献_java英文参考文献

    java英文参考文献 帝国CMS站群文章更新器 2020-09-26 02:30:37 0 java英文参考文献 [1] R.J(Bud)Bates. GPRS:General Packet Radi ...

最新文章

  1. ylinux系统找到软件_电脑用了段时间发现多处一些软件该怎么办?
  2. [转]Entity Framework走马观花之把握全局
  3. HTTP 加速器 Varnish
  4. Sentinel 规则持久化到 apollo 配置中心
  5. PMP每日三题2022年2月11日
  6. jQuery下table操作示例(附案例源码)
  7. 端口复用突破防火墙(图)
  8. priority_queueint,vectorint,greaterint优先队列,按照从小到大
  9. 2017.8.15 阿狸的打字机 失败总结
  10. linux内存管理(十三)-内存规整过程分析
  11. Java杂记3—流程控制之条件 1
  12. Pycharm 安装
  13. 「代码随想录」关于多重背包,你该了解这些!
  14. linux sox录音时间控制,Linux 对音频万能处理的命令——SOX
  15. opencv-python 实现颜色检测
  16. 使用Drupal Console,出现Failed to decode response: zlib_decode(): data error Retrying with degraded mode
  17. 文墨绘学:培训机构如何做好招生培训
  18. 5G要来了,实际测试告诉你它的速度到底有多快!
  19. 实现基于XDP/eBPF的快速路由转发功能
  20. 启发式搜索的实现,特性

热门文章

  1. 基于STM32开源:磁流体蓝牙音箱(包含源码+PCB)
  2. 岂有此理? 珍藏多年的工具让我实现了带薪摸鱼自由
  3. 王学岗性能优化(12)——7z压缩
  4. 荣耀10的android版本区别,荣耀与华为的地位差别有点大,看EMUI10升级计划就明白...
  5. 学习:卷积神经网络发展史
  6. el-table合并表头handerMethod
  7. 现有华为手机可以使用鸿蒙系统吗,【图片】华为鸿蒙系统的厉害之处在于 你可能非用不可 !【手机吧】_百度贴吧...
  8. 第三方框架-PDFViewer的使用
  9. SpringBoot+Vue项目知识管理系统
  10. git 使用简单总结