Java下HttpUnit和Jsoup的Http抓取

简单记录下：搜集信息-分析问题-解决问题

关于html文档的操作现成库有：

HttpUnit 很老了，不更了 http://www.httpunit.org/ 20 May 2008 HttpUnit 1.7 released
Jsoup 还更新 http://jsoup.org/
htmlunit http://htmlunit.sourceforge.net/
selenium WebDriver 带有HttpUnit
Phantomjs 截图

等。。。

抓取xiami网的音乐漫游列表和热度排名，下载链接，在播放界面（第二截图），通过开发者工具的网络面板可以看到ajax请求列表和一个几m的文件。

java原版jsoup和c#版nsoup，就可以了：返回值分html和json。

默认列表请求格式，json数据：

http://www.xiami.com/song/playlist-default/cat/json

返回的json需要转码。

下载请求链接格式，流数据：

http://m5.file.xiami.com/{artist_id}/{artist_id}/{album_id}/{song_id}_{unkown_number}_{[hl]}.mp3?auth_key={unkown_hash_code_with_length_32}-{current_Time_Millis/1000}-0-null

其中{artist_id}为对应id，第一个{artist_id}暂时发现4位时会截断为3位。

由于还未找到代码{unkown_number}和{unkown_hash_code_with_length_32}（可能为md5，32位，和时间秒数有关）未知；同一首歌曲auth_key不同，因为时间不同嘛。

获取漫游列表请求，json数据：

http://www.xiami.com/play/get-manyou-song?song_id={song_id}

演唱者请求，静态网页数据：

http://www.xiami.com/artist/{artist_d}

C# 写的 wpf界面展示

代码片段记录

    /*** <div class="ui-roam-item" οndblclick="SEIYAEVENT.roamDblclick(this, 1772621103)"* id="J_roamItem1772621103"> <div class="ui-roam-sort">* <em data-type="roam" data-sid="1772621103"></em></div> <div* class="ui-roam-item-column c1">Movie Star</div> <div class="ui-roam-item-column c2"><a* href="http://www.xiami.com/artist/1294715676" target="_blank" title="JTR">JTR</a></div> <div* class="ui-roam-item-column c3"><a href="http://www.xiami.com/album/1794716293"* target="_blank" title="Touchdown">Touchdown</a></div>*/public static void xiamiRoamTest(String song_id) throwsIOException, SAXException {System.out.println("向服务器发送数据，然后获取网页内容：");WebConversation wc= newWebConversation();HttpUnitOptions.setExceptionsThrownOnScriptError(false);//js运行错误时，是否抛出异常//http://www.xiami.com/play/get-manyou-song?song_id=1773737813&_ksTS=1434841673746_1939&callback=jsonp1940WebRequest req = new GetMethodWebRequest("http://www.xiami.com/play/get-manyou-song");req.setParameter("song_id", song_id);//req.setParameter("_ksTS", "1434841673746_1939");//歌曲（名）title//jsonRow.get("title")//演唱者artist//"http://www.xiami.com/artist/"+ jsonRow.get("artist_id")//演唱者album_name//"http://www.xiami.com/album/"+ jsonRow.get("album_id")//获取响应对象WebResponse resp =wc.getResponse(req);if ("application/json".equals(resp.getContentType())) {String jsonText=resp.getText();JSONObject jsonResult=(JSONObject) JSONValue.parse(jsonText);JSONArray jsonData= (JSONArray) jsonResult.get("data");int count =jsonData.size();for (int idex = 0; idex < count; idex++) {JSONObject jsonRow=(JSONObject) jsonData.get(idex);if (0 ==idex) {outCsvLine(jsonRow.keySet(), System.out);}outCsvLine(jsonRow.values(), System.out);}}//String jsonText= resp.getText();//System.out.println(jsonText);System.out.println("页面加载调用完成 : " +resp.getURL());}public static void outCsvLine(Collection<?> collection, Appendable appendable) throwsIOException {boolean isFirst = true;for(Object col : collection) {if(isFirst) {isFirst= false;}else{appendable.append(',');}appendable.append(String.valueOf(col));}appendable.append('\n');}public static void xiamiCrawArtistlTest(String artist_id) throwsException {//HttpUnitSystem.out.println("向服务器发送数据，然后获取网页内容：");//建立一个WebConversation实例WebConversation wc = newWebConversation();//wc.setExceptionsThrownOnErrorStatus(false);//向指定的URL发出请求//http://www.xiami.com/play?ids=/song/playlist/id/1316821699/type/1#loaded//"http://www.xiami.com/artist/62500?spm=a1z1s.7154410.1996860241.133.9mIf7i"WebRequest req = new GetMethodWebRequest("http://www.xiami.com/artist/" +artist_id);HttpUnitOptions.setExceptionsThrownOnScriptError(false);//js运行错误时，是否抛出异常//动态加载的文件在这里处理，如标签里引用外部javascript和stylesheetwc.addClientListener(newWebClientListener() {public voidrequestSent(WebClient webclient, WebRequest webrequest) {try{System.out.println("WebClientListener.requestSent() " +webrequest.getURL());}catch(MalformedURLException e) {//TODO Auto-generated catch block
e.printStackTrace();}}public voidresponseReceived(WebClient webclient, WebResponse webresponse) {//FileOutputStream fos;//try {//fos = new FileOutputStream(new File(webresponse.getURL().getPath()).getName());//IoUtil.copyLarge(webresponse.getInputStream(), fos, 0, -1, new byte[8192]);//fos.close();//} catch (FileNotFoundException e) {//                    //TODO Auto-generated catch block//e.printStackTrace();//} catch (IOException e) {//                    //TODO Auto-generated catch block//e.printStackTrace();//}
System.out.println("WebClientListener.responseReceived() " +webresponse.getURL());}});wc.addWindowListener(newWebWindowListener() {public voidwindowOpened(WebClient webclient, WebWindow webwindow) {System.out.println("WebWindowListener.windowOpened(" + webclient + ", " + webwindow + ")");}public voidwindowClosed(WebClient webclient, WebWindow webwindow) {System.out.println("WebWindowListener.windowOpened(" + webclient + ", " + webwindow + ")");}});//获取响应对象WebResponse resp =wc.getResponse(req);//System.out.println("getExternalStyleSheet : " + resp.getExternalStyleSheet());//System.out.println("getScriptableObject.getURL : " + resp.getScriptableObject().getURL());//由带有好的选择器的Jsoup（侧重于静态文档流）解析HttpUnit（支持动态的文档，但不提供好的选择器）请求的结果。Document jpDoc =Jsoup.parse(resp.getInputStream(), resp.getCharacterSet(), resp.getURL().getHost());Elements jpTables= jpDoc.select("#artist_trends TABLE.track_list");Element jpTable= jpTables.get(0);jpTable.select(".song_name");Elements jpTRs= jpTable.getElementsByTag("tr");//解析类似这样的结果：//<td class="song_name"> <a href="/song/1769937495" title="">Super Mario Bros. - Small Mario Jump</a> </td>//<td class="song_hot">5440892</td>//<td class="song_hot_bar"><span style="width:99%;">&nbsp;</span></td>
java.lang.String[] cells= new String[] { "song_name", "song_id", "song_hot", "song_hot_percent"};System.out.println(Arrays.toString(cells));for (int i = 0; i < jpTRs.size(); i++) {Element jpTR=jpTRs.get(i);Elements jpAs= jpTR.select("td.song_name a");if (null != jpAs && 0 <jpAs.size()) {cells[0] = jpAs.get(0).attr("href");cells[1] = jpAs.get(0).text();}Elements jpTDs= jpTR.select("td.song_hot");if (null != jpTDs && 0 <jpTDs.size()) {cells[2] = jpTDs.get(0).text();}Elements jpSPANs= jpTR.select("td.song_hot_bar>span");if (null != jpSPANs && 0 <jpSPANs.size()) {cells[3] = jpSPANs.get(0).attr("style");}System.out.println(Arrays.toString(cells));}//拷贝整个页面FileOutputStream fos = new FileOutputStream("xiami.html");IoUtil.copyLarge(resp.getInputStream(), fos,0, -1, new byte[8192]);fos.close();System.out.println("Finished : " +resp.getURL());}

关于html的操作还有的现成库有：

jsdom

https://github.com/princehaku/pyrailgun

JAVA爬虫 WebCollector

Java Nutch 搜索

基于 Selenium WebDriver 实现多语言环境下自动化截图

c++ larbin-2.6.3.tar

中文

importjava.io.IOException;importjava.io.PrintWriter;importjavax.servlet.ServletException;importjavax.servlet.http.HttpServlet;importjavax.servlet.http.HttpServletRequest;importjavax.servlet.http.HttpServletResponse;importjunit.framework.Assert;importorg.xml.sax.SAXException;importcom.meterware.httpunit.GetMethodWebRequest;importcom.meterware.httpunit.HTMLElement;importcom.meterware.httpunit.PostMethodWebRequest;importcom.meterware.httpunit.WebConversation;importcom.meterware.httpunit.WebForm;importcom.meterware.httpunit.WebLink;importcom.meterware.httpunit.WebRequest;importcom.meterware.httpunit.WebResponse;importcom.meterware.httpunit.WebTable;importcom.meterware.servletunit.InvocationContext;importcom.meterware.servletunit.ServletRunner;importcom.meterware.servletunit.ServletUnitClient;/*** <h1>内容摘要</h1>* <p>* HttpUnit是一个集成测试工具，主要关注Web应用的测试，提供的帮助类让测试者可以通过Java类和服务器进行交互，并且将服务器端的响应当作文本或者DOM对象进行处理。<br>* HttpUnit还提供了一个模拟Servlet容器，让你可以不需要发布Servlet，就可以对Servlet的内部代码进行测试。本文中作者将详细的介绍如何使用HttpUnit提供的类完成集成测试* 。* </p>* <h2>1 HttpUnit简介</h2>* <p>* HttpUnit是SourceForge下面的一个开源项目，它是基于JUnit的一个测试框架，主要关注于测试Web应用，解决使用JUnit框架无法对远程Web内容进行测试的弊端。<br>* 当前的最新版本是1.5.4。为了让HtpUnit正常运行，你应该安装JDK1.3.1或者以上版本。* </p>* <h3>1.1 工作原理</h3>* HttpUnit通过模拟浏览器的行为，处理页面框架（frames）,cookies,页面跳转（redirects）等。通过HttpUnit提供的功能，你可以和服务器端进行信息交互，* 将返回的网页内容作为普通文本、XML Dom对象或者是作为链接、页面框架、图像、表单、表格等的集合进行处理，然后使用JUnit框架进行测试，还可以导向一个新的页面，然后进行新页面的处理，* 这个功能使你可以处理一组在一个操作链中的页面。 <h3>1.2 和其他商业工具的对比</h3>* 商业工具一般使用记录、回放的功能来实现测试，但是这里有个缺陷，就是当页面设计被修改以后，这些被记录的行为就不能重用了，需要重新录制才能继续测试。* <p>* 举个例子：如果页面上有个元素最先的设计是采用单选框，这个时候你开始测试，那么这些工具记录的就是你的单项选择动作，但是如果你的设计发生了变化，比如说我改成了下拉选择，或者使用文本框接受用户输入，* 这时候，你以前录制的测试过程就无效了，必须要重新录制。* </p>* 而HttpUnit因为关注点是这些控件的内容，所以不管你的外在表现形式如何变化，都不影响你已确定测试的可重用性。* <p>* 更多的关于httpunit的信息请访问httpunit的主页http://httpunit.sourceforge.net* </p>* <h2>如何使用httpunit处理页面的内容</h2> WebConversation类是HttpUnit框架中最重要的类，它用于模拟浏览器的行为。其他几个重要的类是：* <ul>* <li>WebRequest类，模仿客户请求，通过它可以向服务器发送信息。</li>* <li>WebResponse类，模拟浏览器获取服务器端的响应信息。</li>* </ul>* *@authorSansan*@see http://www.blogjava.net/relax/archive/2005/01/27/743.html*/
public classJHttpUnitTest {public static voidmain(String[] args) {testWithException();}public static voidtestWithException() {try{httpGetBaidu();}catch(IOException e) {//TODO Auto-generated catch block
e.printStackTrace();}catch(SAXException e) {//TODO Auto-generated catch block
e.printStackTrace();}}public static void httpGetBaidu() throwsIOException, SAXException {//通过Get方法访问页面并且加入参数
System.out.println("向服务器发送数据，然后获取网页内容：");//建立一个WebConversation实例WebConversation wc = newWebConversation();wc.setExceptionsThrownOnErrorStatus(false);//向指定的URL发出请求WebRequest req = new GetMethodWebRequest("http://www.baidu.com");//获取响应对象WebResponse resp =wc.getResponse(req);//获得页面链接对象HTMLElement searchInput = resp.getElementWithID("kw");searchInput.setAttribute("value", "HttpUnit");HTMLElement searchButton= resp.getElementWithID("su");//模拟用户单击事件searchButton.handleEvent("click");//获得当前的响应对象WebResponse nextLink =wc.getCurrentPage();//用getText方法获取相应的全部内容//用System.out.println将获取的内容打印在控制台上
System.out.println(resp.getText());}//4.1  获取指定页面的内容/*** 4.1.1 直接获取页面内容* *@throwsIOException*@throwsSAXException*/public static void httpGetExample() throwsIOException, SAXException {//通过Get方法访问页面并且加入参数
System.out.println("向服务器发送数据，然后获取网页内容：");//建立一个WebConversation实例WebConversation wc = newWebConversation();//向指定的URL发出请求WebRequest req = new GetMethodWebRequest("http://localhost:6888/HelloWorld.jsp");//给请求加上参数req.setParameter("username", "姓名");//获取响应对象WebResponse resp =wc.getResponse(req);//用getText方法获取相应的全部内容//用System.out.println将获取的内容打印在控制台上
System.out.println(resp.getText());}/*** 4.1.2 通过Get方法访问页面并且加入参数* *@throwsIOException*@throwsSAXException*/public static void httpPostExample() throwsIOException, SAXException {//通过Post方法访问页面并且加入参数
System.out.println("使用Post方式向服务器发送数据，然后获取网页内容：");//建立一个WebConversation实例WebConversation wc = newWebConversation();//向指定的URL发出请求WebRequest req = new PostMethodWebRequest("http://localhost:6888/HelloWorld.jsp");//给请求加上参数req.setParameter("username", "姓名");//获取响应对象WebResponse resp =wc.getResponse(req);//用getText方法获取相应的全部内容//用System.out.println将获取的内容打印在控制台上
System.out.println(resp.getText());}/*** 4.1.3 通过Post方法访问页面并且加入参数* *@throwsIOException*@throwsSAXException*/public static void httpAndHandleElement() throwsIOException, SAXException {//处理页面中的链接//这里的演示是找到页面中的某一个链接，然后模拟用户的单机行为，获得它指向文件的内容。比如在我的页面HelloWorld.html中有一个链接，它显示的内容是TestLink，它指向我另一个页面TestLink.htm. TestLink.htm里面只显示TestLink.html几个字符。
System.out.println("获取页面中链接指向页面的内容：");//建立一个WebConversation实例WebConversation wc = newWebConversation();//获取响应对象WebResponse resp = wc.getResponse("http://localhost:6888/HelloWorld.html");//获得页面链接对象WebLink link = resp.getLinkWith("TestLink");//模拟用户单击事件
link.click();//获得当前的响应对象WebResponse nextLink =wc.getCurrentPage();//用getText方法获取相应的全部内容//用System.out.println将获取的内容打印在控制台上
System.out.println(nextLink.getText());}/*** 4.2 处理页面中的链接* *@throwsIOException*@throwsSAXException*/public static void httpAndHandleLink() throwsIOException, SAXException {//这里的演示是找到页面中的某一个链接，然后模拟用户的单机行为，获得它指向文件的内容。//比如在我的页面HelloWorld.html中有一个链接，它显示的内容是TestLink，它指向我另一个页面TestLink.htm.//TestLink.htm里面只显示TestLink.html几个字符。
System.out.println("获取页面中链接指向页面的内容：");//建立一个WebConversation实例WebConversation wc = newWebConversation();//获取响应对象WebResponse resp = wc.getResponse("http://localhost:6888/HelloWorld.html");//获得页面链接对象WebLink link = resp.getLinkWith("TestLink");//模拟用户单击事件
link.click();//获得当前的响应对象WebResponse nextLink =wc.getCurrentPage();//用getText方法获取相应的全部内容//用System.out.println将获取的内容打印在控制台上
System.out.println(nextLink.getText());}/*** 4.3 处理页面中的表格* *@throwsIOException*@throwsSAXException*/public static void httpAndHandleTable() throwsIOException, SAXException {//表格是用来控制页面显示的常规对象，在HttpUnit中使用数组来处理页面中的多个表格，你可以用resp.getTables()方法获取页面所有的表格对象。他们依照出现在页面中的顺序保存在一个数组里面。//[注意] Java中数组下标是从0开始的，所以取第一个表格应该是resp.getTables()[0]，其他以此类推。//下面的例子演示如何从页面中取出第一个表格的内容并且将他们循环显示出来：
System.out.println("获取页面中表格的内容：");//建立一个WebConversation实例WebConversation wc = newWebConversation();//获取响应对象WebResponse resp = wc.getResponse("http://localhost:6888/HelloWorld.html");//获得对应的表格对象WebTable webTable = resp.getTables()[0];//将表格对象的内容传递给字符串数组String[][] datas =webTable.asText();//循环显示表格内容int i = 0, j = 0;int m = datas[0].length;int n =datas.length;while (i <n) {j= 0;while (j <m) {System.out.println("表格中第" + (i + 1) + "行第" + (j + 1) + "列的内容是：" +datas[i][j]);++j;}++i;}}/*** 4.4 处理页面中的表单* *@throwsIOException*@throwsSAXException*/public static void httpAndHandleForm() throwsIOException, SAXException {//表单是用来接受用户输入，也可以向用户显示用户已输入信息（如需要用户修改数据时，通常会显示他以前输入过的信息），//在HttpUnit中使用数组来处理页面中的多个表单，你可以用resp.getForms()方法获取页面所有的表单对象。他们依照出现在页面中的顺序保存在一个数组里面。//[注意] Java中数组下标是从0开始的，所以取第一个表单应该是resp.getForms()[0]，其他以此类推。//下面的例子演示如何从页面中取出第一个表单的内容并且将他们循环显示出来：
System.out.println("获取页面中表单的内容：");//建立一个WebConversation实例WebConversation wc = newWebConversation();//获取响应对象WebResponse resp = wc.getResponse("http://localhost:6888/HelloWorld.html");//获得对应的表单对象WebForm webForm = resp.getForms()[0];//获得表单中所有控件的名字String[] pNames =webForm.getParameterNames();int i = 0;int m =pNames.length;//循环显示表单中所有控件的内容while (i <m) {System.out.println("第" + (i + 1) + "个控件的名字是" + pNames[i] + "，里面的内容是" +webForm.getParameterValue(pNames[i]));++i;}}//5  如何使用httpunit进行测试/*** 5.1 对页面内容进行测试* *@throwsIOException*@throwsSAXException*/public static void httpAndDoWebContent() throwsIOException, SAXException {//5.1  对页面内容进行测试//httpunit中的这部分测试完全采用了JUnit的测试方法，即直接将你期望的结果和页面中的输出内容进行比较。不过这里的测试就简单多了，只是字符串和字符串的比较。//比如你期望中的页面显示是中有一个表格，它是页面中的第一个表格，而且他的第一行第一列的数据应该是显示username，那么你可以使用下面的代码进行自动化测试：
System.out.println("获取页面中表格的内容并且进行测试：");//建立一个WebConversation实例WebConversation wc = newWebConversation();//获取响应对象WebResponse resp = wc.getResponse("http://localhost:6888/TableTest.html");//获得对应的表格对象WebTable webTable = resp.getTables()[0];//将表格对象的内容传递给字符串数组String[][] datas =webTable.asText();//对表格内容进行测试String expect = "中文";Assert.assertEquals(expect, datas[0][0]);}/** 5.2 对Servlet进行测试* * 除了对页面内容进行测试外，有时候（比如开发复杂的Servlets的时候），你需要对Servlet本身的代码块进行测试，这时候你可以选择HttpUnit，它可以提供一个模拟的Servlet容器* ，让你的Servlet代码不需要发布到Servlet容器（如tomcat）就可以直接测试。* * 5.2.1 原理简介* * 使用httpunit测试Servlet时，请创建一个ServletRunner的实例，他负责模拟Servlet容器环境。如果你只是测试一个Servlet,* 你可以直接使用registerServlet方法注册这个Servlet* ，如果需要配置多个Servlet，你可以编写自己的web.xml，然后在初始化ServletRunner的时候将它的位置作为参数传给ServletRunner的构造器。* * 在测试Servlet时，应该记得使用ServletUnitClient类作为客户端，他和前面用过的WebConversation差不多，都继承自WebClient，所以他们的调用方式基本一致* 。要注意的差别是，在使用ServletUnitClient时，他会忽略URL中的主机地址信息，而是直接指向他的ServletRunner实现的模拟环境。*//** 5.2.2 简单测试* * 本实例只是演示如何简单的访问Servlet并且获取他的输出信息，例子中的Servlet在接到用户请求的时候只是返回一串简单的字符串：Hello World!.*///5.2.2.1. Servlet的代码如下：public static class MyServlet extendsHttpServlet {//1. Servlet的代码如下：public void service(HttpServletRequest req, HttpServletResponse resp) throwsIOException {PrintWriter out=resp.getWriter();//向浏览器中写一个字符串Hello World!out.println("Hello World!");out.close();}}//5.2.2.2. 测试的调用代码如下：public static void httpServletExmaple() throwsIOException, SAXException {//2. 测试的调用代码如下：//创建Servlet的运行环境ServletRunner sr = newServletRunner();//向环境中注册Servletsr.registerServlet("myServlet", MyServlet.class.getName());//创建访问Servlet的客户端ServletUnitClient sc =sr.newClient();//发送请求WebRequest request = new GetMethodWebRequest("http://localhost/myServlet");//获得模拟服务器的信息WebResponse response =sc.getResponse(request);//将获得的结果打印到控制台上
System.out.println(response.getText());}/** 5.2.3 测试Servlet的内部行为* 对于开发者来说，仅仅测试请求和返回信息是不够的，所以HttpUnit提供的ServletRunner模拟器可以让你对被调用Servlet内部的行为进行测试* 。和简单测试中不同，这里使用了InvocationContext获得该Servlet的环境* ，然后你可以通过InvocationContext对象针对request、response等对象或者是该Servlet的内部行为（非服务方法）进行操作。* * 下面的代码演示了如何使用HttpUnit模拟Servlet容器，并且通过InvocationContext对象，测试Servlet内部行为的大部分工作，比如控制request、session* 、response等。*///5.2.2.2. 测试的调用代码如下：public static class InternalServlet extendsHttpServlet {//1. Servlet的代码如下：public void service(HttpServletRequest req, HttpServletResponse resp) throwsIOException {PrintWriter out=resp.getWriter();//向浏览器中写一个字符串Hello World!out.println("Hello World!");out.close();}public voidmyMethod() {System.out.println("InternalServlet.myMethod()");}}/*** [注意]* * 在测试Servlet的之前，你必须通过InvocationContext完成Servlet中的service方法中完成的工作，* 因为通过newInvocation方法获取InvocationContext实例的时候该方法并没有被调用。* *@throwsIOException*@throwsSAXException*@throwsServletException*/public static void httpServletExmaple2() throwsIOException, SAXException, ServletException {//创建Servlet的运行环境ServletRunner sr = newServletRunner();//向环境中注册Servletsr.registerServlet("InternalServlet", InternalServlet.class.getName());//创建访问Servlet的客户端ServletUnitClient sc =sr.newClient();//发送请求WebRequest request = new GetMethodWebRequest("http://localhost/InternalServlet");request.setParameter("pwd", "pwd");//获得该请求的上下文环境InvocationContext ic =sc.newInvocation(request);//调用Servlet的非服务方法InternalServlet is =(InternalServlet) ic.getServlet();is.myMethod();//直接通过上下文获得request对象System.out.println("request中获取的内容：" + ic.getRequest().getParameter("pwd"));//直接通过上下文获得response对象,并且向客户端输出信息ic.getResponse().getWriter().write("haha");//直接通过上下文获得session对象，控制session对象//给session赋值ic.getRequest().getSession().setAttribute("username", "timeson");//获取session的值System.out.println("session中的值：" + ic.getRequest().getSession().getAttribute("username"));//使用客户端获取返回信息，并且打印出来WebResponse response =ic.getServletResponse();System.out.println(response.getText());}/** 6 总结* * 本文中，作者详细的演示和介绍了如何使用HttpUnit提供的类来进行集成测试，主要实现以下操作：* * 1. 模拟用户行为向服务器发送请求，传递参数* * 2. 模拟用户接受服务器的响应信息，并且通过辅助类分析这些响应信息，结合JUnit框架进行测试* * 3. 使用HttpUnit提供的模拟Servler容器,测试开发中的Servlet的内部行为*/}

HttpUnit摘录

importjava.io.FileWriter;importjava.io.IOException;importjava.net.URL;importcom.gargoylesoftware.htmlunit.FailingHttpStatusCodeException;importcom.gargoylesoftware.htmlunit.WebClient;importcom.gargoylesoftware.htmlunit.WebRequest;importcom.gargoylesoftware.htmlunit.html.HtmlPage;public classHtmlUnit {/*** 说明：* * 　　1，htmlunit运行过程中会抛出很多很多异常，得有心理准备。哈哈。* * 　　2，对于变态网页，有时候用htmlunit还是不能得到完整的正确的内容，但是得到大概正确的内容是没压力的，下面就是用Jsoup之类的东西来抽取了。* * 　　3，htmlunit很强大，有很多值得研究的空间。* * <pre>* WebClient wc = new WebClient();* HtmlPage page = (HtmlPage) wc.getPage(&quot;http://www.google.com&quot;);* HtmlForm form = page.getFormByName(&quot;f&quot;);* HtmlSubmitInput button = (HtmlSubmitInput) form.getInputByName(&quot;btnG&quot;);* HtmlPage page2 = (HtmlPage) button.click();* </pre>* *@paramargs*@throwsFailingHttpStatusCodeException*@throwsIOException*/public static void main(String[] args) throwsFailingHttpStatusCodeException, IOException {String url= "http://www.xiami.com/play?ids=/song/playlist/id/1/type/9";//想采集的网址String refer = "http://www.xiami.com/play";URL link= newURL(url);WebClient wc= newWebClient();WebRequest request= newWebRequest(link);request.setCharset("UTF-8");//request.setProxyHost("120.120.120.x");//request.setProxyPort(8080);request.setAdditionalHeader("Referer", refer);//设置请求报文头里的refer字段设置请求报文头里的User-Agent字段request.setAdditionalHeader("User-Agent", "Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2");//wc.addRequestHeader("User-Agent", "Mozilla/5.0 (Windows NT 5.1; rv:6.0.2) Gecko/20100101 Firefox/6.0.2");//wc.addRequestHeader和request.setAdditionalHeader功能应该是一样的。选择一个即可。//其他报文头字段可以根据需要添加wc.getCookieManager().setCookiesEnabled(true);//开启cookie管理
wc.getOptions().setJavaScriptEnabled(true);//开启js解析。对于变态网页，这个是必须的wc.getOptions().setCssEnabled(true);//开启css解析。对于变态网页，这个是必须的。wc.getOptions().setThrowExceptionOnFailingStatusCode(false);wc.getOptions().setThrowExceptionOnScriptError(false);wc.getOptions().setTimeout(10000);//设置cookie。如果你有cookie，可以在这里设置//Set<Cookie> cookies = null;//Iterator<Cookie> i = cookies.iterator();//while (i.hasNext()) {//wc.getCookieManager().addCookie(i.next());//}//准备工作已经做好了HtmlPage page = null;page=wc.getPage(request);if (page == null) {System.out.println("采集 " + url + " 失败!!!");return;}String content= page.asXml();//网页内容保存在content里FileWriter fw = new FileWriter("xiami2.html");fw.write(content);fw.close();if (content == null) {System.out.println("采集 " + url + " 失败!!!");return;}//搞定了//CookieManager CM = wc.getCookieManager();//WC = Your WebClient's name//Set<Cookie> cookies_ret = CM.getCookies();//返回的Cookie在这里，下次请求的时候可能可以用上啦。
}
}

HtmlUnit摘录

importjava.io.File;importjava.io.IOException;importorg.jsoup.Jsoup;importorg.jsoup.nodes.Document;importorg.jsoup.nodes.Element;/*** <h1>使用 jsoup 对 HTML 文档进行解析和操作</h1> jsoup 是一款 Java 的 HTML 解析器，可直接解析某个 URL 地址、HTML 文本内容。它提供了一套非常省力的* API，可通过 DOM，CSS 以及类似于 jQuery 的操作方法来取出和操作数据。本文主要介绍如何使用 jsoup 来进行常用的 HTML 解析。 <br>* 小巧、致力于文档文本解析，选择器支持，清理Sanitize untrusted HTML (to prevent XSS)* *@see http://www.ibm.com/developerworks/cn/java/j-lo-jsouphtml/中文*@see http://jsoup.org/cookbook/extracting-data/selector-syntax*@authorSansan**/
public classJsonpTest {public static voidmain(String[] args) {try{getTest();}catch(Exception e) {e.printStackTrace();}}/*** <h1>文档输入 jsoup</h1>可以从包括字符串、URL 地址以及本地文件来加载 HTML 文档，并生成 Document 对象实例。* <p>* 请大家注意最后一种 HTML 文档输入方式中的 parse 的第三个参数，为什么需要在这里指定一个网址呢（虽然可以不指定，如第一种方法）？因为 HTML* 文档中会有很多例如链接、图片以及所引用的外部脚本、css 文件等，而第三个名为 baseURL 的参数的意思就是当 HTML 文档使用相对路径方式引用外部文件时，jsoup 会自动为这些* URL 加上一个前缀，也就是这个 baseURL。 <br>* 例如 < a href=/project> 开源软件 < /a> 会被转换成 < a href=http://www.oschina.net/project> 开源软件 < /a>。* </p>* *@throwsIOException*/public static void parseTest() throwsIOException {System.out.println("Jsoup Fetching...");Document doc;//直接从字符串中输入 HTML 文档String html = "<html><head><title> 开源中国社区 </title></head>" + "<body><p> 这里是 jsoup 项目的相关文章 </p></body></html>";doc=Jsoup.parse(html);//从 URL 直接加载 HTML 文档，简单型的。doc = Jsoup.connect("http://www.oschina.net/").get();String title=doc.title();//从 URL 直接加载 HTML 文档，带参数复杂点的。doc = Jsoup.connect("http://www.oschina.net/").data("query", "Java") //请求参数.userAgent("I'm jsoup") //设置 User-Agent.cookie("auth", "token") //设置 cookie.timeout(3000) //设置连接超时时间.post(); //使用 POST 方法访问 URL//从文件中加载 HTML 文档File input = new File("D:/test.html");doc= Jsoup.parse(input, "UTF-8", "http://www.oschina.net/");System.out.println("Document = " +doc);}public static void getTest() throwsIOException {System.out.println("Jsoup.connect");Document doc= Jsoup.connect("http://www.xiami.com/artist/62500?spm=a1z1s.7154410.1996860241.133.9mIf7i").get();System.out.println("Document = " +doc);}public static void baiduSearchTest() throwsIOException {System.out.println("Jsoup.connect");Document doc= Jsoup.connect("http://www.baidu.com/").get();//System.out.println("Document = " + doc);
Element searchInput= doc.getElementById("kw");System.out.println("searchInput = " +searchInput);searchInput.val("Jsoup");Element searchButton= doc.getElementById("su");System.out.println("searchButton = " +searchButton);}
}

Jsoup摘录

<?xml version="1.0" encoding="UTF-8"?>
<htmlxmlns="http://www.w3.org/1999/xhtml"xml:lang="en">
<head>
<title>Apache Tomcat WebSocket Test</title>
<styletype="text/css">#connect-container{float:left;width:400px}#connect-container div{padding:5px;
}#console-container{float:left;margin-left:15px;width:400px;
}#console{border:1px solid #CCCCCC;border-right-color:#999999;border-bottom-color:#999999;height:170px;overflow-y:scroll;padding:5px;width:100%;
}#console p{padding:0;margin:0;
}table.mytable{margin-top:2px;border-collapse:collapse;
}/*div:hover {border: 1px solid red;}*/table.mytable tbody{display:table-row-group;vertical-align:middle;
}table.mytable th{padding:3px;height:15px;vertical-align:middle;text-align:center;background-color:#efefef;border:1px solid #c3c3c3;
}table.mytable td{padding:2px;vertical-align:middle;border:1px solid #dadada;
}
</style></head>
<body><divclass="noscript"><h2style="color: #ff0000">chrome（webkit核心浏览器）支持本地html文件浏览（file协议，双击运行或者拖进chrome浏览器）。但是和在线浏览（网站发布，http协议），有许多缺点：ajax和cookie</h2><ul><li>当发送的ajax请求本地文件，会报跨域错误。类似如下<br>XMLHttpRequest cannotload file:///E:/webs/xx.txt?param=2222. Cross origin requests areonly supported for protocol schemes: http, data, chrome-extension,https, chrome-extension-resource.<br>解决办法是给chrome添加启动参数：--allow-file-access-from-files，这样本地ajax请求就不会报跨域错误了。（注意如果给chrome添加多个启动参数，每个启动参数“--”之前要有空格隔开，如"C:\ProgramFiles\Google\Chrome\Application\chrome.exe" --enable-file-cookies--allow-file-access-from-files）<br>此外，添加了--allow-file-access-from-files启动参数后，还可以解决本地file加载文件，导致iframe和父页无法相互访问，window.open打开的页面使用opener为null的问题</li><li>chrome（webkit核心浏览器）默认只支持online-cookie（网站发布，通过http协议访问设置的cookie），本地测试（file浏览，双击运行或者拖进chrome浏览器）设置的cookie是无法保存的，如下在javascript控制台操作：<p><ahref="http://www.coding123.net/article/20140701/chrome-webkit-save-local-set-cookie.aspx">图</a>document.cookie='abc=123'<br>"abc=123"<br>document.cookie<br>""</p> 从上图可以找到chrome默认的启动配置没有保存本地设置的cookie。<br>要想chrome本地设置的cookie也要能保存，需要配置过chrome，给chrome快捷方式添加--enable-file-cookies启动参数，右键点击chrome桌面快捷图标，属性，在目标最后添加--enable-file-cookies启动参数，注意--前面要有空格。这样chrome本地测试的时候就可以保存cookie了。</li></ul></div><div><divid="connect-container"><div><buttonid="connect"onclick="connect();">Connect</button><buttonid="disconnect"disabled="disabled"onclick="disconnect();">Disconnect</button><buttonid="clearlog"onclick="document.getElementById('console').innerHTML='';">ClearMessage Log</button></div><div><textareaid="message"style="width: 350px">Here is a message!</textarea></div><div><buttonid="echo"onclick="echo();"disabled="disabled">Echo message</button><buttonid="roam" >漫游</button></div></div><divid="console-container"><divid="console" /></div><!--由于前面设置了float导致脱离文档流，就和父节点无关了。再增加节点，设置clear:both;其实就是利用清除浮动来把外层的div撑开--><divstyle="clear:both;"></div></div><div><tableid="roam_result"class="mytable"><tr><thclass='th_song_id'>歌曲ID</th><thclass='th_album_id'>专辑ID</th><thclass='th_tryhq'>步骤描述</th><thclass='th_artist'>歌手</th><thclass='th_insert_type'>类型</th></tr></table></div><h3>今日推荐歌单</h3><IFRAMEid="xiamiFrame"width="100%"height="90%"src="http://www.xiami.com/play?ids=/song/playlist/id/1/type/9"onload="onLoad()"></IFRAME><scripttype="text/javascript"src="jquery.js"></script><scripttype="text/javascript">$("#roam").click(function(){varfieldNames=["song_id","album_id","tryhq","artist","insert_type"];//http://www.xiami.com/play/get-manyou-song?song_id=1773737813&_ksTS=1434841673746_1939&callback=jsonp1940
$.getJSON("http://www.xiami.com/play/get-manyou-song?song_id=1773737813&_ksTS=1434841673746_1939",function(result){varrows=result.data;//var $RoamResult = $("#roam_result");vareRoamResult=document.getElementById("roam_result");varf=["td_song_id","td_album_id","td_tryhq","td_artist","td_insert_type"];//for(var idx = 0; idx < result.length; idx++){var jsonRow = rows[idx];}
$.each(result.data,function(i, jsonRow){vareRow=eRoamResult.insertRow();for(variCol= 0; iCol< 5; iCol++){vareCell=eRow.insertCell();eCell.className=f[iCol];eCell.innerHTML=jsonRow[fieldNames[iCol]];}});});});/*jQuery中的$.getJSON( )方法函数主要用来从服务器加载json编码的数据，它使用的是GET HTTP请求。使用方法如下：$.getJSON( url [, data ] [, success(data, textStatus, jqXHR) ] )url是必选参数，表示json数据的地址；必需。规定将请求发送到哪个 URL。data是可选参数，用于请求数据时发送数据参数；可选。规定发送到服务器的数据。success是可参数，这是一个回调函数，用于处理请求到的数据。success(data,status,xhr) 可选。规定当请求成功时运行的函数。额外的参数：data - 包含从服务器返回的数据status - 包含请求的状态（"success"、"notmodified"、"error"、"timeout"、"parsererror"）xhr - 包含 XMLHttpRequest 对象$("button").click(function(){$.getJSON("demo_ajax_json.js",function(result){//result = []$.each(result, function(i, field){$("div").append(field + " ");});});});*/varws= null;functionsetConnected(connected) {document.getElementById('connect').disabled=connected;document.getElementById('disconnect').disabled= !connected;document.getElementById('echo').disabled= !connected;}functionconnect() {vartarget=document.getElementById('target').value;if(target== '') {alert('Please select server side connection implementation.');return;}if('WebSocket' inwindow) {ws= newWebSocket(target);}else if('MozWebSocket' inwindow) {ws= newMozWebSocket(target);}else{alert('WebSocket is not supported by this browser.');return;}ws.onopen= function() {setConnected(true);log('Info: WebSocket connection opened.');};ws.onmessage= function(event) {log('Received:' +event.data);};ws.onclose= function(event) {setConnected(false);log('Info: WebSocket connection closed, Code:' +event.code+(event.reason== "" ? "":", Reason:" +event.reason));};}functiondisconnect() {if(ws!= null) {ws.close();ws= null;}setConnected(false);}functionecho() {if(ws!= null) {varmessage=document.getElementById('message').value;log('Sent:' +message);ws.send(message);}else{alert('WebSocket connection not established, please connect.');}}functionupdateTarget(target) {varhost=window.location.host|| "127.0.0.1:8080";if(window.location.protocol== 'https:') {document.getElementById('target').value= 'wss://' +host+target;}else{document.getElementById('target').value= 'ws://' +host+target;}}functionlog(message) {varconsole=document.getElementById('console');varp=document.createElement('p');p.style.wordWrap= 'break-word';p.appendChild(document.createTextNode(message));console.appendChild(p);while(console.childNodes.length> 25) {console.removeChild(console.firstChild);}console.scrollTop=console.scrollHeight;}document.addEventListener("DOMContentLoaded",function() {//Remove elements with "noscript" class - <noscript> is not allowed in XHTMLvarnoscripts=document.getElementsByClassName("noscript");for(vari= 0; i<noscripts.length; i++) {noscripts[i].parentNode.removeChild(noscripts[i]);}},false);functiononLoad(){varxiamiFrame=document.getElementById('xiamiFrame');varxiamiWindow=xiamiFrame.contentWindow;//跨域时只可以获取到iframe的window对象，但属性和方法几乎不可用try{varxiaDocument=xiamiWindow.document;//跨域时获取不到iframe里的Window对象的属性document对象varxiamiWindowName=xiamiWindow.name;//同样的，跨域时也取不到iframe里的Window对象的属性name
}catch(e){console.log(e)}console.log(xiamiWindow);console.log(xiaDocument);console.log(xiamiWindowName);}</script>
</body>
</html>

需要开启跨域

转载于:https://www.cnblogs.com/Fang3s/p/4592372.html

Java下HttpUnit和Jsoup的Http抓取相关推荐

Java爬虫系列二：使用HttpClient抓取页面HTML
爬虫要想爬取需要的信息,首先第一步就要抓取到页面html内容,然后对html进行分析,获取想要的内容.上一篇随笔<Java爬虫系列一:写在开始前>中提到了HttpClient可以抓取页面内 ...
Java爬虫实战（一）：抓取一个网站上的全部链接
前言:写这篇文章之前,主要是我看了几篇类似的爬虫写法,有的是用的队列来写,感觉不是很直观,还有的只有一个请求然后进行页面解析,根本就没有自动爬起来这也叫爬虫?因此我结合自己的思路写了一下简单的爬虫,测 ...
java 开发用到网络爬虫，抓取汽车之家网站全部数据经历
经历了两个礼拜的折腾,某某知名网站的数据终于到手了.犯罪没被发现这种心情感觉很爽. 说一下我的犯罪经历,之前公司总是抓取某某网站数据,可能是被发现了.某某网站改变了策略.通过各种技术终止了我们的行为, ...
java天平数据,java RS232串口通讯（电子天平数据抓取）
写这篇文章是为了记录RS232串口通讯,当时有个央企项目里需要实现自动抓取RS232串口数据,需要支持主要浏览器:Chrome,Firefox,IE8以上等.看了很多有关rs232的资料和也找了很多方 ...
jsoup实战之抓取大众点评网区域省份城市信息
需求:从大众点评网抓取所有区域,省份,城市信息所使用技术:manve+jsoup.1.7.3+httpclient.4.3.3 pom.xml <project xmlns="ht ...
java 抓取百度新闻,java中使用jdom生成百度新闻抓取的xm
百度新闻开放,详细见 plaincopy to clipboardprint? 自己用java写了个使用jdom生成百度要求的 xml文件的实例,生成供百度搜索引擎抓取新闻 package com. ...
java行程单解析获取内容_java如何抓取网页上的动态信息，获取源代码后如何分析JS？...
首先明确我指的动态数据是什么. 名词定义:动态数据在这里指的是网页中由Javascript动态生成的页面内容,即网页源文件中没有,在页面加载到浏览器后动态生成的. 下面进入正题. 抓取静态页面很简单, ...
java 调用dll_Python调用海康SDK抓取红外图像
海康SDK提供了C++.C#.Java等示例代码,可以使用这些语言进行二次开发.对于做算法开发的人来说,就想快速采集到图像,然后在Matlab或Python里对图像进行分析,使用C++.C#.Java ...
基于scrapy框架的爬虫详细步骤（含没有“下一页”按钮的href抓取）
脱离八爪鱼,最近两天用scrapy爬了一个商品网站,本来可以快很多的,其中有一天把时间花在一行代码上最后绕了一大圈改了个参数就解决了??希望大家少走点弯路. 很多都是对慕课网的一个总结,网址:http ...

Java下HttpUnit和Jsoup的Http抓取

Java下HttpUnit和Jsoup的Http抓取相关推荐

最新文章

热门文章