如何使用PHP开发爬虫功能去爬取网页资讯

如何使用PHP开发爬虫功能去爬取网页资讯

一．网页数据分析

A ) 爬取目标：

比如我们现在需要爬取网页https://brands.cnblogs.com/ 中资讯列表数据。

那我们现在应该怎么分析呢？

图一：网页资讯列表

在本文中为了更简单的讲解，我的先爬取列表数据。

B ) Html标签分析：

分析列表数据部分html源代码中的标签,按F12后可以看到网页源码，并找到文章列表部分的代码。

图二：列表对应的代码

图三：专栏名称和专栏副标题

此次准备提取资讯如下字段：

文章标题："title",（对应图二中内容）

文章摘要："details",（对应图二中内容）

文章作者："diggs1",（对应图二中内容）

浏览数量："diggs2",（对应图二中内容）

点赞人数："diggs3",（对应图二中内容）

缩略图片："img",（对应图二中内容）

详情地址："href",（对应图二中内容）

文章页数："page",（对应图三中内容）

专栏名称："titles",（对应图三中内容）

专栏副标题："subtitle",（对应图三中内容）

下面我们来分析各字段所在标签：

专栏名称：div div div div div div h1 a；

专栏副标题：div.wrapper div.bannerpage div div div div p，副标题里面有两个P标签，获取时需要foreach循环获取；

文章页数：div div div div div div ul li a；

文章标题：div.article-item div.title h2；

文章摘要：div.article-item div.summary p；

文章作者：div.article-item div.post-info div.author span；

浏览数量：div.article-item div.post-info div.views span；

点赞人数：div.article-item div.post-info div.likes span；

缩略图片：div.article-item figure a img；

详情地址：div.article-item div.title h2 a；

图四：文章相关标签

图五：翻页相关标签

二．编写或下载第三方网页标签分析SDK

类似第三方的标签分析提取SDK比较多，笔者在此使用的是simplehtmldom。

SDK的代码如下：

<?php
/*** Website: http://sourceforge.net/projects/simplehtmldom/* Additional projects: http://sourceforge.net/projects/debugobject/* Acknowledge: Jose Solorzano (https://sourceforge.net/projects/php-html/)** Licensed under The MIT License* See the LICENSE file in the project root for more information.** Authors:*   S.C. Chen*   John Schlick*   Rus Carroll*   logmanoriginal** Contributors:*   Yousuke Kumakura*   Vadim Voituk*   Antcs** Version Rev. 1.9.1 (291)*/
namespace Htbz\Api;define('HDOM_TYPE_ELEMENT', 1);
define('HDOM_TYPE_COMMENT', 2);
define('HDOM_TYPE_TEXT', 3);
define('HDOM_TYPE_ENDTAG', 4);
define('HDOM_TYPE_ROOT', 5);
define('HDOM_TYPE_UNKNOWN', 6);
define('HDOM_QUOTE_DOUBLE', 0);
define('HDOM_QUOTE_SINGLE', 1);
define('HDOM_QUOTE_NO', 3);
define('HDOM_INFO_BEGIN', 0);
define('HDOM_INFO_END', 1);
define('HDOM_INFO_QUOTE', 2);
define('HDOM_INFO_SPACE', 3);
define('HDOM_INFO_TEXT', 4);
define('HDOM_INFO_INNER', 5);
define('HDOM_INFO_OUTER', 6);
define('HDOM_INFO_ENDSPACE', 7);defined('DEFAULT_TARGET_CHARSET') || define('DEFAULT_TARGET_CHARSET', 'UTF-8');
defined('DEFAULT_BR_TEXT') || define('DEFAULT_BR_TEXT', "\r\n");
defined('DEFAULT_SPAN_TEXT') || define('DEFAULT_SPAN_TEXT', ' ');
defined('MAX_FILE_SIZE') || define('MAX_FILE_SIZE', 600000);
define('HDOM_SMARTY_AS_TEXT', 1);function file_get_html($url,$use_include_path = false,$context = null,$offset = 0,$maxLen = -1,$lowercase = true,$forceTagsClosed = true,$target_charset = DEFAULT_TARGET_CHARSET,$stripRN = true,$defaultBRText = DEFAULT_BR_TEXT,$defaultSpanText = DEFAULT_SPAN_TEXT)
{if($maxLen <= 0) { $maxLen = MAX_FILE_SIZE; }$dom = new simple_html_dom(null,$lowercase,$forceTagsClosed,$target_charset,$stripRN,$defaultBRText,$defaultSpanText);/*** For sourceforge users: uncomment the next line and comment the* retrieve_url_contents line 2 lines down if it is not already done.*/$contents = file_get_contents($url,$use_include_path,$context,$offset,$maxLen);// $contents = retrieve_url_contents($url);if (empty($contents) || strlen($contents) > $maxLen) {$dom->clear();return false;}return $dom->load($contents, $lowercase, $stripRN);
}function str_get_html($str,$lowercase = true,$forceTagsClosed = true,$target_charset = DEFAULT_TARGET_CHARSET,$stripRN = true,$defaultBRText = DEFAULT_BR_TEXT,$defaultSpanText = DEFAULT_SPAN_TEXT)
{$dom = new simple_html_dom(null,$lowercase,$forceTagsClosed,$target_charset,$stripRN,$defaultBRText,$defaultSpanText);if (empty($str) || strlen($str) > MAX_FILE_SIZE) {$dom->clear();return false;}return $dom->load($str, $lowercase, $stripRN);
}function dump_html_tree($node, $show_attr = true, $deep = 0)
{$node->dump($node);
}class simple_html_dom_node
{public $nodetype = HDOM_TYPE_TEXT;public $tag = 'text';public $attr = array();public $children = array();public $nodes = array();public $parent = null;public $_ = array();public $tag_start = 0;private $dom = null;function __construct($dom){$this->dom = $dom;$dom->nodes[] = $this;}function __destruct(){$this->clear();}function __toString(){return $this->outertext();}function clear(){$this->dom = null;$this->nodes = null;$this->parent = null;$this->children = null;}function dump($show_attr = true, $depth = 0){echo str_repeat("\t", $depth) . $this->tag;if ($show_attr && count($this->attr) > 0) {echo '(';foreach ($this->attr as $k => $v) {echo "[$k]=>\"$v\", ";}echo ')';}echo "\n";if ($this->nodes) {foreach ($this->nodes as $node) {$node->dump($show_attr, $depth + 1);}}}function dump_node($echo = true){$string = $this->tag;if (count($this->attr) > 0) {$string .= '(';foreach ($this->attr as $k => $v) {$string .= "[$k]=>\"$v\", ";}$string .= ')';}if (count($this->_) > 0) {$string .= ' $_ (';foreach ($this->_ as $k => $v) {if (is_array($v)) {$string .= "[$k]=>(";foreach ($v as $k2 => $v2) {$string .= "[$k2]=>\"$v2\", ";}$string .= ')';} else {$string .= "[$k]=>\"$v\", ";}}$string .= ')';}if (isset($this->text)) {$string .= " text: ({$this->text})";}$string .= ' HDOM_INNER_INFO: ';if (isset($node->_[HDOM_INFO_INNER])) {$string .= "'" . $node->_[HDOM_INFO_INNER] . "'";} else {$string .= ' NULL ';}$string .= ' children: ' . count($this->children);$string .= ' nodes: ' . count($this->nodes);$string .= ' tag_start: ' . $this->tag_start;$string .= "\n";if ($echo) {echo $string;return;} else {return $string;}}function parent($parent = null){// I am SURE that this doesn't work properly.// It fails to unset the current node from it's current parents nodes or// children list first.if ($parent !== null) {$this->parent = $parent;$this->parent->nodes[] = $this;$this->parent->children[] = $this;}return $this->parent;}function has_child(){return !empty($this->children);}function children($idx = -1){if ($idx === -1) {return $this->children;}if (isset($this->children[$idx])) {return $this->children[$idx];}return null;}function first_child(){if (count($this->children) > 0) {return $this->children[0];}return null;}function last_child(){if (count($this->children) > 0) {return end($this->children);}return null;}function next_sibling(){if ($this->parent === null) {return null;}$idx = array_search($this, $this->parent->children, true);if ($idx !== false && isset($this->parent->children[$idx + 1])) {return $this->parent->children[$idx + 1];}return null;}function prev_sibling(){if ($this->parent === null) {return null;}$idx = array_search($this, $this->parent->children, true);if ($idx !== false && $idx > 0) {return $this->parent->children[$idx - 1];}return null;}function find_ancestor_tag($tag){global $debug_object;if (is_object($debug_object)) { $debug_object->debug_log_entry(1); }if ($this->parent === null) {return null;}$ancestor = $this->parent;while (!is_null($ancestor)) {if (is_object($debug_object)) {$debug_object->debug_log(2, 'Current tag is: ' . $ancestor->tag);}if ($ancestor->tag === $tag) {break;}$ancestor = $ancestor->parent;}return $ancestor;}function innertext(){if (isset($this->_[HDOM_INFO_INNER])) {return $this->_[HDOM_INFO_INNER];}if (isset($this->_[HDOM_INFO_TEXT])) {return $this->dom->restore_noise($this->_[HDOM_INFO_TEXT]);}$ret = '';foreach ($this->nodes as $n) {$ret .= $n->outertext();}return $ret;}function outertext(){global $debug_object;if (is_object($debug_object)) {$text = '';if ($this->tag === 'text') {if (!empty($this->text)) {$text = ' with text: ' . $this->text;}}$debug_object->debug_log(1, 'Innertext of tag: ' . $this->tag . $text);}if ($this->tag === 'root') {return $this->innertext();}// todo: What is the use of this callback? Remove?if ($this->dom && $this->dom->callback !== null) {call_user_func_array($this->dom->callback, array($this));}if (isset($this->_[HDOM_INFO_OUTER])) {return $this->_[HDOM_INFO_OUTER];}if (isset($this->_[HDOM_INFO_TEXT])) {return $this->dom->restore_noise($this->_[HDOM_INFO_TEXT]);}$ret = '';if ($this->dom && $this->dom->nodes[$this->_[HDOM_INFO_BEGIN]]) {$ret = $this->dom->nodes[$this->_[HDOM_INFO_BEGIN]]->makeup();}if (isset($this->_[HDOM_INFO_INNER])) {// todo: <br> should either never have HDOM_INFO_INNER or alwaysif ($this->tag !== 'br') {$ret .= $this->_[HDOM_INFO_INNER];}} elseif ($this->nodes) {foreach ($this->nodes as $n) {$ret .= $this->convert_text($n->outertext());}}if (isset($this->_[HDOM_INFO_END]) && $this->_[HDOM_INFO_END] != 0) {$ret .= '</' . $this->tag . '>';}return $ret;}function text(){if (isset($this->_[HDOM_INFO_INNER])) {return $this->_[HDOM_INFO_INNER];}switch ($this->nodetype) {case HDOM_TYPE_TEXT: return $this->dom->restore_noise($this->_[HDOM_INFO_TEXT]);case HDOM_TYPE_COMMENT: return '';case HDOM_TYPE_UNKNOWN: return '';}if (strcasecmp($this->tag, 'script') === 0) { return ''; }if (strcasecmp($this->tag, 'style') === 0) { return ''; }$ret = '';// In rare cases, (always node type 1 or HDOM_TYPE_ELEMENT - observed// for some span tags, and some p tags) $this->nodes is set to NULL.// NOTE: This indicates that there is a problem where it's set to NULL// without a clear happening.// WHY is this happening?if (!is_null($this->nodes)) {foreach ($this->nodes as $n) {// Start paragraph after a blank lineif ($n->tag === 'p') {$ret = trim($ret) . "\n\n";}$ret .= $this->convert_text($n->text());// If this node is a span... add a space at the end of it so// multiple spans don't run into each other.  This is plaintext// after all.if ($n->tag === 'span') {$ret .= $this->dom->default_span_text;}}}return $ret;}function xmltext(){$ret = $this->innertext();$ret = str_ireplace('<![CDATA[', '', $ret);$ret = str_replace(']]>', '', $ret);return $ret;}function makeup(){// text, comment, unknownif (isset($this->_[HDOM_INFO_TEXT])) {return $this->dom->restore_noise($this->_[HDOM_INFO_TEXT]);}$ret = '<' . $this->tag;$i = -1;foreach ($this->attr as $key => $val) {++$i;// skip removed attributeif ($val === null || $val === false) { continue; }$ret .= $this->_[HDOM_INFO_SPACE][$i][0];//no value attr: nowrap, checked selected...if ($val === true) {$ret .= $key;} else {switch ($this->_[HDOM_INFO_QUOTE][$i]){case HDOM_QUOTE_DOUBLE: $quote = '"'; break;case HDOM_QUOTE_SINGLE: $quote = '\''; break;default: $quote = '';}$ret .= $key. $this->_[HDOM_INFO_SPACE][$i][1]. '='. $this->_[HDOM_INFO_SPACE][$i][2]. $quote. $val. $quote;}}$ret = $this->dom->restore_noise($ret);return $ret . $this->_[HDOM_INFO_ENDSPACE] . '>';}function find($selector, $idx = null, $lowercase = false){$selectors = $this->parse_selector($selector);if (($count = count($selectors)) === 0) { return array(); }$found_keys = array();// find each selectorfor ($c = 0; $c < $count; ++$c) {// The change on the below line was documented on the sourceforge// code tracker id 2788009// used to be: if (($levle=count($selectors[0]))===0) return array();if (($levle = count($selectors[$c])) === 0) { return array(); }if (!isset($this->_[HDOM_INFO_BEGIN])) { return array(); }$head = array($this->_[HDOM_INFO_BEGIN] => 1);$cmd = ' '; // Combinator// handle descendant selectors, no recursive!for ($l = 0; $l < $levle; ++$l) {$ret = array();foreach ($head as $k => $v) {$n = ($k === -1) ? $this->dom->root : $this->dom->nodes[$k];//PaperG - Pass this optional parameter on to the seek function.$n->seek($selectors[$c][$l], $ret, $cmd, $lowercase);}$head = $ret;$cmd = $selectors[$c][$l][4]; // Next Combinator}foreach ($head as $k => $v) {if (!isset($found_keys[$k])) {$found_keys[$k] = 1;}}}// sort keysksort($found_keys);$found = array();foreach ($found_keys as $k => $v) {$found[] = $this->dom->nodes[$k];}// return nth-element or arrayif (is_null($idx)) { return $found; }elseif ($idx < 0) { $idx = count($found) + $idx; }return (isset($found[$idx])) ? $found[$idx] : null;}protected function seek($selector, &$ret, $parent_cmd, $lowercase = false){global $debug_object;if (is_object($debug_object)) { $debug_object->debug_log_entry(1); }list($tag, $id, $class, $attributes, $cmb) = $selector;$nodes = array();if ($parent_cmd === ' ') { // Descendant Combinator// Find parent closing tag if the current element doesn't have a closing// tag (i.e. void element)$end = (!empty($this->_[HDOM_INFO_END])) ? $this->_[HDOM_INFO_END] : 0;if ($end == 0) {$parent = $this->parent;while (!isset($parent->_[HDOM_INFO_END]) && $parent !== null) {$end -= 1;$parent = $parent->parent;}$end += $parent->_[HDOM_INFO_END];}// Get list of target nodes$nodes_start = $this->_[HDOM_INFO_BEGIN] + 1;$nodes_count = $end - $nodes_start;$nodes = array_slice($this->dom->nodes, $nodes_start, $nodes_count, true);} elseif ($parent_cmd === '>') { // Child Combinator$nodes = $this->children;} elseif ($parent_cmd === '+'&& $this->parent&& in_array($this, $this->parent->children)) { // Next-Sibling Combinator$index = array_search($this, $this->parent->children, true) + 1;if ($index < count($this->parent->children))$nodes[] = $this->parent->children[$index];} elseif ($parent_cmd === '~'&& $this->parent&& in_array($this, $this->parent->children)) { // Subsequent Sibling Combinator$index = array_search($this, $this->parent->children, true);$nodes = array_slice($this->parent->children, $index);}// Go throgh each element starting at this element until the end tag// Note: If this element is a void tag, any previous void element is// skipped.foreach($nodes as $node) {$pass = true;// Skip root nodesif(!$node->parent) {$pass = false;}// Handle 'text' selectorif($pass && $tag === 'text' && $node->tag === 'text') {$ret[array_search($node, $this->dom->nodes, true)] = 1;unset($node);continue;}// Skip if node isn't a child node (i.e. text nodes)if($pass && !in_array($node, $node->parent->children, true)) {$pass = false;}// Skip if tag doesn't matchif ($pass && $tag !== '' && $tag !== $node->tag && $tag !== '*') {$pass = false;}// Skip if ID doesn't existif ($pass && $id !== '' && !isset($node->attr['id'])) {$pass = false;}// Check if ID matchesif ($pass && $id !== '' && isset($node->attr['id'])) {// Note: Only consider the first ID (as browsers do)$node_id = explode(' ', trim($node->attr['id']))[0];if($id !== $node_id) { $pass = false; }}// Check if all class(es) existif ($pass && $class !== '' && is_array($class) && !empty($class)) {if (isset($node->attr['class'])) {$node_classes = explode(' ', $node->attr['class']);if ($lowercase) {$node_classes = array_map('strtolower', $node_classes);}foreach($class as $c) {if(!in_array($c, $node_classes)) {$pass = false;break;}}} else {$pass = false;}}// Check attributesif ($pass&& $attributes !== ''&& is_array($attributes)&& !empty($attributes)) {foreach($attributes as $a) {list ($att_name,$att_expr,$att_val,$att_inv,$att_case_sensitivity) = $a;// Handle indexing attributes (i.e. "[2]")/*** Note: This is not supported by the CSS Standard but adds* the ability to select items compatible to XPath (i.e.* the 3rd element within it's parent).** Note: This doesn't conflict with the CSS Standard which* doesn't work on numeric attributes anyway.*/if (is_numeric($att_name)&& $att_expr === ''&& $att_val === '') {$count = 0;// Find index of current element in parentforeach ($node->parent->children as $c) {if ($c->tag === $node->tag) ++$count;if ($c === $node) break;}// If this is the correct node, continue with next// attributeif ($count === (int)$att_name) continue;}// Check attribute availabilityif ($att_inv) { // Attribute should NOT be setif (isset($node->attr[$att_name])) {$pass = false;break;}} else { // Attribute should be set// todo: "plaintext" is not a valid CSS selector!if ($att_name !== 'plaintext'&& !isset($node->attr[$att_name])) {$pass = false;break;}}// Continue with next attribute if expression isn't definedif ($att_expr === '') continue;// If they have told us that this is a "plaintext"// search then we want the plaintext of the node - right?// todo "plaintext" is not a valid CSS selector!if ($att_name === 'plaintext') {$nodeKeyValue = $node->text();} else {$nodeKeyValue = $node->attr[$att_name];}if (is_object($debug_object)) {$debug_object->debug_log(2,'testing node: '. $node->tag. ' for attribute: '. $att_name. $att_expr. $att_val. ' where nodes value is: '. $nodeKeyValue);}// If lowercase is set, do a case insensitive test of// the value of the selector.if ($lowercase) {$check = $this->match($att_expr,strtolower($att_val),strtolower($nodeKeyValue),$att_case_sensitivity);} else {$check = $this->match($att_expr,$att_val,$nodeKeyValue,$att_case_sensitivity);}if (is_object($debug_object)) {$debug_object->debug_log(2,'after match: '. ($check ? 'true' : 'false'));}if (!$check) {$pass = false;break;}}}// Found a match. Add to list and clear nodeif ($pass) $ret[$node->_[HDOM_INFO_BEGIN]] = 1;unset($node);}// It's passed by reference so this is actually what this function returns.if (is_object($debug_object)) {$debug_object->debug_log(1, 'EXIT - ret: ', $ret);}}protected function match($exp, $pattern, $value, $case_sensitivity){global $debug_object;if (is_object($debug_object)) {$debug_object->debug_log_entry(1);}if ($case_sensitivity === 'i') {$pattern = strtolower($pattern);$value = strtolower($value);}switch ($exp) {case '=':return ($value === $pattern);case '!=':return ($value !== $pattern);case '^=':return preg_match('/^' . preg_quote($pattern, '/') . '/', $value);case '$=':return preg_match('/' . preg_quote($pattern, '/') . '$/', $value);case '*=':return preg_match('/' . preg_quote($pattern, '/') . '/', $value);case '|=':/*** [att|=val]** Represents an element with the att attribute, its value* either being exactly "val" or beginning with "val"* immediately followed by "-" (U+002D).*/return strpos($value, $pattern) === 0;case '~=':/*** [att~=val]** Represents an element with the att attribute whose value is a* whitespace-separated list of words, one of which is exactly* "val". If "val" contains whitespace, it will never represent* anything (since the words are separated by spaces). Also if* "val" is the empty string, it will never represent anything.*/return in_array($pattern, explode(' ', trim($value)), true);}return false;}protected function parse_selector($selector_string){global $debug_object;if (is_object($debug_object)) { $debug_object->debug_log_entry(1); }/*** Pattern of CSS selectors, modified from mootools (https://mootools.net/)** Paperg: Add the colon to the attribute, so that it properly finds* <tag attr:ibute="something" > like google does.** Note: if you try to look at this attribute, you MUST use getAttribute* since $dom->x:y will fail the php syntax check.** Notice the \[ starting the attribute? and the @? following? This* implies that an attribute can begin with an @ sign that is not* captured. This implies that an html attribute specifier may start* with an @ sign that is NOT captured by the expression. Farther study* is required to determine of this should be documented or removed.** Matches selectors in this order:** [0] - full match** [1] - tag name*     ([\w:\*-]*)*     Matches the tag name consisting of zero or more words, colons,*     asterisks and hyphens.** [2] - id name*     (?:\#([\w-]+))*     Optionally matches a id name, consisting of an "#" followed by*     the id name (one or more words and hyphens).** [3] - class names (including dots)*     (?:\.([\w\.-]+))?*     Optionally matches a list of classs, consisting of an "."*     followed by the class name (one or more words and hyphens)*     where multiple classes can be chained (i.e. ".foo.bar.baz")** [4] - attributes*     ((?:\[@?(?:!?[\w:-]+)(?:(?:[!*^$|~]?=)[\"']?(?:.*?)[\"']?)?(?:\s*?(?:[iIsS])?)?\])+)?*     Optionally matches the attributes list** [5] - separator*     ([\/, >+~]+)*     Matches the selector list separator*/// phpcs:ignore Generic.Files.LineLength$pattern = "/([\w:\*-]*)(?:\#([\w-]+))?(?:|\.([\w\.-]+))?((?:\[@?(?:!?[\w:-]+)(?:(?:[!*^$|~]?=)[\"']?(?:.*?)[\"']?)?(?:\s*?(?:[iIsS])?)?\])+)?([\/, >+~]+)/is";preg_match_all($pattern,trim($selector_string) . ' ', // Add final ' ' as pseudo separator$matches,PREG_SET_ORDER);if (is_object($debug_object)) {$debug_object->debug_log(2, 'Matches Array: ', $matches);}$selectors = array();$result = array();foreach ($matches as $m) {$m[0] = trim($m[0]);// Skip NoOpsif ($m[0] === '' || $m[0] === '/' || $m[0] === '//') { continue; }// Convert to lowercaseif ($this->dom->lowercase) {$m[1] = strtolower($m[1]);}// Extract classesif ($m[3] !== '') { $m[3] = explode('.', $m[3]); }/* Extract attributes (pattern based on the pattern above!)* [0] - full match* [1] - attribute name* [2] - attribute expression* [3] - attribute value* [4] - case sensitivity** Note: Attributes can be negated with a "!" prefix to their name*/if($m[4] !== '') {preg_match_all("/\[@?(!?[\w:-]+)(?:([!*^$|~]?=)[\"']?(.*?)[\"']?)?(?:\s+?([iIsS])?)?\]/is",trim($m[4]),$attributes,PREG_SET_ORDER);// Replace element by array$m[4] = array();foreach($attributes as $att) {// Skip empty matchesif(trim($att[0]) === '') { continue; }$inverted = (isset($att[1][0]) && $att[1][0] === '!');$m[4][] = array($inverted ? substr($att[1], 1) : $att[1], // Name(isset($att[2])) ? $att[2] : '', // Expression(isset($att[3])) ? $att[3] : '', // Value$inverted, // Inverted Flag(isset($att[4])) ? strtolower($att[4]) : '', // Case-Sensitivity);}}// Sanitize Separatorif ($m[5] !== '' && trim($m[5]) === '') { // Descendant Separator$m[5] = ' ';} else { // Other Separator$m[5] = trim($m[5]);}// Clear Separator if it's a Selector Listif ($is_list = ($m[5] === ',')) { $m[5] = ''; }// Remove full match before adding to resultsarray_shift($m);$result[] = $m;if ($is_list) { // Selector List$selectors[] = $result;$result = array();}}if (count($result) > 0) { $selectors[] = $result; }return $selectors;}function __get($name){if (isset($this->attr[$name])) {return $this->convert_text($this->attr[$name]);}switch ($name) {case 'outertext': return $this->outertext();case 'innertext': return $this->innertext();case 'plaintext': return $this->text();case 'xmltext': return $this->xmltext();default: return array_key_exists($name, $this->attr);}}function __set($name, $value){global $debug_object;if (is_object($debug_object)) { $debug_object->debug_log_entry(1); }switch ($name) {case 'outertext': return $this->_[HDOM_INFO_OUTER] = $value;case 'innertext':if (isset($this->_[HDOM_INFO_TEXT])) {return $this->_[HDOM_INFO_TEXT] = $value;}return $this->_[HDOM_INFO_INNER] = $value;}if (!isset($this->attr[$name])) {$this->_[HDOM_INFO_SPACE][] = array(' ', '', '');$this->_[HDOM_INFO_QUOTE][] = HDOM_QUOTE_DOUBLE;}$this->attr[$name] = $value;}function __isset($name){switch ($name) {case 'outertext': return true;case 'innertext': return true;case 'plaintext': return true;}//no value attr: nowrap, checked selected...return (array_key_exists($name, $this->attr)) ? true : isset($this->attr[$name]);}function __unset($name){if (isset($this->attr[$name])) { unset($this->attr[$name]); }}function convert_text($text){global $debug_object;if (is_object($debug_object)) { $debug_object->debug_log_entry(1); }$converted_text = $text;$sourceCharset = '';$targetCharset = '';if ($this->dom) {$sourceCharset = strtoupper($this->dom->_charset);$targetCharset = strtoupper($this->dom->_target_charset);}if (is_object($debug_object)) {$debug_object->debug_log(3,'source charset: '. $sourceCharset. ' target charaset: '. $targetCharset);}if (!empty($sourceCharset)&& !empty($targetCharset)&& (strcasecmp($sourceCharset, $targetCharset) != 0)) {// Check if the reported encoding could have been incorrect and the text is actually already UTF-8if ((strcasecmp($targetCharset, 'UTF-8') == 0)&& ($this->is_utf8($text))) {$converted_text = $text;} else {$converted_text = iconv($sourceCharset, $targetCharset, $text);}}// Lets make sure that we don't have that silly BOM issue with any of the utf-8 text we output.if ($targetCharset === 'UTF-8') {if (substr($converted_text, 0, 3) === "\xef\xbb\xbf") {$converted_text = substr($converted_text, 3);}if (substr($converted_text, -3) === "\xef\xbb\xbf") {$converted_text = substr($converted_text, 0, -3);}}return $converted_text;}static function is_utf8($str){$c = 0; $b = 0;$bits = 0;$len = strlen($str);for($i = 0; $i < $len; $i++) {$c = ord($str[$i]);if($c > 128) {if(($c >= 254)) { return false; }elseif($c >= 252) { $bits = 6; }elseif($c >= 248) { $bits = 5; }elseif($c >= 240) { $bits = 4; }elseif($c >= 224) { $bits = 3; }elseif($c >= 192) { $bits = 2; }else { return false; }if(($i + $bits) > $len) { return false; }while($bits > 1) {$i++;$b = ord($str[$i]);if($b < 128 || $b > 191) { return false; }$bits--;}}}return true;}function get_display_size(){global $debug_object;$width = -1;$height = -1;if ($this->tag !== 'img') {return false;}// See if there is aheight or width attribute in the tag itself.if (isset($this->attr['width'])) {$width = $this->attr['width'];}if (isset($this->attr['height'])) {$height = $this->attr['height'];}// Now look for an inline style.if (isset($this->attr['style'])) {// Thanks to user gnarf from stackoverflow for this regular expression.$attributes = array();preg_match_all('/([\w-]+)\s*:\s*([^;]+)\s*;?/',$this->attr['style'],$matches,PREG_SET_ORDER);foreach ($matches as $match) {$attributes[$match[1]] = $match[2];}// If there is a width in the style attributes:if (isset($attributes['width']) && $width == -1) {// check that the last two characters are px (pixels)if (strtolower(substr($attributes['width'], -2)) === 'px') {$proposed_width = substr($attributes['width'], 0, -2);// Now make sure that it's an integer and not something stupid.if (filter_var($proposed_width, FILTER_VALIDATE_INT)) {$width = $proposed_width;}}}// If there is a width in the style attributes:if (isset($attributes['height']) && $height == -1) {// check that the last two characters are px (pixels)if (strtolower(substr($attributes['height'], -2)) == 'px') {$proposed_height = substr($attributes['height'], 0, -2);// Now make sure that it's an integer and not something stupid.if (filter_var($proposed_height, FILTER_VALIDATE_INT)) {$height = $proposed_height;}}}}// Future enhancement:// Look in the tag to see if there is a class or id specified that has// a height or width attribute to it.// Far future enhancement// Look at all the parent tags of this image to see if they specify a// class or id that has an img selector that specifies a height or width// Note that in this case, the class or id will have the img subselector// for it to apply to the image.// ridiculously far future development// If the class or id is specified in a SEPARATE css file thats not on// the page, go get it and do what we were just doing for the ones on// the page.$result = array('height' => $height,'width' => $width);return $result;}function save($filepath = ''){$ret = $this->outertext();if ($filepath !== '') {file_put_contents($filepath, $ret, LOCK_EX);}return $ret;}function addClass($class){if (is_string($class)) {$class = explode(' ', $class);}if (is_array($class)) {foreach($class as $c) {if (isset($this->class)) {if ($this->hasClass($c)) {continue;} else {$this->class .= ' ' . $c;}} else {$this->class = $c;}}} else {if (is_object($debug_object)) {$debug_object->debug_log(2, 'Invalid type: ', gettype($class));}}}function hasClass($class){if (is_string($class)) {if (isset($this->class)) {return in_array($class, explode(' ', $this->class), true);}} else {if (is_object($debug_object)) {$debug_object->debug_log(2, 'Invalid type: ', gettype($class));}}return false;}function removeClass($class = null){if (!isset($this->class)) {return;}if (is_null($class)) {$this->removeAttribute('class');return;}if (is_string($class)) {$class = explode(' ', $class);}if (is_array($class)) {$class = array_diff(explode(' ', $this->class), $class);if (empty($class)) {$this->removeAttribute('class');} else {$this->class = implode(' ', $class);}}}function getAllAttributes(){return $this->attr;}function getAttribute($name){return $this->__get($name);}function setAttribute($name, $value){$this->__set($name, $value);}function hasAttribute($name){return $this->__isset($name);}function removeAttribute($name){$this->__set($name, null);}function remove(){if ($this->parent) {$this->parent->removeChild($this);}}function removeChild($node){$nidx = array_search($node, $this->nodes, true);$cidx = array_search($node, $this->children, true);$didx = array_search($node, $this->dom->nodes, true);if ($nidx !== false && $cidx !== false && $didx !== false) {foreach($node->children as $child) {$node->removeChild($child);}foreach($node->nodes as $entity) {$enidx = array_search($entity, $node->nodes, true);$edidx = array_search($entity, $node->dom->nodes, true);if ($enidx !== false && $edidx !== false) {unset($node->nodes[$enidx]);unset($node->dom->nodes[$edidx]);}}unset($this->nodes[$nidx]);unset($this->children[$cidx]);unset($this->dom->nodes[$didx]);$node->clear();}}function getElementById($id){return $this->find("#$id", 0);}function getElementsById($id, $idx = null){return $this->find("#$id", $idx);}function getElementByTagName($name){return $this->find($name, 0);}function getElementsByTagName($name, $idx = null){return $this->find($name, $idx);}function parentNode(){return $this->parent();}function childNodes($idx = -1){return $this->children($idx);}function firstChild(){return $this->first_child();}function lastChild(){return $this->last_child();}function nextSibling(){return $this->next_sibling();}function previousSibling(){return $this->prev_sibling();}function hasChildNodes(){return $this->has_child();}function nodeName(){return $this->tag;}function appendChild($node){$node->parent($this);return $node;}}class simple_html_dom
{public $root = null;public $nodes = array();public $callback = null;public $lowercase = false;public $original_size;public $size;protected $pos;protected $doc;protected $char;protected $cursor;protected $parent;protected $noise = array();protected $token_blank = " \t\r\n";protected $token_equal = ' =/>';protected $token_slash = " />\r\n\t";protected $token_attr = ' >';public $_charset = '';public $_target_charset = '';protected $default_br_text = '';public $default_span_text = '';protected $self_closing_tags = array('area' => 1,'base' => 1,'br' => 1,'col' => 1,'embed' => 1,'hr' => 1,'img' => 1,'input' => 1,'link' => 1,'meta' => 1,'param' => 1,'source' => 1,'track' => 1,'wbr' => 1);protected $block_tags = array('body' => 1,'div' => 1,'form' => 1,'root' => 1,'span' => 1,'table' => 1);protected $optional_closing_tags = array(// Not optional, see// https://www.w3.org/TR/html/textlevel-semantics.html#the-b-element'b' => array('b' => 1),'dd' => array('dd' => 1, 'dt' => 1),// Not optional, see// https://www.w3.org/TR/html/grouping-content.html#the-dl-element'dl' => array('dd' => 1, 'dt' => 1),'dt' => array('dd' => 1, 'dt' => 1),'li' => array('li' => 1),'optgroup' => array('optgroup' => 1, 'option' => 1),'option' => array('optgroup' => 1, 'option' => 1),'p' => array('p' => 1),'rp' => array('rp' => 1, 'rt' => 1),'rt' => array('rp' => 1, 'rt' => 1),'td' => array('td' => 1, 'th' => 1),'th' => array('td' => 1, 'th' => 1),'tr' => array('td' => 1, 'th' => 1, 'tr' => 1),);function __construct($str = null,$lowercase = true,$forceTagsClosed = true,$target_charset = DEFAULT_TARGET_CHARSET,$stripRN = true,$defaultBRText = DEFAULT_BR_TEXT,$defaultSpanText = DEFAULT_SPAN_TEXT,$options = 0){if ($str) {if (preg_match('/^http:\/\//i', $str) || is_file($str)) {$this->load_file($str);} else {$this->load($str,$lowercase,$stripRN,$defaultBRText,$defaultSpanText,$options);}}// Forcing tags to be closed implies that we don't trust the html, but// it can lead to parsing errors if we SHOULD trust the html.if (!$forceTagsClosed) {$this->optional_closing_array = array();}$this->_target_charset = $target_charset;}function __destruct(){$this->clear();}function load($str,$lowercase = true,$stripRN = true,$defaultBRText = DEFAULT_BR_TEXT,$defaultSpanText = DEFAULT_SPAN_TEXT,$options = 0){global $debug_object;// prepare$this->prepare($str, $lowercase, $defaultBRText, $defaultSpanText);// Per sourceforge http://sourceforge.net/tracker/?func=detail&aid=2949097&group_id=218559&atid=1044037// Script tags removal now preceeds style tag removal.// strip out <script> tags$this->remove_noise("'<\s*script[^>]*[^/]>(.*?)<\s*/\s*script\s*>'is");$this->remove_noise("'<\s*script\s*>(.*?)<\s*/\s*script\s*>'is");// strip out the \r \n's if we are told to.if ($stripRN) {$this->doc = str_replace("\r", ' ', $this->doc);$this->doc = str_replace("\n", ' ', $this->doc);// set the length of content since we have changed it.$this->size = strlen($this->doc);}// strip out cdata$this->remove_noise("'<!\[CDATA\[(.*?)\]\]>'is", true);// strip out comments$this->remove_noise("'<!--(.*?)-->'is");// strip out <style> tags$this->remove_noise("'<\s*style[^>]*[^/]>(.*?)<\s*/\s*style\s*>'is");$this->remove_noise("'<\s*style\s*>(.*?)<\s*/\s*style\s*>'is");// strip out preformatted tags$this->remove_noise("'<\s*(?:code)[^>]*>(.*?)<\s*/\s*(?:code)\s*>'is");// strip out server side scripts$this->remove_noise("'(<\?)(.*?)(\?>)'s", true);if($options & HDOM_SMARTY_AS_TEXT) { // Strip Smarty scripts$this->remove_noise("'(\{\w)(.*?)(\})'s", true);}// parsing$this->parse();// end$this->root->_[HDOM_INFO_END] = $this->cursor;$this->parse_charset();// make load function chainablereturn $this;}function load_file(){$args = func_get_args();if(($doc = call_user_func_array('file_get_contents', $args)) !== false) {$this->load($doc, true);} else {return false;}}function set_callback($function_name){$this->callback = $function_name;}function remove_callback(){$this->callback = null;}function save($filepath = ''){$ret = $this->root->innertext();if ($filepath !== '') { file_put_contents($filepath, $ret, LOCK_EX); }return $ret;}function find($selector, $idx = null, $lowercase = false){return $this->root->find($selector, $idx, $lowercase);}function clear(){if (isset($this->nodes)) {foreach ($this->nodes as $n) {$n->clear();$n = null;}}// This add next line is documented in the sourceforge repository.// 2977248 as a fix for ongoing memory leaks that occur even with the// use of clear.if (isset($this->children)) {foreach ($this->children as $n) {$n->clear();$n = null;}}if (isset($this->parent)) {$this->parent->clear();unset($this->parent);}if (isset($this->root)) {$this->root->clear();unset($this->root);}unset($this->doc);unset($this->noise);}function dump($show_attr = true){$this->root->dump($show_attr);}protected function prepare($str, $lowercase = true,$defaultBRText = DEFAULT_BR_TEXT,$defaultSpanText = DEFAULT_SPAN_TEXT){$this->clear();$this->doc = trim($str);$this->size = strlen($this->doc);$this->original_size = $this->size; // original size of the html$this->pos = 0;$this->cursor = 1;$this->noise = array();$this->nodes = array();$this->lowercase = $lowercase;$this->default_br_text = $defaultBRText;$this->default_span_text = $defaultSpanText;$this->root = new simple_html_dom_node($this);$this->root->tag = 'root';$this->root->_[HDOM_INFO_BEGIN] = -1;$this->root->nodetype = HDOM_TYPE_ROOT;$this->parent = $this->root;if ($this->size > 0) { $this->char = $this->doc[0]; }}protected function parse(){while (true) {// Read next tag if there is no text between current position and the// next opening tag.if (($s = $this->copy_until_char('<')) === '') {if($this->read_tag()) {continue;} else {return true;}}// Add a text node for text between tags$node = new simple_html_dom_node($this);++$this->cursor;$node->_[HDOM_INFO_TEXT] = $s;$this->link_nodes($node, false);}}protected function parse_charset(){global $debug_object;$charset = null;if (function_exists('get_last_retrieve_url_contents_content_type')) {$contentTypeHeader = get_last_retrieve_url_contents_content_type();$success = preg_match('/charset=(.+)/', $contentTypeHeader, $matches);if ($success) {$charset = $matches[1];if (is_object($debug_object)) {$debug_object->debug_log(2,'header content-type found charset of: '. $charset);}}}if (empty($charset)) {// https://www.w3.org/TR/html/document-metadata.html#statedef-http-equiv-content-type$el = $this->root->find('meta[http-equiv=Content-Type]', 0, true);if (!empty($el)) {$fullvalue = $el->content;if (is_object($debug_object)) {$debug_object->debug_log(2,'meta content-type tag found'. $fullvalue);}if (!empty($fullvalue)) {$success = preg_match('/charset=(.+)/i',$fullvalue,$matches);if ($success) {$charset = $matches[1];} else {// If there is a meta tag, and they don't specify the// character set, research says that it's typically// ISO-8859-1if (is_object($debug_object)) {$debug_object->debug_log(2,'meta content-type tag couldn\'t be parsed. using iso-8859 default.');}$charset = 'ISO-8859-1';}}}}if (empty($charset)) {// https://www.w3.org/TR/html/document-metadata.html#character-encoding-declarationif ($meta = $this->root->find('meta[charset]', 0)) {$charset = $meta->charset;if (is_object($debug_object)) {$debug_object->debug_log(2, 'meta charset: ' . $charset);}}}if (empty($charset)) {// Try to guess the charset based on the content// Requires Multibyte String (mbstring) support (optional)if (function_exists('mb_detect_encoding')) {/*** mb_detect_encoding() is not intended to distinguish between* charsets, especially single-byte charsets. Its primary* purpose is to detect which multibyte encoding is in use,* i.e. UTF-8, UTF-16, shift-JIS, etc.** -- https://bugs.php.net/bug.php?id=38138** Adding both CP1251/ISO-8859-5 and CP1252/ISO-8859-1 will* always result in CP1251/ISO-8859-5 and vice versa.** Thus, only detect if it's either UTF-8 or CP1252/ISO-8859-1* to stay compatible.*/$encoding = mb_detect_encoding($this->doc,array( 'UTF-8', 'CP1252', 'ISO-8859-1' ));if ($encoding === 'CP1252' || $encoding === 'ISO-8859-1') {// Due to a limitation of mb_detect_encoding// 'CP1251'/'ISO-8859-5' will be detected as// 'CP1252'/'ISO-8859-1'. This will cause iconv to fail, in// which case we can simply assume it is the other charset.if (!@iconv('CP1252', 'UTF-8', $this->doc)) {$encoding = 'CP1251';}}if ($encoding !== false) {$charset = $encoding;if (is_object($debug_object)) {$debug_object->debug_log(2, 'mb_detect: ' . $charset);}}}}if (empty($charset)) {// Assume it's UTF-8 as it is the most likely charset to be used$charset = 'UTF-8';if (is_object($debug_object)) {$debug_object->debug_log(2, 'No match found, assume ' . $charset);}}// Since CP1252 is a superset, if we get one of it's subsets, we want// it instead.if ((strtolower($charset) == 'iso-8859-1')|| (strtolower($charset) == 'latin1')|| (strtolower($charset) == 'latin-1')) {$charset = 'CP1252';if (is_object($debug_object)) {$debug_object->debug_log(2,'replacing ' . $charset . ' with CP1252 as its a superset');}}if (is_object($debug_object)) {$debug_object->debug_log(1, 'EXIT - ' . $charset);}return $this->_charset = $charset;}protected function read_tag(){// Set end position if no further tags foundif ($this->char !== '<') {$this->root->_[HDOM_INFO_END] = $this->cursor;return false;}$begin_tag_pos = $this->pos;$this->char = (++$this->pos < $this->size) ? $this->doc[$this->pos] : null; // next// end tagif ($this->char === '/') {$this->char = (++$this->pos < $this->size) ? $this->doc[$this->pos] : null; // next// Skip whitespace in end tags (i.e. in "</   html>")$this->skip($this->token_blank);$tag = $this->copy_until_char('>');// Skip attributes in end tagsif (($pos = strpos($tag, ' ')) !== false) {$tag = substr($tag, 0, $pos);}$parent_lower = strtolower($this->parent->tag);$tag_lower = strtolower($tag);// The end tag is supposed to close the parent tag. Handle situations// when it doesn'tif ($parent_lower !== $tag_lower) {// Parent tag does not have to be closed necessarily (optional closing tag)// Current tag is a block tag, so it may close an ancestorif (isset($this->optional_closing_tags[$parent_lower])&& isset($this->block_tags[$tag_lower])) {$this->parent->_[HDOM_INFO_END] = 0;$org_parent = $this->parent;// Traverse ancestors to find a matching opening tag// Stop at root nodewhile (($this->parent->parent)&& strtolower($this->parent->tag) !== $tag_lower){$this->parent = $this->parent->parent;}// If we don't have a match add current tag as text nodeif (strtolower($this->parent->tag) !== $tag_lower) {$this->parent = $org_parent; // restore origonal parentif ($this->parent->parent) {$this->parent = $this->parent->parent;}$this->parent->_[HDOM_INFO_END] = $this->cursor;return $this->as_text_node($tag);}} elseif (($this->parent->parent)&& isset($this->block_tags[$tag_lower])) {// Grandparent exists and current tag is a block tag, so our// parent doesn't have an end tag$this->parent->_[HDOM_INFO_END] = 0; // No end tag$org_parent = $this->parent;// Traverse ancestors to find a matching opening tag// Stop at root nodewhile (($this->parent->parent)&& strtolower($this->parent->tag) !== $tag_lower) {$this->parent = $this->parent->parent;}// If we don't have a match add current tag as text nodeif (strtolower($this->parent->tag) !== $tag_lower) {$this->parent = $org_parent; // restore origonal parent$this->parent->_[HDOM_INFO_END] = $this->cursor;return $this->as_text_node($tag);}} elseif (($this->parent->parent)&& strtolower($this->parent->parent->tag) === $tag_lower) { // Grandparent exists and current tag closes it$this->parent->_[HDOM_INFO_END] = 0;$this->parent = $this->parent->parent;} else { // Random tag, add as text nodereturn $this->as_text_node($tag);}}// Set end position of parent tag to current cursor position$this->parent->_[HDOM_INFO_END] = $this->cursor;if ($this->parent->parent) {$this->parent = $this->parent->parent;}$this->char = (++$this->pos < $this->size) ? $this->doc[$this->pos] : null; // nextreturn true;}// start tag$node = new simple_html_dom_node($this);$node->_[HDOM_INFO_BEGIN] = $this->cursor;++$this->cursor;$tag = $this->copy_until($this->token_slash); // Get tag name$node->tag_start = $begin_tag_pos;// doctype, cdata & comments...// <!DOCTYPE html>// <![CDATA[ ... ]]>// <!-- Comment -->if (isset($tag[0]) && $tag[0] === '!') {$node->_[HDOM_INFO_TEXT] = '<' . $tag . $this->copy_until_char('>');if (isset($tag[2]) && $tag[1] === '-' && $tag[2] === '-') { // Comment ("<!--")$node->nodetype = HDOM_TYPE_COMMENT;$node->tag = 'comment';} else { // Could be doctype or CDATA but we don't care$node->nodetype = HDOM_TYPE_UNKNOWN;$node->tag = 'unknown';}if ($this->char === '>') { $node->_[HDOM_INFO_TEXT] .= '>'; }$this->link_nodes($node, true);$this->char = (++$this->pos < $this->size) ? $this->doc[$this->pos] : null; // nextreturn true;}// The start tag cannot contain another start tag, if so add as text// i.e. "<<html>"if ($pos = strpos($tag, '<') !== false) {$tag = '<' . substr($tag, 0, -1);$node->_[HDOM_INFO_TEXT] = $tag;$this->link_nodes($node, false);$this->char = $this->doc[--$this->pos]; // prevreturn true;}// Handle invalid tag names (i.e. "<html#doc>")if (!preg_match('/^\w[\w:-]*$/', $tag)) {$node->_[HDOM_INFO_TEXT] = '<' . $tag . $this->copy_until('<>');// Next char is the beginning of a new tag, don't touch it.if ($this->char === '<') {$this->link_nodes($node, false);return true;}// Next char closes current tag, add and be done with it.if ($this->char === '>') { $node->_[HDOM_INFO_TEXT] .= '>'; }$this->link_nodes($node, false);$this->char = (++$this->pos < $this->size) ? $this->doc[$this->pos] : null; // nextreturn true;}// begin tag, add new node$node->nodetype = HDOM_TYPE_ELEMENT;$tag_lower = strtolower($tag);$node->tag = ($this->lowercase) ? $tag_lower : $tag;// handle optional closing tagsif (isset($this->optional_closing_tags[$tag_lower])) {// Traverse ancestors to close all optional closing tagswhile (isset($this->optional_closing_tags[$tag_lower][strtolower($this->parent->tag)])) {$this->parent->_[HDOM_INFO_END] = 0;$this->parent = $this->parent->parent;}$node->parent = $this->parent;}$guard = 0; // prevent infinity loop// [0] Space between tag and first attribute$space = array($this->copy_skip($this->token_blank), '', '');// attributesdo {// Everything until the first equal sign should be the attribute name$name = $this->copy_until($this->token_equal);if ($name === '' && $this->char !== null && $space[0] === '') {break;}if ($guard === $this->pos) { // Escape infinite loop$this->char = (++$this->pos < $this->size) ? $this->doc[$this->pos] : null; // nextcontinue;}$guard = $this->pos;// handle endless '<'// Out of bounds before the tag endedif ($this->pos >= $this->size - 1 && $this->char !== '>') {$node->nodetype = HDOM_TYPE_TEXT;$node->_[HDOM_INFO_END] = 0;$node->_[HDOM_INFO_TEXT] = '<' . $tag . $space[0] . $name;$node->tag = 'text';$this->link_nodes($node, false);return true;}// handle mismatch '<'// Attributes cannot start after opening tagif ($this->doc[$this->pos - 1] == '<') {$node->nodetype = HDOM_TYPE_TEXT;$node->tag = 'text';$node->attr = array();$node->_[HDOM_INFO_END] = 0;$node->_[HDOM_INFO_TEXT] = substr($this->doc,$begin_tag_pos,$this->pos - $begin_tag_pos - 1);$this->pos -= 2;$this->char = (++$this->pos < $this->size) ? $this->doc[$this->pos] : null; // next$this->link_nodes($node, false);return true;}if ($name !== '/' && $name !== '') { // this is a attribute name// [1] Whitespace after attribute name$space[1] = $this->copy_skip($this->token_blank);$name = $this->restore_noise($name); // might be a noisy nameif ($this->lowercase) { $name = strtolower($name); }if ($this->char === '=') { // attribute with value$this->char = (++$this->pos < $this->size) ? $this->doc[$this->pos] : null; // next$this->parse_attr($node, $name, $space); // get attribute value} else {//no value attr: nowrap, checked selected...$node->_[HDOM_INFO_QUOTE][] = HDOM_QUOTE_NO;$node->attr[$name] = true;if ($this->char != '>') { $this->char = $this->doc[--$this->pos]; } // prev}$node->_[HDOM_INFO_SPACE][] = $space;// prepare for next attribute$space = array($this->copy_skip($this->token_blank),'','');} else { // no more attributesbreak;}} while ($this->char !== '>' && $this->char !== '/'); // go until the tag ended$this->link_nodes($node, true);$node->_[HDOM_INFO_ENDSPACE] = $space[0];// handle empty tags (i.e. "<div/>")if ($this->copy_until_char('>') === '/') {$node->_[HDOM_INFO_ENDSPACE] .= '/';$node->_[HDOM_INFO_END] = 0;} else {// reset parentif (!isset($this->self_closing_tags[strtolower($node->tag)])) {$this->parent = $node;}}$this->char = (++$this->pos < $this->size) ? $this->doc[$this->pos] : null; // next// If it's a BR tag, we need to set it's text to the default text.// This way when we see it in plaintext, we can generate formatting that the user wants.// since a br tag never has sub nodes, this works well.if ($node->tag === 'br') {$node->_[HDOM_INFO_INNER] = $this->default_br_text;}return true;}protected function parse_attr($node, $name, &$space){$is_duplicate = isset($node->attr[$name]);if (!$is_duplicate) // Copy whitespace between "=" and value$space[2] = $this->copy_skip($this->token_blank);switch ($this->char) {case '"':$quote_type = HDOM_QUOTE_DOUBLE;$this->char = (++$this->pos < $this->size) ? $this->doc[$this->pos] : null; // next$value = $this->copy_until_char('"');$this->char = (++$this->pos < $this->size) ? $this->doc[$this->pos] : null; // nextbreak;case '\'':$quote_type = HDOM_QUOTE_SINGLE;$this->char = (++$this->pos < $this->size) ? $this->doc[$this->pos] : null; // next$value = $this->copy_until_char('\'');$this->char = (++$this->pos < $this->size) ? $this->doc[$this->pos] : null; // nextbreak;default:$quote_type = HDOM_QUOTE_NO;$value = $this->copy_until($this->token_attr);}$value = $this->restore_noise($value);// PaperG: Attributes should not have \r or \n in them, that counts as// html whitespace.$value = str_replace("\r", '', $value);$value = str_replace("\n", '', $value);// PaperG: If this is a "class" selector, lets get rid of the preceeding// and trailing space since some people leave it in the multi class case.if ($name === 'class') {$value = trim($value);}if (!$is_duplicate) {$node->_[HDOM_INFO_QUOTE][] = $quote_type;$node->attr[$name] = $value;}}protected function link_nodes(&$node, $is_child){$node->parent = $this->parent;$this->parent->nodes[] = $node;if ($is_child) {$this->parent->children[] = $node;}}protected function as_text_node($tag){$node = new simple_html_dom_node($this);++$this->cursor;$node->_[HDOM_INFO_TEXT] = '</' . $tag . '>';$this->link_nodes($node, false);$this->char = (++$this->pos < $this->size) ? $this->doc[$this->pos] : null; // nextreturn true;}protected function skip($chars){$this->pos += strspn($this->doc, $chars, $this->pos);$this->char = ($this->pos < $this->size) ? $this->doc[$this->pos] : null; // next}protected function copy_skip($chars){$pos = $this->pos;$len = strspn($this->doc, $chars, $pos);$this->pos += $len;$this->char = ($this->pos < $this->size) ? $this->doc[$this->pos] : null; // nextif ($len === 0) { return ''; }return substr($this->doc, $pos, $len);}protected function copy_until($chars){$pos = $this->pos;$len = strcspn($this->doc, $chars, $pos);$this->pos += $len;$this->char = ($this->pos < $this->size) ? $this->doc[$this->pos] : null; // nextreturn substr($this->doc, $pos, $len);}protected function copy_until_char($char){if ($this->char === null) { return ''; }if (($pos = strpos($this->doc, $char, $this->pos)) === false) {$ret = substr($this->doc, $this->pos, $this->size - $this->pos);$this->char = null;$this->pos = $this->size;return $ret;}if ($pos === $this->pos) { return ''; }$pos_old = $this->pos;$this->char = $this->doc[$pos];$this->pos = $pos;return substr($this->doc, $pos_old, $pos - $pos_old);}protected function remove_noise($pattern, $remove_tag = false){global $debug_object;if (is_object($debug_object)) { $debug_object->debug_log_entry(1); }$count = preg_match_all($pattern,$this->doc,$matches,PREG_SET_ORDER | PREG_OFFSET_CAPTURE);for ($i = $count - 1; $i > -1; --$i) {$key = '___noise___' . sprintf('% 5d', count($this->noise) + 1000);if (is_object($debug_object)) {$debug_object->debug_log(2, 'key is: ' . $key);}$idx = ($remove_tag) ? 0 : 1; // 0 = entire match, 1 = submatch$this->noise[$key] = $matches[$i][$idx][0];$this->doc = substr_replace($this->doc, $key, $matches[$i][$idx][1], strlen($matches[$i][$idx][0]));}// reset the length of content$this->size = strlen($this->doc);if ($this->size > 0) {$this->char = $this->doc[0];}}function restore_noise($text){global $debug_object;if (is_object($debug_object)) { $debug_object->debug_log_entry(1); }while (($pos = strpos($text, '___noise___')) !== false) {// Sometimes there is a broken piece of markup, and we don't GET the// pos+11 etc... token which indicates a problem outside of us...// todo: "___noise___1000" (or any number with four or more digits)// in the DOM causes an infinite loop which could be utilized by// malicious softwareif (strlen($text) > $pos + 15) {$key = '___noise___'. $text[$pos + 11]. $text[$pos + 12]. $text[$pos + 13]. $text[$pos + 14]. $text[$pos + 15];if (is_object($debug_object)) {$debug_object->debug_log(2, 'located key of: ' . $key);}if (isset($this->noise[$key])) {$text = substr($text, 0, $pos). $this->noise[$key]. substr($text, $pos + 16);} else {// do this to prevent an infinite loop.$text = substr($text, 0, $pos). 'UNDEFINED NOISE FOR KEY: '. $key. substr($text, $pos + 16);}} else {// There is no valid key being given back to us... We must get// rid of the ___noise___ or we will have a problem.$text = substr($text, 0, $pos). 'NO NUMERIC NOISE KEY'. substr($text, $pos + 11);}}return $text;}function search_noise($text){global $debug_object;if (is_object($debug_object)) { $debug_object->debug_log_entry(1); }foreach($this->noise as $noiseElement) {if (strpos($noiseElement, $text) !== false) {return $noiseElement;}}}function __toString(){return $this->root->innertext();}function __get($name){switch ($name) {case 'outertext':return $this->root->innertext();case 'innertext':return $this->root->innertext();case 'plaintext':return $this->root->text();case 'charset':return $this->_charset;case 'target_charset':return $this->_target_charset;}}function childNodes($idx = -1){return $this->root->childNodes($idx);}function firstChild(){return $this->root->first_child();}function lastChild(){return $this->root->last_child();}function createElement($name, $value = null){return @str_get_html("<$name>$value</$name>")->firstChild();}function createTextNode($value){return @end(str_get_html($value)->nodes);}function getElementById($id){return $this->find("#$id", 0);}function getElementsById($id, $idx = null){return $this->find("#$id", $idx);}function getElementByTagName($name){return $this->find($name, 0);}function getElementsByTagName($name, $idx = -1){return $this->find($name, $idx);}function loadFile(){$args = func_get_args();$this->load_file($args);}
}

至于代码放置何处，只要放在能引用到的地方都可以。笔者是放在自己封装的SDK中的。

三．编写代码

笔者准备写个接口，将爬取到的资讯信息通过接口返回;

A ）添加路由：

代码如下：

Route::group('html/suppplier', function () {Route::get('crawler_news', '/crawlerNews')->name('htmlSupplierCrawlerNews');//爬虫爬取资讯数据})->prefix('html.Supplier');

B ）实现刚配置的接口函数：

代码如下：

    /*** TODO 爬虫爬取资讯数据* @return mixed* @author huidaoli* @day 2021-11-17*/public function crawlerNews(EchartRepository $EchartRepository){$data = $this->request->params(['page']);$res = $EchartRepository->getHtmlTags($data);return app('json')->success($res);}

C ）爬取专栏相关信息：

代码如下：

    //爬虫爬取资讯数据public function getHtmlTags(array $data){$htmlObj = new simple_html_dom();   //工具类对象初始化$htmlObj->load_file('https://brands.cnblogs.com/');  // 从url中加载$data['page'] = $htmlObj->find('div div div div div div ul li a', 3)->innertext;$ret = $this->scraping_digg($data);$ret['titles'] = $htmlObj->find('div div div div div div h1 a', 0)->innertext;foreach($htmlObj->find('div.wrapper div.bannerpage div div div div p') as $e){$ret['subtitle'][] = $e->plaintext;}return $ret;}

D ）爬取文章信息：

代码如下：

    //爬虫爬取资讯数据public function scraping_digg(array $data) {   $limit = 0;for($i=1;$i<=$data['page'];$i++){$URL = 'https://brands.cnblogs.com';$url = $URL.'/'.$i;$html = new simple_html_dom();$html->load_file($url);foreach($html->find('div.article-item') as $article) {$item['title'] = trim($article->find('div.title h2', 0)->plaintext);$item['details'] = trim($article->find('div.summary p', 0)->plaintext);$item['diggs1'] = trim($article->find('div.post-info div.author span', 0)->plaintext);$item['diggs2'] = trim($article->find('div.post-info div.views span', 0)->plaintext);$item['diggs3'] = trim($article->find('div.post-info div.likes span', 0)->plaintext);$item['img'] = trim($article->find('figure a img', 0)->src);$item['href'] = $URL.trim($article->find('div.title h2 a', 1)->href);$ret['list'][] = $item;$limit += 1;}$html->clear();unset($html);}$ret['limit'] =  $limit;$ret['page'] =  $data['page'];return $ret;}

四．代码测试

返回数据细多，这里暂时获取1页3条数据返回：

{"status": 200,"message": "success","data": {"list": [{"title": "赶紧收藏！7大类400多种组件，鸿蒙三方库来了！","details": "目前，HarmonyOS三方库涵盖7大类400多种，并且还在持续壮大中。以下为三方库的介绍及资源地址，赶紧收藏！","diggs1": "eva3w","diggs2": "3075","diggs3": "0","img": "https://img2020.cnblogs.com/blog/2016690/202105/2016690-20210512143438411-894995434.png?v=20200821","href": "https://brands.cnblogs.com/huawei/p/2174"},{"title": "上云没错，上错了云，才是错！","details": "疫情更像加速器，加速了企业云化的进程。企业 IT 需要更弹性和稳定的基础架构、更先进并持续更新的技术栈、永远在线的体验和有保障的 SLA。 「上云」是不可逆转的趋势，但是，上云要找老司机，选择科学的方法和正确的步调。 上错了云，就像上贼船，...","diggs1": "小黑羊","diggs2": "1510","diggs3": "7","img": "https://img2020.cnblogs.com/blog/1/202102/1-20210224163516159-1710022906.jpg?v=20200821","href": "https://brands.cnblogs.com/aws/p/2081"}],"limit": 3,"page": "1","titles": "博客园品牌专区","subtitle": ["你的远见，被千万开发者看见","Stay with developers. Build your brand."]}
}

五．总结

本次使用PHP开发爬虫去爬取资讯数据还是第一次，之前都是使用Ptyhon去实现。大家可以去找下我之前的文章。

如何使用PHP开发爬虫功能去爬取网页资讯相关推荐

Python爬虫：Xpath爬取网页信息（附代码）
Python爬虫:Xpath爬取网页信息(附代码) 上一次分享了使用Python简单爬取网页信息的方法.但是仅仅对于单一网页的信息爬取一般无法满足我们的数据需求.对于一般的数据需求,我们通常需要从一个 ...
爬虫系列-jsoup爬取网页你需要了解的一切
爬虫系列-jsoup爬取网页概述解析和遍历文档文档的对象模型加载HTML数据从String解析文档从String中加载解析片段从URL加载文档描述从文件加载文档描述提取数据使用 ...
Python实训day04am【爬虫介绍、爬取网页测试、Python第三方库】
Python实训-15天-博客汇总表目录 1.文本文件编程题 2.爬虫(Scrapy) 2.1.安装第三方库 2.2.爬取网页测试 2.2.1.样例1 2.2.2.样例2 3.PyCharm导入第三 ...
（第一个java爬虫）java爬取网页文本并抽取中英文关键词
问题引入对于给定的任意一个网址,爬取这个网页上的所有文本信息,并抽取出文本的中文和英文关键词搭建环境首先需要具备eclipse开发环境.然后导入jsoup包,用来爬取网页,再到这个网址http: ...
网络爬虫学习2 - 爬取网页的通用代码框架、HTTP协议、Requests库的各种方法
网络爬虫MOOC学习打卡 - 第二天文章目录网络爬虫MOOC学习打卡 - 第二天一.爬取网页的通用代码框架 1.理解requests库的异常 2.Respones类提供了一个方法 -- r.ra ...
python爬虫scrapy框架爬取网页数据_Scrapy-Python
scrapy Scrapy:Python的爬虫框架实例Demo 抓取:汽车之家.瓜子.链家等数据信息版本+环境库 Python2.7 + Scrapy1.12 初窥Scrapy Scrapy是一 ...
【java爬虫】jsoup爬取网页数据-搜索算法评测/竞品评测
百度抽奖概率改4个小时频繁黑屏频繁弹出源码的前端FE T8李森:请云端高level的同学参加会议...对,我级别到了... 666666 业务背景:如何保证搜索算法的好坏?所以有了竞品评测,自己的AP ...
不会Python爬虫？教你一个通用爬虫思路轻松爬取网页数据，赶紧收藏！！
前言其实爬虫的思路很简单,但是对于很多初学者而言,看得懂,但是自己写的时候就不知道怎么去分析了!说实话还是写少了,自己不要老是抄代码,多动手! 其实一个爬虫无非就三步:下载数据.解析数据.保存数据. ...
不会Python爬虫？教你一个通用爬虫思路轻松爬取网页数据
前言其实爬虫的思路很简单,但是对于很多初学者而言,看得懂,但是自己写的时候就不知道怎么去分析了!说实话还是写少了,自己不要老是抄代码,多动手! 本人对于Python学习创建了一个小小的学习圈子,为各 ...

如何使用PHP开发爬虫功能去爬取网页资讯

如何使用PHP开发爬虫功能去爬取网页资讯相关推荐

最新文章

热门文章