为什么80%的码农都做不了架构师?>>>   

There has been lots of buzz about many of the new features in PHP 5.4, like the traits support, the short array syntax and all those other syntax improvements.

But one set of changes that I think is particularly important was largely overlooked: For PHP 5.4 cataphract (Artefacto on StackOverflow) heroically rewrote large parts ofhtmlspecialcharsthus fixing various quirks and adding some really nice new features.

(The changes discussed here apply not only to htmlspecialchars, but also to the related htmlentities and in parts to htmlspecialchars_decode, html_entity_decode and get_html_translation_table.)

Here a quick summary of the most important changes:

  • UTF-8 as the default charset
  • Improved error handling (ENT_SUBSTITUTE)
  • Doctype handling (ENT_HTML401, …)

UTF-8 as the default charset

As you hopefully know the third argument forhtmlspecialcharsis the character set. Thing is: Most people just leave that argument out, thus falling back to the default charset. This default charset was ISO-8859-1 before PHP 5.4 and as such did not match the UTF-8 encoding most people use. PHP 5.4 fixes this by making UTF-8 the default.

Improved error handling

Error handling inhtmlspecialcharsbefore PHP 5.4 was … uhm, let’s call it “unintuitive”:

If you passed a string containing an “invalid code unit sequence” (which is Unicode slang for “not encoded correctly”)htmlspecialcharswould return an empty string. Well, okay, so far so good. The funny thing was that it additionally would throw an error, but only if error display was disabled. So it would only error if errors are hidden. Nice, innit?

This basically meant that on your development machine you wouldn’t see any errors, but on your production machine the error log would be flooded with them. Awesome.

So, as of PHP 5.4 thankfully this behavior is gone. The error will not be generated anymore.

Additionally there are two options that allow you to specify an alternative to just returning an empty string:

  • ENT_IGNORE: This option (which isn’t actually new, it was there in PHP 5.3 already) will just drop all invalid code unit sequences. This is bad for two reasons: First, you won’t notice invalid encoding because it’ll be simply dropped. Second, this imposes a certain security risk (for more info see the Unicode Security Considerations).
  • ENT_SUBSTITUTE: This new alternative option takes a much more sensible approach at the problem: Instead of just dropping the code units they will be replaced by a Unicode Replacement Character (U+FFFD). So invalid code unit sequences will be replaced by � characters.

Let’s have a look at the different behaviors ( demo):

[php]

// "\80" is invalid UTF-8 in this context  var_dump(htmlspecialchars("a\x80b")); // string(0) "" var_dump(htmlspecialchars("a\x80b", ENT_IGNORE)); // string(2) "ab" var_dump(htmlspecialchars("a\x80b", ENT_SUBSTITUTE)); // string(5) "a�b"

[/php]

Clearly, you want the last behavior. In your real code it will probably look like this:

[php]

// this goes into the bootstrap (or where appropriate) to make the code  // not throw a notice on PHP 5.3  if (!defined('ENT_SUBSTITUTE')) {  define('ENT_SUBSTITUTE', 0); // if you want the empty string behavior on 5.3  // or  define('ENT_SUBSTITUTE', ENT_IGNORE);  // if you want the char removal behavior on 5.3  // (don't forget about the security issues though!)  }  // don't forget to specify the charset! Otherwise you'll get the old default charset on 5.3. $escaped = htmlspecialchars($string, ENT_QUOTES | ENT_SUBSTITUTE, 'UTF-8');

[/php]

Doctype handling

In PHP 5.4 there are four additional flags for specifying the used doctype:

  • ENT_HTML401(HTML 4.01) => this is the default
  • ENT_HTML5(HTML 5)
  • ENT_XML1(XML 1)
  • ENT_XHTML(XHTML)

Depending on which doctype you specifyhtmlspecialchars(and the other related functions) will use different entity tables.

You can see this in the following example (demo):

[php]

var_dump(htmlspecialchars("'", ENT_HTML401)); // string(6) "'"  var_dump(htmlspecialchars("'", ENT_HTML5)); // string(6) "'"

[/php]

So for HTML 5 an'entity will be generated, whereas for HTML 4.01 - which does not yet support'- a numerical'entity is returned.

The difference becomes more evident when usinghtmlentities, because the differences are larger there. You can easily see this by having a look at the raw translation tables:

To do this, we can use theget_html_translation_tablefunction. Here first an example for the XML 1 doctype (demo):

[php]

var_dump(get_html_translation_table(HTML_ENTITIES, ENT_QUOTES | ENT_XML1));

[/php]

The result will look like this:

array(5) {["""]=>string(6) "&quot;"["&"]=>string(5) "&amp;"["'"]=>string(6) "&apos;"["<"]=>string(4) "&lt;"[">"]=>string(4) "&gt;"
}

This matches our expectations: XML by itself defines only the five basic entities.

Now try the same thing for HTML 5 (demo) and you’ll see something like this:

array(1510) {["    "]=>string(5) "&Tab;"["
"]=>string(9) "&NewLine;"["!"]=>string(6) "&excl;"["""]=>string(6) "&quot;"["#"]=>string(5) "&num;"["$"]=>string(8) "&dollar;"["%"]=>string(8) "&percnt;"["&"]=>string(5) "&amp;"["'"]=>string(6) "&apos;"// ...
}

So HTML 5 defines a vast number of entities - 1510 to be precise. You can also try HTML 4.01 and XHTML; they both define 253 entities.

Also affected by the chosen doctype is another new error handling flag which I did not mention above:ENT_DISALLOWED. This flag will replace characters with a Unicode Replacement Character, which formally are a valid code unit sequences, but are invalid in the given doctype.

This way you can ensure that the returned string is always well formed regarding encoding (in the given doctype). I’m not sure though how much sense it makes to use this flag. The browser will handle invalid characters gracefully anyways, so this seems unnecessary to me (though I’m probably wrong).

There is other stuff too…

… but I don’t want to list everything here. I think the three changes mentioned above are the most important improvements.

[php]

htmlspecialchars("<\x80The End\xef\xbf\xbf>", ENT_QUOTES | ENT_HTML5 | ENT_DISALLOWED | ENT_SUBSTITUTE, 'UTF-8');

[/php]

转载于:https://my.oschina.net/clearchen/blog/113105

htmlspecialchars() improvements in PHP 5.4相关推荐

  1. PHP中htmlentities跟htmlspecialchars的区别

    http://php.net/manual/zh/function.htmlspecialchars.php 很多人都以为htmlentities跟htmlspecialchars的功能是一样的,都是 ...

  2. htmlentities()与htmlspecialchars()

    htmlspecialchars()和htmlentities()之间有什么区别? 什么时候应该使用其中一个? #1楼 如果只希望字符串是XML和HTML安全的htmlspecialchars($st ...

  3. PHP的转义函数 htmlspecialchars、strip_tags、addslashes解释

    第一个函数:strip_tags,去掉 HTML 及 PHP 的标记 注意:本函数可去掉字串中包含的任何 HTML 及 PHP 的标记字串.若是字串的 HTML 及 PHP 标签原来就有错,例如少了大 ...

  4. htmlspecialchars() 函数把一些预定义的字符转换为 HTML 实体。

    htmlspecialchars() 函数把一些预定义的字符转换为 HTML 实体.语法为:htmlspecialchars(string,quotestyle,character-set). PHP ...

  5. PHP的htmlspecialchars、strip_tags、addslashes解释

    2019独角兽企业重金招聘Python工程师标准>>> 第一个函数:strip_tags,去掉 HTML 及 PHP 的标记 注意:本函数可去掉字串中包含的任何 HTML 及 PHP ...

  6. PHP中htmlentities和htmlspecialchars的区别

    使用函数 htmlentities 后使中文变乱码,以至数据存到数据库全部是乱码.一直以为是MYSQL字符集设置问题,花了两天时间才找到原因.使用htmlspecialchars既可解决问题. 这两个 ...

  7. php ignore special characters,PHP htmlspecialchars() 函數--防注入字符轉義函數

    更多實例 例子 1 把一些預定義的字符轉換為 HTML 實體:<?php $str = "Bill & 'Steve'"; echo htmlspecialchars ...

  8. PHP5.4以上版本GBK编码下htmlspecialchars输出为空问题解决方法汇总

    从旧版升级到php5.4,恐怕最麻烦的就是htmlspecialchars这个问题了! 当然,htmlentities也会受影响,不过,对于中文站来说一般用htmlspecialchars比较常见,h ...

  9. php常用过滤htmlspecialchars() 函数把预定义的字符转换为 HTML 实体

    这个函数非常重要,特别是在处理中文字符时,同时开发过程中往往需对写入数据库或读取数据库的数据进行处理. htmlspecialchars(string,flags,character-set,doub ...

最新文章

  1. 在R中子集化数据框的5种方法
  2. 微服务~分布式事务里的最终一致性
  3. B计划 第四周(开学第一周)
  4. bzoj 2705: [SDOI2012]Longge的问题——欧拉定理
  5. 第 1 章 第 6 题 带重复数排序问题( 扩展 ) 位向量实现
  6. SpringCloud学习笔记:SpringCloud简介(1)
  7. kali安装docker和portainer
  8. java hashmap
  9. 冷热分离和直接使用大数据库_「系统架构」如何通过分离冷热数据提升系统性能?...
  10. 计算机无法识别 手机,手机连接电脑无法识别usb设备的解决教程
  11. 一个好用的大文件传输工具
  12. 优酷屏幕录制在哪里_手机优酷怎么录制视频
  13. 计算机wps是什么意思啊,路由器WPS是什么意思?
  14. 利用opencv3中的kmeans实现抠图功能
  15. 利用Xshell修改Linux默认SSH端口号等详细配置
  16. mysql搜索结果去重_mysql数据库去重查询
  17. oracle分区注意点,ORACLE分区表梳理系列(一)- 分区表概述、分类、使用方法及注意事项...
  18. 塞班s60v3手电筒sisx_s60第三版_塞班v3软件下载网站_塞班s60 v3论坛
  19. LoRa网关和NS的那些事
  20. SQL Server 2005 安装图解

热门文章

  1. python fun
  2. items属性的combo_【内存消耗问题】DataGridViewComboboxColoumn关于Items属性和DataSource属性的性能开销问题...
  3. Python杂谈——Python都能干什么呢?
  4. 51单片机怎么编程,有什么好的课程?
  5. 单片机自学多久可以成功?学单片机需要什么基础知识?
  6. 字符串 内存 函数的介绍与模拟实现
  7. Perfect Security (01字典树删除点)
  8. 最后的分的计算机公式,省考最后10天!掌握这些数学运算公式,提分!
  9. 传真休眠怎么取消_C盘满了怎么办——系统瘦身
  10. 0x55. 动态规划 - 环形与后效性处理(例题详解 × 6)