python中url解析库（urlparse、 urlunparse、 urlsplit、 urlunsplit、 urlsplit、 urlunsplit、 urljoin）

urlparse()

使用urlparse库会将url分解成6部分，返回的是一个元组 (scheme, netloc, path, parameters, query, fragment)。可以再使用urljoin、urlsplit、urlunsplit、urlparse把分解后的url拼接起来。

def urlparse(url, scheme='', allow_fragments=True):"""Parse a URL into 6 components:<scheme>://<netloc>/<path>;<params>?<query>#<fragment>Return a 6-tuple: (scheme, netloc, path, params, query, fragment).Note that we don't break the components up in smaller bits(e.g. netloc is a single string) and we don't expand % escapes."""url, scheme, _coerce_result = _coerce_args(url, scheme)splitresult = urlsplit(url, scheme, allow_fragments)scheme, netloc, url, query, fragment = splitresultif scheme in uses_params and ';' in url:url, params = _splitparams(url)else:params = ''result = ParseResult(scheme, netloc, url, params, query, fragment)return _coerce_result(result)

注意：通过urlparse库返回的元组可以用来确定网络协议（HTTP、FTP等）、服务器地址、文件路径等。

示例代码：

from urllib.parse import urlparseurl = urlparse('http://www.baidu.com/index.php?username=dgw')
print(url)
print(url.netloc)

urlunparse()

使用urlunparse库将一个元组(scheme, netloc, path, parameters, query, fragment)组成一个具有正确格式的URL。

def urlunparse(components):"""Put a parsed URL back together again.  This may result in aslightly different, but equivalent URL, if the URL that was parsedoriginally had redundant delimiters, e.g. a ? with an empty query(the draft states that these are equivalent)."""scheme, netloc, url, params, query, fragment, _coerce_result = (_coerce_args(*components))if params:url = "%s;%s" % (url, params)return _coerce_result(urlunsplit((scheme, netloc, url, query, fragment)))

示例代码：

from urllib.parse import urlparse, urlunparseurl = urlparse('http://www.baidu.com/index.php?username=dgw')
print(url)
url_join1 = urlunparse(url)
print(url_join1)url_tuple = ("http", "www.baidu.com", "index.php", "", "username=dgw", "")
url_join2 = urlunparse(url_tuple)
print(url_join2)

urlsplit()

使用urlsplit库只要用来分析urlstring，返回包含5个参数的元组(scheme, netloc, path, query, fragment)。urlsplit()和urlparse()差不多。不过它不切分URL的参数。

def urlsplit(url, scheme='', allow_fragments=True):"""Parse a URL into 5 components:<scheme>://<netloc>/<path>?<query>#<fragment>Return a 5-tuple: (scheme, netloc, path, query, fragment).Note that we don't break the components up in smaller bits(e.g. netloc is a single string) and we don't expand % escapes."""url, scheme, _coerce_result = _coerce_args(url, scheme)allow_fragments = bool(allow_fragments)key = url, scheme, allow_fragments, type(url), type(scheme)cached = _parse_cache.get(key, None)......

示例代码：

from urllib.parse import urlparse, urlspliturl = urlparse('http://www.baidu.com/index.php?username=dgw')
print(url)url2 = urlsplit('http://www.baidu.com/index.php?username=dgw')
print(url2)

urlunsplit()

def urlunsplit(components):"""Combine the elements of a tuple as returned by urlsplit() into acomplete URL as a string. The data argument can be any five-item iterable.This may result in a slightly different, but equivalent URL, if the URL thatwas parsed originally had unnecessary delimiters (for example, a ? with anempty query; the RFC states that these are equivalent)."""scheme, netloc, url, query, fragment, _coerce_result = (_coerce_args(*components))if netloc or (scheme and scheme in uses_netloc and url[:2] != '//'):if url and url[:1] != '/': url = '/' + url

示例代码：

from urllib.parse import urlparse, urlsplit, urlunspliturl = urlparse('http://www.baidu.com/index.php?username=dgw')
print(url)url2 = urlsplit('http://www.baidu.com/index.php?username=dgw')
print(url2)url3 = urlunsplit(url2)
print(url3)

urljoin()

urljoin()将一个基本URL和一个可能的相对URL连接起来，形成对后者的绝对地址。

注意：如果基本URL并非以字符/结尾的话，那么URL基地址最右边部分就会被这个相对路径所替换。

def urljoin(base, url, allow_fragments=True):"""Join a base URL and a possibly relative URL to form an absoluteinterpretation of the latter."""if not base:return urlif not url:return basebase, url, _coerce_result = _coerce_args(base, url)......

示例代码：

from urllib.parse import urljoinurl = urljoin('http://www.baidu.com/test/', 'index.php?username=dgw')
print(url)url2 = urljoin('http://www.baidu.com/test', 'index.php?username=dgw')
print(url2)