前言
大家應該都有所體會,很多時候在做網路爬蟲的時候特別需要將爬蟲搜尋到的超連結進行處理,統一都改成絕對路徑的,所以本文就寫了一個Regex來對搜尋到的連結進行處理。下面話不多說,來看看詳細的介紹吧。
通常我們可能會搜尋到如下的連結:
<!-- 空超連結 --><a href=""></a> <!-- 空白符 --><a href=" " rel="external nofollow" > </a><!-- a標籤含有其它屬性 --><a href="index.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" alt="超連結"> index.html </a><a href="/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" target="_blank"> / target="_blank" </a><a target="_blank" href="/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" alt="超連結" > target="_blank" / alt="超連結" </a><a target="_blank" title="超連結" href="/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" alt="超連結" > target="_blank" title="超連結" / alt="超連結" </a><!-- 根目錄 --><a href="/" rel="external nofollow" rel="external nofollow" rel="external nofollow" rel="external nofollow" > / </a><a href="a" rel="external nofollow" > a </a><!-- 含參數 --><a href="/index.html?id=1" rel="external nofollow" > /index.html?id=1 </a><a href="?id=2" rel="external nofollow" > ?id=2 </a><!-- // --><a href="//index.html" rel="external nofollow" > //index.html </a><a href="//www.mafutian.net" rel="external nofollow" > //www.mafutian.net </a><!-- 站內連結 --><a href="http://www.hole_1.com/index.html" rel="external nofollow" > http://www.php.cn/ </a><!-- 站外連結 --><a href="http://www.mafutian.net" rel="external nofollow" > http://www.php.cn/ </a><a href="http://www.numberer.net" rel="external nofollow" > http://www.php.cn/ </a><!-- 圖片,文字檔格式的連結 --><a href="1.jpg" rel="external nofollow" > 1.jpg </a><a href="1.jpeg" rel="external nofollow" > 1.jpeg </a><a href="1.gif" rel="external nofollow" > 1.gif </a><a href="1.png" rel="external nofollow" > 1.png </a><a href="1.txt" rel="external nofollow" > 1.txt </a><!-- 普通連結 --><a href="index.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" > index.html </a><a href="index.html" rel="external nofollow" rel="external nofollow" rel="external nofollow" > index.html </a><a href="./index.html" rel="external nofollow" > ./index.html </a><a href="../index.html" rel="external nofollow" > ../index.html </a><a href=".../" rel="external nofollow" > .../ </a><a href="..." rel="external nofollow" > ... </a><!-- 非連結,含有連結冒號 --> <a href="javascript:void(0)" rel="external nofollow" > javascript:void(0) </a><a href="a:b" rel="external nofollow" > a:b </a><a href="/a#a:b" rel="external nofollow" > /a#a:b </a><a href="mailto:'mafutian@126.com'" rel="external nofollow" > mailto:'mafutian@126.com' </a><a href="/tencent://message/?uin=335134463" rel="external nofollow" > /tencent://message/?uin=335134463 </a> <!-- 相對路徑 --><a href="." rel="external nofollow" > . </a><a href=".." rel="external nofollow" > .. </a><a href="../" rel="external nofollow" > ../ </a><a href="/a/b/.." rel="external nofollow" > /a/b/.. </a><a href="/a" rel="external nofollow" > /a </a><a href="./b" rel="external nofollow" > ./b </a><a href="./././././././././b" rel="external nofollow" > ./././././././././b </a> <!-- 其實就是 ./b --><a href="../c" rel="external nofollow" > ../c </a><a href="../../d" rel="external nofollow" > ../../d </a><a href="../a/../b/c/../d" rel="external nofollow" > ../a/../b/c/../d </a><a href="./../e" rel="external nofollow" > ./../e </a><a href="http://www.hole_1.org/./../e" rel="external nofollow" > http://www.php.cn/ </a> <a href="./.././f" rel="external nofollow" > ./.././f </a><a href="http://www.hole_1.org/../a/.../../b/c/../d/.." rel="external nofollow" > http://www.php.cn/ </a> <!-- 帶有連接埠號碼 --><a href=":8081/index.html" rel="external nofollow" > :8081/index.html </a><a href="http://www.mafutian.net:80/index.html" rel="external nofollow" > :80/index.html </a><a href="http://www.mafutian.net:8081/index.html" rel="external nofollow" > http://www.php.cn/:8081/index.html </a><a href="http://www.mafutian.net:8082/index.html" rel="external nofollow" > http://www.php.cn/:8082/index.html </a>
處理的第一步,設定成絕對路徑:
http:// ... / ../ ../
然後本文講講如何去除絕對路徑中的 './'、'../'、'/..'的實現代碼:
function url_to_absolute($relative){ $absolute = ''; // 去除所有的 './' $absolute = preg_replace('/(?<!\.)\.\//','',$relative); $count = preg_match_all('/(?<!\/)\/([^\/]{1,}?)\/\.\.\//',$absolute,$res); // 迭代去除所有的 '/abc/../' do { $absolute = preg_replace('/(?<!\/)\/([^\/]{1,}?)\/\.\.\//','/',$absolute); $count = preg_match_all('/(?<!\/)\/([^\/]{1,}?)\/\.\.\//',$absolute,$res); }while($count >= 1); // 除去最後的 '/..' $absolute = preg_replace('/(?<!\/)\/([^\/]{1,}?)\/\.\.$/','/',$absolute); $absolute = preg_replace('/\/\.\.$/','',$absolute); // 除去存在的 '../' $absolute = preg_replace('/(?<!\.)\.\.\//','',$absolute); return $absolute;}$relative = 'http://www.mytest.org/../a/.../../b/c/../d/..';var_dump(url_to_absolute($relative));// 輸出:string 'http://www.mytest.org/a/b/' (length=26)