Java基礎(chǔ)知識(shí)：Jtidy解析腳本時(shí)候出現(xiàn)問題

打印 | 收藏

問題描述：

最近在做網(wǎng)頁結(jié)構(gòu)化信息抽取，用到了JTidy和xslt.當(dāng)在處理一些包含很多腳本的頁面時(shí)候，出現(xiàn)了，JTidy去臟失敗，提示標(biāo)題中的異常。

最后發(fā)現(xiàn)，問題出現(xiàn)在解析腳本的時(shí)候因?yàn)橐恍┠_本里面不規(guī)范的內(nèi)容，導(dǎo)致不能判斷結(jié)束造成了上面的異常出現(xiàn)。

解決方法：

最初的時(shí)候想通過修改JTidy的源碼來解決這個(gè)問題，但是后來做著發(fā)現(xiàn)可行性不高，一個(gè)是修改這個(gè)源碼可能會(huì)帶來其它的問題。另外一個(gè)，還要花長時(shí)間去看源碼。

所以，最終還是選擇了采用預(yù)處理的方式來進(jìn)行處理刪除掉腳本。

代碼

[java]

public static String getFilterBody（String strBody） {

// htmlparser 解析

Parser parser = Parser.createParser（strBody, "utf-8"）；

NodeList list;

String reValue = strBody;

try {

list = parser.parse（null）；

visitNodeList（list）；

reValue = list.toHtml（）；

} catch （ParserException e1） {

}

return reValue;

}

// 遞歸過濾

private static void visitNodeList（NodeList list） {

for （int i = 0; i < list.size（）； i++） {

Node node = list.elementAt（i）；

if （node instanceof Tag） {

if （node instanceof ScriptTag） {

list.remove（i）；

continue;

}// 這里可以增加刪除的Tag

if （node instanceof StyleTag） {

list.remove（i）；

continue;

}// 這里可以增加刪除的Tag

}

NodeList children = node.getChildren（）；

if （children != null && children.size（） > 0）

visitNodeList（children）；

}

但是在刪除腳本的時(shí)候一樣遇到了相同的問題，就是在解析腳本的時(shí)候出現(xiàn)了錯(cuò)亂，把一些腳本中的標(biāo)簽識(shí)別為正常標(biāo)簽。如：<script>里面的 '<span></span>'里面的'</'就會(huì)被識(shí)別為腳本的結(jié)束，導(dǎo)致腳本獲取不全，刪除不全最后在網(wǎng)上找到了解決的辦法通過下面兩個(gè)參數(shù)的設(shè)置來解析了html對(duì)腳本的處理問題

[java]

org.htmlparser.scanners.ScriptScanner.STRICT = false;

org.htmlparser.lexer.Lexer.STRICT_REMARKS = false;

只要配置其中之一就可以了，下面是這兩個(gè)參數(shù)的一個(gè)官方說明

org.htmlparser.scanners.ScriptScanner.STRICT = false;

[java]

/**

* Strict parsing of CDATA flag.

* If this flag is set true, the parsing of script is performed without

* regard to quotes. This means that erroneous script such as:

* <pre>

* document.write（"</script>"）；

* </pre>

* will be parsed in strict accordance with appendix

* <a href="/TR/html4/appendix/notes.html#notes-specifying-data" mce_href="/TR/html4/appendix/notes.html#notes-specifying-data">

* B.3.2 Specifying non-HTML data</a> of the

* <a href="/TR/html4/" mce_href="/TR/html4/">HTML 4.01 Specification</a> and

* hence will be split into two or more nodes. Correct javascript would

* escape the ETAGO:

* <pre>

* document.write（"<//script>"）；

* </pre>

* If true, CDATA parsing will stop at the first ETAGO （"</"） no matter

* whether it is quoted or not. If false, balanced quotes （either single or

* double） will shield an ETAGO. Beacuse of the possibility of quotes within

* single or multiline comments, these are also parsed. In most cases,

* users prefer non-strict handling since there is so much broken script

* out in the wild.

org.htmlparser.lexer.Lexer.STRICT_REMARKS = false;

[java]

/**

* Process remarks strictly flag.

* If <code>true</code>, remarks are not terminated by ---$gt;

* or --!$gt;, i.e. more than two dashes. If <code>false</code>,

* a more lax （and closer to typical browser handling） remark parsing

* is used.

* Default <code>true</code>.

在默認(rèn)情況下，htmlparser解析是按嚴(yán)格的html標(biāo)準(zhǔn)解析，所以當(dāng)碰到不標(biāo)準(zhǔn)的標(biāo)簽有可能出錯(cuò)，

當(dāng)把以上這兩個(gè)參數(shù)改變以后，htmlparser解析不再嚴(yán)格，能應(yīng)對(duì)所有可能出現(xiàn)的情況。

上一條：Java基礎(chǔ)知識(shí)：靜態(tài)變量錯(cuò)誤修復(fù)
下一條：Java基礎(chǔ)知識(shí)：新的Java漏洞已成為大規(guī)模攻擊的目標(biāo)

射精一区欧美专区|国产精品66xx|亚洲视频一区导航|日韩欧美人妻精品中文|超碰婷婷xxnx|日韩无码综合激情|特级黄片一区二区|四虎日韩成人A√|久久精品内谢片|亚洲成a人无码电影

Java基礎(chǔ)知識(shí)：Jtidy解析腳本時(shí)候出現(xiàn)問題

相關(guān)文章