【 tulaoshi.com - Web开发 】
                             
                            搜索引擎中一个比较重要的环节就是从网页中抽取出有效内容。简单来说,就是吧HTML文本中的HTML标记去掉,留下我们用IE等浏览器打开HTML文档看到的部分(我们这里不考虑图片). 
将HTML文本中的标记分为:注释,script ,style,以及其他标记分别去掉: 
1.去注释,正则为: 
output = Regex.Replace(input, @"!--[^-]*--", string.Empty, RegexOptions.IgnoreCase); 
2.去script,正则为: 
ouput = Regex.Replace(input, @"script[^]*?.*?/script", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline); 
output2 = Regex.Replace(ouput , @"noscript[^]*?.*?/noscript", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline); 
3.去style,正则为: 
output = Regex.Replace(input, @"style[^]*?.*?/style", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline); 
4.去其他HTML标记 
result = result.Replace(" ", " "); 
result = result.Replace(""", """); 
result = result.Replace("", ""); 
result = result.Replace("", ""); 
result = result.Replace("&", "&"); 
result = result.Replace("br", "rn"); 
result = Regex.Replace(result, @"[sS]*?", string.Empty, RegexOptions.IgnoreCase); 
以上的代码中大家可以看到,我使用了RegexOptions.Singleline参数,这个参数很重要,他主要是为了让"."(小圆点)可以匹配换行符.如果没有这个参数,大多数情况下,用上面列正则表达式来消除网页HTML标记是无效的. 
HTML发展至今,语法已经相当复杂,上面只列出了几种最主要的标记,更多的去HTML标记的正则我将在 
Rost WebSpider 的开发过程中补充进来。 
下面用c#实现了一个从HTML字符串中提取有效内容的类: 
using System; 
using System.Collections.Generic; 
using System.Text; 
using System.Text.RegularExpressions; 
class HtmlExtract 
{ 
#region private attributes 
private string _strHtml; 
#endregion 
#region public mehtods 
public HtmlExtract(string inStrHtml) 
{ 
_strHtml = inStrHtml 
} 
public override string ExtractText() 
{ 
string result = _strHtml; 
result = RemoveComment(result); 
result = RemoveScript(result); 
result = RemoveStyle(result); 
result = RemoveTags(result); 
return result.Trim(); 
} 
#endregion 
#region private methods 
private string RemoveComment(string input) 
{ 
string result = input; 
//remove comment 
result = Regex.Replace(result, @"!--[^-]*--", string.Empty, RegexOptions.IgnoreCase); 
return result; 
} 
private string RemoveStyle(string input) 
{ 
string result = input; 
//remove all styles 
result = Regex.Replace(result, @"style[^]*?.*?/style", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline); 
return result; 
} 
private string RemoveScript(string input) 
{ 
string result = input; 
result = Regex.Replace(result, @"script[^]*?.*?/script", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline); 
result = Regex.Replace(result, @"noscript[^]*?.*?/noscript", string.Empty, RegexOptions.IgnoreCase | RegexOptions.Singleline); 
return result; 
} 
private string RemoveTags(string input) 
{ 
string result = input; 
result = result.Replace(" ", " "); 
result = result.Replace(""", """); 
result = result.Replace("", ""); 
result = result.Replace("", ""); 
result = result.Replace("&", "&"); 
result = result.Replace("br", "rn"); 
result = Regex.Replace(result, @"[sS]*?", string.Empty, RegexOptions.IgnoreCase); 
return result; 
} 
#endregion