返回

首页

业界

电商

创业

访谈

手机

移动

报告

运营

建站

互联网+

系统

教程

易采站长站-移动端

C#使用for循环移除HTML标记

2019-12-30 13:47:56于海丽

移除一段文字中的HTML标记，以消除其中包含的样式和段落等，最常用的办法可能就是正则表达式了。但是请注意，正则表达式并不能处理所有的HTML文档，所以有时采用一个迭代的方式会更好，如for循环。

看下面的代码：


using System;
using System.Text.RegularExpressions;
/// <summary>
/// Methods to remove HTML from strings.
/// </summary>
public static class HtmlRemoval
{
/// <summary>
/// Remove HTML from string with Regex.
/// </summary>
public static string StripTagsRegex(string source)
{
return Regex.Replace(source, "<.*?>", string.Empty);
}
/// <summary>
/// Compiled regular expression for performance.
/// </summary>
static Regex _htmlRegex = new Regex("<.*?>", RegexOptions.Compiled);
/// <summary>
/// Remove HTML from string with compiled Regex.
/// </summary>
public static string StripTagsRegexCompiled(string source)
{
return _htmlRegex.Replace(source, string.Empty);
}
/// <summary>
/// Remove HTML tags from string using char array.
/// </summary>
public static string StripTagsCharArray(string source)
{
char[] array = new char[source.Length];
int arrayIndex = 0;
bool inside = false;
for (int i = 0; i < source.Length; i++)
{
char let = source[i];
if (let == '<')
{
inside = true;
continue;
}
if (let == '>')
{
inside = false;
continue;
}
if (!inside)
{
array[arrayIndex] = let;
arrayIndex++;
}
}
return new string(array, 0, arrayIndex);
}
}




	代码中提供了两种不同的方式来移除给定字符串中的HTML标记，一个是使用正则表达式，一个是使用字符数组在for循环中进行处理。来看一下测试的结果：

	
using System;
using System.Text.RegularExpressions;
class Program
{
static void Main()
{
const string html = "<p>There was a <b>.NET</b> programmer " +
"and he stripped the <i>HTML</i> tags.</p>";
Console.WriteLine(HtmlRemoval.StripTagsRegex(html));
Console.WriteLine(HtmlRemoval.StripTagsRegexCompiled(html));
Console.WriteLine(HtmlRemoval.StripTagsCharArray(html));
}
}



	　　输出结果如下：

	There was a .NET programmer and he stripped the HTML tags.

	There was a .NET programmer and he stripped the HTML tags.

	There was a .NET programmer and he stripped the HTML tags.

	　　上述代码中分别调用了HtmlRemoval类中的三个不同的方法，均返回了相同的结果，即去除了给定字符串中的HTML标记。推荐使用第二种方法，即直接引用一个预先定义好的RegexOptions.Compiled的正则表达式对象，它比第一种方法速度更快。但是RegexOptions.Compiled有一些缺点，在某些情况下它的启动时间会增加数十倍。具体的内容可以查看下面这两篇文章：

	RegexOption.Compiled

	Regex Performance

	　　通常，正则表达式的执行效率并不是最高的，所以HtmlRemoval类中给定了另一种方法，使用字符数组来处理字符串。测试程序提供了1000个HTML文件，每个HTML文件中有大约8000个字符，所有的文件均通过File.ReadAllText方式进行读取，测试结果显示字符数组的方式执行速度是最快的。								 
 1/2    1 2 下一页 尾页


		
				
    相关文章
    大家在看


    
			



C#实现获取文件大小并进行比较
2023-03-15
0万阅读





利用C#编写一个Windows服务程序的方法详解
2023-03-14
0万阅读





C#实现日期时间的格式化输出的示例详解
2023-03-13
0万阅读





C#递归应用之实现JS文件的自动引用
2023-03-13
0万阅读





C#递归应用之实现JS文件的自动引用
2023-03-11
0万阅读





C#异步编程之async/await详解
2023-03-11
0万阅读





C#/VB.NET实现在Word中插入或删除脚注
2023-03-08
0万阅读





WPF利用ValueConverter实现值转换器
2023-03-08
0万阅读





C#/VB.NET实现在Word文档中添加页眉和页脚
2023-03-07
0万阅读





c#中如何获取指定字符前的字符串
2023-03-03
0万阅读


			
		
	  
    
	
	


C#实现获取文件大小并进行比较
2023-03-15
0万阅读





利用C#编写一个Windows服务程序的方法详解
2023-03-14
0万阅读





C#实现日期时间的格式化输出的示例详解
2023-03-13
0万阅读





C#递归应用之实现JS文件的自动引用
2023-03-13
0万阅读





C#递归应用之实现JS文件的自动引用
2023-03-11
0万阅读





C#异步编程之async/await详解
2023-03-11
0万阅读





C#/VB.NET实现在Word中插入或删除脚注
2023-03-08
0万阅读





WPF利用ValueConverter实现值转换器
2023-03-08
0万阅读





C#/VB.NET实现在Word文档中添加页眉和页脚
2023-03-07
0万阅读





c#中如何获取指定字符前的字符串
2023-03-03
0万阅读


	
    

        
电脑版 - 移动首页