VB2010如何过滤网页源码？-网页设计-锦华智联科技

使用 `WebClient` (简单直接，适用于简单网页)

WebClient 是最简单的方式，适合获取静态网页的源码，但对于需要处理 Cookie、Session 或自定义请求头（如模拟浏览器）的复杂网页,它就显得力不从心了。

（图片来源网络，侵删）

步骤：

创建 WebClient 实例。
设置编码：为了正确处理中文等非英文字符，必须将 WebClient 的 Encoding 属性设置为 UTF8。
调用 DownloadString 方法：传入目标 URL,获取网页源码字符串。
过滤字符串：使用字符串方法或正则表达式进行过滤。

示例代码 (VB2010)

这个例子会获取 "http://www.cnblogs.com" 的首页源码，并提取所有 <a> 标签的 href 属性。

Imports System.Net
Imports System.Text.RegularExpressions
Public Class Form1
    Private Sub btnGetAndFilter_Click(sender As Object, e As EventArgs) Handles btnGetAndFilter.Click
        Dim url As String = "http://www.cnblogs.com"
        Dim htmlSource As String = String.Empty
        Try
            ' 1. 创建 WebClient 实例
            Using client As New WebClient()
                ' 2. 设置编码为 UTF-8，防止中文乱码
                client.Encoding = Encoding.UTF8
                ' 3. 下载网页源码
                ' 注意：实际项目中应加入超时设置，避免程序长时间卡死
                client.DownloadStringAsync(New Uri(url))
                ' 异步完成时的事件处理
                AddHandler client.DownloadStringCompleted, AddressOf WebClient_DownloadStringCompleted
            End Using
        Catch ex As Exception
            MessageBox.Show("获取网页失败: " & ex.Message, "错误", MessageBoxButtons.OK, MessageBoxIcon.Error)
        End Try
    End Sub
    ' 异步下载完成后的回调方法
    Private Sub WebClient_DownloadStringCompleted(sender As Object, e As DownloadStringCompletedEventArgs)
        If e.Error IsNot Nothing Then
            MessageBox.Show("下载出错: " & e.Error.Message, "错误", MessageBoxButtons.OK, MessageBoxIcon.Error)
            Return
        End If
        Dim htmlSource As String = e.Result
        Dim links As New List(Of String)
        ' --- 过滤逻辑 ---
        ' 方法 A: 使用正则表达式 (更强大，更灵活)
        ' 匹配 <a> 标签的 href 属性
        Dim regex As New Regex("<a\s+(?:[^>]*?\s+)?href=""([^""]*)""", RegexOptions.IgnoreCase Or RegexOptions.Compiled)
        Dim matches As MatchCollection = regex.Matches(htmlSource)
        For Each match As Match In matches
            If match.Groups.Count > 1 Then
                links.Add(match.Groups(1).Value)
            End If
        Next
        ' 方法 B: 使用字符串方法 (简单场景)
        ' Dim startPos As Integer = 0
        ' While True
        '     Dim hrefIndex As Integer = htmlSource.IndexOf("href=""", startPos)
        '     If hrefIndex = -1 Then Exit While
        '
        '     startPos = hrefIndex + 6 ' "href=""" 的长度
        '     Dim endIndex As Integer = htmlSource.IndexOf("""", startPos)
        '     If endIndex = -1 Then Exit While
        '
        '     links.Add(htmlSource.Substring(startPos, endIndex - startPos))
        '     startPos = endIndex
        ' End While
        ' 显示结果
        If links.Count > 0 Then
            ' 将结果放入 ListBox 或 TextBox 中显示
            lstLinks.Items.Clear()
            For Each link As String In links
                lstLinks.Items.Add(link)
            Next
            MessageBox.Show(String.Format("成功找到 {0} 个链接。", links.Count), "成功")
        Else
            MessageBox.Show("未找到任何链接。", "提示")
        End If
    End Sub
End Class

使用 `HttpWebRequest` / `HttpWebResponse` (功能更强大)

HttpWebRequest 是更底层的类,提供了更多的控制，

自定义请求头（如 User-Agent, Referer）。
处理 Cookie。
支持 POST 请求。
设置超时。

对于现代网站，模拟浏览器是必须的,否则可能会被拒绝访问。

示例代码 (VB2010)

这个例子会模拟一个常见的浏览器来访问网页,并获取源码。

（图片来源网络，侵删）

Imports System.Net
Imports System.IO
Imports System.Text.RegularExpressions
Public Class Form2
    Private Sub btnAdvancedRequest_Click(sender As Object, e As EventArgs) Handles btnAdvancedRequest.Click
        Dim url As String = "http://www.cnblogs.com"
        Dim htmlSource As String = String.Empty
        Try
            ' 1. 创建 HttpWebRequest 对象
            Dim request As HttpWebRequest = DirectCast(WebRequest.Create(url), HttpWebRequest)
            ' 2. 设置请求属性，模拟浏览器
            request.Method = "GET"
            request.UserAgent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
            request.Accept = "text/html,application/xhtml+xml,application/xml;q=0.9,image/webp,*/*;q=0.8"
            request.Headers.Add("Accept-Language", "zh-CN,zh;q=0.9,en;q=0.8")
            request.Timeout = 15000 ' 设置15秒超时
            ' 3. 获取响应
            Using response As HttpWebResponse = DirectCast(request.GetResponse(), HttpWebResponse)
                ' 检查响应状态码是否为 200 (OK)
                If response.StatusCode = HttpStatusCode.OK Then
                    ' 4. 从响应流中读取数据
                    Using reader As New StreamReader(response.GetResponseStream(), Encoding.UTF8)
                        htmlSource = reader.ReadToEnd()
                    End Using
                End If
            End Using
            ' 5. 过滤源码 (这里我们过滤出页面标题 <title>...</title>)
            Dim titleRegex As New Regex("<title>(.*?)</title>", RegexOptions.IgnoreCase)
            Dim titleMatch As Match = titleRegex.Match(htmlSource)
            If titleMatch.Success Then
                txtTitle.Text = titleMatch.Groups(1).Value
            Else
                txtTitle.Text = "未找到标题"
            End If
        Catch ex As WebException
            MessageBox.Show("网络请求失败: " & ex.Message & vbCrLf & "状态码: " & ex.Status, "Web错误", MessageBoxButtons.OK, MessageBoxIcon.Error)
        Catch ex As Exception
            MessageBox.Show("发生未知错误: " & ex.Message, "错误", MessageBoxButtons.OK, MessageBoxIcon.Error)
        End Try
    End Sub
End Class

使用 `Html Agility Pack` (推荐！最专业、最稳健的方法)

直接用字符串或正则表达式解析 HTML 是非常脆弱的，网页结构稍微变动，你的代码就可能失效。Html Agility Pack (HAP) 是一个强大的 HTML 解析库，它能像处理 XML 一样处理“不完美”的 HTML,让数据提取变得非常简单和稳健。

第一步：添加引用

在 VB2010 中，右键点击你的项目 -> “添加引用”。
选择 “.NET” 选项卡。
找到并选择 System.Xml 和 System.Xml.Linq (如果项目是 .NET Framework 4.0 或更高)。
更重要的是，你需要下载 HtmlAgilityPack.dll 文件，然后切换到 “浏览” 选项卡，找到并添加这个 DLL 文件。
- 下载地址：https://html-agility-pack.net/

示例代码 (VB2010)

这个例子使用 HAP 来解析一个博客文章列表,提取文章标题和链接。

Imports HtmlAgilityPack
Imports System.Net
Public Class Form3
    Private Sub btnHAP_Click(sender As Object, e As EventArgs) Handles btnHAP.Click
        Dim url As String = "http://www.cnblogs.com/"
        Dim lstPosts As New List(Of String)
        Try
            ' 1. 下载网页源码 (使用 WebClient 或 HttpWebRequest)
            Dim htmlSource As String
            Using client As New WebClient()
                client.Encoding = Encoding.UTF8
                htmlSource = client.DownloadString(url)
            End Using
            ' 2. 创建 HtmlDocument 对象并加载源码
            Dim doc As New HtmlDocument()
            doc.LoadHtml(htmlSource)
            ' 3. 使用 XPath 精准定位要提取的元素
            ' 这个 XPath 选择器定位到博客园首页的每篇博文标题的 <a> 标签
            Dim postNodes As HtmlNodeCollection = doc.DocumentNode.SelectNodes("//div[@class='postTitle']/a")
            If postNodes IsNot Nothing Then
                For Each node As HtmlNode In postNodes
                    ' 获取 <a> 标签的文本 (标题) 和 href 属性 (链接)
                    Dim title As String = node.InnerText.Trim()
                    Dim link As String = node.GetAttributeValue("href", "")
                    lstPosts.Add(String.Format("标题: {0}", title))
                    lstPosts.Add(String.Format("链接: {0}", link))
                    lstPosts.Add("------------------")
                Next
                ' 显示结果
                txtPosts.Text = String.Join(Environment.NewLine, lstPosts)
            Else
                txtPosts.Text = "未找到任何博文，可能是网页结构已更改或解析失败。"
            End If
        Catch ex As Exception
            MessageBox.Show("使用 HAP 解析失败: " & ex.Message, "错误", MessageBoxButtons.OK, MessageBoxIcon.Error)
        End Try
    End Sub
End Class

总结与建议

方法	优点	缺点	适用场景
`WebClient`	简单易用，代码量少。	功能有限，无法处理复杂请求，对网页结构变化敏感。	获取静态、简单的网页，快速原型开发。
`HttpWebRequest`	功能强大，可自定义请求，模拟浏览器。	代码相对复杂，需要手动处理流和编码。	需要登录、处理 Cookie、或需要特定请求头的复杂网站。
`Html Agility Pack`	最推荐，解析能力强，稳健，使用 XPath 定位元素，不易受网页微小变动影响。	需要额外下载和添加 DLL 引用。	所有需要进行 HTML 解析的场景，尤其是长期维护的项目。

给你的建议：

（图片来源网络，侵删）

对于新手或简单任务：从 WebClient 开始,快速实现功能。
对于任何需要解析 HTML 的正式项目：强烈建议直接使用 Html Agility Pack，它将为你节省大量调试和维护代码的时间，让你的程序更加健壮，虽然需要添加一个引用,但它带来的好处远超这点麻烦。

VB2010如何过滤网页源码？

使用 WebClient (简单直接，适用于简单网页)

步骤：

示例代码 (VB2010)

使用 HttpWebRequest / HttpWebResponse (功能更强大)

示例代码 (VB2010)

使用 Html Agility Pack (推荐！最专业、最稳健的方法)

示例代码 (VB2010)

总结与建议

使用 `WebClient` (简单直接，适用于简单网页)

使用 `HttpWebRequest` / `HttpWebResponse` (功能更强大)

使用 `Html Agility Pack` (推荐！最专业、最稳健的方法)