Screen Scraping in ASP.NET...
Screen Scraping is the process of reading, and optionally displaying the content of a web site. It can be
a useful technique and is quite easy to do in .NET.
By: John Kilgo
Date: October 20, 2003
Download the code.
br>
Screen scraping is a lot easier in .NET than it was in classic ASP where you had to use the infamous
INET.DLL. The System.Net namespace provides all the methods we need to perform screen scraping in .NET.
I will show you the code without a great deal of explanation. In this first example program we are going
to screen scrape www.dotnetjohn.com (the default page). When we scrape the page and place the result in a
textbox the raw html will be displayed. When the results are displayed in a label, the html will actually
be rendered just as it is when you visit the site. Displaying the results and having the rendered page
show up was something that I discovered quite by accident. Going in I didn't know that would be the case.
The .aspx page is as follows:
<%@ Page Language="vb" AutoEventWireup="false" Codebehind="ScreenScraper.aspx.vb" Inherits="DotNetJohn.ScreenScraper"%>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<HTML>
<HEAD>
<title>ScreenScraper</title>
<meta name="GENERATOR" content="Microsoft Visual Studio .NET 7.1">
<meta name="CODE_LANGUAGE" content="Visual Basic .NET 7.1">
<meta name=vs_defaultClientScript content="JavaScript">
<meta name=vs_targetSchema content="http://schemas.microsoft.com/intellisense/ie5">
</HEAD>
<body MS_POSITIONING="GridLayout">
<h3>Screen Scraping in .NET</h3>
<p></p>
<form id="Form1" method="post" runat="server">
<asp:Button ID="btnSubmit" Runat="server" Text="Scrape DotNetJohn" OnClick="btnSubmit_Click" />
<br>
<asp:TextBox ID="txtResponse" Runat="server" Width="760" Height="360" TextMode="MultiLine" />
<br>
<asp:Label ID="lblResponse" Runat="server" />
</form>
</body>
</HTML>
|
The .vb program is listed below.
Public Class ScreenScraper
Inherits System.Web.UI.Page
' < Designer Code Omitted >
Private Sub Page_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load
End Sub
Public Sub btnSubmit_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles btnSubmit.Click
Dim strURL As String = "http://www.dotnetjohn.com"
Dim objWebRequest As System.Net.HttpWebRequest
Dim objWebResponse As System.Net.HttpWebResponse
Dim streamReader As System.IO.StreamReader
Dim strHTML As String
objWebRequest = CType(System.Net.WebRequest.Create(strURL), System.Net.HttpWebRequest)
objWebRequest.Method = "GET"
objWebResponse = CType(objWebRequest.GetResponse(), System.Net.HttpWebResponse)
streamReader = New System.IO.StreamReader(objWebResponse.GetResponseStream)
strHTML = streamReader.ReadToEnd
txtResponse.Text = strHTML
lblResponse.Text = strHTML
streamReader.Close()
objWebResponse.Close()
objWebRequest.Abort()
End Sub
End Class
|
As you can see above we have created webrequest and webresponse objects, a streamreader, and a string
variable to hold the results of the screen scrape. We then set the text properties of the textbox and the
label to the string variable. You
may run this example program here.
You now have a functioning screen scraping program. It is cool and all that, but does it really do anything
useful? The answer is, probably not. It might be useful in a teaching environment where you can show the
raw html and the rendered page on the same screen, but I can't think of much other use for it.
It would be useful, however, if we needed to display some information that was included in the web page
being scraped. An example which we will explore is stock price information available from yahoo.com for
example. If you go to yahoo and request a stock quote for Microsoft, the resulting page showing the
results will have a URL of
"http://finance.yahoo.com/q?s=msft". Try the link and you will see what I mean.
Notice that there is the text "Last Trade:" followed by the stock price in large, bolded text. That will be
our key to finding and displaying stock price information from the screen scrape. Will will have to do
some string parsing of the string variable (strHTML) holding the html returned from the scrape. Let's see
how this can be done.
The following is the .aspx file. All it does is create a table containg some text and a label to hold the
stock price value.
<%@ Page Language="vb" AutoEventWireup="false" Codebehind="MSStockPrice.aspx.vb" Inherits="DotNetJohn.MSStockPrice"%>
<!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN">
<html>
<head>
<title>MSStockPrice</title>
<meta name="GENERATOR" content="Microsoft Visual Studio .NET 7.1">
<meta name="CODE_LANGUAGE" content="Visual Basic .NET 7.1">
<meta name=vs_defaultClientScript content="JavaScript">
<meta name=vs_targetSchema content="http://schemas.microsoft.com/intellisense/ie5">
</head>
<body MS_POSITIONING="GridLayout">
<h3>Microsoft Stock Price</h3>
<p></p>
<form id="Form1" method="post" runat="server">
<table border="0" cellpadding="4" cellspacing="4">
<tr>
<td align="left">Current Price:</td>
<td align="right"><asp:Label ID="lblPrice" Runat="server" /></td>
</tr>
</table>
</form>
</body>
</html>
|
The codebehind is as follows. It is mostly the same as the first program, with the addition of the string
parsing to find the stock price within the html results of the scrape.
Public Class MSStockPrice
Inherits System.Web.UI.Page
' < Designer Code Omitted >
Private Sub Page_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load
Dim strURL As String = "http://finance.yahoo.com/q?s=msft"
Dim objWebRequest As System.Net.HttpWebRequest
Dim objWebResponse As System.Net.HttpWebResponse
Dim streamReader As System.IO.StreamReader
Dim strHTML As string = streamReader.ReadToEnd
objWebRequest = CType(System.Net.WebRequest.Create(strURL), System.Net.HttpWebRequest)
objWebRequest.Method = "GET"
objWebResponse = CType(objWebRequest.GetResponse(), System.Net.HttpWebResponse)
streamReader = New System.IO.StreamReader(objWebResponse.GetResponseStream)
strHTML = streamReader.ReadToEnd
Dim intPos1, intPos2, intPos3 As Integer
intPos1 = strHTML.IndexOf("Last Trade:", 0)
intPos2 = strHTML.IndexOf("<b>", intPos1)
intPos3 = strHTML.IndexOf("</b>", intPos2)
lblPrice.Text = strHTML.Substring(intPos2 + 3, intPos3 - intPos2 + 3)
streamReader.Close()
objWebResponse.Close()
objWebRequest.Abort()
End Sub
End Class
|
If we do a View Source on the yahoo page and do a find on "Last Trade:" we will see the following bit of
html:
Last Trade:</td><td class="yfnc_tabledata1"><big><b>28.93</b></big>
The "28.93" is what the value is at this writing. If you do a View Source you will probably see a different
price. But that is the whole point of this example - to find that price no matter what it is at the time.
Near the bottom of the page above we do the string parsing to find the price. We first find the position of
"Last Trade:". We then use that position as the start to find the opening <b> tag that occurs just before
the price. We then use that position to find the closing </b> tag that comes after the price. We then
use SubString to parse the price out between intPos2 and intPos3.
I hope you have learned something useful from this article. I've had to do this in the past with ASP and
.NET sure makes it a lot easier.
You may run the first example program here.
You may run the Microsoft stock price example here.
You may downolad the code here.