Screen Scraping in ASP.NET...
Screen Scraping is the process of reading, and optionally displaying the content of a web site. It can be a useful technique and is quite easy to do in .NET.
Screen scraping is a lot easier in .NET than it was in classic ASP where you had to use the infamous INET.DLL. The System.Net namespace provides all the methods we need to perform screen scraping in .NET. I will show you the code without a great deal of explanation. In this first example program we are going to screen scrape www.dotnetjohn.com (the default page). When we scrape the page and place the result in a textbox the raw html will be displayed. When the results are displayed in a label, the html will actually be rendered just as it is when you visit the site. Displaying the results and having the rendered page show up was something that I discovered quite by accident. Going in I didn't know that would be the case.
The .aspx page is as follows:
|
<%@ Page Language="vb" AutoEventWireup="false" Codebehind="ScreenScraper.aspx.vb" Inherits="DotNetJohn.ScreenScraper"%> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <HTML> <HEAD> <title>ScreenScraper</title> <meta name="GENERATOR" content="Microsoft Visual Studio .NET 7.1"> <meta name="CODE_LANGUAGE" content="Visual Basic .NET 7.1"> <meta name=vs_defaultClientScript content="JavaScript"> <meta name=vs_targetSchema content="http://schemas.microsoft.com/intellisense/ie5"> </HEAD> <body MS_POSITIONING="GridLayout"> <h3>Screen Scraping in .NET</h3> <p></p> <form id="Form1" method="post" runat="server"> <asp:Button ID="btnSubmit" Runat="server" Text="Scrape DotNetJohn" OnClick="btnSubmit_Click" /> <br> <asp:TextBox ID="txtResponse" Runat="server" Width="760" Height="360" TextMode="MultiLine" /> <br> <asp:Label ID="lblResponse" Runat="server" /> </form> </body> </HTML> |
The .vb program is listed below.
|
Public Class ScreenScraper Inherits System.Web.UI.Page ' < Designer Code Omitted > Private Sub Page_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load End Sub Public Sub btnSubmit_Click(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles btnSubmit.Click Dim strURL As String = "http://www.dotnetjohn.com" Dim objWebRequest As System.Net.HttpWebRequest Dim objWebResponse As System.Net.HttpWebResponse Dim streamReader As System.IO.StreamReader Dim strHTML As String objWebRequest = CType(System.Net.WebRequest.Create(strURL), System.Net.HttpWebRequest) objWebRequest.Method = "GET" objWebResponse = CType(objWebRequest.GetResponse(), System.Net.HttpWebResponse) streamReader = New System.IO.StreamReader(objWebResponse.GetResponseStream) strHTML = streamReader.ReadToEnd txtResponse.Text = strHTML lblResponse.Text = strHTML streamReader.Close() objWebResponse.Close() objWebRequest.Abort() End Sub End Class |
As you can see above we have created webrequest and webresponse objects, a streamreader, and a string variable to hold the results of the screen scrape. We then set the text properties of the textbox and the label to the string variable. You may run this example program here.
You now have a functioning screen scraping program. It is cool and all that, but does it really do anything useful? The answer is, probably not. It might be useful in a teaching environment where you can show the raw html and the rendered page on the same screen, but I can't think of much other use for it.
It would be useful, however, if we needed to display some information that was included in the web page being scraped. An example which we will explore is stock price information available from yahoo.com for example. If you go to yahoo and request a stock quote for Microsoft, the resulting page showing the results will have a URL of "http://finance.yahoo.com/q?s=msft". Try the link and you will see what I mean.
Notice that there is the text "Last Trade:" followed by the stock price in large, bolded text. That will be our key to finding and displaying stock price information from the screen scrape. Will will have to do some string parsing of the string variable (strHTML) holding the html returned from the scrape. Let's see how this can be done.
The following is the .aspx file. All it does is create a table containg some text and a label to hold the stock price value.
|
<%@ Page Language="vb" AutoEventWireup="false" Codebehind="MSStockPrice.aspx.vb" Inherits="DotNetJohn.MSStockPrice"%> <!DOCTYPE HTML PUBLIC "-//W3C//DTD HTML 4.0 Transitional//EN"> <html> <head> <title>MSStockPrice</title> <meta name="GENERATOR" content="Microsoft Visual Studio .NET 7.1"> <meta name="CODE_LANGUAGE" content="Visual Basic .NET 7.1"> <meta name=vs_defaultClientScript content="JavaScript"> <meta name=vs_targetSchema content="http://schemas.microsoft.com/intellisense/ie5"> </head> <body MS_POSITIONING="GridLayout"> <h3>Microsoft Stock Price</h3> <p></p> <form id="Form1" method="post" runat="server"> <table border="0" cellpadding="4" cellspacing="4"> <tr> <td align="left">Current Price:</td> <td align="right"><asp:Label ID="lblPrice" Runat="server" /></td> </tr> </table> </form> </body> </html> |
The codebehind is as follows. It is mostly the same as the first program, with the addition of the string parsing to find the stock price within the html results of the scrape.
|
Public Class MSStockPrice Inherits System.Web.UI.Page ' < Designer Code Omitted > Private Sub Page_Load(ByVal sender As System.Object, ByVal e As System.EventArgs) Handles MyBase.Load Dim strURL As String = "http://finance.yahoo.com/q?s=msft" Dim objWebRequest As System.Net.HttpWebRequest Dim objWebResponse As System.Net.HttpWebResponse Dim streamReader As System.IO.StreamReader Dim strHTML As string = streamReader.ReadToEnd objWebRequest = CType(System.Net.WebRequest.Create(strURL), System.Net.HttpWebRequest) objWebRequest.Method = "GET" objWebResponse = CType(objWebRequest.GetResponse(), System.Net.HttpWebResponse) streamReader = New System.IO.StreamReader(objWebResponse.GetResponseStream) strHTML = streamReader.ReadToEnd Dim intPos1, intPos2, intPos3 As Integer intPos1 = strHTML.IndexOf("Last Trade:", 0) intPos2 = strHTML.IndexOf("<b>", intPos1) intPos3 = strHTML.IndexOf("</b>", intPos2) lblPrice.Text = strHTML.Substring(intPos2 + 3, intPos3 - intPos2 + 3) streamReader.Close() objWebResponse.Close() objWebRequest.Abort() End Sub End Class |
If we do a View Source on the yahoo page and do a find on "Last Trade:" we will see the following bit of
html:
Last Trade:</td><td class="yfnc_tabledata1"><big><b>28.93</b></big>
The "28.93" is what the value is at this writing. If you do a View Source you will probably see a different
price. But that is the whole point of this example - to find that price no matter what it is at the time.
Near the bottom of the page above we do the string parsing to find the price. We first find the position of "Last Trade:". We then use that position as the start to find the opening <b> tag that occurs just before the price. We then use that position to find the closing </b> tag that comes after the price. We then use SubString to parse the price out between intPos2 and intPos3.
I hope you have learned something useful from this article. I've had to do this in the past with ASP and .NET sure makes it a lot easier.
You may run the first example program here.
You may run the Microsoft stock price example here.
You may downolad the code here.