Href and Src Validation of HTML Documents - Link Validation

Hi guys, have you ever done a boring job like validating the hyperlinks and source tags of an HTML document if those are working or not ? I guess some of us might have done this before and did not bother about it. But what about if you were asked to validate those hrefs and src tags of some thousands of HTML files?

Would you love it? I am sure the answer is no. Who's gonna love this kind boring work and doing it over and over again?

Yeah, you are think it it right. I came across this situation before some days where I have to validate around 1,500 HTML files. I hate this job but what to do? I have to anyhow do it, because I am responsible for it.

I started doing it manually for 4-5 files and then got bored and really frustrated. Then thought of some kind of automation. and did it in ASP.NET, C# which helped me do my job of verifying 1,500 HTML files in just 1 single day.

So, here in this article I will show you how to do it along with I will provide the tool to download for direct use.



So, lets start.

Why do you need a Link Validation Tool?


Its very often if you are working on any kind of campaign tool like Adobe Campaign, Fidelity, IBM Campaign or any other campaign tool where you expect a lot of creatives(HTML files). Here is becomes very important you validate each and every link of those HTMLs otherwise when emails will out those wrong and not working emails will take your target users into nowhere hence impacting your revenue.

This Link validation tool crawls around 80+ HTML files at once and extracts all the links including href and src tags as well and gives you a well defined report with individual status of each link. So that you can be aware of what link is working and what not.

Here is how it looks like.

Prerequisites for Link Validation Tool


I made this one in ASP.NET, so no doubt its a web application. To run this you can host it in your local IIS or just run it in Visual Studio if you have.

If you do not have Visual studio then Download Microsoft WebMatrix, which a free IDE provided by Microsoft. I made this in WebMatrix. 

How to create Link Validation Tool


As I made this tool in WebMatrix this article will be explaining you how to do it webmatrix, but if you are a .NET guy then you can easily replicate it in Visual Studio.

Ok, lets get into it.

Step : 1


First download Web Matrix from here, I have uploaded it into my Google drive so you need not to search for it in Google. Just get it if you do not have a visual studio.


Step : 2


Open WebMatrix and create a new site. Add an aspx page there. As in WebMatrix we do not have aspx.cs file to write the C# code we have to write the code inside a script with runat = server tag.



Name the page as index.aspx or whatever you want to call it.

Now import the below namespaces to the page, those will be used during creating the tool.

<%@ Import Namespace="System.Text.RegularExpressions" %>
<%@ Import Namespace="System.IO" %>
<%@ Import Namespace="System.Linq" %>
<%@ Import Namespace="System.Text" %>
<%@ Import Namespace="System.Net" %>
<%@ Import Namespace="System.Configuration" %>
<%@ Import Namespace="System.Threading.Tasks" %>

In order to make the webpage look better I am using Bootstrap. And putted all the styles in an external css file.

Here is the link to bootstrap CDN and the linkvalidation.css file.

<link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css">
<link rel="stylesheet" href="/linkvalidation.css">

Step : 3


I will simplify the concept here what I am going to do now. 

Alright, I will put maximum of 80 HTML files into a particular location and the tool will parse those files and find out all links and try to render those, if finds the response as 200 it will display Found else Not Found. Not Found can be because of multiple responses. I may be 404: Not Found, 500:Bad Request etc.

So my Page load event looks something like this.

  StringBuilder sbContent = new StringBuilder();
 
  protected void Page_Load(object sender, EventArgs e)
  {
      string folderLocation = ConfigurationManager.AppSettings["folderLocation"];
      var filePaths = Directory.GetFiles(folderLocation).Where(s => s.EndsWith(".htm") || s.EndsWith(".html"));
 
      if(filePaths.Count() > 0)
      {                    
         StringBuilder sbPBlocks = new StringBuilder();
 
         foreach (var file in filePaths) {                                             
 
             WebClient client = new WebClient();
             byte[] buffer = client.DownloadData(file.ToString());
 
             string html = System.Text.Encoding.UTF8.GetString(buffer);
 
             List<string> list = Extract(html);
 
             sbContent.Append("<br><strong class='p-block'> P-Block: <span class='c-red'>" + Path.GetFileNameWithoutExtension(file.ToString()) + "</span></strong> ");
             sbContent.Append("<strong class='p-block-count'>Total No of Links Verified: <span class='c-red badge'>"+list.Count()+"</span></strong>");
             sbContent.Append("<br>******************************************************************************************************************************************************************************************************************<br>");
 
             sbPBlocks.Append(Path.GetFileNameWithoutExtension(file.ToString()) + "<br>");
 
             CreateTable(list);
         }
 
         sbContent.Append("<br><br>");
         sbContent.Append("<br><strong class='p-block'>Total P-Blocks Parsed: <span class='c-red badge'>"+filePaths.Count()+"</span></strong>");                                             
         sbContent.Append("<br>******************************************************************************************************************************************************************************************************************<br>");
 
         sbContent.Append(sbPBlocks.ToString());
         sbContent.Append("<br>******************************************************************************************************************************************************************************************************************<br>");
 
         sbContent.Append("<br>");
         if(ConfigurationManager.AppSettings["deleteFilesAfterparsing"] == "true")
         {
         Array.ForEach(Directory.GetFiles(folderLocation), File.Delete);
         }
     }
     else
     {
         sbContent.Append("<br/><br/><p class='text-center'><strong class='c-red'>No HTML files in given directory</strong> <br/>"+folderLocation+"</p>");
     }
 
     divContent.InnerHtml = sbContent.ToString();                            
  }

Here I am getting the folder location form the web config file where all my HTML files are being stored. And yes I am getting only the files ending with an extension .html or .htm. So if you have any other .txt or image or any other things at that location it will not parse those files.



Take a string builder globally to hold the table we will create during parsing and then will write the response to the HTML body.

So I am parsing each and every HTML files  in a for loop and getting the entire content of the parsing html file.

WebClient client = new WebClient();
byte[] buffer = client.DownloadData(file.ToString());

WebClient will download all the content of the current HTML file and hold it in a byte variable.  Now we need to parse those html content to find out all the links it has.

Step : 4


The Extract method is having a regular expression that will filter out all the links be it a href or src tag and put hose into a list.

public List<string> Extract(string html)
{
    List<string> list = new List<string>();

    Regex regex = new Regex("(?:href|src)=[\"|']?(.*?)[\"|'|>]+", RegexOptions.Singleline | RegexOptions.CultureInvariant);
    if (regex.IsMatch(html))
    {
        foreach (Match match in regex.Matches(html))
        {
            list.Add(match.Groups[1].Value);
        }
    }

    return list;
}

Step : 5


Now we need to create the table which will list the links.

private void CheckURLInParallel(List<string> listUrls)
{
    var times = new ParallelOptions { MaxDegreeOfParallelism = 10 };
    StringBuilder sb = new StringBuilder();
    sb.Append("<table class='customers'><tr><th style='width:80%;'>URL</th><th style='width:20%;'>Status</th></tr>");
    string statusCode;
 
    Parallel.ForEach(listUrls, times, x =>
    {
        Uri uriResult;
        bool isValidURL = Uri.TryCreate(x, UriKind.Absolute, out uriResult)
            && (uriResult.Scheme == Uri.UriSchemeHttp || uriResult.Scheme == Uri.UriSchemeHttps);
 
        if (isValidURL)
        {
            if (x.Contains("trackingid="))
            {
                statusCode = GetStatusCode(x);
 
                if (statusCode == "Not Found")
                {
                    statusCode = "<strong class='not-found'>Not Found</strong>";
 
                    sb.Append("<tr><td style='width:80%;'><a class='c-red' target='_blank' href=" + x + ">" + x + "</a></td>");
                    sb.Append("<td style='width:20%;'> " + statusCode + "</td></tr>");
                }
                else
                {
                    statusCode = "<strong class='c-orange'>" + statusCode + "</strong>";
 
                    sb.Append("<tr><td style='width:80%;'><a class='c-orange' target='_blank' href=" + x + ">" + x + "</a></td>");
                    sb.Append("<td style='width:20%;'> " + statusCode + "</td></tr>");
                }
            }
            else
            {
                statusCode = GetStatusCode(x);
 
                if (statusCode == "Not Found")
                {
                    statusCode = "<strong class='not-found'>Not Found</strong>";
 
                    sb.Append("<tr><td style='width:80%;'><a class='c-red' target='_blank' href=" + x + ">" + x + "</a></td>");
                    sb.Append("<td style='width:20%;'>" + statusCode + "</td></tr>");
                }
                else
                {
                    statusCode = "<strong class='c-green'>" + statusCode + "</strong>";
 
                    sb.Append("<tr><td style='width:80%;'><a target='_blank' href=" + x + ">" + x + "</a></td>");
                    sb.Append("<td style='width:20%;'>" + statusCode + "</td></tr>");
                }
            }
        }
        else
        {
            statusCode = "<strong class='not-found'>Not Valid </strong> <br/> ";
 
            sb.Append("<tr><td style='width:80%;'><a class='c-red' target='_blank' href=" + x + ">" + x + "</a></td>");
            sb.Append("<td style='width:20%;'> " + statusCode + "</td></tr>");
        }
    });
 
    sb.Append("</table>");
    sbContent.Append(sb.ToString());
}

Here I am creating tabular structure and holding those html into the string builder we have taken globally.

So in course of table creation we need to check the status of each link if those are valid or not.



I am checking for valid URLs, sometimes in your page you will have some relative paths like "images/test.png", which is not a valid url so while parsing it it might lead to some exceptions. And also sometimes you must have some email address in your page so to exclude those as they are not some valid urls.

The use of Parallel is for multi threading, this will make the program faster by validating maximum of 10 links at once as I set the MaxDegreeOfParallelism to 10.

Step : 6


GetStatusCode is the method which will take the url and parse it exactly we did in Page Load to read the content of the HTML file.

private string GetStatusCode(string url)
{
    string status = String.Empty;
 
    try
    {
        ServicePointManager.SecurityProtocol = SecurityProtocolType.Ssl3 | SecurityProtocolType.Tls12 | SecurityProtocolType.Tls11 | SecurityProtocolType.Tls;
 
        var request = (HttpWebRequest)WebRequest.Create(url);
        request.Method = "GET";
        // always compress, if you get back a 404 from a HEAD it can be quite big.
        request.AutomaticDecompression = DecompressionMethods.GZip;
        request.AllowAutoRedirect = false;
        request.Timeout = 30000;
        request.KeepAlive = false;
 
        try
        {
            using (var response = request.GetResponse() as HttpWebResponse)
            {
                if (response.StatusCode == HttpStatusCode.OK ||
                    response.StatusCode == HttpStatusCode.Redirect ||
                    response.StatusCode == HttpStatusCode.MovedPermanently)
                    return "Found";
                else
                    return "Not Found";
            }
        }
        catch (Exception ex)
        {
            return "Not Found";
        }
    }
    catch (Exception ex)
    {
        return "Not Found";
    }
}

Here you might run into a situation where you will be parsing a secure web url there it wont work sometimes so to solve this issue I am defining the security protocol type as SSL3 or TLS12 or TLS 11 or TLS. Which will parse a web url which blocks web crawlers. For this I have mentioned the below line.

ServicePointManager.SecurityProtocol = SecurityProtocolType.Ssl3 | SecurityProtocolType.Tls12 | SecurityProtocolType.Tls11 | SecurityProtocolType.Tls;

I specified some properties of the HttpWebRequest variable like AutomaticCompression, its because when this will find a 404 error, it will take a long time download the content as it keeps on trying for several time to download the content. So compressing the request will save a lot of time.

Except that I guess everything is self explanatory, how I am filtering the requests as per their HttpStatusCode.

Step : 7


Now the coding part is done and we need to set some configurations for the program, like where it will find the HTML files and what to do after parsing.



So, to set those settings I have used the AppSetting section of Web.Config. Below is the App Setting section of my program.

<appSettings>
    <add key="folderLocation" value="C:\\Users\\tapkumar\\Documents\\Projects\\GMO\\CC Retention - P Block Validation\\New folder" />
    <add key="deleteFilesAfterparsing" value="false" />
</appSettings>

I made this to work for a certain purpose, so the code might not be fully usable for you, but I am sure you can find it very helpful with a little modification.

The FolderLocation key is having the value of the folder where the program will file all the HTML files. And the DeleteFilesAfterParsing key holds the value if you want all the files to get deleted automatically after parsing or not.

That's it, we have made a Link Validation tool which will solve a lot of problems if you run into such situation and help you save a lot of time by automating the manual efforts, you were supposed to put.


Below is the entire source code for this tool. Check it out.

<%@ Page Language="C#" %>
<%@ Import Namespace="System.Text.RegularExpressions" %>
<%@ Import Namespace="System.IO" %>
<%@ Import Namespace="System.Linq" %>
<%@ Import Namespace="System.Text" %>
<%@ Import Namespace="System.Net" %>
<%@ Import Namespace="System.Configuration" %>
<%@ Import Namespace="System.Threading.Tasks" %>
 
<!DOCTYPE html>
<html lang="en">
    <head>
        <meta charset="utf-8" />
        <title>Link Validation Tool by Tapan kumar</title>
        <link rel="stylesheet" href="https://maxcdn.bootstrapcdn.com/bootstrap/3.3.7/css/bootstrap.min.css">
        <link rel="stylesheet" href="/linkvalidation.css">
    </head>
    <body>
        <form id="form1" runat="server">
            <div class="container">
                <div class="row">
                    <div class="col-sm-12">
                        <br><br>
                        <h1 class="c-blue">Link Validation <small>This tool checks all the <strong class="c-blue">href</strong> and <strong class="c-blue">src</strong> element in the html document</small></h1>
                        <div class="alert alert-info">
                            <strong class="c-blue">Tool Color Meaning</strong><br><br>
                            <strong class="c-green">Green: </strong> Link is valid.
                            <br><strong class="c-red">Red: </strong> Link not Found, click on the link to double check and note it down.
                            <br><strong class="c-orange">Orange: </strong> Link is valid, and contains a tracking id, if you want then double check those.
                        </div>
                        <div id="divContent" runat="server">
                            <script runat="server">
                               
                                StringBuilder sbContent = new StringBuilder();
 
                                protected void Page_Load(object sender, EventArgs e)
                                {
                                    string folderLocation = ConfigurationManager.AppSettings["folderLocation"];
                                    var filePaths = Directory.GetFiles(folderLocation).Where(s => s.EndsWith(".htm") || s.EndsWith(".html") || s.EndsWith(".txt"));
                                    string ext;
 
                                    if (filePaths.Count() > 0)
                                    {
                                        StringBuilder sbPBlocks = new StringBuilder();
 
                                        foreach (var file in filePaths)
                                        {
                                            ext = Path.GetExtension(file);
 
                                            WebClient client = new WebClient();
                                            byte[] buffer = client.DownloadData(file.ToString());
 
                                            string html = System.Text.Encoding.UTF8.GetString(buffer);
 
                                            List<string> list = Extract(html, ext);
 
                                            sbContent.Append("<br><strong class='p-block'> P-Block: <span class='c-red'>" + Path.GetFileNameWithoutExtension(file.ToString()) + "</span></strong> ");
                                            sbContent.Append("<strong class='p-block-count'>Total No of Links Verified: <span class='c-red badge'>" + list.Count() + "</span></strong>");
                                            sbContent.Append("<br>******************************************************************************************************************************************************************************************************************<br>");
 
                                            sbPBlocks.Append(Path.GetFileNameWithoutExtension(file.ToString()) + "<br>");
 
                                            CheckURLInParallel(list);
                                        }
 
                                        sbContent.Append("<br><br>");
                                        sbContent.Append("<br><strong class='p-block'>Total P-Blocks Parsed: <span class='c-red badge'>" + filePaths.Count() + "</span></strong>");
                                        sbContent.Append("<br>******************************************************************************************************************************************************************************************************************<br>");
 
                                        sbContent.Append(sbPBlocks.ToString());
                                        sbContent.Append("<br>******************************************************************************************************************************************************************************************************************<br>");
 
                                        sbContent.Append("<br>");
                                        if (ConfigurationManager.AppSettings["deleteFilesAfterparsing"] == "true")
                                        {
                                            Array.ForEach(Directory.GetFiles(folderLocation), File.Delete);
                                        }
                                    }
                                    else
                                    {
                                        sbContent.Append("<br/><br/><p class='text-center'><strong class='c-red'>No HTML files in given directory</strong> <br/>" + folderLocation + "</p>");
                                    }
 
                                    divContent.InnerHtml = sbContent.ToString();
                                }
        
                                private void CheckURLInParallel(List<string> listUrls)
                                {
                                    var times = new ParallelOptions { MaxDegreeOfParallelism = 10 };
                                    StringBuilder sb = new StringBuilder();
                                    sb.Append("<table class='customers'><tr><th style='width:80%;'>URL</th><th style='width:20%;'>Status</th></tr>");
                                    string statusCode;
 
                                    Parallel.ForEach(listUrls, times, x =>
                                    {
                                        Uri uriResult;
                                        bool isValidURL = Uri.TryCreate(x, UriKind.Absolute, out uriResult)
                                            && (uriResult.Scheme == Uri.UriSchemeHttp || uriResult.Scheme == Uri.UriSchemeHttps);
 
                                        if (isValidURL)
                                        {
                                            if (x.Contains("trackingid="))
                                            {
                                                statusCode = GetStatusCode(x);
 
                                                if (statusCode == "Not Found")
                                                {
                                                    statusCode = "<strong class='not-found'>Not Found</strong>";
 
                                                    sb.Append("<tr><td style='width:80%;'><a class='c-red' target='_blank' href=" + x + ">" + x + "</a></td>");
                                                    sb.Append("<td style='width:20%;'> " + statusCode + "</td></tr>");
                                                }
                                                else
                                                {
                                                    statusCode = "<strong class='c-orange'>" + statusCode + "</strong>";
 
                                                    sb.Append("<tr><td style='width:80%;'><a class='c-orange' target='_blank' href=" + x + ">" + x + "</a></td>");
                                                    sb.Append("<td style='width:20%;'> " + statusCode + "</td></tr>");
                                                }
                                            }
                                            else
                                            {
                                                statusCode = GetStatusCode(x);
 
                                                if (statusCode == "Not Found")
                                                {
                                                    statusCode = "<strong class='not-found'>Not Found</strong>";
 
                                                    sb.Append("<tr><td style='width:80%;'><a class='c-red' target='_blank' href=" + x + ">" + x + "</a></td>");
                                                    sb.Append("<td style='width:20%;'>" + statusCode + "</td></tr>");
                                                }
                                                else
                                                {
                                                    statusCode = "<strong class='c-green'>" + statusCode + "</strong>";
 
                                                    sb.Append("<tr><td style='width:80%;'><a target='_blank' href=" + x + ">" + x + "</a></td>");
                                                    sb.Append("<td style='width:20%;'>" + statusCode + "</td></tr>");
                                                }
                                            }
                                        }
                                        else
                                        {
                                            statusCode = "<strong class='not-found'>Not Valid </strong> <br/> ";
 
                                            sb.Append("<tr><td style='width:80%;'><a class='c-red' target='_blank' href=" + x + ">" + x + "</a></td>");
                                            sb.Append("<td style='width:20%;'> " + statusCode + "</td></tr>");
                                        }
                                    });
 
                                    sb.Append("</table>");
                                    sbContent.Append(sb.ToString());
                                }
 
                                private string GetStatusCode(string url)
                                {
                                    string status = String.Empty;
 
                                    try
                                    {
                                        ServicePointManager.SecurityProtocol = SecurityProtocolType.Ssl3 | SecurityProtocolType.Tls12 | SecurityProtocolType.Tls11 | SecurityProtocolType.Tls;
 
                                        var request = (HttpWebRequest)WebRequest.Create(url);
                                        request.Method = "GET";
                                        // always compress, if you get back a 404 from a HEAD it can be quite big.
                                        request.AutomaticDecompression = DecompressionMethods.GZip;
                                        request.AllowAutoRedirect = false;
                                        request.Timeout = 30000;
                                        request.KeepAlive = false;
 
                                        try
                                        {
                                            using (var response = request.GetResponse() as HttpWebResponse)
                                            {
                                                if (response.StatusCode == HttpStatusCode.OK ||
                                                    response.StatusCode == HttpStatusCode.Redirect ||
                                                    response.StatusCode == HttpStatusCode.MovedPermanently)
                                                    return "Found";
                                                else
                                                    return "Not Found";
                                                    }
                                        }
                                        catch (Exception ex)
                                        {
                                            return "Not Found";
                                        }
                                    }
                                    catch (Exception ex)
                                    {
                                        return "Not Found";
                                    }
                                }
       
                                public List<string> Extract(string html, string ext)
                                {
                                    List<string> list = new List<string>();
 
                                    if(ext == ".txt")
                                    {
                                        foreach (Match item in Regex.Matches(html, @"(http|ftp|https):\/\/([\w\-_]+(?:(?:\.[\w\-_]+)+))([\w\-\.,@?^=%&amp;:/~\+#]*[\w\-\@?^=%&amp;/~\+#])?"))
                                        {
                                            list.Add(item.Value);
                                        }
                                    }
                                    else
                                    {
                                        Regex regex = new Regex("(?:href|src)=[\"|']?(.*?)[\"|'|>]+", RegexOptions.Singleline | RegexOptions.CultureInvariant);
 
                                        if (regex.IsMatch(html))
                                        {
                                            foreach (Match match in regex.Matches(html))
                                            {
                                                list.Add(match.Groups[1].Value);
                                            }
                                        }
                                    }
 
                                    return list;
                                }
    
                            </script>
                        </div>
                        <p class="text-center p-block">Made with  <span class="glyphicon glyphicon-heart c-red"></span> in C# by Tapan kumar</p>
                    </div>
                </div>
 
            </div>
        </form>
    </body>
</html>

No comments:

Post a Comment