Introduction: This article is intended as a knowledge sharing purpose here I am trying to solve a problem which I faced in my MVC web application (This implementation can be done with any ASP.NET Technologies).
Problem Statement: We have a public facing website, as you may be aware of web crawler’s if not i.e. A program or script which scans the www, for the purpose of web indexing or extracting email contents from the website for spamming purpose. So web crawler indexes our pages and media folders. We can restrict the indexing of media folder by disallowing the folder in Robot.txt file. This article is solely for restricting the web crawler to index the media folder content using c#.
Image source: http://seo-advisors.com/wp-content/uploads/2013/07/Web-Crawler-Route-Map-Chart.jpg
1.1 Creating a Simple MVC application
If you want to learn ASP.NET MVC you can start it from here.
Figure 1: Creating New Project
Figure 2: Creating New Web Application
Figure 3: Select MVC and click on Ok.
We will create a form which will just contain one image, which we want to be restricted from a web crawler.
1.2 In order to create a view, we will first create a Login Controller (you can name it as per your requirement)
Figure 4: Creating LoginController
1.3 Adding Media Folder
We will add a Media folder which will contain our Media file; I will just paste one image which I will reference in my application for sample perspective.
Figure 5: Media folder
1.4 Adding Login View
Figure 6: Creating Index View
@{ ViewBag.Title = "Index"; } @using (Html.BeginForm("LoginUser", "Login")) { @Html.AntiForgeryToken() }
@Html.ActionLink(“Back to List”, “Index”)
@section Scripts { @Scripts.Render("~/bundles/jqueryval") }
Now run the application to check the View.
Figure 7: Home Screen
Now take fiddler as web crawler, if I will try to get the image from the fiddler to index the image folder, I will perform a get Http verb to the server as shown below:
Figure 8: Trying to index or download the image file.
Now we have to protect our media folder from a web crawler if a web crawler indexes our media folder we won’t allow web crawler to do the same and throw 401 HTTP Status code to the crawler.
So, In order to solve this problem I will take help of HttpModule in ASP.NET.
An HTTP module is an assembly that is called on every request to our application. The Http module gives us the opportunity to check the input request and take necessary step for the same.
3.0 Creating a basicHttpModule for our application
We need to first register our httpmodule in the web.config file as shown below:
<httpModules> <add name="BasicHttpModule" type="WebsiteDemo.HttpModule.BasicHttpModule" /> <add name="ScriptModule" type="System.Web.Handlers.ScriptModule, System.Web.Extensions, Version=3.5.0.0, Culture=neutral, PublicKeyToken=31BF3856AD364E35" /> </httpModules></system.web> <system.webServer> <modules> <remove name="FormsAuthentication" /> <add name="BasicHttpModule" type="WebsiteDemo.HttpModule.BasicHttpModule" /> </modules> </system.webServer>
Here BasicHttpModule is our custom class name. When ASP.NET create an instance of our web application i.e HttpApplication class, an instance of the custom HTTP module class that has been registered in web.config is created, as soon as this process gets completed init method is called and module initializes itself. In the init method itself, we can mention the method we want to call on BeginRequest or EndRequest by binding events to methods. If you want learn more about HttpModule Two Interceptors: HttpModule and HttpHandlers
Let’s get started, create a folder and name it as HttpModule and create a class called BasicHttpModule
Figure 9: HTTP module pipeline (Image Source: http://www.codeproject.com/Articles/30907/The-Two-Interceptors-HttpModule-and-HttpHandlers)
Figure 10: Creating BasicHttpModule class
Now we need to implement the interface called IHttpModule
namespace WebsiteDemo.HttpModule { public class BasicHttpModule:IHttpModule { public void Init(HttpApplication context) { throw new NotImplementedException(); } public void Dispose() { throw new NotImplementedException(); }}}
Now every request that comes to our web application will go through this init method and method we want to call.
We will create our custom method which we want to call in order to check whether the request(HTTP method) is GET and the request URI contains the media folder or not. If yes then we will not allow the user to access the same until he is authenticated.
In order to accomplish the solution to our problem, I have added a key in web.config appSettings which contains the folder location which we want to disallow.
<add key="disAllow" value="/Media"/>
ASP.NET provides us HttpCapabilitiesBase.Crawler which gets a value that indicates whether the browser is a search engine Web Crawler or not.
When we use this property we will always get the bool result as false, so in order to configure the same in the .Net we need to add some configuration setting in web.config as shown below:
I want to thank Erwin’s blog where I found the actual section configuration which worked for me. So we need to add BrowserCaps element which specifies the setting for the browsers, this element can be updated as per your need.
<browserCaps> <filter> <case match="openwebspider"> browser=openwebspider crawler=true </case> <!-- Fiddler --> <case match="Fiddler"> browser=Fiddler crawler=true </case> <!-- Google Crawler --> <case match="Googlebot"> browser=Googlebot crawler=true </case> </filter> </browserCaps>
So when the request comes to the ASP.NET web application it will check for the browser option, if it matches the element value it will restrict the access to the crawler.
So now in our BasicHttpModule class Init method we will register a method to be call on BeginRequest and apply our logic to restrict the crawler and send us a response as 401 Access Denied.
public void Init(HttpApplication context) { //registering the CheckMediaRequest method on BeginRequest event context.BeginRequest += CheckMediaRequest; } private void CheckMediaRequest(object source, EventArgs eventArgs) { var httpApplication = (HttpApplication) source; //get the path which we want to restrict from web crawler var restrictPath = ConfigurationManager.AppSettings["disAllow"]; HttpBrowserCapabilities myBrowserCaps = httpApplication.Request.Browser; //checking if it's the browser is a crawler, HTTP verb is GET and //URL contains the restricted path if ( ( (System.Web.Configuration.HttpCapabilitiesBase) myBrowserCaps ).Crawler && httpApplication.Request.HttpMethod.Equals("GET") && httpApplication.Request.Url.AbsoluteUri.Contains(restrictPath)) { DenyAccess(httpApplication); } } private void DenyAccess(HttpApplication app) { app.Response.StatusCode = 401; app.Response.StatusDescription = "Access Denied"; app.Response.Write("401 Access Denied"); app.CompleteRequest(); }
Now it’s time for testing our solution, I have tried downloading google extensions for crawlers but all of them uses google chrome as their browser, so in order to test the same, I will use fiddler to make a request to the server to get the media file access. In order to do the same I will request the get method to the server as shown below in the figure:
Figure 11: GET request for media
So once I execute the request, the init methods gets called which invokes CheckMediaRequest method as shown below:
Figure 12: Browser checking
I have entered fiddler as crawler which I want to restrict the access to media folder, so I have written the below attributes in the browserCaps section, So we have to add the web crawlers browser name in the config section in order to restrict the access to the media folder.
<!-- Fiddler --> <case match="Fiddler"> browser=Fiddler crawler=true </case>
Figure 13: Web Crawler check
The ((System.Web.Configuration.HttpCapabilitiesBase) myBrowserCaps).Crawler statement returns true, so now we will disallow the request and return 401 status code.
Figure 14: Returning 401 Status code.
Status Result in Fidler
Acknowledgements
I hope this article will be useful, share your thoughts how you would solve the same problem in a better way. The test source code is uploaded to GitHub https://github.com/SailleshPawar/WebsiteDemo. Stay tuned for upcoming articles.