HttpClone

 

A simple website clone, export, and/or publishing utility.

Why did you build this?

This tool was built so that I can remove my blog engine from my production web server. I’ve never really cared for the idea of self-editing websites to begin with. When you add the security issues, install requirements, and the performance problems of some blogs together you start looking for another answer. HttpClone is that answer. I can now run wordpress locally, take a snapshot, and publish it securely over PKI authentication.

What is this for?

This tool is for anyone looking to make a working clone of a website. It can capture output from an existing server, modify/cleanup the content, and then can republish the content via a build-in host or in IIS.

Who is this for?

People looking to use this will need a strong working knowledge of http, html/xml, xpath, regular expressions, and probably a little C#. The tool is mostly usable out of the box but requires a lot of configuration.

Why should I care?

If your like me and want a clean, secure, and fast site this tool can get you there. The example site used (http://w3example.wordpress.com) can take anywhere from 1 to 3.5 seconds to load with almost nothing on it. In the example, the optimizations reduce the total number of requests from 24 to 7 and reducing the overall download size from 79k to 9k. This all makes a significant impact on user experience.

Getting started:

The best place to start is to look at the source for example.bat in this directory. It performs a basic walk-though of the capabilities. From the initial capturing of a website to clean-up and republication.

The next place to start looking is the configuration file. Currently the example.bat relies on the configuration found at “/src/HttpClone/app.config”. This configuration file has loads of comments that should help as well as an accompanying XSD file for validation.

Once you get a feel for what it’s doing and how you can review the detailed command-line reference at “/HttpClone-Help.html”.

Links
 

If you’ve missed it there is great article entitled Keep it secret, keep it safe by Eric Lippert. Essentially it attempts to dissect the essence of typical crypto issues in plain English (i.e. crypto for dummies). He did a great job of explaining the difficulties in key management, worth a read.

I found it particularly interesting that he brought up this topic since just days ago I released a “SecureTransfer” class. I’ll get more into the details of that later, but it is interesting here because it happens to be very susceptible to the very issue he warns about. To put the problem in simple terms:

The best implementations of cryptography out there are only as secure as their key storage.

This is most certainly true for the SecureTransfer client/server classes. That doesn’t mean that implementing a secure communication channel is easy. Far from it. It just means that *if* you’ve implemented a secure channel it’s most obvious attack vector is to crack the key store.

Key storage for HttpClone

This very problem was one of the first things I had to address with HttpClone (which is now serving this website). To publish content from my local machine over HTTP I needed to know the server’s public key, and the server needs to know my local public key. In addition both client and server also have to store their own private keys.

So I asked myself what kind of assurances do I want regarding key security for HttpClone? It turns out that simply placing the keys in the web server’s /bin directory is probably all that is required. I mean to say if they can modify files on my web server’s bin directory the game is already lost. They can freely change the assemblies, web.config, etc and serve up any malicious content they want.

In the end I chose to allow an added a level of password security to the private key file and then store that password elsewhere. Why? Well you only need to remember back to last year at this time when Microsoft released this annoucement: “Important: ASP.NET Security Vulnerability“. One of the possible gains from this attack was being able to read any file in the web directory (web.config included). Due to this and the potential of a co-hosted site not running ASP.NET being hacked I thought adding the extra layer was worthwhile.

For HttpClone’s purposes it’s still not necessary to further protect the server’s private key. This is due in part to the fact that the data being sent (a copy of a public website) is not private and does not need to be secure. In fact the only reason for involving cryptography at all is for authorization not privacy.

 

Yes this site is still using wordpress, in fact, I’m writing this in wordpress right now.  The interesting thing is I’ve completely uninstalled wordpress and MySql from my production server.  I know crazy huh?

So if I’ve peaked your curiosity you’ll want to stay tuned.  Right now I don’t have time for a lot of details, but what I can tell you is this:

  • Search still works
  • RSS still works
  • Postback still works
  • All the wordpress admin goodies still work

 
How?  Well I’m using a new project I started called HttpClone to create a snap-shot of the site.  From there I can pretty much do anything i want to it, including:

  • Rename it to a different domain
  • Add, edit, or remove content
  • Remove and insert html tags
  • Index the content with Lucene.Net
  • View, modify, validate and track down links

 
Once I’m happy with the changes being made I run a publish command and presto-changeo it’s live!

The project is definately still ‘Alpha’ material but the server-side of things should be solid enough for most uses. I’ve actually been running this site on early versions for two weeks now without issue. In the bargain the site should be around 3x-5x faster than when wordpress was serving the content directly.