HttpClone

A simple website clone, export, and/or publishing utility.

Why did you build this?

This tool was built so that I can remove my blog engine from my production web server. I’ve never really cared for the idea of self-editing websites to begin with. When you add the security issues, install requirements, and the performance problems of some blogs together you start looking for another answer. HttpClone is that answer. I can now run wordpress locally, take a snapshot, and publish it securely over PKI authentication.

What is this for?

This tool is for anyone looking to make a working clone of a website. It can capture output from an existing server, modify/cleanup the content, and then can republish the content via a build-in host or in IIS.

Who is this for?

People looking to use this will need a strong working knowledge of http, html/xml, xpath, regular expressions, and probably a little C#. The tool is mostly usable out of the box but requires a lot of configuration.

Why should I care?

If your like me and want a clean, secure, and fast site this tool can get you there. The example site used (http://w3example.wordpress.com) can take anywhere from 1 to 3.5 seconds to load with almost nothing on it. In the example, the optimizations reduce the total number of requests from 24 to 7 and reducing the overall download size from 79k to 9k. This all makes a significant impact on user experience.

Getting started:

The best place to start is to look at the source for example.bat in this directory. It performs a basic walk-though of the capabilities. From the initial capturing of a website to clean-up and republication.

The next place to start looking is the configuration file. Currently the example.bat relies on the configuration found at “/src/HttpClone/app.config”. This configuration file has loads of comments that should help as well as an accompanying XSD file for validation.

Once you get a feel for what it’s doing and how you can review the detailed command-line reference at “/HttpClone-Help.html”.

Links

Documentation: http://httpclone.googlecode.com/hg/HttpClone-Help.html
Source Code: http://code.google.com/p/httpclone/source/browse/
Downloads: http://code.google.com/p/httpclone/downloads/list

Recently I ran across an article on the subject of Content Management Systems and their inability to separate content editing from content publishing. The article titled “EditingPublishingSeparation” by Martin Fowler is worth a read.

I completely agree with his assertion that, from an architecture point of view, the editing and publishing of content should be separated. I would however take the assertion much farther than that. Websites should NOT be capable of editing themselves. The mere idea of this is absurd IMHO. I’ve written CMS systems before back in the late 90′s, and even then it was obvious. You can not secure a self-editing website.

Why is a self-editing website a bad idea?

1. The group take down. To say most CMS systems have a vulnerability or two is putting it mildly. Attackers love to take these vulnerabilities and then proceed to use automated software to seek out sites using that CMS and exploit them. This allows them to inexpensively disperse malware to a large audience in a very short period of time. This, IMHO, is the worst thing about running a CMS solution. Nobody specifically targeted your site, it just happened to be running on software they knew how to attack. No provocation needed, you got taken down with 10,000 other unfortunate people.

2. It runs in the browser. The issue here is that some form of logon allows users to modify the content on the web server. This means that the user’s horribly insecure browser environment is entirely in control of ‘production’ content. Thus a simple XSS script, a malicious browser plugin, or other common vulnerability can allow an attacker to modify content. Browsers are the worst place to be editing content. Even with the advent of Windows Live Writer and other rich-client authoring tools you still occasionally need to log into the website. So these tools help, but they do not fix the problem.

3. Preview is not a preview. Most all the of the CMS systems out there will allow you to preview the content before publishing it. Most of them get it wrong. It seems CMS systems are more and more moving to a “wysiwyg” display editing where they modify the output HTML so that you can edit it, even in preview. This then gives you no assurance about how it will actually format and display since the authoring widgets on screen change the HTML being rendered. Furthermore while previewing a single page is possible many CMS systems will not allow you to preview entirely new sections and navigation elements. Lastly previewing an entire redesign of the site’s look-and-feel, navigation structure, etc is also not possible.

4. My web server runs DRY. CMS systems often fail to appropriately cache the rendered HTML. This produces lags in performance as you server must reprocess the same content against the template over and over again. I prefer my sever to run as DRY as possible, Don’t Repeat Yourself. There is just no point in reprocessing the content for every request.

5. User provided content. IMHO, user authored content does not belong on your sever. This is one of the driving factors behind #4 and is simply not necessary. Using Facebook or another discussion server is easy. If you need something more fancy that what is freely available, go build it. Stand up a completely different site on a different domain with a completely different authentication model. Users should never log in to your site.

6. XCopy backup and deployment. Asside from backup and deployment there is also the issue with applying a version control system to most CMS systems. This is one of my biggest pet-peeves with CMS systems. They absolutely love to rely on a database back-end. Although some newer CMS solutions can use embeded sql servers, most do not support it and this is not an option if you are farming the content across several servers. I suspect most CMS sites are not being backed up regularly and if the server is lost or it’s drive corrupted their likely to loose most if not all of their site.

What are my alternatives?

1. Find a better CMS. I’m not aware of a single CMS system in operation today that avoids the issues above. Please correct me in the comments if this is inaccurate, I’d love to know if one exists.

2. Using a CDN (Content Distribution Network). These are often very powerful tools and can be configured to avoid many of the issues mentioned above. If you are looking for one I would consider CloudFlare a viable starting point.

3. HttpClone or similar product. I’m sure there are other solutions that have similar capabilities, but honestly I love using HttpClone. I use WordPress on the back-end and have a deployment script that automates the process end-to-end. Whether I’m publishing the result to a test server or to production it’s relatively easy once you get it working. The hard part was the configuration of the crawler to identify content I wanted removed or changed, and indexing for search. Once that was complete I wrote a simple batch file to do the deployment that looks roughly like:

@ECHO OFF
HttpClone.exe crawlsite http://csharptest.net/index.html
HttpClone.exe copysite http://csharptest.net/index.html http://csharptest.net/index.html /overwrite
HttpClone.exe optimize http://csharptest.net/index.html
HttpClone.exe index http://csharptest.net/index.html
HttpClone.exe addrelated http://csharptest.net/index.html
HttpClone.exe publish http://csharptest.net/index.html
mysqldump.exe -u root -ppassword --create-options --skip-extended-insert --databases csharptest --result-file=csharptest.sql

Basically what this does is crawls my locally running copy of this website (admin.csharptest.net) and captures the results. Then it crawls all the pages and changes references from admin.csharptest.net to csharptest.net overwriting the content that was previously there. Then it performs a series of steps: optimizing the content, creating the search index, and injecting related article links. Finally it packages and publishes all the content to the remote site, and then backs up the database. The entire site is instantly switched to the new content once it is ready. For small edits I can choose to publish the content directly to production, or more often I push to a local site to then verify the content package.

Obviously the most vulnerable part of the process is the code on the server that allows publication. This is why the entire thing requires the client and server to know each-other’s public key. They negotiate a session key, transfer the file, and sign/verify every request and response. This code uses the CSharpTest.Net.Crypto.SecureTransfer class from my library if you are interested in the details.

The benefit to both client and server using a public/private key is that an observer knowing only one of the two keys can learn very little about the content being transferred. It should be obvious that if an attacker obtains the servers private key they can replace the server (assuming some form of DNS poisoning or the like); however, they will not be able to then forward it to the actual server and still be able to read the content. Again it should be obvious that if someone were to obtain my client private key they can publish new or modified content to the server since this is the only form of authentication. I will add that even with my client private key they still can not upload anything that is executable on the server. This leaves my server secure and in-tact and all that is needed for me to recover is replacing the client key and republishing the content.

I wish the guys at WordPress or another CMS would just do this out of the box.

If you’ve missed it there is great article entitled Keep it secret, keep it safe by Eric Lippert. Essentially it attempts to dissect the essence of typical crypto issues in plain English (i.e. crypto for dummies). He did a great job of explaining the difficulties in key management, worth a read.

I found it particularly interesting that he brought up this topic since just days ago I released a “SecureTransfer” class. I’ll get more into the details of that later, but it is interesting here because it happens to be very susceptible to the very issue he warns about. To put the problem in simple terms:

The best implementations of cryptography out there are only as secure as their key storage.

This is most certainly true for the SecureTransfer client/server classes. That doesn’t mean that implementing a secure communication channel is easy. Far from it. It just means that *if* you’ve implemented a secure channel it’s most obvious attack vector is to crack the key store.

Key storage for HttpClone

This very problem was one of the first things I had to address with HttpClone (which is now serving this website). To publish content from my local machine over HTTP I needed to know the server’s public key, and the server needs to know my local public key. In addition both client and server also have to store their own private keys.

So I asked myself what kind of assurances do I want regarding key security for HttpClone? It turns out that simply placing the keys in the web server’s /bin directory is probably all that is required. I mean to say if they can modify files on my web server’s bin directory the game is already lost. They can freely change the assemblies, web.config, etc and serve up any malicious content they want.

In the end I chose to allow an added a level of password security to the private key file and then store that password elsewhere. Why? Well you only need to remember back to last year at this time when Microsoft released this annoucement: “Important: ASP.NET Security Vulnerability“. One of the possible gains from this attack was being able to read any file in the web directory (web.config included). Due to this and the potential of a co-hosted site not running ASP.NET being hacked I thought adding the extra layer was worthwhile.

For HttpClone’s purposes it’s still not necessary to further protect the server’s private key. This is due in part to the fact that the data being sent (a copy of a public website) is not private and does not need to be secure. In fact the only reason for involving cryptography at all is for authorization not privacy.

Yes this site is still using wordpress, in fact, I’m writing this in wordpress right now. The interesting thing is I’ve completely uninstalled wordpress and MySql from my production server. I know crazy huh?

So if I’ve peaked your curiosity you’ll want to stay tuned. Right now I don’t have time for a lot of details, but what I can tell you is this:

Search still works
RSS still works
Postback still works
All the wordpress admin goodies still work

How? Well I’m using a new project I started called HttpClone to create a snap-shot of the site. From there I can pretty much do anything i want to it, including:

Rename it to a different domain
Add, edit, or remove content
Remove and insert html tags
Index the content with Lucene.Net
View, modify, validate and track down links

Once I’m happy with the changes being made I run a publish command and presto-changeo it’s live!

The project is definately still ‘Alpha’ material but the server-side of things should be solid enough for most uses. I’ve actually been running this site on early versions for two weeks now without issue. In the bargain the site should be around 3x-5x faster than when wordpress was serving the content directly.

csharptest.net

HttpClone

Why did you build this?

What is this for?

Who is this for?

Why should I care?

Getting started:

Links

Separation of Concerns – Edit then Publish

Why is a self-editing website a bad idea?

What are my alternatives?

RE: Keep it secret, keep it safe

Finally, wordpress is no longer running on my web server

Search

Projects

Recent Updates