End of GSOC and project status

First of all, I’d like to ask for the pardon of my readers (there are few of them, but anyway) for not posting an update after GSOC.

I spent the last 2 weeks of GSOC working furiously to be able to demonstrate a working version. There were quite a few problems in the end, but my mentor approved the project so that I could receive the last GSOC payment, provided that I continue working on it (as already promised in an earlier post).

Anyway, I don’t want to ramble more about what went wrong or not – today i’m starting again my work, after 3 weeks of holidays and the start of a new academic year in another country. To those interested, most of the work is done but many things need to be reworked. I’ve been talking to Mozilla developers and most problems should be easily solvable

I expect (but cannot promise) a working version in January 2010, and a release to public after Mozilla’s authorization, which should not take long.

Yet another status update

I have been experimenting with the code sections to port from the apache cache client to the ff-crcsync extension.

The good news is that most of the code can be ported as-is, although I’m eliminating the Apache Portable Library dependencies as possible (to eliminate them totally I need to coordinate with the http-crcsync team).

Creating crcsync requests is mostly done, but there is still some work to do on recreating replies. Note that this code is still off the extension, I’m experimenting with it before integrating it in the C++ XPCOM component.

I still need to dig in deeper in existing code samples and components in order to fully understand how the StreamConverter mechanism works.

I’m being very careful with the implementation. I want this extension to be used in a production environment reliably and not just be a simple proof-of-concept. This adds little overhead to the project and allows me to understand better what alternatives and possibilities exist for specific problems.

Right now I’m not sure if I will be able to honor the July 31st alpha release date, but it should not be too far off that. I’m also very dependant on the http-crcsync developers for the code changes that need to be done to the server. In an extreme situation I should be able to modify their code to show off what is done in the extension.

“Final” plan update

Well, it seems that calling the plan in the last post was too hasty :-)

There is no need to have 3 different XPCOM components. I only need a StreamConverter which can convert from crcsync to text/html and from text/html to crcsync, the latter only being used in the Javascript main component of the extension.

I have the component skeleton done and right now i’m reading up on deflate (necessary to unzip the literal blocks) and thinking how to save the crc blocks, etc.

That’s about it, i’ll keep this blog updated with my progress.

Final plan and status update

Midterm has passed and we are now entering the final phase!

The extension will be composed of 4 XPCOM components:
- init, header and cache manipulation (Javascript)
- crcsync compressor (C++)
- crcsync decompressor which is a streamConverter from crcsync to uncompressed (C++)
- crcsync library to be used by the two above (C)

On application init:
- Register two observers, http-on-modify-request and http-on-examine-response
- Add crcsync to the Accept-Encoding preferences (network.http.accept-encoding)
- Register the crcsync library component (let’s call it XPCOM_1)
- Register the crcsync “compressor” (let’s call it XPCOM_2)
- Register the streamConverter from crcsync to uncompressed (let’s call it XPCOM_3)

When a request is made:
- Check if there is a cache entry in the crcsync cache for this URL.
   If there isn’t any, ignore.
   There is a cache entry
      Open an inputStream into the cache and pass it to component XPCOM_2
      XPCOM_2 returns a block number, file size and block array
      Add the crcsync specific headers
      Add the etag and other headers saved in cache

When a response is received:
(The use of Content-Encoding: crcsync in the response will trigger the streamConverter component XPCOM_3. This process is automatic – confirmed by the Gecko developers)
- Check if there is a Content-Encoding: crcsync header
   If there isn’t any, ignore.
   This is a crcsync encoded response
      Decompression is done automatically by XPCOM_3 as specified above
      Save the etag and other necessary headers
      Save the crcsync compressed body – NOT YET CLEAR HOW TO DO
      Leave the crcsync headers so that Firefox doesn’t use its cache – MAYBE NOT NECESSARY?

What is done:
- Basic init
- Header manipulation
- Cache handling

What is left:
- The first thing to do is to XPCOM-ify the crcsync library (XPCOM_1).
- Following that, the compressor (XPCOM_2) and streamConverter (XPCOM_3) need to be implemented.
- There are still a few open points when the response is received. For a proof-of-concept we can not save the compressed body if accessing it is difficult – however, to make this extension useful and efficient, this is a necessity. It can be dealt later though.

There have been some major changes in the HTTP-crcsync protocol. I’ll post here the changes once they are reflected on the code base, which will happen by July 24th according to the crcsync developers.

It’s midterm time!

Midterm evalutions end tomorrow!

Lately, I’ve been playing around with the cache while I wait for some finishing touches in the protocol. We’ve been redefining some headers which are of crucial importance.

A few thoughts about the cache:

-Cache entries will be saved with the URL hash as the cache key.

-A header file should be kept, because there is a need to keep AT LEAST the etag and If-Modified-Since so when can build conditional requests.

-When a request is made, there is an access to the cache to determine if a header file is present: if it is, then we are accessing a crcsync server and a crcsync request should be built.

-The block file received from the server should be kept in the cache to avoid having to crc the whole file each time a request is made or a response is received. So:
The first time the a request is made to a specific resource, it should be block-compressed (if not already) and saved in the cache.
When a response is received in subsequent requests, the existing block file is merged with the response blocks and saved in the cache (replacing the old copy).
When a request is made, the block file is sent in the body and the original etag and If-Modified-Since are restored.

Looking good!

I have modified a bit liveHTTPHeaders and the FF-crcsync 0.0.1 extension is working!

The HTTP-crcsync protocol defines new headers and modifies existing:

 

Request crcsync encoded respose

A-IM: crcsync

If-Block: <base 64 encoded hashes>

File-Size: <size of cached file>

 

Crcsync Delta Response headers

HTTP/1.x 226 IM Used

IM: crcsync

Cache headers, content-encoding and Content-MD5 headers have BI- appended to them (Before instance coding).

New cache header ‘no cache’ added:

  • Pragma: no-cache

  • Cache-Control: no-cache

  • Expires: Thu, 01 Dec 1994 16:00:00 GMT

etag (if present in origin server response) has -crcsync appended to it

 

Crcsync Delta Response body

block := <Single byte block identifier> <block header> <block body>

compressed_literal_block := ‘Z’ <literal data, compressed with zlib library>

block_match := ‘B’ <single byte block ID>

 

What I did was always inject these headers with If-Block: 1 in whenever a request is made. This will tell that I have only 1 block (supposedly base64 encoded), which doesn’t even exist, so the server will return all the blocks.

Of course this will return garbage to Firefox, but the server log is interesting – the crcsync compressed response is 70% smaller in size.

Keep in mind this is the response with all the blocks and the purpose of crcsync is to avoid receiving them all. In reality these 70% can also be achieved using gzip compression, which uses the same algorithm.

But if only half of the blocks are transmitted, we can a response 85% smaller than no compression! Hopefully it will only need 1/4 of the blocks, making it 92,5% smaller. In my opinion over 90% is the magic mark – that way it will be easy to convince network administrators to have a try!

And Mozilla especially could benefit a lot from it, since most of the pages is html text, where crcsync excels.

Plan for the coming weeks

Here is what is coming in the next few weeks:

I’ll attempt to describe my plan to the maximum detail possible. Please keep in mind some details might be altered and there are a few design decisions to make, but the core of this plan will not change.

The following is pseudo-code algorithm for the Firefox crcsync extension:

intercept outgoing http request
 check the cache to see if there is a cached version of the url
   if there is a cached version
      calculate the CRC of the webpage
      add the crcsync protocol headers
   else
      do nothing

intercept the http response and convert the response body to text/html
   if it is a crcsync response (identified by the crcsync protocol response headers)
      extract the response body
      decompress the response body
      force removal of no-cache headers to enforce caching
   else
      force removal of no-cache headers to enforce caching

The first part is easy, and all the header manipulating extensions (Tamperdata, liveHTTPHeaders,…) do this. All I need to do is to register an “http-on-modify-request” observer:
———————
var observerService = Components.classes["@mozilla.org/observer-service;1"].getService(Components.interfaces.nsIObserverService);
observerService.addObserver(obj, "http-on-modify-request", false);

———————

The process is also described in https://developer.mozilla.org/en/Setting_HTTP_request_headers.

In the crcsync protocol specification a new Content-Encoding is defined: crcsync. So i’ll modify the Accept-Encoding header to include crcsync, add new headers:
A-IM: crcsync
If-Block: <base 64 encoded hashes>
File-Size: <size of cached file>

And send the request.

For the second part of the algorithm, I will intercept the responses by registering a “http-on-examine-response” observer the same way as for the “http-on-modify-request”.

The tricky part is to modify the response body. I contacted a necko developer which confirmed me – “I don’t recall an API for doing this.” but he suggested a workaround using a StreamConverter used to convert between different content encodings.
If the reply contains the Content-Enconding: crcsync, the response body can be passed through the StreamConverter which will decompress the data.

To register a StreamConverter, the answer is in netwerk/streamconv/public/nsIStreamConverter.idl:


70 * Registering a stream converter:
71 * Stream converter registration is a two step process. First of all the stream
72 * converter implementation must register itself with the component manager using
73 * a contractid in the format below. Second, the stream converter must add the contractid
74 * to the registry.
75 *
76 * Stream converter contractid format (the stream converter root key is defined in this
77 * file):
78 *
79 * @mozilla.org/streamconv;1?from=FROM_MIME_TYPE&to=TO_MIME_TYPE

Which would translate to @mozilla.org/streamconv;1?from=crcsync&to=uncompressed.

Since the response has various “no-cache” headers and expiry date set in the past Firefox will not cache it. So the best option is to strip the response headers which cause the caching to fail.

The best way to force this caching is to strip the offending headers (the code can be taken from the BetterCache extension – http://netticat.ath.cx/BetterCache/BetterCache.htm).

Is it my opinion that the cache mechanism should be integrated with the regular Firefox cache, BUT on a separate CacheSession in the offline cache.

Right now I can think of one way to do it. By keeping the crcsync cache in the offline cache, we can assure that ONLY the crcsync responses are cached in this special cache. That way, the extension will only send the request headers if it finds an existing cache copy of the remote resource. If not, then the remote server is not crcsync-aware and there is no need to modify headers.

To use the cache for read / write access, I need to create a CacheSession wit the CacheService (@mozilla.org/network/cache-service;1). This CacheSession would run in parallel with the regular cache for non-crcsync enabled servers.

On to the performance. Most of the cpu-intensive tasks will be done in crcsync library, which will be compiled as a C++ XPCOM component.
Now there is an important design decision to be made. Should the rest of the extension be written in C++ or Javascript?

Although I haven’t written a line in Javascript and I have experience with C++, Javascript seems really easy and manageable.
I’ll probably do a bit of both.
My original idea to use other extensions as a base doesn’t seem as interesting as I thought it was. Most of these extensions have complex XUL interfaces, and I don’t think I need a GUI (only for on and off, but that can by done by the “disable” button in the addons dialog).
But of course, looking at their source code for ideas and for “robbing” snippets of code is still an excellent idea.

I expect to finish coding in July. In August I will do some serious testing and eventually improve the performance and efficiency of the protocol and the extension. During this phase I will also distribute the beta version of the extension to the http-crcsync developers.

I’m committing myself to maintaining it after the end of GSoC and upgrade whenever a new version of Firefox or the crcsync protocol emerges.

First timeline – update

After reading up on the extensions referred in the last post, Tamperdata seems overkill, especially since I don’t need a GUI!
The liveHTTPHeaders extension seems more appropriate. It does practically the same thing with much less code and bloat (from the point of view of my extension of course), so adapting it will be much easier.

So here is an update on the first timeline:

* – done
> – work in progress
x – TODO

crcsync side
————
* – Read current spec
* – Get it building
* – Get a system set up where I can play with it (perhaps between two Apache proxies)
* – Investigate libcrcsync
> – Read the code to understand how it works

Mozilla side
————
* – Decide between an extension or a patch
* – Find which addons do stuff similar to what I want to do
> – Examine source code and construction of those addons
> – Find how to manipulate network streams with an addon
x – Write a simple addon to change all occurrences of “swine flu” to “SCARY KILLER DISEASE” in network streams

Much of it is done and in a few days I will post the second timeline and more detailed information about what I’ve done so far.

I found the perfect base for my extension!

The last few days I have been searching around for extensions which do similar things (like manipulating HTTP request / responses) and I seem to have found one which does exactly what I want – “TamperData is an extension to track and modify http/https requests”.

It is used for “security testing of Web based applications”. I have been using for the past few days and it’s quite cool…

It has an HTTP listener which processes responses and requests (of course) and it is quite complex. Its interface provides a lot of info, showing when a cache hit happens, the request / response headers and optionally the content. It also has a feature which forces caching of all requests – exactly what I need!

This is a great base for developing my extension… and for testing it!

There are also other extensions which allow manipulation of headers, such as LiveHTTPHeaders and modifyheaders but they don’t seem so evolved… I’ll have to look into it.

First timeline

I’m back!!

Well it’s about time – I’ve been reading RFC’s in the past few days and I should make a blog post about my progression.

Here is the schedule that was proposed by my mentor, Gervase Markham, and modified by me:

* – done
> – work in progress
x – TODO

crcsync side
————
* – Read current spec
* – Get it building
* – Get a system set up where I can play with it (perhaps between two Apache proxies)
> – Investigate libcrcsync
> – Read the code to understand how it works

Mozilla side
————
> – Decide between an extension or a patch
> – Find which addons do stuff similar to what I want to do
x – Examine source code and construction of those addons
x – Find how to manipulate network streams with an addon
x – Write a simple addon to change all occurrences of “swine flu” to “SCARY KILLER DISEASE” in network streams

(I especially like the last part :) )

Now things have to get up to speed – after all, it’s been almost 3 weeks now that GSoC has started!

Anyway, the guys at the http-crcsync have been quite supportive. Even though I haven’t said much in that list, most of them keep referring me in their emails. It seems that they also are very excited at this idea…

Gotta go – the code is waiting for me…

Next Page »