Background: Various groups track information flow by logging the URLs being exchanged when data is up- and down-loaded. An initiative is currently underway by the US Department of Justice, and already passed in Europe, to force internet providers (ISP) to log data transfers, and to retain these logs for a long time (despite the large amount of storage this requires). This is an ongoing effort and follows a
prior effort in the same direction. It seems the objective is not to save the actual data, just the information linking the URLs that were used: where did you visit and what filename did you upload. Such information is already used for DMCA "take-down" requests and
the legal page
at the infamous Pirate Bay bittorrent tracker provides many (amusing) examples. It also represents a huge incursion into personal privacy which is threatening in many ways.
Utility?: While anti-terrorism is cited as one of the benefits of this plan, it seems unlikely that actual terrorists operate by uploading data to public sites. The initiative is more likely to be motivated by DMCA enforcement. Even the most trivial passwording and encryption by organzied groups (such as terrorists) gets around this measure. I suppose one still might be able to catch the a really foolish bad guy,
which sounds insignificant, but perhaps that not as irrelevant is it seems.
Once such linkages between users and URLs are made, there are a lot of very interesting data mining possibilities. Google is surely looking at doing this right now for commercial purposes (e.g targeted advertising). This is very close to work we are doing (abstract)
(pdf) to unravel the positions of robots or sensors deployed in space. The connections one might get can be insigntful, but also very misleading at times, and this is worrisome. It means, in principle, that you could get into trouble for using certain goodle search words, without even download anything. This is akin to patrolling people's thoughts.
In the case of web traffic and DMCA enforcement, however, it seems like this effort to simply log traffic can be easily circumvented or obfuscated. A current practice is to pursue people if an upload of theirs has been download "too many" times. If the data provider simply uses cryptric URL's and rotates them often, as illustrated below, then the logged data becomes almost useless. The URL doesn't tell you anything and surely doesn't prove much. (i.e. the URL for this article might be
blog/41 now, but tomorrow it becomes blog/21111). This makes permanent links tricky for the user, but many such links already lead only to index pages that provide the connections between the URL's and the description of the content they provide. Of course, one could log that too, but then it becomes much much more complicated since doing it would human intervention involves producing (essentially) a snapshot of the whole internet on a regular basis. In short, this proposal seems fraught with problems, but from a technical standpoint as well as with respect to personal privacy.
Try it: URL content changes after a few clicks.
The simple example shown here illustrates a URL (for the picture) that delivers different content at different times. Note that this is not the same as just changing the images linked into a page, because the actual URL of the image itself doesn't change, but the content it points to changes. The first time you click it you get an image of "secret" troop deployments (that might violate the DMCA). If you reload the same URL a few times, you get something more benign. Hence, knowing who accessed the URL doesn't provide any information, unless you actually store the data too (which isn't practical). (Approximate source code for above example here.)