On Twitter: @JamesFirth and @s_r_o_c (post feed)

Got a tip? tip@sroc.eu

Thursday, 26 April 2012

"The internet is filling up!" - it raised a laugh but I said it to express an important concept

Last week I almost became embroiled in a Ted Stevens series of tubes moment when I told a group of internet rights and privacy advocates that the internet was filling up; and this fact had quite profound implications for the way we understood online privacy and disclosure.

The Open Rights Group executive director Jim Killock almost spilled his drink as he snorked at my suggestion, but I said it for a reason: we need to look beyond the engineering challenge of data storage to the macroscopic properties of a massively distributed stochastic system where billions of users provide input and absorb output.

The internet never forgets is a common mantra amongst both privacy and anti-censorship advocates.

It captures how information, once published online, has the tenacity to stick around for decades - perhaps forever; having obvious privacy implications but also the capacity to undermine censorship, as any attempt to block or cleanse the information from the internet often turns into a game of whack-a-mole.

But the internet never forgets was coined in an era of limited participation.  Now internet evangelists such as Vint Cerf talk about bitrot - the loss of data, or the inability to read or interpret stored data at some future point.

Bitrot comes in at least five forms I can identify:
  1. Deletion of the last copy of the data (accidental; but also deliberate, without realising the future value)
  2. Storage of the data in a format which can't be interpreted at some future point due to the unavailability of the software or expertise to interpret the data
  3. Storage on a medium which can't be read due to lack of a physical device to read the data
  4. Failure or degradation of the storage medium so the data can no longer be read
  5. Storage in a forgotten place or a location which can't be found with available search tools
Archivists worry about bitrot because we often don't realise the value of past works and historical data until some future point, when our understanding of the subject has improved sufficiently to make sense of the historical record.

But I'm surprisingly relaxed about bitrot. Bitrot is an engineers-eye view of a problem which can be solved by further engineering.

We're on the verge of producing zettabytes of data per year.  That's approaching two trillion DVDs worth of data; which, stacked end-to-end(!), would reach out into space approximately five times further than the moon.  

But this fact doesn't cause engineers to lose much sleep.  In fact it excites them.  We'll continue to find more efficient ways of storing data than DVDs.  We'll filter the noise and de-duplicate

But the fact that data, without some nurturing, risks disappearing into oblivion should be seen as an opportunity to explore some of the benefits of forgetting.

An opportunity to look beyond the data and towards how people and data interact; look at how we use and organise data, and in particular look at the social implications of a connected society now, rather than the benefit to archivists later.

A collective brain

Bitrot is not simply a threat to archivists that we should fight with brute storage. It's also a phenomenon we might embrace to weed-out garbage and let useful data sit at the front of our collective minds.

When I use the internet today it sometimes feels as though digital content is organising itself along the lines of a giant central human brain, with short and long-term memory.

Short term memory: things we can't help to be reminded of. Current events, trends, things people talk about on social networks.  Information that's easy to find on Google.  You get the idea.

Long term memory: things we have to hunt out. We have to piece together multiple parts of a jigsaw with many well-crafted web searches, or combine data unique to ourselves - data we've stored on our local systems.

I started thinking along these lines a few years ago when the world's best search engine* started to show some of the creaking flaws of its predecessors.

Anyone who remembers using Lycos, Yahoo and Excite to find content back in the 90's will perhaps also remember the wow factor when they first used Google.  Google just found stuff - the stuff you wanted - with minimal effort.

Now things are very different. I can find some stuff easily, and some just falls between the gaps.

The principle of extreme sharing (or, as some call it, toxic disinhibition) has simultaneously lead to an increase in serendipitous moments, a massive increase in noise and a loss of privacy.

Mass participation has also introduced chaos into a well-engineered electronic system.

Many people now (rightly) want to have their say and put their spin on news events.  Plus, commercial drivers lead to the creation of content, content, content; optimised to appear high up the search rankings despite having little inherent knowledge value.

Data is growing.  A good proportion of the population has at their fingertips a means to input and retrieve data from this thing we call the internet.  The internet itself grew autonomously from a series of basic regulated principles - protocols.

Now even Google can't help me find an article I read on a reasonably well established professional source just last week.  Twitter only allows tweets from the last 2 weeks to be searched - beyond which you need to rely on third-party and expensive tweet archives.

The internet is filling up!

Some feel that search technology will catch up, but I disagree.  When Google became the dominant search engine it wasn't facing the same challenge we see now.  The challenge back then was to scan all available content and maintain an algorithm which brought the most relevant results to the top.

A decade after the advent of the World Wide Web relevance was pretty much a universal constant. Now it's highly subjective.  Less than 20 years ago no national newspaper put their content online.  Now when a major news event breaks in the UK it can spawn a thousand articles - many times more if you count personal blogs and social media.

Depending what you're looking for, any one of 100,000 websites could be relevant.

Enter the personalised web.  But that doesn't solve the problem of discovery - finding something completely new to the searcher. And it doesn't solve the problem of finding something specific if that something happens to be outside your normal personalised search bubble or circle of relevance.

Personalisation is a fig leaf based around the premise that around 90% of what we're looking for can be guessed from our past behaviour and, in cases, our friends' past behaviour.  But this thinking locks us into our past and can make the thing we actually seek even harder to find.

We're facing a new set of challenges that increasingly make the old adage, that the internet never forgets, look naive.

The internet is filling up, and that not only has profound implications for the way we use and store data; it will impact all online businesses, have implications for free and open online competition when a few dominant providers act as gatekeepers, and will affect legislation attempting to tackle e.g. government transparency, privacy and regulation of online content.

The prominence of data is becoming governed not by how engineers store and organise it but how humans interact with each other in a way that simply can't be predicted, to bring certain information to prominence and let other facts linger at the back of our collective minds.


1 comment:

  1. I really get this, and the privacy implications. Before the internet we had publishers and publishers had the power to keep things in the public's short term memory. Now the "short term memory" is distributed accross all people who choose to connect and is effectively uncrontrollable.

    Which makes a mockery of any notion that this can or should be controlled. It's inherently democratic.

    As far as privacy it means things which are privacy invasive but in the "long term memory" are less of a worry as they're not being rammed down people's throats. And this is scalable right down to communities and right up to a globalized view.

    I search for my friend - something is privacy invasive if it comes at the top of the search results for her. It's less invasive if it sits on some internet backwater.

    Thanks for the post!


Comments will be accepted so long as they're on-topic, do not include gratuitous language and do not include personal attacks or libellous assertions.

Comments are the views of the commentator and not necessarily the view of the blog owner.

Comments on newer posts are not normally pre-moderated and the blog owner cannot be held responsible for comments made by 3rd parties.

Requests for comment removal will be considered via the Contact section (above) or email to editorial@slightlyrightofcentre.com.