Preserving Web-based digital material Andrea Goethals Harvard University Library

Preserving Web-based digital material Andrea Goethals Harvard University Library www.phwiki.com

Preserving Web-based digital material Andrea Goethals Harvard University Library

Hansen, Fay, Contributing Editor has reference to this Academic Journal, PHwiki organized this Journal Preserving Web-based digital material Andrea Goethals Harvard University Library Why Books Site Visit 28 October 2010 Agenda Why preserve Web content A look at the Web Web archiving Web archiving at Harvard Open challenges in Web archiving Questions 1. Why preserve Web content

Baptist Bible College & Seminary of Pennsylvania PA www.phwiki.com

This Particular University is Related to this Particular Journal

Books have moved off the shelves in addition to onto the Web! A few other things on the Web TV Shows Blogs Images Scholarly papers Stores Discussions Maps Virtual worlds Art exhibits Documents Music Articles Magazines Newspapers Tutorials Software Databases Social networking Advertising Courses Museums Libraries Archives Recipes Data sets Oral history Poetry Broadcasts Wikis Movies

But is it valuable May be historically significant White House web site March 20, 2003 May be the only version Harvard Magazine May/June 2009

May document human behavior World of Warcraft, Fizzcrank realm, Morc the Orc’s view, Oct. 25 2010 Important to researchers ABC News Aug. 2007 Important to researchers Strangers in addition to friends: collaborative play in world of warcraft From tree house to barracks: The social life of guilds in World of Warcraft The life in addition to death of online gaming communities: a look at guilds in World of Warcraft Learning conversations in World of Warcraft The ideal elf: Identity exploration in World of Warcraft Traffic analysis in addition to modeling as long as world of warcraft E-collaboration in addition to e-commerce in virtual worlds: The potential of second life in addition to world of warcraft Underst in addition to ing social interaction in world of warcraft Communication, coordination, in addition to camaraderie in World of Warcraft An online community as a new tribalism: The world of warcraft A hybrid cultural ecology: world of warcraft in China etc.

May be a work of art YouTube Play. A Biennial of Creative Video (Oct. 2010 -) May be important data as long as scholarship NOAA Satellite in addition to In as long as mation Service May be an important reference

May be of personal value 2. A look at the Web Remember this 1993: “First” graphical Web browser (Mosaic)

Volume of content is immense! 1998: First Google index has 26 million pages 2000: Google index has 1 billion pages 2008: Google processes 1 trillion unique URLs “ in addition to the number of individual Web pages out there is growing by several billion pages per day” (Source: the official Google blog) Prolific self-publishers “Humanity’s total digital output currently st in addition to s at 8,000,000 petabytes but is expected to pass 1.2 zettabytes this year. One zettabyte is equal to one million terabytes ” “Around 70 per cent of the world’s digital content is generated by individuals, but it is stored by companies on content-sharing websites such as Flickr in addition to YouTube.” Telegraph.co.uk May 2010 on IDC study Ever-increasing of web sites 96 million out of 233 million web sites are active (Netcraft.com)

A moving target Flickr (Feb 2004) Facebook (Feb 2004) YouTube (Feb 2005) Twitter (2006) Anatomy of a web page Typically 1 web page = ~35 files 1 HTML file 7 text/css 8 image/gif 17 image/jpeg 2 javascript Source: representative samples taken by Internet Archive Can’t rely on it always being out there

Hansen, Fay Workforce Management Contributing Editor www.phwiki.com

Web content is transient The average lifespan of a web site is between 44 in addition to 100 days Captured April 8, 2009 Visited October 13, 2010 Disappearing web sites 2000 Sydney Olympics Most of the Web record is only held by the National Library of Australia Half of the URLs cited in D-Lib Magazine inaccessible 10 years after publication (McCown et al., 2005) 3. Web archiving

Web archiving 101 Web harvesting Select in addition to capture it Preservation of captured Web content “Digital preservation” Keep it safe Keep it usable to people long-term, despite technological changes acquisition of web content acquisition of other digital content preservation of web content preservation of other digital content Web harvesting Download all files needed to reproduce the Web page Try to capture the original as long as m of the Web page as it would have been experienced at the time of capture Also collect in as long as mation about the capture process Must be some kind of selection Type of harvesting Domain harvesting Collect the web space of an entire country The French Web including the .fr domain Selective harvesting Collect based on a theme, event, individual, organization, etc. The London 2012 Olympics Hurricane Katrina Women’s blogs President Obama Any type of regular harvesting results in a large quantity of content to manage.

How do we eliminate Web spam Intentional in addition to unintentional crawler traps Potential solutions: Spam filters during or after a crawl Duplicate content Exact copies of content previously captured Within a harvest – heritrix already de-dupes Among harvests – a “smart crawler” version of heritrix exists What should we collect Inability to determine now what will be valuable in the future Potential strategies Only do large domain crawls But there’s a price to pay as long as these crawls! Internet Archive Swedish National Library Library in addition to Archives Canada Selective crawls complemented with periodic broad domain crawls (e.g. BnF, Denmark) How do we describe it given its Volume Prohibits technical metadata description in addition to storage But technical metadata is necessary to know what you have in addition to to plan its preservation Limited amounts of metadata (Harvard – as long as mats, admin flags)

Hansen, Fay Contributing Editor

Hansen, Fay is from United States and they belong to Workforce Management and they are from  Irvine, United States got related to this Particular Journal. and Hansen, Fay deal with the subjects like Human Resources; Labor/Unions; Management; Workplace

Journal Ratings by Baptist Bible College & Seminary of Pennsylvania

This Particular Journal got reviewed and rated by Baptist Bible College & Seminary of Pennsylvania and short form of this particular Institution is PA and gave this Journal an Excellent Rating.