Here's the video.
Technorati Tags: blogs, podcasting, tags, technorati, video
このページは大阪弁化フィルタによって翻訳生成されたんですわ。 |
Web wide crawl with initial seedlist and crawler configuration from March 2011. This uses the new HQ software for distributed crawling by Kenji Nagahashi.
What’s in the data set:
Crawl start date: 09 March, 2011
Crawl end date: 23 December, 2011
Number of captures: 2,713,676,341
Number of unique URLs: 2,273,840,159
Number of hosts: 29,032,069
The seed list for this crawl was a list of Alexa’s top 1 million web sites, retrieved close to the crawl start date. We used Heritrix (3.1.1-SNAPSHOT) crawler software and respected robots.txt directives. The scope of the crawl was not limited except for a few manually excluded sites.
However this was a somewhat experimental crawl for us, as we were using newly minted software to feed URLs to the crawlers, and we know there were some operational issues with it. For example, in many cases we may not have crawled all of the embedded and linked objects in a page since the URLs for these resources were added into queues that quickly grew bigger than the intended size of the crawl (and therefore we never got to them). We also included repeated crawls of some Argentinian government sites, so looking at results by country will be somewhat skewed.
We have made many changes to how we do these wide crawls since this particular example, but we wanted to make the data available “warts and all” for people to experiment with. We have also done some further analysis of the content.
If you would like access to this set of crawl data, please contact us at info at archive dot org and let us know who you are and what you’re hoping to do with it. We may not be able to say “yes” to all requests, since we’re just figuring out whether this is a good idea, but everyone will be considered.
Edifying exquisite equine entrapments
Technorati Tags: blogs, podcasting, tags, technorati, video
I didn't attend Blogher, but many of my friends and colleagues did, and mostly got lots out of it. I did pick up an undercurrent of discomfort from my female geek friends at what they saw as the low tech content of the conference, and even 'all these women in high heels giggling together'. Melinda Casino, Shelley Powers and Tara Hunt express various concerns with tone and with intrusive sponsorship.
The problems of sponsorship and product pitches always intrude into conferences - with the BarCamp model they get minimised by the low budget ethos and emphasis on emergent scheduling, but having watched several friends put together big conferences that involve taking over hotels for a few days, the need to raise significant sponsorship money does lead to editorial pressure on the schedule, and it difficult to walk the line between Jane Jacobs' Commercial and Guardian modes.
However, reading some of the posts by non-techie Blogher attendees, like IzzyMom and tastetheworld, what I see is the sheer pleasure at meeting people you have only known through their online writing, and making the personal connection with them. I recognise the experience I had when I crashed O'Reilly's eTech in 2003, and was able to pick up conversations with people based on what we'd been writing about, and overcome my previous inability to make smalltalk in big groups. The continual growth of blogging means that there are now many more interest groups out there beyond my techie clan. Lisa, Jory, Elise and the other Blogher organisers enabled lots of women with different interests to get together and have these personal epiphanies, and resolve Ford Prefect's quest for 'a peer group and a stiff drink' - well done.
Technorati Tags: Barcamp, Blogher, blogs, conference, emergence, etech
This is my personal blog. Any views you read here are mine, and not my employers.
encourage copying, expect payment