Four short links: 16 March 2010
Nat Torkington @gnat 2010-03-16
|
|
|
Nat主持O'Reilly开源大会和其他O'Reilly会议已经超过十年了。他运行了新西兰第一个Web服务器,也是畅销书“Perl Cookbook”作者之一,还是Radar最早的博主之一。他住在新西兰主要关注亚太地区。 |
|
Nat的Twitter更新 |
Nat Torkington @gnat 2010-03-16
|
Nat Torkington @gnat 2010-03-16
|
When I blogged about truly open data, readers sent me a lot of interesting links. I've collected them all below. Enjoy!
|
Nat Torkington @gnat 2010-03-15
|
Nat Torkington @gnat 2010-03-12
|
Nat Torkington @gnat 2010-03-11
|
Nat Torkington @gnat 2010-03-10
|
Nat Torkington @gnat 2010-03-09
|
Nat Torkington @gnat 2010-03-09
|
I'm kicking myself. I have spent a non-trivial number of hours talking to government departments and scientists about open data, talking up an "open source approach" to data, pushing hard to get them to release datasets in machine readable formats with reuse-friendly licenses. I've had more successes than failures, met and helped some wonderful people, and now have more mail about open data in my inbox than about open source. So why am I kicking myself? I'm kicking myself because I've been taking far too narrow an interpretation of "an open source approach". I've been focused on getting people to release data. That's the data analogue of tossing code over the wall, and we know it takes more than a tarball on an FTP server to get the benefits of open source. The same is true of data. Open source discourages laziness (because everyone can see the corners you've cut), it can get bugs fixed or at least identified much faster (many eyes), it promotes collaboration, and it's a great training ground for skills development. I see no reason why open data shouldn't bring the same opportunities to data projects. And a lot of data projects need these things. From talking to government folks and scientists, it's become obvious that serious problems exist in some datasets. Sometimes corners were cut in gathering the data, or there's a poor chain of provenance for the data so it's impossible to figure out what's trustworthy and what's not. Sometimes the dataset is delivered as a tarball, then immediately forks as all the users add their new records to their own copy and don't share the additions. Sometimes the dataset is delivered as a tarball but nobody has provided a way for users to collaborate even if they want to. So lately I've been asking myself: What if we applied the best thinking and practices from open source to open data? What if we ran an open data project like an open source project? What would this look like? First, we'd collaboratively build the dataset. This means we'd have a curator who is the equivalent of a project leader, taking patches and filtering for quality. Successful open source project leaders foster a group of developers of different skills, rewarding on merit while fostering new talent. Like open source projects, the nirvana state is to have a project that can survive the retirement or death of its founder. But collaboration takes more than leadership--open source projects have tools that help. An open data project would need a mailing list to collaborate on, IRC or equivalent to chat in real-time, and a bug-tracker to identify what needs work and ensure that the users' needs are being met. The official dataset of New Zealand school zones has errors but there's nobody to report them to, much less a way to submit a fix to a maintainer. Oh, and don't forget a way to acknowledge and credit contributors—think not just of credits.txt but also of the difference between patch submitter, committer, and project maintainer. Open source software developers have a powerful set of tools to make distributed authoring of software possible: diff to identify what's changed, patch to apply those changes elsewhere, version control to track changes over time and show provenance. Patch management would be just as important in a collaborative open data project, where users and other researchers might be submitting new or revised data. What would git for data look like? Heck, what would a local branch look like? I have a new attribute, you have a different projection, she has new rows, how does this all tie back together? (I eagerly await claims that RDF will solve this problem and all others) That's just development. The interface between developers and users is the release. State of the art for a lot of government data is the equivalent of source.tar.gz. No version numbers, much the ability to download older versions of the datasets or separate stable and development branches. Why would we want to download the historic version of a dataset? Because a paper used it and we want to test the analysis software that the paper used to ensure we get the same answer. Or because we want to see what our analysis technique would have shown with the knowledge that was available back then. Or simply to be able to track defects.The users of data will have to adapt to the idea of versions, like the users of software have. The maintainers of the dataset might release five different versions of it while you're writing your analysis code, so it can't be a painful process to incorporate the revised data into your project. With software we have shared libraries and dynamic libraries, supported by autotools and such packages. Our code has interfaces and a branch that promises backwards compatibility. What would that look like for data? And what is the data version of the dependency hell that software developers know all-too-well (M 1.5 depends on N 1.7 and P 2.0, but P 2.0 requires N 2.0, and upgrading N to 2.0 breaks M which expects the 1.x set of interfaces from N ...). And, of course, there's documentation. As with software, I imagine we'll see some docs structured and some unstructured. The state of the art isn't great for government datasets, it has to be said: if you're lucky you get a "code X means ABCD" but rarely are you told exactly how the data were generated, the limits on its accuracy, situations where it shouldn't be used, etc. Finally, we need to change attitudes and social systems. Data is produced as the product of work done, and is rarely conceived of as having a life outside the original work that produced it. Some datasets will (some won't--think of how many projects fail to interest anyone but the person who started them). This means thinking of yourself not just as the person who does the work, but the person who leads a project of interested outsiders and (in some cases) collaborators and who is building something that will last beyond their time. This is not a natural mindset within government nor, in many cases, science. Funding and budgeting systems at the moment may prevent this, and would need to change. The good news is that while government datasets are rarely generated collaboratively, science is a little further along. PubMed and GenBank are just two examples of great science collaborations that we can learn from, and I'm sure there are more. Beyond science, OpenStreetMap is an important example of collaborative data gathering and the Open Knowledge Foundation folks may have work in this area already. I'm keen to learn more about the open data projects that are more than just data-over-the-wall and share what I find. Time to stop kicking myself and start learning! |
Nat Torkington @gnat 2010-03-08
|
Nat Torkington @gnat 2010-03-08
|
I just got interesting email from Amazon: the Colorado government recently enacted a law to impose sales tax regulations on online retailers [...] We and many others strongly opposed this legislation, known as HB 10-1193, but it was enacted anyway. Regrettably, as a result of the new law, we have decided to stop advertising through Associates based in Colorado. We plan to continue to sell to Colorado residents, however, and will advertise through other channels, including through Associates based in other states. The message goes on to say that they'll pay out all the money they owe me but I won't earn any more money for referring people to them. Interesting! So let me get this straight: I've done nothing, and Amazon just fired me? Now, I haven't used referrals a whole lot so it doesn't hit me in the pocketbook but this should send chills down the spine of anyone who thought they were building a business, or at least an income, around Amazon services. It's one thing to be fired for something you did (hey doofus, don't cause a heap of MPAA infringement notices to land on Amazon's desk because you were running the new Pirate Bay on EC2) but it's entirely another to be fired for something outside your control. A farmer friend told me that the goats to keep are female goats: when one doe headbutts another, the recipient then turns to the next in the hierarchy and headbutts them. With male goats, though, you get prolonged headbutt battles that are loud, intimidating, and potentially damaging. Amazon is obviously hoping the female goat scenario plays out: Amazon headbutts me, so I'll go headbutt my representative— punish Amazon's associates and hope they'll pass the pain on. I wonder whether any of Amazon's (former) Colorado associates will turn out to be male goats who, grumpy at being set upon, retaliate.... The full text of the letter follows, and there's TechFlash covered the new law.
|
Nat Torkington @gnat 2009-12-31
|
|
Nat Torkington @gnat 2009-12-30
|
Nat Torkington @gnat 2009-12-29
|
Nat Torkington @gnat 2009-12-28
|
|
Nat Torkington @gnat 2009-12-25
|
Nat Torkington @gnat 2009-12-24
|
Nat Torkington @gnat 2009-12-23
|
Nat Torkington @gnat 2009-12-22
|
Nat Torkington @gnat 2009-12-21
|
Nat Torkington @gnat 2009-12-18
|