Jesse Robbins

photo_jesse_m.jpgJesse Robbins is passionate about Infrastructure, Emergency Management, and technology that helps people be Safe, Happy, and Free.

He currently serves as the Director of Infrastructure at Etelos, co-chair of the Velocity Performance & Operations Conference, and is part of the O’Reilly Radar. He previously worked at Amazon.com where his title was “Master of Disaster” and where he was responsible for Website Availability.

Jesse is a volunteer Firefighter/EMT & Emergency Manager, and led a task force deployed in Operation Hurricane Katrina.

Jesse Robbins热衷于基础架构、紧急事件管理和能使人们安全、快乐、自由的技术。

他目前作为基础架构主管在Etelos工作,还是Velocity Performance & Operations Conference的联合主席,也参与O'Reilly Radar。他此前在Amazon.com,职位是“灾难专家”,在那里负责网站可用性。

Jesse是消防员志愿者/急诊医生和紧急事件管理人员,还曾经领导了一支部署在卡特里娜飓风营救行动中的特遣分队。

Velocity 2009: Themes, ideas, and call for participation...

2008-11-21

velocity2009_120x421.gifLast year's Velocity conference was an incredible success. We expected around 400 people and we ended up maxing out the facility with over 600. This year we're moving the conference to a bigger space and extending it to 3 days to accommodate workshops and longer sessions. Velocity 2009 will be on June 22-24th, 2009 at the Fairmont Hotel in San Jose, CA. This year's conference will be especially important. I've said many times that Web Performance and Operations is critical to the success of every company that depends on the web. In the current economic situation, it's becoming a matter of survival. The competitive advantage comes from the ability to do two things:
  1. Generate more revenue with fewer resources
  2. Respond quickly to change
Our Velocity 2009 mantra is "Fast, Scalable, Efficient, Available", a slight change from last year. (We've replaced "Resilient" with "Efficient" to make focus clear.)

I'm excited to announce that joining Steve Souders & I on this year's program committee are John Allspaw, Artur Bergman, Scott Ruthfield, Eric Schurman, and Mandi Walls.  We've already started working on the program, and have just opened the Call for Participation.

We're especially interested in the following topics:

  • How to tie web performance and operations to the bottom line
  • Real-world incident management - getting “tight like a pit crew”
  • Making websites as fast and reliable as desktop apps
  • Networking, DNS, and load balancing
  • Profiling everywhere: JavaScript, CSS, and the network
  • Managing web services - flaming disasters you survived and lessons learned
  • The intersection between performance and design
  • Wicked cool (and actionable) metrics
  • Ads, ads, ads - the performance killer?
  • Troubleshooting in production
  • How to scale and be fast on the social web
  • Capacity planning and load testing
  • Establishing performance and operations best practices within your organization
  • Configuration management best (and worst) tools and practices
  • Monitoring and instrumentation experiences: Open Source, as a service, commercially supported solutions
  • Using multiple CDNs to improve customer experience and reduce cost
The submission deadline is January 5th, so get your talks in.  If you have any questions or suggestions for the committee, send them to velocity-idea@oreilly.com.

Major milestone for ProgrammableWeb & "The Web as Platform"

2008-11-03

200811031544.jpg Last week marked an important milestone for the "Web as Platform" as the 1,000 API was added to the ProgrammableWeb registry. John Musser (see: Web2.0 Report) started tracking the first few web service API's back in 2005.

How do these 1000 APIs break down by type? The following chart, derived from our database, shows the the top 15 sectors or markets with the greatest number of competing API providers. As you can see there are already 71 mapping-related APIs alone"

200811031528.jpg

Congratulations!

DisasterTech: "Decisions for Heros"

2008-11-01

One of the most interesting DisasterTech projects I've been following is "Decisions for Heroes" led by developer and Irish Coast Guard volunteer Robin Blandford.

Decisions is like Basecamp for volunteer Search & Rescue teams. The focus is on providing "just enough" process to compliment the real-world workflow of a rescue team, without unnecessary complexity. One of Robin's design goals is that: decisions-for-heros.png

User requirements are nil. Nobody likes reading manuals - if we have to write one, we've gotten too complicated.

This is the winning approach for building systems that "serve those that serve others", and is echoed by InSTEDD's design philosophy and the Sahana disaster management system.

Teams begin by entering their responses to incidents and training exercises. They then tag them with things like the weather conditions, the tools and skills required, and who from the team was deployed.

As a team's incident database grows this information can be used to show heatmaps, and provide powerful insight on the locations, weather conditions, and times of year that various incidents occur. Over time this kind of data could be analyzed in aggregate across multiple teams and regions and create an incredibly powerful resource for Emergency Managers. This is very similar to what Wesabe does for consumers with financial transaction data today (disclosure: OATV investment).

200811011649.jpg

Rescue team members enter training dates and levels. The system tracks certification expiration dates and prompts team members & leaders to plan classes and remain current. This is a huge issue for volunteers who have to manage professional-level training requirements with the demands of a regular career.

As more incidents are entered into the system, it compares the skills required for each of the rescues with the team training exercises. This allows teams to identify areas to focus, train, and develop new skills.

200811011644.jpg

This is an innovative project with tremendous potential, and hopefully an early signal of coming changes in Emergency Management.

(Note: ''How to Serve those that Serve Others" will be the theme of my "High Order Bit" session at the Web2.0 Summit.  I'll be sure to post video/slides/notes when they are available.)

Sprint blocking Cogent network traffic...

2008-10-31

It appears that Sprint has stopped routing traffic (called "depeering") from Cogent as a result of some sort of legal dispute. Sprint customers cannot reach Cogent customers, and vice versa. The effect is similar to what would happen if Sprint were to block voice phonecalls to AT&T customers.

Here's a graph that shows the outage, courtesy of Keynote :
sprint-cogent-routing-problems-keynote.png

Rich Miller at DataCenterKnowledge has a great summary of the issues behind the incident, which has happened with Cogent before. Rich says:

At the heart of it, peering disputes are really loud business negotiations, and angry customers can be used as leverage by either side. This one will end as they always do, with one side agreeing to pay up or manage their traffic differently.

I think this is particularly Radar-worthy because it provides an example of the complex issues around Net Neutrality.In this case customers are harmed and most (especially Sprint wireless customers) will have no immediate recourse.

Todd Underwood of Renesys has posted an incredibly detailed explanation the scope and impact of this issue. Here is a summary:

Another way to look at the scope of this event is to identify the number, size and ownership of the network prefixes affected by the outage. [...] So, in total, at least 3500 networks on the Internet have less than full connectivity right now. [...]

One might suspect that these single-homed autonomous systems are simply incautious or insignificant networks. After all, given the history of Internet partitions, who would be rash enough to have important network services located on a single-homed prefix in this day and age?

The following prefixes are some of the more interesting networks single-homed behind Sprint:

  • 208.95.96.0/21 Expedia, Inc.
  • 164.62.0.0/16 Federal Trade Commission
  • 204.108.8.0/24 Federal Aviation Administration
  • 198.9.201.0/24 National Aeronautics and Space Administration
  • 170.189.200.0/24 Occidental Petroleum Corporation
  • 148.168.0.0/16 Pfizer Inc.
  • 128.6.0.0/16 Rutgers University
  • 173.100.0.0/16 Sprint PCS (lots of networks here, of course)
  • 149.24.174.0/23 SUNGARD HIGHER EDUCATION INC.

And that is just a few.

The following prefixes are some of the more interesting networks single-homed behind Cogent:

  • 89.251.2.0/24 Joost Production Benelux Network
  • 72.5.224.0/24 Loopt, Inc.
  • 198.185.178.0/23 National Aeronautics and Space Administration and many more)
  • 204.201.48.0/21 NTT America, Inc. (and many more like it, from the T1 and hosting customers acquired from NTT/Verio)
  • 204.9.56.0/24 Skynet Access (this might actually be good news, if the loss of connectivity to Skynet prevents or delays sentience).
  • 142.155.0.0/16 St. Lawrence College
  • 128.100.0.0/16 University of Toronto (and a bunch of other colleges and universities)
[...] The point here is that this is a big deal. There are lots of significant organizations that appear to have lost connectivity due to this dispute.

These same kinds of issues will likely happen with cloud service providers as well. As we've already learned from the evolution of VoIP, you become what you disrupt.

Amazon's new EC2 SLA

2008-10-24

Amazon announced a new SLA for EC2, similar to the one for S3. This is a notable step for Amazon and cloud computing as a whole, as it establishes a new bar for utility computing services.

Amazon is committing to 99.95% availability for the EC2 service on a yearly basis, which corresponds to approximately four hours and twenty three minutes of downtime per year. It's important to remember that an SLA is just a contract that provides a commitment to a certain level of performance and some form of compensation when a provider fails to meet it.

Here's the summary of the EC2 SLA (emphasis added):
Service Commitment AWS will use commercially reasonable efforts to make Amazon EC2 available with an Annual Uptime Percentage (defined below) of at least 99.95% during the Service Year. In the event Amazon EC2 does not meet the Annual Uptime Percentage commitment, you will be eligible to receive a Service Credit as described below. [...]
  • “Annual Uptime Percentage” is calculated by subtracting from 100% the percentage of 5 minute periods during the Service Year in which Amazon EC2 was in the state of “Region Unavailable.” If you have been using Amazon EC2 for less than 365 days, your Service Year is still the preceding 365 days but any days prior to your use of the service will be deemed to have had 100% Region Availability [...]
  • “Unavailable” means that all of your running instances have no external connectivity during a five minute period and you are unable to launch replacement instances. [...]
To receive a Service Credit, you must submit a request by sending an e-mail message to aws-sla-request @ amazon.com. To be eligible, the credit request must [...] include your server request logs that document the errors and corroborate your claimed outage (any confidential or sensitive information in these logs should be removed or replaced with asterisks)

This new SLA does not appear to address the reliability of server instances individually or in aggregate. For example, if half of a customer's EC2 instances lose their connections or die every 6 minutes, EC2 would still be considered "available" even if it is essentially unusable.

If the entire EC2 service is down a cumulative four hours and twenty minutes, customers must furnish proof of the outage to Amazon to be eligible for the 10% credit. This seems like an onerous process for very little compensation, and isn't in-line with Amazon's famous "Relentless Customer Obsession". Amazon takes monitoring very seriously and should take the lead by tracking, reporting, and proactively compensating customers when it lets them down.

Incredible images of the Sun

2008-10-15

sol17.jpg
The Boston Globe has assembled a beautiful gallery of images of the Sun.
This LASCO C2 image, taken 8 January 2002, shows a widely spreading coronal mass ejection (CME) as it blasts more than a billion tons of matter out into space at millions of kilometers per hour. The C2 image was turned 90 degrees so that the blast seems to be pointing down. An EIT 304 Angstrom image from a different day was enlarged and superimposed on the C2 image so that it filled the occulting disk for effect (Courtesy of SOHO/LASCO consortium)

[link courtesy Barry Brumitt]

Apple's restrictions mean more jailbreaking & Android adoption(Apple的限制只能造成更多越狱并将大家推向Android)

2008-09-24

When Apple announced the iPhone SDK last year I said:

[...] Jobs makes it clear that the platform won't be completely open. While he says that this is to balance the benefits of an open platform with user security protection, it's unclear where Apple will draw those lines. Will there be a Skype client? Third-party media apps?

It would have been better if Apple had announced [the details] when it released the iPhone. I'm hopeful that Apple will now embrace the existing iPhone developer community, and won't use “security” as a way to keep potential competitors off its platform.

Almost a year later Apple is using their control of the App store to block innovative developers from reaching their customers. The most recent example is the "Podcaster" iPhone app which allows you to download and manage podcasts on the iPhone directly, without having to boot your computer to sync in iTunes.

According to the developer, Apple blocked this application from the App store, saying:

Since Podcaster assists in the distribution of podcasts, it duplicates the functionality of the Podcast section of iTunes.

If you want to build a platform, you have to compete fairly with the developers on your platform (if you must to compete at all). By restricting developers, Apple is stifling innovation and their long-term growth. Frustrated customers and developers who "think different" are Jailbreaking their iPhones and getting excited about Google's Android.

Remember: Successful platforms create more value than they capture.

翻译:西门吹雪

去年Apple发布iPhone SDK时我写道

……乔布斯明确指出该平台不会完全开放。他称此举是在平台开放性和用户安全保护之间取得平衡,不清楚Apple会在平衡到什么程度。会产生Skype客户端?或者第三方媒体应用?

Apple发布iPhone时宣布详细信息会更好。我希望Apple能拥抱已经存在的iPhone开发社区,而不是用“安全”借口将潜在的竞争者隔离在平台之外。

Apple采用App Store控制将革新开发人员与客户隔离开来快一年了。最新的例子就是“Podcaster”,这个应用允许用户直接在iPhone上下载并管理播客,而无需用计算机通过iTunes来同步。

据该开发人员称App Store将其拒之门外,理由是

Podcaster处理播客,与iTunes的播客功能重复。

如果你构建了一个平台就必须与该平台上的开发人员公平竞争(如果一定要竞争)。通过限制开发人员Apple阻碍了革新和自身的长远发展。失望的客户和开发人员有不同想法,只能越狱iPhone,并对Google的Android欢呼雀跃。

请记住:成功的平台要创造比他们索取到的更多的价值

Kaminsky DNS Patch Visualization

2008-08-07

Dan Kaminsky has posted the details of the widespread DNS vulnerability. Clarified Networks created this visualization of DNS patch deployment over the past month:

Red = Unpatched
Yellow = Patched, "but NAT is screwing things up"
Green = OK

The new internet traffic spikes

2008/06/28

Theo Schlossnagle, author of Scalable Internet Architectures, gave a great explanation of how internet traffic spikes are shifting:

Lately, I see more sudden eyeballs and what used to be an established trend seems to fall into a more chaotic pattern that is the aggregate of different spike signatures around a smooth curve. This graph is from two consecutive days where we have a beautiful comparison of a relatively uneventful day followed by long-exposure spike (nytimes.com) compounded by a short-exposure spike (digg.com):

The disturbing part is that this occurs even on larger sites now due to the sheer magnitude of eyeballs looking at today's already popular sites. Long story short, this makes planning a real bitch.

[...]What isn't entirely obvious in the above graphs? These spikes happen inside 60 seconds. The idea of provisioning more servers (virtual or not) is unrealistic. Even in a cloud computing system, getting new system images up and integrated in 60 seconds is pushing the envelope and that would assume a zero second response time. This means it is about time to adjust what our systems architecture should support. The old rule of 70% utilization accommodating an unexpected 40% increase in traffic is unraveling. At least eight times in the past month, we've experienced from 100% to 1000% sudden increases in traffic across many of our clients.

[Link]

Video of Rich Wolski's EUCALYPTUS talk at Velocity(视频:Rich Wolski在Velocity上关于EUCALYPTUS的演讲)

2008/06/24

Rich Wolski gave a truly impressive talk at Velocity about an open-source software infrastructure for cloud computing called EUCALYPTUS . The API is compatible with Amazon's EC2 interface, and the underlying infrastructure is designed to support multiple client-side interfaces. EUCALYPTUS is implemented using commonly-available Linux tools and basic Web-service technologies making it easy to install and maintain. Watch and learn...

You can see more videos from Velocity on Blip.tv.

Rich Wilski在Velocity上发表了精彩的讲话,介绍云计算开源软件体系结构EUCALYPTUS。该API与Amazon的EC2接口兼容,其底层体系结构却可以支持多客户端接口。EUCALYPTUS可以通过普通的Linux工具和基本的Web服务技术来实现,使之易于安装和维护。更多了解请看视频……

Blip.tv上Velocity有更多视频。

user/jesse_robbins.txt · 最后更改: 2008/10/04 由 radarman
O'Reilly Home | Privacy Policy ©2005-2008, O'Reilly Media, Inc.
All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.