Inside Innovation at Xerox PARC(施乐PARC中心的革新)

Ben Lorica Ben Lorica 2008/04/29

We were part of a group of journalists and bloggers invited to hear presentations from 10 different research groups within various parts of Xerox, PARC, and Fuji-Xerox. The format was similar to a science fair or a poster session in an academic conference with small groups moving around to hear presentations from the different projects. While other research labs use a large auditorium and parade different researchers in, I thought the smaller, science fair format made for better interactions between the visitors and the researchers.

We saw early prototypes created by the researchers themselves, so the user interfaces were far from polished. Here are some of the highlights from our visit:

Seamless Document Viewer
bl_parc1.jpg
A J2ME application designed to help solve the problem of viewing documents on small screens (cell phones and other mobile devices), this app automatically segments a document into blocks and displays the keyphrase for each block. The keyphrases are intended to help users navigate to sections of interest quickly. The cell phone demo we saw used a fairly intuitive touchscreen interface that included an interesting way to pan and zoom in and out of sections of a document. Because documents viewed through the application need to be processed and analyzed in advance, it is better suited for viewing PDF's and static documents, not frequently updated web pages.

Hybrid Categorization
Categorizing documents automatically is an old topic in information science. Most tools rely only on the text portion of documents and use a combination of Natural Language Processing and Machine Learning. I was looking forward to this presentation because we use text-only automatic classifiers to help organize some of our data sources.

Hybrid categorization uses both the text and images contained in documents. It isn't clear how scalable their hybrid categorizer is, the results we saw were based on small numbers of documents. Precision measures the accuracy of a categorizer and judging from the results of an academic competition, Xerox' hybrid (text +images) approach may hold some promise.
bl_parc2.jpg
Erasable Paper
"Reusable paper" refers to paper coated with special materials and a custom printer that shoots UV light onto it. The resulting printed document is designed to fade within 24 hours and the paper can be reused and fed into the printer multiple (10+) times. The printer can even erase the printing on the specially-coated papers, and print an entirely new document on the same sheets of paper. We raised the possibility that a sheet of paper that has nominally erased itself can be reverse engineered to reveal sensitive content: think security agencies or dumpster-diving identity thieves. Surprisingly, the researchers had not seriously investigated the possibility of "recovering erased documents".

The cost of the specially-coated paper is projected to be only 2-3 time the cost of normal paper, while the accompanying printer will cost about the same as a laser printer. Since paper can be reused multiple (10+) times, the obvious environmental benefits also lead to savings. Further savings come from the design of the printer itself: since the printing is done with light (UV LED bar), the printer does not use ink or toner.

Intelligent Redaction
Redaction is the process of removing sensitive information from documents. Popular examples include government/intelligence documents released to the public and medical records. Text redaction is normally a tedious manual process that requires staff possessing significant domain expertise. As an example, privacy rules governing medical records in the U.S. requires redaction of terms associated with HIV/AIDS, mental health and drug/alcohol problems. In the demo we saw, the software tool examined a corpus of documents, automatically came up with terms/phrases associated with the listed illnesses, and redacted them from every document in the corpus.

Other Notables

  • Clean technology: solar concentrators and membrane-less water filtration

  • "Environmentally-friendly" plastic: plastic with more than 30% of its weight made from biomass

  • Cancer detection tools: rare cell detection

  • 作为记者和博客我们被邀请参加施乐、PARC中心以及Fuji-Xerox不同部分的10个研究小组的成果介绍。形式很像科学展览或者学术会议的成果展示部分,你可以听听不同项目的介绍。和其他研究院采用大礼堂隆重介绍研究人员不同,我认为这种小型的、像科技展览一样的形式非常有利于参观者和研究人员的沟通。

    我们看到的多是研究人员作的原型产品,所以像用户界面这样的东西做得不是很精细。下面是一些我们看到的亮点:

    无缝文档阅读器

    这是一个J2ME程序,来解决在小屏幕(移动电话和其他移动设备)上阅读文档的问题。它自动将文档分成多个块然后显示每个文档块的关键字。这些关键字帮助用户快速找到感兴趣的部分。我们看到的演示手机使用了一个非常直观的触摸屏界面,以一种有趣的方式放大和缩小文档的某一部分。因为通过这个应用程序阅读的文档需要提前处理和分析,所以很适合像PDF这样的静态文档,而不断更新的Web页面就不合适。

    混合分类

    文档自动分类是信息科学的一个老话题。很多工具只是依赖文档中的文本部分然后采用一些自然语言处理和机器学习技术组合来工作。

    我很期待这个介绍因为我一直使用基于文本的自动分类器来组织一些我们的数据

    混合分类技术使用文档中的文本和图片。我们看到的演示只是少量文档,还不清楚最多能处理多大规模的数据。从精确测量分类器的精确度以及一个学术竞赛的结果看施乐的混合(文本+图片)办法还是不错的。

    可擦除纸

    “可擦除纸”就是在纸外面涂上一层特殊材料然后用专门的紫外光打印机打印。打印出来的文档会在24小时之内退色,这样纸张就可以重复使用10次以上。打印机也可以擦除纸上的文字然后打印新的内容。我们马上想到了另一种可能性——这种可擦除纸可以被逆向工程从而泄露敏感信息:比如安全部门或者狗仔队翻你的垃圾箱。很吃惊的是研究人员还没有认真考虑过“恢复已擦除文档”的可能性。

    这种带特殊涂层的纸成本是普通纸2-3倍,打印机和普通激光打印机成本一样。既然这样的纸可以重复用10次以上那环保方面的益处就很显然。

    更多的好处是这种打印机:打印通过紫外光完成那么就不需要墨水或墨粉了。

    智能节录

    节录就是把文档中敏感信息删掉。常见的例子包括公布给公众的政府文件或者情报文件,还有医疗档案。文本节录通常是一种冗长乏味的人工处理,还要求工作人员掌握很多领域专有知识。比如,美国医疗档案的隐私法案要求节录掉与艾滋病病毒和艾滋病相关的词语,还有精神健康和药物/酒精问题相关的内容。我们看到的演示中软件工具检查了很多文档,自动掌握了指定疾病相关的名字和用语,然后将它们节录掉。

    其他突出的内容:

    • 清洁技术:太阳能集中器和不用膜的水过滤技术
    • 环保塑料:使塑料增加30%
    • 癌症检测:稀少细胞检测

    Discussion

    Enter your comment (wiki syntax is allowed):
     
    blog/ben/inside_innovation_at_xerox_parc.txt · 最后更改: 2008/04/30 12:09 由 sniffer
     

    O'Reilly Home | Privacy Policy ©2005-2008, O'Reilly Media, Inc.
    All trademarks and registered trademarks appearing on oreilly.com are the property of their respective owners.