Doug cutting interview

Source: Internet
Author: User

While reading hadoop, I met Doug cutting. The hadoop team was established as early as. Of course, the premise of this team was that Doug cutting joined Yahoo. So I found out that he is a famous full-text search engine.
Lucene

And
Nutch

Founder.

The following is the original article of Doug cutting interview.

Doug cutting is primary developer of
Lucene

And
Nutch


Open source search projects. He has worked in the search technology
Field for nearly two decades, including five years at Xerox PARC, three
Years at Apple, and four years at excite.

 

What do you do for a living, and how are you involved in search engine development?

I work from home on Lucene & nutch, two open source search
Projects. I get paid by varous contracts related to these projects. I
Have an ongoing relationship with Yahoo! Labs that funds me to work
Part-time on nutch. I take other short-term contract work related
Both projects.

Cocould you briefly tell us about nutch, and where you are trying to take it?

First, let me say what Lucene is, to provide context. Lucene is
Software library for full-text search. It's not an application,
But rather technology that can be ininitialized in applications. It's
Apache project and is widely used. A small subset of folks using Lucene
Are listed at wiki.apache.org/Jakarta-Lucene/poweredby
.

Nutch builds on Lucene to implement web search. nutch is
Application: You can download it and run it. It adds a crawler and
Other web-specific stuff to Lucene. nutch aims to scale from simple
Intranet searching to search of the entire web, like Google and Yahoo !.
To rival these guys you need a lot of tricks. We 've demoed it on over
100 m pages, and it's designed to scale to over 1B pages. But it also
Works well on a single machine, searching just a few servers.

From your perspective, what are the core principles
Search engine architecture? What are the main things to consider and
The big modules search engine software can be broken up?

Let's see, the major bits are:

  • Fetching-Downloading lists of pages that have been referenced.

  • Database-keeping track of what pages you 've fetched, When you fetched them, what they 've linked to, etc.

  • Link
    Analysis-analyzing the database to assign a priori scores to pages
    (E.g., PageRank & webrank) and to prioritize fetching. The value
    This is somewhat overrated. indexing anchor text is probably more
    Important (that's what makes, e.g., Google bombing so valid tive ).

  • Indexing
    -Combines content from The fetcher, incoming links from the database,
    And link analysis scores into a Datastructure That's quickly searchable.

  • Searching-ranks pages against a query using an index.

To scale to billions of pages, all of these must be distributable,
I. e., each must be able to run in parallel on multiple machines.

You are saying people can download nutch to run on their
Machines. Is there a possibility for small time webmasters who don't
Have full control over their Apache servers to make use of nutch?

Unfortunately, most of them probably won't. nutch requires a java servlet container, which some ISPs support, but most do not.

Can I combine Lucene and the Google Web API, or Lucene and some other application I wrote?

A couple of folks have contributed Google-like APIs for nutch,
None has yet made it into the system. We shoshould have something like
This soon, however.

What do you think is the biggest hurdle to overcome when
Implementing a search engine-is it the hardware and storage barrier,
Or the ranking algorithms? Also, how much space do you need to assure
The search engine makes some sense, say, by writing an engine
Restricted to search a million RSS feeds?

Nutch requires around a total of 10kb per web page. RSS feeds tend
To point to small pages, so you 'd probably do better than that. nutch
Doesn't yet have specific support for RSS.

Is it easy to get funded by Yahoo! Labs? Who can apply, and what do you need to give back in return?

I was invited, I didn't apply. I have no idea what the process is.

Did Google Inc. Ever show interest in nutch?

I 've talked with folks there, including Larry Page. The 'd like
Help, but they can't see a way to do that without also helping their
Competitors.

In nutch, do you implement your own PageRank or webrank system? What considerations go into ranking?

Yes, nutch has a link analysis module. Use of it is optional. For intranet search we find that it's really not needed.

I guess you heard it before, but doesn't an open-source search engine open itself up for blackhat search engine optimization?

Potentially.

Let's say it takes spammers six weeks to reverse engineer
Closed-source search engines latest spam detecting algorithm. With
Open Source engine, this can be done much faster. But in either case,
The Spammers will eventually figure out how it works; the only
Difference is how quickly. So the best anti-spam techniques, open or
Closed source, are those that continue to work even when their
Mechanic is known.

Also, if you, e.g., remove detected spammers from the index for six
Months, then there's not much they can do, once detected, to change
Their sites to elude detection. And if your spam detectors are based on
Statistical analyses of good and bad example sites, then you can,
Overnight, notice new patterns and remove the spammers before they have
A chance to respond.

So open source can make it a little harder to stop spam, But it
Doesn't make it impossible. And closed-source search engines have not
Been able to use secrecy to solve this problem. I think
Closed-source advantage here is not as great as folks imagine it to be.

How does nutch relate to distributed Web Crawler grub
, And what do you think of it?

As far as I can tell, grub is a project that lets folks donate their
Hardware and bandwidth to looksmart's crawler effort. Only the client
Is open source, not the server, so folks can neither deploy their own
Version of grub, nor can they access the data that grub gathers.

What about distributed crawling more generally? When a search engine
Gets big, crawl-related expenses are dwarfed by search-related
Expenses. So a distributed crawler doesn' t significantly improve costs,
Rather it makes more complicated something that is already relatively
Inexpensive. That's not a good tradeoff.

Widely distributed search is interesting, but I'm not sure it can
Yet be done and keep things as fast as they need to be. A faster search
Engine is a better search engine. When folks can quickly revise queries
Then they more frequently find what they're looking for before they get
Impatient. But building a widely distributed search system that can
Search billions of pages in a fraction of a second is difficult, since
Network latencies are high. Most of the half-second or so that Google
Takes to perform a search is network latency within a single
Datacenter. If you were to spread that same system over a bunch of PCs
In people's houses, even connected by DSL and cable modems,
Latencies are much higher and searches wowould probably take several
Seconds or longer. and hence it wouldn't be as good of a search engine.

You are emphasizing the importance of speed in a search
Engine. I'm often puzzled by how fast Google returns a result. Do you
Have an idea how they do it, and what are your experience with nutch?

I believe Google does roughly what nutch does: they broadcast
Queries to a number of nodes, each which returns the top-results over
Set of pages. With a couple of million pages per node then Disk
Accesses can be avoided for most queries and each node can process tens
To hundreds of queries per second. If you want to search billions
Pages then you have to broadcast each query to thousands of nodes.
That's a lot of network traffic.

Some of this is described in www.computer.org/micro/mi2003/m2022.pdf

When you mention spam, do you have any spam-fighting
Algorithms in nutch? How can one differentiate between spam patterns
Like linkfarms, and sites which just happen to be very popular?

We haven't yet had time to start working on this, but it's obviusly
An important area. Before we get to link farms we need to do the simple
Stuff: Look for word stuffing, white-on-white text, etc.

I think the key to search quality in general (of which Spam
Detection is a sub-problem) is to have a trusted supply
Hand-evaluated search results. With this, one can train a ranking
Algorithm to generate better results. (spammy results are just one kind
Of bad results.) specified cial search engines get trusted evaluations
Hiring folks. It remains to be seen how nutch will do this. We
Obviusly cannot just accept all evaluations donated, or else spammers
Will spam the evaluations. So we need a means of establishing
Trustability of volunteer evaluators. I think a peer-review system,
Perhaps something like slashdots's karma system, cocould work here.

Where do you see search engines heading in the near and far
Future, and what do you think are the biggest hurdles to overcome from
A developer's perspective?

Sorry, I'm not very imaginative here. My prediction is that the Web
Search in the coming decade is going to look more-or-less like Web
Search of today. It's a safe bet. Web Search evolved quickly for
First few years. It started in 1994 with webcrawler, using standard
Information retrieval methods. The development of more web-specific
Methods took a few years, culminating in Google's 1998 Launch. Since
Then, the introduction of new methods has slowed dramatically.
Low-hanging fruit has been harvested. Innovation is easy when an area
Is young, and becomes more difficult as it field matures. Web Search
Grew up in the 1990 s, is now a cash cow, and will soon be a commodity.

As far as development challenges, I think operational reliability is
A big one. We're working on developing something like GFS, the Google
Filesystem. Stuff like this is essential to large-scale web search: You
Cannot let a failure of any single component cause a major hiccough;
You must be able to easily scale by throwing more hardware into
Pool, without massive reconfiguration; and you can't require an army
Operators-things shoshould largely fix themselves.

Contact Us

The content source of this page is from Internet, which doesn't represent Alibaba Cloud's opinion; products and services mentioned on that page don't have any relationship with Alibaba Cloud. If the content of the page makes you feel confusing, please write us an email, we will handle the problem within 5 days after receiving your email.

If you find any instances of plagiarism from the community, please send an email to: info-contact@alibabacloud.com and provide relevant evidence. A staff member will contact you within 5 working days.

A Free Trial That Lets You Build Big!

Start building with 50+ products and up to 12 months usage for Elastic Compute Service

  • Sales Support

    1 on 1 presale consultation

  • After-Sales Support

    24/7 Technical Support 6 Free Tickets per Quarter Faster Response

  • Alibaba Cloud offers highly flexible support services tailored to meet your exact needs.