Monday, December 6, 2010

The Anatomy of a Large-Scale Hypertextual Web Search Engine-GOOGLE SEARCH ENGINE first publication

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Sergey Brin and Lawrence Page
{sergey, page}@cs.stanford.edu
Computer Science Department, Stanford University, Stanford, CA 94305

Abstract

       In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/
       To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date.
       Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want.
 Keywords: World Wide Web, Search Engines, Information Retrieval, PageRank, Google

Architecture of Google File System


A GFS cluster consists of a single master and multiple
and is accessed by multiple clients Each of these is typically a commodity Linux
chunkservers
machine running a user-level server process. It is easy to run
both a chunkserver and a client on the same machine, as long
as machine resources permit and the lower reliability caused
by running possibly flaky application code is acceptable.
Files are divided into fixed-size chunks. Each chunki s
identified by an immutable and globally unique 64 bit chunkassigned by the master at the time of chunkcreat ion.
handle
Chunkservers store chunks on local disks as Linux files and
read or write chunkda ta specified by a chunkha ndle and
byte range. For reliability, each chunkis replicated on multiple
chunkservers. By default, we store three replicas, though
users can designate different replication levels for different
regions of the file namespace.
The master maintains all file system metadata. This includes
the namespace, access control information, the mapping
from files to chunks, and the current locations of chunks.
It also controls system-wide activities such as chunklease
management, garbage collection of orphaned chunks, and
chunkmigration between chunkservers. The master periodically
communicates with each chunkserver in HeartBeat
messages to give it instructions and collect its state.
GFS client code linked into each application implements
the file system API and communicates with the master and
chunkservers to read or write data on behalf of the application.
Clients interact with the master for metadata operations,
but all data-bearing communication goes directly to
the chunkservers. We do not provide the POSIX API and
therefore need not hookin to the Linux vnode layer.
Neither the client nor the chunkserver caches file data.
Client caches offer little benefit because most applications
stream through huge files or have working sets too large
to be cached. Not having them simplifies the client and
the overall system by eliminating cache coherence issues.
(Clients do cache metadata, however.) Chunkservers need
not cache file data because chunks are stored as local files
and so Linux’s buffer cache already keeps frequently accessed
data in memory.

 Architecture

GFS Architecture

The GFS architecture in Google

Design Assumptions of Google File System:

  • components that often fail. It must constantly monitor
    itself and detect, tolerate, and recover promptly from
    component failures on a routine basis.
    expect a few million files, each typically 100 MB or
    larger in size. Multi-GB files are the common case
    and should be managed efficiently. Small files must be
    supported, but we need not optimize for them.
    large streaming reads and small random reads. In
    large streaming reads, individual operations typically
    read hundreds of KBs, more commonly 1 MB or more.
    Successive operations from the same client often read
    through a contiguous region of a file. A small random
    read typically reads a few KBs at some arbitrary
    offset. Performance-conscious applications often batch
    and sort their small reads to advance steadily through
    the file rather than go backan d forth.
    that append data to files. Typical operation sizes are
    similar to those for reads. Once written, files are seldom
    modified again. Small writes at arbitrary positions
    in a file are supported but do not have to be
    efficient.
    for multiple clients that concurrently append
    to the same file. Our files are often used as producerconsumer
    queues or for many-way merging. Hundreds
    of producers, running one per machine, will concurrently
    append to a file. Atomicity with minimal synchronization
    overhead is essential. The file may be
    read later, or a consumer may be reading through the
    file simultaneously.
    latency. Most of our target applications place a premium
    on processing data in bulka t a high rate, while
    few have stringent response time requirements for an
    individual read or write.
    High sustained bandwidth is more important than low
    The system must efficiently implement well-defined semantics
    The workloads also have many large, sequential writes
    The workloads primarily consist of two kinds of reads:
    The system stores a modest number of large files. We
    The system is built from many inexpensive commodity

Google File System ---Abstract

Fantastic Paper presented by GOOGLE
Google designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.
While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led google to reexamine traditional choices and explore radically different design points.
The file system has successfully met google storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by google service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients.
In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use.

Over View of Amazon WebServices:

The Infrastructure of Amazon AWS environment.. it encompasses variety of Infrastructure services.
Amazon EC2
Amazon Simple Storage(Amazon s3)
Amazon Simple queue service (Amazon SQS)
Amazon CloudFront
Amazon Simple DB





Cloud Application Architectures:
1)Grid Computing
2)Transaction Computing

The main foucs of this blog is how u write an application so that it can be take advantage of the Cloud





The Cloud

The cloud can be both Software & Infrastructure.
A Cloud service is accessible through either via a webbrowser or WebServices API.
You pay only for what you use..
SAAS(Software as a Service) is a term that basically refers to Software in the Cloud..
SAAS is a web based software model that makes the software available entirely thu a browser
Examples::Gmail,SalesForce.com are the basic exampless;
For instance:Salesforce is an enterprise CRM app you can jus sit before the system and access the url of the salesforce /gmail they can directly accessthe mail/salesforce CRM application
SAAS Characterstics:
Avaialability via thru a webbrowser
On demand avaialabilty
Payments terms used on usage
Minimal IT demands
Coming to the Hardware Virtualisation --> Amazon Webservices (AWS) Through AWS we can divide one physical server into any no: of virtual servers The Amazon solution is an extension of popular open source virtualisation called XEN.

Cloud Storage:
Elements of Cloud Computing involves in dividing the data into small chunks and storing the data with some check sums across multiple servers.So -->Parallel programming will be done
Data can be retrieved rapidly