The Anatomy of a Large-Scale Hypertextual Web Search Engine-GOOGLE SEARCH ENGINE first publication

The Anatomy of a Large-Scale Hypertextual Web Search Engine

Sergey Brin and Lawrence Page{sergey, page}@cs.stanford.eduComputer Science Department, Stanford University, Stanford, CA 94305

Abstract

       In this paper, we present Google, a prototype of a large-scale search engine which makes heavy use of the structure present in hypertext. Google is designed to crawl and index the Web efficiently and produce much more satisfying search results than existing systems. The prototype with a full text and hyperlink database of at least 24 million pages is available at http://google.stanford.edu/
       To engineer a search engine is a challenging task. Search engines index tens to hundreds of millions of web pages involving a comparable number of distinct terms. They answer tens of millions of queries every day. Despite the importance of large-scale search engines on the web, very little academic research has been done on them. Furthermore, due to rapid advance in technology and web proliferation, creating a web search engine today is very different from three years ago. This paper provides an in-depth description of our large-scale web search engine -- the first such detailed public description we know of to date.
       Apart from the problems of scaling traditional search techniques to data of this magnitude, there are new technical challenges involved with using the additional information present in hypertext to produce better search results. This paper addresses this question of how to build a practical large-scale system which can exploit the additional information present in hypertext. Also we look at the problem of how to effectively deal with uncontrolled hypertext collections where anyone can publish anything they want. Keywords: World Wide Web, Search Engines, Information Retrieval, PageRank, Google

Architecture of Google File System

A GFS cluster consists of a single master and multiple

and is accessed by multiple clients, Each of these is typically a commodity Linux

chunkservers

machine running a user-level server process. It is easy to run

both a chunkserver and a client on the same machine, as long

as machine resources permit and the lower reliability caused

by running possibly flaky application code is acceptable.

Files are divided into fixed-size chunks. Each chunki s

identified by an immutable and globally unique 64 bit chunkassigned by the master at the time of chunkcreat ion.

handle

Chunkservers store chunks on local disks as Linux files and

read or write chunkda ta specified by a chunkha ndle and

byte range. For reliability, each chunkis replicated on multiple

chunkservers. By default, we store three replicas, though

users can designate different replication levels for different

regions of the file namespace.

The master maintains all file system metadata. This includes

the namespace, access control information, the mapping

from files to chunks, and the current locations of chunks.

It also controls system-wide activities such as chunklease

management, garbage collection of orphaned chunks, and

chunkmigration between chunkservers. The master periodically

communicates with each chunkserver in HeartBeat

messages to give it instructions and collect its state.

GFS client code linked into each application implements

the file system API and communicates with the master and

chunkservers to read or write data on behalf of the application.

Clients interact with the master for metadata operations,

but all data-bearing communication goes directly to

the chunkservers. We do not provide the POSIX API and

therefore need not hookin to the Linux vnode layer.

Neither the client nor the chunkserver caches file data.

Client caches offer little benefit because most applications

stream through huge files or have working sets too large

to be cached. Not having them simplifies the client and

the overall system by eliminating cache coherence issues.

(Clients do cache metadata, however.) Chunkservers need

not cache file data because chunks are stored as local files

and so Linux’s buffer cache already keeps frequently accessed

data in memory.

Architecture

GFS Architecture

The GFS architecture in Google

Design Assumptions of Google File System:

•
components that often fail. It must constantly monitor
itself and detect, tolerate, and recover promptly from
component failures on a routine basis.
•
expect a few million files, each typically 100 MB or
larger in size. Multi-GB files are the common case
and should be managed efficiently. Small files must be
supported, but we need not optimize for them.
•
large streaming reads and small random reads. In
large streaming reads, individual operations typically
read hundreds of KBs, more commonly 1 MB or more.
Successive operations from the same client often read
through a contiguous region of a file. A small random
read typically reads a few KBs at some arbitrary
offset. Performance-conscious applications often batch
and sort their small reads to advance steadily through
the file rather than go backan d forth.
•
that append data to files. Typical operation sizes are
similar to those for reads. Once written, files are seldom
modified again. Small writes at arbitrary positions
in a file are supported but do not have to be
efficient.
•
for multiple clients that concurrently append
to the same file. Our files are often used as producerconsumer
queues or for many-way merging. Hundreds
of producers, running one per machine, will concurrently
append to a file. Atomicity with minimal synchronization
overhead is essential. The file may be
read later, or a consumer may be reading through the
file simultaneously.
•
latency. Most of our target applications place a premium
on processing data in bulka t a high rate, while
few have stringent response time requirements for an
individual read or write.High sustained bandwidth is more important than lowThe system must efficiently implement well-defined semanticsThe workloads also have many large, sequential writesThe workloads primarily consist of two kinds of reads:The system stores a modest number of large files. WeThe system is built from many inexpensive commodity

Google File System ---Abstract

Fantastic Paper presented by GOOGLE
Google designed and implemented the Google File System, a scalable distributed file system for large distributed data-intensive applications. It provides fault tolerance while running on inexpensive commodity hardware, and it delivers high aggregate performance to a large number of clients.
While sharing many of the same goals as previous distributed file systems, our design has been driven by observations of our application workloads and technological environment, both current and anticipated, that reflect a marked departure from some earlier file system assumptions. This has led google to reexamine traditional choices and explore radically different design points.
The file system has successfully met google storage needs. It is widely deployed within Google as the storage platform for the generation and processing of data used by google service as well as research and development efforts that require large data sets. The largest cluster to date provides hundreds of terabytes of storage across thousands of disks on over a thousand machines, and it is concurrently accessed by hundreds of clients.
In this paper, we present file system interface extensions designed to support distributed applications, discuss many aspects of our design, and report measurements from both micro-benchmarks and real world use.

Over View of Amazon WebServices:

The Infrastructure of Amazon AWS environment.. it encompasses variety of Infrastructure services.
Amazon EC2
Amazon Simple Storage(Amazon s3)
Amazon Simple queue service (Amazon SQS)
Amazon CloudFront
Amazon Simple DB

Cloud Application Architectures:
1)Grid Computing
2)Transaction Computing

The main foucs of this blog is how u write an application so that it can be take advantage of the Cloud

The Cloud

The cloud can be both Software & Infrastructure.
A Cloud service is accessible through either via a webbrowser or WebServices API.
You pay only for what you use..
SAAS(Software as a Service) is a term that basically refers to Software in the Cloud..
SAAS is a web based software model that makes the software available entirely thu a browser
Examples::Gmail,SalesForce.com are the basic exampless;
For instance:Salesforce is an enterprise CRM app you can jus sit before the system and access the url of the salesforce /gmail they can directly accessthe mail/salesforce CRM application
SAAS Characterstics:
Avaialability via thru a webbrowser
On demand avaialabilty
Payments terms used on usage
Minimal IT demands
Coming to the Hardware Virtualisation --> Amazon Webservices (AWS) Through AWS we can divide one physical server into any no: of virtual servers The Amazon solution is an extension of popular open source virtualisation called XEN.

Cloud Storage:
Elements of Cloud Computing involves in dividing the data into small chunks and storing the data with some check sums across multiple servers.So -->Parallel programming will be done
Data can be retrieved rapidly

CloudArchitect-Kamal

Monday, December 6, 2010