|
|
Search Engine
|
Home
About Us
|
Project name |
| Search Engine |
Customer |
| USA-based company Covered by NDA |
Business case |
| The current scale and growth of the World Wide Web makes effective and accurate search and location of Web pages crucial. Now the only feasible way for searcher to locate a particular Web-based source is to use a Web search engine. Generic large-scale search engines return thousands web-pages, and since many of them lack for relevance to the query, searchers only tend to look at the first few results. That's why an accurate rank is critically important. |
| The Customer decided to build a Web-scale search engine mitigating problems of the existing search systems. The goal of the project was to address many issues, both in quality and scalability, by scaling search engine technology to extraordinary web growth. Creating a search engine which scales even to today's web presents many challenges. Fast crawling technology is needed to gather the web documents and keep them up to date. Storage space must be used efficiently to store indices and the documents themselves. |
| The system had to keep local copies of documents retrieved from the Internet and had to have fast data storage. Full size of the document repository that contains all information about web pages (including document header, archived document body, etc.) was estimated as dozens terabytes. |
Solution overview |
| Inteks specialists spent over 1000 man-hours investigating the issues of relevance calculation for large collections of documents. As a result, an architecture was designed that could support novel research activities on large-scale web data. Due to distributed data processing architecture, search engine can be scaled to any target system, from desktops to high-end computers. A novel "search by meaning" feature employs thesauri-based retrieval concept: the purpose of Thesaurus is to provide meanings and synonyms for a given word, and to store relations between words.
This information is used by front-end application to provide a search-refining capability. This capability drastically increases quality and relevance of search results. Two dictionaries were implemented: one as a wrapper to WordNet, the second - user-defined. WordNet is an on-line lexical reference system: its design is inspired by current psycholinguistic theories of human lexical memory.
|
Benefits |
| The Customer received a system which meets the highest market requirements. The functionality of the designed system is on the same level with the world leading search engines, and the following facts show the advantages: |
The search engine can index up to billions of web pages. |
The search system processes a query faster than in 1 second on the index of 1 billion documents. |
Tools and technologies |
| JAVA, JSP, TCP/IP, WordNet® lexical database (by Cognitive Science Laboratory), Linux |
|
|
|