Information technologies of information search. Stages of information search

Information technologies of information retrieval

Search for information: basic concepts, types and forms of organization

Information search or information retrieval is one of the main information processes. Mankind has been doing it since ancient times. The goals, possibilities and nature of the search have always depended on the availability, information, its importance and accessibility, as well as the means of organizing the search.

The end of the 20th - the beginning of the 21st century is characterized by huge arrays of constantly growing diverse information that is accessible and of interest to the widest strata of society. Moreover, Internet technologies and software and hardware tools, also available to most people, allow this process to be carried out at any time, almost anywhere, for any request.

Search- a process during which, in one sequence or another, the searched for is correlated with each object stored in the array. The purpose of any search is the need, need or desire to find various types of information that help the searcher to obtain the information, knowledge, etc. he needs. to improve their own professional, cultural and any other level; creation of new information and formation of new knowledge; making managerial decisions, etc.

According to experts, there are billions of users on the Internet. Of these, hundreds of millions are online (English "on-line" - interactive access at any time) and the number of such users is constantly growing. This makes it difficult to organize an operational search and find the information necessary for such a large number of users. Problems arise due to various possibilities (types) of information retrieval, various ways of their implementation in information retrieval systems (IPS), different levels of user knowledge about the capabilities of such systems, especially in the field of generating queries and processing data obtained as a result of executing these queries and etc.

It is assumed that in the future ISs will be created that can automatically adapt to the level of knowledge and requests of specific users, perceive requests in natural language and, using artificial intelligence, give them relevant and pertinent information. The creation of such IPS will require the intelligence and knowledge of specific IPS users or their intermediaries. In the meantime, a wide range of users of search engines is required to have a fairly good command of this subject area.

There are various interpretations of the term "information search" or "information search".

The term "information retrieval" (English "information retrieval") was introduced by the American mathematician K. Muers. He noticed that the motive for such a search is information need , expressed in the form of an information request. K. Muers classified documents, information about their presence and (or) location, and factual information as objects of information retrieval.

Representatives of libraries were the first to solve the problems of factographic search. They developed information retrieval tools called " reference and search apparatus "(catalogues, bibliographic indexes, etc.). In the professional domestic press, this term has been used since the 1970s. Librarians define " information retrieval "as being in the informationarray of documents correspondinginformation request of users .

From the point of view of the use of computer technology "information retrieval "- a set of logical and technical operations with the ultimate goal of finding documents, information about them, facts, data relevant to the consumer's request.

" Relevance" - installed wheninformation retrieval compliance with the content of the documentinformation request or the search image of the document to the search prescription.

There are other definitions as well. In any case, information retrieval is caused by the need to satisfy the information needs of users who expect to quickly receive the data or information they need with the help of search engines. It is a method of targeted search and retrieval of relevant documents and/or facts from various sources of information, such as databanks or storage devices. These are living and non-living objects representing various sources and carriers of information.

Systems that provide the implementation of such information retrieval are calledsearch engines(PS). In traditional technologies, PS represent file cabinets and catalogs, address and other directories, indexes, encyclopedias, reference apparatus for publications and other materials.

In 1945, the American scientist and engineer W. Bush, in his article "A Possible Mechanism of Our Thinking," for the first time widely raised the question of the need to mechanize information retrieval.

Since the 1960s, automated search engines have appeared that work with information. Since this period, intensive work has been carried out in the field of the formation and implementation of the principles and methods of information retrieval.

" Search engines" perform a search among the documents of the database or other arrays of machine-readable data containing the given words.

Electronic PS using conventional or intelligent terminals (PC) enable users to make search queries using formal and content-descriptive elements and using special logical operators; perform a search among the documents of the database or other arrays of machine-readable data containing the specified words. Search engines allow only search procedures and related processes.

Information retrieval systems

PS with a large set of functions and capabilities are usually part of the DBMS and are called information retrieval systems. They are also created and used to efficiently find the data they need, including on the Internet.

Terminologically "information retrieval system" (English "information retrieval system", IRS) - represents a system designed to search and store information; a software package that implements the processes of creating, updating, storing and searching in information databases and data banks.

Information retrieval system is interpreted as a system that provides the search and selection of the necessary data based on the information retrieval language and the corresponding search rules, and database- as a set of means and methods for describing, storing and manipulating data that facilitate the collection, accumulation and processing of large information arrays. The organization of various databases differs in the type of data objects and the relationships between them.

The functioning of modern IPS is based on two assumptions:

    the documents required by the user are united by the presence of some feature or combination of features;

    the user is able to specify this attribute.

Both of these assumptions are not fulfilled in practice, and we can only talk about the probability of their fulfillment. Therefore, the process of information retrieval is usually a sequence of steps leading through the system to a certain result, and allowing to evaluate its completeness. At the same time, the user's behavior, as the organizing beginning of the search process management, is motivated not only by the information need, but also by the variety of strategies, technologies and tools provided by the system.

The user usually does not have comprehensive knowledge of the content of the resource being searched. He can evaluate the adequacy of the query expression, as well as the completeness of the result obtained, by finding additional information, or by organizing the process in such a way that part of the search results can be used to confirm or deny the adequacy of the other part. At the same time, professional users are characterized by the stability of the thematic profile. When they are "information-oriented", they are characterized by the desire and ability to organize the information space of the problem. This means that the user creates essentially a new, "independent" problem-oriented, individually updated and replenished IR, which, in addition to document collections, also includes meta-information, for example, dictionaries of specialized terminology, subject area classifiers, resource descriptions, etc.

The peculiarity of the user's work in the "self-service" mode, in the context of the task of automating the aggregate activity, means that the system must represent an environment that provides support for the consumer's functions for processing the information found, as well as those traditionally related to the functions of the information intermediary (interpretation of the request, its translation into informational search language, choice of IR, automated search and manual selection of materials), but also such "providing" functions as: structuring information needs, lexical adaptation of the query, evaluation, systematization and processing of search results, both at the level of both a separate document and information resources in general. The technical capabilities available to the user allow him to create an information resource - to form arrays, systematize and create external representations of their content for their own or external use.

IPS are divided into: traditional (manual, mechanical, electromechanical) and automated (electronic).

Automated IPS (AIPS), use computer software and hardware and technologies and are intended to find and issue information to users according to specified criteria. The following two factors are decisive for understanding search automation methods:

    not the objects themselves are compared, but descriptions - the so-called "search images";

    the process itself is complex (composite and not one-act) and is usually implemented by a sequence of operations.

Data in AIPS is entered on the basis of specially developed input formats. All information about one object in the IPS is presented in the form of systematized data that forms one row of the table and is called record . At the same time, if the IS represents the electronic catalog of the library, then any bibliographic description (BD) of the document in it is one record, consisting of fields equal to the number of BD elements. The set of records forms a database, which is usually stored in one file. A set of databases united by one DBMS forms a data bank.

Because AIPSa tool used by a person when searching (but notintelligent search machine information - ready-made solutions to the problems of the main activity), the effectiveness of its use depends on how well a person knows the nature of operational objects and the properties of the tool through which he works with these objects.

Information retrieval involves the use of certain strategies, methods, mechanisms and tools. The behavior of the user who manages the search process is determined not only by the information need, but also by the instrumental diversity of the system - the technologies and tools provided by the system.

Search strategy - a general plan (concept, preference, setting) of the behavior of the system or user to express and satisfy the information needs of the user, due both to the nature of the goal and type of search, and system "strategic" decisions - database architecture, methods and search tools in a particular AIPS.

The choice of strategy in the general case is an optimization problem. In practice, it is largely determined by the art of achieving a compromise between practical needs and the possibilities of available means.

Search Method - a set of models and algorithms for the implementation of individual technological stages: building a search query image (PRI), selecting documents (comparing search images of queries and documents), expanding and reformulating a query, localizing and evaluating issuance.

Query search image- a text written in ILP that expresses the semantic content of the information request and contains instructions necessary for the most effective implementation of information retrieval.

Search methods, i.e. the selection of a subset of documents that potentially contain a description of the solution to the problem of document selection (OD) is a reflection of the process of finding a solution and depends on the nature of the problem and the subject area.

Considering the search as an iterative process, the search space reduction methods (of the scanned subset) form essentially the methodological basis of the search strategy and can be divided into the following classes - search methods in:

    one space (usually thematic);

    hierarchically ordered space;

    alternative spaces;

    dynamic (changing during the search) space.

The implemented method for constructing a POS should provide efficient ways to construct a query to achieve various types of goals.

Search mechanisms - a set of models and algorithms implemented in the system for the process of generating documents issuing in response to a search query.

Search tools , on the one hand, is an interdependent complex of information retrieval languages ​​(IRLs) and data definition/management languages ​​that provides structural and semantic transformations of processing objects (documents, dictionaries, sets of search results), and on the other hand, user interface objects that provide control sequence of selection of operational objects of a particular AIPS.

Search technologies are unified (optimized within a specific AIPS) sequences for the effective use of individual search tools in the process of user interaction with the system to obtain sustainable final and intermediate results.

Navigation as an implementation of the on-demand search process in the selected database - a targeted, strategy-defined sequence of using the methods, tools and technologies of a particular AIPS to obtain and evaluate the result.

Navigation aids allow the user to control the search process. They are provided to the user in the forminterface , which allows organizing a more or less efficient process of interaction with the database. At the same time, the "friendliness" of the interface is characterized not only by ergonomics and clarity, but also by the variability in the choice of operational objects.

The process of information retrieval is a sequence of steps leading through the system to a certain result, and allowing to evaluate its completeness. Since the user usually does not have comprehensive knowledge about the information content of the resource in which he conducts a search, he can evaluate the adequacy of the query expression, as well as the completeness of the result obtained, based only on external estimates or on intermediate results and generalizations, comparing them, for example , with the previous ones.

The search process can be represented as the following main components:

    formulating a query in natural language, choosing search engines and services, formalizing the query in the corresponding ILP;

    conducting a search in one or more search engines;

    review of the obtained results (references);

    preliminary processing of the obtained results: viewing the content of links, extracting and saving relevant and pertinent data;

    if necessary, modifying the request and conducting a repeated (clarifying) search with subsequent processing of the results.

To reduce the volume of selected materials, search results are filtered by type of sources (sites, portals), topics and other grounds.

Search technologies used IS can be divided into 4 categories:

    Thematic catalogs;

    Specialized catalogs (online directories);

    Search engines (full-text search);

    Metasearch tools.

On the Internet, an IPS is hosted on one or more servers. The information system collects, indexes and registers information about documents available in the group of web servers serviced by the system. All significant words are indexed in documents, or only words from headings.

Thematic catalogs provide for the processing of documents and their assignment to one of several categories, the list of which is predetermined. In fact, this is indexing based on classification. Indexing can be done automatically or manually with the help of experts browsing popular web sites and compiling a short description of summary documents (keywords, abstract, abstract).

Specialized catalogs orreference books are created by individual industries and topics, by news, by cities, by e-mail addresses, etc.

search engines (the most advanced search facility on the Internet) implement full-text search technology. Texts located on polled servers are indexed. An index can contain information about several million documents. For example, the index of the popular IPS "AltaVista" contains more than 56 million URLs.

When using fundsmetasearch The request is carried out simultaneously by several search engines. The search result is combined into a common list sorted by relevance. Each system processes only a part of the network nodes, which allows expanding the search base. This class also includes "personal search programs" that allow you to create your own metasearch tools (for example, automatically query frequently visited sites).

Information databases can contain almost any kind of information, including any combination. Information retrieval is carried out both by the terms existing in the full-text EIR and by special elements that are part of the ILP. To form queries, special information retrieval languages ​​are used.

IPS within the found sample usually try to arrange the documents in the order of their "relevance ", that is, proximity to the query entered by the user. There are many criteria for such proximity, and identifying documents close "in meaning" to the query does not solve the problem of obtaining information in the absence of a relevant document. This situation is quite trivial, also because the user often searches for a document It should be noted that as a result of the search, the user can get both relevant, pertinent, and irrelevant and non-pertinent data subarrays.

IPS are actuallyinformation support systemsand are databases and databanks. Asobjectthey include an individual, organization, industry, region, etc.The subject of information supportis a computer scientist, any consumer of information.

Search organization

It is proposed to divide the procedure for searching for the necessary information into nine main stages:

    Definition of a field of knowledge;

    Selecting the type and sources of data;

    Collection of materials necessary for filling the information model;

    Selection of the most useful information;

    Choice of information processing method (classification, clustering, regression analysis, etc.);

    Choosing an algorithm for searching for patterns;

    Search for patterns, formal rules and structural relationships in the collected information;

    Creative interpretation of the obtained results;

    Integration of extracted "knowledge".

To conduct a search, the interface for working with the corresponding database is initially loaded on the user's computer. It can be a local or remote database. Initially, you should decide on the type of search (simple, advanced, etc.). Then with a set of fields offered for searching. The IPS may offer one or more input fields. In the latter case, these are usually the fields: author, title (title), time period, type of document, keywords, headings, etc. When forming a query, almost all systems allow using the logical elements "AND", "OR", "NO".

Information retrieval technologies

Search tools and technologies used to fulfill information needs are determined by the type and state of the main activity task being solved by the user: the ratio of his knowledge and ignorance about the object under study. In addition, the process of user interaction with the system is determined by the level of user knowledge of the content of the resource (completeness of representation, reliability of the source, etc.) and the functionality of the system as a tool. In general, these factors usually come down to the concept of "professionalism" - informational (trained/untrained user) andsubject (professional/non-professional)"professionalism ".

The process of searching for information is usually empirical in nature. It represents a sequence of steps leading through the system to some result, allowing to evaluate its completeness. At the same time, the user's behavior, as the organizing beginning of the search process management, is motivated not only by the information need, but also by the variety of strategies, technologies and tools provided by the system.

Usually, the user does not have exhaustive knowledge about the information content of the resource in which he conducts a search, therefore, he can evaluate the adequacy of the query expression, as well as the completeness of the result obtained, by finding additional information, or by organizing the process so that part of the search results can be used to confirm or deny the adequacy of the other part.

The operational objects that are directly involved in the interaction of users with the search engine are the search image of the document (DOI) and DO, the correspondence of which is established by the AIPS search engine at the formal level. The adequacy of the image to the actual content of the document is determined by the quality of the information convolution process and the level of knowledge by the subject of the means of reflection - the conceptual scheme of the subject area and the capabilities of the ILP.

Document search image- description of the document, expressed by means of the ILP and characterizing the main semantic content or any other features of this document, necessary for its search on request.

Most PSs initially offer users either BRs or links to full or partial documents, their descriptions, and others stored in various AIPSs. Modern PSs make it possible to determine and indicate what and in what form the source of information is of interest to the user.

Methods for processing search results

According to the nature of the transformations (in the context of further use of the processing results), the methods for processing search results can be divided into two groups:

    Structural-format transformations;

    Structural-semantic transformations (information-analytical, logical-semantic).

Search Implementation

Commonly searched on the Internet: personal data about individuals and organizations; various address data; specific materials (articles, books, photographs, reference data, software, etc.), including the place of their storage; where and how much certain materials, services, products, etc. cost; information sites and portals, etc.

It is generally accepted to organize the search by the initial fragments of the word (search with right truncation), for example, instead of the word "library" you can enter its fragment "library*". In this case, documents will be found that contain not only the word "library", but also "library", "librarian", "librarianship", etc. In each case, the user must imagine what exactly he wants to find, since in the proposed him option, a much larger number of documents will be found than when the given word is specified completely (without truncation). In such a case, it is possible to conduct a refinement search in the received information array and, as a result, obtain more relevant and pertinent data.

Registration of results

From the point of view of the IS, the search result in it is a set (subset) of the documents found or links to them. It is usually presented to the user in the form of a list. That is, the simplest output form in this case will be a list of links in the form of full or partial BRs found by the IR. Such a list can be immediately printed or sent to any e-mail address, if such an opportunity is provided by the IP and the user is connected to the Internet.

Graphic and full-text EIR can be offered to the user only for viewing, for copying in various formats and scales, and in whole or in part. Graphic IRs usually exist in generally accepted formats such as: JPG, GIFF, TIFF, BMP, etc., and for text materials they usually use text formats TXT, DOC, etc., HTML and PDF - in fact, a graphic format in which they can be saved as text, as well as graphic data.

The documents obtained as a result of the search are saved.

Search Evaluation Criteria

The criterion for the search result is the receipt by the user of a list of documents, one document or parts thereof, that best meets his needs formulated in the search query. In the IPS, it is customary to form a list of documents obtained as a result of the search according to their relevance. There are criteria for semantic and formal correspondence between the search prescription and the issued document.

computer internet animation search engine

Internet search engines

Search engines Google, Yahoo, Yandex, Mail ... are used to find the necessary resource on the Internet by keywords. These systems, or, as they are otherwise called, search engines, go through millions of WWW servers every day, index and catalog the resources found. The ability to search for a resource on the Internet is very convenient, but we must not forget that the Web lives its own life - thousands of new pages appear every day, some old ones disappear ... Therefore, search engines do not always give the most accurate information.

Search and structure tools, sometimes referred to as search engines, are used to help people find the information they need. Search tools such as agents, spiders, crawlers and robots are used to collect information about documents located on the Internet. These are special programs that search for pages on the Web, extract hypertext links on those pages, and automatically index the information they find to build a database. Each search engine has its own set of rules that determine how documents are collected. Some follow each link on each page they find, and then in turn examine each link on each of the new pages, and so on. Some people ignore the links that lead to graphics and sound files, animation files; others ignore references to resources such as WAIS databases; others are instructed to look at the most popular pages first.

Google- the largest network of search engines owned by Google Inc.

The first most popular system, processes 41 billion 345 million requests per month, indexes more than 25 billion web pages, can find information in 195 languages.

The Google interface contains a rather sophisticated query language that allows you to limit your search to specific domains, languages, file types, and so on.

For search results, Google previously provided the ability to re-search, allowing you to search in more detail. For a more detailed search, users had to specify additional parameters by which the results were selected, which made it possible to immediately display not only the query, but also the context where it is applied. This feature simplified the search procedure by eliminating the need to open each result. On September 22, 2010 the company launched voice search in Russia. To search, you need to press the button next to the search bar on your phone and say your query, the phone will send your voice to the server, and the browser will display a string with your query recognized and search results for it.

Due to the popularity of the search engine, the neologism to google or to Google has appeared in English, which is used to refer to searching for information on the Internet using Google. It is with this definition that the verb is listed in the most authoritative dictionaries of the English language - the Oxford English Dictionary and Merriam-Webster, although other sources give examples of its use to mean searching for anything on the Internet at all.

Yandex is a Russian IT company that owns a web search system and an Internet portal of the same name. The Yandex search engine is the fourth among the world's search engines in terms of the number of processed search queries. As of February 8, 2013, according to the Alexa.com rating, the site yandex.ru ranks 20th in popularity in the world and 1st in Russia.

The Yandex.ru search engine was officially announced on September 23, 1997, and at first developed within the framework of CompTek International. As a separate company, Yandex was formed in 2000. In May 2011, Yandex held an initial public offering, earning more from it than any Internet company since Google's IPO in 2004.

ь Management of indexing in the Yandex search engine

Permissions and prohibitions for indexing are taken from the robots.txt file. Yandex supports the META robots tag, the NOINDEX tag, and the non-standard robots.txt extension - the Host directive. Permissions and prohibitions for indexing are taken by all search engines from the robots.txt file located in the root directory of the server. The ban on indexing a number of pages may appear, for example, from the desire not to index the same documents in different encodings. The smaller the server, the faster the robot will bypass it. Therefore, it is desirable to prohibit all documents in the robots.txt file that do not make sense to be indexed.

ь Adding pages in the Yandex search engine

Yandex scans hundreds of thousands of Web pages every day looking for changes or new links. Resource owners can add their own site by filling out the AddURL form

The Yandex search engine is full-text, that is, only those words that are written on the pages of sites get into its index (and become available for searching).

ь Indexing in the Yandex search engine

When Yandex detects a new or modified page, it indexes it. In the process, the page is divided into elements, the content of which is entered into the index. When Yandex detects a new or modified page, it indexes it. In the process, the page is divided into elements (text, headings, image captions, links, and so on), the content of which is entered into the index. This takes into account the positions of words, that is, their position in the document or its element. The document itself is not stored in the database.

Yahoo! is an American company that owns the second most popular search engine in the world and provides a number of services united by the Internet portal Yahoo! directory; the portal includes the popular Yahoo e-mail service.

According to Alexa Internet statistics, in February-April 2012 Yahoo! - the fourth most visited website on the Internet, and approximately 28% of visits consist of viewing only one page.

Mail- a major communication portal of the Russian Internet, the monthly audience of which, as of October 2012, exceeds 31.9 million people.

The number of employees is 2800 people.

The resource belongs to the investment group Mail. Ru Group.

The key service of the portal is the postal service Pochta@Mail. Ru, was created in 1998 in the American software company DataArt founded by Russian emigrants. Programmers from the St. Petersburg office of DataArt created new software for the web mail server, which was supposed to be sold to Western companies in the future. To test the service, it was temporarily made publicly available in November 1998 for Russian users, and the service suddenly began to rapidly gain popularity.

According to VP and CTO of Mail. Ru Vladimir Gabrielyan, the portal has eight data centers, the number of servers is 9000 units. In the technical department of Mail. Ru employs more than seven hundred specialists.

Search organization

A search form is a very useful and popular thing, especially when it comes to serious large (in terms of the number of pages and material presented) and well-visited sites. Finding the right information on such a site using only the navigation menu and internal links can sometimes be a difficult task. It is much easier to drive a couple of necessary words into the appropriate field, press the “find” button and, as a result, get links to pages where information of interest to the user may be.

Search can usually be done in two ways:

1. search implemented by means of the site engine (php or some other web programming language) - but this is only for serious web programmers, for mere mortals method number 2 is preferable;

2. search form addressing the search engine. This method is available to every person who has mastered the basics of html, and is suitable for any site, even consisting of a set of static html pages. However, such a search will be conducted only on those pages that are in the search engine database. In order for all pages of the site to be indexed normally, two rules must be observed: 1) a direct link without a redirect should lead to each page of the site; 2) the site must not violate the search license of the search engine used.

Relevance

Relevance in information retrieval is the semantic correspondence of the search query and the search image of the document. In a more general sense, one of the closest to the concept of the quality of "relevance" is "adequacy", that is, not only an assessment of the degree of compliance, but also the degree of practical applicability of the result, as well as the degree of social applicability of the solution to the problem.

Types of relevance

Compliance of a document with an information request, determined informally

2. Formal relevance

A match determined by comparing the image of the search query with the search image of the document according to a certain algorithm.

Lecture ORGANIZATION AND TECHNOLOGY OF INFORMATION SEARCH ON THE INTERNET 1. 2. Information retrieval tools Information retrieval technology

The characteristics of the Internet provide a faster way to search for information than traditional ones. With a significant amount of information, the network is semi-structured. In this regard, information retrieval tools are being actively developed to automate the process of information retrieval in this environment. slide number 3

Information retrieval tools Internet search services (tools designed to search for information) Search engines Catalogs (search engines) (directories) Metasearch engines (metasearch engines) Slide No. 5

Classification of search engines by the breadth of coverage of information resources Slide No. 6 INTERNET Search engines Catalogs Global Regional Local Specialized Regional Metasearch engines Network Local Specialized

Classification of search tools according to the breadth of coverage of information resources A specific search tool can simultaneously correspond to several of the listed types. The type of search tool determines the breadth of coverage of Internet information resources by this tool. slide number 7

Information retrieval system Slide No. 8 Information retrieval system (IPS) is a system that provides selection, indexing and retrieval of information based on an index of documents. Indexing information means assigning to each document keywords that reflect the content of the document and control the search, leading to those documents whose words turn out to be more similar to the words of the request made by the IS, solving the problems of collecting, storing, processing and issuing information, search for documents, analyze their content , building search images of documents (extracting information from documents used by the system as knowledge about the document), storing search images, analyzing user requests, searching for documents that are relevant (corresponding) to the request and issuing links to documents to users.

Typical IPS scheme Slide No. 9 Request Client Information resources Robot indexer User interface Response Search engine Response Request Document index

Features of the IPS Slide No. 10 Each specific search engine does not store information about all Internet documents, but only about those documents that are known to this system (for various systems, the percentage of indexed documents is different, but, as a rule, does not exceed 70%). The search engines do not store the documents themselves, but only information about them sufficient for them to be found by the user and, as a result, the system in question may not return some documents corresponding to the request as a result of the search. As a result of the search (response to the request), the system sorts documents according to the degree of compliance with the request made by the user from the point of view of the search engine algorithm, and not from the point of view of their actual correspondence to the request.

Using IPS Slide No. 11 Search engines are the most voluminous source of knowledge about the pages (document) of the Internet. In most cases, it is necessary to search for various information on the Internet with the help of information retrieval systems. In terms of speed and completeness of obtaining information on a user's request, they have no equal. Many search engines share a search engine and a directory.

Information retrieval systems Popular global information retrieval systems on the Internet are: n Google (http: //www. google. com) n Bing (http: //search. msn. com/) n Ask. com (http: //www. ask. com) Russian IPS include: n Yandex (http: //www. yandex. ru, http: // www. ya. ru) n Rambler (http: // www. rambler .ru) n Webalta (http: //www.aport.ru/) Slide № 12

Catalog Slide No. 20 Catalog is a system that provides classification of information. Its distinguishing feature is the presence of a hierarchy (ordering scheme) of resources, in which each of the resources belongs to one or more sections. Catalogs store descriptions (annotations) of Internet resources. They are filled with webmasters (people who create information resources) or special editors who view the information resources of the network. In response to the user's request, directories search these descriptions. Catalogs do not automatically detect changes to network information resources.

Typical Catalog Scheme Slide #21 Query Information Resources Technical Staff User Interface Response Hypertext Links Client Search Engine Response Query Information Resource Hierarchy and Their Descriptions

Using the catalog Slide No. 22 When solving a search problem when you need to find a group of information resources on a fairly broad topic, the catalog is the best tool for performing a search, for example, when searching for sites that provide contact information for organizations in Moscow or electronic media sites. Search results in directories can be more meaningful, since the information resources in them are prepared by people.

Catalogs Slide No. 23 Electronic catalogs of a global scale on the Internet are: n Yahoo (http: //www. yahoo. com) n Open Directory (http: // www. dmoz. org) n Look. Smart (http: //www. looksmart. com) The most important Russian electronic catalogs are: n Yandex catalog (http: //yaca. yandex. ru) n Mail catalog. ru (http: //www. list. ru/) n Rambler’s Top 100 catalog (http: //top 100. rambler. ru)

Metasearch system Slide No. 28 A metasearch system is an add-on for search engines and electronic catalogs that does not have its own database (index) and, when searching for a user's search prescription, automatically generates queries for several external search tools, and then also automatically analyzes the results received from them results and returns a list of links in the order determined by the ratio of answer ratings across multiple search engines at once. Differences in strategy and breadth of coverage of information resources of various search engines often lead to the fact that different search engines give different answers to the same query. Metasearch systems in their work use the potential of other means of information retrieval.

Typical scheme of a metasearch system Slide No. 29 Request Client User interface Response Search engine Requests Information resources Answers IPS 1 Catalog 1 IPS N Catalog N

Using a Metasearch Engine Slide #30 Metasearch engines are most effective at the initial stages of information retrieval. They allow you to quickly check whether the necessary information is on the Internet and localize the search tools in which it is present. Metasearch engines allow you to reduce the time spent on searching for information, since when processing a user request, these systems simultaneously access several different search engines.

Types of metasearch engines Slide No. 31 Network - available through the network to search for information Global metasearch engines available via the Internet include: n Meta. Crawler (http://www.metacrawler.com) n Web. Crawler (http://www.webcrawler.com) n Search. com (http: //www.search.com) The most famous Russian metasearch engines: n Meta. Bot. ru (http: //metabot. ru) n Nigma (http: //nigma. ru) The advantage of Russian search tools is the correct processing of a request in the national language.

Specialized Search Tools Slide #33 Systems that search for files, such as File. search. ru (http: //www. filesearch. ru) Systems that provide search in electronic media news, for example, Yandex News (http: //news. yandex. ru), Google News (http: //news. google. ru) Search for goods, for example, Yandex Market (http: //market. yandex. ru), Torg. ru (http: //www. torg. ru) People search, for example, POISKI. ru (http: //poiski. ru), Poisk 24 (http: //www. poisk 24. de), Yahoo! People Search (http://people.yahoo.com)

Specialized search tools Image search, for example, Yandex Pictures (http: //images.yandex.ru), Google Pictures (http: //images.google.ru) Video search, for example, Yandex Video (http: //video.yandex .ru), Google Video (http: //video.google.ru) Slide № 34

Additional search tools and methods Slide No. 36 On the Internet, you can search for information not only with search engines, but also in other ways. There are many different sites, services, and users on the web that can help you with your search. Such services include question-answer systems, forums, various Internet communities (social networks), e-mail, chats. All these ways of obtaining information have in common that other people (and not programs) answer your questions. Q&A systems: Answers Mail. ru (http: //answer. mail. ru), Google Questions and Answers (http: //answer. google. ru), Znatok. ru (http: //znatok.ru)

Additional search tools and methods Slide No. 37 These methods are additional because: n they are not universal (they accumulate addresses in insufficient volume or in narrow directions); n there is no exact guarantee of getting an answer to a question (the question can simply be ignored), it can sometimes take a long time to get an answer in such systems. The main advantage of using additional search methods is the high accuracy of the information obtained.

Recommendations for information search Slide No. 40 Make sure that the word (phrase) of the query is spelled correctly. Your request can be corrected if the word in which you made a mistake is commonly used. Rare words or phrases may not be found. When searching for information using search engines, you should be aware that systems usually respond to any user request (due to the large volume of the Internet) (for example, the request asgr vkt 5, which, at first glance, is a meaningless set of characters, the Yandex search engine found 12 web -pages in which this phrase occurs). Be careful.

Recommendations for information search Slide № 41 Specify the request. The more accurate the query phrase, the more likely it is to quickly find the information you need, for example, search results for Yesenin's poem and Yesenin's poem of the early years will be different. Use synonyms. If your query did not find the information you need, try refining your query by replacing the word with its synonym, such as RAM or RAM or RAM. Different words and phrases produce different results. Use words that could be used on the websites you are looking for.

Recommendations for information search Slide No. 42 When drafting a request, it is always necessary to mentally imagine what the intended content of the document might be. For example, if you need to find information about A. S. Pushkin, then it is not enough just to indicate his last name in the request (the result list will contain many various institutions located on Pushkin streets in different cities). The search will bring a greater effect if the names of the poet's works are added to the surname. To search for the texts of works, it is worth entering separate lines from them (preferably rarely used in citations).

Recommendations for information search Slide number 43 Do not enter a query to the search engine in the usual colloquial form. So, on request What is the weather now in Nizhny Novgorod? documents will be found that include all the words of the query, namely, texts containing this question (for example, texts of literary works). In this case, it would be more effective to enter a request for the weather in Nizhny Novgorod, in the first ten links of the answer to which there will be the required information. Try to write the query words only in small letters - additional documents can be found for such a query.

Tips for Finding Information Slide #44 Search for similar documents. If one of the found documents is closer to the topic you are looking for than the rest, click on the link "find similar documents" . The search engine will parse the page and find documents similar to the one you specified. But if this page has been deleted from the server, and the search engine has not yet had time to remove it from the index, then you will receive a message "The requested document was not found".

Recommendations for finding information Slide number 45 Use the signs "+" and "-". To exclude documents where a certain word occurs, precede it with a minus sign. Conversely, to make sure that a certain word is present in the document, put a plus in front of it. Note that there must be no space between the word and the plus/minus sign. You can also use other special commands to refine the query. A list of them can be found in the system help, usually on the "Query Language" page.

Tips for Finding Information Slide #46 Search for exact phrases. If you know the exact phrase that should appear on the results page, then specify it in the request, putting it in quotation marks. For example, "Wide scope for dreams and for life The coming years open up for us" Use regional search engines. For more complete information in a language other than English, you can use regional systems that work with this language. In many countries, regional systems have a wide range of resources. The largest search engine in Russia is Yandex (http: //www.yandex.ru).

Recommendations for information search Slide number 47 Use specialized search engines. If you are looking for pictures, videos, products, maps and some other information, then you can find all this information faster using specialized search engines designed for these purposes. Many general purpose search engines have special interfaces for searching these types of information (see descriptions of specific systems). The search request in this case can be as follows: image search.

Recommendations for searching for information Slide No. 48 If the source of information is an organization, then try to search for information on the website of this organization. Search engines may not be aware of all the information stored on Internet sites. Go to the website of the organization from which this information came, perhaps there will be detailed information about it. Sites have local search engines (that search specifically for this site) or you can try to find the information you need by navigating through sections of the site. If, for example, you heard a broadcast on the radio and know the name of this radio station. Look up information about this program on the official website of this radio station.

Tips for Finding Information Slide #49 Ask other people for help finding information. There are special systems on the Internet (for example, question-answer systems) in which some users can help others in finding information. Maybe people have already been interested in the same question as you and know the correct answer.

Internet search methods

Three Ways to Search the Web

The Internet in general, and the World Wide Web in particular, provide the subscriber with access to thousands of servers and millions of Web pages that store an unimaginable amount of information. How not to get lost in this "information ocean"? To do this, you need to learn how to search and find the necessary information on the network.

As already mentioned, there are three main ways to find information on the Internet.

1. Specifying the page address. This is the fastest search method, but it can only be used if the address of the document is known exactly.

2. Navigation via hyperlinks. This is the least convenient method, since it can be used to search for documents that are only close in meaning to the current document. If the current document is dedicated to, for example, music, then using the hyperlinks of this document, it will hardly be possible to get to a site dedicated to sports.

3. Contacting a search server (search engine). Using search engines is the most convenient way to find information. Currently, the following search servers are popular in the Russian-speaking part of the Internet:

Yandex;
Rambler;
Aport.

There are other search engines as well. For example, an efficient search system is implemented on the mail.ru mail service server.

Search servers

The most accessible and convenient way to find information on the World Wide Web is to use search engines. At the same time, information can be searched for by catalogs, as well as by a set of keywords characterizing the searched text document.

Consider the use of search servers in more detail. search server contains a large number of links to a variety of documents, and all these links are systematized in thematic directories. For example: sports, movies, cars, games, science, etc. Moreover, these links are set by the server independently, automatically by regularly viewing all Web pages that appear on the World Wide Web. In addition, search servers provide the user with the ability to search for information by keywords. After entering keywords, the search server starts browsing documents on other Web servers and displays links to those documents in which the specified words are found. Typically, search results are sorted in descending order by a special document rating that indicates how well a given document matches the search criteria or how often it is requested on the web.



Search engine query language

A group of keywords, formed according to certain rules - using the query language, is called a request to the search server. Query languages ​​for different search engines are very similar. You can learn more about this by visiting the "Help" section of the desired search server. Consider the rules for generating queries using the Yandex search engine as an example.

Operator syntax What does operator mean Request example
space or & Logical AND (within sentence) physiotherapy
&& Logical AND (within the document) recipes && (processed cheese)
| Logical OR photo | photography | snapshot | photographic image
+ Mandatory presence of the word in the found document +to be or +not to be
() Grouping words (technology | production) (cheese | cottage cheese)
~ Binary operator AND NOT (within a sentence) banks ~ law
~~ or _ Binary AND NOT operator (within document) Paris travel guide ~~ (agency | tour)
/(nm) Distance in words (minus (-) - back, plus (+) - forward) suppliers /2 coffee music /(-2 4) education vacancies ~ /+1 students
" " Phrase search "little red riding hood" Equivalently: red / +1 riding hood
&&/(nm) Distance in sentences (minus (-) - back, plus (+) - forward) bank && /1 taxes

To get the best search results, you need to remember a few simple rules:

1. Do not search for information on only one keyword.

2. It's best not to enter keywords in capital letters, as this may result in the same words written in lower case not being found.

3. If your search doesn't return any results, check for spelling errors in your keywords.

Modern search engines provide the ability to connect to the generated query of a semantic analyzer. With its help, you can, by entering a word, select documents in which there are derivatives of this word in various cases, tenses, etc.

Information technologies of information retrieval

Search for information: basic concepts, types and forms of organization

Information search or information retrieval is one of the main information processes. Mankind has been doing it since ancient times. The goals, possibilities and nature of the search have always depended on the availability, information, its importance and accessibility, as well as the means of organizing the search.

The end of the 20th - the beginning of the 21st century is characterized by huge arrays of constantly growing diverse information that is accessible and of interest to the widest strata of society. Moreover, Internet technologies and software and hardware tools, also available to most people, allow this process to be carried out at any time, almost anywhere, for any request.

Search- the process during which, in one sequence or another, the search is correlated with each object stored in the array. The purpose of any search is the need, need or desire to find various types of information that help the searcher to obtain the information, knowledge, etc. he needs. to improve their own professional, cultural and any other level; creation of new information and formation of new knowledge; making managerial decisions, etc.

According to experts, there are 30 or more million users on the Internet. Of these, tens of thousands are online (English "on-line" - interactive access at any time) and the number of such users is constantly growing. This makes it difficult to organize an operational search and find the information necessary for such a large number of users. Problems arise due to various possibilities (types) of information retrieval, various ways of their implementation in information retrieval systems (IPS), different levels of user knowledge about the capabilities of such systems, especially in the field of generating queries and processing data obtained as a result of executing these queries and etc.

It is assumed that in the future ISs will be created that can automatically adapt to the level of knowledge and requests of specific users, perceive requests in natural language and, using artificial intelligence, give them relevant and pertinent information. The creation of such IPS will require the intelligence and knowledge of specific IPS users or their intermediaries. In the meantime, a wide range of users of search engines is required to have a fairly good command of this subject area.

There are various interpretations of the term "information search" or "information search".

The term " information retrieval" (English "information retrieval") was introduced by the American mathematician K. Muers. He noticed that the motive for such a search is information need, expressed in the form of an information request. K. Muers classified documents, information about their presence and (or) location, and factual information as objects of information retrieval.

Representatives of libraries were the first to solve the problems of factographic search. They developed information retrieval tools called " reference and search apparatus"(catalogues, bibliographic indexes, etc.). In the professional domestic press, this term has been used since the 1970s. Librarians define " information retrieval "as being in the information array of documents corresponding information request of users.

From the point of view of the use of computer technology " information retrieval "- a set of logical and technical operations with the ultimate goal of finding documents, information about them, facts, data relevant to the consumer's request.

"Relevance" - installed at information retrieval compliance with the content of the document information request or the search image of the document to the search prescription.

There are other definitions as well. In any case, information retrieval is caused by the need to satisfy the information needs of users who expect to quickly receive the data or information they need with the help of search engines. It is a method of targeted search and retrieval of relevant documents and/or facts from various sources of information, such as databanks or storage devices. These are living and non-living objects representing various sources and carriers of information.

Systems that provide the implementation of such information retrieval are called search engines(PS). In traditional technologies, PS represent file cabinets and catalogs, address and other directories, indexes, encyclopedias, reference apparatus for publications and other materials.

In 1945, the American scientist and engineer W. Bush, in his article "A Possible Mechanism of Our Thinking," for the first time widely raised the question of the need to mechanize information retrieval. Since the 1960s, automated search engines have appeared that work with information. Since this period, intensive work has been carried out in the field of the formation and implementation of the principles and methods of information retrieval.

"Search engines" perform a search among the documents of the database or other arrays of machine-readable data containing the given words.

Electronic PS using conventional or intelligent terminals (PC) enable users to make search queries using formal and content-descriptive elements and using special logical operators; perform a search among the documents of the database or other arrays of machine-readable data containing the specified words. Search engines allow only search procedures and related processes.

Loading...
Top