Multiple Domains Crawler Performance Over Single Domain Crawler

thoughtleadershipz
Jan 8, 2019
3 min read

We can see that, single domain-specific crawler crawling time is more than the multiple domains-specific crawler crawling time. When we work through large number of Web-pages in single domain-specific crawler, most of the Web-pages are irrelevant and we discard those Web-pages but in multiple domains specific crawler, most of the Web-pages are not irrelevant page because it belongs to any one domain. Hence, the performance of the crawler increases, while we have considered multiple domains.

Multilevel Domains Specific Web Search Crawler In this subsection, we have given some experimental study as well as discussed how to set up our multilevel domain-specific Web search crawler.

Experiment Procedure Performance of our system depends on various parameters, and those parameters need to be set up before running our system. The considered parameters are domain relevance limit, weight value assignment, ontology terms, etc. These parameters are assigned by tuning our system through experiments. We have assigned 20 seed URLs for each crawler input to start initial crawling.

Complexity Analysis There are few assumptions taken to calculate the time complexity of our system. (a) We are dealing with “n” number of terms, which includes both Ontology terms as well as synonyms of those terms. (b) “d” time taken to download a Web-page because Internet speed is a big factor to download a Web-page. (c) We are dealing with “m” number of URL extension domains. (d) On an average we assumed we are receiving “p” number of hyperlink URLs exists in Web-page content. (e) Constant time complexity denoted by Ci where “i” is a positive integer. In our approach, we have used parallel crawler mechanism, so that if we used multiple crawlers then also our time complexity remains same.

Accuracy Testing of Our Prototype To produce an accuracy report, we have used harvest rate. We have also given a comparative study with an existing unfocused crawler performance. In Fig. 4, we have given a harvest rate plot for unfocused crawler. Unfocused crawler crawls a large number of Web-pages but found very few domain-specific Web-pages. For a focused crawler harvest rate value monitored by the relevance limit of that domain and tolerance limit value. Relevance limit is a predefined static relevance cut-off value to recognize whether a Web-page is domain-specific or not. On the other hand tolerance limit also a numeric value, which decreases the relevance limit value by doing a subtraction operation between relevance limit and tolerance limit. We have reached an optimal tolerance limit value by doing testing in various phases and achieved a satisfactory harvest rate. We have shown the harvest rate of our multilevel domain-specific Web search crawler by taking relevance limit and tolerance limit are 12 and 5 respectively.

Parallel Crawling Performance Report We have given a performance report of our system. According to our seed URLs of each crawler, i.e., .com crawler, .edu crawler, etc., crawls Web-pages simultaneously. We have taken our statistic in various time intervals. Say after 10 min, we saw .com crawler crawls 111 Web-pages, .edu crawler crawls 113 Web-pages, .net and .in crawler crawls 109 and 101 Web-pages respectively. According to our strategy, the entire considered parallel crawler crawls Web-pages simultaneously; hence our system total performance after 10 min has become summation of all individual crawler output, i.e., 434 Web-pages. The same way we also consider other time intervals and measure our system performance.

we have presented results obtained from our proposed single, multiple and multilevel domain-specific Web crawler prototype. As discussed earlier we have found some drawbacks in single and multiple domain-specific Web crawler and those drawbacks resolved in multiple domain-specific Web crawler. Multilevel domain-specific Web crawler has used two classifiers, i.e., Web-page content classifier and Web-page URL classifier for identification of Web-page domains and URL extension regions more prominently. In addition, we have used the parallel crawling mechanism to download the Web-pages in a faster way. To perform searching operation, Web searcher must give a search string, classifier 1 and classifier 2 inputs. Based on the user given inputs, our prototype retrieves Web-pages from the Web-page repository.

Our prototype supports multiple domains by using multiple Ontologies. This prototype is scalable. Suppose, we need to increase the supporting domains for our prototype, then we need to include the new domain Ontology and other details like weight table, syntable, etc., of that Ontology. According to the Web-page retrieval mechanism based on the classifier 1 and classifier 2 inputs, our prototype traverses very few Web-pages, i.e., produces a faster result. Finally, our prototype gives a provision to the Web searcher to customize their search result by varying classifier 1 and classifier 2 inputs. In this post, we have given a detail description of crawling domain-specific Web-pages from Internet and generated a domain-specific Web-page repository. Now, Web-page repository structure plays a big role for search engine performance.

leadership development goals