33 open source crawler software tools for capturing data

Want to play big data, how to play without data? Here are some 33 open source crawler software recommended for everyone.

A crawler, a web crawler, is a program that automatically retrieves web content. It is an important part of the search engine, so search engine optimization is largely optimized for crawlers.

Web crawler is a program that automatically extracts web pages. It downloads web pages from the World Wide Web for search engines and is an important component of search engines. The traditional crawler starts from the URL of one or several initial webpages and obtains the URL on the initial webpage. During the process of crawling the webpage, the new URL is continuously extracted from the current page into the queue until a certain stop condition of the system is satisfied. The workflow of focusing on crawlers is more complicated, and it is necessary to filter the links irrelevant to the topic according to certain webpage analysis algorithms, retain useful links and put them into the queue of URLs waiting to be crawled. Then, it will select the URL of the web page to be crawled from the queue according to a certain search strategy, and repeat the above process until it reaches a certain condition of the system. In addition, all web pages crawled by the crawler will be stored by the system, analyzed, filtered, and indexed for later query and retrieval. For focused crawlers, the analysis results obtained by this process may also be Give feedback and guidance on the subsequent crawling process.

There are hundreds of reptile softwares that have been formed in the world. This article sorts out the well-known and common open source crawler software and summarizes them according to the development language. Although search engines also have crawlers, this time I’m only collecting crawler software, not large, complex search engines, because many brothers just want to crawl data instead of running a search engine.

Java crawler

1, Arachnid
Arachnid is a Java-based web spider framework. It contains a simple HTML parser that can parse input streams containing HTML content. By implementing Arachnid subclasses, you can develop a simple Web spiders and be able to work on each of the Web sites. Add a few lines of code calls after the page is parsed. Arachnid’s download package contains two spider application examples to demonstrate how to use the framework.

Features: Mini crawler framework with a small HTML parser

License: GPL

2, crawlzilla
Crawlzilla is a free software that helps you easily build a search engine. With it, you don’t have to rely on the search engine of a commercial company, nor do you have to worry about the index of the company’s internal website.

The nutch project is the core, and more related kits are integrated, and the UI is designed, installed and managed to make it easier for users to get started.

In addition to crawling basic html, crawlzilla can also analyze files on web pages, such as (doc, pdf, ppt, ooo, rss) and other file formats, so that your search engine is not just a web search engine, but a website. Full data index library.

Have Chinese word segmentation ability to make your search more accurate.

The main features and goals of crawlzilla are to provide users with a convenient and easy to use search platform.

License Agreement: Apache License 2
Development Language: Java JavaScript SHELL
Operating System: Linux

Project home page: https://github.com/shunfa/crawlzilla

Download address: http://sourceforge.net/projects/crawlzilla/

Features: easy to install, with Chinese word segmentation

3, Ex-Crawler
Ex-Crawler is a web crawler developed in Java. The project is divided into two parts, one is a daemon and the other is a flexible and configurable web crawler. Use a database to store web page information.

License Agreement: GPLv3
Development Language: Java
Operating system: cross platform

Features: Executed by the daemon process, using the database to store web page information

4, Heritrix
Heritrix is ​​an open source web crawler developed by Java that users can use to crawl the resources they want online. The best thing about it is its scalability, making it easy for users to implement their own crawling logic.
Heritrix uses a modular design, each module is coordinated by a controller class (CrawlController class), the controller is the core of the whole.
Code hosting: https://github.com/internetarchive/heritrix3

• License Agreement: Apache
• Development Language: Java
• Operating System: Cross-platform

Features: strictly follow the exclusion instructions of robots files and META robots tags

5, heyDr

heyDr is a lightweight open source multi-threaded vertical retrieval crawler framework based on Java, which follows the GNU GPL V3 protocol.
Users can build their own vertical resource crawlers through heyDr, which is used to build data preparation for the vertical search engine.

License Agreement: GPLv3
Development Language: Java
Operating system: cross platform

Features: Lightweight open source multi-threaded vertical search crawler framework

6, ItSucks
ItSucks is a java web spider (web bot, crawler) open source project. Support for defining download rules by downloading templates and regular expressions. Provide a swing GUI operation interface.
Features: Provides swing GUI interface

7, jcrawl
Jcrawl is a small and well-behaved web crawler that crawls various types of files from web pages based on user-defined symbols such as email, qq.
License Agreement: Apache
Development Language: Java
Operating system: cross platform

Features: Lightweight, excellent performance, you can grab various types of files from the web page

8, JSpider
JSpider is a WebSpider implemented in Java. The execution format of JSpider is as follows:
Jspider [URL] [ConfigName]

The URL must be added to the protocol name, such as: http://, otherwise an error will be reported. If ConfigName is omitted, the default configuration is used.

The behavior of JSpider is specifically configured by the configuration file, such as what plugin is used, the result storage mode, etc. are set in the conf\[ConfigName]\ directory. JSpider’s default configuration is rare and not very useful. But JSpider is very easy to extend and can be used to develop powerful web crawling and data analysis tools. To do this, you need to have an in-depth understanding of the principles of JSpider, then develop plug-ins based on your own needs, and write configuration files.

License Agreement: LGPL
Development Language: Java
Operating system: cross platform

Features: powerful, easy to expand

9, Leopdo
Web search and crawler written in JAVA, including full text and categorical vertical search, and word segmentation system

License Agreement: Apache
Development Language: Java
Operating system: cross platform

Features: including full text and classified vertical search, as well as word segmentation system

10, MetaSeeker
A complete web content capture, formatting, data integration, storage management and search solution.
There are several ways to implement web crawlers. If you divide them according to the deployment, you can divide them into:

1, server side: generally a multi-threaded program, while downloading multiple target HTML, can be done with PHP, Java, Python (currently popular), etc., can be done very quickly, generally integrated search engine crawler to do so. However, if the other party hates reptiles, it is likely to block your IP, the server IP is not easy to change, and the bandwidth used is quite expensive. It is recommended to take a look at Beautiful soap.
2, the client: generally achieve the title crawler, or focus on the crawler, do a comprehensive search engine is not easy to succeed, and vertical search or comparison service or recommendation engine, relatively easy, many such crawlers are not taken from any page, and It is a page that only takes your relationship, and only takes care of the content on the page, such as extracting yellow pages information, product price information, and extracting competitors’ advertising information. Searching for Spyfu is very interesting. This type of reptile can be deployed a lot, and it can be aggressive and difficult for the other party to block.
The web crawler in MetaSeeker belongs to the latter. The MetaSeeker toolkit leverages the capabilities of the Mozilla platform, which can be extracted as long as it is seen by Firefox.

The MetaSeeker toolkit is free to use at www.gooseeker.com/cn/node/download/front

Features: web crawling, information extraction, data extraction toolkit, easy to operate

11, Playfish
Playfish is a web crawler that uses Java technology to integrate multiple open source Java components and implements highly customizable and extensible web crawlers through XML configuration files.
The application open source jar package includes httpclient (content read), dom4j (configuration file parsing), jericho (html parsing), which is already in the lib of the war package.
The project is still immature, but the functionality is basically complete. Require users to be familiar with XML and familiar with regular expressions. At present, this tool can be used to capture various forums, post bars, and various CMS systems. Articles like Discuz!, phpbb, forums and blogs can be easily crawled with this tool. The crawl definition is fully XML-compliant and suitable for Java developers.
How to use, 1. Download the .war package on the right and import it into eclipse. 2. Create a sample database using the wcc.sql file under WebContent/sql. 3. Modify the dbConfig.txt of wcc.core under the src package. Set the password with your own mysql username and password. 4. Then run SystemCore, the runtime will be in the console, no parameters will execute the default example.xml configuration file, with the parameter name is the configuration file name.
The system comes with 3 examples, respectively, for baidu.xml to crawl Baidu, example.xml to grab my javaeye blog, bbs.xml to grab a discuz forum content.

License Agreement: MIT
Development Language: Java
Operating system: cross platform

Features: Highly customizable and extensible through XML configuration files

12, Spiderman
Spiderman is a web spider based on the microkernel + plug-in architecture. Its goal is to capture and parse complex landing page information into the business data it needs in a simple way.
how to use?

First, identify your target site and the landing page (ie a certain type of page where you want to get data, such as the news page of NetEase News)
Then, open the target page, analyze the HTML structure of the page, and get the XPath of the data you want. See how to get the specific XPath.
Finally, fill in the parameters in an xml configuration file and run Spiderman!

License Agreement: Apache
Development Language: Java
Operating system: cross platform

Features: flexible, extensible, microkernel + plug-in architecture, data capture can be done through simple configuration, no need to write a code

13, webmagic
Webmagic is a crawler framework that doesn’t need to be configured for secondary development. It provides a simple and flexible API that implements a crawler with a small amount of code.

Webmagic adopts a completely modular design, and its functions cover the whole life cycle of crawler (link extraction, page download, content extraction and persistence). It supports multi-threaded fetching, distributed fetching, automatic retry, custom UA/cookie and other functions.

Webmagic includes powerful page extraction capabilities, developers can easily use css selector, xpath and regular expressions for link and content extraction, support multiple selector chain calls.
Webmagic documentation: http://webmagic.io/docs/

View source code: http://git.oschina.net/flashsword20/webmagic

License Agreement: Apache
Development Language: Java
Operating system: cross platform

Features: Features cover the entire crawler lifecycle, using XPath and regular expressions for linking and content extraction.

Remarks: This is a domestic open source software contributed by Huang Yihua

14, Web-Harvest
Web-Harvest is a Java open source web data extraction tool. It collects specific web pages and extracts useful data from those pages. Web-Harvest mainly uses technologies such as XSLT, XQuery, and regular expressions to implement text/xml operations.
The implementation principle is that the entire content of the page is obtained by using httpclient according to the predefined configuration file (about the content of httpclient, some articles have been introduced in this blog), and then using XPath, XQuery, regular expression and other technologies to implement text/xml The content filtering operation selects the exact data. The vertical search of the previous two years (such as: cool news, etc.) is also implemented using a similar principle. The key to Web-Harvest applications is to understand and define configuration files. The rest is Java code that considers how to process data. Of course, before the crawler starts, you can also populate the Java variable into the configuration file to implement dynamic configuration.
License Agreement: BSD
Development Language: Java

Features: Use XSLT, XQuery, regular expression and other technologies to achieve text or XML operations, with a visual interface

15, WebSPHINX
WebSPHINX is an interactive development environment for Java class packages and web crawlers. Web crawlers (also known as bots or spiders) are programs that automatically browse and process web pages. WebSPHINX consists of two parts: the crawler workbench and the WebSPHINX class package.
License Agreement: Apache

Development language: Java

Features: Two parts: crawler work platform and WebSPHINX package

16, YaCy
YaCy is a p2p-based distributed web search engine. It is also an Http cache proxy server. This project is a new method for building a p2p-based web indexing network. It can search your own or global index, or you can crawl your own web page or Start distributed Crawling, etc.

License Agreement: GPL
Development Language: Java Perl
Operating system: cross platform

Features: P2P-based distributed web search engine

Python crawler
17, QuickRecon
QuickRecon is a simple information gathering tool that helps you find subdomain names, perform zone transfe, collect email addresses, and use microformats to find relationships. QuickRecon is written in python and supports linux and windows operating systems.

License Agreement: GPLv3
Development Language: Python
Operating System: Windows Linux

Features: Features such as finding subdomain names, collecting email addresses, and finding relationships

18, PyRailgun
This is a very easy to use crawler. Simple, practical and efficient python web crawler crawling module that supports crawling javascript rendered pages

License Agreement: MIT
Development Language: Python
Operating System: Cross-platform Windows Linux OS X

Features: simple, lightweight, and efficient web crawling framework

Remarks: This software is also open to Chinese people.

Github download: https://github.com/princehaku/pyrailgun#readme

19, Scrapy
Scrapy is a set of asynchronous processing framework based on Twisted. It is a crawler framework implemented by pure python. Users only need to customize several modules to implement a crawler. It is used to crawl web content and various images. It is very convenient~

License Agreement: BSD
Development Language: Python
Operating system: cross platform

Github source code: https://github.com/scrapy/scrapy

Features: Twisted based asynchronous processing framework, complete documentation

C++ crawler

20, hispider
HiSpider is a fast and high performance spider with high speed

Strictly speaking, it can only be a framework of a spider system. There is no need to refine the requirements. At present, it can only extract URLs, URLs, asynchronous DNS resolution, queued tasks, support for N-machine distributed downloads, and support for site-directed downloads (requires configuration of hisidderd. Ini whitelist).
Features and usage:
• Development based on unix/linux system
• Asynchronous DNS resolution
• URL weight
• Support HTTP compression encoding transfer gzip/deflate
• Character set judgment is automatically converted to UTF-8 encoding
• Document compression storage
• Support multiple download nodes distributed download
• Support website directed download (need to configure hispiderd.ini whitelist )
• View download statistics via http://127.0.0.1:3721/, download task control (stop and resume tasks)
• Depends on the basic communication libraries libevbase and libsbase (you need to install these two libraries before installation),

work process:
• Take the URL from the central node (including the task number, IP and port corresponding to the URL, and may also need to resolve it yourself)
• Connect to the server to send a request
• Wait for the data header to determine whether the data is needed (currently the text type is mainly taken)
• Waiting for the completion of the data (the length of the head directly waits for the length of the data or waits for a larger number and then sets the timeout)
• Data completion or timeout, zlib compressed data is returned to the central server, the data may include self-resolving DNS information, compressed data length + compressed data, if the error returns directly to the task number and related information
• The central server receives the data with the task number to see if it includes data. If there is no data, the status corresponding to the task number is incorrect. If there is data extraction data, then link the data to the document file.
• Return to a new task after completion.
License Agreement: BSD
Development Language: C/C++
Operating System: Linux

Features: Support multi-machine distributed download, support website targeted download

21, larbin
Larbin is an open source web crawler/web spider developed independently by the French young man Sébastien Ailleret. The purpose of larbin is to be able to track the url of the page for extended crawling, and finally to provide a wide range of data sources for search engines. Larbin is just a crawler, which means that larbin only crawls web pages, and how to parse is done by the user himself. In addition, how to store to the database and indexing things is not provided by larbin. A simple larbin crawler can get 5 million web pages per day.
With larbin, we can easily get/determine all the links of a single website, or even mirror a website; we can also use it to create a url list group, for example, url retrive for all web pages, and xml join. Either mp3, or custom larbin, can be used as a source of information for search engines.

License Agreement: GPL
Development Language: C/C++
Operating System: Linux

Features: High-performance crawler software, only responsible for crawling is not responsible for parsing

22. Methabot
Methabot is a speed-optimized, highly configurable crawler for WEB, FTP, and local file systems.
License Agreement: Unknown
Development Language: C/C++
Operating System: Windows Linux

Features: Over-speed optimization, crawlable WEB, FTP and local file system

Source code: http://www.oschina.net/code/tag/methabot

C# crawler

23, NWebCrawler
NWebCrawler is an open source, C# development web crawler.
characteristic:

• Configurable: number of threads, wait time, connection timeout, MIME type and priority allowed, download folder.
• Statistics: number of URLs, total download files, total number of downloaded bytes, CPU utilization and available memory.
• Preferential crawler: The user can set the priority MIME type.
•Robust: 10+ URL normalization rules, crawler trap avoiding rules.

License Agreement: GPLv2
Development Language: C#
Operating System: Windows

Project homepage: http://www.open-open.com/lib/view/home/1350117470448

Features: statistical information, process visualization

24, Sinawler
The first crawler program for Weibo data in China! Formerly known as “Sina Weibo crawler.”
After logging in, you can specify the user as the starting point, use the user’s followers and fans as clues, and collect the user’s basic information, microblog data, and comment data.
The data obtained by this application can be used as data support for scientific research, research and development related to Sina Weibo, etc., but not for commercial use. The application is based on the .NET 2.0 framework and requires SQL SERVER as the back-end database and provides database script files for SQL Server.
In addition, due to the limitations of the Sina Weibo API, the data crawled may not be complete (such as limiting the number of fans, limiting the number of Weibos, etc.)
This program is copyrighted by the author. You can do it for free: copy, distribute, present and perform current work to create derivative works. You may not use your current work for commercial purposes.
The 5.x version has been released! There are 6 background worker threads in this version: robots that crawl basic user information, robots that crawl user relationships, robots that crawl user tags, robots that crawl microblog content, robots that crawl microblog comments, and adjustment requests The frequency of the robot. Higher performance! Maximize the potential of reptiles! According to the results of the current test, it has been able to satisfy its own use.
Features of this program:

1, 6 background work threads, maximize the potential of crawler performance!

2, provide parameter settings on the interface, flexible and convenient

3, abandon the app.config configuration file, self-encrypted storage of configuration information, protect database account information

4, automatically adjust the request frequency to prevent over-limit, but also avoid too slow, reduce efficiency

5, any control of the crawler, you can pause, continue, stop the crawler at any time

6, a good user experience

License Agreement: GPLv3
Development Language: C# .NET
Operating System: Windows

25, spidernet
Spidernet is a multi-threaded web crawler model based on recursive tree. It supports text/html resource acquisition. It can set crawl depth, maximum download byte limit, support gzip decoding, support gbk (gb2312) and utf8 encoding. Resources; stored in sqlite data files.
The TODO: tag in the source code describes the incomplete function and wants to submit your code.

License Agreement: MIT
Development Language: C#
Operating System: Windows
Github source code: https://github.com/nsnail/spidernet

Features: Multi-threaded web crawler with recursive tree model, supports resources encoded in GBK (gb2312) and utf8, and uses sqlite to store data.

26, Web Crawler
Mart and Simple Web Crawler is a web crawler framework. Integrated Lucene support. The crawler can start with a single link or a linked array and provides two traversal modes: maximum iteration and maximum depth. You can set the filter to limit the links that are crawled back. By default, there are three filters ServerFilter, BeginningPathFilter, and RegularExpressionFilter. These three filters can be combined with AND, OR, and NOT. Listeners can be added before and after the parsing process or page loading. The content is from Open-Open
Development Language: Java
Operating system: cross platform
License Agreement: LGPL

Features: Multi-threading, support for capturing PDF/DOC/EXCEL and other document sources

27. Network miners
Website data collection software Network miner collector (formerly souky picking)
Soukey Picking Website Data Acquisition Software is an open source software based on the .Net platform and the only open source software in the web data collection software category. Although Sookee picks up open source, it does not affect the provision of software features, even more than some commercial software features.

License Agreement: BSD
Development Language: C# .NET
Operating System: Windows

Features: feature-rich, no less inferior to commercial software

PHP crawler

28, OpenWebSpider
OpenWebSpider is an open source multi-threaded Web Spider (robot: robot, crawler: crawler) and a search engine with many interesting features.
License Agreement: Unknown
Development Language: PHP
Operating system: cross platform

Features: Open source multi-threaded web crawler with many interesting features

29, PhpDig

PhpDig is a web crawler and search engine developed in PHP. Create a vocabulary by indexing dynamic and static pages. When a query is searched, it displays a search results page containing keywords in a certain collation. PhpDig includes a templating system and is capable of indexing PDF, Word, Excel, and PowerPoint documents. PHPdig is suitable for a more specialized and deeper personalized search engine, and it is the best choice to create a vertical search engine for a certain field.

Demo: http://www.phpdig.net/navigation.php?action=demo

License Agreement: GPL
Development Language: PHP
Operating system: cross platform

Features: with the ability to collect web content, submit forms

30, ThinkUp
ThinkUp is a social media perspective engine that collects social network data such as Twitter and Facebook. By collecting data from an individual’s social network account, an interactive analysis tool for archiving and processing it, and graphing the data for more intuitive viewing.

License Agreement: GPL
Development Language: PHP
Operating system: cross platform
Github source code: https://github.com/ThinkUpLLC/ThinkUp

Features: A social media perspective engine that collects social network data such as Twitter and Facebook. It can perform interactive analysis and visualize the results.

31, micro purchase
Micro-purchasing social shopping system is an open source shopping sharing system based on ThinkPHP framework. It is also a Taobao website program for webmasters and open source. It integrates more than 300 Taobao, Tmall, Taobao and so on. Home commodity data collection interface, for the majority of Taobao visitors to provide fool-like Taobao website service, HTML will be a program template, free open download, is the first choice for the majority of Amoy stationmasters.

Demo URL: http://tlx.wego360.com

License Agreement: GPL
Development Language: PHP

Operating system: cross platform

ErLang reptile

32, Ebot

Ebot is a scalable distributed web crawler developed in the ErLang language. URLs are stored in the database and can be queried via RESTful HTTP requests.

License Agreement: GPLv3
Development Language: ErLang
Operating system: cross platform

Github source code: https://github.com/matteoredaelli/ebot

Project home page: http://www.redaelli.org/matteo/blog/projects/ebot

Features: Scalable distributed web crawler

Ruby crawler

33, Spider

Spidr is a Ruby web crawler that crawls the entire site, multiple websites, and a link to the local.

Development Language: Ruby

License Agreement: MIT

Features: One or more websites, a link can be fully captured locally

Leave a Reply

×