Krzysztof Łuczka 2024, MIT license

Introduction

scrapium.h is a simple, light and fast C++ webscraping library. It aims to create a high-level, easy way to scrape web without worrying about code portability. The source code is distributed under MIT license, and can be freely viewed there.

Note that this pre-release version is likely not bug-free. Full version will be released along with other features, that are described in 'Development plans'.

Scraping

Basic code to scrape web can look like this:

#include "scrapium.h"

int main() {
    scrapium::contents content = scrapium::scrape( "https://www.example.com/", "<p>", "</p>" );

    content.print( scrapium::print_type::JSON );
}

The first argument is our website's address. Function scrape() will save scraped content into predefinied class contents, to ensure proper saving and printing out.

Custom scraping

If we provide three arguments, the last two will be used as the start and the end. Function scrape() will spit out everything that is stated between every of these two specified arguments. For example, previously given code will spit out this:

{
    "0": "This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission. ",
    "1": "<a href="https://www.iana.org/domains/example">More information...</a> ",
}

HTML tag scraping

If we provide two arguments, the last one will be used as tag name. The function will capture their contents no matter theirs properties/attributes.

scrapium::contents content = scrapium::scrape( "http://www.wierszespodtaboreta.pl/", "a" );

This will spit out after content.print( scrapium::print_type::JSON ) function:

{
    "0": "<div class="button"> <cite>Taboret I</cite>, Zbigniew Kaczmarek </div>",
    "1": "<div class="button"> <cite>Parvum Opus</cite>, Krzysztof Łuczka i Jakub Pinkowski </div>",
    "2": "<div class="button"> <cite>Brzydota</cite>, Rafał Skałecki </div>",
    "3": "<div class="button"> <cite>Bańka</cite>, Rafał Skałecki </div>",
}

Note that it will not capture standalone tags.

Results

scrapium::contents class provides print() function with given view types:

  • scrapium::print_type::RAW - will display every line separated by a newline character
  • scrapium::print_type::JSON - will display results in a JSON format
  • scrapium::print_type::XML - will display results in a XML format
  • scrapium::print_type::PHP - will display results in a PHP serialization format
  • scrapium::print_type::YAML - will display results in a YAML format

Saving results

To save a file, simply provide a path as its second argument.

content.print( scrapium::print_type::JSON, "example/path.json" );

Flags and properties

Unicode escaping

To ensure printing out correct results we can switch the unicode_escape flag, to convert unicode characters to their \uXXXX form.

scrapium::unicode_escape = true;

Browser emulation

If we want to disable browser emulation and use pure GET protocol, we can switch the browser_emulation flag.

scrapium::browser_emulation = false;

By providing true (by default), function scrape() will emulate a browser connection to download the loaded site. By providing false, the function will use pure GET protocol. It might be significantly faster, although along with the inability to simulate sessions and cookies, some websites might block this way of raw downloading its contents.

Note that if the scrape() function without browser emulation will encounter redirecting (HTTP 301 or HTTP 302) it will force browser emulation, slowing down the whole process. It is recommended not to change it, unless you have the proper knowledge.

User agent

We can change user_agent to ensure browser emulation will send proper data. By default it is set as:

scrapium::user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"

It is recommended not to change it, unless you have the proper knowledge.

Development plans

scrapium.h ver 1

The list of features that are planned to be implemented until the full release of scrapium.h version 1:
  • scrape should take HTML tags properties/attributes into account

scrapium.h ver 2

The list of features that are planned to be implemented until the full release of scrapium.h version 2:
  • I/O methods to connect this program with any other application
  • Linux support

Contributors

This project has no other contributors yet. This documentation is current until July 31, 2024 and describes the scrapium.h pre-release version.

    MIT License

    Copyright (c) 2024, Krzysztof Łuczka
    
    Permission is hereby granted, free of charge, to any person obtaining a copy
    of this software and associated documentation files (the "Software"), to deal
    in the Software without restriction, including without limitation the rights
    to use, copy, modify, merge, publish, distribute, sublicense, and/or sell
    copies of the Software, and to permit persons to whom the Software is
    furnished to do so, subject to the following conditions:
    
    The above copyright notice and this permission notice shall be included in all
    copies or substantial portions of the Software.
    
    THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR
    IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY,
    FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE
    AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER
    LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM,
    OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE
    SOFTWARE.