Introduction
scrapium.h is a simple, light and fast C++ webscraping library. It aims to create a high-level, easy way to scrape web without worrying about code portability. The source code is distributed under MIT license, and can be freely viewed there.
Note that this pre-release version is likely not bug-free. Full version will be released along with other features, that are described in 'Development plans'.
Scraping
Basic code to scrape web can look like this:
#include "scrapium.h"
int main() {
scrapium::contents content = scrapium::scrape( "https://www.example.com/", "<p>", "</p>" );
content.print( scrapium::print_type::JSON );
}
The first argument is our website's address.
Function scrape()
will save scraped content into predefinied class contents
,
to ensure proper saving and printing out.
Custom scraping
If we provide three arguments, the last two will be used as the start and the end.
Function scrape()
will spit out everything that is stated between
every of these two specified arguments. For example, previously given code will spit out this:
{
"0": "This domain is for use in illustrative examples in documents. You may use this domain in literature without prior coordination or asking for permission. ",
"1": "<a href="https://www.iana.org/domains/example">More information...</a> ",
}
HTML tag scraping
If we provide two arguments, the last one will be used as tag name. The function will capture their contents no matter theirs properties/attributes.
scrapium::contents content = scrapium::scrape( "http://www.wierszespodtaboreta.pl/", "a" );
This will spit out after content.print( scrapium::print_type::JSON )
function:
{
"0": "<div class="button"> <cite>Taboret I</cite>, Zbigniew Kaczmarek </div>",
"1": "<div class="button"> <cite>Parvum Opus</cite>, Krzysztof Łuczka i Jakub Pinkowski </div>",
"2": "<div class="button"> <cite>Brzydota</cite>, Rafał Skałecki </div>",
"3": "<div class="button"> <cite>Bańka</cite>, Rafał Skałecki </div>",
}
Note that it will not capture standalone tags.
Results
scrapium::contents
class provides print()
function with given view types:
scrapium::print_type::RAW
- will display every line separated by a newline characterscrapium::print_type::JSON
- will display results in a JSON formatscrapium::print_type::XML
- will display results in a XML formatscrapium::print_type::PHP
- will display results in a PHP serialization formatscrapium::print_type::YAML
- will display results in a YAML format
Saving results
To save a file, simply provide a path as its second argument.
content.print( scrapium::print_type::JSON, "example/path.json" );
Flags and properties
Unicode escaping
To ensure printing out correct results we can switch the unicode_escape
flag, to convert unicode characters
to their \uXXXX
form.
scrapium::unicode_escape = true;
Browser emulation
If we want to disable browser emulation and use pure GET protocol, we can switch the browser_emulation
flag.
scrapium::browser_emulation = false;
By providing true
(by default), function scrape()
will emulate a browser connection
to download the loaded site. By providing false
, the function will use pure GET protocol.
It might be significantly faster, although along with the inability to simulate sessions and cookies,
some websites might block this way of raw downloading its contents.
Note that if the scrape()
function without browser emulation will encounter redirecting
(HTTP 301 or HTTP 302) it will force browser emulation, slowing down the whole process.
It is recommended not to change it, unless you have the proper knowledge.
User agent
We can change user_agent
to ensure browser emulation will send proper data.
By default it is set as:
scrapium::user_agent = "Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36 (KHTML, like Gecko) Chrome/91.0.4472.124 Safari/537.36"
It is recommended not to change it, unless you have the proper knowledge.
Development plans
scrapium.h ver 1
The list of features that are planned to be implemented until the full release of scrapium.h version 1:scrape
should take HTML tags properties/attributes into account
scrapium.h ver 2
The list of features that are planned to be implemented until the full release of scrapium.h version 2:- I/O methods to connect this program with any other application
- Linux support
Contributors
This project has no other contributors yet. This documentation is current until July 31, 2024 and describes the scrapium.h pre-release version.
MIT License Copyright (c) 2024, Krzysztof Łuczka Permission is hereby granted, free of charge, to any person obtaining a copy of this software and associated documentation files (the "Software"), to deal in the Software without restriction, including without limitation the rights to use, copy, modify, merge, publish, distribute, sublicense, and/or sell copies of the Software, and to permit persons to whom the Software is furnished to do so, subject to the following conditions: The above copyright notice and this permission notice shall be included in all copies or substantial portions of the Software. THE SOFTWARE IS PROVIDED "AS IS", WITHOUT WARRANTY OF ANY KIND, EXPRESS OR IMPLIED, INCLUDING BUT NOT LIMITED TO THE WARRANTIES OF MERCHANTABILITY, FITNESS FOR A PARTICULAR PURPOSE AND NONINFRINGEMENT. IN NO EVENT SHALL THE AUTHORS OR COPYRIGHT HOLDERS BE LIABLE FOR ANY CLAIM, DAMAGES OR OTHER LIABILITY, WHETHER IN AN ACTION OF CONTRACT, TORT OR OTHERWISE, ARISING FROM, OUT OF OR IN CONNECTION WITH THE SOFTWARE OR THE USE OR OTHER DEALINGS IN THE SOFTWARE.