PawnScraper
A powerful scraper plugin that provides interface for utlising html_parsers and css selectors in pawn.
Installing
Thanks to Southclaws,plugin installation is now much easier with sampctl
PHP Code:
sampctl p install Sreyas-Sreelal/pawn-scraper
OR
- Download suitable binary files from releases for your operating system
- Add it your plugins folder
- Add PawnScraper to server.cfg or? PawnScraper.so (for linux)
- Add pawnscraper.inc in includes folder
Building
- Clone the repo
PHP Code:git clone https://github.com/Sreyas-Sreelal/pawn-scraper.git
- Compile the plugin using nightly compiler
- Windows
PHP Code:cargo 鸨↶i686-pc-windows-msvc build --release
- Linux
PHP Code:cargo 鸨↶i686-unknown-linux-gnu build --release
- Windows
API
- ParseHtmlDocument(document[])]
- Params
- document[] - string of html document
- document[] - string of html document
- Returns
- Html document instance id
- if failed to parse document INVALID_HTML_DOC is returned
- Html document instance id
- Example Usage
PHP Code:new Html:doc = ParseHtmlDocument("\
<!DOCTYPE html>\
<meta charset=\"utf-8\">\
<title>Hello, world!</title>\
<h1 class=\"foo\">Hello, <i>world!</i></h1>\
");
ASSERT(doc != INVALID_HTML_DOC);
DeleteHtml(doc);
- Params
- ResponseParseHtml(Response:id)
- Params
- id - Http response id returned from HttpGet
- id - Http response id returned from HttpGet
- Returns
- Html document instance id
- if failed to parse document INVALID_HTML_DOC is returned
- Html document instance id
- Example Usage
PHP Code:new Response:response = HttpGet("https://www.sa-mp.com");
new Html:doc = ResponseParseHtml(response);
ASSERT(doc != INVALID_HTML_DOC);
DeleteHtml(doc);
- Params
- HttpGet(url[],Header:headerid=INVALID_HEADER)
- Params
- url[] - Url of a website
- header - id of header object created using CreateHeader
- url[] - Url of a website
- Returns
- Response id if successful
- if failed to INVALID_HTTP_RESPONSE is returned
- Response id if successful
- Example Usage
PHP Code:new Response:response = HttpGet("https://www.sa-mp.com");
ASSERT(response != INVALID_HTTP_RESPONSE);
DeleteResponse(response);
- Params
- HttpGetThreaded(playerid,callback[],url[],Header:headerid=INVALID_HEADER)
- Params
- playerid - id of the player
- callback[] - name of the callback function to handle the response.
- url[] - Url of a website
- header - id of header object created using CreateHeader
- playerid - id of the player
- Example Usage
PHP Code:HttpGetThreaded(0,"MyHandler","https://sa-mp.com");
//********
forward MyHandler(playerid,Response:responseid);
public MyHandler(playerid,Response:responseid){
? ? ASSERT(responseid != INVALID_HTTP_RESPONSE);
? ? DeleteResponse(responseid);
}
- Params
- ParseSelector(string[])
- Params
- string[] - CSS selector
- string[] - CSS selector
- Returns
- Selector instance id if successful
- if failed to INVALID_SELECTOR is returned
- Selector instance id if successful
- Example Usage
PHP Code:new Selector:selector = ParseSelector("h1 .foo");
ASSERT(selector != INVALID_SELECTOR);
DeleteSelector(selector);
- Params
- CreateHeader(?)
- Params
- key,value pairs of String type
- key,value pairs of String type
- Returns
- Header instance id if successful
- if failed to INVALID_HEADER is returned
- Header instance id if successful
- Example Usage
PHP Code:new Header:header = CreateHeader(
? ? "User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"
);
ASSERT(header != INVALID_HEADER);
new Response:response = HttpGet("https://sa-mp.com/",header);
ASSERT(response != INVALID_HTTP_RESPONSE);
ASSERT(DeleteHeader(header) == 1);
- Params
- GetNthElementName(Html:docid,Selector:selectorid,idx,string[],size = sizeof(string))
- Params
- docid - Html instance id
- selectorid - CSS selector instance id
- idx - the n?th occurence of element in the document (starts from 0)
- string[] - element name is stored
- size - sizeof string
- docid - Html instance id
- Returns
- 1 if successful
- 0 if failed
- 1 if successful
- Example Usage
PHP Code:new Html:doc = ParseHtmlDocument("\
? ? <!DOCTYPE html>\
? ? <meta charset=\"utf-8\">\
? ? <title>Hello, world!</title>\
? ? <h1 class=\"foo\">Hello, <i>world!</i></h1>\
");
ASSERT(doc != INVALID_HTML_DOC);
new Selector:selector = ParseSelector("i");
ASSERT(selector != INVALID_SELECTOR);
new i= -1,element_name[10];
while(GetNthElementName(doc,selector,,element_name)!=0){
? ? ASSERT(strcmp(element_name,"i") == 0);
}
DeleteSelector(selector);
DeleteHtml(doc);
- Params
- GetNthElementText(Html:docid,Selector:selectorid,idx,string[],size = sizeof(string))
- Params
- docid - Html instance id
- selectorid - CSS selector instance id
- idx - the n?th occurence of element in the document (starts from 0)
- string[] - element name
- size - sizeof string
- docid - Html instance id
- Returns
- 1 if successful
- 0 if failed
- 1 if successful
- Example Usage
PHP Code:new Html:doc = ParseHtmlDocument("\
? ? <!DOCTYPE html>\
? ? <meta charset=\"utf-8\">\
? ? <title>Hello, world!</title>\
? ? <h1 class=\"foo\">Hello, <i>world!</i></h1>\
");
ASSERT(doc != INVALID_HTML_DOC);
new Selector:selector = ParseSelector("h1.foo");
ASSERT(selector != INVALID_SELECTOR);
new element_text[20];
ASSERT(GetNthElementText(doc,selector,0,element_text) == 1);
new check = strcmp(element_text,("Hello, world!"));
ASSERT(check == 0);
DeleteSelector(selector);
DeleteHtml(doc);
- Params
- GetNthElementAttrVal(Html:docid,Selector:selectorid,idx,attribute[],string[],size = sizeof(string))
- Params
- docid - Html instance id
- selectorid - CSS selector instance id
- idx - the n?th occurence of element in the document (starts from 0)
- attribute[] - the attribute of element
- string[] - element name
- size - sizeof string
- docid - Html instance id
- Returns
- 1 if successful
- 0 if failed
- 1 if successful
- Example Usage
PHP Code:new Html:doc = ParseHtmlDocument("\
<!DOCTYPE html>\
<meta charset=\"utf-8\">\
<title>Hello, world!</title>\
<h1 class=\"foo\">Hello, <i>world!</i></h1>\
");
ASSERT(doc != INVALID_HTML_DOC);
new Selector:selector = ParseSelector("h1");
ASSERT(selector != INVALID_SELECTOR);
new element_attribute[20];
ASSERT(GetNthElementAttrVal(doc,selector,0,"class",element_attribute) == 1);
new check = strcmp(element_attribute,("foo"));
ASSERT(check == 0);
DeleteSelector(selector);
DeleteHtml(doc);
- Params
- DeleteHtml(Html:id)
- Params
- id - html instance to be deleted
- id - html instance to be deleted
- Returns
- 1 if successful
- 0 if failed
- 1 if successful
- Params
- DeleteSelector(Selector:id)
- Params
- id - selector instance to be deleted
- id - selector instance to be deleted
- Returns
- 1 if successful
- 0 if failed
- 1 if successful
- Params
- DeleteResponse(Html:id)
- Params
- id - response instance to be deleted
- id - response instance to be deleted
- Returns
- 1 if successful
- 0 if failed
- 1 if successful
- Params
- DeleteHeader(Header:id)
- Params
- id - header instance to be deleted
- id - header instance to be deleted
- Returns
- 1 if successful
- 0 if failed
- 1 if successful
- Params
Example Usage
A small example to fetch all links in wiki.sa-mp.com
PHP Code:
new Response:response = HttpGet("https://wiki.sa-mp.com");
if(response == INVALID_HTTP_RESPONSE){
printf("HTTP ERROR");
return;
}
new Html:html = ResponseParseHtml(response);
if(html == INVALID_HTML_DOC){
DeleteResponse(response);
return;
}
new Selector:selector = ParseSelector("a");
if(selector == INVALID_SELECTOR){
DeleteResponse(response);
DeleteHtml(html);
return;
}
new str[500],i;
while(GetNthElementAttrVal(html,selector,i,"href",str)){
printf("%s",str);
;
}
//delete created objects after the usage..
DeleteHtml(html);
DeleteResponse(response);
DeleteSelector(selector);
The same above with threaded http call would be
PHP Code:
HttpGetThreaded(0,"MyHandler","https://wiki.sa-mp.com");
//...
forward MyHandler(playerid,Response:responseid);
public MyHandler(playerid,Response:responseid){
if(responseid == INVALID_HTTP_RESPONSE){
printf("HTTP ERROR");
return 0;
}
new Html:html = ResponseParseHtml(responseid);
if(html == INVALID_HTML_DOC){
DeleteResponse(response);
return 0;
}
new Selector:selector = ParseSelector("a");
if(selector == INVALID_SELECTOR){
DeleteResponse(response);
DeleteHtml(html);
return 0;
}
new str[500],i;
while(GetNthElementAttrVal(html,selector,i,"href",str)){
printf("%s",str);
;
}
DeleteHtml(html);
Delete(response);
DeleteSelector(selector);
return 1;
}
More examples can be found in examples
Repository
https://github.com/Sreyas-Sreelal/pawn-scraper
Note
The plugin is in primary stage and more tests and features needed to be added.I?m open to any kind of contribution, just open a pull request if you have anything to improve or add new features.
Special thanks
- Eva for samp-rust-sdk
- Y_Less for y_tests
- Discord members in SAMP discord channel