open.mp forum
[Plugin] PawnScraper - Printable Version

+ open.mp forum (https://forum.open.mp)
-- Forum: SA-MP (https://forum.open.mp/forumdisplay.php?fid=3)
--- Forum: Releases (https://forum.open.mp/forumdisplay.php?fid=13)
---- Forum: Plugins (https://forum.open.mp/forumdisplay.php?fid=32)
---- Thread: [Plugin] PawnScraper (/showthread.php?tid=131)



PawnScraper - SyS - 2019-04-14

PawnScraper






[Image: pawn-scraper.svg?branch=master] [Image: 5rq55kukvy8xymly?svg=true] [Image: sampctl-PawnScraper-2f2f2f.svg] [Image: pawn-scraper.svg] [Image: pawn-scraper.svg] [Image: pawn-scraper.svg]



A powerful scraper plugin that provides interface for utlising html_parsers and css selectors in pawn.





Installing



Thanks to Southclaws,plugin installation is now much easier with sampctl



PHP Code:
sampctl p install Sreyas-Sreelal/pawn-scraper 



OR


  • Download suitable binary files from releases for your operating system

  • Add it your plugins folder

  • Add PawnScraper to server.cfg or? PawnScraper.so (for linux)

  • Add pawnscraper.inc in includes folder





Building


  • Clone the repo



    PHP Code:
    git clone https://github.com/Sreyas-Sreelal/pawn-scraper.git 



  • Compile the plugin using nightly compiler


    • Windows

      PHP Code:
      cargo 鸨↶i686-pc-windows-msvc build --release 

    • Linux

      PHP Code:
      cargo 鸨↶i686-unknown-linux-gnu build --release 






API


  • ParseHtmlDocument(document[])]


    • Params


      • document[] - string of html document


    • Returns


      • Html document instance id

      • if failed to parse document INVALID_HTML_DOC is returned


    • Example Usage



      PHP Code:
      new Html:doc ParseHtmlDocument("\

       <!DOCTYPE html>\

       <meta charset=\"utf-8\">\

       <title>Hello, world!</title>\

       <h1 class=\"foo\">Hello, <i>world!</i></h1>\

       "
      );

      ASSERT(doc != INVALID_HTML_DOC);

      DeleteHtml(doc); 




  • ResponseParseHtml(Response:id)


    • Params


      • id - Http response id returned from HttpGet


    • Returns


      • Html document instance id

      • if failed to parse document INVALID_HTML_DOC is returned


    • Example Usage



      PHP Code:
      new Response:response HttpGet("https://www.sa-mp.com");

      new 
      Html:doc ResponseParseHtml(response);

      ASSERT(doc != INVALID_HTML_DOC);

      DeleteHtml(doc); 




  • HttpGet(url[],Header:headerid=INVALID_HEADER)


    • Params


      • url[] - Url of a website

      • header - id of header object created using CreateHeader


    • Returns


      • Response id if successful

      • if failed to INVALID_HTTP_RESPONSE is returned


    • Example Usage



      PHP Code:
      new Response:response HttpGet("https://www.sa-mp.com");

      ASSERT(response != INVALID_HTTP_RESPONSE);

      DeleteResponse(response); 




  • HttpGetThreaded(playerid,callback[],url[],Header:headerid=INVALID_HEADER)


    • Params


      • playerid - id of the player

      • callback[] - name of the callback function to handle the response.

      • url[] - Url of a website

      • header - id of header object created using CreateHeader


    • Example Usage

      PHP Code:
      HttpGetThreaded(0,"MyHandler","https://sa-mp.com");

      //********

      forward MyHandler(playerid,Response:responseid);

      public 
      MyHandler(playerid,Response:responseid){

      ? ? 
      ASSERT(responseid != INVALID_HTTP_RESPONSE);

      ? ? 
      DeleteResponse(responseid);




  • ParseSelector(string[])


    • Params


      • string[] - CSS selector


    • Returns


      • Selector instance id if successful

      • if failed to INVALID_SELECTOR is returned


    • Example Usage



      PHP Code:
      new Selector:selector ParseSelector("h1 .foo");

      ASSERT(selector != INVALID_SELECTOR);

      DeleteSelector(selector); 




  • CreateHeader(?)


    • Params


      • key,value pairs of String type


    • Returns


      • Header instance id if successful

      • if failed to INVALID_HEADER is returned


    • Example Usage



      PHP Code:
      new Header:header CreateHeader(

      ? ? 
      "User-Agent","Mozilla/5.0 (Windows NT 10.0; Win64; x64) AppleWebKit/537.36"

      );

      ASSERT(header != INVALID_HEADER);

      new 
      Response:response HttpGet("https://sa-mp.com/",header);

      ASSERT(response != INVALID_HTTP_RESPONSE);

      ASSERT(DeleteHeader(header) == 1); 




  • GetNthElementName(Html:docid,Selector:selectorid,idx,string[],size = sizeof(string))


    • Params


      • docid - Html instance id

      • selectorid - CSS selector instance id

      • idx - the n?th occurence of element in the document (starts from 0)

      • string[] - element name is stored

      • size - sizeof string


    • Returns


      • 1 if successful

      • 0 if failed


    • Example Usage



      PHP Code:
      new Html:doc ParseHtmlDocument("\

      ? ? <!DOCTYPE html>\

      ? ? <meta charset=\"utf-8\">\

      ? ? <title>Hello, world!</title>\

      ? ? <h1 class=\"foo\">Hello, <i>world!</i></h1>\

      "
      );

      ASSERT(doc != INVALID_HTML_DOC);



      new 
      Selector:selector ParseSelector("i");

      ASSERT(selector != INVALID_SELECTOR);



      new 
      i= -1,element_name[10];

      while(
      GetNthElementName(doc,selector,,element_name)!=0){

      ? ? 
      ASSERT(strcmp(element_name,"i") == 0);

      }



      DeleteSelector(selector);

      DeleteHtml(doc); 


  • GetNthElementText(Html:docid,Selector:selectorid,idx,string[],size = sizeof(string))


    • Params


      • docid - Html instance id

      • selectorid - CSS selector instance id

      • idx - the n?th occurence of element in the document (starts from 0)

      • string[] - element name

      • size - sizeof string


    • Returns


      • 1 if successful

      • 0 if failed


    • Example Usage



      PHP Code:
      new Html:doc ParseHtmlDocument("\

      ? ? <!DOCTYPE html>\

      ? ? <meta charset=\"utf-8\">\

      ? ? <title>Hello, world!</title>\

      ? ? <h1 class=\"foo\">Hello, <i>world!</i></h1>\

      "
      );

      ASSERT(doc != INVALID_HTML_DOC);



      new 
      Selector:selector ParseSelector("h1.foo");

      ASSERT(selector != INVALID_SELECTOR);



      new 
      element_text[20];

      ASSERT(GetNthElementText(doc,selector,0,element_text) == 1);



      new 
      check strcmp(element_text,("Hello, world!"));

      ASSERT(check == 0);



      DeleteSelector(selector);

      DeleteHtml(doc); 


  • GetNthElementAttrVal(Html:docid,Selector:selectorid,idx,attribute[],string[],size = sizeof(string))


    • Params


      • docid - Html instance id

      • selectorid - CSS selector instance id

      • idx - the n?th occurence of element in the document (starts from 0)

      • attribute[] - the attribute of element

      • string[] - element name

      • size - sizeof string


    • Returns


      • 1 if successful

      • 0 if failed


    • Example Usage



      PHP Code:
      new Html:doc ParseHtmlDocument("\

       <!DOCTYPE html>\

       <meta charset=\"utf-8\">\

       <title>Hello, world!</title>\

       <h1 class=\"foo\">Hello, <i>world!</i></h1>\

      "
      );

      ASSERT(doc != INVALID_HTML_DOC);



      new 
      Selector:selector ParseSelector("h1");

      ASSERT(selector != INVALID_SELECTOR);



      new 
      element_attribute[20];

      ASSERT(GetNthElementAttrVal(doc,selector,0,"class",element_attribute) == 1);



      new 
      check strcmp(element_attribute,("foo"));

      ASSERT(check == 0);



      DeleteSelector(selector);

      DeleteHtml(doc); 


  • DeleteHtml(Html:id)


    • Params


      • id - html instance to be deleted


    • Returns


      • 1 if successful

      • 0 if failed



  • DeleteSelector(Selector:id)


    • Params


      • id - selector instance to be deleted


    • Returns


      • 1 if successful

      • 0 if failed



  • DeleteResponse(Html:id)


    • Params


      • id - response instance to be deleted


    • Returns


      • 1 if successful

      • 0 if failed



  • DeleteHeader(Header:id)


    • Params


      • id - header instance to be deleted


    • Returns


      • 1 if successful

      • 0 if failed








Example Usage



A small example to fetch all links in wiki.sa-mp.com



PHP Code:
new Response:response HttpGet("https://wiki.sa-mp.com");

if(
response == INVALID_HTTP_RESPONSE){

 
printf("HTTP ERROR");

 return;

}



new 
Html:html ResponseParseHtml(response);

if(
html == INVALID_HTML_DOC){

 
DeleteResponse(response);

 return;

}



new 
Selector:selector ParseSelector("a");

if(
selector == INVALID_SELECTOR){

 
DeleteResponse(response);

 
DeleteHtml(html);

 return;

}



new 
str[500],i;

while(
GetNthElementAttrVal(html,selector,i,"href",str)){

 
printf("%s",str);

 ;

}

//delete created objects after the usage..

DeleteHtml(html);

DeleteResponse(response);

DeleteSelector(selector); 



The same above with threaded http call would be



PHP Code:
HttpGetThreaded(0,"MyHandler","https://wiki.sa-mp.com");

//...

forward MyHandler(playerid,Response:responseid);

public 
MyHandler(playerid,Response:responseid){

 if(
responseid == INVALID_HTTP_RESPONSE){

 
printf("HTTP ERROR");

 return 
0;

 }



 new 
Html:html ResponseParseHtml(responseid);

 if(
html == INVALID_HTML_DOC){

 
DeleteResponse(response);

 return 
0;

 }



 new 
Selector:selector ParseSelector("a");

 if(
selector == INVALID_SELECTOR){

 
DeleteResponse(response);

 
DeleteHtml(html);

 return 
0;

 }



 new 
str[500],i;

 while(
GetNthElementAttrVal(html,selector,i,"href",str)){

 
printf("%s",str);

 ;

 }



 
DeleteHtml(html);

 
Delete(response);

 
DeleteSelector(selector);

 return 
1;






More examples can be found in examples



Repository

https://github.com/Sreyas-Sreelal/pawn-scraper



Note



The plugin is in primary stage and more tests and features needed to be added.I?m open to any kind of contribution, just open a pull request if you have anything to improve or add new features.



Special thanks