Skip to content
Learn Netverks
0

Get libcurl to return only a specific amount of HTML from a site

asked 9 hours ago by @qa-ldtwvoulqzrq2sgqgrft 0 rep · 90 views

c libcurl

I don't want all the HTML from a specific site so I tried to alter CURL_MAX_WRITE_SIZE.

Unfortunately it still returns way more than 650,000 characters and gives me the entire site.

#include <stdio.h>
#include <curl/curl.h>

#ifdef CURL_MAX_WRITE_SIZE
#undef CURL_MAX_WRITE_SIZE
#define CURL_MAX_WRITE_SIZE 650000
#endif

//for now this just prints the code for debugging purposes
int processCode(char html[]){
    printf("%s", html);
    return 0;
}

int main(){
    CURL *curl = curl_easy_init();
    curl_easy_setopt(curl, CURLOPT_URL, "https://amazon.com");
    curl_easy_setopt(curl, CURLOPT_FOLLOWLOCATION, 1L);
    curl_easy_setopt(curl, CURLOPT_BUFFERSIZE, 650000L);
    curl_easy_setopt(curl, CURLOPT_WRITEFUNCTION, processCode);
    curl_easy_perform(curl);
    curl_easy_cleanup(curl);
    return 0;
}

Comments on this question (0)

Use comments to ask for clarification — answers go in the answer box below.

Log in to comment on this question.

2 answers

3

The code here compiles on my machine, gcc throws no errors and the the file it spits out runs, just not in the way I would like.

Welcome to C, where the compiler can't save you all the time, no matter how strictly to tell it to check your code.

Your processCode() function is wrong. The reason is because of the way the curl_easy_setopt() function is defined - the third argument can be anything that can be cast to void *. (Pedantically, casting a function pointer to void * is actually a bit sketchy per the C standard...)

Per the libcurl documentation (bolded text mine):

Name

CURLOPT_WRITEFUNCTION - callback for writing received data

Synopsis

#include <curl/curl.h>

size_t write_callback(char *ptr, size_t size, size_t nmemb, void *userdata);

CURLcode curl_easy_setopt(CURL *handle, CURLOPT_WRITEFUNCTION, write_callback);

Description

Pass a pointer to your callback function, which should match the prototype shown above.

This callback function gets called by libcurl as soon as there is data received that needs to be saved. For most transfers, this callback gets called many times and each invoke delivers another chunk of data. ptr points to the delivered data, and the size of that data is nmemb; size is always 1.

The data passed to this function is not null-terminated.

...

There's an example on that very page:

#include <stdlib.h> /* for realloc */
#include <string.h> /* for memcpy */
 
struct memory {
  char *response;
  size_t size;
};
 
static size_t cb(char *data, size_t size, size_t nmemb, void *clientp)
{
  size_t realsize = nmemb;
  struct memory *mem = (struct memory *)clientp;
 
  char *ptr = realloc(mem->response, mem->size + realsize + 1);
  if(!ptr)
    return 0;  /* out of memory */
 
  mem->response = ptr;
  memcpy(&(mem->response[mem->size]), data, realsize);
  mem->size += realsize;
  mem->response[mem->size] = 0;
 
  return realsize;
}

Although if you want to collect the output into a memory buffer, it's much easier to use CURLOPT_WRITEDATA with the FILE * argument being a memory stream opened using the POSIX open_memstream() function.

Though you should probably start with the example on that last page - that will teach you how to use libcurl to write output to a file.

Skyler Patel · 0 rep · 9 hours ago

2

Other issues after the first answer.

You misunderstood how CURL_MAX_WRITE_SIZE and CURLOPT_BUFFERSIZE work.

The former is a constant to inform a client app about the maximum value for the send buffer, simultaneously it's used as a default size for the receive buffer. This is a libcurl build time value and can't be changed in a client app to affect the default receive buffer size.

CURLOPT_BUFFERSIZE is used for setting the receive buffer size. Roughly, this just sets the maximum value for the parameter nmemb in the callback function, that is called in a loop with data chunks until all data is received. You must collect received data, appending chunks, until a desired size is reached and then return 0 from the callback.

CURLOPT_WRITEFUNCTION

For most transfers, this callback gets called many times and each invoke delivers another chunk of data. ptr points to the delivered data, and the size of that data is nmemb

Skyler Brooks · 0 rep · 9 hours ago

Your answer