How to Build a Web Scraping API using Java, Spring Boot, and Jsoup?

3i Data Scraping

How to Build a Web Scraping API using Java, Spring Boot, and Jsoup?

Overview

At 3i Data Scraping, we will create an API for scraping data from a couple of vehicle selling sites as well as extract the ads depending on vehicle models that we pass for an API. This type of API could be used from the UI as well as show different ads from various websites in one place.

Web Scraping

IntelliJ as IDE of option
Maven 3.0+ as a building tool
JDK 1.8+

Getting Started

Initially, we require to initialize the project using a spring initializer

It can be done by visiting http://start.spring.io/

Ensure to choose the given dependencies also:

Lombok: Java library, which makes a code cleaner as well as discards boilerplate codes.
Spring WEB: It is a product of the Spring community, with a focus on making document-driven web services.

After starting the project, we would be utilizing two-third party libraries JSOUP as well as Apache commons. The dependencies could be added in the pom.xml file.

<dependencies>
   
   <dependency>
         <groupId>org.springframework.boot</groupId>
         <artifactId>spring-boot-starter-web</artifactId>
      </dependency>
      
   <!-- https://mvnrepository.com/artifact/org.jsoup/jsoup -->
      <dependency>
         <groupId>org.jsoup</groupId>
         <artifactId>jsoup</artifactId>
         <version>1.13.1</version>
      </dependency>
   
      <!-- https://mvnrepository.com/artifact/org.apache.commons/commons-lang3 -->
      <dependency>
         <groupId>org.apache.commons</groupId>
         <artifactId>commons-lang3</artifactId>
         <version>3.11</version>
      </dependency>
   
   
      <dependency>
         <groupId>org.projectlombok</groupId>
         <artifactId>lombok</artifactId>
         <optional>true</optional>
      </dependency>
      <dependency>
         <groupId>org.springframework.boot</groupId>
         <artifactId>spring-boot-starter-test</artifactId>
         <scope>test</scope>
      </dependency>
   </dependencies>

Analyze HTML to Extract Data

Before starting the implementation of API, we need to visit https://riyasewana.com/ and https://ikman.lk/ to locate data, which we need to extract from these sites.

We can perform that by launching the given sites on the browser as well as inspecting HTML code with Dev tools.

If you are using Chrome, just right-click on the page as well as choose inspect.

Its result will look like this:

screenshot

After opening different websites we need to navigate through HTML for identifying a DOM where the ad list is positioned. These identified elements would be utilized in the spring boot project for getting relevant data.

From navigating through ikman.lk HTML, it’s easy to see a list of ads are positioned under a class name’s list — 3NxGO.

screenshot

After that, we need to perform the same with Riyasewana.com where ad data is positioned under a div with id content.

screenshot

After recognizing all the data, let’s create our API for scraping the data!!!.

Implementation

Initially, we need to define website URLs in the file called application.yml/application.properties

website:
  urls: https://ikman.lk/en/ads/sri-lanka/vehicles?sort=relevance&buy_now=0&urgent=0&query=,https://riyasewana.com/search/

After that, create an easy model class for mapping data using HTML.

package com.scraper.api.model;

import lombok.Data;

@Data
public class ResponseDTO {
    String title;
    String url;
}

In the given code, we utilize Data annotation generation setters and getters for attributes.

After that, it’s time to create a service layer as well as scrape data from these websites.

package com.scraper.api.service;

import com.scraper.api.model.ResponseDTO;
import org.apache.commons.lang3.StringUtils;
import org.jsoup.Jsoup;
import org.jsoup.nodes.Document;
import org.jsoup.nodes.Element;
import org.jsoup.select.Elements;
import org.springframework.beans.factory.annotation.Value;
import org.springframework.stereotype.Service;

import java.io.IOException;
import java.util.HashSet;
import java.util.List;
import java.util.Set;

@Service
public class ScraperServiceImpl implements ScraperService {
    //Reading data from property file to a list
    @Value("#{'${website.urls}'.split(',')}")
    List<String> urls;

    @Override
    public Set<ResponseDTO> getVehicleByModel(String vehicleModel) {
        //Using a set here to only store unique elements
        Set<ResponseDTO> responseDTOS = new HashSet<>();
        //Traversing through the urls
        for (String url: urls) {

            if (url.contains("ikman")) {
                //method to extract data from Ikman.lk
                extractDataFromIkman(responseDTOS, url + vehicleModel);
            } else if (url.contains("riyasewana")) {
               //method to extract Data from riyasewana.com
                extractDataFromRiyasewana(responseDTOS, url + vehicleModel);
            }

        }

        return responseDTOS;
    }

    private void extractDataFromRiyasewana(Set<ResponseDTO> responseDTOS, String url) {

        try {
            //loading the HTML to a Document Object
            Document document = Jsoup.connect(url).get();
            //Selecting the element which contains the ad list
            Element element = document.getElementById("content");
            //getting all the <a> tag elements inside the content div tag
            Elements elements = element.getElementsByTag("a");
           //traversing through the elements
            for (Element ads: elements) {
                ResponseDTO responseDTO = new ResponseDTO();

                if (!StringUtils.isEmpty(ads.attr("title")) ) {
                    //mapping data to the model class
                    responseDTO.setTitle(ads.attr("title"));
                    responseDTO.setUrl(ads.attr("href"));
                }
                if (responseDTO.getUrl() != null) responseDTOS.add(responseDTO);
            }
        } catch (IOException ex) {
            ex.printStackTrace();
        }
    }

    private void extractDataFromIkman(Set<ResponseDTO> responseDTOS, String url) {
        try {
            //loading the HTML to a Document Object
            Document document = Jsoup.connect(url).get();
//Selecting the element which contains the ad list
            Element element = document.getElementsByClass("list--3NxGO").first();
            //getting all the <a> tag elements inside the list-       -3NxGO class
            Elements elements = element.getElementsByTag("a");

            for (Element ads: elements) {

                ResponseDTO responseDTO = new ResponseDTO();

                if (StringUtils.isNotEmpty(ads.attr("href"))) {
                   //mapping data to our model class
                    responseDTO.setTitle(ads.attr("title"));
                    responseDTO.setUrl("https://ikman.lk"+ ads.attr("href"));
                }
                if (responseDTO.getUrl() != null) responseDTOS.add(responseDTO);

            }
        } catch (IOException ex) {
            ex.printStackTrace();
        }
    }

}

3i Data Scraping

Use Web Scraping API to Extract Your Data

Disha4 2022-05-16

There are various reasons why you should use web scraping to boost your online business. By monitoring prices and product trends, web scraper API allows you to create a fully automated pricing and investment strategy. Allow your sales staff to focus on the correct leads by using web scraper API to automatically extract contact information from websites that suit your target demographic. Zenscrape scraping API allows you to learn about their number of open positions, hiring priorities, and other useful details. When it comes to making informed purchasing decisions, this information is invaluable.

Using Data Scraping APIs to Extract Raw Data

Meenal aggarwal 2022-10-17

A web scraping API allows a programmer to extract raw data from websites without having to download and parse the entire page. In this blog post, we'll show you how to use data scraping APIs to extract raw data from websites. How can a web scraping API be used to extract raw data from websites? With data scraping APIs, you can extract raw data from websites automatically. There are many different uses for data scraping APIs.

Top Industries Where Data Scraping can used

Devendra Baghel 2020-03-31

Well, the common factor is that they all derive their strategies from DATA!

Data is the one of the most important things that that every industry needs today.

In this regard, data scraping helps a lot!Data scraping isn’t just useful for a particular industry but it is relevant for different industries too.

Here’s the blog which can help you in knowing the different areas where data scraping service can help us!Recruitment: The recruitment companies are dependent on the job boards so as to hire the candidates for filling the organization’s need.

These job boards are entirely full of the job posts that are fetched from the job portals and sites.

The data scraping services are utilized by the job boards for crawling through the different websites so as to scrape the relevant information regarding job postings, company profiles, job descriptions, and much more.Banking: The banking analysts are dependent on the financial statements so as to determine the organization’s health.

How to Make a Web Scraper with AWS Lambda and the Serverless Framework?

3i Data Scraping 2021-11-22

The use of a serverless framework is recommended to develop the Lambda function. Step 1: Serverless SetupRead the quick start guide for the serverless framework. Developing a new serverless project work:$ serverless create --template aws-nodejs --path donkeyjob$ cd donkeyjob A serverless. then(({data}) => { const jobs = extractListingsFromHTML(data); callback(null, {jobs}); }). DynamoDBIamPolicy: Type: AWS::IAM::Policy DependsOn: donkeyjobs Properties: PolicyName: lambda-dynamodb PolicyDocument: Version: '2012-10-17' Statement: - Effect: Allow Action: - dynamodb:DescribeTable - dynamodb:Query - dynamodb:Scan - dynamodb:GetItem - dynamodb:PutItem - dynamodb:UpdateItem - dynamodb:DeleteItem Resource: arn:aws:dynamodb:*:*:table/donkeyjobs Roles: - Ref: IamRoleLambdaExecution In order to construct the DynamoDB resource, we'll have to deploy this to AWS.

Scraping Data from APIs: A Comprehensive Guide

Sameer Anthony 2024-02-20

In the realm of data extraction, APIs (Application Programming Interfaces) serve as a valuable source of structured and real-time data. In this article, we will explore the process of scraping data from APIs, including best practices and tools to ensure successful data extraction. Understanding APIsBefore delving into the scraping process, it's crucial to understand what APIs are and how they work. Understanding the API Documentation: API documentation provides details about the available endpoints, request formats, authentication methods, and response structures. ConclusionScraping data from APIs can be a powerful technique for accessing valuable data for analysis, research, or application development.

How Can You Monitor Prices On Car Dealership Websites?

Sam Moriss 2022-06-15

The effectiveness and scale of manually scraping car dealer data are limited. Apart from this there few several other online platforms for scraping price monitoring data from car dealership websites. We scrape price monitoring data from car dealer websites, we also provide car inventory data scraping and used car inventory data scraping services. Why Should You Hire a Professional Crawler, such as Web Screen Scraping to Monitor Car Dealer Prices? Our car dealer site price monitoring services are dependable, skilled, and provide accurate results quickly.

WHO TO FOLLOW