logo
logo
Sign in

How to Extract the Web to Get Data about the Top Rated Movies on TV?

avatar
3i Data Scraping
How to Extract the Web to Get Data about the Top Rated Movies on TV?


This listing having these films would get stored within the SQLite database as well as emailed. That’s how you would never miss any blockbuster movies on TV again.


Getting-a-Good-Webpage-to-Extract

We will start with the online TV guide for finding films on different Belgian TV channels. However, you can easily adapt our code to utilize it for other websites. For making the life easier while extracting for films, ensure the site you wish to extract:

  • has different HTML tags having a clear id or class
  • utilizes ids and classes in a constant way
  • provides well-structured URLs
  • contains all applicable TV channels on single page
  • has a different page every weekday
  • only lists films as well as no other programs like news, live shows, reportage, etc. Except you can easily differentiate films from other program kinds.

With available results, we will scrape The Movie Database (TMDB) data for film ratings and other information.

Decide-Which-Data-to-Store

We will extract the following details about films:

  • Film Title
  • TV Channel
  • TMDB Rating
  • The Time When a Film Starts
  • The Date Film is on TV
  • Release Date
  • Plot
  • Link To The Details Page On TMDB
  • Genre

You can complement the list with different actors, director, interesting facts, and more– all the data you’d love to know about.

In Scrapy, the information would get stored in an item’s fields.

Creating-a-Scrapy-Project

We assume here that you have already installed Scrapy. If not, just install it first.

When Scrapy gets installed, just open a command line as well as go to a directory wherever you wish to store a Scrapy project. After that, run:

scrapy startproject topfilms

It will make a folder structure for top films project given below. You could ignore a topfilms.db file now.

folder structure

Define Scrapy Items

We would be dealing with a file items.py as it is made by default while making a Scrapy project.

A scrapy.Item is the container, which will get filled during web scraping. This will hold different fields, which we wish to scrape from web page(s). Different contents of an Item could get accessed in similar way like the Python dict.

Then, open items.py as well as add Scrapy.Item class using following fields:

import scrapy
                                class TVGuideItem(scrapy.Item):
                                    title = scrapy.Field()
                                    channel = scrapy.Field()
                                    start_ts = scrapy.Field()
                                    film_date_long = scrapy.Field()
                                    film_date_short = scrapy.Field()
                                    genre = scrapy.Field()
                                    plot = scrapy.Field()
                                    rating = scrapy.Field()
                                    tmdb_link = scrapy.Field()
                                    release_date = scrapy.Field()
                                    nb_votes = scrapy.Field()
                                

Process Items using Pipelines

After beginning a new Scrapy project, you would get a file named pipelines.py. Open the file as well as copy-paste a code given below. Then, we will show step-by-step what every part of a code does.

import sqlite3 as lite
                                con = None  # db connection
                                class StoreInDBPipeline(object):
                                    def __init__(self):
                                        self.setupDBCon()
                                        self.dropTopFilmsTable()
                                        self.createTopFilmsTable()
                                def process_item(self, item, spider):
                                        self.storeInDb(item)
                                        return item
                                def storeInDb(self, item):
                                        self.cur.execute("INSERT INTO topfilms(\
                                        title, \
                                        channel, \
                                        start_ts, \
                                        film_date_long, \
                                        film_date_short, \
                                        rating, \
                                        genre, \
                                        plot, \
                                        tmdb_link, \
                                        release_date, \
                                        nb_votes \
                                        ) \
                                        VALUES( ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ? )",
                                        (
                                        item['title'],
                                        item['channel'],
                                        item['start_ts'],
                                        item['film_date_long'],
                                        item['film_date_short'],
                                        float(item['rating']),
                                        item['genre'],
                                        item['plot'],
                                        item['tmdb_link'],
                                        item['release_date'],
                                        item['nb_votes']
                                        ))
                                        self.con.commit()
                                def setupDBCon(self):
                                        self.con = lite.connect('topfilms.db')
                                        self.cur = self.con.cursor()
                                def __del__(self):
                                        self.closeDB()
                                def createTopFilmsTable(self):
                                        self.cur.execute("CREATE TABLE IF NOT EXISTS topfilms(id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, \
                                        title TEXT, \
                                        channel TEXT, \
                                        start_ts TEXT, \
                                        film_date_long TEXT, \
                                        film_date_short TEXT, \
                                        rating TEXT, \
                                        genre TEXT, \
                                        plot TEXT, \
                                        tmdb_link TEXT, \
                                        release_date TEXT, \
                                        nb_votes \
                                        )")
                                def dropTopFilmsTable(self):
                                        self.cur.execute("DROP TABLE IF EXISTS topfilms")
                                        
                                    def closeDB(self):
                                        self.con.close()
                                

Initially, we begin by importing SQLite package as well as provide that an alias lite. Also, we initialize an adjustable con that is utilized for a database connection.

Create a Class for Storing Items in a Database

After that, you make a class using the logical name. After allowing the pipeline within a settings file, this class would be named.

class StoreInDBPipeline(object):

Define a Constructor Method

The constructor method is a method having name __init__. It is run automatically while making an example of a StoreInDBPipeline class.

def __init__(self):
    self.setupDBCon()
    self.dropTopFilmsTable()
    self.createTopFilmsTable()

In this constructor method, we have launched three other techniques that are given below a constructor method.

SetupDBCon Method

Using this method called setupDBCon, we have created a topfilms database as well as make connection of that with a connect function.

def setupDBCon(self):
    self.con = lite.connect('topfilms.db')
	self.cur = self.con.cursor()

Now, we utilize an alias lite for SQLite package. After that, we make a Cursor object having a cursor function. Using the Cursor object, we can perform SQL statements in a database.

DropTopFilmsTable Method

Another method is a dropTopFilmsTable method. Here, it drops a table in a SQLite database.

Every time a web scraper runs a database would get completely removed. If you wish to do it also. If you wish to do a few inquiring or analysis of a films’ data, you can keep scraping results of every run.

We just wish to see top-rated films of coming days as well as nothing more. So, we have decided to delete a database in every run.

def dropTopFilmsTable(self):
    self.cur.execute("DROP TABLE IF EXISTS topfilms")

Using a Cursor object cur we have executed a DROP statement. Let’s understand more about this.

CreateTopFilmsTable Method

After dropping a top films table, we require to make it. That is done with a last method call using a constructor method.

def createTopFilmsTable(self):
    self.cur.execute("CREATE TABLE IF NOT EXISTS topfilms(id INTEGER PRIMARY KEY AUTOINCREMENT NOT NULL, \
    title TEXT, \
    channel TEXT, \
    start_ts TEXT, \
    film_date_long TEXT, \
    film_date_short TEXT, \
    rating TEXT, \
    genre TEXT, \
    plot TEXT, \
    tmdb_link TEXT, \
    release_date TEXT, \
    nb_votes \
    )")

After that, we utilize a Cursor object cur for executing CREATE TABLE statement. Different fields, which are added in table top films are similar like in a Scrapy Item that we have created before. For keeping things easy, we utilize exactly similar names in a SQLite table like in an Item. Only id field is additional.

Note: Any good application to search at the SQLite databases is a SQLite Manager plug-in within Firefox. You could watch any SQLite Manager tutorial on YouTube to find how to utilize this plugin.

Process_item Method

This technique must get implemented in a Pipeline class as well as this must return the dict, a DropItem or Item exception. In the web scraper, we would return an item.

def process_item(self, item, spider):
    self.storeInDb(item)
	return item

In contrast to other methods given, this has two additional arguments. The item, which was extracted as well as the spider, which extrcted the items. Using this method, we have launch a storeInDb method as well as afterwards returned the item.

StoreInDb Method

The given method implements the INSERT statement for inserting the extracted items into a SQLite database.

def storeInDb(self, item):
    self.cur.execute("INSERT INTO topfilms(\
    title, \
    channel, \
    start_ts, \
    film_date_long, \
    film_date_short, \
    rating, \
    genre, \
    plot, \
    tmdb_link, \
    release_date, \
    nb_votes \
    ) \
    VALUES( ?, ?, ?, ?, ?, ?, ?, ?, ?, ?, ? )",
                     (
                         item['title'],
                         item['channel'],
                         item['start_ts'],
                         item['film_date_long'],
                         item['film_date_short'],
                         float(item['rating']),
                         item['genre'],
                         item['plot'],
                         item['tmdb_link'],
                         item['release_date'],
                         item['nb_votes']
                     ))
    self.con.commit()

Different values for a table fields are available from an item that is the argument for the given method. All the values are known as the dict value (keep in mind that an Item is not more than the dict).

Every Constructor Comes with a Destructor!

The equivalent of a constructor method is a destructor method having a name __del__. In this destructor method for a pipelines class, we have closed the connection with a database.

def __del__(self):
    self.closeDB()


Read More: https://www.3idatascraping.com/how-to-extract-the-web-to-get-data-about-the-top-rated-movies-on-tv.php

collect
0
avatar
3i Data Scraping
guide
Zupyak is the world’s largest content marketing community, with over 400 000 members and 3 million articles. Explore and get your content discovered.
Read more