Wednesday, September 20, 2017

Backup articles from "tian.yam.com" using Python Scrapy framework as a crawler

Backup articles from "tian.yam.com" using Python Scrapy framework as a crawler

Preface

One of my friends needed to backup all articles from a blog site called “tian.yam.com” (天空部落格). She might move her blog to another platform such as Blogger or self-built Wordpress site. Unfortunately, it seemed that there is no tools provided by “tian.yam.com” to backup all articles for her. ( Maybe I don’t know. ) I started to research how to backup all articles using any feasible way.

Scrapy

I have found a perfect tool for this kind of job: Scrapy
Scrapy is an open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.
I utilized Scrapy as a Python crawler to get the contents from the blog and save them to the local disk storage.

Installation

To install Scrapy using conda, run:
conda install -c conda-forge scrapy 
( I use conda for installation since I have installed Anaconda )
or
pip install Scrapy

Start A Project

It is easy to start to learn from the tutorial.
We can start a project by typing the command:
scrapy startproject blog
Then it will create the whole blog directory:
blog/
    scrapy.cfg   # deploy configuration file
    blog/        # project's Python module, you'll import your code from here
        __init__.py
        items.py      # project items definition file
        pipelines.py  # project pipelines file
        settings.py   # project settings file
        spiders/      # a directory where you'll later put your spiders
            __init__.py

Write The Spider

Then I wrote a Pyhton program called blog.py under the directory blog/spiders.
A class ‘BlogCrawler’ is declared. The functions ‘start_requests’ and ‘parse’ were built-in. All we need to do is to add code in those functions.
In addition, Beautiful Soup is a powerful Python library used for parsing documents in HTML and XML format.
import scrapy
from bs4 import BeautifulSoup
from datetime import datetime
import requests
import shutil
import os

class BlogCrawler(scrapy.Spider):
  name = 'blog'

  global mt_filename, post_list
  mt_filename = 'duduh_blog.txt'
  post_list = []

  # delete the mt file if exist.
  if os.path.exists(mt_filename):
    os.remove(mt_filename)

  def start_requests(self):
    # Get every post from page 1 to 34
    #
    post_url = 'https://duduh.tian.yam.com/posts?page='
    urls = []
    for i in range(1,35):
      urls.append(post_url + str(i))

    # parse each post in every page
    for url in urls:
      yield scrapy.Request(url=url, callback=self.parse)

  def closed(self, reason):
    global post_list
    post_list.sort(reverse=True)
    # print(post_list)
    # print(len(post_list))
    self.output_to_mt(post_list)

  def parse(self, response):
    global post_list
    res = BeautifulSoup(response.body, "html.parser")
    titles = res.select('.post-content-block')

    # Iterating the titile in the titles
    for title in titles:
      link = title.select('a')[0]['href']
      post_date = title.select('.post-date')[0].text
      yield scrapy.Request(link, self.parse_detail)

Parse Multiple Layers of Web Page

If there are multiple layers of web page need to be parsed, a scrapy.Request can be called within the parse function. In our code, parse_detail is called by the parse function.
def parse_detail(self, response):
    global post_list

    def get_and_save_img(post_image_url):
      res = requests.get(post_image_url, stream = True)
      directory = os.getcwd() + '/images/'
      if not os.path.exists(directory):
        os.makedirs(directory)

      if (res.status_code == 200) & (post_image_url.split('.')[-1] == 'jpg'):
        filename = post_image_url.split('/')[-1]
        filepath = directory + filename
        f = open(filepath, 'wb')
        res.raw.decode_content = True
        shutil.copyfileobj(res.raw, f)
        f.close
        del res

    def convert_datetime(datetime_string):
      # code example:
      # d = datetime.strptime('2007-07-18 10:03:19', '%Y-%m-%d %H:%M:%S')
      # day_string = d.strftime('%Y-%m-%d')
      #
      # Now, date input format example:
      #   31, Jul 2014 15:19
      # This should be converted in the format MM/DD/YYYY hh:mm:ss AM|PM.
      # The AM|PM is optional.
      #
      # ** using %b for month name.
      #
      real_date  = datetime.strptime(datetime_string, '%d, %b %Y %H:%M')
      mt_date = real_date.strftime('%m/%d/%Y %H:%M:%S')
      return real_date, mt_date

    res = BeautifulSoup(response.body, "lxml")
    detail = res.select('.post-content-block')

    post_title   = detail[0].select('h3')[0].text
    real_date, post_date    = convert_datetime(detail[0].select('.post-date')[0].text)
    post_content = detail[0].select('.post-content')[0].extract()

    post_images  = detail[0].select('img')
    ## Get and save images
    for post_image in post_images:
      get_and_save_img(post_image['src'])
    # Save the results in the global list 'post_list' 
    post_list.append([real_date, post_title, post_date, post_content])

The Close Function

After all the iterations, all data has been retrieved and save to the global list post_list. It’s time to save to file. In Scrapy framework, there is a built-in function called closed to utilized when all works have been done.
def closed(self, reason):
  global post_list
  post_list.sort(reverse=True)
  self.output_to_mt(post_list)

Save Backup File in MT (Movable Type Import / Export) Format

The Movable Type Import / Export format document is here.
An example is as follows:
TITLE: A dummy title
BASENAME: a-dummy-title
AUTHOR: Foo Bar
DATE: 01/31/2002 03:31:05 PM
PRIMARY CATEGORY: Media
CATEGORY: News
—– (—–\n)
BODY:
This is the body.
Another paragraph here.
Another paragraph here.
——
EXTENDED BODY:
Here is some more text.
Another paragraph here.
Another paragraph here.
—–
COMMENT:
AUTHOR: Foo
DATE: 01/31/2002 15:47:06
This is
the body of this comment.
—–
COMMENT:
AUTHOR: Bar
DATE: 02/01/2002 04:02:07 AM
IP: 205.66.1.32
EMAIL: me@bar.com
This is the body of
another comment. It goes
up to here.
—–
PING:
TITLE: My Entry
URL: http://www.foo.com/old/2002/08/
IP: 206.22.1.53
BLOG NAME: My Weblog
DATE: 08/05/2002 16:09:12
This is the start of my
entry, and here it…
—– (—–\n)
——– (——–\n)
The code is here:
def output_to_mt(self, post_list):
  global mt_filename
  for post in post_list:
    mt  = 'TITLE: ' + post[1] + '\n'
    mt  = mt + 'AUTHOR: duduh' + '\n'
    mt += 'DATE: '
    mt  = mt + post[2] + '\n'
    mt += '-----\n'
    mt  = mt + 'BODY:' + '\n'
    mt  = mt + str(post[3]) + '\n'
    mt += '-----\n'
    mt += '--------\n'
    with open(mt_filename, 'a+') as f:
      f.write(mt)
    self.log('Saved file %s' % mt_filename)

Run The Spider

The command to run the spider is:
scrapy crawl blog
The we get the result file. Done!