Preface
One of my friends needed to backup all articles from a blog site called “tian.yam.com” (天空部落格). She might move her blog to another platform such as Blogger or self-built Wordpress site. Unfortunately, it seemed that there is no tools provided by “tian.yam.com” to backup all articles for her. ( Maybe I don’t know. ) I started to research how to backup all articles using any feasible way.Scrapy
I have found a perfect tool for this kind of job: ScrapyScrapy is an open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.
I utilized Scrapy as a Python crawler to get the contents from the blog and save them to the local disk storage.
Installation
To install Scrapy using conda, run:conda install -c conda-forge scrapy
( I use conda for installation since I have installed Anaconda )or
pip install Scrapy
Start A Project
It is easy to start to learn from the tutorial.We can start a project by typing the command:
scrapy startproject blog
Then it will create the whole blog directory:blog/
scrapy.cfg # deploy configuration file
blog/ # project's Python module, you'll import your code from here
__init__.py
items.py # project items definition file
pipelines.py # project pipelines file
settings.py # project settings file
spiders/ # a directory where you'll later put your spiders
__init__.py
Write The Spider
Then I wrote a Pyhton program calledblog.py
under the directory blog/spiders
.A class ‘BlogCrawler’ is declared. The functions ‘start_requests’ and ‘parse’ were built-in. All we need to do is to add code in those functions.
In addition, Beautiful Soup is a powerful Python library used for parsing documents in HTML and XML format.
import scrapy
from bs4 import BeautifulSoup
from datetime import datetime
import requests
import shutil
import os
class BlogCrawler(scrapy.Spider):
name = 'blog'
global mt_filename, post_list
mt_filename = 'duduh_blog.txt'
post_list = []
# delete the mt file if exist.
if os.path.exists(mt_filename):
os.remove(mt_filename)
def start_requests(self):
# Get every post from page 1 to 34
#
post_url = 'https://duduh.tian.yam.com/posts?page='
urls = []
for i in range(1,35):
urls.append(post_url + str(i))
# parse each post in every page
for url in urls:
yield scrapy.Request(url=url, callback=self.parse)
def closed(self, reason):
global post_list
post_list.sort(reverse=True)
# print(post_list)
# print(len(post_list))
self.output_to_mt(post_list)
def parse(self, response):
global post_list
res = BeautifulSoup(response.body, "html.parser")
titles = res.select('.post-content-block')
# Iterating the titile in the titles
for title in titles:
link = title.select('a')[0]['href']
post_date = title.select('.post-date')[0].text
yield scrapy.Request(link, self.parse_detail)
Parse Multiple Layers of Web Page
If there are multiple layers of web page need to be parsed, ascrapy.Request
can be called within the parse
function. In our code, parse_detail
is called by the parse
function.def parse_detail(self, response):
global post_list
def get_and_save_img(post_image_url):
res = requests.get(post_image_url, stream = True)
directory = os.getcwd() + '/images/'
if not os.path.exists(directory):
os.makedirs(directory)
if (res.status_code == 200) & (post_image_url.split('.')[-1] == 'jpg'):
filename = post_image_url.split('/')[-1]
filepath = directory + filename
f = open(filepath, 'wb')
res.raw.decode_content = True
shutil.copyfileobj(res.raw, f)
f.close
del res
def convert_datetime(datetime_string):
# code example:
# d = datetime.strptime('2007-07-18 10:03:19', '%Y-%m-%d %H:%M:%S')
# day_string = d.strftime('%Y-%m-%d')
#
# Now, date input format example:
# 31, Jul 2014 15:19
# This should be converted in the format MM/DD/YYYY hh:mm:ss AM|PM.
# The AM|PM is optional.
#
# ** using %b for month name.
#
real_date = datetime.strptime(datetime_string, '%d, %b %Y %H:%M')
mt_date = real_date.strftime('%m/%d/%Y %H:%M:%S')
return real_date, mt_date
res = BeautifulSoup(response.body, "lxml")
detail = res.select('.post-content-block')
post_title = detail[0].select('h3')[0].text
real_date, post_date = convert_datetime(detail[0].select('.post-date')[0].text)
post_content = detail[0].select('.post-content')[0].extract()
post_images = detail[0].select('img')
## Get and save images
for post_image in post_images:
get_and_save_img(post_image['src'])
# Save the results in the global list 'post_list'
post_list.append([real_date, post_title, post_date, post_content])
The Close Function
After all the iterations, all data has been retrieved and save to the global listpost_list
. It’s time to save to file. In Scrapy framework, there is a built-in function called closed
to utilized when all works have been done.def closed(self, reason):
global post_list
post_list.sort(reverse=True)
self.output_to_mt(post_list)
Save Backup File in MT (Movable Type Import / Export) Format
The Movable Type Import / Export format document is here.An example is as follows:
TITLE: A dummy titleThe code is here:
BASENAME: a-dummy-title
AUTHOR: Foo Bar
DATE: 01/31/2002 03:31:05 PM
PRIMARY CATEGORY: Media
CATEGORY: News
—– (—–\n)
BODY:
This is the body.
Another paragraph here.
Another paragraph here.
——
EXTENDED BODY:
Here is some more text.
Another paragraph here.
Another paragraph here.
—–
COMMENT:
AUTHOR: Foo
DATE: 01/31/2002 15:47:06
This is
the body of this comment.
—–
COMMENT:
AUTHOR: Bar
DATE: 02/01/2002 04:02:07 AM
IP: 205.66.1.32
EMAIL: me@bar.com
This is the body of
another comment. It goes
up to here.
—–
PING:
TITLE: My Entry
URL: http://www.foo.com/old/2002/08/
IP: 206.22.1.53
BLOG NAME: My Weblog
DATE: 08/05/2002 16:09:12
This is the start of my
entry, and here it…
—– (—–\n)
——– (——–\n)
def output_to_mt(self, post_list):
global mt_filename
for post in post_list:
mt = 'TITLE: ' + post[1] + '\n'
mt = mt + 'AUTHOR: duduh' + '\n'
mt += 'DATE: '
mt = mt + post[2] + '\n'
mt += '-----\n'
mt = mt + 'BODY:' + '\n'
mt = mt + str(post[3]) + '\n'
mt += '-----\n'
mt += '--------\n'
with open(mt_filename, 'a+') as f:
f.write(mt)
self.log('Saved file %s' % mt_filename)
Run The Spider
The command to run the spider is:scrapy crawl blog
The we get the result file. Done!