Showing posts with label Anaconda. Show all posts
Showing posts with label Anaconda. Show all posts

Wednesday, September 20, 2017

Backup articles from "tian.yam.com" using Python Scrapy framework as a crawler

Backup articles from "tian.yam.com" using Python Scrapy framework as a crawler

Preface

One of my friends needed to backup all articles from a blog site called “tian.yam.com” (天空部落格). She might move her blog to another platform such as Blogger or self-built Wordpress site. Unfortunately, it seemed that there is no tools provided by “tian.yam.com” to backup all articles for her. ( Maybe I don’t know. ) I started to research how to backup all articles using any feasible way.

Scrapy

I have found a perfect tool for this kind of job: Scrapy
Scrapy is an open source and collaborative framework for extracting the data you need from websites. In a fast, simple, yet extensible way.
I utilized Scrapy as a Python crawler to get the contents from the blog and save them to the local disk storage.

Installation

To install Scrapy using conda, run:
conda install -c conda-forge scrapy 
( I use conda for installation since I have installed Anaconda )
or
pip install Scrapy

Start A Project

It is easy to start to learn from the tutorial.
We can start a project by typing the command:
scrapy startproject blog
Then it will create the whole blog directory:
blog/
    scrapy.cfg   # deploy configuration file
    blog/        # project's Python module, you'll import your code from here
        __init__.py
        items.py      # project items definition file
        pipelines.py  # project pipelines file
        settings.py   # project settings file
        spiders/      # a directory where you'll later put your spiders
            __init__.py

Write The Spider

Then I wrote a Pyhton program called blog.py under the directory blog/spiders.
A class ‘BlogCrawler’ is declared. The functions ‘start_requests’ and ‘parse’ were built-in. All we need to do is to add code in those functions.
In addition, Beautiful Soup is a powerful Python library used for parsing documents in HTML and XML format.
import scrapy
from bs4 import BeautifulSoup
from datetime import datetime
import requests
import shutil
import os

class BlogCrawler(scrapy.Spider):
  name = 'blog'

  global mt_filename, post_list
  mt_filename = 'duduh_blog.txt'
  post_list = []

  # delete the mt file if exist.
  if os.path.exists(mt_filename):
    os.remove(mt_filename)

  def start_requests(self):
    # Get every post from page 1 to 34
    #
    post_url = 'https://duduh.tian.yam.com/posts?page='
    urls = []
    for i in range(1,35):
      urls.append(post_url + str(i))

    # parse each post in every page
    for url in urls:
      yield scrapy.Request(url=url, callback=self.parse)

  def closed(self, reason):
    global post_list
    post_list.sort(reverse=True)
    # print(post_list)
    # print(len(post_list))
    self.output_to_mt(post_list)

  def parse(self, response):
    global post_list
    res = BeautifulSoup(response.body, "html.parser")
    titles = res.select('.post-content-block')

    # Iterating the titile in the titles
    for title in titles:
      link = title.select('a')[0]['href']
      post_date = title.select('.post-date')[0].text
      yield scrapy.Request(link, self.parse_detail)

Parse Multiple Layers of Web Page

If there are multiple layers of web page need to be parsed, a scrapy.Request can be called within the parse function. In our code, parse_detail is called by the parse function.
def parse_detail(self, response):
    global post_list

    def get_and_save_img(post_image_url):
      res = requests.get(post_image_url, stream = True)
      directory = os.getcwd() + '/images/'
      if not os.path.exists(directory):
        os.makedirs(directory)

      if (res.status_code == 200) & (post_image_url.split('.')[-1] == 'jpg'):
        filename = post_image_url.split('/')[-1]
        filepath = directory + filename
        f = open(filepath, 'wb')
        res.raw.decode_content = True
        shutil.copyfileobj(res.raw, f)
        f.close
        del res

    def convert_datetime(datetime_string):
      # code example:
      # d = datetime.strptime('2007-07-18 10:03:19', '%Y-%m-%d %H:%M:%S')
      # day_string = d.strftime('%Y-%m-%d')
      #
      # Now, date input format example:
      #   31, Jul 2014 15:19
      # This should be converted in the format MM/DD/YYYY hh:mm:ss AM|PM.
      # The AM|PM is optional.
      #
      # ** using %b for month name.
      #
      real_date  = datetime.strptime(datetime_string, '%d, %b %Y %H:%M')
      mt_date = real_date.strftime('%m/%d/%Y %H:%M:%S')
      return real_date, mt_date

    res = BeautifulSoup(response.body, "lxml")
    detail = res.select('.post-content-block')

    post_title   = detail[0].select('h3')[0].text
    real_date, post_date    = convert_datetime(detail[0].select('.post-date')[0].text)
    post_content = detail[0].select('.post-content')[0].extract()

    post_images  = detail[0].select('img')
    ## Get and save images
    for post_image in post_images:
      get_and_save_img(post_image['src'])
    # Save the results in the global list 'post_list' 
    post_list.append([real_date, post_title, post_date, post_content])

The Close Function

After all the iterations, all data has been retrieved and save to the global list post_list. It’s time to save to file. In Scrapy framework, there is a built-in function called closed to utilized when all works have been done.
def closed(self, reason):
  global post_list
  post_list.sort(reverse=True)
  self.output_to_mt(post_list)

Save Backup File in MT (Movable Type Import / Export) Format

The Movable Type Import / Export format document is here.
An example is as follows:
TITLE: A dummy title
BASENAME: a-dummy-title
AUTHOR: Foo Bar
DATE: 01/31/2002 03:31:05 PM
PRIMARY CATEGORY: Media
CATEGORY: News
—– (—–\n)
BODY:
This is the body.
Another paragraph here.
Another paragraph here.
——
EXTENDED BODY:
Here is some more text.
Another paragraph here.
Another paragraph here.
—–
COMMENT:
AUTHOR: Foo
DATE: 01/31/2002 15:47:06
This is
the body of this comment.
—–
COMMENT:
AUTHOR: Bar
DATE: 02/01/2002 04:02:07 AM
IP: 205.66.1.32
EMAIL: me@bar.com
This is the body of
another comment. It goes
up to here.
—–
PING:
TITLE: My Entry
URL: http://www.foo.com/old/2002/08/
IP: 206.22.1.53
BLOG NAME: My Weblog
DATE: 08/05/2002 16:09:12
This is the start of my
entry, and here it…
—– (—–\n)
——– (——–\n)
The code is here:
def output_to_mt(self, post_list):
  global mt_filename
  for post in post_list:
    mt  = 'TITLE: ' + post[1] + '\n'
    mt  = mt + 'AUTHOR: duduh' + '\n'
    mt += 'DATE: '
    mt  = mt + post[2] + '\n'
    mt += '-----\n'
    mt  = mt + 'BODY:' + '\n'
    mt  = mt + str(post[3]) + '\n'
    mt += '-----\n'
    mt += '--------\n'
    with open(mt_filename, 'a+') as f:
      f.write(mt)
    self.log('Saved file %s' % mt_filename)

Run The Spider

The command to run the spider is:
scrapy crawl blog
The we get the result file. Done!

Tuesday, May 9, 2017

How to Install OpenCV 3 for Python3 through Conda in OSX?

How to Install OpenCV 3 for Python3 through Conda in OSX?

System using Python 3.

$ python --version
Python 3.6.1 :: Anaconda custom (x86_64)

While typing the following command:

$ anaconda show menpo/opencv3

It shows:

Using Anaconda API: https://api.anaconda.org
Name:    opencv3
Summary:
Access:  public
Package Types:  conda
Versions:
   + 3.1.0
   + 3.2.0

To install this package with conda run:
     conda install --channel https://conda.anaconda.org/menpo opencv3

So, just type the last line of command above:

$ conda install --channel https://conda.anaconda.org/menpo opencv3

or

$ conda install -c menpo opencv3=3.2.0

(from https://anaconda.org/menpo/opencv3)

Done!

Friday, March 24, 2017

Edit $PATH After Anaconda Installed If 'conda' Command Can't Be Found

Edit $PATH After Anaconda Installed If 'conda' Command Can't Be Found

Anaconda is a Python package utility which contains many great packages for scientific and mathematical computation. Users just install it once and it’s all set.

Problem after installation:

The ‘conda’ command can’t be found after Anaconda installed.

Solution:

Edit the paths file in the user’s root directory according to the shell used( bash or zsh).

if using bash shell, edit .bashrcfile.
if using zsh shell, edit .zshrcfile.

Add the following line to the end of the file and save.

export PATH="$HOME/anaconda/bin:$PATH"

Reopen the terminal and it’s all done!