Metadata-Version: 2.0
Name: comiccrawler
Version: 2016.4.21
Summary: An image crawler with extendible modules and gui
Home-page: https://github.com/eight04/ComicCrawler
Author: eight
Author-email: eight04@gmail.com
License: MIT
Keywords: image crawler
Platform: UNKNOWN
Classifier: Development Status :: 5 - Production/Stable
Classifier: Environment :: Console
Classifier: Environment :: Win32 (MS Windows)
Classifier: Intended Audience :: End Users/Desktop
Classifier: License :: OSI Approved :: MIT License
Classifier: Natural Language :: Chinese (Traditional)
Classifier: Operating System :: Microsoft :: Windows :: Windows 7
Classifier: Programming Language :: Python :: 3.4
Classifier: Topic :: Internet
Requires-Dist: docopt (~=0.6.2)
Requires-Dist: pyexecjs (~=1.3.1)
Requires-Dist: pythreadworker (~=0.4.0)
Requires-Dist: requests (~=2.9.1)
Requires-Dist: safeprint (~=0.1.1)

Comic Crawler
=============

Comic Crawler 是用來扒圖的一支 Python Script。擁有簡易的下載管理員、圖書館功能、 與方便的擴充能力。

2016.2.27 更新
--------------

-  "www.comicvip.com" 被 "www.comicbus.com" 取代。詳細請參考 `#7 <https://github.com/eight04/ComicCrawler/issues/7>`__

Todos
-----

-  Make grabber be able to return verbose info?
-  Need a better error log system.
-  Support pool in Sankaku.
-  Add module.get_episode_id to make the module decide how to compare episodes.

Dependencies
------------

-  docopt - command line interface.
-  pyexecjs - to execute javascript.
-  pythreadworker - a small threading library.
-  safeprint - to print unicode chars on Windows.
-  requests - http library.

Development Dependencies
------------------------

-  wheel - create python wheel.
-  twine - upload package.
-  docutils - to test rst.
-  pyxcute - task runner.

下載和安裝（Windows）
---------------------

Comic Crawler is on
`PyPI <https://pypi.python.org/pypi/comiccrawler/>`__. 安裝完
python 後，可以直接用 pip 指令自動安裝。

Install Python
~~~~~~~~~~~~~~

你需要 Python 3.4 以上。安裝檔可以從它的
`官方網站 <https://www.python.org/>`__ 下載。

安裝時記得要選「Add python.exe to path」，才能使用 pip 指令。

Install Node.js
~~~~~~~~~~~~~~~

有些網站的 JavaScript 用 Windows 內建的 Windows Script Host
會解析失敗，建議安裝 `Node.js <https://nodejs.org/>`__.

Install Comic Crawler
~~~~~~~~~~~~~~~~~~~~~

在 cmd 底下輸入以下指令︰

::

    pip install comiccrawler

更新時︰

::

    pip install --upgrade comiccrawler

Supported domains
-----------------

.. DOMAINS
..

    chan.sankakucomplex.com comic.acgn.cc comic.ck101.com comic.sfacg.com danbooru.donmai.us deviantart.com exhentai.org g.e-hentai.org imgbox.com konachan.com m.dmzj.com manhua.dmzj.com seiga.nicovideo.jp tel.dm5.com tsundora.com tumblr.com tw.seemh.com www.8comic.com www.99comic.com www.chuixue.com www.comicbus.com www.comicvip.com www.dm5.com www.facebook.com www.iibq.com www.manhuadao.com www.pixiv.net www.seemh.com yande.re

.. END DOMAINS

使用說明
--------

::

    Usage:
      comiccrawler domains
      comiccrawler download URL [--dest SAVE_FOLDER]
      comiccrawler gui
      comiccrawler (--help | --version)

    Commands:
      domains             列出支援的網址
      download URL        下載指定的 url
      gui                 啟動主視窗

    Options:
      --dest SAVE_FOLDER  設定下載目錄（預設為 "."）
      --help              顯示幫助訊息
      --version           顯示版本

圖形介面
--------

.. figure:: http://i.imgur.com/ZzF0YFx.png
   :alt: 主視窗

-  在文字欄貼上網址後點「加入連結」或是按 Enter
-  若是剪貼簿裡有支援的網址，且文字欄同時是空的，程式會自動貼上
-  對著任務右鍵，可以選擇把任務加入圖書館。圖書館內的任務，在每次程式啟動時，都會檢查是否有更新。

設定檔
------

::

    [DEFAULT]
    ; 設定下載完成後要執行的程式，會傳入下載資料夾的位置
    runafterdownload =

    ; 啟動時自動檢查圖書館更新
    libraryautocheck = true

    ; 下載目的資料夾
    savepath = ~/comiccrawler/download

    ; 開啟 grabber 偵錯
    errorlog = false

    ; 每隔 5 分鐘自動存檔
    autosave = 5

-  設定檔位於 ``%USERPROFILE%\comiccrawler\setting.ini``
-  執行一次 ``comiccrawler gui`` 後關閉，設定檔會自動產生
-  各別的網站會有自己的設定，通常是要填入一些登入相關資訊
-  設定檔會在重新啟動後生效。若 ComicCrawler 正在執行中，可以點「重載設定檔」來載入新設定

Module example
--------------

.. code:: python

    #! python3
    """
    This is an example to show how to write a comiccrawler module.

    """

    import re, urllib.parse
    from ..core import Episode

    # The header used in grabber method
    header = {}

    # The cookies
    cookie = {}

    # Match domain. Support sub-domain.
    domain = ["www.example.com", "comic.example.com"]

    # Module name
    name = "Example"

    # With noepfolder = True, Comic Crawler won't generate subfolder for each episode.
    noepfolder = False

    # Wait 5 seconds between each download.
    rest = 5

    # Specific user settings
    config = {
        "user": "user-default-value",
        "hash": "hash-default-value"
    }

    def load_config():
        """This function will be called each time the config reloaded.
        """
        cookie.update(config)

    def get_title(html, url):
        """Return mission title.

        Title will be used in saving filepath, so be sure to avoid duplicate title.
        """
        return re.search("<h1 id='title'>(.+?)</h1>", html).group(1)

    def get_episodes(html, url):
        """Return episode list.

        The episode list should be sorted by date, oldest first.
        """
        match_iter = re.finditer("<a href='(.+?)'>(.+?)</a>", html)
        episodes = []
        for match in match_iter:
            m_url, title = match.groups()
            episodes.append(Episode(title, urllib.parse.urljoin(url, m_url)))
        return episodes

    def get_images(html, url):
        """Get the URL of all images. Return list, iterator, or string.

        The list and iterator may generate URL string or a callback function to get URL string.
        """

        match_iter = re.finditer("<img src='(.+?)'>", html)
        return [match.group(1) for match in match_iter]

    def get_next_page(html, url):
        """Return the url of the next page."""
        match = re.search("<a id='nextpage' href='(.+?)'>next</a>", html)
        if match:
            return match.group(1)

    def errorhandler(error, episode):
        """Downloader will call errorhandler if there is an error happened when
        downloading image. Normally you can just ignore this function.
        """
        pass


Changelog
---------

-  2016.4.21

   -  Big rewrite.
   -  Move to requests.
   -  Move to pythreadworker 0.4.0.
   -  Add the ability to load module from ``~/comiccrawler/mods``
   -  Drop migrate command.

-  2016.4.20

   -  Update install_requires.

-  2016.4.13

   -  Fix facebook bug.
   -  Move to doit.

-  2016.4.8

   -  Fix get_next_page error.
   -  Fix key error in CLI.

-  2016.4.4

   -  Use new API!
   -  Analyzer will check the last episode to decide whether to analyze all pages.
   -  Support multiple images in one page.
   -  Change how getimgurl and getimgurls work.

-  2016.4.2

   -  Add tumblr module.
   -  Enhance: support sub-domain in ``mods.get_module``.

-  2016.3.27

   -  Fix: handle deleted post (konachan).
   -  Fix: enhance dialog. try to fix `#8 <https://github.com/eight04/ComicCrawler/issues/8>`__.

-  2016.2.29

   -  Fix: use latest comicview.js (8comic).

-  2016.2.27

   -  Fix: lastcheckupdate doesn't work.
   -  Add: comicbus domain (8comic).

-  2016.2.15.1

   -  Fix: can not add mission.

-  2016.2.15

   -  Add `lastcheckupdate` setting. Now the library will only automatically check updates once a day.
   -  Refactor. Use MissionProxy, Mission doesn't inherit UserWorker anymore.

-  2016.1.26

   -  Change: checking updates won't affect mission which is downloading.
   -  Fix: page won't skip if the savepath contains "~".
   -  Add: a new url pattern in facebook.

-  2016.1.17

   -  Fix: an url matching issue in Facebook.
   -  Enhance: downloader will loop through other episodes rather than stop current mission on crawlpage error.

-  2016.1.15

   -  Fix: ComicCrawler doesn't save session during downloading.

-  2016.1.13

   -  Handle HTTPError 429.

-  2016.1.12

   -  Add facebook module.
   -  Add ``circular`` option in module. Which should be set to ``True`` if downloader doesn't know which is the last page of the album. (e.g. Facebook)

-  2016.1.3

   -  Fix downloading failed in seemh.

-  2015.12.9

   -  Fix build-time dependencies.

-  2015.11.8

   -  Fix next page issue in danbooru.

-  2015.10.25

   -  Support nico seiga.
   -  Try to fix MemoryError when writing files.

-  2015.10.9

   -  Fix unicode range error in gui. See http://is.gd/F6JfjD

-  2015.10.8

   -  Fix an error that unable to skip episode in pixiv module.

-  2015.10.7

   -  Fix errors that unable to create folder if title contains "{}" characters.

-  2015.10.6

   -  Support search page in pixiv module.

-  2015.9.29

   -  Support http://www.chuixue.com.

-  2015.8.7

   -  Fixed sfacg bug.

-  2015.7.31

   -  Fixed: libraryautocheck option does not work.

-  2015.7.23

   -  Add module dmzj\_m. Some expunged manga may be accessed from mobile page. ``http://manhua.dmzj.com/name => http://m.dmzj.com/info/name.html``

-  2015.7.22

   -  Fix bug in module eight.

-  2015.7.17

   -  Fix episode selecting bug.

-  2015.7.16

   -  Added:

      -  Cleanup unused missions after session loads.
      -  Handle ajax episode list in seemh.
      -  Show an error if no update to download when clicking "download updates".
      -  Show an error if failing to load session.

   -  Changed:

      -  Always use "UPDATE" state if the mission is not complete after re-analyzing.
      -  Create backup if failing to load session instead of moving them to "invalid-save" folder.
      -  Check edit flag in MissionManager.save().

   -  Fixed:

      -  Can not download "updated" mission.
      -  Update checking will stop on error.
      -  Sankaku module is still using old method to create Episode.

-  2015.7.15

   -  Add module seemh.

-  2015.7.14

   -  Refactor: pull out download\_manager, mission\_manager.
   -  Enhance content\_write: use os.replace.
   -  Fix mission\_manager save loop interval.

-  2015.7.7

   -  Fix danbooru bug.
   -  Fix dmzj bug.

-  2015.7.6

   -  Fix getepisodes regex in exh.

-  2015.7.5

   -  Add error handler to dm5.
   -  Add error handler to acgn.

-  2015.7.4

   -  Support imgbox.

-  2015.6.22

   -  Support tsundora.

-  2015.6.18

   -  Fix url quoting issue.

-  2015.6.14

   -  Enhance ``safeprint``. Use ``echo`` command.
   -  Enhance ``content_write``. Add ``append=False`` option.
   -  Enhance ``Crawler``. Cache imgurl.
   -  Enhance ``grabber``. Add ``cookie=None`` option. Change errorlog behavior.
   -  Fix ``grabber`` unicode encoding issue.
   -  Some module update.

-  2015.6.13

   -  Fix ``clean_finished``
   -  Fix ``console_download``
   -  Enhance ``get_by_state``

Author
------

-  eight eight04@gmail.com


