packtpub-crawler: Error attempting to claim book from newsletter

~ $ python script/spider.py --config config/prod.cfg --notify ifttt --claimOnly

                      __   __              __                                   __
    ____  ____ ______/ /__/ /_____  __  __/ /_        ______________ __      __/ /__  _____
   / __ \/ __ `/ ___/ //_/ __/ __ \/ / / / __ \______/ ___/ ___/ __ `/ | /| / / / _ \/ ___/
  / /_/ / /_/ / /__/ ,< / /_/ /_/ / /_/ / /_/ /_____/ /__/ /  / /_/ /| |/ |/ / /  __/ /
 / .___/\__,_/\___/_/|_|\__/ .___/\__,_/_.___/      \___/_/   \__,_/ |__/|__/_/\___/_/
/_/                       /_/

Download FREE eBook every day from www.packtpub.com
@see github.com/niqdev/packtpub-crawler

[*] 2017-01-31 10:30 - fetching today's eBooks
[*] configuration file: /app/config/prod.cfg
[*] getting daily free eBook
[*] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[*] fetching url... 200 | https://www.packtpub.com/packt/offers/free-learning
[*] fetching url... 200 | https://www.packtpub.com/account/my-ebooks
[+] book successfully claimed
[+] notification sent to IFTTT
[*] getting free eBook from newsletter
[*] fetching url... 200 | https://www.packtpub.com/packt/free-ebook/practical-data-analysis
[-] <type 'exceptions.IndexError'> list index out of range | spider.py@123
Traceback (most recent call last):
  File "script/spider.py", line 123, in main
    packtpub.runNewsletter(currentNewsletterUrl)
  File "/app/script/packtpub.py", line 160, in runNewsletter
    self.__parseNewsletterBookInfo(soup)
  File "/app/script/packtpub.py", line 98, in __parseNewsletterBookInfo
    title = urlWithTitle.split('/')[4].replace('-', ' ').title()
IndexError: list index out of range
[+] error notification sent to IFTTT
[*] done
~ $

It has successfully claimed the book from the newsletter already, but on subsequent days I’m getting the above error.

And it sends an IFTTT notification for the second one 😦

About this issue

  • Original URL
  • State: open
  • Created 7 years ago
  • Comments: 39 (30 by maintainers)

Most upvoted comments

Hi Guys, I’m creating google script that parsing PacktPab tweets(it comes from @juzim google script). I’m not sure but there is a chance that all books from newsletters also will be published on their Twitter and no needs to fix it 😃 joking. It’s not finished - should exclude duplicates and check does link still available or not. If you have time, please look on output if it’s fine for crawler or not https://goo.gl/AXtAC8

Looks like some of the divs has been renamed on the newsletter’s landing page. I compared the page for an older book:

    <div class="book-top-block-wrapper cf">
        <div class="cf section-inner">
            <div class="float-left promo-landing-book-picture">
                <div itemprop="image" itemtype="http://schema.org/URL" itemscope>
                    <a href="/web/20170113204509/https://dz13w8afd47il.cloudfront.net/networking-and-servers/mastering-aws-development">
                        <img src="/web/20170113204509im_/https://d1ldz4te4covpm.cloudfront.net/sites/default/files/3632EN_Mastering%20AWS%20Development.jpg" class="bookimage" />
                    </a>
                </div>
            <div class="float-left promo-landing-book-info">
                <div class="promo-landing-book-body-title">
                                    </div>
                <div class="promo-landing-book-body">
                    <div><h1>Claim your free 416 page Amazon Web Services eBook!</h1>
<p>This book is a practical guide to developing, administering, and managing applications and infrastructures with AWS. With this, you'll be able to create, design, and manage an entire application life cycle on AWS by using the AWS SDKs, APIs, and the AWS Management Console.</p>
</div>
                </div>
                            </div>

with the current one:

<div id="main-book" class="cf nano" itemscope itemtype="http://schema.org/Book">
    <div class="book-top-block-wrapper cf">
        <div class="cf section-inner">
            <div class="float-left nano-book-main-image">
                <div itemprop="image" itemtype="http://schema.org/URL" itemscope>
                    <a class="fancybox" href="///d1ldz4te4covpm.cloudfront.net/sites/default/files/imagecache/nano_main_image/5612_WYNTKAngular_eBook_500x617.jpg">
                        <img src="//d1ldz4te4covpm.cloudfront.net/sites/default/files/imagecache/nano_main_image/5612_WYNTKAngular_eBook_500x617.jpg" class="bookimage" />
                    </a>
                </div>
            <div class="float-left nano-book-text">
                <h1>What you need to know about Angular 2</h1>
                <div><strong>Get to grips with the ins and outs of one of the biggest web dev revolutions of this decade with the aid of this free eGuide! From setting up the very basics of Angular to making the most of Directives and Components you’ll discover everything you need to get started building your own web apps today.</strong></div>
                <div id="nano-learn">
                    <div id="nano-learn-title">
                        <div id="nano-learn-title-text">
                            <span id="nano-learn-title-text-inner">
                                What You Will Learn                            </span>
                        </div>
                    </div>

and came up with this hotfix: https://github.com/niqdev/packtpub-crawler/compare/master...mkarpiarz:fix_newsletter_divs I haven’t tested email notifications yet, so I’m not sure how the description would look like, but claiming a newsletter ebook seems to work now. Happy to submit a PR if @juzim haven’t started working on this yet.

That’s it?! I’ll try to fix it soon but it might take till next week, sorry.

niqdev notifications@github.com schrieb am So., 2. Apr. 2017, 11:10:

The div promo-landing-book-picture doesn’t exists

— You are receiving this because you were mentioned.

Reply to this email directly, view it on GitHub https://github.com/niqdev/packtpub-crawler/issues/47#issuecomment-290974349, or mute the thread https://github.com/notifications/unsubscribe-auth/AEmPPB4hdhLsEjGopseM72lUW5HhNEgvks5rr2YVgaJpZM4Lydlb .

The script would just claim the book and you can download it later manually or run it with a “downloadAll” parameter that only syncs the archive with the local folder. Notifications etc are handled on claim, not download.