Seeking Reader Help with Evil Site Scrapers

In classic the cobbler’s children go unshod fashion, I do a horrible job of monetizing the site (but I have made some progress on that front) and have even fallen behind on the sort of thing I normally do religiously, submitting for expense reimbursements (still haven’t sent receipts off to my publisher for what they’d cover of my UK trip, the one in which I became a volcano refugee).

Another thing I have too often been slow to address is people stealing my work. The last time I found I had a site scraper (meaning someone who rips off entire posts, no permission, and often no atttribution) it took me over half a year to get around to it, and readers were enormously helpful in getting the offender to cease.

This time, I learned of two in a day, one via Barry Ritholtz, the other quite by random (a Google search). So I figured I needed to take action faster.

My experience last time was readers were all over this as soon as I asked, which was very gratifying, but in an uncoordinated fashioin, which meant there was a lot of duplicated effort (and yes, in theory I could deal with this, but it takes time from posting, which I see as much greater value added).

The key bit is not finding out who the owner of the URL of the offending site is, although that is progress. Even if the operator of the site has registered it in his own name, and whoever the official admin contact is should get a nastygram, one can assume they know they are up to no good and are not likely to desist. The critical party is an admin contact at the webhost. If it is a commercial webhost, they do NOT like getting nastygrams, and they typically shut sites down first and ask questions later.

The sites in question are:

HedgeHogs. Here is an example: http://www.hedgehogs.net/pg/newsfeeds/hhwebadmin/item/6339370/guest-post-a-recession-to-remember-lessons-from-the-us-19371938 (this is the one Barry found)./ Here is it on NC.

Zmarter.com. This is a post they ripped off (here is the NC original)

DragonParadox. Took the same post as above.

The process I suggest is:

1. If you are game for this mission, I really appreciate the help. Check comments first and see what progress has been made.

2. If you want to track down either site, I’d clearly be delighted, please proceed and let everyone know what you find in comments (the last time, some readers did an impressive amount of forensics. A bit scary to see what you can learn about someone from the Web).

3. Another way to help is to prepare the cease and desist letters for me to send off. There are various versions on the Internet. If somone wants to do that, please volunteer yourself in comments and provide your real e-mail address (not in the message body, but in the address line) so I can ping you (I’d normally do this part myself, but I am using an antique, faithful but very sluggish Mac right now and won’t get my tricked out MacBook Air with hopefully my recovered data on it till Friday at best, so even easy stuff is taking longer than it ought to).

Thanks a lot!

Print Friendly, PDF & Email

27 comments

  1. dearieme

    If any of these bums turn out to be in Britain, feel free to send them a message along the lines of “My chum dearieme is a Very Big Fella. I might ask him to pay you a friendly vist. Know wot I mean?”

  2. Kevin Smith

    Hi Yves,

    You can run your material through
    http://www.copyscape.com/
    for free, maybe after a few hour or few day delay,
    to see where it is being duplicated on the web.

    In a few minutes I found a ton of people who had stolen ~2,000 word blocks of my material.

    Maybe give a hungry young lawyer a percentage of the take
    [“eat what you kill”] to go and [using provisions of the Digital Millenium Copyright Act, “DMCA”]
    extract money from the rip artists,
    and/or make their ISPs shtu them down.

    The lawyer could run your material through copyscape every week, hunting for new targets.

    This could become a good little legal niche, with quite a large market potential, and could be highly automated, so might be easily scalable.

    Cheers,

    KCS

    1. Paul Repstock

      “Eat what you kill”..great idea as is going with that fee model..:)

      Just think, it will even help reduce the eager ranks of the ‘forclosers’..lol

    2. Fractal

      Wow! A new legal business! You can work from home!

      Seriously, I do agree there could be revenues available for this biz model under DMCA, but Yves might NOT want her site to gain the reputation for extracting funds from offenders, but rather might prefer to protect our robust free speech on this site while suppressing parasites.

      Here from findlaw (dot) com are the damages provisions of 17 USC 1203 (apologies for formatting defects):

      “(2) Actual damages. – The court shall award to the complaining
      party the actual damages suffered by the party as a result of the
      violation, and any profits of the violator that are attributable
      to the violation and are not taken into account in computing the
      actual damages, if the complaining party elects such damages at
      any time before final judgment is entered.
      (3) Statutory damages. – (A) At any time before final judgment
      is entered, a complaining party may elect to recover an award of
      statutory damages for each violation of section 1201 in the sum
      of not less than $200 or more than $2,500 per act of
      circumvention, device, product, component, offer, or performance
      of service, as the court considers just.
      (B) At any time before final judgment is entered, a complaining
      party may elect to recover an award of statutory damages for each
      violation of section 1202 in the sum of not less than $2,500 or
      more than $25,000.
      (4) Repeated violations. – In any case in which the injured
      party sustains the burden of proving, and the court finds, that a
      person has violated section 1201 or 1202 within 3 years after a
      final judgment was entered against the person for another such
      violation, the court may increase the award of damages up to
      triple the amount that would otherwise be awarded, as the court
      considers just.
      (5) Innocent violations. –
      (A) In general. – The court in its discretion may reduce or
      remit the total award of damages in any case in which the
      violator sustains the burden of proving, and the court finds,
      that the violator was not aware and had no reason to believe
      that its acts constituted a violation.
      (B) Nonprofit library, archives, educational institutions, or
      public broadcasting entities.”

      1. jeff

        wrong section. you cited damages related to circumvention, which is not at issue here. The appropriate section is in 17 U.S.C. 504.
        http://www.copyright.gov/title17/92chap5.html

        Note that statutory damages are only available if the work is registered with the copyright office prior to filing suit.

        Any lawyer looking to get in to the sue-for-profit business might want to investigate how that’s working out for Righthaven…
        http://arstechnica.com/web/news/2010/11/righthaven-retreats-on-democratic-underground-troll-suit.ars
        http://arstechnica.com/tech-policy/news/2010/10/judge-tells-copyright-troll-righthaven-no-its-fair-use.ars
        or for the U.S. Copyright Group…
        http://arstechnica.com/tech-policy/news/2010/11/p2p-settlement-lawyers-lied-committed-fraud-says-new-lawsuit.ars
        While the issue of fair use does not seem to be an issue here as it was in the Righthaven cases, courts don’t seem to be keen on rewarding entrepreneurial lawyers seeking statutory damages.

    3. OregonChris

      Hungry young lawyer, definitely would like to help Yves. No experience in this area but I have a friend who has done IP and I can learn fast.

  3. Foreclosureblues

    i do copy some of your posts (because i believe that they are some of the best on the planet) and you had previously asked that you be credited; and i always do.

    i am not selling nor advertizing anything and i don’t receive monetary support from anyone (literally).

    if this is unsatisfactory pls let me know.

    foreclosureblues.wordpress.com

  4. Keating Willcox

    The articles I quote on FB include a hat tip. Is that OK. I post the photos of your daily anecdote, usually with a h/t, is that OK?

    I think that fair use comes into play here. Stuff on web sites is often re-posted. What, after all, is a viral video but just that.

    Much of your NC site are articles that you refer to. Should we credit you, or the original author? Do they require written permission to re-post?

    To monetize your site, why not put Amazon.com links to books you like and get revenue from that?

    You do great work, and I visit your site daily.

  5. Dan Duncan

    While you sort it out, always include several internal internal links to other posts. As long as you have internal links to your other work, then at least the scraped content will get you deep links to your back pages.

    Other considerations: Instead of a simple HTAccess denial—ie simply denying access from the offending IP address— do an HTAccess “re-write”. By doing this, you don’t block access…rather, you send the asshole “false” content of your choice. It could be a HUGE file of jibberish like “hy^&GBHBDFNLG#$&H%” …or even better send them “The Best of DownSouth”! [“Please Yves of Naked Cap, we won’t ever scrape your site again. Please, just-make-it-stop! We’re begging you!”] [Of course, you are more than welcome to send them my commentary as well.]

    Or, you could send the scraper into an infinite loop with something like this in HTAccess:

    RewriteCond %{REMOTE_ADDR} ^123.123.123
    RewriteRule ^(.*)$ http://domain.tld/feed

    Replace the IP address with that of the scraper and replace the feed URL with the feed from the scraper’s site. That would actually be amusing. If you do this, please let us know what happens.

    Here are some other good blacklist options from a helpful site:

    http://perishablepress.com/press/2009/02/03/eight-ways-to-blacklist-with-apaches-mod_rewrite/

    Also, beyond the Cease and Desist, you need to file DMCA Reports with the Search Engines.

    http://www.mcanerin.com/EN/articles/copyright-03.asp

    And finally, since they are scraping to game Google go to Google:

    http://googlewebmastercentral.blogspot.com/2008/06/duplicate-content-due-to-scrapers.html

  6. deeeringothamnus

    Yves, if you have anything valuable without the $ to protect it, you will attract flies like @#$$%. As an extrapolation on the general state of disrespect of intellectual output, which has huge economic implications, you might find reader interest if you could offer coverage on patents and inventors. There are some eye opening doozies here. Under Bush, they pretty much defunded the patent office, so, not only does it take longer to issue patents, but,patents get issued that never should have in light of prior art. In this regard, there are large US companies in China running patent mills that plagarize patents of US citizens. A wrongly issued patents is a right to sue for any and no reason, and, since this kind of litigation can easily be 8 figures, the less wealthy ( less government bailed out) party will settle long before going to court. In other words, litigation is an actual business strategy used to steal intellectual property and other assets.

  7. deeringothamnus

    One more thing. if any of these plagiarizing sites are run by the ruling oligarch class, watch out, unless you want to going underground like that wikileaks guy. For example, if your situation is similar to other forms of intellectual property, once you send a cease and desist letter, that gives the other party a right front run you into court and sue you to death until you “settle”. Worse, they could run into court with a T.R.O. “temporary restraining order”, claiming they will suffer irreparable harm, and get a judge to rubber stamp a gag order against you. This is where they take away your so called right of free speech, and you would be the one shut down, not them.

  8. Andrew Macpherson

    I really wouldn’t put too much energy into policing yesterday’s news, because it will take your focus away from processing todays.

    I’m now at the point where I just follow you, Baseline Scenario, David Rosenberg and Zero Hedge, that’s it. As a group I feel I am able to reach everything I need know through you.

    Now I’ve settled into knowing what I like to get in my mailbox, I’d be very happy to pay a yearly sub. I’d suggest that your service is easily worth $50 a year, and I doubt any of your consistent subscribers would bat an eyelid at a number like that. Of course I don’t know how many subs you have, but I’d suggest it’ll help to at least ask everyone what their contribution comfort level is in a poll, that would be a start.

      1. fullFrontal patDown

        Second the motion. Yesterdays rice approaches zero as today approaches tomorrow.

        When sniffing out the web for culprits, be sure to have cookies turned off and firewall turned on. Don’t be tempted into compromising your defense just long enough to steal a peek. Recheck your settings now.

        Grazia
        !

  9. Adam's Myth

    Remove your own post’s links to these offending sites. You have a lot of Google juice, which you are handing to your enemies for free.

    Instead, display the offending URL in a way that Google won’t follow, i.e. text, not href, and without the http.

    Dan Duncan’s suggestions are excellent, by the way.

  10. Fractal

    The Digital Millenium Copyright Act is Pub. L. 105-304, Oct. 28, 1998, 112 Stat. 2860, which was enrolled as H.R. 2281, includes important protections for domain hosts (“service providers”), including this (apologies again for any formatting errors):

    “SEC. 202. LIMITATIONS ON LIABILITY FOR COPYRIGHT INFRINGEMENT.

    (a) IN GENERAL- Chapter 5 of title 17, United States Code, is amended by adding after section 511 the following new section:

    `Sec. 512. Limitations on liability relating to material online

    `(a) TRANSITORY DIGITAL NETWORK COMMUNICATIONS- A service provider shall not be liable for monetary relief, or, except as provided in subsection (j), for injunctive or other equitable relief, for infringement of copyright by reason of the provider’s transmitting, routing, or providing connections for, material through a system or network controlled or operated by or for the service provider, or by reason of the intermediate and transient storage of that material in the course of such transmitting, routing, or providing connections, if–

    `(1) the transmission of the material was initiated by or at the direction of a person other than the service provider;

    `(2) the transmission, routing, provision of connections, or storage is carried out through an automatic technical process without selection of the material by the service provider;

    `(3) the service provider does not select the recipients of the material except as an automatic response to the request of another person;

    `(4) no copy of the material made by the service provider in the course of such intermediate or transient storage is maintained on the system or network in a manner ordinarily accessible to anyone other than anticipated recipients ************

    ***********************

    `(c) INFORMATION RESIDING ON SYSTEMS OR NETWORKS AT DIRECTION OF USERS-

    `(1) IN GENERAL- A service provider shall not be liable for monetary relief, or, except as provided in subsection (j), for injunctive or other equitable relief, for infringement of copyright by reason of the storage at the direction of a user of material that resides on a system or network controlled or operated by or for the service provider, if the service provider–

    `(A)(i) does not have actual knowledge that the material or an activity using the material on the system or network is infringing;

    `(ii) in the absence of such actual knowledge, is not aware of facts or circumstances from which infringing activity is apparent; or

    `(iii) upon obtaining such knowledge or awareness, acts expeditiously to remove, or disable access to, the material;

    `(B) does not receive a financial benefit directly attributable to the infringing activity, in a case in which the service provider has the right and ability to control such activity; and

    `(C) upon notification of claimed infringement as described in paragraph (3), responds expeditiously to remove, or disable access to, the material that is claimed to be infringing or to be the subject of infringing activity.

    ***********************

    `(3) ELEMENTS OF NOTIFICATION-

    `(A) To be effective under this subsection, a notification of claimed infringement must be a written communication provided to the designated agent of a service provider that includes substantially the following:

    `(i) A physical or electronic signature of a person authorized to act on behalf of the owner of an exclusive right that is allegedly infringed.

    `(ii) Identification of the copyrighted work claimed to have been infringed, or, if multiple copyrighted works at a single online site are covered by a single notification, a representative list of such works at that site.

    `(iii) Identification of the material that is claimed to be infringing or to be the subject of infringing activity and that is to be removed or access to which is to be disabled, and information reasonably sufficient to permit the service provider to locate the material.

    `(iv) Information reasonably sufficient to permit the service provider to contact the complaining party, such as an address, telephone number, and, if available, an electronic mail address at which the complaining party may be contacted.

    `(v) A statement that the complaining party has a good faith belief that use of the material in the manner complained of is not authorized by the copyright owner, its agent, or the law.

    `(vi) A statement that the information in the notification is accurate, and under penalty of perjury, that the complaining party is authorized to act on behalf of the owner of an exclusive right that is allegedly infringed.

    `(B)(i) Subject to clause (ii), a notification from a copyright owner or from a person authorized to act on behalf of the copyright owner that fails to comply substantially with the provisions of subparagraph (A) shall not be considered under paragraph (1)(A) in determining whether a service provider has actual knowledge or is aware of facts or circumstances from which infringing activity is apparent.

    `(ii) In a case in which the notification that is provided to the service provider’s designated agent fails to comply substantially with all the provisions of subparagraph (A) but substantially complies with clauses (ii), (iii), and (iv) of subparagraph (A), clause (i) of this subparagraph applies only if the service provider promptly attempts to contact the person making the notification or takes other reasonable steps to assist in the receipt of notification that substantially complies with all the provisions of subparagraph (A).”

  11. c.

    TechDirt rips off the site I run. Grrrr. You’ve really gotta get better at this guys.

    When you run a technology news website and the big companies eg Intel tell you “we don’t brief the NYT, WSJ, or any of those sites, including ones like CNET etc. we just brief the small guys like you because it’s not worth our time and you guys get the tech and they don’t” Man that feels like dirt.

    Most ISPs have a DMCA process available online as well as blogger etc. that do the simple scrapes. All that will happen is that they take the site down, site owner moves host and goes back up.

    Yves, don’t ever get forums, your intelligent comments in your forums will be spun into stories that you don’t have time for.

    Anyone who wants to do this for Yves will need to be her agent or authority so she’ll need paperwork because the DMCA requires you to “attest” to the orginality of the works being taken.

    I’d recommend a cleaner copyright notice on your site that limits the taking of works to a paragraph or less upon condition that they link your original and credit your author prominently in their works. That means that you’re setting standards in compliance with fair use but also making sure that you get credit and links back to your site if someone wants to read more. It allows for full commentary and discussion within the field you write.

    If you wanta “former” lawyer’s thoughts on how I run my site to deal with this feel free to email me directly.

  12. razzz

    Do your posts in .PDF format or convert them to JPG and post the pictures of the pages. Then you’ll see how much trouble someone will take to re-post your work somewhere else.

  13. ChrisPacific

    Prevention is better than cure. I think your main problem is probably the RSS feeds – it’s very easy to aggregate multiple RSS feeds and put them online with very little manual work involved.

    I suggest taking one or more of the following steps:

    1. Add explicit Terms and Conditions or a syndication policy for use of your RSS feeds – these seem to be absent right now. Most major sites that offer content feeds via RSS do this – see NY Times or Boston.com for example. At present you seem to be relying on standard copyright/fair use protection rather than explicit terms, which raises the possibility that people might unintentionally syndicate your content in a way that you’re not happy with. (Either that or they may lie and claim it was unintentional, which you want to make it as difficult as possible for them to do plausibly).

    2. Add your syndication T&Cs to the end of every post that is distributed over RSS. That way they either include the T&C entry on their site (making it really obvious they are stealing it) or they have to clip it manually, which is tedious to do and also makes the fair use violation crystal clear.

    3. Syndicate only part of each post over RSS and include a ‘More’ link that brings readers back to NC if they want to read the whole article.

    4. Remove the RSS feed from your site and require everyone to read your content at NC.

    That’s roughly in order of severity and increasing impact to regular RSS readers. Numbers 3 and 4 might be an overreaction at this stage but #1 and possibly #2 would have minimal impact on your regular readers, and would probably help quite a bit.

    1. anon

      This would be very annoying to those of us who read NC via an RSS feed (for example, I use Google Reader to read new NC posts). If I had to physically go to every website to read content, I’d never make it to most of them–NC included.

  14. ChrisPacific

    Agreed, which is why I listed it as the last resort (the nuclear option, if you will). But I think #1 and possibly #2 would have minimal impact as far as people like you were concerned.

  15. Eric Cohen

    Be very careful!

    As I understand it, if you submit DCMA takedown notices upon which an ISP or search engine provider acts, YOU as Copyright holder may be held to be responsible for the financial consequences and any associated damages if the party involved can demonstrate that they are protected by the various “safe harbour” provisions that apply under Subsection 512 of the Digital Millennium Copyright Act?:-

    Quote
    There are four major categories of network systems offered by service providers that qualify for protection under the safe harbor provisions:

    -Conduit Communications include the transmission and routing of information, such as an email or Internet service provider, which store the material only temporarily on their networks. [Sec. 512(a)]

    -System Caching refers to the temporary copies of data that are made by service providers in providing the various services that require such copying in order to transfer data. [Sec. 512(b)]

    -Storage Systems refers to services which allow users to store information on their networks, such as a web hosting service or a chat room. [Sec. 512(c)]

    -Information Location Tools refer to services such as search engines, directories, or pages of recommended web sites which provide links to the allegedly infringing material. [Sec. 512(d)]

    “If it is determined that the copyright holder misrepresented its claim regarding the infringing material, the copyright holder then becomes liable to the OSP for any damages that resulted from the improper removal of the material. [512(f)]”
    UNQUOTE

    If you for example caused a site to be taken down by sending a DCMA TakeDown notice to its ISP, or its search engine visibility compromised by causing material to be removed from a search listing by submitting a Takedown notice to a Search Engine, and the operator can establish that they themselves enjoyed the appropriate safe harbour provisions, the material was not infringing, or the claim was incorrectly submitted, and bring a counterclaim for damages, you as Copyright Holder could be deemed liable, since the ISP/OSP itself is sheltered by the same safe harbour provisions.

Comments are closed.