Bojago Philip James's Blog

25Jan/1144

An Apology to CellarTracker

Thank you to those who gave me, and the company, the time to investigate the claims made. It’s very easy in the digital age to simply hit “retweet” and join the virtual mob, and to those who hung back to wait for the truth, thank you. Separate to the public discussion being played out over Twitter, there’s been a parallel discussion between Rich Tomko, CEO of Snooth, and Eric LeVine, Founder of CellarTracker, as we all work to understand what the genesis of the issue is.

To be clear, we do not scrape data, we do not steal data. Typically, data on Snooth is sent to us, uploaded via our Merchant Hub, or is crawled using web standards (robots.txt, which is how Google and other search engines work).

I think CellarTracker provides a great service to serious wine collectors out there. I know this, because members of the team here continue to rave to me about how awesome it is. It was because I knew this, that way back in April 2007 when Snooth first launched I negotiated an agreement with CellarTracker to feature reviews from their site (announced here). The relationship evolved over the ensuing six months, and ultimately in October 2007, the partnership ended and we were asked to remove “user reviews" -- which we did, in its entirety.

At the end of October 2007, CellarTracker wrote to us asking us to remove not only the reviews (which had been done), but to take down other pieces of content on our site. This is where everything gets complicated, so bear with me as I explain how Snooth is built.

Snooth is an aggregator -- we collect data from around the web. We’re more than that today, with a thriving social community and a strong editorial component, but back then when there were just two of us and we worked off my co-founder's kitchen table in Brooklyn, to make our efforts stretch, we decided to go build a “Google meets Expedia” for wine.

Today we aggregate information from over 10,000 sources, with thousands of these sources actually sending us information in the form of Excel and csv files, or data feeds. All of this is managed via our Merchant Hub, which is an incredibly sophisticated tool able to recognize and fix wine data errors on the fly. It’s something we are very proud of, and continuously working to improve. In addition to the Hub, we have the Snooth Crawler. This is essentially a robot that behaves like a mini-Google, wandering around the web looking for wine content to add to the site. We currently monitor more than 50 million pages focused on wine, and we check in every few days to see if there’s any new information on each page. The Snooth Crawler “behaves” well: We don’t visit sites too frequently, and we leave a signature behind so that webmasters can see that it was us calling, and if they want to stop us from coming, it’s simple to request that and our Robot will not visit that site again.

To date, no site has ever asked to be removed from our index. It sometimes takes some explaining (“What are you doing with my data?”), but site owners mostly recognize the value of what we’re doing: namely to raise a winery’s visibility of their products and to drive sales to both wineries and retailers. It’s the wine equivalent of Google, Expedia, or CitySearch.

Back to CellarTracker. We realized yesterday that some of our wine descriptors, namely the user tags (not reviews or information that is, or could be, copyright-protected) that are automatically extracted from user reviews, were still being calculated using CellarTracker information as one of the thousands of data sources. This data was being pulled via the original CellarTracker XML feed, which was set up under the 2007 agreement. Once we discovered that this data is still contributing to the compilation of these tags, and having been informed by CellarTracker about their position on this issue, we immediately switched the feed off and have begun the process of extracting the information from the site. The process will take several days as we have to recalculate the wine tags across over 3 million wines.

Snooth’s code base, the guts of the code that make the site live and breathe, has been written by tens of people over four years, involving more than 10,000 individual code submissions and over 1 million lines of code. Amongst that spaghetti of code, this routine slipped through -- specifically, the code that pulled the CellarTracker feed was not switched off when the agreement between us ended, although no reviews went on the site, the wine tags were still being calculated with CellarTracker as one of the sources.

For that I’m sorry. To Eric LeVine, who single-handedly has built CellarTracker into the best tool for the serious wine collector, and to any CellarTracker users who feel their data was misused, my apologies.

Again, thank you to those who gave us the time to understand the situation and to respond appropriately.

Comments (44) Trackbacks (13)
  1. Thanks for explaining–however, it is a simple matter, and a standard methodology for web crawlers, RSS users, and web scrapers to include a back-link to the source; that would be both courteous and professional and it is the standard practice, as with quoting in print to include a citation.

    To do otherwise leaves you open for a DMCA complaint.

  2. I really appreciate the transparent response. As a former product strategist for CNN.com, I understand the complexities of data and information. Leveraging data (aka scraping) is common. In fact, most sites do some sort of this…google has built their business on this. One question…why does Paul Mabray hate you so much?

  3. Great to see a response from you guys. Keep up the good work

  4. Great post with a sincere apology. I use both services(Snooth & Cellar Tracker) and find them both indispensable for a satisfying online wine experience.

  5. My reviews from CT are still featured on your snooth site!

  6. Lisa – the reviews did feature attribution when they were present on the site.

  7. This is much appreciated…thanks Philip.

  8. CW – your reviews would only be on the site if you uploaded them, the tags are different, and, as per the post are being recalculated/removed as soon as the server reindexes the pages.

    Can you point me to a link to see where the reviews are as they shouldnt be there. thanks

  9. Here is just one example.

    http://www.snooth.com/wine/ayres-vineyard-pinot-blanc-2009/?t=reviews&r=2655127

    While I greatly appreciate the apology, I am reserving judgment until I see how thoroughly you guys “undo” this. Also I have a series of emails with a Snooth representative (AL) from 2008 that actually conflict with a number of the assertions made.

  10. Thanks for a great, honest, and transparent explanation. The animosity is amazing, and unbelievable. For someone to blame, and continue to draw attention to their brand, while not acknowledging your side of the story is unprofessional, to say the least.

  11. Thanks for the good response and explanation. In an already confusing industry it’s nice to see some transparency.

  12. Eric – Thanks for posting the link. We’re looking into this now, its from our crawler (SnoothBot), I dont think it was ever blocked from your site (robots.txt), we’re manually removing your site now. Reindexing next…

    * http.agent.name = Snoothbot
    * http.agent.url = http://www.snooth.com/bot/

    Our site is big, none of us here knew that CT was on the crawlers list.

  13. You don’t scrape data… but you have a bot that crawls website gathering data. Sigh… It might help if you understood what you were doing. There’s zero functional difference between scraping data off the generated HTML and crawling the page.

    Taking data without asking and then deciding that, if you’re caught and blocked via robot.txt, to stop isn’t OK. It’s not good net citizenship. Oh, I suppose that if you grabbed an excerpt to an article and displayed that with attribution and a link that might be defensible but the wholesale hoovering up of content that other people have created without so much as a request is reprehensible. If you’d like to aggregate data approach the people generating it and get their permission.

  14. Jeff,
    I do not hate Philip James at all. In fact I complimented their organization on many fronts in my posts. However, in this instance, I am just doing my job at VinTank. We are the leading digital think tank for the wine industry. It is our job is to elevate the awareness and understanding of technology as it relates to the wine industry. This includes telling what technology is relevant, how it relates to the goals and aspirations of wineries, and to create transparency in wine technology companies activities, statements and agendas.

  15. Rick – there’s a major difference, scrapers are written per site, our crawler crawls over 50M pages every 48 hours, we don’t monitor the sites it covers. We also give attribution – in fact Snooth and CT had an agreement in place for 6 months back in 2007 where we, with permission, hosted reviews from their site and gave full attribution.

    I think you’re mixing things up – what Paul wrote about was not generated by the crawler, but an xml feed which we polled. Paul was also not writing about reviews, but key words extracted from sites we polled data from.

    The key point here is that Snooth is working with Eric to fix this.

  16. What a load of steaming poop. Sorry, but when you say “we don’t scrape, we crawl,” you’re really saying “yes, we scrape.” As for the idea that you can steal people’s content for your own profit, but if they happen to catch you they can ask you to stop, well, that’s just not how the law works. When you steal copyrighted material for your own profit, you’re breaking the rules. It is YOUR duty to NOT steal, not the victim’s duty to catch you.

    Here’s an idea. Create original content. Don’t steal other people’s intellectual property. I don’t care if you call it “scrape,” “crawl,” or anything else. If you do not have PRIOR permission to use it, and if you use it for your own profit, you are in violation of Section 107 of the Copyright Act of 1976. That statute, by the way, has statutory monetary penalties.

    We will be reviewing Snooth’s use of Palate Press and the more than 100 members of The Palate Press Advertising Network. I suggest everybody producing original wine content do the same. We are entitled to our own intellectual property. Nobody else is entitled to profit from it, and nobody else is entitled to steal it until we notice.

  17. Philip,

    You mean to say that you led a small company for three years (whose business is an online product), and you had no idea how the tags were generated? Given the relative size of your business that seems almost preposterous.

    Put another way, if you oversaw the same web site within the confines of a larger corporation, doubtful that your rational and apology would hold any water with your Sr. Mgmt. who would hold you very accountable for knowing exactly where you were getting the information in your “aggregation” in an Expedia meets wine kind of way.

  18. David – CellarTracker and Snooth had an agreement in place originally, Eric and I agreed a process by which we’d take data from their XML feed and host it on our site with attribution.

    Jeff – I know exactly “how” the tags were generated, I worked with the engineering team to write the algorithm. What none of us was aware of was that, of thousands of different sources that come to us from many different channels, there was a lone process that should have been switched off.

    I’m not sure what the sentiment towards Google or wine sites like Wine Searcher or other is, but crawling the web and creating directories is, in my opinion, a useful exercise and provides a useful tool to consumers. The majority of the thousands of sources we have are more than happy to be a part of our site, in fact, several thousand actually submit data to us directly.

  19. If sites submit data to your directly, they choose to give you their intellectual property. However, your explanation above was that many sites do not give you that permission, you use it anyway, and it is the site’s obligation to discover your perfidy and demand its end. Fortunately for all who care about their own property, including their own intellectual property, that is not how it works.

    I’m sorry to sound so strident, but it appears the business model is to build your product upon the work of others, without at least some of the others’ permission. I find that quite disturbing.

  20. The pure fact that when your agreement with CT ended, you didn’t remove them from your crawl list, which you knew contained CT…. You also didn’t stop pulling data from the XML feed, again, which you knew about, makes all of your responses suspect, in my opinion….

    Sure the aggregation of data may go on post the agreement with CT ending for a couple days/weeks, but for 3+ years….

  21. I’m still waiting for at least one of my tasting notes converted to tag format to be pulled down from Snooth. I often post my notes to a blog from CellarTracker, and if others do the same and Snooth is scraping their blogs, Snooth may have a much larger problem to fix than just scraping the CT feed.

    I’m all for sharing info, but if someone is monetizing *my* intellectual property with asking permission and not granting me a piece of the profit, I have a serious problem with that.

  22. Thanks for the response Philip. Am also a user of both services. I chose to wait to let the drama and rabble rousing pass, as most barely read then Re-Tweet; especially when it was written with what smacked to me of vehemence and self righteousness, and ended with an arrow pointing to the truer genesis. (The first article.)

  23. Jeez louiz lighten up people – data should be free! Who cares if it gets scraped/crawled/vacuumed by a bot?
    Some people are just looking to get themselves excited I guess.

  24. Pete,

    What an odd comment. Data should be free? Is your position truly that other people’s data should be free for Snooth to use for its own profit? Really?

  25. Pete, the issue is that Snooth monetizes the information housed on its own website presumably by selling ad space, making apps for smart phones, etc. By your reasoning–data should be free–Snooth should not make any revenue by aggregating data. Why bother if they can’t make money doing what they do?

    Snooth has every right to run a profitable business by organizing information in a useful manner. But they do not have absolute freedom to publish information as their own. That’s why copyrights exists. Fair Use–quoting or paraphrasing and linking–is fine. Claiming content as original is not.

  26. Greg, Actually, the idea you meet “fair use” by using a short quote is a bit of a myth. “Fair use” is determined by a court using multiple considerations. Length of quote is only one. The more important is the purpose. “Fair use” is not a stand-alone phrase. It actually refers to “Fair use … for purposes such as criticism, comment, news reporting, teaching (including multiple copies for classroom use), scholarship, or research,” and does NOT include for commercial purposes. In other words, just using a snippet might not be fair use if for commercial, rather than journalistic or scholarly purposes. In Snooth’s case, there is very little argument that it is for anything other than pure commercial purpose.

  27. Hello! I just want to state that I like your writing way and that I’m going to follow your blog continually from now Stay the best!

  28. post not working in firefox

  29. Folks, Philip and his team really are working very hard to “undo” all of this. Let’s leave it at that for now.

    Thanks,
    -Eric LeVine

  30. I think everyone has lost a grip on reality. IF you are in the wine industry (which I am…I own a small retailer in NY), you should promote anything that would help drive wine information and ultimately, wine sales. I have always looked at Snooth as one of the best vehicles for a consumers to learn about wine and ultimately, buy more of it. Clearly, CellarTracker and Snooth need to work something out (in private). Still disgusted in the public attack that was made without knowing any of the facts. Poor form on Vintank.

  31. I think everyone has lost a grip on reality. IF you are in the wine industry (which I am…I own a small retailer in NY), you should promote anything that would help drive wine information and ultimately, wine sales. I have always looked at Snooth as one of the best vehicles for consumers to learn about wine and ultimately, buy more of it. Clearly, CellarTracker and Snooth need to work something out (in private). Still disgusted in the public attack that was made without knowing any of the facts. Poor form on Vintank.

  32. Eric,

    I respect your point of view. That said, I hope Philip and his team are working hard to undo EVERYTHING they have done wrong, ALL the intellectual property they have stolen, not just what they got caught doing. And just in case I’m not being sufficiently clear, that means I want to know if they have been stealing any of the intellectual property of any of the 100+ websites in The Palate Press Advertising Network, and I don’t want to have to dig to figure that out. I suspect other websites have the same question they want answered.

  33. I was recommended this website by my cousin. I am not sure whether this post is written by him as nobody else know such detailed about my trouble. You’re wonderful! Thanks!

  34. Jeff Del Vino – we did know the facts. We demonstrated them clearly on our site which has stimulated the apology and changes in the Snooth site. According to many of Eric’s comments, this issue had been mentioned by Eric in private multiple times. There was no “public attack” only a revelation of evidence that pointed to data from CellarTracker appearing on Snooth.

    We look forward to Snooth’s efforts to clean this “snafu” up. I know they are working hard and we are looking forward to the final resolution.

  35. Philip, looks like your cleanup has been fantastic, at least concerning the tags that appear to have been borrowed from my notes. Very impressed and very much appreciate the directness in resolving the issue.

    David, point taken on Fair Use as far as commercial use. Seems like retailers still use WA, WS and WE notes based on this principle, though. It does not seem to fit the technical definition of Fair Use, however.

    Jeff, would you work for another wine shop free of charge? I doubt it. But isn’t your goal identical, to sell wine? Then why aren’t you working free of charge for your colleague (or perhaps competitor)?

  36. Seems like that you’ve placed lots of effort and hard perform into your article and I require much more of these using the net these days. I sincerely got a kick from your article. I really do not truly have much to say in response, I only wanted to comment to reply great work.

  37. Snooth shouldn’t have taken this material. Period.

    OTOH many of the comments here turn copyrights on their head. The idea behind copyrights was to create a richer public domain while insuring that authors had limited time periods to recoup a return if indeed they wanted one. Initially copyrights were much more limited in term and required the author to take at least mark the material as copyrighted. Now thanks to Disney and others, every time Mickey Mouse’s copyright comes up, it’s extended to the point it is now an endless right, and other legislation has added in huge statutory damages while removing even the need to notify a potential user that it is copyrighted. Result: a vastly reduced public domain and restricted creativity. My concerns have to do with the idea that anything anyone creates a potential hidden landmine for others: an unmarked, endless, potential for huge statutory damages.

  38. I would take this as a simple case of the oops, if it weren’t for *other* repeated uses of data without author permission. Images, for example, *can* be copyrighted, but Snooth regularly “borrows” without attribution and, according to several of the photographers, is fairly slow to respond to non-formal attempts to communicate on this issue.

  39. It’s really not as straightforward as some comment, as many sites scrape or crawl however you define it. Gee, Google comes to mind.

    What is clear is that Snooth has or had some information they shouldn’t have had, and they seem to be working on fixing it. I don’t know all the background, but am a long time CT user

  40. Scanning the web using robots.txt does not mean YOU can then use what you find anyway you want. Copyright still applies. You CANNOT ‘aggregate’ simply because you find something. Please consult an attorney skilled in this specialty. A commenter here, including me, may or may not know what YOU need to know.

  41. One of the things not being discussed is the fact that CT has no true copyright to the wines in their database. The “facts” of the wine (vintage, producer, AVA) I believe are considered “information” and under Feist v. Rural [ http://en.wikipedia.org/wiki/Feist_v._Rural ] would not meet the threshold for copyright. The individual reviews submitted by users is another story. Who owns the copyright? The user? CT? a combination of both? Can you copyright a word? a tag? a tweet? If Snooth tags a Pinot Noir with “Cherry” or “Wood” from the CT database how could that be considered a copyright issue? No matter what you believe about this senario, copyright requires one thing– originality. The fact is that Pinot almost always has cherry aromoas, Merlot consistently smells of blue fruit and American Oak has the unmistakable oder of toasted coconut. Telling me something different is original. Most of the reviews on CT are non-original; regurgitated tasting notes that all kinda look, sound and feel the same.

    Here’s a scenario… Will CT disqualify a review on their own site because it too closely resembles another review on their site? If not, it begs the question: does CT take copyright issues seriously?

  42. I as well as my guys have already been checking the great secrets and techniques found on your web page and all of the sudden came up with an awful suspicion I never expressed respect to the website owner for those tips. All the ladies became for this reason glad to learn all of them and have in effect unquestionably been taking advantage of those things. We appreciate you simply being so helpful and for obtaining such great areas most people are really eager to learn about. My personal honest apologies for not expressing appreciation to sooner.

  43. i like it every Apology to CellarTracker | Bojago now im your rss reader


Leave a comment

(required)