license for a web corpus

Dec 20, 2012 at 11:56

Dear Metashare helpdesk team,

 

I have a question regarding a web corpus.

The data was crawled from German and Russian web pages using a Bootcat style crawler. Each text in the corpus comes with an xml header indicating the webpage, the crawling date, the seeds used etc. In the gold data, I give the (rather small) subcorpora in 2 versions: one with all the headers, one without headers for easier processing. Consequently, each text element in the data can be traced back to a given web page. Basically, the data was collected from freely available sources, but no agreement was entered into with the owners of the web pages. Some of the texts in the corpus were converted by the crawler from PDF or Word docs that are downloadable from the web. I wonder whether the documentation provided by the headers is enough credit to the owners. I also wonder whether I should check whether I asked my annotators (who created the gold data) to transfer all the rights wrt their annotations to me and, if I haven't done so, still ask them before I publish the data. Also, I would like to ask Tilde to enter a publication agreement with me, since my data is uploaded on their Metshare node.

Tags:

Discussion 5 answers

  • avatar
    Answer by Anne-Kathrin Schumann on Jan 09, 2013 at 11:28

    Rereading the text it seems my question is unclear. Here it is in plain language: I do not know which license to choose for the data described above to publish it on Metashare.

  • avatar
    Answer by Khalid Choukri, on behalf of the META-SHARE legal team on Jan 17, 2013 at 15:21

    Dear Anne-Kathrin

    While browsing my emails on the helpdesk list I realize that we may have left your question un-answered. So here are some brief comments.

     

    The case is very common as you can guess (there will be a report on this issue by ELDA in the framework of PANACEA project, it will be made public soon, look at www.panacea-LR.eu))

    A) Data sources:

    1) Giving credits to the owners is not enough, you really need to get their consent.

    2) what's the risk then if you did not get the consent. If these are R&D labs, blogs, forums, public institutions,  etc. the risk is low (even none for public sector) ; if these are major newspapers and have records in suing who ever harvest/crawl their data then it is up to you to decide to withdraw such sources.

     

    3) Ain all cases, You need to add a statement /disclaimer that you downloaded the data from sites that did not have any statement to prevent such thing (I assume you did not break any password, did not see any sentence like "Please do not harvest, Crawl this site etc.) and that you are ready to delete any part if its right-owner asked for.

     

    B)  Right of the annotators (who created the gold data)

    This depends on their contract:

    ·         if they had contract with you (or your institution) and they were hired and paid to do such work, then is work-for-hire and it is yours unless stated differently on their contracts.

    ·         If they did not have contracts or if the contracts did not cover this work then it is better you get the transfer of ownership to you.

    Sorry again for being so late and hope this helps.

  • avatar
    Answer by Prodromos Tsiavos on Jan 30, 2013 at 16:36

    Dear Anne-Katherine,

     

    Just a brief addition to Khalid's answer with which I fully agree.

    Once you have secured all necessary permissions (i.e. with an implied licence, such as the one that is obtained in scenario A that Khalid describes) and after you have the disclaimer and notice and take down procedures suggested by Khalid, you may use whichever licence you deem as the most appropriate one. 

    Hope this helps.

     

    With best wishes,

    pRo

  • avatar
    Answer by Prodromos Tsiavos on Jan 30, 2013 at 16:36

    Dear Anne-Katherine,

     

    Just a brief addition to Khalid's answer with which I fully agree.

    Once you have secured all necessary permissions (i.e. with an implied licence, such as the one that is obtained in scenario A that Khalid describes) and after you have the disclaimer and notice and take down procedures suggested by Khalid, you may use whichever licence you deem as the most appropriate one. 

    Hope this helps.

     

    With best wishes,

    pRo

  • avatar
    Answer by Prodromos Tsiavos on Jan 30, 2013 at 16:36

    Dear Anne-Katherine,

     

    Just a brief addition to Khalid's answer with which I fully agree.

    Once you have secured all necessary permissions (i.e. with an implied licence, such as the one that is obtained in scenario A that Khalid describes) and after you have the disclaimer and notice and take down procedures suggested by Khalid, you may use whichever licence you deem as the most appropriate one. 

    Hope this helps.

     

    With best wishes,

    pRo

  • avatar
    Log in or Register to reply to this post.