Extending DICE
Adding your own web scraper
The task of adding a web scraper of your own can really run the gamut of difficulty. It really depends on the complexity of the resource from which you are acquiring documents. While this guide is meant to be helpful to a very broad purpose, please make sure that you are not breaking any laws while scraping. Many sites do not allow scraping without exceptions and others may require express permission. The authors of DICE encourage you to use discretion and good judgement.
Wikipedia is provided as an example. Example code can be found in <DICE>/database-specific/wikipedia.
To get access to DICE library functions, add an import statement like the following:
sys.path.insert(0, "%s/../../lib" % os.path.dirname(sys.argv[0])) import dice
DICE module ~ common functions
dice.associate_document_with_concept(concept, term, document_identifier)
Write the given concept, term, and document_identifier to a csv file so that we can keep track in the extract step of what concepts and terms introduced a document into the dataset.
A document_identifier can appear multiple times in the pairing document.
dice.check_pairing_doc_exists()
Provides an way to error out if the pairing document is not present
Returns: | True if present, False otherwise |
---|
dice.gen_document_terms_map(database_prefix=’-‘)
generate a map of terms associated with each document.
Params : | database_prefix: e.g. “wikipedia-“ |
---|
This function will look through _DICE_CONCEPT_TERM_DOCUMENT_PAIRINGS to find entries relevant to the database given by database_prefix.
Returns: | document_terms: a dictionary
|
---|
dice.get_concept()
Return the concept for the current run
dice.get_default_delay()
Return how long to wait in between downloading articles
Returns: | 1 |
---|
dice.get_term()
Return the term for the current run
dice.start_NER_server(port=8080)
Calls a wrapper script to start up Stanford NER in the background
dice.url_escape_query(term)
If the term given has spaces, wrap the term in double quotes. Next, run the term though urllib2.quote.
The value returned is useful for submitting an escaped value to a form.