download_tables_from_dokuwiki_to_json

Get main wiki page containing all links to related pages to parse this into 2 json files with all the table data.

The script needs to be able to access dokuwiki.datapunt.amsterdam.nl.

Example command line: For applicaties on our internal dokuwiki:

python download_tables_from_dokuwiki_to_json.py https://dokuwiki.datapunt.amsterdam.nl/doku.php\?id\=start:gebruik:overzicht-informatievoorzining h2 output informatievoorziening applicatie_gegevens
For gebruik on our internal dokuwiki:
python download_tables_from_dokuwiki_to_json.py https://dokuwiki.datapunt.amsterdam.nl/doku.php\?id\=start:gebruik:overzicht-organsiaties h3 output Directie,Organisatie,Stadsdeel gebruik_basisregistraties

usage: download_tables_from_dokuwiki_to_json [-h]
                                             url cluster_headertype
                                             output_folder header_name_urls
                                             [header_name_urls ...] filename

Positional Arguments

url
Full url of the main webpage containing the wiki url links to the subpages in quotes, For example: “https://dokuwiki.datapunt.amsterdam.nl/doku.php?id=start:gebruik:systeem
cluster_headertype
 Specify the header css style to select the cluster titles: h2 for applicaties, h3 for gebruik
output_folder Specify the desired output folder path, for example: output
header_name_urls
 
Specify the names of the field where the wiki urls are defined. You can also use multiple field if different column names are used on one page. For example: “informatievoorziening, directie”
filename Specify the desired filename, for example: clusters.json

functions

datapunt_processing.extract.download_tables_from_dokuwiki_to_json.create_dir_if_not_exists(directory)

Create directory if it does not yet exists.

Args:
Specify the name of directory, for example: dir/anotherdir
Returns:
Creates the directory if it does not exists, of return the error message.
datapunt_processing.extract.download_tables_from_dokuwiki_to_json.getPage(url)

Get parsed text data from url’s. Wait for 1 second for slow networks. Retry 5 times.

datapunt_processing.extract.download_tables_from_dokuwiki_to_json.getRows(url, headers, row)

Get all rows from tables, add them into a dict and add host url to wiki urls.

datapunt_processing.extract.download_tables_from_dokuwiki_to_json.parseHtmlTable(url, html_doc, header_name_urls, cluster_headertype, table_headertype='h3')

Retrieve one html page to parse tables and H3 names from. Args:

  • htmldoc: wiki url
  • name: name of the page
  • headertype: h1, h2, or h3 type of the titles used above each table. h3 is not used.
Result:
{table_title: h3 text, [{name: value}, ..]} if no name is specified: [{cluster: title of the page},{name: value}, …]
datapunt_processing.extract.download_tables_from_dokuwiki_to_json.parser()

Parser function to run arguments from commandline and to add description to sphinx docs. To see possible styling options: https://pythonhosted.org/an_example_pypi_project/sphinx.html

datapunt_processing.extract.download_tables_from_dokuwiki_to_json.saveFile(data, folder, name)

Save file as json and return the full path.