download_tables_from_dokuwiki_to_json¶

Get main wiki page containing all links to related pages to parse this into 2 json files with all the table data.

The script needs to be able to access dokuwiki.datapunt.amsterdam.nl.

Example command line: For applicaties on our internal dokuwiki:

python download_tables_from_dokuwiki_to_json.py https://dokuwiki.datapunt.amsterdam.nl/doku.php\?id\=start:gebruik:overzicht-informatievoorzining h2 output informatievoorziening applicatie_gegevens

For gebruik on our internal dokuwiki:

python download_tables_from_dokuwiki_to_json.py https://dokuwiki.datapunt.amsterdam.nl/doku.php\?id\=start:gebruik:overzicht-organsiaties h3 output Directie,Organisatie,Stadsdeel gebruik_basisregistraties

usage: download_tables_from_dokuwiki_to_json [-h]
                                             url cluster_headertype
                                             output_folder header_name_urls
                                             [header_name_urls ...] filename

Positional Arguments¶

`url`	Full url of the main webpage containing the wiki url links to the subpages in quotes, For example: “https://dokuwiki.datapunt.amsterdam.nl/doku.php?id=start:gebruik:systeem”
`cluster_headertype`
	Specify the header css style to select the cluster titles: h2 for applicaties, h3 for gebruik
`output_folder`	Specify the desired output folder path, for example: output
`header_name_urls`
	Specify the names of the field where the wiki urls are defined. You can also use multiple field if different column names are used on one page. For example: “informatievoorziening, directie”
`filename`	Specify the desired filename, for example: clusters.json

functions¶

datapunt_processing.extract.download_tables_from_dokuwiki_to_json.create_dir_if_not_exists(directory)¶

Create directory if it does not yet exists.

Args:: Specify the name of directory, for example: dir/anotherdir
Returns:: Creates the directory if it does not exists, of return the error message.

datapunt_processing.extract.download_tables_from_dokuwiki_to_json.getPage(url)¶: Get parsed text data from url’s. Wait for 1 second for slow networks. Retry 5 times.

datapunt_processing.extract.download_tables_from_dokuwiki_to_json.getRows(url, headers, row)¶: Get all rows from tables, add them into a dict and add host url to wiki urls.

datapunt_processing.extract.download_tables_from_dokuwiki_to_json.parseHtmlTable(url, html_doc, header_name_urls, cluster_headertype, table_headertype='h3')¶

Retrieve one html page to parse tables and H3 names from. Args:

htmldoc: wiki url

name: name of the page

headertype: h1, h2, or h3 type of the titles used above each table. h3 is not used.

Result:: {table_title: h3 text, [{name: value}, ..]} if no name is specified: [{cluster: title of the page},{name: value}, …]

datapunt_processing.extract.download_tables_from_dokuwiki_to_json.parser()¶: Parser function to run arguments from commandline and to add description to sphinx docs. To see possible styling options: https://pythonhosted.org/an_example_pypi_project/sphinx.html

datapunt_processing.extract.download_tables_from_dokuwiki_to_json.saveFile(data, folder, name)¶: Save file as json and return the full path.