download_tables_from_dokuwiki_to_json¶
Get main wiki page containing all links to related pages to parse this into 2 json files with all the table data.
The script needs to be able to access dokuwiki.datapunt.amsterdam.nl.
Example command line: For applicaties on our internal dokuwiki:
python download_tables_from_dokuwiki_to_json.py https://dokuwiki.datapunt.amsterdam.nl/doku.php\?id\=start:gebruik:overzicht-informatievoorzining h2 output informatievoorziening applicatie_gegevens
- For gebruik on our internal dokuwiki:
python download_tables_from_dokuwiki_to_json.py https://dokuwiki.datapunt.amsterdam.nl/doku.php\?id\=start:gebruik:overzicht-organsiaties h3 output Directie,Organisatie,Stadsdeel gebruik_basisregistraties
usage: download_tables_from_dokuwiki_to_json [-h]
url cluster_headertype
output_folder header_name_urls
[header_name_urls ...] filename
Positional Arguments¶
url |
|
cluster_headertype | |
Specify the header css style to select the cluster titles: h2 for applicaties, h3 for gebruik | |
output_folder | Specify the desired output folder path, for example: output |
header_name_urls | |
| |
filename | Specify the desired filename, for example: clusters.json |
functions¶
-
datapunt_processing.extract.download_tables_from_dokuwiki_to_json.
create_dir_if_not_exists
(directory)¶ Create directory if it does not yet exists.
- Args:
- Specify the name of directory, for example: dir/anotherdir
- Returns:
- Creates the directory if it does not exists, of return the error message.
-
datapunt_processing.extract.download_tables_from_dokuwiki_to_json.
getPage
(url)¶ Get parsed text data from url’s. Wait for 1 second for slow networks. Retry 5 times.
-
datapunt_processing.extract.download_tables_from_dokuwiki_to_json.
getRows
(url, headers, row)¶ Get all rows from tables, add them into a dict and add host url to wiki urls.
-
datapunt_processing.extract.download_tables_from_dokuwiki_to_json.
parseHtmlTable
(url, html_doc, header_name_urls, cluster_headertype, table_headertype='h3')¶ Retrieve one html page to parse tables and H3 names from. Args:
- htmldoc: wiki url
- name: name of the page
- headertype: h1, h2, or h3 type of the titles used above each table. h3 is not used.
- Result:
- {table_title: h3 text, [{name: value}, ..]} if no name is specified: [{cluster: title of the page},{name: value}, …]
-
datapunt_processing.extract.download_tables_from_dokuwiki_to_json.
parser
()¶ Parser function to run arguments from commandline and to add description to sphinx docs. To see possible styling options: https://pythonhosted.org/an_example_pypi_project/sphinx.html
-
datapunt_processing.extract.download_tables_from_dokuwiki_to_json.
saveFile
(data, folder, name)¶ Save file as json and return the full path.