NAME App::CSVUtils - CLI utilities related to CSV VERSION This document describes version 1.022 of App::CSVUtils (from Perl distribution App-CSVUtils), released on 2023-03-10. DESCRIPTION This distribution contains the following CLI utilities: 1. csv-add-fields 2. csv-avg 3. csv-check-cell-values 4. csv-check-field-names 5. csv-check-field-values 6. csv-check-rows 7. csv-cmp 8. csv-concat 9. csv-convert-to-hash 10. csv-csv 11. csv-delete-fields 12. csv-dump 13. csv-each-row 14. csv-fill-template 15. csv-find-values 16. csv-freqtable 17. csv-gen 18. csv-get-cells 19. csv-grep 20. csv-info 21. csv-intrange 22. csv-list-field-names 23. csv-lookup-fields 24. csv-ltrim 25. csv-map 26. csv-munge-field 27. csv-munge-row 28. csv-pick 29. csv-pick-fields 30. csv-pick-rows 31. csv-quote 32. csv-replace-newline 33. csv-rtrim 34. csv-select-fields 35. csv-select-rows 36. csv-setop 37. csv-shuf 38. csv-shuf-fields 39. csv-shuf-rows 40. csv-sort 41. csv-sort-fields 42. csv-sort-rows 43. csv-sorted 44. csv-sorted-fields 45. csv-sorted-rows 46. csv-split 47. csv-sum 48. csv-transpose 49. csv-trim 50. csv-uniq 51. csv-unquote 52. csv2ltsv 53. csv2paras 54. csv2td 55. csv2tsv 56. csv2vcf 57. paras2csv 58. tsv2csv FUNCTIONS gen_csv_util Usage: gen_csv_util(%args) -> bool Generate a CSV utility. This routine is used to generate a CSV utility in the form of a Rinci function (code and metadata). You can then produce a CLI from the Rinci function simply using Perinci::CmdLine::Gen or, if you use Dist::Zilla, Dist::Zilla::Plugin::GenPericmdScript or, if on the command-line, gen-pericmd-script. Using this routine, by providing just one or a few hooks and setting some parameters like a couple of extra arguments, you will get a complete CLI with decent POD/manpage, ability to read one or multiple CSV's and write one or multiple CSV's, some command-line options to customize how the input CSV's should be parsed and how the output CSV's should be formatted and named. Your CLI also has tab completion, usage and help message, and other features. To create a CSV utility, you specify a "name" (e.g. "csv_dump"; must be a valid unqualified Perl identifier/function name) and optionally "summary", "description", and other metadata like "links" or even "add_meta_props". Then you specify one or more of "on_*" or "before_*" or "after_*" arguments to supply handlers (coderefs) for your CSV utility at various hook points. *THE HOOKS* All code for hooks should accept a single argument "r". "r" is a stash (hashref) of various data, the keys of which will depend on which hook point being called. You can also add more keys to store data or for flow control (see hook documentation below for more details). The order of the hooks, in processing chronological order: * on_begin Called when utility begins, before reading CSV. You can use this hook e.g. to process arguments, set output filenames (if you allow custom output filenames). * before_read_input Called before opening any input CSV file. This hook is *still* called even if your utility sets "reads_csv" to false. At this point, the "input_filenames" stash key (as well as other keys like "input_filename", "input_filenum", etc) has not been set. You can use this hook e.g. to set a custom "input_filenames". * before_open_input_files Called before an input CSV file is about to be opened, including for stdin ("-"). You can use this hook e.g. to check/preprocess input file. Flow control is available by setting "$r->{wants_skip_files}" to skip reading all the input file and go directly to the "after_read_input" hook. * before_open_input_file Called before an input CSV file is about to be opened, including for stdin ("-"). For the first file, called after "before_open_input_file" hook. You can use this hook e.g. to check/preprocess input file. Flow control is available by setting "$r->{wants_skip_file}" to skip reading a single input file and go to the next file, or "$r->{wants_skip_files}" to skip reading the rest of the files and go directly to the "after_read_input" hook. * on_input_header_row Called when receiving header row. Will be called for every input file, and called even when user specify "--no-input-header", in which case the header row will be the generated "["field1", "field2", ...]". You can use this hook e.g. to add/remove/rearrange fields. You can set "$r->{wants_fill_rows}" to a defined false if you do not want "$r->{input_rows}" to be filled with empty string elements when it contains less than the number of fields (in case of sparse values at the end). Normally you only want to do this when you want to do checking, e.g. in csv-check-rows. * on_input_data_row Called when receiving each data row. You can use this hook e.g. to modify the row or print output (for line-by-line transformation or filtering). * after_close_input_file Called after each input file is closed, including for stdin ("-") (although for stdin, the handle is not actually closed). Flow control is possible by setting "$r->{wants_skip_files}" to skip reading the rest of the files and go straight to the "after_close_input_files" hook. * after_close_input_files Called after the last input file is closed, after the last "after_close_input_file" hook, including for stdin ("-") (although for stdin, the handle is not actually closed). * after_read_input Called after the last row of the last CSV file is read and the last file is closed. This hook is *still* called, if you set "reads_csv" option to false. At this point the stash keys related to CSV reading have all been cleared, including "input_filenames", "input_filename", "input_fh", etc. You can use this hook e.g. to print output if you buffer the output. * on_end Called when utility is about to exit. You can use this hook e.g. to return the final result. *THE STASH* The common keys that "r" will contain: * "gen_args", hash. The arguments used to generate the CSV utility. * "util_args", hash. The arguments that your CSV utility accepts. Parsed from command-line arguments (or configuration files, or environment variables). * "name", str. The name of the CSV utility. Which can also be retrieved via "gen_args". * "code_print", coderef. Routine provided for you to print something. Accepts a string. Takes care of opening the output files for you. * "code_print_row", coderef. Routine provided for you to print a data row. You pass the row (either arrayref or hashref). Takes care of opening the output files for you, as well as printing header row the first time, if needed. * "code_print_header_row", coderef. Routine provided for you to print header row. You don't need to pass any arguments. Will only print the header row once per output file if output header is enabled, even if called multiple times. If you are accepting CSV data ("reads_csv" gen argument set to true), the following keys will also be available (in "on_input_header_row" and "on_input_data_row" hooks): * "input_parser", a Text::CSV_XS instance for input parsing. * "input_filenames", array of str. * "input_filename", str. The name of the current input file being read ("-" if reading from stdin). * "input_filenum", uint. The number of the current input file, 1 being the first file, 2 for the second, and so on. * "input_fh", the handle to the current file being read. * "input_rownum", uint. The number of rows that have been read (reset after each input file). In "on_input_header_row" phase, this will be 1 since header row (including the generated one) is the first row. Then in "on_input_data_row" phase (called the first time for a file), it will be 2 for the first data row, even if physically it is the first row for CSV file that does not have a header. * "input_data_rownum", uint. The number of data rows that have been read (reset after each input file). This will be equal to "input_rownum" less 1 if input file has header. * "input_row", aos (array of str). The current input CSV row as an arrayref. * "input_row_as_hashref", hos (hash of str). The current input CSV row as a hashref, with field names as hash keys and field values as hash values. This will only be calculated if utility wants it. Utility can express so by setting "$r->{wants_input_row_as_hashref}" to true, e.g. in the "on_begin" hook. * "input_header_row_count", uint. Contains the number of actual header rows that have been read. If CLI user specifies "--no-input-header", this will stay at zero. Will be reset for each CSV file. * "input_data_row_count", int. Contains the number of actual data rows that have read. Will be reset for each CSV file. If you are outputting CSV ("writes_csv" gen argument set to true), the following keys will be available: * "output_emitter", a Text::CSV_XS instance for output. * "output_filenames", array of str. * "output_filename", str, name of current output file. * "output_filenum", uint, the number of the current output file, 1 being the first file, 2 for the second, and so on. * "output_fh", handle to the current output file. * "output_rownum", uint. The number of rows that have been outputted (reset after each output file). * "output_data_rownum", uint. The number of data rows that have been outputted (reset after each output file). This will be equal to "input_rownum" less 1 if input file has header. For other hook-specific keys, see the documentation for associated hook point. *ACCEPTING ADDITIONAL COMMAND-LINE OPTIONS/ARGUMENTS* As mentioned above, you will get additional command-line options/arguments in "$r->{util_args}" hashref. Some options/arguments are already added by "gen_csv_util", e.g. "input_filename" or "input_filenames" along with "input_sep_char", etc (when your utility declares "reads_csv"), "output_filename" or "output_filenames" along with "overwrite", "output_sep_char", etc (when your utility declares "writes_csv"). If you want to accept additional arguments/options, you specify them in "add_args" (hashref, with key being Each option/argument has to be specified first via "add_args" (as hashref, with key being argument name and value the argument specification as defined in Rinci::function)). Some argument specifications have been defined in App::CSVUtils and can be used. See existing utilities for examples. *READING CSV DATA* To read CSV data, normally your utility would provide handler for the "on_input_data_row" hook and sometimes additionally "on_input_header_row". *OUTPUTTING STRING OR RETURNING RESULT* To output string, usually you call the provided routine "$r->{code_print}". This routine will open the output files for you. You can also return enveloped result directly by setting "$r->{result}". *OUTPUTTING CSV DATA* To output CSV data, usually you call the provided routine "$r->{code_print_row}". This routine accepts a row (arrayref or hashref). This routine will open the output files for you when needed, as well as print header row automatically. You can also buffer rows from input to e.g. "$r->{output_rows}", then call "$r->{code_print_row}" repeatedly in the "after_read_input" hook to print all the rows. *READING MULTIPLE CSV FILES* To read multiple CSV files, you first specify "reads_multiple_csv". Then, you can supply handler for "on_input_header_row" and "on_input_data_row" as usual. If you want to do something before/after each input file, you can also supply handler for "before_open_input_file" or "after_close_input_file". *WRITING TO MULTIPLE CSV FILES* Similarly, to write to many CSv files, you first specify "writes_multiple_csv". Then, you can supply handler for "on_input_header_row" and "on_input_data_row" as usual. To switch to the next file, set "$r->{wants_switch_to_next_output_file}" to true, in which case the next call to "$r->{code_print_row}" will close the current file and open the next file. *CHANGING THE OUTPUT FIELDS* When calling "$r->{code_print_row}", you can output whatever fields you want. By convention, you can set "$r->{output_fields}" and "$r->{output_fields_idx}" to let other handlers know about the output fields. For example, see the implementation of csv-concat. This function is not exported by default, but exportable. Arguments ('*' denotes required arguments): * add_args => *hash* (No description) * add_args_rels => *hash* (No description) * add_meta_props => *hash* Add additional Rinci function metadata properties. * after_close_input_file => *code* (No description) * after_close_input_files => *code* (No description) * after_read_input => *code* (No description) * before_open_input_file => *code* (No description) * before_open_input_files => *code* (No description) * before_read_input => *code* (No description) * description => *str* (No description) * examples => *array* (No description) * links => *array[hash]* (No description) * name* => *perl::identifier::unqualified_ascii* (No description) * on_begin => *code* (No description) * on_end => *code* (No description) * on_input_data_row => *code* (No description) * on_input_header_row => *code* (No description) * reads_csv => *bool* (default: 1) Whether utility reads CSV data. * reads_multiple_csv => *bool* Whether utility accepts CSV data. Setting this option to true will implicitly set the "reads_csv" option to true, obviously. * summary => *str* (No description) * writes_csv => *bool* (default: 1) Whether utility writes CSV data. * writes_multiple_csv => *bool* Whether utility outputs CSV data. Setting this option to true will implicitly set the "writes_csv" option to true, obviously. Return value: (bool) compile_eval_code Usage: $coderef = compile_eval_code($str, $label); Compile string code $str to coderef in 'main' package, without "use strict" or "use warnings". Die on compile error. eval_code Usage: $res = eval_code($coderef, $r, $topic_var_value, $return_topic_var); FAQ My CSV does not have a header? Use the "--no-header" option. Fields will be named "field1", "field2", and so on. My data is TSV, not CSV? Use the "--tsv" option. I have a big CSV and the utilities are too slow or eat too much RAM! These utilities are not (yet) optimized, patches welcome. If your CSV is very big, perhaps a C-based solution is what you need. HOMEPAGE Please visit the project's homepage at <https://metacpan.org/release/App-CSVUtils>. SOURCE Source repository is at <https://github.com/perlancar/perl-App-CSVUtils>. SEE ALSO Similar CLI bundles for other format App::TSVUtils, App::LTSVUtils, App::SerializeUtils. Other CSV-related utilities xls2csv and xlsx2csv from Spreadsheet::Read import-csv-to-sqlite from App::SQLiteUtils Query CSV with SQL using fsql from App::fsql csvgrep from csvgrep Other non-Perl-based CSV utilities Python csvkit, <https://csvkit.readthedocs.io/en/latest/> AUTHOR perlancar <perlancar@cpan.org> CONTRIBUTOR Adam Hopkins <violapiratejunky@gmail.com> CONTRIBUTING To contribute, you can send patches by email/via RT, or send pull requests on GitHub. Most of the time, you don't need to build the distribution yourself. You can simply modify the code, then test via: % prove -l If you want to build the distribution (e.g. to try to install it locally on your system), you can install Dist::Zilla, Dist::Zilla::PluginBundle::Author::PERLANCAR, Pod::Weaver::PluginBundle::Author::PERLANCAR, and sometimes one or two other Dist::Zilla- and/or Pod::Weaver plugins. Any additional steps required beyond that are considered a bug and can be reported to me. COPYRIGHT AND LICENSE This software is copyright (c) 2023, 2022, 2021, 2020, 2019, 2018, 2017, 2016 by perlancar <perlancar@cpan.org>. This is free software; you can redistribute it and/or modify it under the same terms as the Perl 5 programming language system itself. BUGS Please report any bugs or feature requests on the bugtracker website <https://rt.cpan.org/Public/Dist/Display.html?Name=App-CSVUtils> When submitting a bug or request, please include a test-file or a patch to an existing test-file that illustrates the bug or desired feature.