| 1 | Copyright (C) 1992, 1997-2002, 2004-2012 Free Software Foundation, Inc.
|
|---|
| 2 |
|
|---|
| 3 | Copying and distribution of this file, with or without modification,
|
|---|
| 4 | are permitted in any medium without royalty provided the copyright
|
|---|
| 5 | notice and this notice are preserved.
|
|---|
| 6 |
|
|---|
| 7 | ===============
|
|---|
| 8 | Short term work
|
|---|
| 9 | ===============
|
|---|
| 10 |
|
|---|
| 11 | See where we are with UTF-8 performance.
|
|---|
| 12 |
|
|---|
| 13 | Merge Debian patches 55-bigfile.patch, 69-mbtowc.patch and
|
|---|
| 14 | 70-man_apostrophe.patch. Go through patches in Savannah.
|
|---|
| 15 |
|
|---|
| 16 | Cleanup of the grep(), grepdir(), recursion (the "main loop") to use fts.
|
|---|
| 17 | Fix --directories=read.
|
|---|
| 18 |
|
|---|
| 19 | Write better Texinfo documentation for grep. The manual page would be a
|
|---|
| 20 | good place to start, but Info documents are also supposed to contain a
|
|---|
| 21 | tutorial and examples.
|
|---|
| 22 |
|
|---|
| 23 | Some test in tests/spencer2.tests should have failed! Need to filter out
|
|---|
| 24 | some bugs in dfa.[ch]/regex.[ch].
|
|---|
| 25 |
|
|---|
| 26 | Multithreading?
|
|---|
| 27 |
|
|---|
| 28 | GNU grep does 32-bit arithmetic, it needs to move to 64-bit (i.e.
|
|---|
| 29 | size_t/ptrdiff_t).
|
|---|
| 30 |
|
|---|
| 31 | Lazy dynamic linking of libpcre.
|
|---|
| 32 |
|
|---|
| 33 | Check FreeBSD's integration of zgrep (-Z) and bzgrep (-J) in one
|
|---|
| 34 | binary. Is there a possibility of doing even better by automatically
|
|---|
| 35 | checking the magic of binary files ourselves (0x1F 0x8B for gzip, 0x1F
|
|---|
| 36 | 0x9D for compress, and 0x42 0x5A 0x68 for bzip2)? Once what to do with
|
|---|
| 37 | libpcre is decided, do the same for libz and libbz2.
|
|---|
| 38 |
|
|---|
| 39 | |
|---|
| 40 |
|
|---|
| 41 | ==================
|
|---|
| 42 | Matching algorithms
|
|---|
| 43 | ==================
|
|---|
| 44 |
|
|---|
| 45 | Check <http://tony.abou-assaleh.net/greps.html>. Take a look at these
|
|---|
| 46 | and consider opportunities for merging or cloning:
|
|---|
| 47 |
|
|---|
| 48 | -- ja-grep's mlb2 patch (Japanese grep)
|
|---|
| 49 | <ftp://ftp.freebsd.org/pub/FreeBSD/ports/distfiles/grep-2.4.2-mlb2.patch.gz>
|
|---|
| 50 | -- lgrep (from lv, a Powerful Multilingual File Viewer / Grep)
|
|---|
| 51 | <http://www.ff.iij4u.or.jp/~nrt/lv/>;
|
|---|
| 52 | -- cgrep (Context grep) <http://plg.uwaterloo.ca/~ftp/mt/cgrep/>
|
|---|
| 53 | seems like nice work;
|
|---|
| 54 | -- sgrep (Struct grep) <http://www.cs.helsinki.fi/u/jjaakkol/sgrep.html>;
|
|---|
| 55 | -- agrep (Approximate grep) <http://www.tgries.de/agrep/>,
|
|---|
| 56 | from glimpse;
|
|---|
| 57 | -- nr-grep (Nondeterministic reverse grep)
|
|---|
| 58 | <http://www.dcc.uchile.cl/~gnavarro/software/>;
|
|---|
| 59 | -- ggrep (Grouse grep) <http://www.grouse.com.au/ggrep/>;
|
|---|
| 60 | -- grep.py (Python grep) <http://www.vdesmedt.com/~vds2212/grep.html>;
|
|---|
| 61 | -- freegrep <http://www.vocito.com/downloads/software/grep/>;
|
|---|
| 62 |
|
|---|
| 63 | Check some new algorithms for matching; talk to Karl Berry and Nelson.
|
|---|
| 64 | Sunday's "Quick Search" Algorithm (CACM 33, 1990-08-08 pp. 132-142)
|
|---|
| 65 | claim that his algorithm is faster than Boyer-More. Worth checking.
|
|---|
| 66 |
|
|---|
| 67 | Fix the DFA matcher to never use exponential space. (Fortunately, these
|
|---|
| 68 | cases are rare.)
|
|---|
| 69 |
|
|---|
| 70 | |
|---|
| 71 |
|
|---|
| 72 | ============================
|
|---|
| 73 | Standards: POSIX and Unicode
|
|---|
| 74 | ============================
|
|---|
| 75 |
|
|---|
| 76 | For POSIX compliance, see p10003.x. Current support for the POSIX [= =]
|
|---|
| 77 | and [. .] constructs is limited. This is difficult because it requires
|
|---|
| 78 | locale-dependent details of the character set and collating sequence,
|
|---|
| 79 | but POSIX does not standardize any method for accessing this information!
|
|---|
| 80 |
|
|---|
| 81 | For Unicode, interesting things to check include the Unicode Standard
|
|---|
| 82 | <http://www.unicode.org/standard/standard.html> and the Unicode Technical
|
|---|
| 83 | Standard #18 (<http://www.unicode.org/reports/tr18/> “Unicode Regular
|
|---|
| 84 | Expressions”). Talk to Bruno Haible who's maintaining GNU libunistring.
|
|---|
| 85 | See also Unicode Standard Annex #15 (<http://www.unicode.org/reports/tr15/>
|
|---|
| 86 | “Unicode Normalization Forms”), already implemented by GNU libunistring.
|
|---|
| 87 |
|
|---|
| 88 | In particular, --ignore-case needs to be evaluated against the standards.
|
|---|
| 89 | We may want to deviate from POSIX if Unicode provides better or clearer
|
|---|
| 90 | semantics.
|
|---|
| 91 |
|
|---|
| 92 | POSIX and --ignore-case
|
|---|
| 93 | -----------------------
|
|---|
| 94 |
|
|---|
| 95 | For this issue, interesting things to check in POSIX include the
|
|---|
| 96 | Volume “Base Definitions (XBD)”, Chapter “Regular Expressions” and in
|
|---|
| 97 | particular Section “Regular Expression General Requirements” and its
|
|---|
| 98 | paragraph about caseless matching (note that this may not have been
|
|---|
| 99 | fully thought through and that this text may be self-contradicting
|
|---|
| 100 | [specifically: “of either data or patterns” versus all the rest]).
|
|---|
| 101 |
|
|---|
| 102 | In particular, consider the following with POSIX's approach to case
|
|---|
| 103 | folding in mind. Assume a non-Turkic locale with a character
|
|---|
| 104 | repertoire reduced to the following various forms of “LATIN LETTER I”:
|
|---|
| 105 |
|
|---|
| 106 | 0049;LATIN CAPITAL LETTER I;Lu;0;L;;;;;N;;;;0069;
|
|---|
| 107 | 0069;LATIN SMALL LETTER I;Ll;0;L;;;;;N;;;0049;;0049
|
|---|
| 108 | 0130;LATIN CAPITAL LETTER I WITH DOT ABOVE;Lu;0;L;0049 0307;;;;N;LATIN CAPITAL LETTER I DOT;;;0069;
|
|---|
| 109 | 0131;LATIN SMALL LETTER DOTLESS I;Ll;0;L;;;;;N;;;0049;;0049
|
|---|
| 110 |
|
|---|
| 111 | First note the differing UTF-8 octet lengths of U+0049 (0x49) and
|
|---|
| 112 | U+0069 (0x69) versus U+0130 (0xC4 0xB0) and U+0131 (0xC4 0xB1). This
|
|---|
| 113 | implies that whole UTF-8 strings cannot be case-converted in place,
|
|---|
| 114 | using the same memory buffer, and that the needed octet-size of the
|
|---|
| 115 | new buffer cannot merely be guessed (although there's a simple upper
|
|---|
| 116 | bound of six times the size of the input, as the longest UTF-8
|
|---|
| 117 | encoding of any character is six bytes).
|
|---|
| 118 |
|
|---|
| 119 | We have
|
|---|
| 120 |
|
|---|
| 121 | lc(I) = i, uc(I) = I
|
|---|
| 122 | lc(i) = i, uc(i) = I
|
|---|
| 123 | lc(İ) = i, uc(İ) = İ
|
|---|
| 124 | lc(ı) = ı, uc(ı) = I
|
|---|
| 125 |
|
|---|
| 126 | where lc() and uc() denote lower-case and upper-case conversions.
|
|---|
| 127 |
|
|---|
| 128 | There are several candidate --ignore-case logics (including the one
|
|---|
| 129 | mandated by POSIX):
|
|---|
| 130 |
|
|---|
| 131 | Using the
|
|---|
| 132 |
|
|---|
| 133 | if (lc(input_wchar) == lc(pattern_wchar))
|
|---|
| 134 |
|
|---|
| 135 | logic leads to the following matches:
|
|---|
| 136 |
|
|---|
| 137 | \in I i İ ı
|
|---|
| 138 | pat\ ----------
|
|---|
| 139 | "I" | Y Y Y n
|
|---|
| 140 | "i" | Y Y Y n
|
|---|
| 141 | "İ" | Y Y Y n
|
|---|
| 142 | "ı" | n n n Y
|
|---|
| 143 |
|
|---|
| 144 | There is a lack of symmetry between CAPITAL and SMALL LETTERs with
|
|---|
| 145 | this.
|
|---|
| 146 |
|
|---|
| 147 | Using the
|
|---|
| 148 |
|
|---|
| 149 | if (uc(input_wchar) == uc(pattern_wchar))
|
|---|
| 150 |
|
|---|
| 151 | logic leads to the following matches:
|
|---|
| 152 |
|
|---|
| 153 | \in I i İ ı
|
|---|
| 154 | pat\ ----------
|
|---|
| 155 | "I" | Y Y n Y
|
|---|
| 156 | "i" | Y Y n Y
|
|---|
| 157 | "İ" | n n Y n
|
|---|
| 158 | "ı" | Y Y n Y
|
|---|
| 159 |
|
|---|
| 160 | There is a lack of symmetry between CAPITAL and SMALL LETTERs with
|
|---|
| 161 | this.
|
|---|
| 162 |
|
|---|
| 163 | Using the
|
|---|
| 164 |
|
|---|
| 165 | if ( lc(input_wchar) == lc(pattern_wchar)
|
|---|
| 166 | || uc(input_wchar) == uc(pattern_wchar))
|
|---|
| 167 |
|
|---|
| 168 | logic leads to the following matches:
|
|---|
| 169 |
|
|---|
| 170 | \in I i İ ı
|
|---|
| 171 | pat\ ----------
|
|---|
| 172 | "I" | Y Y Y Y
|
|---|
| 173 | "i" | Y Y Y Y
|
|---|
| 174 | "İ" | Y Y Y n
|
|---|
| 175 | "ı" | Y Y n Y
|
|---|
| 176 |
|
|---|
| 177 | There is some elegance and symmetry with this. But there are
|
|---|
| 178 | potentially two conversions to be made per input character. If the
|
|---|
| 179 | pattern is pre-converted, two copies of it need to be kept and used in
|
|---|
| 180 | a mutually coherent fashion.
|
|---|
| 181 |
|
|---|
| 182 | Using the
|
|---|
| 183 |
|
|---|
| 184 | if ( input_wchar == pattern_wchar
|
|---|
| 185 | || lc(input_wchar) == pattern_wchar
|
|---|
| 186 | || uc(input_wchar) == pattern_wchar)
|
|---|
| 187 |
|
|---|
| 188 | logic (as mandated by POSIX) leads to the following matches:
|
|---|
| 189 |
|
|---|
| 190 | \in I i İ ı
|
|---|
| 191 | pat\ ----------
|
|---|
| 192 | "I" | Y Y n Y
|
|---|
| 193 | "i" | Y Y Y n
|
|---|
| 194 | "İ" | n n Y n
|
|---|
| 195 | "ı" | n n n Y
|
|---|
| 196 |
|
|---|
| 197 | There is a different CAPITAL/SMALL symmetry with this. But there's
|
|---|
| 198 | also a loss of pattern/input symmetry that's unique to it. Also there
|
|---|
| 199 | are potentially two conversions to be made per input character.
|
|---|
| 200 |
|
|---|
| 201 | Using the
|
|---|
| 202 |
|
|---|
| 203 | if (lc(uc(input_wchar)) == lc(uc(pattern_wchar)))
|
|---|
| 204 |
|
|---|
| 205 |
|
|---|
| 206 | logic leads to the following matches:
|
|---|
| 207 |
|
|---|
| 208 | \in I i İ ı
|
|---|
| 209 | pat\ ----------
|
|---|
| 210 | "I" | Y Y Y Y
|
|---|
| 211 | "i" | Y Y Y Y
|
|---|
| 212 | "İ" | Y Y Y Y
|
|---|
| 213 | "ı" | Y Y Y Y
|
|---|
| 214 |
|
|---|
| 215 | This shows total symmetry and transitivity
|
|---|
| 216 | (at least in this example analysis).
|
|---|
| 217 | There are two conversions to be made per input character,
|
|---|
| 218 | but support could be added for having
|
|---|
| 219 | a single straight mapping performing
|
|---|
| 220 | a composition of the two conversions.
|
|---|
| 221 |
|
|---|
| 222 | Any optimization in the implementation of each logic
|
|---|
| 223 | must not change its basic semantic.
|
|---|
| 224 |
|
|---|
| 225 |
|
|---|
| 226 | Unicode and --ignore-case
|
|---|
| 227 | -------------------------
|
|---|
| 228 |
|
|---|
| 229 | For this issue, interesting things to check in Unicode include:
|
|---|
| 230 |
|
|---|
| 231 | -- The Unicode Standard, Chapter 3
|
|---|
| 232 | (<http://www.unicode.org/versions/Unicode5.2.0/ch03.pdf>
|
|---|
| 233 | “Conformance”), Section 3.13 (“Default Case Operations”) and the
|
|---|
| 234 | toCasefold() case conversion operation.
|
|---|
| 235 |
|
|---|
| 236 | -- The Unicode Standard, Chapter 4
|
|---|
| 237 | (<http://www.unicode.org/versions/Unicode5.2.0/ch04.pdf>
|
|---|
| 238 | “Character Properties”), Section 4.2 (“Case—Normative”) and
|
|---|
| 239 | the <http://www.unicode.org/Public/UNIDATA/SpecialCasing.txt>
|
|---|
| 240 | SpecialCasing.txt and
|
|---|
| 241 | <http://www.unicode.org/Public/UNIDATA/CaseFolding.txt>
|
|---|
| 242 | CaseFolding.txt files from the
|
|---|
| 243 | <http://www.unicode.org/Public/UNIDATA/UCD.html> Unicode
|
|---|
| 244 | Character Database .
|
|---|
| 245 |
|
|---|
| 246 | The <http://www.unicode.org/standard/standard.html> Unicode Standard,
|
|---|
| 247 | Chapter 5 (“<http://www.unicode.org/versions/Unicode5.2.0/ch05.pdf>
|
|---|
| 248 | Implementation Guidelines ”), Section 5.18 (“Case Mappings”),
|
|---|
| 249 | Subsection “Caseless Matching”.
|
|---|
| 250 |
|
|---|
| 251 | The Unicode <http://www.unicode.org/charts/case/> case charts.
|
|---|
| 252 |
|
|---|
| 253 | Unicode uses the
|
|---|
| 254 |
|
|---|
| 255 | if (toCasefold(input_wchar_string) == toCasefold(pattern_wchar_string))
|
|---|
| 256 |
|
|---|
| 257 | logic for caseless matching. Let's consider the “LATIN LETTER I”
|
|---|
| 258 | example mentioned above. In a non-Turkic locale, simple case folding
|
|---|
| 259 | yields
|
|---|
| 260 |
|
|---|
| 261 | toCasefold_simple(U+0049) = U+0069
|
|---|
| 262 | toCasefold_simple(U+0069) = U+0069
|
|---|
| 263 | toCasefold_simple(U+0130) = U+0130
|
|---|
| 264 | toCasefold_simple(U+0131) = U+0131
|
|---|
| 265 |
|
|---|
| 266 | which leads to the following matches:
|
|---|
| 267 |
|
|---|
| 268 | \in I i İ ı
|
|---|
| 269 | pat\ ----------
|
|---|
| 270 | "I" | Y Y n n
|
|---|
| 271 | "i" | Y Y n n
|
|---|
| 272 | "İ" | n n Y n
|
|---|
| 273 | "ı" | n n n Y
|
|---|
| 274 |
|
|---|
| 275 | This is different from anything so far!
|
|---|
| 276 |
|
|---|
| 277 | In a non-Turkic locale, full case folding yields
|
|---|
| 278 |
|
|---|
| 279 | toCasefold_full(U+0049) = U+0069
|
|---|
| 280 | toCasefold_full(U+0069) = U+0069
|
|---|
| 281 | toCasefold_full(U+0130) = <U+0069, U+0307>
|
|---|
| 282 | toCasefold_full(U+0131) = U+0131
|
|---|
| 283 |
|
|---|
| 284 | with
|
|---|
| 285 |
|
|---|
| 286 | 0307;COMBINING DOT ABOVE;Mn;230;NSM;;;;;N;NON-SPACING DOT ABOVE;;;;
|
|---|
| 287 |
|
|---|
| 288 | which leads to the following matches:
|
|---|
| 289 |
|
|---|
| 290 | \in I i İ ı
|
|---|
| 291 | pat\ ----------
|
|---|
| 292 | "I" | Y Y * n
|
|---|
| 293 | "i" | Y Y * n
|
|---|
| 294 | "İ" | n n Y n
|
|---|
| 295 | "ı" | n n n Y
|
|---|
| 296 |
|
|---|
| 297 | This is just sad!
|
|---|
| 298 |
|
|---|
| 299 | Note that having toCasefold(U+0131), simple or full, map to itself
|
|---|
| 300 | instead of U+0069 is in contradiction with the rules of Section 5.18
|
|---|
| 301 | of the Unicode Standard since toUpperCase(U+0131) is U+0049. Same
|
|---|
| 302 | thing for toCasefold_simple(U+0130) since toLowerCase(U+0131) is
|
|---|
| 303 | U+0069. The justification for the weird toCasefold_full(U+0130)
|
|---|
| 304 | mapping is unknown; it doesn't even make sense to add a dot (U+0307)
|
|---|
| 305 | to a letter that already has one (U+0069). It would have been so
|
|---|
| 306 | simple to put them all in the same equivalence class!
|
|---|
| 307 |
|
|---|
| 308 | Otherwise, also consider the following problem with Unicode's approach
|
|---|
| 309 | on case folding in mind. Assume that we want to perform
|
|---|
| 310 |
|
|---|
| 311 | echo 'AßBC | grep -i 'Sb'
|
|---|
| 312 |
|
|---|
| 313 | which corresponds to
|
|---|
| 314 |
|
|---|
| 315 | input: U+0041 U+00DF U+0042 U+0043 U+000A
|
|---|
| 316 | pattern: U+0053 U+0062
|
|---|
| 317 |
|
|---|
| 318 | Following “CaseFolding-4.1.0.txt”, applying the toCasefold()
|
|---|
| 319 | transformation to these yields
|
|---|
| 320 |
|
|---|
| 321 | input: U+0061 U+0073 U+0073 U+0062 U+0063 U+000A
|
|---|
| 322 | pattern: U+0073 U+0062
|
|---|
| 323 |
|
|---|
| 324 | so, according to this approach, the input should match the pattern. As
|
|---|
| 325 | long as the original input line is to be reported to the user as a
|
|---|
| 326 | whole, there is no problem (from the user's point-of-view;
|
|---|
| 327 | implementation is complicated by this).
|
|---|
| 328 |
|
|---|
| 329 | However, consider both these GNU extensions:
|
|---|
| 330 |
|
|---|
| 331 | echo 'AßBC' | grep -i --only-matching 'Sb'
|
|---|
| 332 | echo 'AßBC' | grep -i --color=always 'Sb'
|
|---|
| 333 |
|
|---|
| 334 | What is to be reported in these cases, since the match begins in the
|
|---|
| 335 | *middle* of the original input character 'ß'?
|
|---|
| 336 |
|
|---|
| 337 | Note that Unicode's toCasefold() cannot be implemented in terms of
|
|---|
| 338 | POSIX' towctrans() since that can only return a single wint_t value
|
|---|
| 339 | per input wint_t value.
|
|---|