Subject: Re: Snippet of code needed. Fri Nov 20 19:08:43 1998 > A while back, there was a discussion on parsing bodies of text to find > URL's, then inserting the proper anchor tag so that the address is > turned into a working link on the fly. Can anyone shed some light on > how this is done? I'm guessing I'll need to use =~ in there > somewhere... okay.. assuming a unix system, the following will take the name of a source file on the command line and dump a processed version to standard output. the assumed way to run the thing would be: $ ./do_links.pl source_file.html > target_file.html -- do_links.pl -- #!/usr/local/bin/perl $file = $ARGV[0]; open (FILE, $file) or die qq(can't read "$file", $!); while () { s|(http://\S*)|$1|ig; print; } close FILE; -- EOF -- the structure of the inner loop is, shall we say, nontrivial. first of all, it reads each line of the file directly into the default scalar, $_. most perl commands use either the default scalar or the default list if you don't give them explicit variables. a more well-behaved version, which does exactly the same thing, would be: while ($line = ) { $line =~ s|(http://\S*)|$1|ig; print $line; } the substitution expression is also a bear, but so is almost any substitution.. first of all, the actual command is 's', not: s/something/something else/; using slashes to separate the 'find' and 'replace' section is only a convention. in practice, you can use any single character, ('|' in the example), or sets of balanced container characters: s[something][something else]; i chose the vertical bar because URLs use slashes, and changing the separator makes things easier to read. the slash version is less attractive because you have to escape the slashes in the pattern with backslashes: s/(http:\/\/\S*)/$1<\/a>/; and that's even more arcane. the search pattern: (http://\S*) consists of three parts.. - a start string: http:// - a regular expression: \S* - and capture markers: () translating it all into english you get something like: "look for a string that begins with 'http://', then add every non-whitespace character you find directly following it. store the results in a private variable named $1". the start string is pretty much self-explanatory. the interpreter hums along until it sees that sequence of characters, and then it gets interested. the '\S' is a wildcard which means 'any non-whitespace character'. the splat means 'zero or more of whatever is to the left', so the combination means 'everything up to (but not including) the next space'. the issues surrounding private variables are deeply geekish, and completely unnecessary for normal use. just assume the value gets there by magic, and you'll be fine. the replacement string: $1<\/a> is pretty straightforward. it simply pops two copies of whatever URL was just found into a new string. finally, there are two modifiers at the end of the expression: 'i' and 'g'. the 'i' modifier tells the interpreter to Ignore case when matching strings.. 'http' will also match 'HTTP', even though the two are different sequences of characters. the 'g' modifier says to perform the entire search & replace thing Globally within the string. in other words, if there are two URLs in the line, it does both. without the 'g' modifier, the expression will only find and replace the first one.