Analisando um arquivo de texto grande e gravando em arquivos separados, a saída de cada

3
Estou trabalhando com o seguinte corpus de dados: um arquivo de texto simples muito grande (400MB) que contém quase todos os casos jurídicos ingleses da conquista normanda até o século XIX. Dentro do documento, que é apenas um arquivo de texto muito longo, estão os casos, separados pela citação para cada caso.

Por exemplo, os casos são apresentados assim:

CITATION NUMBER (e.g., 1 Report 10)

THE TEXT OF THE CASE .... blah blah blah...

CITATION NUMBER (e.g., 1 Report 11)

ETC...

Como desejo criar um índice de documentos e uma função de consulta de pesquisa, que espero disponibilizar on-line gratuitamente para qualquer pessoa, primeiro divido o documento em arquivos .txt individuais para cada caso.

(é importante mencionar que a citação sempre contém duas letras, como ER, que significa English Reports e, em seguida, seguido por um número que é diferente).

Como posso executar um script que opera com a seguinte lógica:

  1. Encontre a primeira citação localizando a primeira ocorrência do número + "ER" + outro número.
  2. Encontre a próxima ocorrência de uma citação de caso localizando [number + "ER" + outro número]
  3. imprime em um arquivo, todo o texto entre a primeira instância de citação e a seguinte instância da citação, exclusiva da próxima instância de citação em si.
  4. nomeie o arquivo de saída com o valor da referência de citação encontrada.
  5. repita este processo para todas as instâncias de citação subsequentes até que nenhuma instância de citação subsequente seja encontrada (ou seja, o final do documento).

======

Alguma ideia ou orientação sobre onde eu começaria?

Eu usei o cut no linux para trabalhar com arquivos CSV, e acho que o que estou fazendo aqui é semelhante. Por exemplo, com corte, posso dizer: cut -d[citation instance consisting of a pattern of (number)_"ER"_(number) | cat > nameoffile.txt

O seguinte é o caso Lambert v. Lambert a citação do Relatório em inglês (1 ER 764) inicia o início do caso mas parece que a citação é repetida toda vez que a página quebra, 1 ER 765, 1 ER 767, 1 ER 768, 1 ER 769, 1 ER 770. Então, o próximo caso começa em 1 ER 770. Além disso, cada quebra de página é separada por um espaço de linha no meu documento.

==BEGIN===
'Lambert v Lambert [1767] 2 Brown PC 18, 1 ER 764
Report Date: 1767
[2-Brown-18]  CASE 5 WALTER LAMBERT, Appellant; CATHERINE LAMBERT, Respondent  [18th May 176 7].
[Mew's Dig. vii. 1132.]
[Where a husband by force, etc. compels his wife to execute a deed of separation and thereby to accept of a very small maintenance, much inferior to his rank and fortune; a Court of Equity will relieve the wife against this deed, and refer it to a Master, to settle a proper maintenance.]
    Arthur Humphrey, gent. the respondent's former husband, was in his lifetime entitled to a very valuable interest in the lands of Gortern and Ballybrien, in the county of Hallway, containing 300 acres, by virtue of a lease for three lives from Frederick French, esq. He died in the year 1740, leaving the respondent his widow and several children, whose principal subsistence was to arise from the profits of this lease; and at the time of his death, he owed Mr. French, for rent, £119 18s. 9d. for which arrear Mr. French threatened to bring an ejectment against the respondent. Upon which occasion, in May 1740, she applied to the appellant, who was then considered as a very wealthy man, and proposed to grant him a lease of part of the lands at a low rent, if he would pay off the arrear, to which proposal the appellant readily agreed; and accordingly, the respondent executed a lease of 130 acres, part of the said lands, to the appellant, at the yearly rent of 6s. 4d. by the acre, although they were then worth considerably more; and assigned to him the original lease, as a security for the money which he was to advance without interest.
    In 1741, the appellant came to the respondent's house, and solicited her in marriage; and to induce her to comply, proposed to give up his mortgage of the said lease; and accordingly, on the 1st of September 1741, the appellant executed a writing directed to his son Charles Lambert, in the words following: " I do hereby order " that the lease of Gorteran, assigned to me by the widow Catherine Humphrey, as " a security for what money I paid that was due on the said farm, may be given to "her without any demand to it. September 1st, 1741. Walter Lambert.-To my " son Charles Lambert.-The above lease is in the upper drawer."
    Things remained in this situation till the year 1743, when a marriage was agreed on between the appellant and the respondent, and articles were duly executed, dated the 22d of September 1743, whereby the appellant covenanted, that the respondent should have a provision of £20 a year, and should be acquitted from the said sum of £119 18s. 9d. in case any part thereof should remain due at the appellant's death which provision was to be in full satisfaction of all dower, thirds, or jointure.
    Soon after the execution of these articles, the appellant and respondent intermarried; and from the time of their marriage, for a long series of years, lived together as man and wife in perfect [2-Brown-19] harmony and affection. And the appellant (who for many years had been afflicted with disorders and infirmities) many times expressed his acknowledgments and sense of the respondent's great tenderness and affection for him, and of her care in the preservation of his substance; and often declared he 

2 Brown 20, 1 ER p765
would make an ample provision for her, in case she should survive him. This raised the jealousy of the appellant's children by his former wives, and a scheme was formed to supplant the respondent in his affection and esteem; and one Edward Cloran was pitched upon, as a proper instrument for carrying it into execution. This man acted in the appellant's house in the character of overseer, and by his pretended honesty acquired the confidence of the appellant, to whom he insinuated that the respondent had embezzled his substance, to supply the wants of her children by a former husband. At length his insolence arose to such a pitch, that he abused and beat the respondent's eldest son, who frequently visited at the appellant's house.
    The respondent complained to the appellant of the treatment her son had met with; but the appellant's mind had been so poisoned against the respondent by the false insinuations of Cloran, that instead of redressing the complaint, he flew into a violent passion, called the respondent many opprobrious names, and swore she should never lie in the same bed or room with him; and upon the respondent's expostulating with the appellant, he gave her a violent punch of a bill-hook in her side, threw her down, and seized her violently by the throat. And although a very sickly woman, and advanced in years, she was soon after, by the directions of the appellant, confined in a small cold damp room, and fed with the leavings and fragments of Cloran and one Lynch a thatcher, who frequented the appellant's house, and was called the Governor, with intent to starve her into a compliance with their schemes. This Lynch was employed to bar the room door where the respondent was confined, and to fix an iron chain and padlock to it every night; and the respondent, from the cold and damp of the room, lost the sight of one of her eyes: and the appellant often, during her confinement, told her, that if she would not agree to quit his house, and take a separate maintenance, he would lock up all the doors, and would not leave one living creature in the house but herself; and that she should have neither fire or candle light, or any subsistence whatsoever; and that if she did not take £20 a year, she should be still confined, and should never have so good an over made to her again.
    The respondent in this distressed situation, was obliged to execute the following instrument, which the appellant, or his son Charles, had caused to be drawn up. " I do hereby promise and agree, to pay my wife, Catherine Lambert, otherwise " Rolleston, the sum of £20 sterling yearly, during our separation, to be paid in two " payments; that is to say, £10 every May, and £10 every November. And in case " the said Catherine Lambert, otherwise Rolleston, should survive me, this instrument " to be [2-Brown-20] then void, and she is to have the benefit of our marriage articles, and no " more; which articles are witnessed by her brother Francis Rolleston, esq. and John " Leary. In witness whereof, we have hereunto set our hands and seals, the 29th " day of November 1762, Walter Lambert, Catherine Lambert. Present John Lynch " John Butler."
    At the time the respondent signed the above writing, her treatment was such that she was in dread of her life, and would have signed any paper they produced to her, in order to procure her liberty; but after the execution of the writing, having been visited by some of her friends, she was advised by them not to quit the appellant's house at any rate, until she was actually turned out; for that the provision made for her by the appellant, was too poor for the wife of a man of so considerable a fortune and thereupon the respondent absolutely refused to quit her husband's house: upon which, the appellant and his son Charles Lambert redoubled their cruelty to the respondent; kept her more closely confined in the damp room, and turned off a servant for bringing her a little turf for fire to warm herself. But finding that all this barbarous treatment did not produce the intended effect of making the respondent quit the house, the said Charles Lambert prevailed on the appellant to order all the furniture and kitchen utensils to be removed into the brewhouse, and to quit his own house, and to reside with him, leaving the respondent confined under the controul and dominion of the two instruments of his cruelty, Cloran and Lynch.
    In some time afterwards, the appellant, with his son Charles, returned to his house where the respondent was still confined, and seised all the papers belonging to the respondent, which they could find; and immediately after, Cloran, by the directions of the appellant and his son Charles, forcibly dragged the respondent out of the appellant's house, and greatly abused and cut her.
    Not yet satisfied with what had been done, the appellant and his confederates 

2 Brown 21, 1 ER p766
carried their cruelty to the respondent to the most infamous extremity. They stabbed her reputation, which had ever been unblemished; they traduced her as a thief; and even dared to deny her marriage with the appellant, weakly imagining to apologise to the world for the wanton cruelty exercised against her, and maliciously intending to deprive her of all resources from the friendship of her relations and friends, to enable her to seek for justice, or even to procure the means of necessary subsistence. But the respondent's character was too well established to suffer material ally by this wicked device; and her friends, convinced of her innocence, and shocked at the treatment she had received, advised her to seek redress in a Court of Justice.
    Accordingly, on the 2d of December 1763, the respondent, by her brother and next friend Francis Rolleston, esq. exhibited her bill in the Court of Chancery in Ireland against her said husband, the appellant, and also against Charles Lambert and Megg his wife, [2-Brown-21] John Lambert and Mary his wife, Thomas Lambert and Mary his wife, Mary Lambert widow of Peter Lambert, who were the sons of the appellant; and against Robert Hamilton, esq. Brother-in-law to the appellant, and the said Cloran; stating several of the matters and acts of cruelty before mentioned and also charging her marriage and cohabitation with the appellant for nineteen years, during which time, the appellant had frequently represented to his friends and relations, her care and tenderness of him: her being introduced by the appellant to, and visited by ladies of the first distinction in the country, as his wife. That she was always called mother by the appellant's sons and daughters-in-law, and was received as such by them; and that she stood sponsor to several of the appellant's grandchildren. That she from time to tine, received letters from his sons, addressed to her as their mother-in-law, and written in a dutiful manner. That the appellant joined the respondent in making two leases of her farm of Gortern, in each of which leases there was a proviso, that the respondent should take the profits to and for her sole use, notwithstanding her coverture. That the appellant executed several wills which were all in his own handwriting; the first of which was dated the 3d of December 1751; another dated the 17th of March 1752; another dated the 15th of June 1756 another dated in March 1757; and another dated the 25th of November 1758; in each of which wills he made certain provisions for the respondent, and in every one of them called her his wife, and even willed to her her said farm of Gortern and Ballybrien. That the appellant, upon the marriage of his second son John Lambert, with the daughter of Sir Henry Burke, in the year 1756, having occasion to levy a fine of some part of his estate, which was to be settled on the marriage, applied to the respondent to Join In levying such fine; and in order to induce her so to do, he signed the following writing: " I do hereby assure my wife Catherine Lambert, that " she shall not suffer in any shape by her levying fines for my son John Lambert. " Witness my hand, September 10th, 1756, Walter Lambert. Present William " Nethercott." That she accordingly joined in the said fine, as the wife of the appellant. That the appellant had, notwithstanding his agreement with the respondent, received the issues and profits of her said farm, and converted the same to his own use for upwards of ten years, which amounted to £1500, and that the respondent had, by means of the waste which the appellant had committed thereon, lost her said farm the lives for which the same was held having drops, and Colonel French having refused to grant a renewal thereof, which the respondent charged would be well worth to her and her children, upwards of £3000 if the same had been renewed. That the appellant was possessed of a real estate of £1500 a year, and of a personal estate to the amount of £12,000 and upwards. And the bill prayed, that the deed of the 29th of November 1762, whereby the appellant agreed to give the respondent his wife the sum of £20 a year, by way of a separate maintenance, might be set aside; and that the appellant might be compelled to give the respondent [2-Brown-22] such maintenance from the time of her separation from him, as the Court should judge reasonable to support her as his wife, and to continue as long as she should live separate from him and that her bill should be taker as a bill of discovery, against such of the defendants as it was improper to pray relief against; and that the respondent might have such other and further relief, as the nature of her case required.
    To this bill the appellant put in two answers, and admitted his agreeing in the year 1740, to discharge the arrear due to Colonel French, and the execution of the deed of the 25th of November 1740. He admitted, that he frequently called upon 

2 Brown 23, 1 ER p767
the respondent at her house, but never solicited her to marry him; and although in his first answer he said, he did not believe that he had executed the writing of the 1st of September 1741, yet in his second answer he recollected, that he was persuaded and accordingly did execute such writing. He denied his marriage with the respondent, but said, that in 1743, Francis Rolleston, esq. the respondent's brother by the contrivance of the respondent as he believed, proposed to him to marry the respondent, representing her as a careful, industrious, good, humane woman and that he consented to such proposal, and admitted that thereupon such articles of the Id of September 1743, as are before stated, were drawn up and executed by him. He said, that before such marriage could be had, he met two persons, who, on his asking them some questions concerning the respondent, represented her to him as a turbulent troublesome woman, and that thereupon he determined not to marry her, and that in some time after he acquainted her with his resolution; but said, that the respondent thereupon made frequent applications to him, and requested that he would permit her to live in his house; that he then looking upon her to be a person capable of managing his family affairs, agreed to it, at the same time informing her, that for several reasons he never would marry her; and that she accordingly came to his house, and cohabited with him. That he was prevailed upon by her, to agree that she should go by his name, and be called his wife, and that she for some time behaved in a manner very agreeable to him and his friends; but that she afterwards behaved otherwise, and he was so much ashamed, that he would not let any of his family or friends know that he was not married to her, as he had before consented that she should pass for his wife. He said, that the respondent, as he was informed by Edmund Claron, embezzled his substance, and exercised her supposed authority in his family in a most arbitrary manner; that she misbehaved to his children and relations, and ill-treated the appellant himself, on his not listening to a charge which she had made against Cloran; he denied striking the respondent with a bill-book, or seizing her by the throat. He said, that on his refusing to let her lie in his room, she made choice of a bed-chamber for herself; and that he being informed, that the respondent frequently went about the house at unseasonable hours in the night, and sent away his goods, he ordered a padlock to be fixed on her chamber door and that it should [2-Brown-23] be locked every night after she went to bed, and believed the same was accordingly done; but he said he did not mean thereby to confine her He said he was prevailed upon to give her £20 a year, provided she would remove from his house, and live separate from him, and that thereupon the deed of the 29th of November 1762 was drawn, and that the respondent freely executed the same. He said he never gave directions that the respondent should be treated with any cruelty, nor did he believe that she received such treatment. He said he expected that she would have left the house immediately on the perfection of the deed of separation, but she put off her departure from time to time, and he not thinking himself safe in the house with her, went to his son's house, where he continued several weeks expecting the respondent would withdraw; but on his return he found she still continued there, and upon his telling her that he never would cohabit with her, she went away voluntarily. He admitted that he was possessed of an estate of £1500 a year, and of a very considerable personal estate, but refused to discover how much. He admitted his making the wills before mentioned, and believed, that in every of them he called her his wife. He admitted, that he joined with the respondent in making leases of her farm; and that the respondent joined with him in levying the fine stated in the bill, but said he did not believe he executed any instrument to induce her to join therein. He admitted, that he treated the respondent as his wife, and introduced her to all his relations, friends, and acquaintance as such, and that she stood sponsor to several of his grandchildren, and believed she was esteemed in the country to be a prudent, virtuous woman. He denied that any waste committed by him was the cause of the respondent's losing her farm; but admitted, that she had lost the benefit of the said lease. He said, that on the 3d of September 1760, he made a will, which was of his own hand writing, and admitted he therein stiled her his wife, and devised to her £30 a year, in addition to the £20 a year mentioned in the articles, and he bequeathed to her £100. He admitted, that for some part of the time, the respondent was constant in her care and seeming tenderness for him, and that he often expressed his acknowledgments and sense of her care and tenderness, 

2 Brown 24, 1 ER p768
and often declared he would make a good provision for her, in case she survived him. And finally, he insisted, that the matters sought by the bill to be relieved in, were properly cognizable in the Ecclesiastical Court.
    The several other defendants also put in their answers, and all of them admitted, that the appellant and respondent lied together as man and wife.
    Issue having been joined in the cause, several witnesses were examined on both sides. The respondent, on her part, produced many witnesses, persons of character and reputation, and proved every material part of her case; and even as to the actual solemnisation of the marriage, it was proved by the Rev. Dean Crowe, that he had been sent for in order to marry the appellant to the [2-Brown-24] respondent. That the day being exceedingly wet, the then Bishop of Clonfert prevailed on the Dean not to venture his life on that day, by undertaking such a journey. That on the next day he went to Gortern, in order to marry them. That the appellant then told the Dean, that he intended he should be the person to marry him to the respondent, but as he did not come when sent for, he had that ceremony performed by another, and at the same time introduced the respondent to the Dean as his wife. This evidence was confirmed by the deposition of Samuel Simpson, esq. who, amongst other particulars swore, that the winter before the marriage, the appellant had told him, as a secret that he intended to marry the respondent. That some time afterwards he met the appellant, and having heard that the appellant had been privately married to the respondent, he as heard the appellant, whether he might wish him joy ? and that the appellant told him, he might, for that he was married to the respondent; and at the same time he told the deponent, that he had a resentment against Dean Crowe for not coming to marry him when sent for, and as he did not chose to wait, that he procured another clergyman for that purpose. The evidence on the part of the appellant went to prove several instances of the respondent's ill behaviour, that she wasted his substance, embezzled his effects, procured false keys to his locks, stole away his papers, and attempted his life. Two papers were also proved to have been accidentally dropped by the respondent out of an handkerchief, which were certificates of her marriage in her own hand writing, with the appellant's name subscribed thereto, but which was not of his hand writing.
    Publication having passed, the cause came on to be heard before the Lord Chan-cellor of Ireland, on the 17th of November 1766, and to be further heard on the 18th, 19th, and 20th of the same month, when his Lordship was pleased to decree, that the deed of the 29th of November 1762, so far as the same might prevent the respondent's recovering a maintenance, during the separation between her and the appellant should be set aside; and it was referred to a Master, to enquire into and report the circumstances of the estate and fortune, both of the appellant and respondent, and what would be proper to allow the respondent annually for her maintenance, during the said separation.
    The respondent being in very great distress, and being likely to meet with every possible delay to retard the proceedings before the Master, on the 22d of November 1766, applied to the Court, upon an affidavit, stating her distress, for a sum of money to maintain her, and to enable her to carry on the suit; whereupon, and upon hearing counsel on behalf of the appellant, his Lordship was pleased to order the appellant to pay the respondent, in a month, the sum of £200, subject to the further order of the Court.
From this decree and order the appellant appealed, insisting (W. de Grey, A. Forrester, D. Graham), that no actual marriage was proved to have been solemnised [2-Brown-25] between him and the respondent, and he had positively denied it upon oath. That had there really been a marriage, the respondent might have proved it by various kinds of evidence, which she not only had not attempted, but from her own bill, and the evidence produced in support of it, it clearly appeared there never was any marriage between them. The bill charged the marriage to have been in the year 1742 but did not state the day, or the month, or the place where, or the person by whom the ceremony was performed, nor whether any one was present at it. Simpson, her own witness, contradicted her, by fixing the marriage to be in the year 1740; and both were contradicted by the articles made previous to the supposed marriage, which could not mistake, and were not made till September 174.3; thereby plainly proving both the charge and the testimony to be false. Dean Crowe, another of the respon-

2 Brown 26, 1 ER p769
dent's witnesses, said, he was sent for to marry them, and that the appellant told him, he had been married the day before, but he did not recollect what year this was in. Besides, the respondent's attempting to support her pretended marriage by fictitious and forged evidence, were clear proofs against the reality of it. That cohabitation and acknowledgment of marriage may be sufficient, as between the reputed husband and the creditors of the supposed wife, to oblige him to pay her debts; but would not be good as between him and her, to entitle her to dower out of his estate. And in this case, where the question was between the reputed husband and wife, evidence of cohabitation and acknowledgment was not sufficient for the Court of Chancery, if it had jurisdiction at all, to found a decree for alimony. The time when it was solemnized, the place where, and the person by whom the marriage was performed, ought to be fully and indisputably proved; and where the marriage was denied, the Court of Chancery, until it was clearly established, could have no jurisdiction to set aside the agreement of November 1762; for if there was no marriage, there could be no constraint or force in the execution of that agreement.
    [...] On the part of the respondent it was contended (F. Norton, A. Wedderburn), that her marriage was established by every imaginable circumstance; and the deed of separation itself, which was sought to be set aside, was conclusive against the appellant as to the fact of the marriage, which he had been induced to dispute by the same artifices, that had prevailed upon him to treat the respondent in the barbarous manner he had done. But a mere denial of the marriage under such circumstances, and opposed by a course of twenty years public cohabitation, could not even raise a doubt upon the question. The respondent's bill was filed to set aside a deed extorted from her by the most infamous means, of which the Court of Chancery undoubtedly had cognizance: the husband in his answer had declared, that he never would cohabit with her; the reference to the Master to enquire what would be proper to be allowed for her maintenance, and the subsequent order of the 22d of November 1766, were consequential to the original relief; and it would have been absurd in such a case, to have turned the respondent round to sue in another jurisdiction for alimony; especially as the Court had done no more in this case, than it would have done upon a supplicavit, where the husband had refused to maintain his wife. That the Court had as yet made no order with regard to the quantum of maintenance, and as to the £200 directed to be paid to the respondent by the order of November 1766, it, could not be thought too large, either with respect to the appellant's fortune, or the respondent's condition; who had lost by his misconduct her own separate fortune, and been for above four years destitute of any provision, and engaged in a most expensive litigation. As therefore the decree and order were equitable and just, and the appeal frivolous, vexatious, and oppressive, it ought to be dismissed with most exemplary costs.
    Accordingly, after hearing counsel on this appeal, it was ORDERED and ADJUDGED, that the same should be dismissed, and the decree and order therein complained of, affirmed: and it was further ORDERED, that the appellant should pay the respondent £200 for her costs in respect of the said appeal. (Jour. vol. 31. p. 604.)
    
por hef 01.11.2015 / 22:02

3 respostas

2

Eu usaria o comando csplit .

csplit -z citations '/ v .*[0-9] ER [0-9]/' '{*}'

dividirá o arquivo em todas as linhas que contiverem essas sequências de caracteres:

space, v, space, any other characters, digit, space, E, R, space, digit

e armazene cada seção dividida em seu próprio nome de arquivo.

Depois que os arquivos são divididos, eles podem ser movidos para os nomes corretos.

O script de solução completo, aceita um argumento de nome de arquivo ou lê a entrada padrão:

#!/bin/sh

csplit -z "${1:--}" '/ v .*[0-9] ER [0-9]/' '{*}'

find . -maxdepth 1 -name 'xx*' |
while read filename
do
    mv "$filename" "$(head -1 $filename)"
done
    
por 03.11.2015 / 07:03
1

O script perl a seguir extrai o nome de usuário em $outfile se uma linha corresponder ao padrão ER (espaço, E, R, espaço) que não é uma linha de número de página (m/ ER (?!p\d+)/) e grava todas as subsequentes linhas (até encontrar esse padrão novamente e, portanto, um novo nome de arquivo) em um arquivo chamado out/$outfile.txt .

#! /usr/bin/perl

use strict;

my $outfile='/dev/stdout';
open(OUTFILE,">","$outfile") || die "couldn't open $outfile for write: $!\n";

while (<>) {
    chomp;
    if (m/ ER (?!p\d+)/) {
       $outfile = substr($_,0,200);
       open(OUTFILE,">>","./out/$outfile.txt") || die "couldn't open ./out/$outfile.txt for write: $!\n";;

       if (-s "./out/$outfile.txt") {
           print OUTFILE "\n\n-=-=-=-=-=-=-=-=\n\n";
       }
    };
    print OUTFILE $_,"\n";
}

A saída é muito longa para ser mostrada aqui, mas eu a testei na entrada que você forneceu e funcionou conforme o esperado. Se você puder disponibilizar todo ou parte do arquivo (com mais casos incluídos) para download em algum lugar, posso testar (e possivelmente refinar) o script ainda mais.

Usando o list-of-cases-volume-1.txt (com números de linha removidos e salvos como cases2.txt ), a saída (usando Lambert* casos como exemplos) é:

$ mkdir -p out/
$ ./hef.pl cases2.txt
$ ls -1 out/Lambert*
out/Lambert v Aeretree 1 Lord Raymond 223, 91 ER 1045.txt
out/Lambert v Atkins and Another 2 Campbell 272, 170 ER 1153.txt
out/Lambert v Cook 1 Lord Raymond 237, 91 ER 1055.txt
out/Lambert v Oakes 1 Lord Raymond 443, 91 ER 1194.txt
out/Lambert v Pack 1 Salkeld 127, 91 ER 120.txt
out/Lambert v Peyton [1860] 7 House of Lords Cases 423, 11 ER 169.txt

Algumas das linhas de entrada (4176 de 12861 linhas) eram para nomes de casos duplicados, portanto, modifiquei o script acima para anexar as linhas extras para esse caso ao arquivo existente, com -=-=-=-=-=-=-=-= como separador.

Alguns dos títulos de casos eram muito longos para serem usados como um nome de arquivo, então usei substr($_,0,200) para limitar o nome do arquivo aos primeiros 200 caracteres. Outra alternativa, que resultaria em nomes de arquivos que não são humanos, seria usar o hash md5sum do nome do caso como o nome do arquivo. O nome do caso ainda estaria na primeira linha do arquivo.

Um, esperançosamente final, comente. Não seria difícil modificar o script acima para usar o módulo perl DBI e um banco de dados postgres ou mysql para armazenar todos esses registros em um banco de dados pesquisável .... com o nome do caso como o campo de índice de título e o texto em um campo de texto.

    
por 02.11.2015 / 01:23
0

O Perl é ideal para isso ... o seguinte não foi testado, mas deve funcionar.

Editar: modificado para ler um arquivo fixo.

#!/usr/bin/perl

open IN, "< this_is_input";
open OUT, "> before_ER";
while(<IN>) {
  if(/^\d+\sER\s\d+/$) {
     # Line containing <number><spaces>ER<spaces><number> only
     chomp;
     close OUT;
     open OUT, "> $_";
  }
  else {
    print $OUT;
  }
}
    
por 01.11.2015 / 23:53